이슈

학교 홈페이지 공지사항을 크롤링하여 보여주는 간단한 애플리케이션 개발 과정에서 AWS Lambda를 통해 serverless 환경을 구현하여 1시간 마다 주기적인 크롤링을 하도록 했다. 그런데 학교 홈페이지 특성상 게시판을 크롤링하려면 브라우저 기반 크롤링 엔진인 selenium이 필요했고, 람다에서 셀레니움을 구동하려 했다. 그런데 셀레니움은 브라우저 환경에 맞는 드라이버의 import가 필수이기 때문에 람다에 리눅스용 크롬 드라이버를 import 시켜 사용하려 했지만 계속해서 알 수 없는 오류를 뿜었다. 구글링을 통해 발견한 해결책도 통하지 않아, 크롤링 코드와 환경을 직접 docker 이미지로 빌드하고 람다에서 직접 실행하게끔 하기로 했다.

Selenium 크롬 드라이버 설정

아래의 코드를 통해 selenium의 드라이버에 chrome driver option을 적용할 수 있다.

from selenium import webdriver

def get_driver():
    options = webdriver.ChromeOptions()
    options.binary_location = '/opt/chrome/chrome'
    options.add_argument('--headless')
    options.add_argument('--no-sandbox')
    options.add_argument("--disable-gpu")
    options.add_argument("--window-size=1280x1696")
    options.add_argument("--single-process")
    options.add_argument("--disable-dev-shm-usage")
    options.add_argument("--disable-dev-tools")
    options.add_argument("--no-zygote")
    options.add_argument('user-agent=Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36')
    
    driver = webdriver.Chrome("/opt/chromedriver", options=options)
    return driver

def crawl():
    driver = get_driver()
    driver.get('YOUR_URL')

    # Crawling code . . .

Dockerfile 작성

프로젝트 폴더 루트 경로에 아래와 같은 Dockerfile을 작성한다.

FROM public.ecr.aws/lambda/python@sha256:b15110cfc524c410f9c3b3e906b4fa2fe268c38811c4c34f048e8e5d4c9669c8 as build

RUN yum install -y unzip && \
    curl -Lo "/tmp/chromedriver.zip" "https://chromedriver.storage.googleapis.com/97.0.4692.71/chromedriver_linux64.zip" && \
    curl -Lo "/tmp/chrome-linux.zip" "https://www.googleapis.com/download/storage/v1/b/chromium-browser-snapshots/o/Linux_x64%2F938549%2Fchrome-linux.zip?alt=media" && \
    unzip /tmp/chromedriver.zip -d /opt/ && \
    unzip /tmp/chrome-linux.zip -d /opt/


FROM public.ecr.aws/lambda/python@sha256:b15110cfc524c410f9c3b3e906b4fa2fe268c38811c4c34f048e8e5d4c9669c8

RUN yum install atk cups-libs gtk3 libXcomposite alsa-lib \
    libXcursor libXdamage libXext libXi libXrandr libXScrnSaver \
    libXtst pango at-spi2-atk libXt xorg-x11-server-Xvfb \
    xorg-x11-xauth dbus-glib dbus-glib-devel -y

RUN pip install --upgrade pip

# RUN pip install requests
# RUN pip install beautifulsoup4
# RUN pip install pymysql
RUN pip install selenium

COPY --from=build /opt/chrome-linux /opt/chrome

COPY --from=build /opt/chromedriver /opt/

COPY *.py ./

CMD [ "lambda_function.lambda_handler" ]

이때 CMD 커맨드의 "lambda_function.lambda_handler"는 람다 함수의 진입점에 해당하는 function의 경로이다.

Docker 이미지를 Amazon ECR에 업로드

도커 이미지 빌드

$docker build -t {IMAGE_TAG} .   

액세스 키 생성

AWS Console 에서 IAM 서비스 접속
좌측 사용자 탭 -> 본인 계정 -> 보안 자격 증명 -> 액세스 키 만들기
.csv 파일 다운로드 후 Access key ID, Secret access key ID 확인

AWS Configure 등록

$aws configure

AWS Access Key ID [None]:
AWS Secret Access Key [None]: 
Default region name [None]: # ex) ap-northeast-2
Default output format [None]: # ex) json

Docker CLI 를 Amazon ECR 에 인증

$aws ecr get-login-password --region {YOUR_REGION} | docker login --username AWS --password-stdin {YOUR_AWS_ID}.dkr.ecr.{YOUR_REGION}.amazonaws.com    

YOUR_REGION 예시) ap-northeast-2
YOUR_AWS_ID 예시) 123456789012

Amazon ECR 에 레포지토리 생성

$aws ecr create-repository --repository-name {IMAGE_TAG} --image-scanning-configuration scanOnPush=true --image-tag-mutability MUTABLE

Amazon ECR 에 이미지 배포

$docker tag {IMAGE_TAG}:latest {YOUR_AWS_ID}.dkr.ecr.{YOUR_REGION}.amazonaws.com/{IMAGE_TAG}:latest

$docker push {YOUR_AWS_ID}.dkr.ecr.{YOUR_REGION}.amazonaws.com/{IMAGE_TAG}:latest        

AWS Lambda 에서 컨테이너 이미지 배포

AWS Console 접속
AWS Lambda 서비스 접속
생성 옵션을 컨테이너 이미지로 선택
이미지 찾아보기를 통해 방금 ECR에 업로드한 이미지 선택

이 이후 AWS EventBridge를 통해 람다를 주기적으로 실행하게끔 하여 서버리스 셀레니움 크롤러를 구축할 수 있다.

Post author: Yeojun Yoon
Post link: https://blog.yjyoon.dev/aws/2022/01/26/aws-01/
Copyright Notice: All articles in this blog are licensed under CC BY-NC-SA 3.0 unless stating additionally.

Android Developer

[AWS] Lambda에서 Docker 이미지로 Selenium 구동하기

이슈