데이터분석

[selenium] 동적 웹크롤링 루틴

바틀비 2024. 2. 17. 06:00

일반적인 크롤링 실행¶

패키지 호출 및 전체적인 틀 짜기

In [4]:

import selenium
import pandas as pd
import time
import random

# 크롬 드라이버 다운로드 후 세팅
# https://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

# 크롬 드라이버 설정
CHROME_DRIVER_PATH = './chromedriver.exe'
service = Service(executable_path=CHROME_DRIVER_PATH)
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

#크롤링할 url 열기
url = 'https://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu&divpage=85'
driver.get(url=url)

CSS_PATH 분석하기_일반¶

크롤링하고자 하는 element의 css path 패턴을 분석한다.
개발자 모드 -> Element -> copy selector

예시 1)¶

실시간 검색어 순위 1위 ~ 10위. 두 개의 테이블에 1위 ~ 5위, 6위 ~ 10위가 묶여있다.
1위 ~ 5위
div:nth-child(1) > a > span.rank-text 이 부분이 다르다.
- app > div > main > div > section > div > section > section:nth-child(2) > div:nth-child(2) > div > div:nth-child(1) > div:nth-child(1) > a > span.rank-text
- app > div > main > div > section > div > section > section:nth-child(2) > div:nth-child(2) > div > div:nth-child(1) > div:nth-child(2) > a > span.rank-text
- app > div > main > div > section > div > section > section:nth-child(2) > div:nth-child(2) > div > div:nth-child(1) > div:nth-child(3) > a > span.rank-text
6위 ~ 10위
1위 ~ 5위와 패턴이 같으나 div > div:nth-child(2) > div:nth-child(1) > a > span.rank-text 이 부분이 다르다.
- app > div > main > div > section > div > section > section:nth-child(2) > div:nth-child(2) > div > div:nth-child(2) > div:nth-child(1) > a > span.rank-text
- app > div > main > div > section > div > section > section:nth-child(2) > div:nth-child(2) > div > div:nth-child(2) > div:nth-child(2) > a > span.rank-text
- app > div > main > div > section > div > section > section:nth-child(2) > div:nth-child(2) > div > div:nth-child(2) > div:nth-child(3) > a > span.rank-text
일반화
같은 부분은 그대로 쓰고 달라지는 부분만 다르게 쓴다.
app > div > main > div > section > div > section > section:nth-child(2) > div:nth-child(2) > div > **div > div >*** a > span.rank-text

예시 2)¶

포스트의 제목
포스트
- revolution_main_table > tbody > tr:nth-child(9) > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a > font
- revolution_main_table > tbody > tr:nth-child(11) > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a > font
- revolution_main_table > tbody > tr:nth-child(13) > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a > font
코드화: for i in range(9, 14)
revolution_main_table > tbody > tr > td:nth-child(' +str(i)+ ') > table > tbody > tr > td:nth-child(2) > div > a > font

크롤링하기¶

url를 드라이버로 열기
css path 설정
지정된 css 선택자와 일치하는 모든 요소를 찾고 리스트로 반환함
미리 설정한 빈 리스트에 append하면서 리스트를 생성
리스트로 반환할 때 element 분석하기
title의 경우
- font class="list_title">[인터파크] 삼성전자 비스포크 냉장고 RF85C9141AP 868리터 (2,189,840원/무료)</font
- TEXT만 가져오기
link의 경우
- a href="view.php?id=ppomppu&page=1&divpage=85&no=517933">[인터파크] 삼성전자 비스포크 냉장고 RF85C9141AP 868리터 (2,189,840원/무료)</a
- text가 아닌 링크, href를 가져온다.
- get_attribute

In [18]:

import selenium
import pandas as pd
import time
import random

# 크롬 드라이버 다운로드 후 세팅
# https://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

# 크롬 드라이버 설정
CHROME_DRIVER_PATH = './chromedriver.exe'
service = Service(executable_path=CHROME_DRIVER_PATH)
options = webdriver.ChromeOptions()
driver = webdriver.Chrome(service=service, options=options)

#크롤링할 url 열기
url = 'https://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu&divpage=85'
driver.get(url=url)

#크롤링
# 빈 리스트 생성
titles_list = []
links_list = []

# CSS_PATH 설정
titles_css_path = '#revolution_main_table > tbody > tr > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a > font'
links_css_path = '#revolution_main_table > tbody > tr > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a'

# list로 반환
title_results =driver.find_elements(By.CSS_SELECTOR, titles_css_path)
link_results =driver.find_elements(By.CSS_SELECTOR, links_css_path)

# 원하는 부분만 추출 후 빈 리스트에 붙이면서 생성
for title_result in title_results:
    titles_list.append(title_result.text)
for link_result in link_results:
    links_list.append(link_result.get_attribute('href'))

# 데이터프레임으로 만들기
df = pd.DataFrame({'title': titles_list, 'link': links_list})
print(f'총 {len(df)}개의 포스트 추출 성공')

# 드라이버 종료
driver.quit()

총 31개의 포스트 추출 성공

In [13]:

df.head()

Out[13]:

	title	link
0	[SSG라방]탑텐키즈 24 S/S/ 겨울 베스트템 특가LIVE(상품다양/3만원이상무료)	https://www.ppomppu.co.kr/zboard/view.php?id=p...
1	[위메프] 뉴트리나 건강백서 울트라 강아지 고양이사료 모음 (39,360원 /무배)	https://www.ppomppu.co.kr/zboard/view.php?id=p...
2	[위메프] 블루펫 강아지 배변패드 20g 소형 400매 (21,260원/무배)	https://www.ppomppu.co.kr/zboard/view.php?id=p...
3	[G마켓] ASUS 모니터 ROG Strix XG27ACS 사전예약판매 특가 (36...	https://www.ppomppu.co.kr/zboard/view.php?id=p...
4	[네이버] 파나소닉 제트워셔 구강세정기 2개 + 파우치 + 노즐한팩 108,000원...	https://www.ppomppu.co.kr/zboard/view.php?id=p...

구조화¶

In [19]:

import selenium
import pandas as pd
import time
import random

# 크롬 드라이버 다운로드 후 세팅
# https://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

def crawling(driver):
    
    # 빈 리스트 생성
    titles_list = []
    links_list = []
    
    # CSS_PATH 설정
    titles_css_path = '#revolution_main_table > tbody > tr > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a > font'
    links_css_path = '#revolution_main_table > tbody > tr > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a'
    
    # list로 반환
    title_results =driver.find_elements(By.CSS_SELECTOR, titles_css_path)
    link_results =driver.find_elements(By.CSS_SELECTOR, links_css_path)
    
    # 원하는 부분만 추출 후 빈 리스트에 붙이면서 생성
    for title_result in title_results:
        titles_list.append(title_result.text)
    for link_result in link_results:
        links_list.append(link_result.get_attribute('href'))
    
    # 데이터프레임으로 만들기
    return pd.DataFrame({'title': titles_list, 'link': links_list})

def main():

    # 크롬 드라이버 설정
    CHROME_DRIVER_PATH = './chromedriver.exe'
    service = Service(executable_path=CHROME_DRIVER_PATH)
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(service=service, options=options)
    
    #크롤링할 url 열기
    url = 'https://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu&divpage=85'
    driver.get(url=url)
    
    # 크롤링 실시
    df = crawling(driver)
    print(f'총 {len(df)}개의 포스트 추출 성공')
    
    # 드라이버 종료
    driver.quit()

if __name__ == '__main__':
    main()

총 31개의 포스트 추출 성공

발전 시키기_1¶

여러 페이지 크롤링
main()이 달라짐

In [4]:

import selenium
import pandas as pd
import time
import random

# 크롬 드라이버 다운로드 후 세팅
# https://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

def crawling(driver):
    
    # 빈 리스트 생성
    titles_list = []
    links_list = []
    
    # CSS_PATH 설정
    titles_css_path = '#revolution_main_table > tbody > tr > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a > font'
    links_css_path = '#revolution_main_table > tbody > tr > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a'
    
    # list로 반환
    title_results =driver.find_elements(By.CSS_SELECTOR, titles_css_path)
    link_results =driver.find_elements(By.CSS_SELECTOR, links_css_path)
    
    # 원하는 부분만 추출 후 빈 리스트에 붙이면서 생성
    for title_result in title_results:
        titles_list.append(title_result.text)
    for link_result in link_results:
        links_list.append(link_result.get_attribute('href'))
    
    # 데이터프레임으로 만들기
    return pd.DataFrame({'title': titles_list, 'link': links_list})

def main():

    # 크롬 드라이버 설정
    CHROME_DRIVER_PATH = './chromedriver.exe'
    service = Service(executable_path=CHROME_DRIVER_PATH)
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(service=service, options=options)

    # 변수 설정
    df = None
    last_page = 5

    for i in range(1, last_page + 1):
        # 크롤링할 페이지의 url 열기
        url = 'https://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu&page=' +str(i)+ 'divpage=85'
        driver.get(url=url)

        # 해당 페이지의 크롤링 데이터프레임
        result_crawl = crawling(driver)

        # 앞서 만들어둔 데이터프레임에 붙이기
        df = pd.concat([df, result_crawl], ignore_index=True)

        # 서버 부하 방지
        time.sleep(random.uniform(2, 5))

    print(f'총 {len(df)}개의 포스트 추출 성공')
    
    # 드라이버 종료
    driver.quit()

if __name__ == '__main__':
    main()

총 111개의 포스트 추출 성공

발전시키기_2¶

크롤링 사이트 분석
- 1페이지에만 공지와 알람이 있고 하단에는 쇼핑블루와 핫딜과 같은 원치 않는 포스트가 있다.
- 1페이지는 nth-child(9) ~ nth-child(47)까지, 홀수로
- 나머지는 nth-child(6) ~ nth-child(44)까지, 짝수로
달라진 부분
- crawling함수에 입력하는 변수가 추가됨
- crawling함수의 css path가 달라짐
- main()에서 if문이 달라짐

In [7]:

import selenium
import pandas as pd
import time
import random

# 크롬 드라이버 다운로드 후 세팅
# https://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By

def crawling(driver, start_num, end_num):
    
    # 빈 리스트 생성
    titles_list = []
    links_list = []

    for i in range(start_num, end_num + 1):
        # CSS_PATH 설정
        titles_css_path ='#revolution_main_table > tbody > tr:nth-child(' + str(i)+ ') > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a > font'
        links_css_path = '#revolution_main_table > tbody > tr:nth-child(' + str(i)+ ') > td:nth-child(3) > table > tbody > tr > td:nth-child(2) > div > a'
        # list로 반환
        title_results =driver.find_elements(By.CSS_SELECTOR, titles_css_path)
        link_results =driver.find_elements(By.CSS_SELECTOR, links_css_path)
        
        # 원하는 부분만 추출 후 빈 리스트에 붙이면서 생성
        for title_result in title_results:
            titles_list.append(title_result.text)
        for link_result in link_results:
            links_list.append(link_result.get_attribute('href'))
    
    # 데이터프레임으로 만들기
    return pd.DataFrame({'title': titles_list, 'link': links_list})

def main():

    # 크롬 드라이버 설정
    CHROME_DRIVER_PATH = './chromedriver.exe'
    service = Service(executable_path=CHROME_DRIVER_PATH)
    options = webdriver.ChromeOptions()
    driver = webdriver.Chrome(service=service, options=options)

    # 변수 설정
    df = None
    last_page = 5

    # 크롤링 실행
    for i in range(1, last_page + 1):
        # 크롤링할 페이지의 url 열기
        url = 'https://www.ppomppu.co.kr/zboard/zboard.php?id=ppomppu&page=' +str(i)+ 'divpage=85'
        driver.get(url=url)

        if i == 1: # 1페이지
            start_num, end_num = (9, 47)
        else:
            start_num, end_num = (6, 44)

        # 해당 페이지의 크롤링 데이터프레임
        result_crawl = crawling(driver, start_num, end_num)

        # 앞서 만들어둔 데이터프레임에 붙이기
        df = pd.concat([df, result_crawl], ignore_index=True)

        # 서버 부하 방지
        time.sleep(random.uniform(2, 5))

    print(f'총 {len(df)}개의 포스트 추출 성공')
    # df.to_csv('RelativePath\result.csv', index=False) 
    
    # 드라이버 종료
    driver.quit()

if __name__ == '__main__':
    main()

총 100개의 포스트 추출 성공

저작자표시 비영리 변경금지 (새창열림)

'데이터분석' 카테고리의 다른 글

오차 행렬 - 정확도, 정밀도, 재현도, 특이도, F1 Score (0)	2024.02.20
[통계분석] 다중회귀분석(예제 위주) (0)	2024.01.19
[통계분석] 단순선형회귀분석, 카이제곱검정 (0)	2024.01.19
[통계분석] 2표본 가설검정, 등분산 검정 (0)	2024.01.18
[통계분석] 정규성 검정, 가설검정의 기본, 단일표본 가설검정 (0)	2024.01.18

현재글[selenium] 동적 웹크롤링 루틴

Selenium, 플러그인, 통계분석, 웹크롤링, seaborn, SciPy, matplotlib, Python, git, 알고리즘, pandas, Numpy, jupyter, figma, 가상환경, Anaconda,

Today :
Yesterday :

바틀비의 타자기

[selenium] 동적 웹크롤링 루틴

일반적인 크롤링 실행¶

CSS_PATH 분석하기_일반¶

예시 1)¶

예시 2)¶

크롤링하기¶

구조화¶

발전 시키기_1¶

발전시키기_2¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

티스토리툴바

« 2025/07 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31

[selenium] 동적 웹크롤링 루틴

일반적인 크롤링 실행¶

CSS_PATH 분석하기_일반¶

예시 1)¶

예시 2)¶

크롤링하기¶

구조화¶

발전 시키기_1¶

발전시키기_2¶

'데이터분석' 카테고리의 다른 글

'데이터분석'의 다른글

관련글

티스토리툴바