[Python] web scraping

BeautifulSoup 라이브러리를 사용하면 간단하게 web scraping을 할 수 있다.

BeautifulSoup 라이브러리는 HTML, XML을 분석해준다.

웹 사이트의 html을 scraping하기 위해서는 urllib 라이브러리도 함께 사용해야한다.

두 라이브러리를 함께 사용하여 웹 사이트의 html을 scraping할 수 있다.

urllib option

import urllib.request as request

url = "https://990427.tistory.com"

data = request.urlopen(url)

urlopen() 함수는 url에 해당하는 웹 데이터를 가져온다.

BeautifulSoup option

from bs4 import BeautifulSoup

html = 'html info'
soup = BeautifulSoup(html, 'html.parser')

html 정보를 입력하거나 웹사이트에서 가져온 html 데이터를 html변수에 저장하고,

html.parser을 이용해 분석할 수 있다.

두 라이브러리를 합쳐서 url을 통해 가져온 데이터를 읽을 수 있다.

import urllib.request as request
from bs4 import BeautifulSoup

url = "https://990427.tistory.com"
html = request.urlopen(url)

soup = BeautifulSoup(html, 'html.parser')

result

내 블로그 Programming 카테고리에서 글 제목만 추출하기

from bs4 import BeautifulSoup
import urllib.request as request

url = "https://990427.tistory.com/category/Programming"
html  = request.urlopen(url)

soup = BeautifulSoup(html, 'html.parser')
index = soup.select("div.article-info > span.title")
for title in index :
	print(title.string)

result

나머지 함수 정보는 www.crummy.com/software/BeautifulSoup/bs4/doc

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str(soup) # ' I linked to example.com ' unicode(soup.a) # u' I linked to example.com ' The str() functio

www.crummy.com

저작자표시 (새창열림)

'📚 Programming > Python' 카테고리의 다른 글

[Python] 정렬함수 sort(), sorted() 활용 (0)	2022.02.06
[Python] 알파벳 또는 숫자 확인 (isalpha , isdigit) (0)	2022.02.05
[Python] input 대신 sys.stdin.readline() 쓰자 (0)	2022.01.30
[머신러닝] train_test_split (데이터 나누기) (0)	2021.10.19
[머신러닝] scikit-learn (사이킷런) (0)	2021.10.19

'📚 Programming > Python' 카테고리의 다른 글

티스토리툴바