sladuf
200
sladuf
์ „์ฒด ๋ฐฉ๋ฌธ์ž
์˜ค๋Š˜
์–ด์ œ
  • ๋ถ„๋ฅ˜ ์ „์ฒด๋ณด๊ธฐ (83)
    • ๐Ÿ“š Programming (32)
      • Swift (13)
      • JAVA (2)
      • Python (6)
      • SQL (6)
      • Web (5)
    • ๐Ÿ“ฑ iOS (25)
      • Base (7)
      • SwiftUI (9)
      • UIKit (7)
      • ์ธ๊ฐ• & ์ฑ… (2)
    • ๐Ÿ”— Algorithm (20)
      • Python (12)
      • Swift (3)
      • Tip (5)
    • ๐Ÿ—‚ ETC (6)

๋ธ”๋กœ๊ทธ ๋ฉ”๋‰ด

  • ํ™ˆ
  • ํƒœ๊ทธ
  • ๋ฐฉ๋ช…๋ก

๊ณต์ง€์‚ฌํ•ญ

์ธ๊ธฐ ๊ธ€

ํƒœ๊ทธ

  • ์Šค์œ„ํ”„ํŠธ
  • Swift

์ตœ๊ทผ ๋Œ“๊ธ€

์ตœ๊ทผ ๊ธ€

ํ‹ฐ์Šคํ† ๋ฆฌ

๊ธ€์“ฐ๊ธฐ ์„ค์ •
hELLO ยท Designed By ์ •์ƒ์šฐ.
sladuf

200

[Python] web scraping
๐Ÿ“š Programming/Python

[Python] web scraping

2020. 9. 25. 15:27

 

 

BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•˜๋ฉด ๊ฐ„๋‹จํ•˜๊ฒŒ web scraping์„ ํ•  ์ˆ˜ ์žˆ๋‹ค.

BeautifulSoup ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋Š” HTML, XML์„ ๋ถ„์„ํ•ด์ค€๋‹ค.

์›น ์‚ฌ์ดํŠธ์˜ html์„ scrapingํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” urllib ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋„ ํ•จ๊ป˜ ์‚ฌ์šฉํ•ด์•ผํ•œ๋‹ค.

๋‘ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ•จ๊ป˜ ์‚ฌ์šฉํ•˜์—ฌ ์›น ์‚ฌ์ดํŠธ์˜ html์„ scrapingํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

urllib option

import urllib.request as request

url = "https://990427.tistory.com"

data = request.urlopen(url)

urlopen() ํ•จ์ˆ˜๋Š” url์— ํ•ด๋‹นํ•˜๋Š” ์›น ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜จ๋‹ค.

 

BeautifulSoup option

from bs4 import BeautifulSoup

html = 'html info'
soup = BeautifulSoup(html, 'html.parser')

html ์ •๋ณด๋ฅผ ์ž…๋ ฅํ•˜๊ฑฐ๋‚˜ ์›น์‚ฌ์ดํŠธ์—์„œ ๊ฐ€์ ธ์˜จ html ๋ฐ์ดํ„ฐ๋ฅผ html๋ณ€์ˆ˜์— ์ €์žฅํ•˜๊ณ ,

html.parser์„ ์ด์šฉํ•ด ๋ถ„์„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

 

๋‘ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ•ฉ์ณ์„œ url์„ ํ†ตํ•ด ๊ฐ€์ ธ์˜จ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์„ ์ˆ˜ ์žˆ๋‹ค.

import urllib.request as request
from bs4 import BeautifulSoup

url = "https://990427.tistory.com"
html = request.urlopen(url)

soup = BeautifulSoup(html, 'html.parser')

result

 

๋‚ด ๋ธ”๋กœ๊ทธ Programming ์นดํ…Œ๊ณ ๋ฆฌ์—์„œ ๊ธ€ ์ œ๋ชฉ๋งŒ ์ถ”์ถœํ•˜๊ธฐ

from bs4 import BeautifulSoup
import urllib.request as request

url = "https://990427.tistory.com/category/Programming"
html  = request.urlopen(url)

soup = BeautifulSoup(html, 'html.parser')
index = soup.select("div.article-info > span.title")
for title in index :
	print(title.string)

result

๋‚˜๋จธ์ง€ ํ•จ์ˆ˜ ์ •๋ณด๋Š” www.crummy.com/software/BeautifulSoup/bs4/doc

 

Beautiful Soup Documentation — Beautiful Soup 4.9.0 documentation

Non-pretty printing If you just want a string, with no fancy formatting, you can call unicode() or str() on a BeautifulSoup object, or a Tag within it: str(soup) # ' I linked to example.com ' unicode(soup.a) # u' I linked to example.com ' The str() functio

www.crummy.com

 

์ €์ž‘์žํ‘œ์‹œ

'๐Ÿ“š Programming > Python' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

[Python] ์ •๋ ฌํ•จ์ˆ˜ sort(), sorted() ํ™œ์šฉ  (0) 2022.02.06
[Python] ์•ŒํŒŒ๋ฒณ ๋˜๋Š” ์ˆซ์ž ํ™•์ธ (isalpha , isdigit)  (0) 2022.02.05
[Python] input ๋Œ€์‹  sys.stdin.readline() ์“ฐ์ž  (0) 2022.01.30
[๋จธ์‹ ๋Ÿฌ๋‹] train_test_split (๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ)  (0) 2021.10.19
[๋จธ์‹ ๋Ÿฌ๋‹] scikit-learn (์‚ฌ์ดํ‚ท๋Ÿฐ)  (0) 2021.10.19
    '๐Ÿ“š Programming/Python' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€
    • [Python] ์•ŒํŒŒ๋ฒณ ๋˜๋Š” ์ˆซ์ž ํ™•์ธ (isalpha , isdigit)
    • [Python] input ๋Œ€์‹  sys.stdin.readline() ์“ฐ์ž
    • [๋จธ์‹ ๋Ÿฌ๋‹] train_test_split (๋ฐ์ดํ„ฐ ๋‚˜๋ˆ„๊ธฐ)
    • [๋จธ์‹ ๋Ÿฌ๋‹] scikit-learn (์‚ฌ์ดํ‚ท๋Ÿฐ)
    sladuf
    sladuf

    ํ‹ฐ์Šคํ† ๋ฆฌํˆด๋ฐ”