Beautiful Soup로 크롤링 하기 #1

이제 본격적으로 파이썬으로 크롤링을 하기 위해서 Beautiful Soup에 대해서 좀 더 자세히 알아보도록 하겠습니다.

설치

일단은 아래 명령어를 통해서 BeautifulSoup의 설치를 하도록 하겠습니다.

$ pip install BeautifulSoup4

그리고 URL을 열려면 urlopen이란 메소드도 필요합니다. 아래 명령어를 통해서 설치를 하도록 하겠습니다.

$ pip install urlopen

간단한 사용법

# coding=utf-8
from urllib import urlopen
from bs4 import BeautifulSoup

url = "http://pythonscraping.com/pages/page1.html"
soup = BeautifulSoup(urlopen(url).read(), "html.parser") # html.parser를 이용해 웹 페이지 읽어서 bs4 객체에 담기

print soup.body # 태그 포함해서 가져오기
print soup.body.get_text() # .get_text()를 통해서 태그를 제외하고 텍스트만 가져오기

<body>
<h1>An Interesting Title</h1>
<div>
Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.
</div>
</body>

An Interesting Title

Lorem ipsum dolor sit amet, consectetur adipisicing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.

특정 클래스만 가져오고 싶을때

<span class="green">Green Text</span>

위와 같이 span태그 안에 class="green"인 부분만 가져오고 싶으면 아래와 같이 사용할 수 있습니다.

# coding=utf-8
from urllib import urlopen
from bs4 import BeautifulSoup

url = "http://pythonscraping.com/pages/warandpeace.html"
soup = BeautifulSoup(urlopen(url).read(), "html.parser")

green_list =  soup.find_all("span", {"class":"green"})

for text in green_list:
    print text

<span class="green">Anna
Pavlovna Scherer</span>
<span class="green">Empress Marya
Fedorovna</span>
<span class="green">Prince Vasili Kuragin</span>
<span class="green">Anna Pavlovna</span>
<span class="green">St. Petersburg</span>
<span class="green">the prince</span>
<span class="green">Anna Pavlovna</span>
<span class="green">Anna Pavlovna</span>
<span class="green">the prince</span>
<span class="green">the prince</span>
<span class="green">the prince</span>
<span class="green">Prince Vasili</span>
<span class="green">Anna Pavlovna</span>
<span class="green">Anna Pavlovna</span>
<span class="green">the prince</span>
<span class="green">Wintzingerode</span>
<span class="green">King of Prussia</span>
<span class="green">le Vicomte de Mortemart</span>
<span class="green">Montmorencys</span>
<span class="green">Rohans</span>
<span class="green">Abbe Morio</span>
<span class="green">the Emperor</span>
<span class="green">the prince</span>
<span class="green">Prince Vasili</span>
<span class="green">Dowager Empress Marya Fedorovna</span>
<span class="green">the baron</span>
<span class="green">Anna Pavlovna</span>
<span class="green">the Empress</span>
<span class="green">the Empress</span>
<span class="green">Anna Pavlovna's</span>
<span class="green">Her Majesty</span>
<span class="green">Baron
Funke</span>
<span class="green">The prince</span>
<span class="green">Anna
Pavlovna</span>
<span class="green">the Empress</span>
<span class="green">The prince</span>
<span class="green">Anatole</span>
<span class="green">the prince</span>
<span class="green">The prince</span>
<span class="green">Anna
Pavlovna</span>
<span class="green">Anna Pavlovna</span>

** get_text()를 쓸 때와 태그를 보존할 때

get_text()는 태그를 제거하고 텍스트만 가져올 때 사용합니다, 그렇기 때문에 .get_text()는 항상 마지막, 즉 최종 데이터를 출력하거나 저장, 조작하기 직전에만 써야 합니다. 일반적으로는 문서의 태그 구조를 최대한 유지하는 것이 좋습니다

find()와 findAll()

find() : 가장 먼저 찾는 첫번째 데이터
findAll() : 연관된 데이터 전부 찾아서 리스트로 반환

바로 예제 소스를 보는게 좋기 때문에 바로 예제로 들어가겠습니다.

soup.findAll({"h1", "h2", "h3", "h4", "h5", "h6"}) # h1 ~ h6까지 가져오기

soup.findAll(text="the prince") # text항목을 사용하면 'the prince' 텍스트를 가져옵니다
print (len(nameList)) # 다음과 같이 하면 'the prince'의 개수를 알 수 있습니다

soup.findAll(id="text") # id 속성이 "text"인 함수
soup.findAll("", {"id",:"text"}) # 위와 동일

soup.findAll(class="green") # class가 python의 예약어라 에러 발생
soup.findAll(class_="green") # 해결방법 1
soup.findAll("",{"class":"green"}) # 해결방법 2

트리 이동

html
- body
  - h1
  - table#giftList
    - tr
      - th
      - th
      - th
      - th
    - tr#gift1
      - td
      - td
      - td
      - td
        
        img
    - …

위와 같이 html 트리 구조가 정해져 있다고 가정 하겠습니다.

자식 : 바로 아래의 태그 ex) tr 밑에 td는 자식이자 자손
자손 : 자기 아래의 태그 전부 ex) tr 밑에 img는 자손

자식과 자손 다루기

soup.body.h1 # body 태그 아래의 h1 태그를 의미

soup.body.find("table", {"id":"giftList").children # children을 쓰면 자식만 가져옵니다

# coding=utf-8
from urllib import urlopen
from bs4 import BeautifulSoup

url = "http://pythonscraping.com/pages/page3.html"
soup = BeautifulSoup(urlopen(url).read(), "html.parser")

for child in soup.find("table",{"id":"giftList"}).children:
    print (child)

<tr><th>
Item Title
</th><th>
Description
</th><th>
Cost
</th><th>
Image
</th></tr>


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>


<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>

형제 다루기

# coding=utf-8
from urllib import urlopen
from bs4 import BeautifulSoup

url = "http://pythonscraping.com/pages/page3.html"
soup = BeautifulSoup(urlopen(url).read(), "html.parser")

# sibling : 동기, 형제라는 의미
for sibling in soup.find("table",{"id":"giftList"}).tr.next_siblings:
    print (sibling)

/Users/jungwoon/PycharmProjects/Crawler/venv/bin/python /Users/jungwoon/PycharmProjects/Crawler/example2.py


<tr class="gift" id="gift1"><td>
Vegetable Basket
</td><td>
This vegetable basket is the perfect gift for your health conscious (or overweight) friends!
<span class="excitingNote">Now with super-colorful bell peppers!</span>
</td><td>
$15.00
</td><td>
<img src="../img/gifts/img1.jpg"/>
</td></tr>


<tr class="gift" id="gift2"><td>
Russian Nesting Dolls
</td><td>
Hand-painted by trained monkeys, these exquisite dolls are priceless! And by "priceless," we mean "extremely expensive"! <span class="excitingNote">8 entire dolls per set! Octuple the presents!</span>
</td><td>
$10,000.52
</td><td>
<img src="../img/gifts/img2.jpg"/>
</td></tr>


<tr class="gift" id="gift3"><td>
Fish Painting
</td><td>
If something seems fishy about this painting, it's because it's a fish! <span class="excitingNote">Also hand-painted by trained monkeys!</span>
</td><td>
$10,005.00
</td><td>
<img src="../img/gifts/img3.jpg"/>
</td></tr>


<tr class="gift" id="gift4"><td>
Dead Parrot
</td><td>
This is an ex-parrot! <span class="excitingNote">Or maybe he's only resting?</span>
</td><td>
$0.50
</td><td>
<img src="../img/gifts/img4.jpg"/>
</td></tr>


<tr class="gift" id="gift5"><td>
Mystery Box
</td><td>
If you love suprises, this mystery box is for you! Do not place on light-colored surfaces. May cause oil staining. <span class="excitingNote">Keep your friends guessing!</span>
</td><td>
$1.50
</td><td>
<img src="../img/gifts/img6.jpg"/>
</td></tr>

부모 다루기

다음과 같은 구조로 되어 있을때 예제

tr
- td
- td
- td
  - $15.00
- td
  - <img src="../img/gifts/img1.jpg">

# coding=utf-8
from urllib import urlopen
from bs4 import BeautifulSoup

url = "http://pythonscraping.com/pages/page3.html"
soup = BeautifulSoup(urlopen(url).read(), "html.parser")

print soup.find("img", {"src":"../img/gifts/img1.jpg"}).parent.previous_sibling

<td>
$15.00
</td>

Beautiful Soup에서 정규표현식 다루기

# coding=utf-8
import re
from urllib import urlopen
from bs4 import BeautifulSoup

url = "http://pythonscraping.com/pages/page3.html"
soup = BeautifulSoup(urlopen(url).read(), "html.parser")

images = soup.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")})

for image in images:
    print image # bs4.element.Tag 타입

for image in images:
    print image["src"] # ["속성"]을 해주면 해당 속성에 접근할 수 있습니다

<img src="../img/gifts/img1.jpg"/>
<img src="../img/gifts/img2.jpg"/>
<img src="../img/gifts/img3.jpg"/>
<img src="../img/gifts/img4.jpg"/>
<img src="../img/gifts/img6.jpg"/>
../img/gifts/img1.jpg
../img/gifts/img2.jpg
../img/gifts/img3.jpg
../img/gifts/img4.jpg
../img/gifts/img6.jpg