Beautiful Soup로 크롤링 하기 #2

이전 포스팅에서는 정적 페이지에서 크롤링을 하는 방법을 다루었고, 이번 포스팅에서는 단일 페이지 내의 이동과 여러 페이지, 여러 사이트를 이동하는 부분을 다뤄보도록 하겠습니다.

실습

페이지는 위키 피디아에서 진행하도록 하겠습니다, 위키피디아는 하나의 페이지 안에도 관련된 여러 링크들이 있는데, 일단 a 태그를 이용해 특정 페이지의 링크들을 가져와보도록 하겠습니다.

import re
from urllib import urlopen

from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon"), "html.parser")

for link in soup.findAll("a"):
    print link

<a id="top"></a>
<a href="/wiki/Wikipedia:Protection_policy#semi" title="This article is semi-protected to promote compliance with the policy on biographies of living people"><img alt="Page semi-protected" data-file-height="128" data-file-width="128" height="20" src="//upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/20px-Padlock-silver.svg.png" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/30px-Padlock-silver.svg.png 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/f/fc/Padlock-silver.svg/40px-Padlock-silver.svg.png 2x" width="20"/></a>
<a href="#mw-head">navigation</a>
<a href="#p-search">search</a>
<a class="mw-disambig" href="/wiki/Kevin_Bacon_(disambiguation)" title="Kevin Bacon (disambiguation)">Kevin Bacon (disambiguation)</a>
<a class="image" href="/wiki/File:Kevin_Bacon_SDCC_2014.jpg"><img alt="Kevin Bacon SDCC 2014.jpg" data-file-height="2649" data-file-width="1907" height="208" src="//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Kevin_Bacon_SDCC_2014.jpg/150px-Kevin_Bacon_SDCC_2014.jpg" srcset="//upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Kevin_Bacon_SDCC_2014.jpg/225px-Kevin_Bacon_SDCC_2014.jpg 1.5x, //upload.wikimedia.org/wikipedia/commons/thumb/d/d7/Kevin_Bacon_SDCC_2014.jpg/300px-Kevin_Bacon_SDCC_2014.jpg 2x" width="150"/></a>
<a href="/wiki/San_Diego_Comic-Con" title="San Diego Comic-Con">San Diego Comic-Con</a>
<a href="/wiki/Philadelphia" title="Philadelphia">Philadelphia</a>
<a href="/wiki/Pennsylvania" title="Pennsylvania">Pennsylvania</a>
<a href="/wiki/Kyra_Sedgwick" title="Kyra Sedgwick">Kyra Sedgwick</a>
<a href="/wiki/Sosie_Bacon" title="Sosie Bacon">Sosie Bacon</a>
<a href="/wiki/Edmund_Bacon_(architect)" title="Edmund Bacon (architect)">Edmund Bacon</a>
<a href="/wiki/Michael_Bacon_(musician)" title="Michael Bacon (musician)">Michael Bacon</a>
...

위에서 이제 불필요한 태그를 제외하고 href에 있는 주소만 가져와보도록 하겠습니다.

import re
from urllib import urlopen

from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon"), "html.parser")

for link in soup.findAll("a"):
    if 'href' in link.attrs: # 내부에 있는 항목들을 리스트로 가져옵니다 ex) {u'href': u'//www.wikimediafoundation.org/'}
        print (link.attrs['href'])

/w/index.php?title=Special:CiteThisPage&page=Kevin_Bacon&id=834092238
/w/index.php?title=Special:Book&bookcmd=book_creator&referer=Kevin+Bacon
/w/index.php?title=Special:ElectronPdf&page=Kevin+Bacon&action=show-download-screen
/w/index.php?title=Kevin_Bacon&printable=yes
https://commons.wikimedia.org/wiki/Category:Kevin_Bacon
https://af.wikipedia.org/wiki/Kevin_Bacon
https://ar.wikipedia.org/wiki/%D9%83%D9%8A%D9%81%D9%8A%D9%86_%D8%A8%D9%8A%D9%83%D9%86
https://an.wikipedia.org/wiki/Kevin_Bacon
https://ast.wikipedia.org/wiki/Kevin_Bacon
...

결과를 보면 사이드 바, 헤더 링크, 카테고리 페이지 등 불 필요한 링크도 많이 있습니다.

내부의 다른 페이지를 찾으려고 보니 다음과 같은 규칙이 있었습니다.

id가 bodyContent인 div안에 있습니다.
URL에는 세미콜론이 포함되어 있지 않습니다
URL은 /wiki/로 시작합니다.

이런 규칙을 가지고 아래와 같은 코드를 작성해 보겠습니다.

import re
from urllib import urlopen

from bs4 import BeautifulSoup

soup = BeautifulSoup(urlopen("https://en.wikipedia.org/wiki/Kevin_Bacon"), "html.parser")

for link in soup.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$")):
    if 'href' in link.attrs:
        print(link.attrs['href'])

/wiki/Erika_Christensen
/wiki/Clifton_Collins_Jr.
/wiki/Benicio_del_Toro
/wiki/Michael_Douglas
/wiki/Miguel_Ferrer
/wiki/Albert_Finney
/wiki/Topher_Grace
/wiki/Luis_Guzm%C3%A1n
/wiki/Amy_Irving
/wiki/Tomas_Milian
/wiki/D._W._Moffett
/wiki/Dennis_Quaid
/wiki/Peter_Riegert
...

이 사이트만 가지고는 하나의 사이트 내의 링크만 가져오게 되는거고, 무작위로 링크를 타고 들어가게 만들려면 다음과 같이 만들어야 합니다.

import random
import re
from urllib import urlopen
import datetime
from bs4 import BeautifulSoup


random.seed(datetime.datetime.now())

def getLinks(articleUrl) :
    html = urlopen("https://en.wikipedia.org" + articleUrl)
    soup = BeautifulSoup(html, "html.parser")
    return soup.find("div", {"id":"bodyContent"}).findAll("a", href=re.compile("^(/wiki/)((?!:).)*$"))

links = getLinks("/wiki/Kevin_Bacon")


while len(links) > 0:
    newArticle = links[random.randint(0, len(links)-1)].attrs["href"]
    print(newArticle)
    links = getLinks(newArticle)

** 재귀에 관한 경고
파이썬은 기본적으로 재귀 호출을 1,000회로 제한합니다. 위키 피디아의 링크 네트워크는 대단히
넓기 때문에 이 프로그램은 결국 재귀 제한에 걸려서 멈추게 됩니다. 이를 막으려면 재귀 카운터를 삽입하거나
다른 방법을 강구해야 합니다.

전체 사이트 크롤링 #1 - 동일한 페이지를 중복해서 수집하지 않게 하기

# coding=utf-8
import re
from urllib import urlopen
from bs4 import BeautifulSoup

pages = set() # 애초에 set은 중복을 허용하지 않음
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org" + pageUrl)
    soup = BeautifulSoup(html, "html.parser")

    for link in soup.findAll("a", href = re.compile("^(/wiki)")):
        if 'href' in link.attrs: # 위에서 찾은 link에 href 속성이 있는지 확인
            if link.attrs['href'] not in pages: # 새로운 페이지인지 확인
                newPage = link.attrs['href']
                print(newPage)
                pages.add(newPage)
                getLinks(newPage)


getLinks("")

/wiki/Wikipedia
/wiki/Wikipedia:Protection_policy#semi
/wiki/Predictions_of_Wikipedia%27s_end
/wiki/File:Will_Wikipedia_exist_in_20_years-.webm
/wiki/User:Geographyinitiative
/wiki/Wikipedia:WikiProject_East_Asia
/wiki/Wikipedia:WikiProject_East_Asia/About_us
/wiki/Wikipedia:WikiProject_East_Asia/Resources

전체 사이트 크롤링 - 위키 피디아 전체 사이트의 데이터 수집

역시 가장 먼저 해야할일은 사이트의 페이지 몇 개를 살펴보며 패턴을 찾는 일입니다. 위키 백과의 사이트는 아래와 같은 패턴이 있음을 확인할 수 있었습니다 (아래에서 '#' 뒤에는 'id'를 의미합니다)

제목은 항상 h1 태그 안에 있으며, 페이지당 하나만 존재합니다.
바디 텍스트는 div#bodyContent 태그에 들어 있습니다.
첫 번째 문단의 택스트만 선택하려면 div#mw-content-text->p만 선택합니다
편집 링크는 항목 페이지에만 존재합니다. 존재한다면 li#ca-edit -> span -> a로 찾을 수 있습니다.

# coding=utf-8
import re
from urllib import urlopen
from bs4 import BeautifulSoup

pages = set() # 애초에 set은 중복을 허용하지 않음
def getLinks(pageUrl):
    global pages
    html = urlopen("http://en.wikipedia.org" + pageUrl)
    soup = BeautifulSoup(html, "html.parser")

    try:
        print("Title : " + soup.h1.get_text())
        print("First paragraph : " + soup.find(id='mw-content-text').findAll('p')[0].get_text())
        print("Edit page link : " + soup.find(id='ca-edit').find('span').find('a').attrs['href'])
    except AttributeError:
        print("This page is mission something!")


    for link in soup.findAll("a", href = re.compile("^(/wiki)")):
        if 'href' in link.attrs: # 위에서 찾은 link에 href 속성이 있는지 확인
            if link.attrs['href'] not in pages: # 새로운 페이지인지 확인
                newPage = link.attrs['href']
                print("--\nNew Page : "  + newPage)
                pages.add(newPage)
                getLinks(newPage)


getLinks("")

Title : Main Page
First paragraph : The sinking of the RMS Titanic in the early morning of 15 April 1912, four days into the ship's maiden voyage from Southampton to New York City, was one of the deadliest peacetime maritime disasters in history, killing more than 1,500 people. The largest passenger liner in service at the time, Titanic had an estimated 2,224 people on board when she struck an iceberg in the North Atlantic. The ship had received six warnings of sea ice but was travelling at near maximum speed when the lookouts sighted the iceberg. Unable to turn quickly enough, the ship suffered a glancing blow that buckled the starboard (right) side and opened five of sixteen compartments to the sea. The disaster caused widespread outrage over the lack of lifeboats, lax regulations, and the unequal treatment of the three passenger classes during the evacuation. Inquiries recommended sweeping changes to maritime regulations, leading to the International Convention for the Safety of Life at Sea (1914), which continues to govern maritime safety. (Full article...)
This page is mission something!
--
New Page : /wiki/Wikipedia
Title : Wikipedia
First paragraph : Wikipedia (/ˌwɪkɪˈpiːdiə/ ( listen), /ˌwɪkiˈpiːdiə/ ( listen) WIK-ih-PEE-dee-ə) is a multilingual, web-based, free-content encyclopedia project supported by the Wikimedia Foundation and based on a model of openly editable content. Wikipedia is the largest and most popular general reference work on the Internet,[3][4][5] and is named as one of the most popular websites.[6] The project is owned by the Wikimedia Foundation, a non-profit which "operates on whatever monies it receives from its annual fund drives".[7][8][9]
This page is mission something!
--
New Page : /wiki/Wikipedia:Protection_policy#semi
Title : Bad Request
This page is mission something!
--
...

Quiz

아래 선택된 이미지와 같은 카드가 있는 경우 내부의 이미지를 저장하는 소스를 만드세요