我试图递归抓取所有英文文章链接的维基百科网址。我想执行n的深度优先遍历,但由于某种原因,我的代码不会为每次传递重复。知道为什么吗?
def crawler(url, depth):
if depth == 0:
return None
links = bs.find("div",{"id" : "bodyContent"}).findAll("a" , href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))
print ("Level ",depth," ",url)
for link in links:
if ':' not in link['href']:
crawler("https://en.wikipedia.org"+link['href'], depth - 1)
这是对爬虫的调用
url = "https://en.wikipedia.org/wiki/Harry_Potter"
html = urlopen(url)
bs = BeautifulSoup(html, "html.parser")
crawler(url,3)
您需要获取每个不同URL的页面源(向页面发送请求)。您在crawler()
函数中缺少该部分。在函数外添加这些行不会递归调用它们。
def crawler(url, depth):
if depth == 0:
return None
html = urlopen(url) # You were missing
soup = BeautifulSoup(html, 'html.parser') # these lines.
links = soup.find("div",{"id" : "bodyContent"}).findAll("a", href=re.compile("(/wiki/)+([A-Za-z0-9_:()])+"))
print("Level ", depth, url)
for link in links:
if ':' not in link['href']:
crawler("https://en.wikipedia.org"+link['href'], depth - 1)
url = "https://en.wikipedia.org/wiki/Big_data"
crawler(url, 3)
部分输出:
Level 3 https://en.wikipedia.org/wiki/Big_data
Level 2 https://en.wikipedia.org/wiki/Big_Data_(band)
Level 1 https://en.wikipedia.org/wiki/Brooklyn
Level 1 https://en.wikipedia.org/wiki/Electropop
Level 1 https://en.wikipedia.org/wiki/Alternative_dance
Level 1 https://en.wikipedia.org/wiki/Indietronica
Level 1 https://en.wikipedia.org/wiki/Indie_rock
Level 1 https://en.wikipedia.org/wiki/Warner_Bros._Records
Level 1 https://en.wikipedia.org/wiki/Joywave
Level 1 https://en.wikipedia.org/wiki/Electronic_music
Level 1 https://en.wikipedia.org/wiki/Dangerous_(Big_Data_song)
Level 1 https://en.wikipedia.org/wiki/Joywave
Level 1 https://en.wikipedia.org/wiki/Billboard_(magazine)
Level 1 https://en.wikipedia.org/wiki/Alternative_Songs
Level 1 https://en.wikipedia.org/wiki/2.0_(Big_Data_album)