美丽的汤和大html

Question

我试图抓取一些大型维基百科页面，例如这个。

不幸的是，

BeautifulSoup

无法处理如此大的内容，并且它会截断页面。

Answer 1

我在beautifulsoup-where-are-you-putting-my-html找到了使用BeautifulSoup解决这个问题的方法，因为我认为它比lxml更容易。

您唯一需要做的就是安装：

pip install html5lib

并将其作为参数添加到BeautifulSoup：

soup = BeautifulSoup(htmlContent, 'html5lib')

但是，如果您愿意，也可以使用lxml，如下所示：

import lxml.html

doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')

Answer 2

我建议你获取html内容然后将其传递给BS：

import requests
from bs4 import BeautifulSoup

r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
  soup = BeautifulSoup(r.content)
  # get the div with links at the bottom of the page
  links_div = soup.find('div', id='catlinks')
  for a in links_div.find_all('a'):
    print a.text
else:
  print r.status_code

美丽的汤和大html

问题描述投票：0回答：2

2个回答

最新问题

美丽的汤和大html

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2