我在beautifulsoup-where-are-you-putting-my-html找到了使用BeautifulSoup解决这个问题的方法,因为我认为它比lxml更容易。
您唯一需要做的就是安装:
pip install html5lib
并将其作为参数添加到BeautifulSoup:
soup = BeautifulSoup(htmlContent, 'html5lib')
但是,如果您愿意,也可以使用lxml,如下所示:
import lxml.html
doc = lxml.html.parse('https://en.wikipedia.org/wiki/Talk:Game_theory')
我建议你获取html内容然后将其传递给BS:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://en.wikipedia.org/wiki/Talk:Game_theory')
if r.ok:
soup = BeautifulSoup(r.content)
# get the div with links at the bottom of the page
links_div = soup.find('div', id='catlinks')
for a in links_div.find_all('a'):
print a.text
else:
print r.status_code