beautifulsoup 4：分段错误（核心已转储）

Question

我抓取了以下页面：

http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html

但是我在调用：BeautifulSoup(page_html) 时遇到了分段错误（核心转储），其中 page_html 是请求库中的内容。这是 BeautifulSoup 的错误吗？有什么办法可以解决这个问题吗？即使像 try... except 这样的方法也会帮助我让我的代码运行。预先感谢。

代码如下：

import requests
from bs4 import BeautifulSoup

toy_url = 'http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html'
res = requests.get(toy_url,headers={"USER-Agent":"Firefox/12.0"})
page = res.content
soup = BeautifulSoup(page)

Answer 1

此问题是由 lxml 中的错误引起的，该错误已在 lxml 2.3.5 中修复。您可以升级 lxml，或将 Beautiful Soup 与 html5lib 或 HTMLParser 解析器一起使用。

Answer 2

绝对是一个错误。不应该能够以这种方式出现段错误。我可以重现（4.0.1）：

>>> import bs4, urllib2
>>> url = "http://www.nasa.gov/topics/earth/features/plains-tornadoes-20120417.html"
>>> page = urllib2.urlopen(url).read()
>>> soup = bs4.BeautifulSoup(page)
Segmentation fault

经过一番平分，看起来是由 DOCTYPE 引起的：

>>> page[:page.find(">")+1]
'<!DOCTYPE "xmlns:xsl=\'http://www.w3.org/1999/XSL/Transform\'">'

一个粗略的 hack 允许 bs4 解析它：

>>> soup = bs4.BeautifulSoup(page[page.find(">")+1:])
>>> soup.find_all("a")[:3]
[<a href="/home/How_to_enable_Javascript.html" target="_blank">› Learn How</a>, <a href="#maincontent">Follow this link to skip to the main content</a>, <a class="nasa_logo" href="/home/index.html"><span class="hide">NASA - National Aeronautics and Space Administration</span></a>]

了解更多的人可能能够看到到底发生了什么，但这可能会帮助你开始。

beautifulsoup 4：分段错误（核心已转储）

问题描述投票：0回答：2

2个回答

最新问题

beautifulsoup 4：分段错误（核心已转储）

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2