在混乱的网站上使用 Beautiful Soup 进行 Python 网页抓取

Question

我想从这个网站抓取以下三个数据点：%verified、FAR 的数值和 POD 的数值。我正在尝试在 BeautifulSoup 中执行此操作，但我没有进行站点遍历练习，因此我无法描述这些元素的位置。

执行此操作最简单的方法是什么？

Answer 1

如果您还没有安装，请安装 Firefox 的 Firebug 并使用它来检查页面的 html 源代码。

使用

urllib

和BeautifulSoup的组合来处理html检索和解析。这是一个简短的例子：

import urllib
from BeautifulSoup import BeautifulSoup

url = 'http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype[]=TO&hail=1.00&lsrbuffer=15&ltype[]=T&wind=58'
fp = urllib.urlopen(url).read()
soup = BeautifulSoup(fp)

print soup

从这里开始，我提供的链接应该可以帮助您开始了解如何检索您感兴趣的元素。

Answer 2

就像That1Guy所说，你需要分析源页面结构。在这种情况下，您很幸运...您正在查找的数字使用

<span>

特别以红色突出显示。

这将做到这一点：

>>> import urllib2
>>> import lxml.html
>>> url = ... # put your URL here
>>> html = urllib2.urlopen(url)
>>> soup = lxml.html.soupparser.fromstring(html)
>>> elements = soup.xpath('//th/span')
>>> print float(elements[0].text) # FAR
0.67
>>> print float(elements[1].text) # POD
0.58

注意

lxml.html.soupparser

几乎等同于

BeautifulSoup

解析器（我目前不需要手动操作）。

Answer 3

我最终自己解决了这个问题——我使用了类似于isedev的策略，但我希望我能找到一种更好的方法来获取“已验证”数据：

import urllib2
from bs4 import BeautifulSoup

wfo = list()

def main():
    wfo = [i.strip() for i in open('C:\Python27\wfo.txt') if i[:-1]]
    soup = BeautifulSoup(urllib2.urlopen('http://mesonet.agron.iastate.edu/cow/?syear=2009&smonth=9&sday=12&shour=12&eyear=2012&emonth=9&eday=12&ehour=12&wfo=ABQ&wtype%5B%5D=TO&hail=1.00&lsrbuffer=15&ltype%5B%5D=T&wind=58').read())
    elements = soup.find_all("span")
    find_verify = soup.find_all('th')

    far= float(elements[1].text)
    pod= float(elements[2].text)
    verified = (find_verify[13].text[:-1])

在混乱的网站上使用 Beautiful Soup 进行 Python 网页抓取

问题描述投票：0回答：3

3个回答

最新问题

在混乱的网站上使用 Beautiful Soup 进行 Python 网页抓取

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3