我正在尝试使用 Beautiful Soup 抓取一些 Box Office Mojo 页面以获取 Worldwide 票房总数据。下面的代码将很好地获取 Domestic 数据,当我将“Worldwide”子为“时,将无法工作”国内总毛额。”也许是因为“全球”多次出现在页面上或其他原因。
有什么修复的帮助吗?我还将提供这两个部分的源代码。谢谢!
下面是源代码
<center><table border="0" border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcdc" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$172,825,435</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=mgm.htm">MGM</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&release=theatrical&date=1988-12-16&p=.htm">December 16, 1988</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Drama</b></td><td valign="top">Runtime: <b>2 hrs. 13 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>R</b></td><td valign="top">Production Budget: <b>$25 million</b></td></tr></table> </td>
...跳过...
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td width="35%" align="right"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
下面是Python代码
BOG_titles = ['=RainMan.htm']
def get_movie_value(soup, field_name):
obj = soup.find(text = re.compile(field_name))
if not obj:
return "Nothing"
next_sibling = obj.findNextSibling()
if next_sibling:
return next_sibling.text
else:
return "Still Nothing"
BOG_data = []
for x in BOG_titles:
y = 'http://www.boxofficemojo.com/movies/?id' + x
page = urllib2.urlopen(y)
soup = BeautifulSoup(page)
m = get_movie_value(soup, "Worldwide")
title_string = soup.find('title').text
title = title_string.split('(')[0].strip()
BOG_data.append([title,m])
使用
div.mp_box
结构内的表格来获得你想要的:
In [1]: from bs4 import BeautifulSoup
In [2]: import requests
In [3]: r = requests.get("http://www.boxofficemojo.com/movies/?id=rainman.htm").content
In [4]: soup = BeautifulSoup(r,"lxml")
In [5]: table = soup.select_one("div.mp_box table")
In [6]: print(table)
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$172,825,435</b></td>
<td align="right" width="25%"> <b>48.7%</b></td>
</tr>
<tr>
<td width="40%">+ <a href="/movies/?page=intl&id=rainman.htm">Foreign:</a></td>
<td align="right" width="35%"> $182,000,000</td>
<td align="right" width="25%"> 51.3%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
</table>
In [7]: rows = table.select("tr")
In [8]: rows[0].select_one("td + td").text
Out[8]: u'\xa0$172,825,435'
In [9]: rows[1].select_one("td + td").text
Out[9]: u'\xa0$182,000,000'
In [10]: rows[-1].select_one("td + td").text
Out[10]: u'\xa0$354,825,435'
要使用文本而不指定行:
In [27]: soup = BeautifulSoup(r,"lxml")
In [28]: table = soup.select_one("div.mp_box table")
In [29]: print(table.find("b", text="Domestic:").find_next("td").text)
$172,825,435
In [30]: print(table.find("b", text="Worldwide:").find_next("td").text)
$354,825,435
In [31]: print(table.find("a", text="Foreign:").find_next("td").text)
$182,000,000