“全世界”的美丽汤刮

问题描述 投票:0回答:1

我正在尝试使用 Beautiful Soup 抓取一些 Box Office Mojo 页面以获取 Worldwide 票房总数据。下面的代码将很好地获取 Domestic 数据,当我将“Worldwide”子为“时,将无法工作”国内总毛额。”也许是因为“全球”多次出现在页面上或其他原因。

有什么修复的帮助吗?我还将提供这两个部分的源代码。谢谢!

下面是源代码

<center><table border="0" border="0" cellspacing="1" cellpadding="4" bgcolor="#dcdcdc" width="95%"><tr bgcolor="#ffffff"><td align="center" colspan="2"><font size="4">Domestic Total Gross: <b>$172,825,435</b></font></td></tr><tr bgcolor="#ffffff"><td valign="top">Distributor: <b><a href="/studio/chart/?studio=mgm.htm">MGM</a></b></td><td valign="top">Release Date: <b><nobr><a href="/schedule/?view=bydate&release=theatrical&date=1988-12-16&p=.htm">December 16, 1988</a></nobr></b></td></tr><tr bgcolor="#ffffff"><td valign="top">Genre: <b>Drama</b></td><td valign="top">Runtime: <b>2 hrs. 13 min.</b></td></tr><tr bgcolor="#ffffff"><td valign="top">MPAA Rating: <b>R</b></td><td valign="top">Production Budget: <b>$25 million</b></td></tr></table>  </td>

...跳过...

<tr>
<td width="40%">=&nbsp;<b>Worldwide:</b></td>
<td width="35%" align="right">&nbsp;<b>$354,825,435</b></td>
<td width="25%">&nbsp;</td>
</tr>

下面是Python代码

BOG_titles = ['=RainMan.htm']
def get_movie_value(soup, field_name):
obj = soup.find(text = re.compile(field_name))
if not obj:
    return "Nothing"
next_sibling = obj.findNextSibling()
if next_sibling:
    return next_sibling.text
else:
    return "Still Nothing"

BOG_data = []
for x in BOG_titles:
y = 'http://www.boxofficemojo.com/movies/?id' + x
page = urllib2.urlopen(y)
soup = BeautifulSoup(page)
m = get_movie_value(soup, "Worldwide")
title_string = soup.find('title').text
title = title_string.split('(')[0].strip()
BOG_data.append([title,m])
python web-scraping beautifulsoup
1个回答
0
投票

使用

div.mp_box
结构内的表格来获得你想要的:

In [1]: from bs4 import BeautifulSoup
In [2]: import requests
In [3]: r = requests.get("http://www.boxofficemojo.com/movies/?id=rainman.htm").content

In [4]: soup = BeautifulSoup(r,"lxml")

In [5]: table = soup.select_one("div.mp_box table")

In [6]: print(table)
<table border="0" cellpadding="0" cellspacing="0">
<tr>
<td width="40%"><b>Domestic:</b></td>
<td align="right" width="35%"> <b>$172,825,435</b></td>
<td align="right" width="25%">   <b>48.7%</b></td>
</tr>
<tr>
<td width="40%">+ <a href="/movies/?page=intl&amp;id=rainman.htm">Foreign:</a></td>
<td align="right" width="35%"> $182,000,000</td>
<td align="right" width="25%">   51.3%</td>
</tr>
<tr>
<td colspan="3" width="100%"><hr/></td>
</tr>
<tr>
<td width="40%">= <b>Worldwide:</b></td>
<td align="right" width="35%"> <b>$354,825,435</b></td>
<td width="25%"> </td>
</tr>
</table>

In [7]: rows = table.select("tr")

In [8]: rows[0].select_one("td + td").text
Out[8]: u'\xa0$172,825,435'

In [9]: rows[1].select_one("td + td").text
Out[9]: u'\xa0$182,000,000'

In [10]: rows[-1].select_one("td + td").text
Out[10]: u'\xa0$354,825,435'

要使用文本而不指定行:

In [27]: soup = BeautifulSoup(r,"lxml")

In [28]: table = soup.select_one("div.mp_box table")

In [29]: print(table.find("b",  text="Domestic:").find_next("td").text)
 $172,825,435

In [30]: print(table.find("b",  text="Worldwide:").find_next("td").text)
 $354,825,435

 In [31]: print(table.find("a",  text="Foreign:").find_next("td").text)
 $182,000,000
© www.soinside.com 2019 - 2024. All rights reserved.