bs4 python找不到文本

问题描述 投票:3回答:2

我有一个html文件,我抓住了美丽的汤。 html的摘录在这个问题的底部。我正在使用beautifulsoup和selenium。

有人告诉我,我只允许每小时提取这么多数据,当我让这个页面等待一段时间(一个小时)。

这就是我试图提取数据的方式:

def get_page_data(self):
    opts = Options()
    opts.headless = True
    assert opts.headless  # Operating in headless mode
    browser_detail = Firefox(options=opts)
    url = self.base_url.format(str(self.tracking_id))
    print(url)
    browser_detail.get(url)
    self.page_data = bs4(browser_detail.page_source, 'html.parser')
    Error_Check = 1 if len(self.page_data.findAll(text='Error Report Number')) > 0 else 0
    Error_Check = 2 if len(self.page_data.findAll(text='exceeded the maximum number of sessions per hour allowed')) > 0 else Error_Check
    print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.
    print(self.page_data)
    return Error_Check

问题是这一行:

print(self.page_data.findAll(text='waiting an hour and trying your query again')). ##<<--- The Problem is this line.

代码无法在页面中找到该行。我错过了什么?谢谢

<html><head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type"/>
<link href="/CMPL/styles/ogm_style.css;jsessionid=rw9pc8-bncrIy_4KSZmJ8BxN2Z2hnKVwcr79Vho4-99gxTPrxNbo!-68716939" rel="stylesheet" type="text/css"/>
<body>
<!-- Content Area -->
<table style="width:100%; margin:auto;">
<tbody><tr valign="top">
<td class="ContentArea" style="width:100%;">
<span id="messageArea">
<!-- /tiles/messages.jsp BEGIN -->
<ul>
</ul><b>
</b><table style="width:100%; margin:auto; white-space: pre-wrap; text-align: left;">
<tbody><tr><td align="left"><b><li><font color="red"></font></li></b></td>
<td align="left"><font color="red">You have exceeded the maximum number of sessions per hour allowed for the public queries. You may still access the public</font></td>
</tr>
<tr><td><font color="red"><li style="list-style: none;"></li></font></td>
<td align="left"><font color="red">queries by waiting an hour and trying your query again. The RRC public queries are provided to facilitate online research and are not intended to be accessed by automated tools or scripts. For questions or concerns please contact the RRC HelpDesk at [email protected] or 512-463-7229</font></td>
</tr>
</tbody></table>
<p>....more html...</p>
</body></html>
python web-scraping beautifulsoup
2个回答
1
投票

我不确定这是你在找什么,但如果你是:

html = [your code above]
from bs4 import BeautifulSoup as bs4
soup = bs4(html, 'lxml')
data = soup.find_all('font', color="red")
data[3].text

输出:

'queries by waiting an hour and trying your query again. The RRC public queries are provided to facilitate online research and are not intended to be accessed by automated tools or scripts. For questions or concerns please contact the RRC HelpDesk at [email protected] or 512-463-7229'

2
投票

您可以使用以下css选择器

tr:last-child:not([valign])

from bs4 import BeautifulSoup as bs
html = '''yourHTML'''    
soup = bs(html, 'lxml')   
item = soup.select_one('tr:last-child:not([valign])')
print(item.text)

如果这返回多个项目,您可以循环列表过滤包含感兴趣字符串的项目。你可以限制只是一个td的选择器,并做类似的事情。

items = soup.select('tr:last-child:not([valign])')
for item in items:
    if 'queries by waiting an hour' in item.text:
        print(item.text)

BeautifulSoup 4.7.1

© www.soinside.com 2019 - 2024. All rights reserved.