网页抓取雅虎财经

问题描述 投票:0回答:2

我正在尝试从此链接获取公司治理部分的分数 https://ca.finance.yahoo.com/quote/AMP/profile?p=AMP

我是 python 新手,我不知道为什么在网页抓取时会出现错误。

import requests
from bs4 import BeautifulSoup

# URL of the webpage to be scraped
url = "https://ca.finance.yahoo.com/quote/AMP/profile?p=AMP"

# Send a GET request to the URL and store the response in a variable
response = requests.get(url)

# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')

# Find the section containing the governance scores
governance_section = soup.find('section', {'class': 'Mt(30px) corporate-governance-container'})

# Find the required elements using their text content and extract their values
scores = governance_section.find('span', text='The pillar scores are')
score_text = scores[0].text.strip() if scores else ''
audit_score = score_text[score_text.find('Audit: ') + 6 : score_text.find(';', score_text.find('Audit: '))].strip()
shareholder_score = score_text[score_text.find('Shareholder Rights: ') + 20 : score_text.find(';', score_text.find('Shareholder Rights: '))].strip()
compensation_score = score_text[score_text.find('Compensation: ') + 14 :].strip()

# Print the extracted information
print("Corporate Governance Score:", governance_section.find('span', {'class': 'Va(m) Fw(600) D(ib) Lh(23px)'}).text.strip())
print("Audit and Risk Oversight Score:", audit_score)
print("Shareholders' Rights Score:", shareholder_score)
print("Compensation Score:", compensation_score)
print("Board Structure Score:", board_score)

这是错误

AttributeError                            Traceback (most recent call last)
<ipython-input-15-140e65f0615d> in <cell line: 17>()
     15 
     16 # Find the required elements using their text content and extract their values
---> 17 scores = governance_section.find('span', text='The pillar scores are')
     18 score_text = scores[0].text.strip() if scores else ''
     19 audit_score = score_text[score_text.find('Audit: ') + 6 : score_text.find(';', score_text.find('Audit: '))].strip()

AttributeError: 'NoneType' object has no attribute 'find'

这是html源代码

Minneapolis, Minnesota.</p></section><section class="Mt(30px) corporate-governance-container"><h2 class="Fz(m) Lh(1) Fw(b) Mt(0) Mb(18px)"><span>Corporate Governance</span></h2><div><p class="Fz(s)"><span>Ameriprise Financial, Inc.’s ISS Governance QualityScore as of <span>April 1, 2023</span> is 5.</span>  <span>The pillar scores are Audit: 3; Board: 7; Shareholder Rights: 5; Compensation: 6.</span></p><div class="Mt(20px)"><span>Corporate governance scores courtesy of</span> <a href="https://issgovernance.com/quickscore" target="_blank" rel="noopener noreferrer" title="Institutional Shareholder Services (ISS)">Institutional Shareholder Services (ISS)</a>.  <span>Scores indicate decile rank relative to index or region. A decile score of 1 indicates lower governance risk, while a 10 indicates higher governance risk.</span></div></div></section></section></div></div><script>if (window.performance) {window.performance.mark && window.performance.mark('Col1-0-Profile');window.performance.measure && window.performance.measure('Col1-0-ProfileDone','PageStart','Col1-0-Profile');}</script></div><div>

你能帮我一下吗?

python error-handling attributes
2个回答
0
投票

您遇到的错误是因为

governance_section
None
。发生这种情况是因为您从服务器收到的响应不符合预期。事实上,您会得到一个
<Response [404]>
,这意味着 HTML 不包含您尝试抓取的部分。

要解决此问题,请按如下方式修改您的请求:

requests.get(url, headers={'user-agent': 'custom'})

此更改将帮助您获得

<Response [200]>
,允许您解析 HTML 并提取您需要的信息。


0
投票

是的,您需要用户代理才能连接到它。

我有 200 个代码:

导入请求 从 bs4 导入 BeautifulSoup

#要抓取的网页的URL

url =“https://ca.finance.yahoo.com/quote/AMP/profile/?p=AMP”

#代理 user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, 如 Gecko) Chrome/58.0.3029.110 Safari/537.36"

#尝试请求该站点 res = requests.get(url, headers={'user-agent': user_agent}) 打印(res)

回复[200]

© www.soinside.com 2019 - 2024. All rights reserved.