我正在尝试从此链接获取公司治理部分的分数 https://ca.finance.yahoo.com/quote/AMP/profile?p=AMP
我是 python 新手,我不知道为什么在网页抓取时会出现错误。
import requests
from bs4 import BeautifulSoup
# URL of the webpage to be scraped
url = "https://ca.finance.yahoo.com/quote/AMP/profile?p=AMP"
# Send a GET request to the URL and store the response in a variable
response = requests.get(url)
# Parse the HTML content of the response using BeautifulSoup
soup = BeautifulSoup(response.content, 'html.parser')
# Find the section containing the governance scores
governance_section = soup.find('section', {'class': 'Mt(30px) corporate-governance-container'})
# Find the required elements using their text content and extract their values
scores = governance_section.find('span', text='The pillar scores are')
score_text = scores[0].text.strip() if scores else ''
audit_score = score_text[score_text.find('Audit: ') + 6 : score_text.find(';', score_text.find('Audit: '))].strip()
shareholder_score = score_text[score_text.find('Shareholder Rights: ') + 20 : score_text.find(';', score_text.find('Shareholder Rights: '))].strip()
compensation_score = score_text[score_text.find('Compensation: ') + 14 :].strip()
# Print the extracted information
print("Corporate Governance Score:", governance_section.find('span', {'class': 'Va(m) Fw(600) D(ib) Lh(23px)'}).text.strip())
print("Audit and Risk Oversight Score:", audit_score)
print("Shareholders' Rights Score:", shareholder_score)
print("Compensation Score:", compensation_score)
print("Board Structure Score:", board_score)
AttributeError Traceback (most recent call last)
<ipython-input-15-140e65f0615d> in <cell line: 17>()
15
16 # Find the required elements using their text content and extract their values
---> 17 scores = governance_section.find('span', text='The pillar scores are')
18 score_text = scores[0].text.strip() if scores else ''
19 audit_score = score_text[score_text.find('Audit: ') + 6 : score_text.find(';', score_text.find('Audit: '))].strip()
AttributeError: 'NoneType' object has no attribute 'find'
这是html源代码
Minneapolis, Minnesota.</p></section><section class="Mt(30px) corporate-governance-container"><h2 class="Fz(m) Lh(1) Fw(b) Mt(0) Mb(18px)"><span>Corporate Governance</span></h2><div><p class="Fz(s)"><span>Ameriprise Financial, Inc.’s ISS Governance QualityScore as of <span>April 1, 2023</span> is 5.</span> <span>The pillar scores are Audit: 3; Board: 7; Shareholder Rights: 5; Compensation: 6.</span></p><div class="Mt(20px)"><span>Corporate governance scores courtesy of</span> <a href="https://issgovernance.com/quickscore" target="_blank" rel="noopener noreferrer" title="Institutional Shareholder Services (ISS)">Institutional Shareholder Services (ISS)</a>. <span>Scores indicate decile rank relative to index or region. A decile score of 1 indicates lower governance risk, while a 10 indicates higher governance risk.</span></div></div></section></section></div></div><script>if (window.performance) {window.performance.mark && window.performance.mark('Col1-0-Profile');window.performance.measure && window.performance.measure('Col1-0-ProfileDone','PageStart','Col1-0-Profile');}</script></div><div>
你能帮我一下吗?
您遇到的错误是因为
governance_section
是 None
。发生这种情况是因为您从服务器收到的响应不符合预期。事实上,您会得到一个 <Response [404]>
,这意味着 HTML 不包含您尝试抓取的部分。
要解决此问题,请按如下方式修改您的请求:
requests.get(url, headers={'user-agent': 'custom'})
此更改将帮助您获得
<Response [200]>
,允许您解析 HTML 并提取您需要的信息。
是的,您需要用户代理才能连接到它。
我有 200 个代码:
导入请求 从 bs4 导入 BeautifulSoup
#要抓取的网页的URL
url =“https://ca.finance.yahoo.com/quote/AMP/profile/?p=AMP”
#代理 user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, 如 Gecko) Chrome/58.0.3029.110 Safari/537.36"
#尝试请求该站点 res = requests.get(url, headers={'user-agent': user_agent}) 打印(res)
回复[200]