我正在尝试抓取此网址上的推介统计信息,然后将数据帧保存到 csv 文件。
https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml
我当前的代码如下(Python 3.9.7)
_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
data.to_csv('boxscore.csv', index='False')
return data
当我运行此代码时,出现以下错误:
Traceback (most recent call last):
File "d:\BaseballAlgo\Baseball_WhoWins.py", line 205, in <module>
getBoxScore('ARI','2022-04-07')
File "d:\BaseballAlgo\Baseball_WhoWins.py", line 99, in getBoxScore
data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1240, in read_html
return _parse(
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1003, in _parse
raise retained
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 983, in _parse
tables = p.parse_tables()
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 249, in parse_tables
tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 598, in _parse_tables
raise ValueError("No tables found")
ValueError: No tables found
过去的代码迭代:
session = BRefSession()
_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
content =session.get(_URL).content
soup = BeautifulSoup(content, "html.parser")
table = soup.find_all('table', id="ArizonaDiamondbackspitching")
print (table)
data = pd.read_html(StringIO(str(table)))[0]
此代码运行,当它打印表格时,输出为“[]”。上面相同的回溯也作为最后一行的结果输出。
我明白错误的意思,但我就是不明白这怎么可能。似乎 soup.findall 函数无法找到我需要的特定表,但我不确定为什么。我该如何解决这个问题?
这里的主要问题是
table
隐藏在评论中,因此您必须先将其提出,然后 BeautifulSoup 才能找到它 - 我认为最简单的解决方案是替换这种情况下的特定字符:
.replace('<!--','').replace('-->','')
请求和 pandas 的示例:
import requests
import pandas as pd
df = pd.read_html(
requests.get(
'https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml').text.replace('<!--','').replace('-->',''),
attrs={'id':'ArizonaDiamondbackspitching'}
)[0]
df