如何使用熊猫和漂亮的汤从棒球参考中刮出桌子?

问题描述 投票:0回答:1

我正在尝试抓取此网址上的推介统计信息,然后将数据帧保存到 csv 文件。

https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml

我当前的代码如下(Python 3.9.7)

_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
data.to_csv('boxscore.csv', index='False')
return data

当我运行此代码时,出现以下错误:

Traceback (most recent call last):
  File "d:\BaseballAlgo\Baseball_WhoWins.py", line 205, in <module>
    getBoxScore('ARI','2022-04-07')
  File "d:\BaseballAlgo\Baseball_WhoWins.py", line 99, in getBoxScore
    data = pd.read_html(_URL,attrs={'id': 'ArizonaDiamondbackspitching'},header=1)[0]
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1240, in   read_html
    return _parse(
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 1003, in _parse
    raise retained
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 983, in   _parse
    tables = p.parse_tables()
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 249, in parse_tables
    tables = self._parse_tables(self._build_doc(), self.match, self.attrs)
  File "D:\BaseballAlgo\.venv\lib\site-packages\pandas\io\html.py", line 598, in   _parse_tables
    raise ValueError("No tables found")
ValueError: No tables found

过去的代码迭代:

session = BRefSession()
_URL = "https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml"
content =session.get(_URL).content
soup = BeautifulSoup(content, "html.parser")
table = soup.find_all('table', id="ArizonaDiamondbackspitching")
print (table)
data = pd.read_html(StringIO(str(table)))[0]

此代码运行,当它打印表格时,输出为“[]”。上面相同的回溯也作为最后一行的结果输出。

我明白错误的意思,但我就是不明白这怎么可能。似乎 soup.findall 函数无法找到我需要的特定表,但我不确定为什么。我该如何解决这个问题?

python pandas web-scraping beautifulsoup
1个回答
0
投票

这里的主要问题是

table
隐藏在评论中,因此您必须先将其提出,然后 BeautifulSoup 才能找到它 - 我认为最简单的解决方案是替换这种情况下的特定字符:

.replace('<!--','').replace('-->','')

请求和 pandas 的示例:

import requests
import pandas as pd

df = pd.read_html(
    requests.get(
        'https://www.baseball-reference.com/boxes/ARI/ARI202204070.shtml').text.replace('<!--','').replace('-->',''), 
    attrs={'id':'ArizonaDiamondbackspitching'}
    )[0]
df
© www.soinside.com 2019 - 2024. All rights reserved.