如果我想正确格式化 MLB 排名结果以下载到 Excel,我将如何修改附加的 Python 脚本
from selenium import webdriver
from selenium.webdriver.common.by import By
import pandas as pd
# Setup WebDriver
driver = webdriver.Chrome(executable_path='/usr/local/bin/chromedriver')
# Open the page
driver.get('https://www.espn.com/mlb/standings')
# Wait for JavaScript to load (adjust time as needed)
driver.implicitly_wait(10)
# Extract data
standings = []
for row in driver.find_elements(By.CSS_SELECTOR, 'tr'):
cells = row.find_elements(By.CSS_SELECTOR, 'td, th')
if cells:
standings.append([cell.text for cell in cells])
# Convert to DataFrame
df = pd.DataFrame(standings)
# Save to Excel
df.to_excel('mlb_standings.xlsx', index=False)
# Print or process the data
for team in standings:
print(team)
# Clean up
driver.quit()
结果正在返回,但格式不正确...
如果你更好地格式化它们,你应该以更好的方式获取数据。
<table>
仅包含名称,<table>
仅包含值。因此,您应该首先获取所有表,然后使用第一个表获取名称,使用第二个表获取值。稍后您可以使用值来创建 DataFrame
,并且可以将名称添加为此 DataFrame
中的另一列
tables = driver.find_elements(By.CSS_SELECTOR, 'table')
names = []
for row in tables[0].find_elements(By.CSS_SELECTOR, 'tr'):
cells = row.find_elements(By.CSS_SELECTOR, 'td, th')
print(cells)
if cells:
names.append([cell.text for cell in cells][0])
values = []
for row in tables[1].find_elements(By.CSS_SELECTOR, 'tr'):
cells = row.find_elements(By.CSS_SELECTOR, 'td, th')
if cells:
values.append([cell.text for cell in cells])
df = pd.DataFrame(values)
df['names'] = names
如果您希望
names
作为第一列:
columns = df.columns.tolist()
columns = columns[-1:] + columns[:-1] # move last to first
df = df[columns]
还有其他问题 - 有 4 张桌子
如果您还想要
National League
,那么您必须重复tables[2]
(获取名称)和tables[3]
(获取值)。稍后您必须将行合并/连接到先前的 DataFrame,或将其编写为单独的 DataFrame。