我正在使用 Python 和 Selenium 构建一个网络抓取工具来抓取篮球参考网站,并且需要对返回我正在查找的数据的 Xpath 语句进行一些微调。目前,我需要一些 Xpath 语句来返回除最后一列之外的每一列,即“奖项”列,该列有时包含文本(如果玩家当年赢得了任何类型的奖项),如果没有,则为空白。我的代码工作正常,并且 mostly 确实选择了我需要的内容,但是我尝试过的 Xpath 语句的每个变体要么不返回有效的 Xpath 语句,要么只是为我提供了包括最后一列在内的所有数据,我就是这么做的不需要。这是我的工作代码片段以及 selenium 驱动程序代码,它检索表的每个元素并返回它。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import pandas as pd
class PlayerPerGameStats():
def __init__(self, player_name):
self.player_name = player_name.lower()
self.options = Options()
#No popup window when called
self.options.add_argument("--headless=new")
#No image loading for performance
self.options.add_experimental_option(
"prefs", {
"profile.managed_default_content_settings.images" : 2,
}
)
self.browser = webdriver.Chrome(options=self.options)
self.url = f"https://www.basketball-reference.com/players/{self.player_name[0]}/{self.player_name}01.html"
self.browser.get(self.url)
#Add wait for page load
WebDriverWait(self.browser, 10).until(
EC.presence_of_element_located((By.ID, 'per_game_stats'))
)
def get_player_row_stats(self) -> list:
try:
table = self.browser.find_element(By.ID, 'per_game_stats')
rows = table.find_elements(By.XPATH, './tbody')
stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr')]
#List split to get each stat as it's own index
player_data = [y for x in stat_rows for y in x.split(' ')]
print(player_data)
return player_data
except Exception as e:
print(f"Error extracting row stats: {e}")
return None
#To run it
stats = PlayerPerGameStats("lillada")
player_stats = stats.get_player_row_stats()
这是我正在使用的 DOM 片段。
我尝试过的一些 xpath 变体包括:
stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr[position() < last()]')]
stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr[not(contains(@data-stat, 'awards'))]')]
但是这些还不够,而是返回上述每个列或根本不返回任何内容。
感谢您花时间阅读本文。如果需要任何其他信息或代码,我非常乐意提供 - 这个问题已经困扰我好几个星期了,我只是想弄清楚如何解决它。
我改变了你的定位器和方法。它排除了每行的最后一列。
工作代码:
from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
player_name = "lillada"
url = f"https://www.basketball-reference.com/players/{player_name[0]}/{player_name}01.html"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)
wait = WebDriverWait(driver, 10)
for row in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#per_game_stats tbody tr"))):
player_data = [cell.text for cell in row.find_elements(By.CSS_SELECTOR, "th,td")[:-1]]
print(player_data)
输出
['2012-13', '22', 'POR', 'NBA', 'PG', '82', '82', '38.6', '6.7', '15.7', '.429', '2.3', '6.1', '.368', '4.5', '9.6', '.469', '.501', '3.3', '3.9', '.844', '0.5', '2.6', '3.1', '6.5', '0.9', '0.2', '3.0', '2.1', '19.0']
['2013-14', '23', 'POR', 'NBA', 'PG', '82', '82', '35.8', '6.7', '15.9', '.424', '2.7', '6.8', '.394', '4.1', '9.1', '.447', '.508', '4.5', '5.2', '.871', '0.4', '3.1', '3.5', '5.6', '0.8', '0.3', '2.4', '2.4', '20.7']
['2014-15', ...
最简单的解决方案就是简单
stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr')][:-1]
我不知道 xpath,但使用 css 选择器你可以做到
stat_rows = [row.text for row in rows[0].find_elements(By.CSS_SELECTOR, 'tr td:not(:last-of-type)')]
我可以建议不要在这里使用 Selenium 吗?当数据位于源 html 中时有点过头了。只需获取 html,然后让 pandas 解析该表(它在底层使用 BeautifulSoup)。有些表格位于 htnl 注释中,因此您需要为某些表格删除这些表格(我们在这段代码中所做的)。
需要修复的第二部分是你从未真正调用过你的
get_player_row_stats()
函数。
代码:
import requests
import pandas as pd
class PlayerPerGameStats():
def __init__(self, player_name):
self.player_name = player_name.lower()
self.url = f"https://www.basketball-reference.com/players/{self.player_name[0]}/{self.player_name}01.html"
response = requests.get(self.url)
if response.status_code == 200:
self.html = response.text.replace('<!--', '').replace('-->', '') # Remove commented-out HTML
else:
raise Exception(f"Failed to fetch data. HTTP Status Code: {response.status_code}")
def get_player_row_stats(self) -> pd.DataFrame:
try:
player_data = pd.read_html(self.html, attrs={'id':'per_game_stats'})[0]
print(player_data)
return player_data
except Exception as e:
print(f"Error extracting row stats: {e}")
return None
#To run it
stats = PlayerPerGameStats("lillada")
player_stats = stats.get_player_row_stats()
输出:
Season Age Team ... PF PTS Awards
0 2012-13 22 POR ... 2.1 19.0 ROY-1
1 2013-14 23 POR ... 2.4 20.7 AS,NBA3
2 2014-15 24 POR ... 2.0 21.0 AS
3 2015-16 25 POR ... 2.2 25.1 MVP-8,NBA2
4 2016-17 26 POR ... 2.0 27.0 NaN
5 2017-18 27 POR ... 1.6 26.9 MVP-4,AS,NBA1
6 2018-19 28 POR ... 1.9 25.8 MVP-6,AS,NBA2
7 2019-20 29 POR ... 1.7 30.0 MVP-8,AS,NBA2
8 2020-21 30 POR ... 1.5 28.8 MVP-7,AS,NBA2
9 2021-22 31 POR ... 1.3 24.0 NaN
10 2022-23 32 POR ... 1.9 32.2 CPOY-10,AS,NBA3
11 2023-24 33 MIL ... 1.8 24.3 CPOY-11,AS
12 2024-25 34 MIL ... 1.8 25.9 NaN
13 13 Yrs 13 Yrs 13 Yrs ... 1.9 25.1 NaN
14 NaN NaN NaN ... NaN NaN NaN
15 POR (11 Yrs) POR (11 Yrs) POR (11 Yrs) ... 1.9 25.2 NaN
16 MIL (2 Yrs) MIL (2 Yrs) MIL (2 Yrs) ... 1.8 24.6 NaN
[17 rows x 31 columns]