Xpath - 选择除最后一列之外的所有列

问题描述 投票:0回答:3

我正在使用 Python 和 Selenium 构建一个网络抓取工具来抓取篮球参考网站,并且需要对返回我正在查找的数据的 Xpath 语句进行一些微调。目前,我需要一些 Xpath 语句来返回除最后一列之外的每一列,即“奖项”列,该列有时包含文本(如果玩家当年赢得了任何类型的奖项),如果没有,则为空白。我的代码工作正常,并且 mostly 确实选择了我需要的内容,但是我尝试过的 Xpath 语句的每个变体要么不返回有效的 Xpath 语句,要么只是为我提供了包括最后一列在内的所有数据,我就是这么做的不需要。这是我的工作代码片段以及 selenium 驱动程序代码,它检索表的每个元素并返回它。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import NoSuchElementException, TimeoutException
import pandas as pd

class PlayerPerGameStats():
    def __init__(self, player_name):
        self.player_name = player_name.lower()
        self.options = Options()

        #No popup window when called
        self.options.add_argument("--headless=new")

        #No image loading for performance
        self.options.add_experimental_option(
            "prefs", {
                "profile.managed_default_content_settings.images" : 2,
            }
        )
        self.browser = webdriver.Chrome(options=self.options)
        self.url = f"https://www.basketball-reference.com/players/{self.player_name[0]}/{self.player_name}01.html"
        self.browser.get(self.url)

        #Add wait for page load
        WebDriverWait(self.browser, 10).until(
            EC.presence_of_element_located((By.ID, 'per_game_stats'))
        )

def get_player_row_stats(self) -> list:
        try:
            table = self.browser.find_element(By.ID, 'per_game_stats')
            rows = table.find_elements(By.XPATH, './tbody')
            stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr')]

            #List split to get each stat as it's own index
            player_data = [y for x in stat_rows for y in x.split(' ')]

            print(player_data)

            return player_data

        except Exception as e:
            print(f"Error extracting row stats: {e}")
            return None


#To run it
stats = PlayerPerGameStats("lillada")
player_stats = stats.get_player_row_stats()

这是我正在使用的 DOM 片段。

篮球参考 DOM 的片段

我尝试过的一些 xpath 变体包括:

stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr[position() < last()]')]
stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr[not(contains(@data-stat, 'awards'))]')]

但是这些还不够,而是返回上述每个列或根本不返回任何内容。

感谢您花时间阅读本文。如果需要任何其他信息或代码,我非常乐意提供 - 这个问题已经困扰我好几个星期了,我只是想弄清楚如何解决它。

python selenium-webdriver web-scraping xpath
3个回答
1
投票

我改变了你的定位器和方法。它排除了每行的最后一列。

工作代码:

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

player_name = "lillada"
url = f"https://www.basketball-reference.com/players/{player_name[0]}/{player_name}01.html"
driver = webdriver.Chrome()
driver.maximize_window()
driver.get(url)

wait = WebDriverWait(driver, 10)
for row in wait.until(EC.visibility_of_all_elements_located((By.CSS_SELECTOR, "#per_game_stats tbody tr"))):
    player_data = [cell.text for cell in row.find_elements(By.CSS_SELECTOR, "th,td")[:-1]]
    print(player_data)

输出

['2012-13', '22', 'POR', 'NBA', 'PG', '82', '82', '38.6', '6.7', '15.7', '.429', '2.3', '6.1', '.368', '4.5', '9.6', '.469', '.501', '3.3', '3.9', '.844', '0.5', '2.6', '3.1', '6.5', '0.9', '0.2', '3.0', '2.1', '19.0']
['2013-14', '23', 'POR', 'NBA', 'PG', '82', '82', '35.8', '6.7', '15.9', '.424', '2.7', '6.8', '.394', '4.1', '9.1', '.447', '.508', '4.5', '5.2', '.871', '0.4', '3.1', '3.5', '5.6', '0.8', '0.3', '2.4', '2.4', '20.7']
['2014-15', ...

0
投票

最简单的解决方案就是简单

stat_rows = [row.text for row in rows[0].find_elements(By.XPATH, './tr')][:-1]

我不知道 xpath,但使用 css 选择器你可以做到

stat_rows = [row.text for row in rows[0].find_elements(By.CSS_SELECTOR, 'tr td:not(:last-of-type)')]

0
投票

我可以建议不要在这里使用 Selenium 吗?当数据位于源 html 中时有点过头了。只需获取 html,然后让 pandas 解析该表(它在底层使用 BeautifulSoup)。有些表格位于 htnl 注释中,因此您需要为某些表格删除这些表格(我们在这段代码中所做的)。

需要修复的第二部分是你从未真正调用过你的

get_player_row_stats()
函数。

代码:

import requests
import pandas as pd

class PlayerPerGameStats():
    def __init__(self, player_name):
        self.player_name = player_name.lower()

        self.url = f"https://www.basketball-reference.com/players/{self.player_name[0]}/{self.player_name}01.html"

        response = requests.get(self.url)
        if response.status_code == 200:
            self.html = response.text.replace('<!--', '').replace('-->', '')  # Remove commented-out HTML
        else:
            raise Exception(f"Failed to fetch data. HTTP Status Code: {response.status_code}")



    def get_player_row_stats(self) ->  pd.DataFrame:
            try:
                player_data = pd.read_html(self.html, attrs={'id':'per_game_stats'})[0]
                print(player_data)
                return player_data
    
            except Exception as e:
                print(f"Error extracting row stats: {e}")
                return None


#To run it
stats = PlayerPerGameStats("lillada")
player_stats = stats.get_player_row_stats()

输出:

         Season           Age          Team  ...   PF   PTS           Awards
0        2012-13            22           POR  ...  2.1  19.0            ROY-1
1        2013-14            23           POR  ...  2.4  20.7          AS,NBA3
2        2014-15            24           POR  ...  2.0  21.0               AS
3        2015-16            25           POR  ...  2.2  25.1       MVP-8,NBA2
4        2016-17            26           POR  ...  2.0  27.0              NaN
5        2017-18            27           POR  ...  1.6  26.9    MVP-4,AS,NBA1
6        2018-19            28           POR  ...  1.9  25.8    MVP-6,AS,NBA2
7        2019-20            29           POR  ...  1.7  30.0    MVP-8,AS,NBA2
8        2020-21            30           POR  ...  1.5  28.8    MVP-7,AS,NBA2
9        2021-22            31           POR  ...  1.3  24.0              NaN
10       2022-23            32           POR  ...  1.9  32.2  CPOY-10,AS,NBA3
11       2023-24            33           MIL  ...  1.8  24.3       CPOY-11,AS
12       2024-25            34           MIL  ...  1.8  25.9              NaN
13        13 Yrs        13 Yrs        13 Yrs  ...  1.9  25.1              NaN
14           NaN           NaN           NaN  ...  NaN   NaN              NaN
15  POR (11 Yrs)  POR (11 Yrs)  POR (11 Yrs)  ...  1.9  25.2              NaN
16   MIL (2 Yrs)   MIL (2 Yrs)   MIL (2 Yrs)  ...  1.8  24.6              NaN

[17 rows x 31 columns]
© www.soinside.com 2019 - 2024. All rights reserved.