如何在网站内抓取动态表格？

Question

我在此附上网站链接：https://www.sofascore.com/tournament/football/azerbaijan/misli-premier-league/709#id:64075

我已经起草了多个版本的代码，并且已经失去了所有内容，所以我将把我的代码粘贴到这里：

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time


driver = webdriver.Chrome()  # Make sure ChromeDriver is available in your environment


url = "https://www.sofascore.com/tournament/football/azerbaijan/misli-premier-league/709#id:64075"

try:
    # Open the page in the browser
    driver.get(url)

    # Wait for the table to be fully loaded (wait for the table class to appear)
    wait = WebDriverWait(driver, 30)
    table = wait.until(EC.presence_of_element_located((By.XPATH, "//table[contains(@class, 'sc-c347313f-8.kLmLmP')]")))

    # Scrape the rows of the table
    rows = table.find_elements(By.XPATH, ".//tr")

    table_data = []
    
    for row in rows:
        # Get the cells in each row, checking both td and th
        cells = row.find_elements(By.XPATH, ".//td")
        if not cells:  # For header rows, use th
            cells = row.find_elements(By.XPATH, ".//th")
        
        row_data = [cell.text.strip() for cell in cells if cell.text.strip() != '']
        if row_data:
            table_data.append(row_data)

    # Convert the scraped data into a pandas DataFrame
    df = pd.DataFrame(table_data)

    # Print the DataFrame
    print("Extracted Table Data:")
    print(df)

    
    df.to_csv("scraped_table.csv", index=False)

finally:
    # Close the browser once done
    driver.quit()

我基本上尝试了一切可能的事情。我确保该问题与网络驱动程序等外部内容无关。此时我很失落。我尝试了大约 50 个不同的版本，但仍然没有结果。我所需要的只是抓取下面名为“玩家统计数据”的表格。

Answer 1

您面临的问题是，直到您向下滚动到某个元素时，表格元素才存在。当您滚动到父元素时，它会动态加载。

要解决此问题，您必须首先向下滚动到父 div 元素，然后等待表格在 DOM 中可用。

尝试以下代码。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time


driver = webdriver.Chrome()

url = "https://www.sofascore.com/tournament/football/azerbaijan/misli-premier-league/709#id:64075"

try:
    # Open the page in the browser
    driver.get(url)

    # Wait for the table to be fully loaded (wait for the table class to appear)
    wait = WebDriverWait(driver, 30)
    # First get the parent element that loads on the page load without scrolling
    parent = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div[data-testid='league_info']")))
    #scroll down to get the parent into view.
    driver.execute_script("arguments[0].scrollIntoView();", parent)
    #then you'll find that the table is present.
    table = wait.until(EC.presence_of_element_located((By.XPATH, "//table")))

    # Scrape the rows of the table
    rows = table.find_elements(By.XPATH, ".//tr")

    table_data = []

    for row in rows:
        # Get the cells in each row, checking both td and th
        cells = row.find_elements(By.XPATH, ".//td")
        if not cells:  # For header rows, use th
            cells = row.find_elements(By.XPATH, ".//th")

        row_data = [cell.text.strip() for cell in cells if cell.text.strip() != '']
        if row_data:
            table_data.append(row_data)

    # Convert the scraped data into a pandas DataFrame
    df = pd.DataFrame(table_data)

    # Print the DataFrame
    print("Extracted Table Data:")
    print(df)


    df.to_csv("scraped_table.csv", index=False)

finally:
    # Close the browser once done
    driver.quit()

如何在网站内抓取动态表格？

问题描述投票：0回答：1

1个回答

最新问题

如何在网站内抓取动态表格？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1