我在此附上网站链接:https://www.sofascore.com/tournament/football/azerbaijan/misli-premier-league/709#id:64075
我已经起草了多个版本的代码,并且已经失去了所有内容,所以我将把我的代码粘贴到这里:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
driver = webdriver.Chrome() # Make sure ChromeDriver is available in your environment
url = "https://www.sofascore.com/tournament/football/azerbaijan/misli-premier-league/709#id:64075"
try:
# Open the page in the browser
driver.get(url)
# Wait for the table to be fully loaded (wait for the table class to appear)
wait = WebDriverWait(driver, 30)
table = wait.until(EC.presence_of_element_located((By.XPATH, "//table[contains(@class, 'sc-c347313f-8.kLmLmP')]")))
# Scrape the rows of the table
rows = table.find_elements(By.XPATH, ".//tr")
table_data = []
for row in rows:
# Get the cells in each row, checking both td and th
cells = row.find_elements(By.XPATH, ".//td")
if not cells: # For header rows, use th
cells = row.find_elements(By.XPATH, ".//th")
row_data = [cell.text.strip() for cell in cells if cell.text.strip() != '']
if row_data:
table_data.append(row_data)
# Convert the scraped data into a pandas DataFrame
df = pd.DataFrame(table_data)
# Print the DataFrame
print("Extracted Table Data:")
print(df)
df.to_csv("scraped_table.csv", index=False)
finally:
# Close the browser once done
driver.quit()
我基本上尝试了一切可能的事情。我确保该问题与网络驱动程序等外部内容无关。此时我很失落。我尝试了大约 50 个不同的版本,但仍然没有结果。我所需要的只是抓取下面名为“玩家统计数据”的表格。
您面临的问题是,直到您向下滚动到某个元素时,表格元素才存在。 当您滚动到父元素时,它会动态加载。
要解决此问题,您必须首先向下滚动到父 div 元素,然后等待表格在 DOM 中可用。
尝试以下代码。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pandas as pd
import time
driver = webdriver.Chrome()
url = "https://www.sofascore.com/tournament/football/azerbaijan/misli-premier-league/709#id:64075"
try:
# Open the page in the browser
driver.get(url)
# Wait for the table to be fully loaded (wait for the table class to appear)
wait = WebDriverWait(driver, 30)
# First get the parent element that loads on the page load without scrolling
parent = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR,"div[data-testid='league_info']")))
#scroll down to get the parent into view.
driver.execute_script("arguments[0].scrollIntoView();", parent)
#then you'll find that the table is present.
table = wait.until(EC.presence_of_element_located((By.XPATH, "//table")))
# Scrape the rows of the table
rows = table.find_elements(By.XPATH, ".//tr")
table_data = []
for row in rows:
# Get the cells in each row, checking both td and th
cells = row.find_elements(By.XPATH, ".//td")
if not cells: # For header rows, use th
cells = row.find_elements(By.XPATH, ".//th")
row_data = [cell.text.strip() for cell in cells if cell.text.strip() != '']
if row_data:
table_data.append(row_data)
# Convert the scraped data into a pandas DataFrame
df = pd.DataFrame(table_data)
# Print the DataFrame
print("Extracted Table Data:")
print(df)
df.to_csv("scraped_table.csv", index=False)
finally:
# Close the browser once done
driver.quit()