Python 和 Selenium 无法找到页面上的所有表格

问题描述 投票:0回答:1

作为 Selenium 和 Python 的新手,我的任务是从维基百科页面上的三个表中提取所有数据。在所有测试中,我都能够从第一个表中获取相关数据,但代码无法找到有关第二个或第三个表的任何信息。我知道这并没有那么难,但我已经连续三天了,没有任何进展。我的代码中到底缺少什么?我能够打开页面,然后它返回页面上只有 2 个或有时只有一个表,但我知道事实上有 3 个。 有问题的页面是: https://es.wikipedia.org/wiki/Anexo:Entidades_federativas_de_M%C3%A9xico_por_superficie,_poblaci%C3%B3n_y_densidad 我的代码如下:

# Libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time
from io import StringIO

# Add debugging statements
print("Starting script...")

# Initialize Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--start-maximized')
options.add_argument('--disable-extensions')

# Use webdriver-manager to get the appropriate ChromeDriver
service = Service(ChromeDriverManager().install())

# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=options)

try:
    # Start on 2nd Monitor
    driver.set_window_position(2000, 0)
    driver.maximize_window()
    time.sleep(5)

    # Initiate Browser
    driver.get('https://es.wikipedia.org/wiki/Anexo:Entidades_federativas_de_M%C3%A9xico_por_superficie,_poblaci%C3%B3n_y_densidad')

    # Wait for the page to fully load by waiting for a specific element to appear
    WebDriverWait(driver, 30).until(
        EC.presence_of_element_located((By.XPATH, '//*[@id="firstHeading"]'))
    )

    print("Page loaded successfully")

    # Extract data from the tables using specific XPath expressions
    first_table = driver.find_element(By.XPATH, '//table[contains(.//caption, "Entidades federativas de México por superficie, población y densidad")]')
    second_table = driver.find_element(By.XPATH, '(//table[contains(.//caption, "Población histórica de México")])[1]')
    third_table = driver.find_element(By.XPATH, '(//table[contains(.//caption, "Población histórica de México")])[2]')

    print("All tables found")

    # First table extraction
    first_table_html = first_table.get_attribute('outerHTML')
    first_table_df = pd.read_html(first_table_html)[0]
    first_table_df = first_table_df.iloc[2:34, :]  # Remove header rows and ensure 32 rows of data
    first_table_df.columns = first_table_df.iloc[0]  # Set the first row as header
    first_table_df = first_table_df[1:]  # Remove the header row from the data
    print("First table extracted successfully")

    # Second table extraction
    second_table_html = second_table.get_attribute('outerHTML')
    second_table_df = pd.read_html(second_table_html)[0]
    second_table_df.columns = ['Pos', 'Entidad', '2020', '2010', '2000', '1990', '1980', '1970', '1960', '1950', '1940', '1930', '1921', '1910']
    print("Second table extracted successfully")

    # Third table extraction
    third_table_html = third_table.get_attribute('outerHTML')
    third_table_df = pd.read_html(third_table_html)[0]
    third_table_df.columns = ['Pos', 'Entidad', '2010', '2015', '2020', '2025', '2030']
    print("Third table extracted successfully")

    # Save to Excel with each table on a different sheet
    with pd.ExcelWriter('mexico_population_data.xlsx') as writer:
        first_table_df.to_excel(writer, sheet_name='Superficie_Poblacion_Densidad', index=False)
        second_table_df.to_excel(writer, sheet_name='Poblacion_Historica', index=False)
        third_table_df.to_excel(writer, sheet_name='Poblacion_Futura', index=False)

    print("Data extraction and Excel file creation successful")

except Exception as e:
    print(f"An error occurred: {e}")

finally:
    # Close the browser after a delay to see the loaded page
    time.sleep(10)
    driver.quit()

错误示例:

Page loaded successfully
An error occurred: Message: no such element: Unable to locate element: {"method":"xpath","selector":"(//table[contains(.//caption, "Población histórica de México")])[1]"}
  (Session info: chrome=125.0.6422.141); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
#0 0x5b29f734ce3a <unknown>
#1 0x5b29f703645c <unknown>
#2 0x5b29f70825b5 <unknown>
#3 0x5b29f7082671 <unknown>
#4 0x5b29f70c6f14 <unknown>
#5 0x5b29f70a54dd <unknown>
#6 0x5b29f70c42cc <unknown>
#7 0x5b29f70a5253 <unknown>
#8 0x5b29f70751c7 <unknown>
#9 0x5b29f7075b3e <unknown>
#10 0x5b29f731327b <unknown>
#11 0x5b29f7317327 <unknown>
#12 0x5b29f72ffdae <unknown>
#13 0x5b29f7317df2 <unknown>
#14 0x5b29f72e474f <unknown>
#15 0x5b29f733c128 <unknown>
#16 0x5b29f733c2fb <unknown>
#17 0x5b29f734bf6c <unknown>
#18 0x76a0fb094ac3 <unknown>
python selenium-webdriver
1个回答
0
投票

正如您所说,您是新手,这里有一些技巧可以让您的抓取变得轻松高效。 我使用

selenium
获取页面源并使用
requests-html
解析它(很简单)。

对于您的情况,您需要

tables with caption tag

你可以这样做:

from requests_html import HTML
# your initial code
src = driver.page_source
html = HTML(html=src)
tables = html.find("table")
tables = [table.html for table in tables if bool(table.find('caption'))]
# do whatever you want to do with table's html
© www.soinside.com 2019 - 2024. All rights reserved.