作为 Selenium 和 Python 的新手,我的任务是从维基百科页面上的三个表中提取所有数据。在所有测试中,我都能够从第一个表中获取相关数据,但代码无法找到有关第二个或第三个表的任何信息。我知道这并没有那么难,但我已经连续三天了,没有任何进展。我的代码中到底缺少什么?我能够打开页面,然后它返回页面上只有 2 个或有时只有一个表,但我知道事实上有 3 个。 有问题的页面是: https://es.wikipedia.org/wiki/Anexo:Entidades_federativas_de_M%C3%A9xico_por_superficie,_poblaci%C3%B3n_y_densidad 我的代码如下:
# Libraries
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
import pandas as pd
import time
from io import StringIO
# Add debugging statements
print("Starting script...")
# Initialize Chrome options
options = webdriver.ChromeOptions()
options.add_argument('--start-maximized')
options.add_argument('--disable-extensions')
# Use webdriver-manager to get the appropriate ChromeDriver
service = Service(ChromeDriverManager().install())
# Initialize the WebDriver
driver = webdriver.Chrome(service=service, options=options)
try:
# Start on 2nd Monitor
driver.set_window_position(2000, 0)
driver.maximize_window()
time.sleep(5)
# Initiate Browser
driver.get('https://es.wikipedia.org/wiki/Anexo:Entidades_federativas_de_M%C3%A9xico_por_superficie,_poblaci%C3%B3n_y_densidad')
# Wait for the page to fully load by waiting for a specific element to appear
WebDriverWait(driver, 30).until(
EC.presence_of_element_located((By.XPATH, '//*[@id="firstHeading"]'))
)
print("Page loaded successfully")
# Extract data from the tables using specific XPath expressions
first_table = driver.find_element(By.XPATH, '//table[contains(.//caption, "Entidades federativas de México por superficie, población y densidad")]')
second_table = driver.find_element(By.XPATH, '(//table[contains(.//caption, "Población histórica de México")])[1]')
third_table = driver.find_element(By.XPATH, '(//table[contains(.//caption, "Población histórica de México")])[2]')
print("All tables found")
# First table extraction
first_table_html = first_table.get_attribute('outerHTML')
first_table_df = pd.read_html(first_table_html)[0]
first_table_df = first_table_df.iloc[2:34, :] # Remove header rows and ensure 32 rows of data
first_table_df.columns = first_table_df.iloc[0] # Set the first row as header
first_table_df = first_table_df[1:] # Remove the header row from the data
print("First table extracted successfully")
# Second table extraction
second_table_html = second_table.get_attribute('outerHTML')
second_table_df = pd.read_html(second_table_html)[0]
second_table_df.columns = ['Pos', 'Entidad', '2020', '2010', '2000', '1990', '1980', '1970', '1960', '1950', '1940', '1930', '1921', '1910']
print("Second table extracted successfully")
# Third table extraction
third_table_html = third_table.get_attribute('outerHTML')
third_table_df = pd.read_html(third_table_html)[0]
third_table_df.columns = ['Pos', 'Entidad', '2010', '2015', '2020', '2025', '2030']
print("Third table extracted successfully")
# Save to Excel with each table on a different sheet
with pd.ExcelWriter('mexico_population_data.xlsx') as writer:
first_table_df.to_excel(writer, sheet_name='Superficie_Poblacion_Densidad', index=False)
second_table_df.to_excel(writer, sheet_name='Poblacion_Historica', index=False)
third_table_df.to_excel(writer, sheet_name='Poblacion_Futura', index=False)
print("Data extraction and Excel file creation successful")
except Exception as e:
print(f"An error occurred: {e}")
finally:
# Close the browser after a delay to see the loaded page
time.sleep(10)
driver.quit()
错误示例:
Page loaded successfully
An error occurred: Message: no such element: Unable to locate element: {"method":"xpath","selector":"(//table[contains(.//caption, "Población histórica de México")])[1]"}
(Session info: chrome=125.0.6422.141); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
#0 0x5b29f734ce3a <unknown>
#1 0x5b29f703645c <unknown>
#2 0x5b29f70825b5 <unknown>
#3 0x5b29f7082671 <unknown>
#4 0x5b29f70c6f14 <unknown>
#5 0x5b29f70a54dd <unknown>
#6 0x5b29f70c42cc <unknown>
#7 0x5b29f70a5253 <unknown>
#8 0x5b29f70751c7 <unknown>
#9 0x5b29f7075b3e <unknown>
#10 0x5b29f731327b <unknown>
#11 0x5b29f7317327 <unknown>
#12 0x5b29f72ffdae <unknown>
#13 0x5b29f7317df2 <unknown>
#14 0x5b29f72e474f <unknown>
#15 0x5b29f733c128 <unknown>
#16 0x5b29f733c2fb <unknown>
#17 0x5b29f734bf6c <unknown>
#18 0x76a0fb094ac3 <unknown>
正如您所说,您是新手,这里有一些技巧可以让您的抓取变得轻松高效。 我使用
selenium
获取页面源并使用 requests-html
解析它(很简单)。
对于您的情况,您需要
tables with caption tag
你可以这样做:
from requests_html import HTML
# your initial code
src = driver.page_source
html = HTML(html=src)
tables = html.find("table")
tables = [table.html for table in tables if bool(table.find('caption'))]
# do whatever you want to do with table's html