下面的网址会打开一个表格,我们只需要选择会计年度,然后点击搜索即可获取该年的数据,但是搜索年份也会打开与下面相同的网址 -
https://cfpub.epa.gov/compliance/criminal_prosecution/index.cfm
我通过手动输入 2023-
的年份 xpath 编写了下面的代码from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
b = webdriver.Chrome()
b.get(url)
time.sleep(10)
total_article_xpath = "//*[@id=\"main-content\"]/div[2]/div[1]/div/div/form/table/tbody/tr[8]/td/div/div[2]/select/option[42]"
element = WebDriverWait(b, 10).until(EC.presence_of_element_located((By.XPATH, total_article_xpath)))
time.sleep(10)
print(element)
getdetails = element.find_element(By.XPATH, total_article_xpath)
button_val = "//*[@id=\"searchButton\"]"
b.find_element(By.XPATH, button_val).click()
print(b)
vals = b.current_url
我们如何导航和废弃与主页 url 相同的会计年度 url? 任何帮助将不胜感激。
您需要从“财政年度”下拉列表中进行选择,然后单击“搜索”按钮。然后您可以像平常一样继续抓取结果页面。
这是一个例子:
from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ChromeOptions
URL = "https://cfpub.epa.gov/compliance/criminal_prosecution/index.cfm"
SELECT_XPATH = "//*[@id='main-content']/div[2]/div[1]/div/div/form/table/tbody/tr[8]/td/div/div[2]/select"
SEARCH_XPATH = "//*[@id='searchButton']"
TBODY_XPATH = "//*[@id='main-content']/div[2]/div[1]/div/div/table/tbody/tr[1]/td[2]/table/tbody"
FISCAL_YEAR = "2000"
options = ChromeOptions()
options.add_argument("--headless")
with webdriver.Chrome(options=options) as driver:
driver.get(URL)
select = Select(WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, SELECT_XPATH))))
select.select_by_visible_text(FISCAL_YEAR)
WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, SEARCH_XPATH))).click()
tbody = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, TBODY_XPATH)))
for td in tbody.find_elements(By.CSS_SELECTOR, "td.valign-top")[::2]:
print(td.text)
输出:
Allen Sinclair
BP Exploration-Alaska (BPXA)
Ben Shafsky
Doyon Drilling, Inc.
Michael Krupa