如何在Python中抓取与主页url相同的网页url?

问题描述 投票:0回答:1

下面的网址会打开一个表格,我们只需要选择会计年度,然后点击搜索即可获取该年的数据,但是搜索年份也会打开与下面相同的网址 -

https://cfpub.epa.gov/compliance/criminal_prosecution/index.cfm

我通过手动输入 2023-

的年份 xpath 编写了下面的代码
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
import time
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
b = webdriver.Chrome()
b.get(url)
time.sleep(10)
total_article_xpath = "//*[@id=\"main-content\"]/div[2]/div[1]/div/div/form/table/tbody/tr[8]/td/div/div[2]/select/option[42]"
element = WebDriverWait(b, 10).until(EC.presence_of_element_located((By.XPATH, total_article_xpath)))
time.sleep(10)
print(element)
getdetails = element.find_element(By.XPATH, total_article_xpath)
button_val = "//*[@id=\"searchButton\"]"
b.find_element(By.XPATH, button_val).click()
print(b)
vals = b.current_url

我们如何导航和废弃与主页 url 相同的会计年度 url? 任何帮助将不胜感激。

python selenium-webdriver xpath
1个回答
0
投票

您需要从“财政年度”下拉列表中进行选择,然后单击“搜索”按钮。然后您可以像平常一样继续抓取结果页面。

这是一个例子:

from selenium import webdriver
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import Select
from selenium.webdriver import ChromeOptions


URL = "https://cfpub.epa.gov/compliance/criminal_prosecution/index.cfm"
SELECT_XPATH = "//*[@id='main-content']/div[2]/div[1]/div/div/form/table/tbody/tr[8]/td/div/div[2]/select"
SEARCH_XPATH = "//*[@id='searchButton']"
TBODY_XPATH = "//*[@id='main-content']/div[2]/div[1]/div/div/table/tbody/tr[1]/td[2]/table/tbody"
FISCAL_YEAR = "2000"

options = ChromeOptions()
options.add_argument("--headless")

with webdriver.Chrome(options=options) as driver:
    driver.get(URL)
    select = Select(WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, SELECT_XPATH))))
    select.select_by_visible_text(FISCAL_YEAR)
    WebDriverWait(driver, 5).until(EC.element_to_be_clickable((By.XPATH, SEARCH_XPATH))).click()
    tbody = WebDriverWait(driver, 5).until(EC.presence_of_element_located((By.XPATH, TBODY_XPATH)))
    for td in tbody.find_elements(By.CSS_SELECTOR, "td.valign-top")[::2]:
        print(td.text)

输出:

Allen Sinclair
BP Exploration-Alaska (BPXA)
Ben Shafsky
Doyon Drilling, Inc.
Michael Krupa
© www.soinside.com 2019 - 2024. All rights reserved.