我想提取在 xpath 中嵌套为
/html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a
的链接,另请参阅 详细嵌套图像
如果有帮助的话,这些 div 也有一些类。
我试过了
from selenium import webdriver
from bs4 import BeautifulSoup
browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')
soup=BeautifulSoup(browser.page_source)
element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
href = element.get_attribute('href')
print(href)
此代码出现错误
line 9, in <module>
element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable
也尝试过其他方法
from selenium import webdriver
from bs4 import BeautifulSoup
browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')
soup=BeautifulSoup(browser.page_source)
href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
#href = element.get_attribute('href')
print(href)
这给出了错误
href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
^^^^^^^^^^^^^^^^
TypeError: 'ResultSet' object is not callable
预期结果应该是:https://www.visionias.in/resources/material/?id=3731&type=daily_current_affairs或material/?id=3731&type=daily_current_affairs
其他一些链接也有与上面相同的嵌套,有没有办法使用里面的文本过滤链接
/html/body/div[1]/div[2]/div[1]/div/div/p
,例如这里的文本是2024年5月18日,这个p标签也有一个id,但不一致或者没有模式,所以对我来说不太有用。
我在 stackoverflow 上看到了其他答案,但这对我不起作用
如果可能,请详细说明答案,因为我也必须将相同的代码应用于其他一些网站。
参考下面的selenium代码来提取所有链接并将其打印到控制台:
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024")
wait = WebDriverWait(driver, 10)
links = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='center']//a")))
for link in links:
print(link.get_attribute("href"))
控制台输出:
https://www.visionias.in/resources/material?id=3731&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3729&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3727&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3723&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3717&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3715&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3705&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3703&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3701&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3699&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3690&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3688&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3687&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3684&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3682&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3676&type=daily_current_affairs
Process finished with exit code 0
建议: 我强烈建议您阅读有关绝对和相对 XPath 的内容。以及使用相对 XPath 相对于绝对 XPath 的优点。以下几个链接供您参考: