使用 xpath 和 beautifulsoup 进行链接提取不起作用

问题描述 投票:0回答:1

我想提取在 xpath 中嵌套为

/html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a
的链接,另请参阅 详细嵌套图像

如果有帮助的话,这些 div 也有一些类。

我试过了

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')

soup=BeautifulSoup(browser.page_source)

element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
href = element.get_attribute('href')
print(href)

此代码出现错误

 line 9, in <module>
    element = soup.find_element_by_xpath("./html/body/div[1]/div[2]/div[1]/div/div/div/div/div/a")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

也尝试过其他方法

from selenium import webdriver
from bs4 import BeautifulSoup

browser=webdriver.Chrome()
browser.get('https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024')

soup=BeautifulSoup(browser.page_source)

href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
#href = element.get_attribute('href')
print(href)

这给出了错误

    href = soup('a')('div')[1]('div')[2]('div')[1]('div')[0]('div')[0]('div')[0]('div')[0]('div')[0][href]
           ^^^^^^^^^^^^^^^^
TypeError: 'ResultSet' object is not callable

预期结果应该是:https://www.visionias.in/resources/material/?id=3731&type=daily_current_affairs或material/?id=3731&type=daily_current_affairs

其他一些链接也有与上面相同的嵌套,有没有办法使用里面的文本过滤链接

/html/body/div[1]/div[2]/div[1]/div/div/p
,例如这里的文本是2024年5月18日,这个p标签也有一个id,但不一致或者没有模式,所以对我来说不太有用。

我在 stackoverflow 上看到了其他答案,但这对我不起作用

如果可能,请详细说明答案,因为我也必须将相同的代码应用于其他一些网站。

python selenium-webdriver beautifulsoup href
1个回答
0
投票

参考下面的selenium代码来提取所有链接并将其打印到控制台:

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()
driver.maximize_window()
driver.get("https://www.visionias.in/resources/daily_current_affairs_programs.php?type=1&m=05&y=2024")
wait = WebDriverWait(driver, 10)

links = wait.until(EC.visibility_of_all_elements_located((By.XPATH, "//div[@class='center']//a")))

for link in links:
    print(link.get_attribute("href"))

控制台输出:

https://www.visionias.in/resources/material?id=3731&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3729&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3727&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3723&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3717&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3715&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3705&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3703&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3701&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3699&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3690&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3688&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3687&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3684&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3682&type=daily_current_affairs
https://www.visionias.in/resources/material?id=3676&type=daily_current_affairs

Process finished with exit code 0

建议: 我强烈建议您阅读有关绝对和相对 XPath 的内容。以及使用相对 XPath 相对于绝对 XPath 的优点。以下几个链接供您参考:

© www.soinside.com 2019 - 2024. All rights reserved.