使用Python selenium 抓取动态网站

Question

目标是从每张工作卡中抓取信息以创建数据库。为此，我尝试执行以下步骤。

获取现有页面的最大数量
从每个工作卡中获取 ID，通过修改基本 url 来访问每个工作卡
保护 pandas csv 文件中的数据或创建 SQL 数据库

到目前为止，我尝试从第一页 (10) 上的每张工作卡中获取标题，但代码要么返回空列表，要么返回错误消息。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd


#Instantiate the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#Define url
url = 'https://iefponline.iefp.pt/IEFP/pesquisas/search.do'
# load the web page
driver.get(url)

# set maximun time to load the page in seconds
driver.implicitly_wait(15)

#collect data that are withing the main ID block
contents = driver.find_element(By.ID, 'resultados-pesquisa')
# Find all elements with the class name 'offer-card horizontal'
emp_offers = contents.find_elements(By.CLASS_NAME, 'offer-card')

emp_title_list = []
emp_id_list = []

for emp_offer in emp_offers:

    offer_title = emp_offer.get_attribute('title')
    emp_title_list.append(offer_title)

    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
    emp_id_list.append(offer_id)

print(emp_title_list)
print(emp_id_list)
    
# Close the WebDriver
driver.quit()

['', '', '', '', '', '', '', '', '', ''] [无，无，无，无，无，无，无，无，无，无]

或

"DevTools listening on ws://127.0.0.1:65003/devtools/browser/327b17e5-a97d-4d84-9ae0-c1c03122286a
Traceback (most recent call last):
  File "c:\Users\dbelt\Documents\scrape\selenium_iefp.py", line 36, in <module>
    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 416, in find_element
    return self._execute(Command.FIND_CHILD_ELEMENT, {"using": by, "value": value})["value"]
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 394, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 344, in execute
    self.error_handler.check_response(response)
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//div[contains(@class, "offer-code")]/span[2]"}
  (Session info: chrome=117.0.5938.134); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
        GetHandleVerifier [0x0062CFE3+45267]
        (No symbol) [0x005B9741]
        (No symbol) [0x004ABE1D]
        (No symbol) [0x004DED30]
        (No symbol) [0x004DF1FB]
        (No symbol) [0x004D8041]
        (No symbol) [0x004FB084]
        (No symbol) [0x004D7F96]
        (No symbol) [0x004FB2B4]
        (No symbol) [0x0050DDDA]
        (No symbol) [0x004FAE36]
        (No symbol) [0x004D674E]
        (No symbol) [0x004D78ED]
        GetHandleVerifier [0x008E5659+2897737]
        GetHandleVerifier [0x0092E78B+3197051]
        GetHandleVerifier [0x00928571+3171937]
        GetHandleVerifier [0x006B5E40+606000]
        (No symbol) [0x005C338C]
        (No symbol) [0x005BF508]
        (No symbol) [0x005BF62F]
        (No symbol) [0x005B1D27]
        BaseThreadInitThunk [0x757B7BA9+25]
        RtlInitializeExceptionChain [0x7711B79B+107]
        RtlClearBits [0x7711B71F+191]"

取决于我是否尝试使用 get_attribute 还是 XPATH 获取信息。

我还注意到，当我尝试复制此站点上的 XPATH 时，该路径与其他网站的关系非常大。

最后，许多类名中都有空格，我现在不知道将 find_elements 与此类类名一起使用的最佳方法。

Answer 1

看起来
```
emp_id_list
```
是空的，因为您尝试通过其类访问的元素中没有“title”属性。您可能需要在具有指定类的元素中找到标题元素。
关于XPATH问题，看来你的XPATH表达式不正确。
如果要使用元素的类属性来定位元素，并且元素具有多个以空格分隔的类名，则应在
```
By.CLASS_NAME
```
方法中使用点 (.) 来分隔类名。这是更正后的代码：

for emp_offer in emp_offers:
    # Find the title element within the offer using TAG_NAME and get its 'title' attribute
    offer_title = emp_offer.find_element(By.TAG_NAME, "a").get_attribute('title')
    emp_title_list.append(offer_title)

    # Corrected XPATH to find the offer_id element
    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-card-footer")]/div/div[2]/span[2]').text
    emp_id_list.append(offer_id)

Answer 2

您也可以尝试这种方法，稍加改变

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://iefponline.iefp.pt/IEFP/pesquisas/search.do')

# set maximun time to load the page in seconds
driver.implicitly_wait(15)

# click on dropdown menu to change results per page to 50
driver.find_element("xpath", '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/div').click()
driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/ul/li[3]').click()

page_count = driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[2]/div/ul/li[5]/a').text
print(f"Total pages with 50 results per page: {page_count}")

emp_offers = driver.find_elements('xpath', "//*[contains(@id, 'ofertacard_')]")

emp_title_list = []
emp_id_list = []

for emp_offer in emp_offers:
    # text content of element
    emp_offer = emp_offer.text
    offer_title = emp_offer.split('\n')[0]
    offer_id = emp_offer.split('\n')[2].split()[1]
    emp_title_list.append(offer_title)
    emp_id_list.append(offer_id)

print(emp_title_list)
print(emp_id_list)

使用Python selenium 抓取动态网站

问题描述投票：0回答：2

2个回答

最新问题

使用Python selenium 抓取动态网站

问题描述 投票：0回答：2

2个回答

最新问题

问题描述投票：0回答：2