使用Python selenium 抓取动态网站

问题描述 投票:0回答:2

目标是从每张工作卡中抓取信息以创建数据库。为此,我尝试执行以下步骤。

  • 获取现有页面的最大数量
  • 从每个工作卡中获取 ID,通过修改基本 url 来访问每个工作卡
  • 保护 pandas csv 文件中的数据或创建 SQL 数据库

到目前为止,我尝试从第一页 (10) 上的每张工作卡中获取标题,但代码要么返回空列表,要么返回错误消息。

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd


#Instantiate the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#Define url
url = 'https://iefponline.iefp.pt/IEFP/pesquisas/search.do'
# load the web page
driver.get(url)

# set maximun time to load the page in seconds
driver.implicitly_wait(15)

#collect data that are withing the main ID block
contents = driver.find_element(By.ID, 'resultados-pesquisa')
# Find all elements with the class name 'offer-card horizontal'
emp_offers = contents.find_elements(By.CLASS_NAME, 'offer-card')

emp_title_list = []
emp_id_list = []

for emp_offer in emp_offers:

    offer_title = emp_offer.get_attribute('title')
    emp_title_list.append(offer_title)

    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
    emp_id_list.append(offer_id)

print(emp_title_list)
print(emp_id_list)
    
# Close the WebDriver
driver.quit()

['', '', '', '', '', '', '', '', '', ''] [无,无,无,无, 无,无,无,无,无,无]

"DevTools listening on ws://127.0.0.1:65003/devtools/browser/327b17e5-a97d-4d84-9ae0-c1c03122286a
Traceback (most recent call last):
  File "c:\Users\dbelt\Documents\scrape\selenium_iefp.py", line 36, in <module>
    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 416, in find_element
    return self._execute(Command.FIND_CHILD_ELEMENT, {"using": by, "value": value})["value"]
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 394, in _execute
    return self._parent.execute(command, params)
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 344, in execute
    self.error_handler.check_response(response)
  File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//div[contains(@class, "offer-code")]/span[2]"}
  (Session info: chrome=117.0.5938.134); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
        GetHandleVerifier [0x0062CFE3+45267]
        (No symbol) [0x005B9741]
        (No symbol) [0x004ABE1D]
        (No symbol) [0x004DED30]
        (No symbol) [0x004DF1FB]
        (No symbol) [0x004D8041]
        (No symbol) [0x004FB084]
        (No symbol) [0x004D7F96]
        (No symbol) [0x004FB2B4]
        (No symbol) [0x0050DDDA]
        (No symbol) [0x004FAE36]
        (No symbol) [0x004D674E]
        (No symbol) [0x004D78ED]
        GetHandleVerifier [0x008E5659+2897737]
        GetHandleVerifier [0x0092E78B+3197051]
        GetHandleVerifier [0x00928571+3171937]
        GetHandleVerifier [0x006B5E40+606000]
        (No symbol) [0x005C338C]
        (No symbol) [0x005BF508]
        (No symbol) [0x005BF62F]
        (No symbol) [0x005B1D27]
        BaseThreadInitThunk [0x757B7BA9+25]
        RtlInitializeExceptionChain [0x7711B79B+107]
        RtlClearBits [0x7711B71F+191]"

取决于我是否尝试使用 get_attribute 还是 XPATH 获取信息。

我还注意到,当我尝试复制此站点上的 XPATH 时,该路径与其他网站的关系非常大。

最后,许多类名中都有空格,我现在不知道将 find_elements 与此类类名一起使用的最佳方法。

python selenium-webdriver web-scraping beautifulsoup
2个回答
0
投票
  1. 看起来

    emp_id_list
    是空的,因为您尝试通过其类访问的元素中没有“title”属性。您可能需要在具有指定类的元素中找到标题元素。

  2. 关于XPATH问题,看来你的XPATH表达式不正确。

  3. 如果要使用元素的类属性来定位元素,并且元素具有多个以空格分隔的类名,则应在

    By.CLASS_NAME
    方法中使用点 (.) 来分隔类名。这是更正后的代码:

for emp_offer in emp_offers:
    # Find the title element within the offer using TAG_NAME and get its 'title' attribute
    offer_title = emp_offer.find_element(By.TAG_NAME, "a").get_attribute('title')
    emp_title_list.append(offer_title)

    # Corrected XPATH to find the offer_id element
    offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-card-footer")]/div/div[2]/span[2]').text
    emp_id_list.append(offer_id)

0
投票

您也可以尝试这种方法,稍加改变

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager

driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://iefponline.iefp.pt/IEFP/pesquisas/search.do')

# set maximun time to load the page in seconds
driver.implicitly_wait(15)

# click on dropdown menu to change results per page to 50
driver.find_element("xpath", '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/div').click()
driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/ul/li[3]').click()

page_count = driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[2]/div/ul/li[5]/a').text
print(f"Total pages with 50 results per page: {page_count}")

emp_offers = driver.find_elements('xpath', "//*[contains(@id, 'ofertacard_')]")

emp_title_list = []
emp_id_list = []

for emp_offer in emp_offers:
    # text content of element
    emp_offer = emp_offer.text
    offer_title = emp_offer.split('\n')[0]
    offer_id = emp_offer.split('\n')[2].split()[1]
    emp_title_list.append(offer_title)
    emp_id_list.append(offer_id)

print(emp_title_list)
print(emp_id_list)
© www.soinside.com 2019 - 2024. All rights reserved.