目标是从每张工作卡中抓取信息以创建数据库。为此,我尝试执行以下步骤。
到目前为止,我尝试从第一页 (10) 上的每张工作卡中获取标题,但代码要么返回空列表,要么返回错误消息。
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import time
import pandas as pd
#Instantiate the webdriver
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
#Define url
url = 'https://iefponline.iefp.pt/IEFP/pesquisas/search.do'
# load the web page
driver.get(url)
# set maximun time to load the page in seconds
driver.implicitly_wait(15)
#collect data that are withing the main ID block
contents = driver.find_element(By.ID, 'resultados-pesquisa')
# Find all elements with the class name 'offer-card horizontal'
emp_offers = contents.find_elements(By.CLASS_NAME, 'offer-card')
emp_title_list = []
emp_id_list = []
for emp_offer in emp_offers:
offer_title = emp_offer.get_attribute('title')
emp_title_list.append(offer_title)
offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
emp_id_list.append(offer_id)
print(emp_title_list)
print(emp_id_list)
# Close the WebDriver
driver.quit()
['', '', '', '', '', '', '', '', '', ''] [无,无,无,无, 无,无,无,无,无,无]
或
"DevTools listening on ws://127.0.0.1:65003/devtools/browser/327b17e5-a97d-4d84-9ae0-c1c03122286a
Traceback (most recent call last):
File "c:\Users\dbelt\Documents\scrape\selenium_iefp.py", line 36, in <module>
offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-code")]/span[2]').text
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 416, in find_element
return self._execute(Command.FIND_CHILD_ELEMENT, {"using": by, "value": value})["value"]
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webelement.py", line 394, in _execute
return self._parent.execute(command, params)
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 344, in execute
self.error_handler.check_response(response)
File "C:\Users\dbelt\anaconda3\lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.NoSuchElementException: Message: no such element: Unable to locate element: {"method":"xpath","selector":".//div[contains(@class, "offer-code")]/span[2]"}
(Session info: chrome=117.0.5938.134); For documentation on this error, please visit: https://www.selenium.dev/documentation/webdriver/troubleshooting/errors#no-such-element-exception
Stacktrace:
GetHandleVerifier [0x0062CFE3+45267]
(No symbol) [0x005B9741]
(No symbol) [0x004ABE1D]
(No symbol) [0x004DED30]
(No symbol) [0x004DF1FB]
(No symbol) [0x004D8041]
(No symbol) [0x004FB084]
(No symbol) [0x004D7F96]
(No symbol) [0x004FB2B4]
(No symbol) [0x0050DDDA]
(No symbol) [0x004FAE36]
(No symbol) [0x004D674E]
(No symbol) [0x004D78ED]
GetHandleVerifier [0x008E5659+2897737]
GetHandleVerifier [0x0092E78B+3197051]
GetHandleVerifier [0x00928571+3171937]
GetHandleVerifier [0x006B5E40+606000]
(No symbol) [0x005C338C]
(No symbol) [0x005BF508]
(No symbol) [0x005BF62F]
(No symbol) [0x005B1D27]
BaseThreadInitThunk [0x757B7BA9+25]
RtlInitializeExceptionChain [0x7711B79B+107]
RtlClearBits [0x7711B71F+191]"
取决于我是否尝试使用 get_attribute 还是 XPATH 获取信息。
我还注意到,当我尝试复制此站点上的 XPATH 时,该路径与其他网站的关系非常大。
最后,许多类名中都有空格,我现在不知道将 find_elements 与此类类名一起使用的最佳方法。
看起来
emp_id_list
是空的,因为您尝试通过其类访问的元素中没有“title”属性。您可能需要在具有指定类的元素中找到标题元素。
关于XPATH问题,看来你的XPATH表达式不正确。
如果要使用元素的类属性来定位元素,并且元素具有多个以空格分隔的类名,则应在
By.CLASS_NAME
方法中使用点 (.) 来分隔类名。这是更正后的代码:
for emp_offer in emp_offers:
# Find the title element within the offer using TAG_NAME and get its 'title' attribute
offer_title = emp_offer.find_element(By.TAG_NAME, "a").get_attribute('title')
emp_title_list.append(offer_title)
# Corrected XPATH to find the offer_id element
offer_id = emp_offer.find_element(By.XPATH, './/div[contains(@class, "offer-card-footer")]/div/div[2]/span[2]').text
emp_id_list.append(offer_id)
您也可以尝试这种方法,稍加改变
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()))
driver.get('https://iefponline.iefp.pt/IEFP/pesquisas/search.do')
# set maximun time to load the page in seconds
driver.implicitly_wait(15)
# click on dropdown menu to change results per page to 50
driver.find_element("xpath", '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/div').click()
driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[4]/div/div/div/ul/li[3]').click()
page_count = driver.find_element('xpath', '//*[@id="resultado"]/div/div[3]/div[2]/div/ul/li[5]/a').text
print(f"Total pages with 50 results per page: {page_count}")
emp_offers = driver.find_elements('xpath', "//*[contains(@id, 'ofertacard_')]")
emp_title_list = []
emp_id_list = []
for emp_offer in emp_offers:
# text content of element
emp_offer = emp_offer.text
offer_title = emp_offer.split('\n')[0]
offer_id = emp_offer.split('\n')[2].split()[1]
emp_title_list.append(offer_title)
emp_id_list.append(offer_id)
print(emp_title_list)
print(emp_id_list)