我正在网上抓取招聘信息。对于许多网站,它们直接包含带有“href”属性的“a”标签,其中包含指向确切帖子的链接,我可以使用以下格式提取该链接:
link = job.find_element(By.TAG_NAME, 'a') 链接 = link.get_attribute('href')
我正在使用 Selenium,特别是对于包含职位发布的 ADP 网站,我不知道如何获取特定职位发布的链接,因为它不包含 href。
它可以抓取其他信息,这里是抓取公司 Metrea.4 的示例
这里我可以获取信息,但不能获取链接。
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from urllib.parse import urljoin
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
def scrape_metrea():
# Setup chrome options
chrome_options = Options()
chrome_options.add_argument("--headless") # Ensure GUI is off
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")
# Set up the Chrome webdriver service
webdriver_service = Service(ChromeDriverManager().install())
# Start the browser
driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)
# Open the webpage
driver.get('https://workforcenow.adp.com/mascsr/default/mdf/recruitment/recruitment.html?cid=39bc5efc-a576-4a4d-83eb-32e95d4e96c8&ccId=19000101_000001&lang=en_US&selectedMenuKey=CurrentOpenings')
# Set implicit wait
driver.implicitly_wait(2) # Wait up to 10 seconds for elements to be found
try:
# Get the dynamic content
job_listings = driver.find_elements(By.CLASS_NAME, 'current-openings-item')
jobs_list = []
for job in job_listings:
location = ''
job_info = {}
title = job.find_element(By.CSS_SELECTOR, 'span.current-opening-title').text
location = job.find_element(By.CSS_SELECTOR, 'label.current-opening-location-item').text
date = job.find_element(By.CSS_SELECTOR, 'span.current-opening-post-date').text
job_info['Company'] = 'Metrea'
job_info['Job Title'] = title
job_info['Job Link'] = None
job_info['Location'] = location
job_info['Date Posted'] = date
job_info['ID'] = None
job_info['Category'] = None
job_info['Job Type'] = None
jobs_list.append(job_info)
#NO LINK CANT FIND OUT HOW TO PULL
except:
jobs_list = []
message = {}
message['Message'] = 'Error Scraping Company Metrea'
jobs_list.append(message)
print('Error scraping metrea')
print('scraping Metrea')
# Close the browser
driver.quit()
return jobs_list
我一直在尝试使用该属性,但它不包含该属性,所以我很困惑,我也在考虑浏览并单击每个属性,然后获取链接,但如果有的话我会有点迷失获取链接的更好方法,因为它们不可见。
尝试这样的事情:
import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
options = Options()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")
driver = webdriver.Remote(
"http://127.0.0.1:4444/wd/hub",
options=options
)
driver.get('https://workforcenow.adp.com/mascsr/default/mdf/recruitment/recruitment.html?cid=39bc5efc-a576-4a4d-83eb-32e95d4e96c8&ccId=19000101_000001&lang=en_US&selectedMenuKey=CurrentOpenings')
jobs_list = []
# Wait for job listings to show.
WebDriverWait(driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, ".current-openings-item"))
)
# Get all job listings.
job_listings = driver.find_elements(By.CSS_SELECTOR, ".current-openings-item")
for job in job_listings:
title = job.find_element(By.CSS_SELECTOR, 'span.current-opening-title').text
location = job.find_element(By.CSS_SELECTOR, 'label.current-opening-location-item').text
date = job.find_element(By.CSS_SELECTOR, 'span.current-opening-post-date').text
jobs_list.append({
"Company": "Metrea",
"Job Title": title,
"Job Link": None,
"Location": location,
"Date Posted": date,
"ID": None,
"Category": None,
"Job Type": None
})
driver.quit()
print(json.dumps(jobs_list, indent=2))
关键是要等待内容在页面上可见,然后再开始尝试抓取任何内容。
🚨 我更喜欢在 Docker 容器中运行 Selenium 并作为远程网络驱动程序进行连接。您可以将
webdriver.Remote()
的呼叫恢复至 webdriver.Chrome()
。
脚本的输出(前两个元素):
[
{
"Company": "Metrea",
"Job Title": "Commercial Associate",
"Job Link": null,
"Location": "MAM HQ, Washington, DC, US",
"Date Posted": "Yesterday",
"ID": null,
"Category": null,
"Job Type": null
},
{
"Company": "Metrea",
"Job Title": "Accounts Payable Administrator",
"Job Link": null,
"Location": "Oklahoma, Bethany, OK, US",
"Date Posted": "Yesterday",
"ID": null,
"Category": null,
"Job Type": null
}
]