网页抓取链接

问题描述 投票:0回答:1

我正在网上抓取招聘信息。对于许多网站,它们直接包含带有“href”属性的“a”标签,其中包含指向确切帖子的链接,我可以使用以下格式提取该链接:

link = job.find_element(By.TAG_NAME, 'a') 链接 = link.get_attribute('href')

我正在使用 Selenium,特别是对于包含职位发布的 ADP 网站,我不知道如何获取特定职位发布的链接,因为它不包含 href。

它可以抓取其他信息,这里是抓取公司 Metrea.4 的示例

https://workforcenow.adp.com/mascsr/default/mdf/recruitment/recruitment.html?cid=39bc5efc-a576-4a4d-83eb-32e95d4e96c8&ccId=19000101_000001&lang=en_US&selectedMenuKey=CurrentOpenings

这里我可以获取信息,但不能获取链接。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.chrome.options import Options
from webdriver_manager.chrome import ChromeDriverManager
from urllib.parse import urljoin
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

def scrape_metrea():
    # Setup chrome options
    chrome_options = Options()
    chrome_options.add_argument("--headless")  # Ensure GUI is off
    chrome_options.add_argument("--no-sandbox")
    chrome_options.add_argument("--disable-dev-shm-usage")

    # Set up the Chrome webdriver service
    webdriver_service = Service(ChromeDriverManager().install())

    # Start the browser
    driver = webdriver.Chrome(service=webdriver_service, options=chrome_options)

    # Open the webpage
    driver.get('https://workforcenow.adp.com/mascsr/default/mdf/recruitment/recruitment.html?cid=39bc5efc-a576-4a4d-83eb-32e95d4e96c8&ccId=19000101_000001&lang=en_US&selectedMenuKey=CurrentOpenings')

    # Set implicit wait
    driver.implicitly_wait(2)  # Wait up to 10 seconds for elements to be found

    try:

        # Get the dynamic content
        job_listings = driver.find_elements(By.CLASS_NAME, 'current-openings-item')




        jobs_list = []


        for job in job_listings:
            location = ''
            job_info = {}
            title = job.find_element(By.CSS_SELECTOR, 'span.current-opening-title').text
        

            location = job.find_element(By.CSS_SELECTOR, 'label.current-opening-location-item').text
            

            date = job.find_element(By.CSS_SELECTOR, 'span.current-opening-post-date').text
        

            job_info['Company'] = 'Metrea'
            job_info['Job Title'] = title
            job_info['Job Link'] = None
            job_info['Location'] = location
            job_info['Date Posted'] = date
            job_info['ID'] = None
            job_info['Category'] = None
            job_info['Job Type'] = None

            jobs_list.append(job_info)


            #NO LINK CANT FIND OUT HOW TO PULL
    except:
        jobs_list = []
        message = {}
        message['Message'] = 'Error Scraping Company Metrea'
        jobs_list.append(message)
        print('Error scraping metrea')



    print('scraping Metrea')



    # Close the browser
    driver.quit()
    return jobs_list

我一直在尝试使用该属性,但它不包含该属性,所以我很困惑,我也在考虑浏览并单击每个属性,然后获取链接,但如果有的话我会有点迷失获取链接的更好方法,因为它们不可见。

selenium-webdriver web-scraping
1个回答
0
投票

尝试这样的事情:

import json
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

options = Options()
options.add_argument("--no-sandbox")
options.add_argument("--disable-dev-shm-usage")

driver = webdriver.Remote(
    "http://127.0.0.1:4444/wd/hub",
    options=options
)

driver.get('https://workforcenow.adp.com/mascsr/default/mdf/recruitment/recruitment.html?cid=39bc5efc-a576-4a4d-83eb-32e95d4e96c8&ccId=19000101_000001&lang=en_US&selectedMenuKey=CurrentOpenings')

jobs_list = []

# Wait for job listings to show.
WebDriverWait(driver, 10).until(
    EC.presence_of_element_located((By.CSS_SELECTOR, ".current-openings-item"))
)
# Get all job listings.
job_listings = driver.find_elements(By.CSS_SELECTOR, ".current-openings-item")

for job in job_listings:
    title = job.find_element(By.CSS_SELECTOR, 'span.current-opening-title').text
    location = job.find_element(By.CSS_SELECTOR, 'label.current-opening-location-item').text
    date = job.find_element(By.CSS_SELECTOR, 'span.current-opening-post-date').text

    jobs_list.append({
        "Company": "Metrea",
        "Job Title": title,
        "Job Link": None,
        "Location": location,
        "Date Posted": date,
        "ID": None,
        "Category": None,
        "Job Type": None
    })

driver.quit()

print(json.dumps(jobs_list, indent=2))

关键是要等待内容在页面上可见,然后再开始尝试抓取任何内容。

🚨 我更喜欢在 Docker 容器中运行 Selenium 并作为远程网络驱动程序进行连接。您可以将

webdriver.Remote()
的呼叫恢复至
webdriver.Chrome()

脚本的输出(前两个元素):

[                                                                                                                                                                                                                                       
  {                                                                                                                                                                                                                                     
    "Company": "Metrea",                                                                                                                                                                                                                
    "Job Title": "Commercial Associate",                                                                                                                                                                                                
    "Job Link": null,                                                                                                                                                                                                                   
    "Location": "MAM HQ, Washington, DC, US",                                                                                                                                                                                           
    "Date Posted": "Yesterday",                                                                                                                                                                                                         
    "ID": null,                                                                                                                                                                                                                         
    "Category": null,                                                                                                                                                                                                                   
    "Job Type": null                                                                                                                                                                                                                    
  },                                                                                                                                                                                                                                    
  {                                                                                                                                                                                                                                     
    "Company": "Metrea",                                                                                                                                                                                                                
    "Job Title": "Accounts Payable Administrator",                                                                                                                                                                                      
    "Job Link": null,                                                                                                                                                                                                                   
    "Location": "Oklahoma, Bethany, OK, US",                                                                                                                                                                                            
    "Date Posted": "Yesterday",                                                                                                                                                                                                         
    "ID": null,                                                                                                                                                                                                                         
    "Category": null,                                                                                                                                                                                                                   
    "Job Type": null                                                                                                                                                                                                                    
  }
]
© www.soinside.com 2019 - 2024. All rights reserved.