Selenium,如何抓取列表中的链接?

问题描述 投票:0回答:2

所以我有这段代码,我成功地从网站页面上抓取了所有列表链接。

问题是,我无法抓取所有列表链接并抓取列表链接内的数据。

这是我到目前为止的代码:

from selenium import webdriver
#to enable Wait for Page Loading
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#to enable scrolling
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
import time

#Initialize headless browser
options = webdriver.ChromeOptions()
#options.add_argument('--headless')  # Run Chrome in headless mode
driver = webdriver.Chrome(options=options)

url = "https://www.archify.com/id/professionals"
driver.get(url)

#Click Load More Button
l = driver.find_element("xpath", "//button[text()='Load More']")
l.click()

# scroll for 2 times while waiting 1 second between each scroll
for i in range(2):
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(1)

# get HTML
soup = BeautifulSoup(driver.page_source, 'html.parser')

# Extract listing links
product_elements = soup.find_all('div', class_='professional-box') # find all div element 
product_link = []
for product_element in product_elements:
    content = product_element.find('div', class_='text-box type-a') 
    if content:
        link = content.find('a').get('href')    #get link
        product_link.append({'link':link})

#Visit all listing links and scrape the data
product_info =[]
for product in product_link:
    driver.get(product_link)
    btn = driver.find_elements(By.CLASS_NAME, "text-box left-pad-25")
    btn.click()
    information = product.find('div', class_='category-list menu-left-area') 
    if information:
        name = information.find('div', class_='text-box').text   #name
        phone = information.find('div', class_='left-phone phone-number').text   #phone number
        website = information.find('div', class_='left-website phone-number').text   #website
        instagram = information.find('div', class_='left-instagram phone-number').get('href')   #insta 
        facebook = information.find('div', class_='left-facebook phone-number').get('href')   #fb link
        whatsapp = information.find('div', class_='left-whatsapp phone-number').get('href')   #wa link
        product_info.append({'Name': name, 'Phone': phone, 'Web': website, 'Insta': instagram, 'FB': facebook, 'WA': whatsapp})   #append all data
    
driver.quit

我尝试使用 driver.get (product_link) 调用列表product_link,但它显示错误代码。

回溯是:

DevTools listening on ws://127.0.0.1:51136/devtools/browser/3aeae395-36b0-4612-9a75-49c062e6e8eb
Traceback (most recent call last):
  File "c:\Users\user\Desktop\Code\Scrape_Selenium.py", line 68, in <module>
    driver.get(product_link)
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 356, in get  
    self.execute(Command.GET, {"url": url})
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 347, in execute
    self.error_handler.check_response(response)
  File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: 'url' must be a string
  (Session info: chrome=122.0.6261.112)
Stacktrace:
        GetHandleVerifier [0x00007FF64228AD02+56930]
        (No symbol) [0x00007FF6421FF602]
        (No symbol) [0x00007FF6420B42E5]
        (No symbol) [0x00007FF642138C1A]
        (No symbol) [0x00007FF64211BC9A]
        (No symbol) [0x00007FF6421381E2]
        (No symbol) [0x00007FF64211BA43]
        (No symbol) [0x00007FF6420ED438]
        (No symbol) [0x00007FF6420EE4D1]
        GetHandleVerifier [0x00007FF642606F8D+3711213]
        GetHandleVerifier [0x00007FF6426604CD+4077101]
        GetHandleVerifier [0x00007FF64265865F+4044735]
        GetHandleVerifier [0x00007FF642329736+706710]
        (No symbol) [0x00007FF64220B8DF]
        (No symbol) [0x00007FF642206AC4]
        (No symbol) [0x00007FF642206C1C]
        (No symbol) [0x00007FF6421F68D4]
        BaseThreadInitThunk [0x00007FFA3C817344+20]
        RtlUserThreadStart [0x00007FFA3E4426B1+33]
selenium-webdriver
2个回答
0
投票

更改此行:

for product in product_link:
    driver.get(product_link)

致:

for product in product_link:
    driver.get(product['link'])

0
投票

您正在使用 for 循环迭代列表(即 Product_link)

for product in product_link:
    driver.get(product_link)

所以这里你应该使用

driver.get(product)
而不是
driver.get(product_link)

© www.soinside.com 2019 - 2024. All rights reserved.