所以我有这段代码,我成功地从网站页面上抓取了所有列表链接。
问题是,我无法抓取所有列表链接并抓取列表链接内的数据。
这是我到目前为止的代码:
from selenium import webdriver
#to enable Wait for Page Loading
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
#to enable scrolling
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
import requests
import time
#Initialize headless browser
options = webdriver.ChromeOptions()
#options.add_argument('--headless') # Run Chrome in headless mode
driver = webdriver.Chrome(options=options)
url = "https://www.archify.com/id/professionals"
driver.get(url)
#Click Load More Button
l = driver.find_element("xpath", "//button[text()='Load More']")
l.click()
# scroll for 2 times while waiting 1 second between each scroll
for i in range(2):
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(1)
# get HTML
soup = BeautifulSoup(driver.page_source, 'html.parser')
# Extract listing links
product_elements = soup.find_all('div', class_='professional-box') # find all div element
product_link = []
for product_element in product_elements:
content = product_element.find('div', class_='text-box type-a')
if content:
link = content.find('a').get('href') #get link
product_link.append({'link':link})
#Visit all listing links and scrape the data
product_info =[]
for product in product_link:
driver.get(product_link)
btn = driver.find_elements(By.CLASS_NAME, "text-box left-pad-25")
btn.click()
information = product.find('div', class_='category-list menu-left-area')
if information:
name = information.find('div', class_='text-box').text #name
phone = information.find('div', class_='left-phone phone-number').text #phone number
website = information.find('div', class_='left-website phone-number').text #website
instagram = information.find('div', class_='left-instagram phone-number').get('href') #insta
facebook = information.find('div', class_='left-facebook phone-number').get('href') #fb link
whatsapp = information.find('div', class_='left-whatsapp phone-number').get('href') #wa link
product_info.append({'Name': name, 'Phone': phone, 'Web': website, 'Insta': instagram, 'FB': facebook, 'WA': whatsapp}) #append all data
driver.quit
我尝试使用 driver.get (product_link) 调用列表product_link,但它显示错误代码。
回溯是:
DevTools listening on ws://127.0.0.1:51136/devtools/browser/3aeae395-36b0-4612-9a75-49c062e6e8eb
Traceback (most recent call last):
File "c:\Users\user\Desktop\Code\Scrape_Selenium.py", line 68, in <module>
driver.get(product_link)
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 356, in get
self.execute(Command.GET, {"url": url})
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 347, in execute
self.error_handler.check_response(response)
File "C:\Users\user\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 229, in check_response
raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidArgumentException: Message: invalid argument: 'url' must be a string
(Session info: chrome=122.0.6261.112)
Stacktrace:
GetHandleVerifier [0x00007FF64228AD02+56930]
(No symbol) [0x00007FF6421FF602]
(No symbol) [0x00007FF6420B42E5]
(No symbol) [0x00007FF642138C1A]
(No symbol) [0x00007FF64211BC9A]
(No symbol) [0x00007FF6421381E2]
(No symbol) [0x00007FF64211BA43]
(No symbol) [0x00007FF6420ED438]
(No symbol) [0x00007FF6420EE4D1]
GetHandleVerifier [0x00007FF642606F8D+3711213]
GetHandleVerifier [0x00007FF6426604CD+4077101]
GetHandleVerifier [0x00007FF64265865F+4044735]
GetHandleVerifier [0x00007FF642329736+706710]
(No symbol) [0x00007FF64220B8DF]
(No symbol) [0x00007FF642206AC4]
(No symbol) [0x00007FF642206C1C]
(No symbol) [0x00007FF6421F68D4]
BaseThreadInitThunk [0x00007FFA3C817344+20]
RtlUserThreadStart [0x00007FFA3E4426B1+33]
更改此行:
for product in product_link:
driver.get(product_link)
致:
for product in product_link:
driver.get(product['link'])
您正在使用 for 循环迭代列表(即 Product_link)
for product in product_link:
driver.get(product_link)
所以这里你应该使用
driver.get(product)
而不是 driver.get(product_link)