使用 selenium 和 beautifulsoup 进行网页抓取的问题

问题描述 投票:0回答:1

我正在为我的大学项目创建一个价格比较网站。我试图从这个网站打印商品和价格 https://www.lotuss.com.my/en/category/fresh-product?sort=relevance:DESC 但我收到了这个错误。

Exception has occurred: TypeError
'NoneType' object is not callable
  File "C:\xampp\htdocs\Price\test.py", line 36, in <module>
    grocery_items = soup.findall('div', class_='product-grid-item')
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not callable

这是代码

from bs4 import BeautifulSoup
import requests
from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

chrome_options = Options()

chrome_options.add_argument("--disable-gpu")
chrome_options.add_argument("--no-sandbox")
chrome_options.add_argument("--disable-dev-shm-usage")

service = Service(executable_path='C:/chromedriver/chromedriver.exe')
driver = webdriver.Chrome(service=service, options=chrome_options)

# Open the webpage
driver.get('https://www.lotuss.com.my/en/category/fresh-produce?sort=relevance:DESC')

# Wait for the page to fully load
try:
    element = WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, "iframe"))
    )
    print("Please solve the CAPTCHA manually in the opened browser window.")
finally:

    input("Press Enter after solving the CAPTCHA...")

    html_text = driver.page_source

    driver.quit()
soup = BeautifulSoup(html_text, 'lxml')
grocery_items = soup.findall('div', class_='product-grid-item')
grocery_price = soup.findall('span', class_='sc-kHxTfl hwpbzy')

print(grocery_items)
print(grocery_price)
python selenium-webdriver web-scraping beautifulsoup
1个回答
1
投票

该错误是由于 soup 对象使用了不正确的方法引起的。方法应该是

find_all
而不是
findall

代码应如下:

grocery_items = soup.find_all('div', class_='product-grid-item')
grocery_price = soup.find_all('span', class_='sc-kHxTfl hwpbzy')

我测试了你的代码,还有一些问题。修复此问题中提出的错误后,您可能会注意到控制台上没有打印任何内容。请按以下步骤操作:

  1. 默认情况下,selenium 不会全屏打开浏览器,这可能会导致元素有时不可见,并且可能无法找到所有目标元素。因此,使用以下代码全屏打开 chrome:

    driver.get('https://www.lotuss.com.my/en/category/fresh-produce?sort=relevance:DESC')
    driver.maximize_window()
    
  2. 代码的最后一行只是打印 HTML

    print(grocery_items)
    print(grocery_price)
    

    您需要打印 HTML 的文本值。使用代码如下:

    for item in grocery_items:
        print(item.get_text(strip=True))
    
    for price in grocery_price:
        print(price.get_text(strip=True))
    
© www.soinside.com 2019 - 2024. All rights reserved.