Scrapy Spider 不适用于多个 url

问题描述 投票:0回答:1

我编写了一个 Scrapy 蜘蛛,并在其中使用 Selenium 来废弃“devgrossonline.com”中的产品。

它不适用于多个类别 url,但当我只提供一个 url 时它可以工作。如有任何帮助,我们将不胜感激。

这是我的蜘蛛:

import time
from datetime import datetime

import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager


class DevgrossSpider(scrapy.Spider):
    name = 'devgross'
    allowed_domains = ['devgrossonline.com']
    start_urls = [
        'https://devgrossonline.com/sut-kahvaltilik',
        'https://devgrossonline.com/meyve-sebze',
        'https://devgrossonline.com/et-sarkuteri'
    ]

    def __init__(self, *args, **kwargs):
        super(DevgrossSpider, self).__init__(*args, **kwargs)
        self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
        self.driver.maximize_window()

    def closed(self):
        self.driver.quit()

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(url=url, callback=self.parse)

    def parse(self, response):
        self.driver.get(response.url)
        time.sleep(5)

        # find how many pages there are
        WebDriverWait(self.driver, 10).until(
            EC.presence_of_element_located((By.CSS_SELECTOR, 'ul.pagination'))
        )
        page_count_selector = "ul.pagination li:last-child"
        page_count = self.driver.find_element(By.CSS_SELECTOR, page_count_selector).text
        try:
            page_count = int(page_count)
        except:
            page_count = 1

        # page iteration
        for i in range(1, page_count + 1):
            # go next page
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'ul.pagination'))
            )
            if i != 1:
                current_page = self.driver.find_element(By.CSS_SELECTOR, 'a#secili-sayfa')
                next_page = current_page.find_element(By.XPATH, 'following::li/a[@onclick][1]')
                next_page.click()
                time.sleep(1)

            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'ul.pagination')))
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.CSS_SELECTOR, 'div.urun-kutusu')))
            WebDriverWait(self.driver, 10).until(
                EC.presence_of_element_located((By.ID, 'cat_rw_05')))
            time.sleep(2)

            product_elements = self.driver.find_elements(By.CSS_SELECTOR, 'div.urun-kutusu')

            for product_element in product_elements:
                product_name = product_element.find_element(By.CSS_SELECTOR, 'h2 a').text
                product_price = product_element.find_element(By.CSS_SELECTOR, 'div.urun-fiyat span').text
                try:
                    product_price = float(product_price.replace(' TL', '').replace(',', '.'))
                except:
                    product_price = -1.0
                product_url = product_element.find_element(By.CSS_SELECTOR,
                                                           'div.kutu-urun-resmi a.kutu-link').get_attribute('href')

                yield {
                    'name': product_name,
                    'price': product_price,
                    'URL': product_url,
                    'date': datetime.now(),
                }

尽管我在这个问题上工作了几个小时,但我无法解决问题。

selenium-webdriver web-scraping scrapy web-crawler
1个回答
0
投票

我谨建议您考虑使用 Playwright 而不是 Selenium 和 Scrapy。 Playwright 提供与 Selenium 类似的功能,但恕我直言,与 Scrapy 的集成更加最新。

安装这些软件包:

scrapy-playwright==0.0.34
beautifulsoup4==4.12.3

使用 Playwright 安装 Chrome。

playwright install --force chrome

settings.py

DOWNLOAD_HANDLERS = {
    "http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
    "https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30 * 1000
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": False}

还有一个简单的蜘蛛,可以打开所有三个初始页面。

import scrapy
from bs4 import BeautifulSoup


class DevgrossSpider(scrapy.Spider):
    name = "devgross"
    allowed_domains = ["devgrossonline.com"]
    start_urls = [
        'https://devgrossonline.com/sut-kahvaltilik',
        'https://devgrossonline.com/meyve-sebze',
        'https://devgrossonline.com/et-sarkuteri'
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url,
                meta=dict(
                    playwright=True,
                    playwright_include_page=True,
                ),
            )

    async def parse(self, response):
        page = response.meta["playwright_page"]

        await page.wait_for_selector("ul.pagination")

        html = await page.content()
        soup = BeautifulSoup(html, "html.parser")

        await page.close()

        products = soup.select("div.urun-kutusu")

        for product in products:
            name = product.select_one("h2 a").string.strip()
            price = product.select_one("div.urun-fiyat span").string.strip()

            yield {
                "name": name,
                "price": price,
            }

尚未实现分页。但您可以在下面的输出中看到所有三个起始页面都已打开。

2024-07-02 06:09:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://devgrossonline.com/sut-kahvaltilik> (referer: None) ['playwright']
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'VADİM TOST PEYNİRİ 600 GR', 'price': '120,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TAHSİLDAROĞLU ÇEÇİL PEYNİRİ 180 GR', 'price': '87,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TAHSİLDAROĞLU ÖRGÜ PEYNİRİ 180 GR', 'price': '87,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'YÖRE YEŞİL ÇİZİK ZEYTİN (351-400) KG', 'price': '153,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'SÜTAŞ ÇİLEKLİ SÜT 6X180 ML', 'price': '49,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'SÜTAŞ MUZLU SÜT 6X180 ML', 'price': '49,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TORKU TEREYAĞI RULO 750 GR', 'price': '258,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TEMPO FINDIK EZMELİ BAR 24 GR', 'price': '16,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TEMPO ANTEP EZME 24 GR', 'price': '29,59 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'YAYMER PANCAR PEKMEZİ 450 GR', 'price': '87,89 TL'}
2024-07-02 06:09:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://devgrossonline.com/meyve-sebze> (referer: None) ['playwright']
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'KANDEMİR MANTAR 400 GR', 'price': '58,19 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'ZÜMRÜT KUŞ ÜZÜMÜ 40G', 'price': '18,59 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'İĞDE KG', 'price': '164,89 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'SARI ÜZÜM 11 NO KG', 'price': '307,89 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'SİYAH ÜZÜM ÖZEL KG', 'price': '274,89 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'YER ÇİLEĞİ KG', 'price': '247,39 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'YARIM YABAN MERSİNİ KG', 'price': '313,39 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'DOMATES PEMBE KG', 'price': '43,99 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'SOĞAN ARPACIK KG', 'price': '43,99 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'BİBER RENKLİ DOLMA KG', 'price': '43,99 TL'}
2024-07-02 06:09:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://devgrossonline.com/et-sarkuteri> (referer: None) ['playwright']
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'BANVİT PİLİÇ KOKTEYL SOSİS KÜVET 500 GR', 'price': '63,69 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS PİLİÇ 5 Lİ SOSİS 225 GR', 'price': '65,89 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS PİLİÇ SALAM TOMBİK 250 GR', 'price': '62,59 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS DANA SALAM TOMBİK 250 GR', 'price': '184,69 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'ŞENPİLİÇ DOYFRAM DİLİMLİ PİLİÇ SUCUK 240 GR', 'price': '53,85 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AYTAÇ BATON PİLİÇ SUCUK 300 GR', 'price': '73,59 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS DANA KANGAL SUCUK 200 GR', 'price': '194,59 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS PİLİÇ KANGAL SUCUK 250 GR', 'price': '116,49 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'ŞENPİLİÇ PİLİÇ SALAM 200GR', 'price': '22,55 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'NAMET DANA SUCUK 300GR', 'price': '153,89 TL'}
© www.soinside.com 2019 - 2024. All rights reserved.