我编写了一个 Scrapy 蜘蛛,并在其中使用 Selenium 来废弃“devgrossonline.com”中的产品。
它不适用于多个类别 url,但当我只提供一个 url 时它可以工作。如有任何帮助,我们将不胜感激。
这是我的蜘蛛:
import time
from datetime import datetime
import scrapy
from selenium import webdriver
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.ui import WebDriverWait
from webdriver_manager.chrome import ChromeDriverManager
class DevgrossSpider(scrapy.Spider):
name = 'devgross'
allowed_domains = ['devgrossonline.com']
start_urls = [
'https://devgrossonline.com/sut-kahvaltilik',
'https://devgrossonline.com/meyve-sebze',
'https://devgrossonline.com/et-sarkuteri'
]
def __init__(self, *args, **kwargs):
super(DevgrossSpider, self).__init__(*args, **kwargs)
self.driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()))
self.driver.maximize_window()
def closed(self):
self.driver.quit()
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(url=url, callback=self.parse)
def parse(self, response):
self.driver.get(response.url)
time.sleep(5)
# find how many pages there are
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'ul.pagination'))
)
page_count_selector = "ul.pagination li:last-child"
page_count = self.driver.find_element(By.CSS_SELECTOR, page_count_selector).text
try:
page_count = int(page_count)
except:
page_count = 1
# page iteration
for i in range(1, page_count + 1):
# go next page
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'ul.pagination'))
)
if i != 1:
current_page = self.driver.find_element(By.CSS_SELECTOR, 'a#secili-sayfa')
next_page = current_page.find_element(By.XPATH, 'following::li/a[@onclick][1]')
next_page.click()
time.sleep(1)
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'ul.pagination')))
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.CSS_SELECTOR, 'div.urun-kutusu')))
WebDriverWait(self.driver, 10).until(
EC.presence_of_element_located((By.ID, 'cat_rw_05')))
time.sleep(2)
product_elements = self.driver.find_elements(By.CSS_SELECTOR, 'div.urun-kutusu')
for product_element in product_elements:
product_name = product_element.find_element(By.CSS_SELECTOR, 'h2 a').text
product_price = product_element.find_element(By.CSS_SELECTOR, 'div.urun-fiyat span').text
try:
product_price = float(product_price.replace(' TL', '').replace(',', '.'))
except:
product_price = -1.0
product_url = product_element.find_element(By.CSS_SELECTOR,
'div.kutu-urun-resmi a.kutu-link').get_attribute('href')
yield {
'name': product_name,
'price': product_price,
'URL': product_url,
'date': datetime.now(),
}
尽管我在这个问题上工作了几个小时,但我无法解决问题。
我谨建议您考虑使用 Playwright 而不是 Selenium 和 Scrapy。 Playwright 提供与 Selenium 类似的功能,但恕我直言,与 Scrapy 的集成更加最新。
安装这些软件包:
scrapy-playwright==0.0.34
beautifulsoup4==4.12.3
使用 Playwright 安装 Chrome。
playwright install --force chrome
在
settings.py
:
DOWNLOAD_HANDLERS = {
"http": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
"https": "scrapy_playwright.handler.ScrapyPlaywrightDownloadHandler",
}
PLAYWRIGHT_DEFAULT_NAVIGATION_TIMEOUT = 30 * 1000
PLAYWRIGHT_BROWSER_TYPE = "chromium"
PLAYWRIGHT_LAUNCH_OPTIONS = {"headless": False}
还有一个简单的蜘蛛,可以打开所有三个初始页面。
import scrapy
from bs4 import BeautifulSoup
class DevgrossSpider(scrapy.Spider):
name = "devgross"
allowed_domains = ["devgrossonline.com"]
start_urls = [
'https://devgrossonline.com/sut-kahvaltilik',
'https://devgrossonline.com/meyve-sebze',
'https://devgrossonline.com/et-sarkuteri'
]
def start_requests(self):
for url in self.start_urls:
yield scrapy.Request(
url,
meta=dict(
playwright=True,
playwright_include_page=True,
),
)
async def parse(self, response):
page = response.meta["playwright_page"]
await page.wait_for_selector("ul.pagination")
html = await page.content()
soup = BeautifulSoup(html, "html.parser")
await page.close()
products = soup.select("div.urun-kutusu")
for product in products:
name = product.select_one("h2 a").string.strip()
price = product.select_one("div.urun-fiyat span").string.strip()
yield {
"name": name,
"price": price,
}
尚未实现分页。但您可以在下面的输出中看到所有三个起始页面都已打开。
2024-07-02 06:09:43 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://devgrossonline.com/sut-kahvaltilik> (referer: None) ['playwright']
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'VADİM TOST PEYNİRİ 600 GR', 'price': '120,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TAHSİLDAROĞLU ÇEÇİL PEYNİRİ 180 GR', 'price': '87,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TAHSİLDAROĞLU ÖRGÜ PEYNİRİ 180 GR', 'price': '87,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'YÖRE YEŞİL ÇİZİK ZEYTİN (351-400) KG', 'price': '153,89 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'SÜTAŞ ÇİLEKLİ SÜT 6X180 ML', 'price': '49,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'SÜTAŞ MUZLU SÜT 6X180 ML', 'price': '49,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TORKU TEREYAĞI RULO 750 GR', 'price': '258,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TEMPO FINDIK EZMELİ BAR 24 GR', 'price': '16,39 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'TEMPO ANTEP EZME 24 GR', 'price': '29,59 TL'}
2024-07-02 06:09:43 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/sut-kahvaltilik>
{'name': 'YAYMER PANCAR PEKMEZİ 450 GR', 'price': '87,89 TL'}
2024-07-02 06:09:47 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://devgrossonline.com/meyve-sebze> (referer: None) ['playwright']
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'KANDEMİR MANTAR 400 GR', 'price': '58,19 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'ZÜMRÜT KUŞ ÜZÜMÜ 40G', 'price': '18,59 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'İĞDE KG', 'price': '164,89 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'SARI ÜZÜM 11 NO KG', 'price': '307,89 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'SİYAH ÜZÜM ÖZEL KG', 'price': '274,89 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'YER ÇİLEĞİ KG', 'price': '247,39 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'YARIM YABAN MERSİNİ KG', 'price': '313,39 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'DOMATES PEMBE KG', 'price': '43,99 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'SOĞAN ARPACIK KG', 'price': '43,99 TL'}
2024-07-02 06:09:48 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/meyve-sebze>
{'name': 'BİBER RENKLİ DOLMA KG', 'price': '43,99 TL'}
2024-07-02 06:09:52 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://devgrossonline.com/et-sarkuteri> (referer: None) ['playwright']
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'BANVİT PİLİÇ KOKTEYL SOSİS KÜVET 500 GR', 'price': '63,69 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS PİLİÇ 5 Lİ SOSİS 225 GR', 'price': '65,89 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS PİLİÇ SALAM TOMBİK 250 GR', 'price': '62,59 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS DANA SALAM TOMBİK 250 GR', 'price': '184,69 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'ŞENPİLİÇ DOYFRAM DİLİMLİ PİLİÇ SUCUK 240 GR', 'price': '53,85 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AYTAÇ BATON PİLİÇ SUCUK 300 GR', 'price': '73,59 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS DANA KANGAL SUCUK 200 GR', 'price': '194,59 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'AKDENİZ TOROS PİLİÇ KANGAL SUCUK 250 GR', 'price': '116,49 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'ŞENPİLİÇ PİLİÇ SALAM 200GR', 'price': '22,55 TL'}
2024-07-02 06:09:52 [scrapy.core.scraper] DEBUG: Scraped from <200 https://devgrossonline.com/et-sarkuteri>
{'name': 'NAMET DANA SUCUK 300GR', 'price': '153,89 TL'}