使用 scrapy-playwright 抓取亚马逊 ASIN

问题描述 投票:0回答:1

在尝试抓取“亚马逊标准识别号”时,我的代码运行良好,但输出中缺少一些 asin 值。我检查了 html 标签,所有 asin 值都在“div”元素的“data-asin”属性中。我不明白为什么缺少一些值。

import scrapy
from amazon.items import AmazonItem
from scrapy_playwright.page import PageMethod

class AmazonSpider(scrapy.Spider):
    name = 'amazon'

    def start_requests(self):
        url = 'https://www.amazon.com/s?k=laptop&crid=3HPDQBP5QI5QM&sprefix=lap%2Caps%2C672&ref=nb_sb_ss_ts-doa-p_1_3'
        yield scrapy.Request(url, meta= dict(
            playwright =  True,
            playwright_include_page = True,
            playwright_page_methods = [
                PageMethod('wait_for_selector', 'div[data-asin]'),
                PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
                PageMethod("wait_for_selector", "div[data-asin]:nth-child(23)")
            ],
            errback = self.errback
        ))

    async def parse(self, response):
        page = response.meta["playwright_page"]
        await page.close()

        for product in response.css('div[data-asin]'):
            asin = product.css('::attr(data-asin)').get()
            yield {
                "asin": asin
            }

    async def errback(self,failure):
        page = failure.request.meta["playwright_page"]
        await page.close()

这是我写的代码。

scrapy scrapy-playwright
1个回答
0
投票

尝试一下,看看你的爬虫看到了什么:

# add this
from scrapy.utils.response import open_in_browser

还有

async def parse(self, response):
    open_in_browser(response)

这应该在浏览器中打开页面内容,查找详细信息并进行进一步修复。

© www.soinside.com 2019 - 2024. All rights reserved.