在尝试抓取“亚马逊标准识别号”时,我的代码运行良好,但输出中缺少一些 asin 值。我检查了 html 标签,所有 asin 值都在“div”元素的“data-asin”属性中。我不明白为什么缺少一些值。
import scrapy
from amazon.items import AmazonItem
from scrapy_playwright.page import PageMethod
class AmazonSpider(scrapy.Spider):
name = 'amazon'
def start_requests(self):
url = 'https://www.amazon.com/s?k=laptop&crid=3HPDQBP5QI5QM&sprefix=lap%2Caps%2C672&ref=nb_sb_ss_ts-doa-p_1_3'
yield scrapy.Request(url, meta= dict(
playwright = True,
playwright_include_page = True,
playwright_page_methods = [
PageMethod('wait_for_selector', 'div[data-asin]'),
PageMethod("evaluate", "window.scrollBy(0, document.body.scrollHeight)"),
PageMethod("wait_for_selector", "div[data-asin]:nth-child(23)")
],
errback = self.errback
))
async def parse(self, response):
page = response.meta["playwright_page"]
await page.close()
for product in response.css('div[data-asin]'):
asin = product.css('::attr(data-asin)').get()
yield {
"asin": asin
}
async def errback(self,failure):
page = failure.request.meta["playwright_page"]
await page.close()
这是我写的代码。
尝试一下,看看你的爬虫看到了什么:
# add this
from scrapy.utils.response import open_in_browser
还有
async def parse(self, response):
open_in_browser(response)
这应该在浏览器中打开页面内容,查找详细信息并进行进一步修复。