从亚马逊的搜索页面抓取ASIN

Question

我试图在亚马逊上刮掉ASIN号码。请注意，这与产品详细信息无关（如：https://www.youtube.com/watch?v=qRVRIh3GZgI），但这是在您搜索关键字时（在此示例中为“trimmer”，请尝试：https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2）。结果是很多产品，我能够刮掉所有的标题。

不可见的是ASIN（这是一个独特的亚马逊号码）。我看到，在检查HTML中的文本链接（href），其中包含ASIN编号。在下面的示例中，ASIN = B01MSHQ5IQ

<a class="a-link-normal a-text-normal" href="/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ/ref=sr_1_3?keywords=trimmer&amp;qid=1554462204&amp;s=gateway&amp;sr=8-3">

结束我的问题：如何检索页面上的所有产品标题和ASIN号码？例如：

Number     Title                       ASIN
 1       Braun, Beardtrimmer          B07JH1LLYR 
 2       TNT Pro Series Waist         B00R84J2PK
 ...     ...                          ...

到目前为止，我正在使用scrapy（但也开放其他Python解决方案），我能够抓住标题。

我的代码到目前为止：

首先在命令行中运行：

scrapy startproject tutorial

然后，调整Spider中的文件（参见示例1）和items.py（参见示例2）。

例1

class AmazonProductSpider(scrapy.Spider):
  name = "AmazonDeals"
  allowed_domains = ["amazon.com"]

  #Use working product URL below
  start_urls = [
     "https://www.amazon.com/s?k=trimmer&ref=nb_sb_noss_2"         

]
 ## scrapy crawl AmazonDeals -o Asin_Titles.json

  def parse(self, response):
      items = AmazonItem()


      Title = response.css('.a-text-normal').css('::text').extract()
      items['title_Products'] = Title 
      yield items

根据@glhr的要求，添加items.py代码：

例2

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy

class AmazonItem(scrapy.Item):
  # define the fields for your item here like:
  title_Products = scrapy.Field()

Answer 1

您可以通过提取href的<a class="a-link-normal a-text-normal" href="...">属性来获取产品的链接：

Link = response.css('.a-text-normal').css('a::attr(href)').extract()

从链接中，您可以使用正则表达式从链接中提取ASIN编号：

(?<=dp/)[A-Z0-9]{10}

上面的正则表达式将匹配前面有dp/的10个字符（大写字母或数字）。在这里看演示：https://regex101.com/r/mLMv3k/1

这是parse()方法的工作实现：

def parse(self, response):
    Link = response.css('.a-text-normal').css('a::attr(href)').extract()
    Title = response.css('span.a-text-normal').css('::text').extract()

    # for each product, create an AmazonItem, populate the fields and yield the item
    for result in zip(Link,Title):
        item = AmazonItem()
        item['title_Product'] = result[1]
        item['link_Product'] = result[0]
        # extract ASIN from link
        ASIN = re.findall(r"(?<=dp/)[A-Z0-9]{10}",result[0])[0]
        item['ASIN_Product'] = ASIN
        yield item

这需要使用新字段扩展AmazonItem：

class AmazonItem(scrapy.Item):
    # define the fields for your item here like:
    title_Product = scrapy.Field()
    link_Product = scrapy.Field()
    ASIN_Product = scrapy.Field()

样本输出：

{'ASIN_Product': 'B01MSHQ5IQ',
 'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
 'title_Product': 'Philips Norelco Multigroom Series 3000, 13 attachments, '
                  'FFP, MG3750'}
{'ASIN_Product': 'B01MSHQ5IQ',
 'link_Product': '/Philips-Norelco-Groomer-MG3750-50/dp/B01MSHQ5IQ',
 'title_Product': 'Philips Norelco Multi Groomer MG7750/49-23 piece, beard, '
                  'body, face, nose, and ear hair trimmer, shaver, and clipper'}

但是：ぁzxswい

要将输出写入JSON文件，只需在spider中指定feed导出设置：

https://repl.it/@glhr/55534679-AmazonSpider

从亚马逊的搜索页面抓取ASIN

问题描述投票：5回答：1

1个回答

最新问题

从亚马逊的搜索页面抓取ASIN

问题描述 投票：5回答：1

1个回答

最新问题

问题描述投票：5回答：1