无法使用Scrapy刮取下一页内容

问题描述 投票:0回答:1

我想从下一页中删除内容,但它没有进入下一页。我的代码是:

import scrapy
class AggregatorSpider(scrapy.Spider):
name = 'aggregator'
allowed_domains = ['startech.com.bd/component/processor']
start_urls = ['https://startech.com.bd/component/processor']

def parse(self, response):
    processor_details = response.xpath('//*[@class="col-xs-12 col-md-4 product-layout grid"]')
    for processor in processor_details:
        name = processor.xpath('.//h4/a/text()').extract_first()
        price = processor.xpath('.//*[@class="price space-between"]/span/text()').extract_first()
        print ('\n')
        print (name)
        print (price)
        print ('\n')
    next_page_url = response.xpath('//*[@class="pagination"]/li/a/@href').extract_first()
    # absolute_next_page_url = response.urljoin(next_page_url)
    yield scrapy.Request(next_page_url)

我没有使用urljoin,因为next_page_url给了我整个网址。我还在yield函数中尝试了dont_filter = true参数,这使我在第1页中无限循环。我从终端获得的消息是[scrapy.spidermiddlewares.offsite] DEBUG:过滤的异地请求到'www.startech.com.bd':https://www.startech.com.bd/component/processor?page = 2>

python web-scraping scrapy scrapy-shell
1个回答
2
投票

这是因为你的allowed_domains变量是错误的,使用allowed_domains = ['www.startech.com.bd']而不是(see the doc)

您还可以修改下一页选择器,以避免再次转到第一页:

import scrapy
class AggregatorSpider(scrapy.Spider):
    name = 'aggregator'
    allowed_domains = ['www.startech.com.bd']
    start_urls = ['https://startech.com.bd/component/processor']

    def parse(self, response):
        processor_details = response.xpath('//*[@class="col-xs-12 col-md-4 product-layout grid"]')
        for processor in processor_details:
            name = processor.xpath('.//h4/a/text()').extract_first()
            price = processor.xpath('.//*[@class="price space-between"]/span/text()').extract_first()
            yield({'name': name, 'price': price})
        next_page_url = response.css('.pagination li:last-child a::attr(href)').extract_first()
        if next_page_url:
            yield scrapy.Request(next_page_url)
© www.soinside.com 2019 - 2024. All rights reserved.