如何使用带有省略号的下一个按钮使用Scrapy来抓取数据

问题描述 投票:0回答:2

我需要不断获取下一个按钮<1 2 3 ... 5>的数据,但是源中没有提供的href链接也有省略号。有什么想法吗?这是我的代码

def start_requests(self):
    urls = (
        (self.parse_2, 'https://www.forever21.com/us/shop/catalog/category/f21/sale'),
    )
    for cb, url in urls:
        yield scrapy.Request(url, callback=cb)


def parse_2(self, response):
    for product_item_forever in response.css('div.pi_container'):
        forever_item = {
            'forever-title': product_item_forever.css('p.p_name::text').extract_first(),
            'forever-regular-price': product_item_forever.css('span.p_old_price::text').extract_first(),
            'forever-sale-price': product_item_forever.css('span.p_sale.t_pink::text').extract_first(),
            'forever-photo-url': product_item_forever.css('img::attr(data-original)').extract_first(),
            'forever-description-url': product_item_forever.css('a.item_slider.product_link::attr(href)').extract_first(),
        }
        yield forever_item

请帮我谢谢

python web-scraping scrapy
2个回答
2
投票

看来这个分页使用API​​的额外请求。所以,有两种方法:

  1. 使用Splash / Selenium按QHarr模式渲染页面;
  2. 对API进行相同的调用。检查开发人员工具,你会发现POST请求https://www.forever21.com/us/shop/Catalog/GetProducts将所有正确的参数(它们太长,所以我不会在这里发布完整列表)。

1
投票

网址会发生变化,因此您可以在网址中指定页码和每页结果,例如

https://www.forever21.com/uk/shop/catalog/category/f21/sale/#pageno=2&pageSize=120&filter=price:0,250

正如@vezunchik和OP反馈所提到的,这种方法需要selenium / splash来允许js在页面上运行。如果你沿着那条路走下去,你可以点击下一个(.p_next),直到你得到结束页面,因为很容易从文档中获取最后一个页码(.dot + .pageno)。


我很感激你正在尝试scrapy。

用硒演示这个想法以防万一。

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

url_loop = 'https://www.forever21.com/uk/shop/catalog/category/f21/sale/#pageno={}&pageSize=120&filter=price:0,250'
url = 'https://www.forever21.com/uk/shop/catalog/category/f21/sale'
d = webdriver.Chrome()
d.get(url)

d.find_element_by_css_selector('[onclick="fnAcceptCookieUse()"]').click() #get rid of cookies
items =  WebDriverWait(d,10).until(EC.presence_of_all_elements_located((By.CSS_SELECTOR, "#products .p_item")))
d.find_element_by_css_selector('.selectedpagesize').click()
d.find_elements_by_css_selector('.pagesize')[-1].click() #set page result count to 120
last_page = int(d.find_element_by_css_selector('.dot + .pageno').text) #get last page

if last_page > 1:
    for page in range(2, last_page + 1):
        url = url_loop.format(page)
        d.get(url)
        try:
            d.find_element_by_css_selector('[type=reset]').click() #reject offer
        except:
            pass
        # do something with page
        break #delete later
© www.soinside.com 2019 - 2024. All rights reserved.