在必须首先进行身份验证时，是否有更快的方法来使用 scrapy 抓取预定义的 URL 列表？

Question

我有两只刮毛蜘蛛：

Spider 1 抓取产品链接列表（~10000）并使用 feed 将它们保存到
```
csv
```
文件中。它不会访问每个链接，只会访问类别（具有多个页面）。这个蜘蛛很少需要运行，所以速度在这里不是问题。
Spider 2 访问每个链接并提取一些产品信息，同时保存到 csv feed。这需要每天至少运行一次，因此速度至关重要。

由于两者访问相同的网站（只是不同的子页面），因此它们的结构相似。两者都需要首先进行身份验证（多步身份验证），这就是为什么我不能只拥有他们需要爬行的所有链接

start_urls

。然而，尽管它们非常相似，Spider 1 每分钟爬行约 500 页，而 Spider 2 每分钟仅爬行约 50 页，速度太慢了。

唯一的关键区别是他们如何“发现”需要抓取的页面：

Spider 1 从一个小列表开始，然后逐一发现其他页面
Spider 2 从文件中读取大型预定义列表，然后循环遍历页面以抓取它们

由于两者都没有真正对爬行的页面进行任何繁重的处理，所以我的猜测是像 Spider 2 那样循环运行页面（方法

after_2nd_step

）会阻止 scrapy 正确地异步爬行它们，但我不知道如果真是这样的话。

我的另一个假设是，无论出于何种原因，网站本身需要更长的时间来加载产品信息页面。然而，我没有找到一种方法来查看相关数据，例如执行的平均页面加载时间。

所以我的问题如下：为什么蜘蛛 2 比蜘蛛 1 慢 10 倍，即使它运行在同一台机器上，抓取相同的网站？有没有更快的方法来抓取 Spider 2 中预定义的长 URL 列表？如果我的代码中问题不明显，有没有好的方法来分析页面加载所需的时间？

这是我的代码，我对一些敏感部分进行了匿名处理，请放心，原始代码是有效的Python代码。我试图保留其中的大部分内容，因为与我无关的事情可能会出现问题：

# Spider 1
import scrapy
from scrapy.http import TextResponse

from product_pages import PRODUKTSEITEN # a list of ~27 predefined URLs for all the categories

class ProductLinkSpider(scrapy.Spider):
    name = 'my_name'
    start_urls = ['URL to login page']

    custom_settings = {
        'FEEDS': {
            'path\\product_links.csv': {
                'format': 'csv',
                'encoding': 'utf8',
                'overwrite': True,
            },
        },
    }

    def parse(self, response):
        # First, we handle the login form
        return scrapy.FormRequest.from_response(
            response,
            formid='loginForm',
            formdata={
                #entering password and username
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Check for login success by verifying the response content
        if "Your User ID or Password is incorrect. Please try again" in response.text:
            self.logger.error("Login failed")
        # 2nd login step
        return scrapy.FormRequest.from_response(
            response,
            formxpath='//form[@action="/lorem ipsum"]',
            formdata={
                # Formdata is filled in
            },
            callback=self.after_2nd_step
        )

    def after_2nd_step(self, response):
        for page in PRODUKTSEITEN:
            yield scrapy.Request(
                page+"?sort=asc&q=%3Arelevance#&page=0",
                callback=self.parse_products_list_page,
                cb_kwargs=dict(url_without_page=page+"?sort=asc&q=%3Arelevance", page=0)
                )

    def parse_products_list_page(self, response: TextResponse, url_without_page: str, page: int):
        product_links = response.css('a css selector to get the links').getall()

        if There is another page:
            # Checks if there is another page, and requests it next
            yield scrapy.Request(
                url_without_page + f"&page={page+1}",
                callback=self.parse_products_list_page,
                cb_kwargs=dict(url_without_page=url_without_page, page=page+1)
            )

        for link in product_links:
            yield {"url": link}

对于第二只蜘蛛：

# Spider 2
import scrapy
from scrapy.http import TextResponse


class ProductInfoSpider(scrapy.Spider):
    name = 'lorem ipsum'
    start_urls = ['login page']
    custom_settings = {
        'FEEDS': {
            'path\\product_info_added_2.csv': {
                'format': 'csv',
                'encoding': 'utf8',
                'overwrite': True,
            },
        },
    }

    def parse(self, response):
        # First, we handle the login form
        return scrapy.FormRequest.from_response(
            response,
            formid='loginForm',
            formdata={
                # Enter username and password
            },
            callback=self.after_login
        )

    def after_login(self, response):
        # Check for login success by verifying the response content
        if "Your User ID or Password is incorrect. Please try again" in response.text:
            self.logger.error("Login failed")

        return scrapy.FormRequest.from_response(
            response,
            formxpath='//form[@action="/lorem ipsum"]',
            formdata={
                # Enter the same data as in the first Spider
            },
            callback=self.after_2nd_step
        )

    def after_2nd_step(self, response):
        with open(r"path\product_links.csv", "r") as file:
            # HERE is the only key difference
            lines = file.readlines()
            product_links = lines[1:]
        for link in product_links:
            yield scrapy.Request("domain"+link, callback=self.parse_product_page) 
        
    def parse_product_page(self, response):
        # extract some info about the product using 4  different css selectors to get the data
        yield {Some of the extracted info, 6 columns}

Answer 1

您可以尝试使用 Scrapy 的 start_requests() 方法将经过身份验证的请求发送到预定义列表中的每个 URL。只需将您的登录凭据包含在每个请求的标头或 cookie 中即可。

def start_requests(self):
for url in urls_list:
    yield scrapy.Request(url, cookies={'auth_token': 'your_token'})

在必须首先进行身份验证时，是否有更快的方法来使用 scrapy 抓取预定义的 URL 列表？

问题描述投票：0回答：1

1个回答

最新问题

在必须首先进行身份验证时，是否有更快的方法来使用 scrapy 抓取预定义的 URL 列表？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1