异步启动请求scrapy

问题描述 投票:0回答:1

嗨,我用 scrapy 编写了一段代码,它可以工作,但是当我将其更改为异步代码时,我收到此错误



[    await process.crawl(WebSpider, start_urls=urls)

  File "/home/z/PycharmProjects/news-link-extractor/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1180, in send

    raise result.value

TypeError: 'async_generator' object is not iterable

这是我的代码 `

WebSpider 类(scrapy.Spider):

 name = 'webspider'

 allowed_domains = allowed_domains



 custom_settings = {

     'RETRY_TIMES': 8,

     'RETRY_DELAY': 5,

     'RETRY_HTTP_CODES': [500, 502, 503, 504, 400, 403, 404, 408],

 }



 @staticmethod

 async def get_start_urls():

     async with AsyncMySQLConnection() as mysql_connection:

         urls = await mysql_connection.select_urls(settings.DATABASE_NAME, settings.TABLE_NAME)

     return urls



 async def start_requests(self):

     urls = await self.get_start_urls()

     for url in urls:

         yield scrapy.Request(url=url, callback=self.parse)



 async def parse(self, response):

     links = LinkExtractor(allowed_domains).extract_links(response)

     all_urls = [link.url for link in links]

     await URLManager().save_new_url_in_redis(all_urls)

     print(all_urls)

如果名称=='主':

 from scrapy import cmdline, Request



 cmdline.execute("scrapy crawl webspider".split())`

我用 ensureDeferredinlineCllbacks 等多种方式进行了尝试。像这样写就可以了 开始请求=[]

python async-await scrapy
1个回答
0
投票

在 Scrapy 中实现对

async def start_requests()
的支持之前,您可以使用以下解决方法:

class MySpider(Spider):
    ...

    def start_requests(self):
        yield Request("data:,", callback=self.parse_initial)

    async def parse_initial(response):
        ...  # produce the actual initial requests

    ...

这将允许您使用异步代码来获取初始请求(

"data:,"
是一个空 URL,不会引起任何网络请求)。

© www.soinside.com 2019 - 2024. All rights reserved.