嗨,我用 scrapy 编写了一段代码,它可以工作,但是当我将其更改为异步代码时,我收到此错误
[ await process.crawl(WebSpider, start_urls=urls)
File "/home/z/PycharmProjects/news-link-extractor/.venv/lib/python3.10/site-packages/twisted/internet/defer.py", line 1180, in send
raise result.value
TypeError: 'async_generator' object is not iterable
这是我的代码 `
WebSpider 类(scrapy.Spider):
name = 'webspider'
allowed_domains = allowed_domains
custom_settings = {
'RETRY_TIMES': 8,
'RETRY_DELAY': 5,
'RETRY_HTTP_CODES': [500, 502, 503, 504, 400, 403, 404, 408],
}
@staticmethod
async def get_start_urls():
async with AsyncMySQLConnection() as mysql_connection:
urls = await mysql_connection.select_urls(settings.DATABASE_NAME, settings.TABLE_NAME)
return urls
async def start_requests(self):
urls = await self.get_start_urls()
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
async def parse(self, response):
links = LinkExtractor(allowed_domains).extract_links(response)
all_urls = [link.url for link in links]
await URLManager().save_new_url_in_redis(all_urls)
print(all_urls)
如果名称=='主':
from scrapy import cmdline, Request
cmdline.execute("scrapy crawl webspider".split())`
我用 ensureDeferred 和 inlineCllbacks 等多种方式进行了尝试。像这样写就可以了 开始请求=[]
在 Scrapy 中实现对
async def start_requests()
的支持之前,您可以使用以下解决方法:
class MySpider(Spider):
...
def start_requests(self):
yield Request("data:,", callback=self.parse_initial)
async def parse_initial(response):
... # produce the actual initial requests
...
"data:,"
是一个空 URL,不会引起任何网络请求)。