这是我写的代码:
# spider.py
def start_requests(self):
yield scrapy.Request(url=self.url, method='POST', callback=self.parse, dont_filter=True, flags=['this is the start request method'])
def parse(self, response):
logging.info('started parsing')
l = ItemLoader(FeedItem())
json_response = json.loads(response.text)
l.add_value('feed', json_response)
yield l.load_items()
yield scrapy.Request(url=self.url, method='POST', callback=self.parse, flags=['this is the parse method'])
# middlewares.py
def process_request(self, request, spider):
sinceId = self.cur.execute('SELECT sinceId FROM proposal').fetchone()
jobT = self.cur.execute('SELECT jobT FROM proposal').fetchone()
if not sinceId:
body = self.body.encode('utf-8')
request = request.replace(body=body)
spider.logger.info(f'{request.flags}')
spider.logger.info('Returning unchanged request')
return request
body = re.sub(r'("sinceId":")(\d+)(")', '"sinceId":' + f'"{sinceId}"', self.body) # change sinceId
body = re.sub(r'("jobTime":")(\d+)(")', '"jobTime":' + f'"{jobT}"', body) # changed jobTime
body = self.body.encode('utf-8')
spider.logger.info('Body changed')
request = request.replace(body=body)
spider.logger.info(f'{request.flags}')
spider.logger.info('Returning changed request')
return request
def spider_opened(self, spider):
self.body = '{"query":"\n query($queryParams: UserSavedSearchesParams) {\n userSavedSearches(params: $queryParams) {\n results {\n id\n uid:id\n title\n ciphertext\n description\n type\n recno\n freelancersToHire\n duration\n durationLabel\n engagement\n amount {\n amount:displayValue\n }\n createdOn:createdDateTime\n publishedOn:publishedDateTime\n renewedOn:renewedDateTime\n prefFreelancerLocation\n prefFreelancerLocationMandatory\n connectPrice\n client {\n totalHires\n totalPostedJobs\n totalSpent {\n rawValue\n currency\n displayValue\n }\n paymentVerificationStatus,\n location {\n country\n }\n totalReviews\n totalFeedback\n companyRid\n edcUserId\n lastContractRid\n companyOrgUid\n hasFinancialPrivacy\n }\n enterpriseJob\n premium\n jobTs:jobTime\n skills {\n id\n name\n prettyName\n highlighted\n }\n contractorTier\n jobStatus\n relevanceEncoded\n totalApplicants\n proposalsTier\n isLocal:local\n locations {\n city\n country\n }\n isApplied:applied\n attrs {\n id\n uid:id\n prettyName:prefLabel\n parentSkillId\n prefLabel\n highlighted\n freeText\n }\n hourlyBudget {\n type\n min\n max\n }\n clientRelation {\n companyRid\n companyName\n edcUserId\n lastContractPlatform\n lastContractRid\n lastContractTitle\n }\n totalFreelancersToHire\n contractToHire\n }\n paging {\n total\n count\n resultSetTs:resultSetTime\n }\n }\n }\n ","variables":{"queryParams":{"sinceId":"1015914410","jobTime":"1728208823055","paging":"0;20"}}}'
self.body = self.body.replace('\n','\\n')
self.con = sqlite3.connect('data.db')
self.cur = self.con.cursor()
self.cur.execute('CREATE TABLE IF NOT EXISTS proposal(title, description, _type, duration, salary, sinceID, jobT, UNIQUE(title, description, _type, duration, salary, sinceID, jobT))')
spider.logger.info("Spider opened: %s" % spider.name)
我期望它的工作方式是:
但是,它仍然无限期地卡在第二步上。
这是回溯:
2024-10-08 11:12:30 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: feed)
2024-10-08 11:12:31 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.12.6 (tags/v3.12.6:a4a2d2b, Sep 6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Windows-11-10.0.22631-SP0
2024-10-08 11:12:31 [scrapy.addons] INFO: Enabled addons:
[]
2024-10-08 11:12:31 [asyncio] DEBUG: Using selector: SelectSelector
2024-10-08 11:12:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-10-08 11:12:31 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-10-08 11:12:31 [scrapy.extensions.telnet] INFO: Telnet Password: 443b0e3cce0e2a7f
2024-10-08 11:12:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2024-10-08 11:12:31 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'feed',
'CONCURRENT_REQUESTS': 1,
'DOWNLOAD_DELAY': 80,
'FEED_EXPORT_ENCODING': 'utf-8',
'NEWSPIDER_MODULE': 'feed.spiders',
'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
'SPIDER_MODULES': ['feed.spiders'],
'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-10-08 11:12:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'feed.middlewares.FeedDownloaderMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-10-08 11:12:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-10-08 11:12:32 [scrapy.middleware] INFO: Enabled item pipelines:
['feed.pipelines.FeedPipeline']
2024-10-08 11:12:32 [scrapy.core.engine] INFO: Spider opened
2024-10-08 11:12:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-10-08 11:12:32 [feed] INFO: Spider opened: feed
2024-10-08 11:12:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-10-08 11:12:32 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:32 [feed] INFO: Returning unchanged request
2024-10-08 11:12:33 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:33 [feed] INFO: Returning unchanged request
2024-10-08 11:12:34 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:34 [feed] INFO: Returning unchanged request
2024-10-08 11:12:35 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:35 [feed] INFO: Returning unchanged request
2024-10-08 11:12:36 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:36 [feed] INFO: Returning unchanged request
我不明白为什么它发送多个请求并且从不给出回调。
您在下载器中间件的
request
方法中返回了一个process_request
,该中间件不会继续处理您的原始请求,并将新返回的请求放入队列中等待处理。
这个来自关于downloadermiddleware的
process_request
方法的文档:
每个下载请求都会调用此方法 中间件。
process_request() 应该:返回 None、返回 Response 对象,返回 Request 对象,或引发 IgnoreRequest。
如果返回None,Scrapy将继续处理这个请求, 执行所有其他中间件,直到最后,适当的 下载程序处理程序被称为执行的请求(及其响应 下载)。
如果它返回一个 Response 对象,Scrapy 将不会费心调用任何其他对象 process_request() 或 process_exception() 方法,或适当的 下载功能;它会返回该响应。 process_response() 每次响应时都会调用已安装中间件的方法。
如果返回Request对象,Scrapy将停止调用 process_request 方法并重新安排返回的请求。一旦 执行新返回的请求,相应的中间件链 将在下载的响应上调用。
如果引发 IgnoreRequest 异常,则 process_exception() 将调用已安装的下载器中间件的方法。如果没有一个 他们处理异常,请求的 errback 函数 (Request.errback) 被调用。如果没有代码处理引发的异常, 它被忽略并且不被记录(与其他异常不同)。