开始的请求陷入永无休止的循环中,无法到达解析回调

问题描述 投票:0回答:1

这是我写的代码:

# spider.py
def start_requests(self):
    yield scrapy.Request(url=self.url, method='POST', callback=self.parse, dont_filter=True, flags=['this is the start request method'])


def parse(self, response):
    logging.info('started parsing')
    l = ItemLoader(FeedItem())
    json_response = json.loads(response.text)

    l.add_value('feed', json_response)

    yield l.load_items()


    yield scrapy.Request(url=self.url, method='POST', callback=self.parse, flags=['this is the parse method'])
# middlewares.py
def process_request(self, request, spider):
    sinceId = self.cur.execute('SELECT sinceId FROM proposal').fetchone()
    jobT = self.cur.execute('SELECT jobT FROM proposal').fetchone()
    if not sinceId:
        body = self.body.encode('utf-8')
        request = request.replace(body=body)
        spider.logger.info(f'{request.flags}')
        spider.logger.info('Returning unchanged request')
        return request
    
    body = re.sub(r'("sinceId":")(\d+)(")', '"sinceId":' + f'"{sinceId}"', self.body) # change sinceId
    body = re.sub(r'("jobTime":")(\d+)(")', '"jobTime":' + f'"{jobT}"', body) # changed jobTime
    body = self.body.encode('utf-8')
    spider.logger.info('Body changed')

    request = request.replace(body=body)
    spider.logger.info(f'{request.flags}')
    spider.logger.info('Returning changed request')
    return request

def spider_opened(self, spider):
    self.body = '{"query":"\n          query($queryParams: UserSavedSearchesParams) {\n            userSavedSearches(params: $queryParams) {\n              results {\n                id\n                uid:id\n                title\n                ciphertext\n                description\n                type\n                recno\n                freelancersToHire\n                duration\n                durationLabel\n                engagement\n                amount {\n                  amount:displayValue\n                }\n                createdOn:createdDateTime\n                publishedOn:publishedDateTime\n                renewedOn:renewedDateTime\n                prefFreelancerLocation\n                prefFreelancerLocationMandatory\n                connectPrice\n                client {\n                  totalHires\n                  totalPostedJobs\n                  totalSpent {\n                    rawValue\n                    currency\n                    displayValue\n                  }\n                  paymentVerificationStatus,\n                  location {\n                    country\n                  }\n                  totalReviews\n                  totalFeedback\n                  companyRid\n                  edcUserId\n                  lastContractRid\n                  companyOrgUid\n                  hasFinancialPrivacy\n                }\n                enterpriseJob\n                premium\n                jobTs:jobTime\n                skills {\n                  id\n                  name\n                  prettyName\n                  highlighted\n                }\n                contractorTier\n                jobStatus\n                relevanceEncoded\n                totalApplicants\n                proposalsTier\n                isLocal:local\n                locations {\n                  city\n                  country\n                }\n                isApplied:applied\n                attrs {\n                  id\n                  uid:id\n                  prettyName:prefLabel\n                  parentSkillId\n                  prefLabel\n                  highlighted\n                  freeText\n                }\n                hourlyBudget {\n                  type\n                  min\n                  max\n                }\n                clientRelation {\n                  companyRid\n                  companyName\n                  edcUserId\n                  lastContractPlatform\n                  lastContractRid\n                  lastContractTitle\n                }\n                totalFreelancersToHire\n                contractToHire\n              }\n              paging {\n                  total\n                  count\n                  resultSetTs:resultSetTime\n              }\n            }\n          }\n        ","variables":{"queryParams":{"sinceId":"1015914410","jobTime":"1728208823055","paging":"0;20"}}}'
    self.body = self.body.replace('\n','\\n')
    self.con = sqlite3.connect('data.db')
    self.cur = self.con.cursor()
    self.cur.execute('CREATE TABLE IF NOT EXISTS proposal(title, description, _type, duration, salary, sinceID, jobT, UNIQUE(title, description, _type, duration, salary, sinceID, jobT))')

    spider.logger.info("Spider opened: %s" % spider.name)

我期望它的工作方式是:

  1. 它发送初始请求。
  2. 对身体进行改造。
  3. 返回解析方法的回调。
  4. ...

但是,它仍然无限期地卡在第二步上。

这是回溯:

2024-10-08 11:12:30 [scrapy.utils.log] INFO: Scrapy 2.11.2 started (bot: feed)
2024-10-08 11:12:31 [scrapy.utils.log] INFO: Versions: lxml 5.3.0.0, libxml2 2.11.7, cssselect 1.2.0, parsel 1.9.1, w3lib 2.2.1, Twisted 24.7.0, Python 3.12.6 (tags/v3.12.6:a4a2d2b, Sep  6 2024, 20:11:23) [MSC v.1940 64 bit (AMD64)], pyOpenSSL 24.2.1 (OpenSSL 3.3.2 3 Sep 2024), cryptography 43.0.1, Platform Windows-11-10.0.22631-SP0
2024-10-08 11:12:31 [scrapy.addons] INFO: Enabled addons:
[]
2024-10-08 11:12:31 [asyncio] DEBUG: Using selector: SelectSelector
2024-10-08 11:12:31 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.asyncioreactor.AsyncioSelectorReactor
2024-10-08 11:12:31 [scrapy.utils.log] DEBUG: Using asyncio event loop: asyncio.windows_events._WindowsSelectorEventLoop
2024-10-08 11:12:31 [scrapy.extensions.telnet] INFO: Telnet Password: 443b0e3cce0e2a7f
2024-10-08 11:12:31 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2024-10-08 11:12:31 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'feed',
 'CONCURRENT_REQUESTS': 1,
 'DOWNLOAD_DELAY': 80,
 'FEED_EXPORT_ENCODING': 'utf-8',
 'NEWSPIDER_MODULE': 'feed.spiders',
 'REQUEST_FINGERPRINTER_IMPLEMENTATION': '2.7',
 'SPIDER_MODULES': ['feed.spiders'],
 'TWISTED_REACTOR': 'twisted.internet.asyncioreactor.AsyncioSelectorReactor'}
2024-10-08 11:12:32 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'feed.middlewares.FeedDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2024-10-08 11:12:32 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2024-10-08 11:12:32 [scrapy.middleware] INFO: Enabled item pipelines:
['feed.pipelines.FeedPipeline']
2024-10-08 11:12:32 [scrapy.core.engine] INFO: Spider opened
2024-10-08 11:12:32 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2024-10-08 11:12:32 [feed] INFO: Spider opened: feed
2024-10-08 11:12:32 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2024-10-08 11:12:32 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:32 [feed] INFO: Returning unchanged request
2024-10-08 11:12:33 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:33 [feed] INFO: Returning unchanged request
2024-10-08 11:12:34 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:34 [feed] INFO: Returning unchanged request
2024-10-08 11:12:35 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:35 [feed] INFO: Returning unchanged request
2024-10-08 11:12:36 [feed] INFO: ['this is the start request method']
2024-10-08 11:12:36 [feed] INFO: Returning unchanged request

我不明白为什么它发送多个请求并且从不给出回调。

python web-scraping scrapy
1个回答
0
投票

您在下载器中间件的

request
方法中返回了一个
process_request
,该中间件不会继续处理您的原始请求,并将新返回的请求放入队列中等待处理。

这个来自关于downloadermiddleware的

process_request
方法的文档:

每个下载请求都会调用此方法 中间件。

process_request() 应该:返回 None、返回 Response 对象,返回 Request 对象,或引发 IgnoreRequest。

如果返回None,Scrapy将继续处理这个请求, 执行所有其他中间件,直到最后,适当的 下载程序处理程序被称为执行的请求(及其响应 下载)。

如果它返回一个 Response 对象,Scrapy 将不会费心调用任何其他对象 process_request() 或 process_exception() 方法,或适当的 下载功能;它会返回该响应。 process_response() 每次响应时都会调用已安装中间件的方法。

如果返回Request对象,Scrapy将停止调用 process_request 方法并重新安排返回的请求。一旦 执行新返回的请求,相应的中间件链 将在下载的响应上调用

如果引发 IgnoreRequest 异常,则 process_exception() 将调用已安装的下载器中间件的方法。如果没有一个 他们处理异常,请求的 errback 函数 (Request.errback) 被调用。如果没有代码处理引发的异常, 它被忽略并且不被记录(与其他异常不同)。

© www.soinside.com 2019 - 2024. All rights reserved.