我正在尝试使用 Scrapy 抓取数据,但我的 json/csv 是空的。这不是我的第一个抓取工具,我真的不明白为什么这不起作用。
这是我的刮刀。
import scrapy
import itertools
from ..items import NumItem
from scrapy.selector import Selector
from scrapy.http import HtmlResponse
import json
import pandas as pd
from pandas import DataFrame
class ColekaSpider(scrapy.Spider):
name = 'extract_test'
allowed_domains = ['www.coleka.com/fr']
start_urls = [
'https://www.coleka.com/fr/cartes-de-collection/cartes-pokemon/pokemon-epee-et-bouclier/voltage-eclatant/aspicot-reverse_i804899'
]
def parse(self, response):
items = NumItem()
for k in response.xpath('//div[@class="bigTitle clearfix hasImg "]'):
extract = k.xpath('//div[@class="darker"]').extract()
items['extract'] = extract
yield items
如有任何建议,我们将不胜感激。
提前致谢。
这是我对原田的回复命令,请不要删除。
2020-12-15 14:57:56 [scrapy.utils.log] INFO: Scrapy 1.7.3 started (bot: coleka)
2020-12-15 14:57:56 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.5, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 19.7.0, Python 3.7.0 (v3.7.0:1bf9cc5093, Jun 27 2018, 04:59:51) [MSC v.1914 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1c 28 May 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2020-12-15 14:57:56 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 'coleka', 'FEED_FORMAT': 'json', 'FEED_URI': 'test20.json', 'NEWSPIDER_MODULE': 'coleka.spiders', 'SPIDER_MODULES': ['coleka.spiders']}
2020-12-15 14:57:56 [scrapy.extensions.telnet] INFO: Telnet Password: c6c871dbada3c7f5
2020-12-15 14:57:56 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2020-12-15 14:57:57 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-12-15 14:57:57 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-12-15 14:57:57 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-12-15 14:57:57 [scrapy.core.engine] INFO: Spider opened
2020-12-15 14:57:57 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-12-15 14:57:57 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-12-15 14:57:57 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.coleka.com/fr/cartes-de-collection/cartes-pokemon/pokemon-epee-et-bouclier/voltage-eclatant/aspicot-reverse_i804899> (referer: None)
2020-12-15 14:57:57 [scrapy.core.engine] INFO: Closing spider (finished)
2020-12-15 14:57:57 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 317,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'downloader/response_bytes': 413,
'downloader/response_count': 1,
'downloader/response_status_count/200': 1,
'elapsed_time_seconds': 0.163562,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 12, 15, 13, 57, 57, 522702),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'response_received_count': 1,
'scheduler/dequeued': 1,
'scheduler/dequeued/memory': 1,
'scheduler/enqueued': 1,
'scheduler/enqueued/memory': 1,
'start_time': datetime.datetime(2020, 12, 15, 13, 57, 57, 359140)}
2020-12-15 14:57:57 [scrapy.core.engine] INFO: Spider closed (finished)
好的,我想我知道如何解决这个问题。从提供的控制台日志中,我们看到它确实成功爬行到所需的页面。然而,没有数据被抓取。考虑阅读文档,了解为什么
response.css
在这种情况下更好(按类别选择)。
def parse(self, response):
items = NumItem()
# make sure the selectors are correct, let's use the logger to check what the selector gives us
selector = response.css('div.bigTitle.clearfix.hasImg')
self.logger.info(f'the selector: {selector}')
for k in selector:
# try debugging with the logger
self.logger.info(f'k in the selector loop is: {k}')
extract = k.css('div.darker').getall() # getall is synonymous to extract
items['extract'] = extract
yield items
希望这能帮助您找到为什么数据没有被生成并写入您的文件。如果记录器对于选择器和 k 没有显示任何内容,那么您就知道您的选择器是错误的并且在页面上没有找到任何内容。