我想以这种格式抓取一个包含网址的网站:
这是代码:
class ExampleSpider(CrawlSpider):
name = "test" # Spider name
allowed_domains = ["test.com"] # Which (sub-)domains shall be scraped?
start_urls = ["https://test.com/"] # Start with this one
user_agent=["Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.1 (KHTML, like Gecko) Chrome/22.0.1207.1 Safari/537.1"]
rules = [Rule(LinkExtractor(allow=(r'/[a-z-]+/[0-9]+\.html$')), callback='parse_item', follow=True)]
# Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
print('Got a response from %s.' % response.url)
selector = Selector(response)
title = selector.xpath('//title/text()').extract()[0]
post = ''
for line in selector.xpath('//div[@id="article_body"]/p/text()').extract():
post += line
url = response.url
print('TITLE: %s \n' % title)
print('CONTENT: %s \n' % post)
Results:
2017-11-22 12:19:19 [scrapy.core.engine] INFO: Closing spider (finished)
2017-11-22 12:19:19 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 132266,
'downloader/request_count': 315,
'downloader/request_method_count/GET': 315,
'downloader/response_bytes': 9204814,
'downloader/response_count': 315,
'downloader/response_status_count/200': 313,
'downloader/response_status_count/301': 2,
'dupefilter/filtered': 21126,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2017, 11, 22, 12, 19, 19, 295516),
'log_count/DEBUG': 318,
'log_count/INFO': 11,
'offsite/domains': 1,
'offsite/filtered': 312,
'request_depth_max': 4,
'response_received_count': 313,
'scheduler/dequeued': 315,
'scheduler/dequeued/memory': 315,
'scheduler/enqueued': 315,
'scheduler/enqueued/memory': 315,
'start_time': datetime.datetime(2017, 11, 22, 12, 14, 41, 591030)}
2017-11-22 12:19:19 [scrapy.core.engine] INFO: Spider closed (finished)
爬虫在一分钟后停止,它只返回最近的内容!任何解决方案
Scrapy已将您的一些请求定义为重复链接,因此它们将被丢弃。 'dupefilter/filtered': 21126,
您可以在scrapy项目文件夹中的'settings.py'文件中添加下一行:
DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'
它将完全禁用过滤(对于整个项目),但现在您需要自己检测并过滤重复的请求。