如何用scrapy获取所有标题和文章？

Question

我目前正在抓取一些网络信息。我不知道为什么，但它只是不能正常工作。如果有人能够纠正我的代码，将不胜感激。

这只是一个例子，但我想在这里做的是从起始URL，访问其上列出的所有文章，并从每个文章中选取标题和文章。（所有文章都像http://www.bbc.com/sport/tennis/42610656一样）

这是我的代码如下。

非常感谢你的帮助！

# -*- coding: utf-8 -*-
import scrapy
from myproject.NLP_items import Headline
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors import LinkExtractor




class BBC_sport_Spider(CrawlSpider):
    name = 'sport'
    allowed_domains = ['www.bbc.com']
    start_urls = ['http://www.bbc.com/sport/']
    allow_list = ['.*']

    rules = (
        Rule(LinkExtractor( allow=allow_list), callback='parse_item'),
        Rule(LinkExtractor(), follow=True),
    )


    def parse(self, response):

        for url in response.xpath('//div[@id="top-stories"]//a/@href').extract():
            yield(scrapy.Request(response.urljoin(url), self.parse_topics))

    def parse_topics(self, response):


        item=Headline()
        item["title"]=response.xpath('//div[@class="gel-layout__item gel-2/3@l"]//h1').extract()
        item["article"]=response.xpath('//div[@id="story-body"]//p').extract()


        yield item

`

Answer 1

引用scrapy文档：

编写爬网蜘蛛规则时，请避免使用parse作为回调，因为CrawlSpider使用parse方法本身来实现其逻辑。因此，如果您覆盖解析方法，则爬网蜘蛛将不再起作用。

因此，通过覆盖parse方法，您将使您定义的规则无效。

此外，链接提取器中的allow=['.*']是无操作，因此在没有回调的规则中是follow=True。如果您希望规则同时解析项目并遵循链接，只需指定回调并遵循，而不是创建两个规则。

另一个问题是parse方法本身。其中的xpath是从不存在的div中选择链接（尽管具有该id的部分确实存在）。但是，如果您已经使用了CrawlSpider，则应首先指定适当的规则，而不是覆盖parse。

最后，我只想指出scrapy.contrib.spiders和scrapy.contrib.linkextractors已被弃用;你应该使用scrapy.spiders和scrapy.linkextractors代替。

如何用scrapy获取所有标题和文章？

问题描述投票：-1回答：1

1个回答

最新问题

如何用scrapy获取所有标题和文章？

问题描述 投票：-1回答：1

1个回答

最新问题

问题描述投票：-1回答：1