Scrapy-过滤生成的项目

问题描述 投票:0回答:1

我正在尝试刮一些物品,如下所示:

def parse(self, response):

    item = GameItem()
    item['game_commentary'] = response.css('tr td:nth-child(2)[style*=vertical-align]::text').extract()
    item['game_movement'] = response.xpath("//tr/td[1][contains(@style,'vertical-align: top')]/text()").extract()

    yield item    

我的问题是,我不想yield当前response.xpathresponse.css选择器提取的所有项目。

在将这些命令分配给item['game_commentary']item['game_movement']之前,有没有办法应用regex或其他方法来过滤不希望产生的不希望的值?

python-3.x xpath scrapy xml-parsing css-selectors
1个回答
1
投票

我将研究Item Loaders以完成此操作。您必须按如下方式重写解析:

def parse(self, response):
    loader = GameItemLoader(item=GameItem(), response=response)
    loader.add_css('game_commentary', 'tr td:nth-child(2)[style*=vertical-align]::text')
    loader.add_xpath('game_movement', "//tr/td[1][contains(@style,'vertical-align: top')]/text()")
    item = loader.load_item()
    yield item    

您的items.py看起来像这样:

from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst

class GameItemLoader(Item):
    # default input & output processors
    # will be executed for each item loaded,
    # except if a specific in or output processor is specified
    default_output_processor = TakeFirst()

    # you can specify specific input & output processors per field
    game_commentary_in = '...'
    game_commentary_out = '...'

class GameItem(RetviewsItem):
    game_commentary = Field()
    game_movement = Field()

© www.soinside.com 2019 - 2024. All rights reserved.