我正在尝试刮擦某些物品,如下所示:
def parse(self, response):
item = GameItem()
item['game_commentary'] = response.css('tr td:nth-child(2)[style*=vertical-align]::text').extract()
item['game_movement'] = response.xpath("//tr/td[1][contains(@style,'vertical-align: top')]/text()").extract()
yield item
我的问题是我不想yield
当前response.xpath
或response.css
选择器提取的所有迭代。
在将这些命令分配给item['game_commentary']
和item['game_movement']
之前,有没有办法应用regex
或其他方法来过滤不希望产生的不希望的值?
我将研究Item Loaders以完成此操作。您必须按如下方式重写解析:
def parse(self, response):
loader = GameItemLoader(item=GameItem(), response=response)
loader.add_css('game_commentary', 'tr td:nth-child(2)[style*=vertical-align]::text')
loader.add_xpath('game_movement', "//tr/td[1][contains(@style,'vertical-align: top')]/text()")
item = loader.load_item()
yield item
您的items.py看起来像这样:
from scrapy.item import Item, Field
from scrapy.loader import ItemLoader
from scrapy.loader.processors import TakeFirst
class GameItemLoader(Item):
# default input & output processors
# will be executed for each item loaded,
# except if a specific in or output processor is specified
default_output_processor = TakeFirst()
# you can specify specific input & output processors per field
game_commentary_in = '...'
game_commentary_out = '...'
class GameItem(RetviewsItem):
game_commentary = Field()
game_movement = Field()