在@wRAR 提示后更新
我正在按照这个示例来抓取新闻网站。当我检查他的例子中返回的类型时,该类型是 scrapy.selector.unified.SelectorList.
就我而言,由于感兴趣的数据包含在
<script>
标签中,我设法通过下面的 python 代码以 List 的形式提取和解析它。
fetch('https://newswebsite.com/news/national')
data = re.findall("<script type=.application.ld.json. id=.listing-ld.>{.@graph.:(.+?),.@context.:.http:..schema.org..<.script>", response.body.decode("utf-8"), re.S)
#convert list to string before converting to json
jsonData = json.loads(''.join(data))
返回列表后,我无法继续遵循示例来实现项目加载器
您能否指导我了解以下代码中使用的 python 概念,以便我可以熟悉自己并能够使其适应我的用例?
尝试了解 for 循环在做什么..
from itemloaders.processors import TakeFirst, MapCompose
from scrapy.loader import ItemLoader
class ChocolateProductLoader(ItemLoader):
default_output_processor = TakeFirst()
price_in = MapCompose(lambda x: x.split("£")[-1])
url_in = MapCompose(lambda x: 'https://www.chocolate.co.uk' + x )
import scrapy
from chocolatescraper.itemloaders import ChocolateProductLoader
from chocolatescraper.items import ChocolateProduct
class ChocolateSpider(scrapy.Spider):
# The name of the spider
name = 'chocolatespider'
# These are the urls that we will start scraping
start_urls = ['https://www.chocolate.co.uk/collections/all']
def parse(self, response):
products = response.css('product-item')
for product in products:
chocolate = ChocolateProductLoader(item=ChocolateProduct(), selector=product)
chocolate.add_css('name', "a.product-item-meta__title::text")
chocolate.add_css('price', 'span.price', re='<span class="price">\n <span class="visually-hidden">Sale price</span>(.*)</span>')
chocolate.add_css('url', 'div.product-item-meta a::attr(href)')
yield chocolate.load_item()
next_page = response.css('[rel="next"] ::attr(href)').get()
if next_page is not None:
next_page_url = 'https://www.chocolate.co.uk' + next_page
yield response.follow(next_page_url, callback=self.parse)
嗨!
作为解决方案,我建议在单独的函数中使用 ItemLoader。问题是您尝试将生成表达式与 ItemLoader 一起使用,但 Item 加载器仅适用于 return。
这里有两个选项: 通过yield立即生成该类的实例。
for product in products:
yield ChocolateProductItem(
name=response.css('a.product-item-meta__title::text').get(),
...
)
或者,我认为最好的解决方案是定义一个单独的函数来创建有关新项目的条目,并在其中使用 ItemLoader + return。
...
for product in products:
# here you should get url from product and pass it in url variable below
# ? may be like that - it depends on the contents of the porduct
# url = product.css(':attr(href)').get()
yield scrapy.Request(
url,
callback=self.parse_item,
)
def parse_item(self, response):
"""
Scrape a product page.
"""
loader = ItemLoader(item=ChocolateProductItem(), response=response)
loader.add_css('name', 'a.product-item-meta__title')
# And here, please note, I don't get ::text right away, there is a beautiful way to use the input and output processors in the ItemLoader to get the necessary data from the string. I`ll write about it bellow.
...
return loader.load_item()
# items.py
# you can, of course, leave it as it was with you. and it's so good, I'm just sharing my vision
class ChocolateProductItem(scrapy.Item):
name = scrapy.Field(
input_processor=MapCompose(remove_tags, clean_data),
output_processor=TakeFirst(),
)
...
希望我有帮助! 祝你好运!