Scrapy-如何从产生的请求中将数据返回到主解析方法?

问题描述 投票:0回答:1

所以我已经坚持了几天。我正在通过包含一些信息的JSON对象进行解析。在此对象中,有一个包含n个联系人的列表。其中每个都有一个可用于创建网址的ID。在该网址中有该联系人的电话号码。

所以我想开始创建一个项目,添加一些信息,然后循环遍历联系人,对于每个循环,我想使用URL中找到的电话号码添加到该原始项目。

我的问题:我该如何退回报废的电话号码并将其添加到商品中?如果我以“屈服项”结束主解析方法,则循环中抓取的任何数据都不会添加到该项中。但是,如果我改为以“ yield items”结束parseContact,则整个项目将为每个循环重复。

[请帮助,我即将崩溃:D

这里是代码:

def parse(self, response):

        items = projectItem()
        rData = response.xpath('//*[@id="data"]/text()').get()
        dData = json.loads(rData)
        listOfContacts = dData["contacts"]
        Data = dData["customer"]

        items['customername'] = Data["companyName"]
        items['vatnumber'] = Data["vatNo"]
        items['contacts'] = []


        i=0
        for p in listOfContacts:
            id = json.dumps(p["key"])
            pid = id.replace("\"","")
            urlP = urljoin("https://example.com/?contactid=", pid)
            items['contacts'].append({"pid":pid,"name":p["name"]})

            yield scrapy.Request(urlP, callback=self.parseContact,dont_filter=True,cb_kwargs={'items':items},meta={"counter":i})
            i +=1
        #IF I YIELD HERE, NONE OF THE DATA IN THE LOOP GETS SAVED    
        yield items 




    def parseContact(self, response,items):
        i = response.meta['counter']

        data = response.xpath('//*[@id="contactData"]/script/text()').get()
        items['contacts'][i].update({"data":data})
        #IF I YIELD HERE THE ITEM iS DUPLICATED N TIMES
        yield items
python scrapy yield
1个回答
0
投票

如果每个公司需要一件商品,则需要在生产之前完全构建该商品。我会这样:

import json
import scrapy
from urllib.parse import urljoin

def parse(self, response):
    items = projectItem()
    rData = response.xpath('//*[@id="data"]/text()').get()
    dData = json.loads(rData)
    listOfContacts = dData["contacts"]
    Data = dData["customer"]
    items['contacts'] = []

    items['customername'] = Data["companyName"]
    items['vatnumber'] = Data["vatNo"]
    contacts_info = []
    # prepare list with the contact urls, pid & name
    for p in listOfContacts:
        id = json.dumps(p["key"])
        pid = id.replace("\"", "")
        urlP = urljoin("https://example.com/?contactid=", pid)
        contacts_info.append((urlP, pid, p["name"]))
    # get the first item from the list, and pass the rest of the list along in the meta
    urlP, pid, name = contacts_info.pop()
    yield scrapy.Request(urlP,
                         callback=self.parseContact,
                         dont_filter=True,
                         meta={"contacts_info": contacts_info,
                               "items": items})

def parseContact(self, response, items):
    contacts_info = response.meta['contacts_info']
    # get count from meta, or default to 0
    count = response.meta.count('count', 0)
    count += 1
    items = response.meta['items']
    data = response.xpath('//*[@id="contactData"]/script/text()').get()
    items['contacts'][count].update({"data": data})
    try:
        urlP, pid, name = contacts_info.pop()
    except IndexError:
        # list contacts info is empty, so the item is finished and can be yielded
        yield items
    else:
        yield scrapy.Request(urlP,
                             callback=self.parseContact,
                             dont_filter=True,
                             meta={"contacts_info": contacts_info,
                                   "items": items,
                                   "count": count})

我不确定pid和counter之间的链接是什么(因此需要在代码中添加有关添加pid和名称的部分,但我希望您在这里有这个主意。

© www.soinside.com 2019 - 2024. All rights reserved.