所以我已经坚持了几天。我正在通过包含一些信息的JSON对象进行解析。在此对象中,有一个包含n个联系人的列表。其中每个都有一个可用于创建网址的ID。在该网址中有该联系人的电话号码。
所以我想开始创建一个项目,添加一些信息,然后循环遍历联系人,对于每个循环,我想使用URL中找到的电话号码添加到该原始项目。
我的问题:我该如何退回报废的电话号码并将其添加到商品中?如果我以“屈服项”结束主解析方法,则循环中抓取的任何数据都不会添加到该项中。但是,如果我改为以“ yield items”结束parseContact,则整个项目将为每个循环重复。
[请帮助,我即将崩溃:D
这里是代码:
def parse(self, response):
items = projectItem()
rData = response.xpath('//*[@id="data"]/text()').get()
dData = json.loads(rData)
listOfContacts = dData["contacts"]
Data = dData["customer"]
items['customername'] = Data["companyName"]
items['vatnumber'] = Data["vatNo"]
items['contacts'] = []
i=0
for p in listOfContacts:
id = json.dumps(p["key"])
pid = id.replace("\"","")
urlP = urljoin("https://example.com/?contactid=", pid)
items['contacts'].append({"pid":pid,"name":p["name"]})
yield scrapy.Request(urlP, callback=self.parseContact,dont_filter=True,cb_kwargs={'items':items},meta={"counter":i})
i +=1
#IF I YIELD HERE, NONE OF THE DATA IN THE LOOP GETS SAVED
yield items
def parseContact(self, response,items):
i = response.meta['counter']
data = response.xpath('//*[@id="contactData"]/script/text()').get()
items['contacts'][i].update({"data":data})
#IF I YIELD HERE THE ITEM iS DUPLICATED N TIMES
yield items
如果每个公司需要一件商品,则需要在生产之前完全构建该商品。我会这样:
import json
import scrapy
from urllib.parse import urljoin
def parse(self, response):
items = projectItem()
rData = response.xpath('//*[@id="data"]/text()').get()
dData = json.loads(rData)
listOfContacts = dData["contacts"]
Data = dData["customer"]
items['contacts'] = []
items['customername'] = Data["companyName"]
items['vatnumber'] = Data["vatNo"]
contacts_info = []
# prepare list with the contact urls, pid & name
for p in listOfContacts:
id = json.dumps(p["key"])
pid = id.replace("\"", "")
urlP = urljoin("https://example.com/?contactid=", pid)
contacts_info.append((urlP, pid, p["name"]))
# get the first item from the list, and pass the rest of the list along in the meta
urlP, pid, name = contacts_info.pop()
yield scrapy.Request(urlP,
callback=self.parseContact,
dont_filter=True,
meta={"contacts_info": contacts_info,
"items": items})
def parseContact(self, response, items):
contacts_info = response.meta['contacts_info']
# get count from meta, or default to 0
count = response.meta.count('count', 0)
count += 1
items = response.meta['items']
data = response.xpath('//*[@id="contactData"]/script/text()').get()
items['contacts'][count].update({"data": data})
try:
urlP, pid, name = contacts_info.pop()
except IndexError:
# list contacts info is empty, so the item is finished and can be yielded
yield items
else:
yield scrapy.Request(urlP,
callback=self.parseContact,
dont_filter=True,
meta={"contacts_info": contacts_info,
"items": items,
"count": count})
我不确定pid和counter之间的链接是什么(因此需要在代码中添加有关添加pid和名称的部分,但我希望您在这里有这个主意。