我想使用 Scrapy 从这个网站提取信息。但我需要的信息在 JSON 文件中;并且此 JSON 文件仅在描述部分包含不需要的文字换行符。
这是一个示例页面,我想要抓取的 JSON 元素是这个
<script type="application/ld+json">
{
"@context": "http://schema.org",
"@type": "Product",
"description": "Hamster ve Guinea Pig için tasarlanmış temizliği kolay mama kabıdır.
Hamster motifleriyle süslü ve son derece sevimlidir.
Ürün seramikten yapılmıştır
Ürün ölçüleri
Hacim: 100 ml
Çap: 8 cm",
"name": "Karlie Seramik Hamster ve Guinea Pigler İçin Yemlik 100ml 8cm",
"image": "https://www.petlebi.com/up/ecommerce/product/lg_karlie-hamster-mama-kaplari-359657192.jpg",
"brand": {
"@type": "Brand",
"name": "Karlie"
},
"category": "Guinea Pig Yemlikleri",
"sku": "4016598440834",
"gtin13": "4016598440834",
"offers": {
"@type": "Offer",
"availability": "http://schema.org/InStock",
"price": "149.00",
"priceCurrency": "TRY",
"itemCondition": "http://schema.org/NewCondition",
"url": "https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html"
},
"review": [
]
}
</script>
如您所见,描述中存在文字换行符,这在 JSON 中是不允许的。这是我尝试过的代码,但没有成功:
import scrapy
import json
import re
class JsonSpider(scrapy.Spider):
name = 'json_spider'
start_urls = ['https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html']
def parse(self, response):
# Extract the script content containing the JSON data
script_content = response.xpath('/html/body/script[12]').get()
if not script_content:
self.logger.warning("Script content not found.")
return
json_data_match = re.search(r'<script type="application/ld\+json">(.*?)<\/script>', script_content, re.DOTALL)
if json_data_match:
json_data_str = json_data_match.group(1)
try:
json_obj = json.loads(json_data_str)
product_info = {
"name": json_obj.get("name"),
"description": json_obj.get("description"),
"image": json_obj.get("image"),
"brand": json_obj.get("brand", {}).get("name"),
"category": json_obj.get("category"),
"sku": json_obj.get("sku"),
"price": json_obj.get("offers", {}).get("price"),
"url": json_obj.get("offers", {}).get("url")
}
self.logger.info("Extracted Product Information: %s", product_info)
with open('product_info.json', 'w', encoding='utf-8') as json_file:
json.dump(product_info, json_file, ensure_ascii=False, indent=2)
except json.JSONDecodeError as e:
self.logger.error("Error decoding JSON: %s", e)
def start_requests(self):
yield scrapy.Request(
url='https://www.petlebi.com/kemirgen-urunleri/karlie-seramik-hamster-ve-guinea-pig-mama-kabi-100ml-8cm.html',
callback=self.parse,
)
我希望这是一个动态代码,这样它适用于每个产品。
我使用 https://jsonlint.com/ 来查看不需要的字符,当我删除描述中的转义字符时,它表示它是有效的。我尝试了
html.unescape
但没有成功。代码在这一行停止工作:
json_obj = json.loads(json_data_str)
我该怎么办?
只需从
char
文本中删除特定的 response
,然后再转换为 json object
,如下所示
json_data_str.replace("\n","").replace("\r","").replace("\t","")
或者您可以在
strict
函数上指定参数 json.loads
json.loads(json_data_str,strict=False)