在 python3 中使用 BeautifulSoup 提取 html 中的元素时出现问题<script>

问题描述 投票:0回答:1
python-3.x web-scraping beautifulsoup jupyter-notebook
1个回答
1
投票
只需使用

.find()

find_all()


当我这样做时,我看到它实际上是 json 格式,因此可以读取该元素并以这种方式存储所有数据。

import requests from bs4 import BeautifulSoup import json import re url = "https://www.daraz.pk/catalog/?q=dell&_keyori=ss&from=input&spm=a2a0e.searchlist.search.go.57446b5079XMO8" page = requests.get(url) print(page.status_code) print(page.text) soup = BeautifulSoup(page.text, 'html.parser') print(soup.prettify()) alpha = soup.find_all('script',{'type':'application/ld+json'}) jsonObj = json.loads(alpha[1].text) for item in jsonObj['itemListElement']: name = item['name'] price = item['offers']['price'] currency = item['offers']['priceCurrency'] availability = item['offers']['availability'].split('/')[-1] availability = [s for s in re.split("([A-Z][^A-Z]*)", availability) if s] availability = ' '.join(availability) url = item['url'] print('Availability: %s Price: %0.2f %s Name: %s' %(availability,float(price), currency,name))

输出:

Availability: In Stock Price: 82199.00 Rs. Name: DELL INSPIRON 15 5570 - 15.6"HD - CI5 - 8THGEN - 4GB - 1TB HDD - AMD RADEON 530 2GB GDDR5. Availability: In Stock Price: 94599.00 Rs. Name: DELL INSPIRON 15 3576 - 15.6"HD - CI7 - 8THGEN - 4GB - 1TB HRD - AMD Radeon 520 with 2GB GDDR5. Availability: In Stock Price: 106399.00 Rs. Name: DELL INSPIRON 15 5570 - 15.6"HD - CI7 - 8THGEN - 8GB - 2TB HRD - AMD RADEON 530 2GB GDDR5. Availability: In Stock Price: 17000.00 Rs. Name: Dell Latitude E6420 14-inch Notebook 2.50 GHz Intel Core i5 4GB 320GB Laptop Availability: In Stock Price: 20999.00 Rs. Name: Dell Core i5 6410 8GB Ram Wi-Fi Windows 10 Installed ( Refurb ) Availability: In Stock Price: 18500.00 Rs. Name: Core i-5 Laptop Dell 4GB Ram 15.6 " Display Windows 10 DVD+Rw ( Refurb ) Availability: In Stock Price: 8500.00 Rs. Name: Laptop Dell D620 Core 2 Duo 80_2Gb (Used) ...

编辑:查看 2 个 json 结构的差异:

jsonObj_0 = json.loads(alpha[0].text) jsonObj_1 = json.loads(alpha[1].text) print(json.dumps(jsonObj_0, indent=4, sort_keys=True)) print(json.dumps(jsonObj_1, indent=4, sort_keys=True))
    
© www.soinside.com 2019 - 2024. All rights reserved.