从脚本中提取 - 美丽的汤

Question

如何从此页面的源中提取“tier1Category”的值？ https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product

soup.find('script')

仅返回源的子集，以下内容返回该代码中的另一个源。

json.loads(soup.find("script", type="application/ld+json").text)

Answer 1

Bitto和我有类似的方法，但我更喜欢不依赖于知道哪个脚本包含匹配模式，也不依赖于脚本的结构。

import requests
from collections import abc
from bs4 import BeautifulSoup as bs

def nested_dict_iter(nested):
    for key, value in nested.items():
        if isinstance(value, abc.Mapping):
            yield from nested_dict_iter(value)
        else:
            yield key, value

r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
for script in soup.find_all('script'):
    if 'tier1Category' in script.text:
        j = json.loads(script.text[str(script.text).index('{'):str(script.text).rindex(';')])
        for k,v in list(nested_dict_iter(j)):
             if k == 'tier1Category':
                 print(v)

Answer 2

以下是我用于获取输出的步骤

使用find_all并获取第10个脚本标记。此脚本标记包含tier1Category值。
从第一次出现的{到最后一次出现的;获取脚本文本。这将为我们提供一个合适的json文本。
使用json.loads加载文本
理解json的结构，找到如何获得tier1Category值。

码：

import json
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = BeautifulSoup(r.text, 'html.parser')
script_text=soup.find_all('script')[9].text
start=str(script_text).index('{')
end=str(script_text).rindex(';')
proper_json_text=script_text[start:end]
our_json=json.loads(proper_json_text)
print(our_json['product']['results']['productInfo']['tier1Category'])

输出：

Medicines & Treatments

Answer 3

我想你可以使用id。我假设第1层是在导航树中的shop之后。否则，我在该脚本标记中看不到该值。我在一个普通的脚本中看到它（没有脚本[type =“application / ld + json”]）标签，但是层1有很多正则表达式匹配

import requests
from bs4 import BeautifulSoup as bs

r = requests.get('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')
soup = bs(r.content, 'lxml')
data = soup.select_one("#bdCrumbDesktopUrls_0").text
print(data)

Answer 4

我不确定与您需要的<script>标签有关的确切数据元素，这确实找到了tier1Category并提取了这3个元素：

“tier1Category”：“药物和治疗”，
“tier1CategoryId”：“359438”
“tier1url”： “/存储/ C /药品和 - 处理/ ID = 359438-等级1” import re from urllib import request from bs4 import BeautifulSoup crawlRequest = request.urlopen('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product') raw_html = crawlRequest soup = BeautifulSoup(raw_html, 'lxml') for i,tag in enumerate(soup.findAll('script')): # There is a JSON, which could be parsed if 'tier1Category' in tag.text: tier_1_pattern = re.compile('(("tier1Category":"Medicines & Treatments".*)("tier1CategoryId".*)("tier1url":.*-tier1))', re.IGNORECASE|re.MULTILINE) extract_tier_1 = re.search(tier_1_pattern, tag.text) if extract_tier_1: print (extract_tier_1.group(2)) # outputs "tier1Category":"Medicines & Treatments", print (extract_tier_1.group(3)) # outputs "tier1CategoryId":"359438", print (extract_tier_1.group(4)) # outputs "tier1url":"/store/c/medicines-and-treatments/ID=359438-tier1

正如我在上一篇文章中提到的，所讨论的脚本部分有一个JSON对象，因此本重点是从JSON中提取上面列出的元素。我很好奇tier1CategoryId和URL中的prodID之间的区别。

    from urllib import request
    from bs4 import BeautifulSoup
    import json

    crawlRequest = 
    request.urlopen('https://www.walgreens.com/store/c/walgreens-wal-zyr-24-hour-allergy-tablets/ID=prod6205762-product')

    raw_html = crawlRequest
    soup = BeautifulSoup(raw_html, 'lxml')

    for i,tag in enumerate(soup.findAll('script')):
      if 'tier1Category' in tag.text:
        json_data = json.loads(tag.text[str(tag.text).index('{'):str(tag.text).rindex(';')])
        category_type = json_data['product']['results']['productInfo']['tier1Category']
        category_id = json_data['product']['results']['productInfo']['tier1CategoryId']
        category_url = json_data['product']['results']['productInfo']['tier1url']

从脚本中提取 - 美丽的汤

问题描述投票：1回答：4

4个回答

最新问题

从脚本中提取 - 美丽的汤

问题描述 投票：1回答：4

4个回答

最新问题

问题描述投票：1回答：4