处理网页抓取中的插值（Beautiful Soup）

Question

我正在用Python和Beautiful Soup进行一些网络抓取。

我遇到了一个问题，我得到的结果包含原始Javascript插值，而不是值本身。

所以而不是

<span>2.4%</span>

我可以在Chrome检查器中看到，我得到：

<span> {{ item.rate }} </span>

我的结果来自美丽的汤。

a）我做错了什么（类似的代码在不同的网站上工作，所以我不这么认为，但可能是错的）？

要么

b）有没有办法解决这个问题？

我的代码：

url = "https://example.com"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")
divs = soup.findAll("ul", {"class": "result-table--grid"})
print(div[0])

谢谢！

Answer 1

您可以通过以下方式访问json格式的响应。然后使用json_normalize。现在这样做你会看到列中有列表列表/字典。因此，我将提供第二种解决方案，将这些解决方案也展平，但它会真正横向扩展您的桌面

代码1

import requests
from bs4 import BeautifulSoup
from pandas.io.json import json_normalize
import pandas as pd

url = "https://www.moneysupermarket.com/mortgages/results/#?goal=1&property=170000&borrow=150000&types=1&types=2&types=3&types=4&types=5"

request_url = 'https://www.moneysupermarket.com/bin/services/aggregation'

headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}

payload = {
'channelId': '55',
'enquiryId': '2e619c17-061a-4812-adad-40a9f9d8dcbc',
'limit': '20',
'offset': '0',
'sort': 'initialMonthlyPayment'}


jsonObj = requests.get(request_url, headers=headers, params = payload).json()

results = pd.DataFrame()
for each in jsonObj['results']:
    temp_df = json_normalize(each['quote'])
    results = results.append(temp_df).reset_index(drop=True)

输出1：

print (results)
                                               @class                        ...                                                         trackerDescription
0   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
1   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
2   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
3   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
4   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
5   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
6   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
7   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
8   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
9   com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
10  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
11  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
12  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
13  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
14  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
15  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                          after 26 Months,BBR + 3.99% for the remaining ...
16  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
17  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
18  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                                                                           
19  com.moneysupermarket.mortgages.entity.Mortgage...                        ...                          after 26 Months,BBR + 3.99% for the remaining ...

[20 rows x 51 columns]

代码2：

import requests
import pandas as pd

url = "https://www.moneysupermarket.com/mortgages/results/#?goal=1&property=170000&borrow=150000&types=1&types=2&types=3&types=4&types=5"

request_url = 'https://www.moneysupermarket.com/bin/services/aggregation'
headers = {'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
payload = {
'channelId': '55',
'enquiryId': '2e619c17-061a-4812-adad-40a9f9d8dcbc',
'limit': '20',
'offset': '0',
'sort': 'initialMonthlyPayment'}

data = requests.get(request_url, headers=headers, params = payload).json()

def flatten_json(y):
    out = {}
    def flatten(x, name=''):
        if type(x) is dict:
            for a in x:
                flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x
    flatten(y)
    return out


results = pd.DataFrame()
for each in data['results']:
    flat = flatten_json(each)
    temp_df = pd.DataFrame([flat], columns = flat.keys())

    results = results.append(temp_df).reset_index(drop=True)

输出2：

print (results)
    apply_active  apply_desktop   ...    straplineLinkLabel  topTip
0           True           True   ...                  None    None
1           True           True   ...                  None    None
2           True           True   ...                  None    None
3           True           True   ...                  None    None
4           True           True   ...                  None    None
5           True           True   ...                  None    None
6           True           True   ...                  None    None
7           True           True   ...                  None    None
8           True           True   ...                  None    None
9           True           True   ...                  None    None
10          True           True   ...                  None    None
11          True           True   ...                  None    None
12          True           True   ...                  None    None
13          True           True   ...                  None    None
14          True           True   ...                  None    None
15          True           True   ...                  None    None
16          True           True   ...                  None    None
17          True           True   ...                  None    None
18          True           True   ...                  None    None
19          True           True   ...                  None    None

[20 rows x 131 columns]

处理网页抓取中的插值（Beautiful Soup）

问题描述投票：0回答：1

1个回答

最新问题

处理网页抓取中的插值（Beautiful Soup）

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1