比较 pandas 中的两个 json 格式字符串并根据匹配分配标签

问题描述 投票:0回答:1

我的数据框中有两列,即

diff
diff2
.

diff
的实例:

{'paths': {'modified': {'/v1/authorization/details/byDate': {'operations': {'modified': {'POST': {'requestBody': {'added': True}}}}}}}, 'endpoints': {'modified': {'{ method: POST, path: /v1/authorization/details/byDate }': {'requestBody': {'added': True}}}}}
{'info': {'version': {'from': '1.0.2', 'to': '1.0.3'}}, 'paths': {'modified': {'/equipment-status': {'operations': {'modified': {'GET': {'parameters': {'modified': {'query': {'pei': {'schema': {'pattern': {'from': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|.+)$', 'to': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|mac([0-9a-fA-F]{2})((-[0-9a-fA-F]{2}){5})|.+)$'}}}}}}}}}}}}, 'endpoints': {'modified': {'{ method: GET, path: /equipment-status }': {'parameters': {'modified': {'query': {'pei': {'schema': {'pattern': {'from': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|.+)$', 'to': '^(imei-[0-9]{15}|imeisv-[0-9]{16}|mac([0-9a-fA-F]{2})((-[0-9a-fA-F]{2}){5})|.+)$'}}}}}}}}}, 'externalDocs': {'description': {'from': '3GPP TS 29.511 V15.4.0; 5G System; Equipment Identity Register Services; Stage 3', 'to': '3GPP TS 29.511 V16.0.0; 5G System; Equipment Identity Register Services; Stage 3'}}}

diff2
的实例:

Backward compatibility errors (1):
error at specs/389643.json, in API POST /v1/authorization/details/byDate added required request body [added-required-request-body].
Backward compatibility errors (1):
warning at specs/419378.json, in API GET /equipment-status changed the pattern for the 'query' request parameter 'pei' from '^(imei-[0-9]{15}|imeisv-[0-9]{16}|.+)$' to '^(imei-[0-9]{15}|imeisv-[0-9]{16}|mac([0-9a-fA-F]{2})((-[0-9a-fA-F]{2}){5})|.+)$' [request-parameter-pattern-changed]. This is a warning because it is difficult to automatically analyze if the new pattern is a superset of the previous pattern(e.g. changed from '[0-9]+' to '[0-9]*')

我想检查

diff2
中的关键字(总是从 API 开始)是否与
diff
中存在的任何关键字匹配,并基于此为它们分配标签。如果所有关键字都匹配并且没有不匹配的单词集,我想将更改分配为
Breaking
并且如果有匹配的单词(来自
diff2
),并且也不匹配(所有剩余的来自
diff
),我希望标签是
Both

如果

diff2
Nan
那么变化是
Non-Breaking

所以对于第一个例子,变化是

Breaking
,第二个是
Both
.

预期的输出是这样的:

diff                                                            diff_2                                             Change
{'paths': {'modified': {'/v1/authorization/details/byDate'      ./ API POST /v1/authorization/details/byDate      Breaking    

任何关于如何做到这一点的建议或想法将不胜感激。

python pandas string-comparison
1个回答
0
投票

我不完全确定你想做什么,因为你的例子不能完全重现,这是我的,在哪里:

  • 第一行是“重大”变更案例(所有关键字匹配)
  • 第二行说明“两者”(一些关键字匹配)
  • 第三个是“不间断”案例(零匹配):
import pandas as pd

df = pd.DataFrame(
    {
        "diff": [
            {
                "paths": {
                    "modified": {
                        "/v1/authorization/details/byDate": {
                            "operations": {
                                "modified": {"POST": {"requestBody": {"added": True}}}
                            }
                        }
                    }
                },
            },
        ]
        * 3
        + [pd.NA, 2, "aaa"],
        "diff2": [
            "Backward compatibility errors (1): error at specs/389643.json, in API POST /v1/authorization/details/byDate added",
            "Backward compatibility errors (1): error at specs/390643.json, in API GET /v1/authorization/details/byDate added",
            "Backward compatibility errors (1): error at specs/391643.json, in API PUSH /v2/authorization/details/byDate removed",
            "",
            "",
            "",
        ],
    }
)

首先,定义一个 递归 辅助函数以从嵌套字典中获取所有键:

def get_keys_from_dict(d, keys=None):
    keys = keys if keys else []
    if not isinstance(d, dict):
        return None
    for k, v in d.items():
        keys.append(k)
        if isinstance(v, dict):
            get_keys_from_dict(v, keys)
        if isinstance(v, list):
            for i in v:
                get_keys_from_dict(i, keys)
    return keys

使用str.split定义另一个辅助函数以获取字符串中“API”一词之后的所有关键字:

def get_keywords_from_string(string):
    return (
        [item for item in string.split("API")[1].split(" ") if item] if string else []
    )

另一个比较两个关键字列表与 Python 内置函数allany

def compare(keywords, other_keywords):
    if not keywords or not other_keywords:
        return ""
    results = [item in keywords for item in other_keywords]
    if all(results):
        return "Breaking"
    if any(results):
        return "Both"
    return "Non-Breaking"

最后,使用数据框组合和应用这些功能:

df["Change"] = df.apply(
    lambda x: compare(
        get_keys_from_dict(x["diff"], []),
        get_keywords_from_string(x["diff2"]),
    ),
    axis=1,
)

然后:

print(df)
# Output

                                                diff  ...        Change
0  {'paths': {'modified': {'/v1/authorization/det...  ...      Breaking
1  {'paths': {'modified': {'/v1/authorization/det...  ...          Both
2  {'paths': {'modified': {'/v1/authorization/det...  ...  Non-Breaking
3                                               <NA>  ...
4                                                  2  ...
5                                                aaa  ...
© www.soinside.com 2019 - 2024. All rights reserved.