将深度嵌套的 JSON 转换为 Pandas 数据帧

问题描述 投票:0回答:2

我的JSON格式数据:

[
    {
        "UNIT": "UNIT1",
        "PROJECTS": [
            {
                "PROJECT": "A",
                "PERIODS": [
                    {
                        "PERIOD": "2019",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "FANNY",
                                        "ID": 111
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 222
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "TIM",
                                        "ID": 444
                                    },
                                    {
                                        "NAME": "PAUL",
                                        "ID": 555
                                    }
                                ]
                            }
                        ]
                    }
                ]
            },
            {
                "PROJECT": "B",
                "PERIODS": [
                    {
                        "PERIOD": "2021",
                        "TEAMS": [
                            {
                                "TEAM": "Team A",
                                "MEMBERS": [
                                    {
                                        "NAME": "BENNY",
                                        "ID": 121
                                    },
                                    {
                                        "NAME": "JENNY",
                                        "ID": 122
                                    }
                                ]
                            },
                            {
                                "TEAM": "Team B",
                                "MEMBERS": [
                                    {
                                        "NAME": "CHRIS",
                                        "ID": 123
                                    },
                                    {
                                        "NAME": "TANG",
                                        "ID": 124
                                    }
                                ]
                            }
                        ]
                    }
                ]
            }
        ]
    }
]

预期输出数据帧

    UNIT PROJECT PERIOD   NAME   ID
0  UNIT1       A   2019  FANNY  111
1  UNIT1       A   2019   TANG  222
2  UNIT1       A   2019    TIM  444
3  UNIT1       A   2019   PAUL  555
4  UNIT1       B   2021  BENNY  121
5  UNIT1       B   2021  JENNY  122
6  UNIT1       B   2021  CHRIS  123
7  UNIT1       B   2021   TANG  124

我想按照上述 JSON 格式存储数据。以后数据结构可能会很大,所以为了能够嵌套,我选择了上面的方式来存储数据。但是,我发现将其转换回数据帧可能很困难。

上面的JSON嵌套很深,我已经尝试过

pd.json_normalize
但无法达到预期的输出。

python json pandas multi-index
2个回答
0
投票

您可以使用带有一些参数的 Pandas 库 json_normalize 函数。

应该喜欢这样的东西

df = pd.json_normalize(
    name_of_the_file,
    meta=[
        'unit',
        ['unit', 'projects', 'project'],
        ['unit', 'projects', 'periods', 'period'],
        ['unit', 'projects', 'periods', 'teams', 'members', 'name'],
        ['unit', 'projects', 'periods', 'teams', 'members', 'id']
    ]
)

在元参数中,您应该写入要在 pandas 数据框中显示的每个字段的 json 路径。


0
投票

你可以这样做,这个解决方案的优点是你永远不需要关心 json 中的路径。

定义了以下函数(适用于任何 json):

import json
import pandas as pd
def flatten_nested_json_df(df):
    df = df.reset_index()
    s = (df.applymap(type) == list).all()
    list_columns = s[s].index.tolist()
    
    s = (df.applymap(type) == dict).all()
    dict_columns = s[s].index.tolist()

    
    while len(list_columns) > 0 or len(dict_columns) > 0:
        new_columns = []

        for col in dict_columns:
            horiz_exploded = pd.json_normalize(df[col]).add_prefix(f'{col}.')
            horiz_exploded.index = df.index
            df = pd.concat([df, horiz_exploded], axis=1).drop(columns=[col])
            new_columns.extend(horiz_exploded.columns) # inplace

        for col in list_columns:
            #print(f"exploding: {col}")
            df = df.drop(columns=[col]).join(df[col].explode().to_frame())
            new_columns.append(col)

        s = (df[new_columns].applymap(type) == list).all()
        list_columns = s[s].index.tolist()

        s = (df[new_columns].applymap(type) == dict).all()
        dict_columns = s[s].index.tolist()
    return df

然后执行以下操作:

with open(your_json_file) as f:
    data = json.load(f)
results = pd.json_normalize(data)
df = pd.DataFrame(results)

outdf = flatten_nested_json_df(df)

返回:

   index   UNIT PROJECTS.PROJECT PROJECTS.PERIODS.PERIOD  \
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
0       0  UNIT1                A                    2019   
..    ...    ...              ...                     ...   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   
0       0  UNIT1                B                    2021   

   PROJECTS.PERIODS.TEAMS.TEAM PROJECTS.PERIODS.TEAMS.MEMBERS.NAME  \
0                       Team A                               FANNY   
0                       Team A                                TANG   
0                       Team A                                 TIM   
0                       Team A                                PAUL   
0                       Team A                               BENNY   
..                         ...                                 ...   
0                       Team B                                PAUL   
0                       Team B                               BENNY   
0                       Team B                               JENNY   
0                       Team B                               CHRIS   
0                       Team B                                TANG   

    PROJECTS.PERIODS.TEAMS.MEMBERS.ID  
0                                 111  
0                                 222  
0                                 444  
0                                 555  
0                                 121  
..                                ...  
0                                 555  
0                                 121  
0                                 122  
0                                 123  
0                                 124  

[2048 rows x 7 columns]
© www.soinside.com 2019 - 2024. All rights reserved.