根据标记的子字符串列合并数据框,同时保留原始列标签

问题描述 投票:0回答:1

我有一个数据框,其中的列带有标签模式(名称/开始日期时间/结束日期时间)

import pandas as pd
pd.DataFrame({
    "[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
    "[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
    "[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
    "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [4],
    "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T08:00:00Z": [5],
    "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [6],
    "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T08:00:00Z": [7],
    "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [8],
    "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T08:00:00Z": [9],
    "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [10],
    "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T08:00:00Z": [11],
    "[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:00:00Z": [12],
    "[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T09:00:00Z": [13],
    "[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:00:00Z": [14],
    "[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T09:00:00Z": [15],
})

我想合并具有相同名称和开始日期(没有时间)的列(对其值进行求和),列名称应该是原始列(第一个使用的)

这应该给出以下结果

pd.DataFrame({
        "[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
        "[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
        "[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
        "[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [9],
        "[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [13],
        "[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [17],
        "[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [21],
        ...
    })

在我的示例中,每一列都有一个原始数据,但实际上它有多个基于日期时间索引的数据

python pandas dataframe sorting merge
1个回答
0
投票

在这里我将使用以下方法。首先,我定义一个新的数据框,其中包含原始数据中的所有列

df

import pandas as pd
df_col = pd.DataFrame({'col': df.columns})

然后我提取姓名和日期

df_col['name'] = df_col['col'].str.split('/').str[0]
df_col['date'] = df_col['col'].str.split('/').str[1].str[:10]

然后用 grouby 找到第一列,所有列都具有相同的名称、日期

grp = df_col.groupby(['name', 'date'])\
    .agg({'col':{'first', 'unique'}})
grp.columns = [col[1] for col in grp.columns]

最后是输出

df_out = pd.DataFrame(columns=grp['first'])

for i, row in grp.iterrows():
    df_out[row['first']] = df[row['unique']].sum(1)
0                                                      2      

first  [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z  \
0                                                      3      

first  [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z  \
0                                                      9      

first  [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z  \
0                                                     13      

first  [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z  \
0                                                     17      

first  [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z  \
0                                                     21      

first  [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:00:00Z  \
0                                                     25      

first  [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:00:00Z  \
0                                                     29      

first  [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z  
0                                                      1 
h
© www.soinside.com 2019 - 2024. All rights reserved.