我有一个数据框,其中的列带有标签模式(名称/开始日期时间/结束日期时间)
import pandas as pd
pd.DataFrame({
"[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
"[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
"[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
"[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [4],
"[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T08:00:00Z": [5],
"[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [6],
"[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T08:00:00Z": [7],
"[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [8],
"[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T08:00:00Z": [9],
"[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [10],
"[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T08:00:00Z": [11],
"[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:00:00Z": [12],
"[RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T09:00:00Z": [13],
"[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:00:00Z": [14],
"[RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T09:00:00Z": [15],
})
我想合并具有相同名称和开始日期(没有时间)的列(对其值进行求和),列名称应该是原始列(第一个使用的)
这应该给出以下结果
pd.DataFrame({
"[RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z": [1],
"[RATE] BOJ/2024-01-23T04:00:00Z/2024-01-23T07:00:00Z": [2],
"[RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z": [3],
"[RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z": [9],
"[RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z": [13],
"[RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z": [17],
"[RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z": [21],
...
})
在我的示例中,每一列都有一个原始数据,但实际上它有多个基于日期时间索引的数据
在这里我将使用以下方法。首先,我定义一个新的数据框,其中包含原始数据中的所有列
df
。
import pandas as pd
df_col = pd.DataFrame({'col': df.columns})
然后我提取姓名和日期
df_col['name'] = df_col['col'].str.split('/').str[0]
df_col['date'] = df_col['col'].str.split('/').str[1].str[:10]
然后用 grouby 找到第一列,所有列都具有相同的名称、日期
grp = df_col.groupby(['name', 'date'])\
.agg({'col':{'first', 'unique'}})
grp.columns = [col[1] for col in grp.columns]
最后是输出
df_out = pd.DataFrame(columns=grp['first'])
for i, row in grp.iterrows():
df_out[row['first']] = df[row['unique']].sum(1)
0 2
first [RATE] BOJ/2024-03-19T04:00:00Z/2024-03-19T07:00:00Z \
0 3
first [RATE] BOJ/2024-04-26T03:00:00Z/2024-04-26T06:00:00Z \
0 9
first [RATE] BOJ/2024-06-14T03:00:00Z/2024-06-14T06:00:00Z \
0 13
first [RATE] BOJ/2024-07-31T03:00:00Z/2024-07-31T06:00:00Z \
0 17
first [RATE] BOJ/2024-09-20T03:00:00Z/2024-09-20T06:00:00Z \
0 21
first [RATE] BOJ/2024-10-31T04:00:00Z/2024-10-31T07:00:00Z \
0 25
first [RATE] BOJ/2024-12-19T04:00:00Z/2024-12-19T07:00:00Z \
0 29
first [RATE] BOJ presser/2024-03-19T07:30:00Z/2024-03-19T10:30:00Z
0 1
h