我有几个带有 ID、一天中的时间和数字的数据框。我想对每个数据帧编号进行加权,然后对每个 id/时间进行求和。举个例子:
weighted 0.2
ID TOD M
0 10 morning 1
1 13 afternoon 3
2 32 evening 2
3 10 evening 2
weighted 0.4
ID TOD W
0 10 morning 1
1 13 morning 3
2 32 afternoon 2
3 10 evening 3
weighted sum:
ID TOD weighed_sum_mw
0 10 morning (0.2*1 + 0.4*1)
1 10 evening (0.2*2 + 0.4*3)
2 13 morning (0.4*3)
3 13 afternoon (0.4*2)
3 32 evening (0.2*2)
4 32 afternoon (0.4*2)
以下策略有效,但非常消耗内存,我不确定是否有办法在不合并它们的情况下做到这一点。我最终还只需要一天中每个 ID 总和最大的行,因此如果这可以简化流程,那么也可以! (相等最大加权和的决胜局将首先保留下午,然后是晚上,然后是早上)。我目前使用 4 个数据帧执行此操作,但可能会添加更多数据帧,每个数据帧大约有 10M 行
merged_oc= pd.merge(dfs[0], dfs[3], on=['ID', 'TIME_OF_DAY'], suffixes=('_O', '_C'), how='outer')
merged_s = pd.merge(dfs[1], dfs[2], on=['ID', 'TIME_OF_DAY'], suffixes=('_W', 'M'), how='outer')
# merge and weighted sum of O and C
merged_oc['COUNTS_O_weighted_02']= merged_oc['COUNTS_O'].fillna(0).multiply(0.2)
merged_oc['COUNTS_C_weighted_04'] = merged_oc['COUNTS_C'].fillna(0).multiply(0.4)
merged_oc['COUNTS'] = merged_oc['COUNTS_O_weighted_02'] + merged_oc['COUNTS_C_weighted_04']
result_oc = merged_oc[['ID', 'TIME_OF_DAY', 'COUNTS', 'COUNTS_O_weighted_02', 'COUNTS_C_weighted_04']]
merged_s['COUNTS_W_weighted_04'] = merged_s['COUNTS_W'].fillna(0).multiply(0.4)
merged_s['COUNTS_M_weighted_04'] = merged_s['COUNTS_M'].fillna(0).multiply(0.4)
merged_s['COUNTS'] = merged_s['COUNTS_W_weighted_04'] + merged_s['COUNTS_M_weighted_04']
result_s = merged_s[['ID', 'TIME_OF_DAY', 'COUNTS', 'COUNTS_W_weighted_04', 'COUNTS_M_weighted_04']]
merged_final = pd.merge(result_oc, result_s, on=['ID', 'TIME_OF_DAY'], suffixes=('_OC', '_S'), how='outer')
merged_final['COUNTS_OC']= merged_final['COUNTS_OC'].fillna(0)
merged_final['COUNTS_S'] = merged_final['COUNTS_S'].fillna(0)
merged_final['WEIGHTED_SUM'] = merged_final['COUNTS_OC'] + merged_final['COUNTS_SESSION']
merged_final = merged_final[['ID', 'TIME_OF_DAY', 'WEIGHTED_SUM', 'COUNTS_O_weighted_02', 'COUNTS_C_weighted_04', 'COUNTS_W_weighted_04', 'COUNTS_M_weighted_04']].fillna(0)
IIUC,您可以在设置索引并乘以每个数据帧的权重后尝试
pd.concat
数据帧,然后使用groupby
和sum
:
df_out = pd.concat([df1_2.set_index(['ID', 'TOD']).mul(.2),
df2_4.set_index(['ID', 'TOD']).mul(.4)])\
.sum(axis=1)\
.groupby(level=[0,1])\
.sum()\
.reset_index()
df_out
输出:
ID TOD 0
0 10 evening 1.6
1 10 morning 0.6
2 13 afternoon 0.6
3 13 morning 1.2
4 32 afternoon 0.8
5 32 evening 0.4