多个数据帧的加权和

问题描述 投票:0回答:1

我有几个带有 ID、一天中的时间和数字的数据框。我想对每个数据帧编号进行加权,然后对每个 id/时间进行求和。举个例子:

weighted 0.2
    ID   TOD         M 
0    10   morning    1  
1    13   afternoon  3  
2    32   evening    2
3    10   evening    2

weighted 0.4
    ID   TOD         W 
0    10   morning    1  
1    13   morning    3  
2    32   afternoon  2
3    10   evening    3

weighted sum:

    ID   TOD         weighed_sum_mw
0    10   morning    (0.2*1 + 0.4*1)
1    10   evening    (0.2*2 + 0.4*3)
2    13   morning    (0.4*3)
3    13   afternoon  (0.4*2)
3    32   evening    (0.2*2)
4    32   afternoon  (0.4*2)

以下策略有效,但非常消耗内存,我不确定是否有办法在不合并它们的情况下做到这一点。我最终还只需要一天中每个 ID 总和最大的行,因此如果这可以简化流程,那么也可以! (相等最大加权和的决胜局将首先保留下午,然后是晚上,然后是早上)。我目前使用 4 个数据帧执行此操作,但可能会添加更多数据帧,每个数据帧大约有 10M 行

merged_oc= pd.merge(dfs[0], dfs[3], on=['ID', 'TIME_OF_DAY'], suffixes=('_O', '_C'), how='outer')
merged_s = pd.merge(dfs[1], dfs[2], on=['ID', 'TIME_OF_DAY'], suffixes=('_W', 'M'), how='outer')

# merge and weighted sum of O and C
merged_oc['COUNTS_O_weighted_02']= merged_oc['COUNTS_O'].fillna(0).multiply(0.2)
merged_oc['COUNTS_C_weighted_04'] = merged_oc['COUNTS_C'].fillna(0).multiply(0.4)
merged_oc['COUNTS'] = merged_oc['COUNTS_O_weighted_02'] + merged_oc['COUNTS_C_weighted_04']
result_oc = merged_oc[['ID', 'TIME_OF_DAY', 'COUNTS', 'COUNTS_O_weighted_02', 'COUNTS_C_weighted_04']]

merged_s['COUNTS_W_weighted_04'] = merged_s['COUNTS_W'].fillna(0).multiply(0.4)
merged_s['COUNTS_M_weighted_04'] = merged_s['COUNTS_M'].fillna(0).multiply(0.4)
merged_s['COUNTS'] = merged_s['COUNTS_W_weighted_04'] + merged_s['COUNTS_M_weighted_04']
result_s = merged_s[['ID', 'TIME_OF_DAY', 'COUNTS', 'COUNTS_W_weighted_04', 'COUNTS_M_weighted_04']]

merged_final = pd.merge(result_oc, result_s, on=['ID', 'TIME_OF_DAY'], suffixes=('_OC', '_S'), how='outer')

merged_final['COUNTS_OC']= merged_final['COUNTS_OC'].fillna(0)
merged_final['COUNTS_S'] = merged_final['COUNTS_S'].fillna(0)
merged_final['WEIGHTED_SUM'] = merged_final['COUNTS_OC'] + merged_final['COUNTS_SESSION']
merged_final = merged_final[['ID', 'TIME_OF_DAY', 'WEIGHTED_SUM', 'COUNTS_O_weighted_02', 'COUNTS_C_weighted_04', 'COUNTS_W_weighted_04', 'COUNTS_M_weighted_04']].fillna(0)
python-3.x pandas dataframe weighted
1个回答
0
投票

IIUC,您可以在设置索引并乘以每个数据帧的权重后尝试

pd.concat
数据帧,然后使用
groupby
sum

df_out = pd.concat([df1_2.set_index(['ID', 'TOD']).mul(.2), 
                    df2_4.set_index(['ID', 'TOD']).mul(.4)])\
           .sum(axis=1)\
           .groupby(level=[0,1])\
           .sum()\
           .reset_index()

df_out

输出:

   ID        TOD    0
0  10    evening  1.6
1  10    morning  0.6
2  13  afternoon  0.6
3  13    morning  1.2
4  32  afternoon  0.8
5  32    evening  0.4
© www.soinside.com 2019 - 2024. All rights reserved.