我遇到了一个令我感到惊讶的问题,这是我以前从未遇到过的。 从数据框开始:
df = pd.DataFrame({'mrn':[263,263,263,273,273,273],'study_id':['st_1','st_1','st_1','st_2','st_2','st_2'],'data_categ':['pre_event','during_event','post_event','pre_event','during_event','post_event'],'data_val':[12,15,9,2,9,0]},columns=['mrn','study_id','data_categ','data_val'])
正如预期的那样,我可以使用
df[df.mrn==263]
选择特定的行,我得到以下输出:
mrn study_id data_categ data_val
0 263 st_1 pre_event 12
1 263 st_1 during_event 15
2 263 st_1 post_event 9
现在我要分组。我执行以下操作:
df2 = df.groupby(["mrn","study_id","data_categ"]).agg({sum,np.median})
df2.columns = [df2.columns.get_level_values(1)+"_zz"]
df2=df2.reset_index()
这给了我想要的输出:
>>> df2
mrn study_id data_categ sum_zz median_zz
0 263 st_1 during_event 15 15.0
1 263 st_1 post_event 9 9.0
2 263 st_1 pre_event 12 12.0
3 273 st_2 during_event 9 9.0
4 273 st_2 post_event 0 0.0
5 273 st_2 pre_event 2 2.0
但是,现在当我尝试选择特定行时,我没有得到我想要的:
>>> df2[df2.mrn==263]
mrn study_id data_categ sum_zz median_zz
0 263.0 NaN NaN NaN NaN
1 263.0 NaN NaN NaN NaN
2 263.0 NaN NaN NaN NaN
3 NaN NaN NaN NaN NaN
4 NaN NaN NaN NaN NaN
5 NaN NaN NaN NaN NaN
为什么会这样?为什么它不只返回满足条件的行(应该是其中的 3 行)和其余列中的值? 谢谢!
通过在
df2.columns = [df2.columns.get_level_values(1)+"_zz"]
中使用方括号,您创建了一个具有一个级别的MultiIndex,这进一步触发了您的问题,因为只有一个系列可以用于整个DataFrame的布尔索引。
删除它们:
df2 = df.groupby(["mrn","study_id","data_categ"]).agg({sum,np.median})
df2.columns = df2.columns.get_level_values(1)+"_zz"
df2 = df2.reset_index()
print(df2[df2['mrn']==263])
输出:
mrn study_id data_categ sum_zz median_zz
0 263 st_1 during_event 15 15.0
1 263 st_1 post_event 9 9.0
2 263 st_1 pre_event 12 12.0
df2.columns
MultiIndex([( 'mrn',),
( 'study_id',),
('data_categ',),
( 'sum_zz',),
( 'median_zz',)],
)
df2['mrn'] # this is a DataFrame
mrn
0 263
1 263
2 263
3 273
4 273
5 273
df2.columns
Index(['mrn', 'study_id', 'data_categ', 'sum_zz', 'median_zz'], dtype='object')
df2['mrn'] # this is a Series
0 263
1 263
2 263
3 273
4 273
5 273
Name: mrn, dtype: int64
您可以通过以下方式自动获得所需的结果:
df2 = (df.groupby(["mrn","study_id","data_categ"])['data_val'].agg({sum,np.median})
.add_suffix('_zz').reset_index()
)
您正在尝试使用二级值重命名列。不是将它们连接起来,而是将第一级值替换为第二级值。
与@mozway答案的区别主要在于聚合后更改列名的方法以及明确指定应用聚合函数的列。结果将是相同的,您可以使用这些代码中的任何一个。
import pandas as pd
import numpy as np
df = pd.DataFrame({'mrn': [263, 263, 263, 273, 273, 273], 'study_id': ['st_1', 'st_1', 'st_1', 'st_2', 'st_2', 'st_2'],
'data_categ': ['pre_event', 'during_event', 'post_event', 'pre_event', 'during_event', 'post_event'],
'data_val': [12, 15, 9, 2, 9, 0]}, columns=['mrn', 'study_id', 'data_categ', 'data_val'])
df2 = df.groupby(["mrn", "study_id", "data_categ"]).agg({"data_val": [sum, np.median]})
df2.columns = ['_'.join(col).strip() for col in df2.columns]
df2 = df2.reset_index()
print(df2[df2.mrn == 263])
输出:
mrn study_id data_categ data_val_sum data_val_median
0 263 st_1 during_event 15 15.0
1 263 st_1 post_event 9 9.0
2 263 st_1 pre_event 12 12.0
此处提到的其他答案通常适用于每种情况,但如果您只针对要聚合的一列,则以下解决方案将起作用。
当您尝试进行分组时,请确保提及聚合的目标列
df2 = df.groupby(["mrn","study_id","data_categ"])['data_val'].agg({sum,np.median})
这样,当您重置索引时,数据框将不会有 MultiIndex 列
df2 = df2.reset_index()
print(df2[df2.mrn==263])
这将提供您预期的结果集