我正在尝试对具有多索引列的 DataFrame 进行分组,使用系列(没有多索引)作为分组依据的输入。具体来说,给出下面的 DataFrame
>>> df
X Y
A B C A B C
2020-01-01 9 1 2 1 6 5
2020-01-02 5 7 8 0 6 9
2020-01-03 6 3 4 8 6 1
2020-01-06 0 0 9 0 5 1
2020-01-07 8 7 4 8 3 1
以及代表组的系列
>>> groups
A D
B D
C E
dtype: object
我尝试运行以下命令
>>> df.groupby(groups, axis=1, level=1).sum()
并期望得到
X Y
D E D E
2020-01-01 10 2 7 5
2020-01-02 12 8 6 9
2020-01-03 9 4 14 1
2020-01-06 0 9 5 1
2020-01-07 15 4 11 1
但是我收到以下错误:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/frame.py", line 6717, in groupby
return DataFrameGroupBy(
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/groupby.py", line 560, in __init__
grouper, exclusions, obj = get_grouper(
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 828, in get_grouper
Grouping(
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/groupby/grouper.py", line 485, in __init__
) = index._get_grouper_for_level(self.grouper, level)
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/indexes/multi.py", line 1487, in _get_grouper_for_level
grouper = level_values.map(mapper)
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/indexes/base.py", line 5098, in map
new_values = super()._map_values(mapper, na_action=na_action)
File "/home/zak/anaconda3/envs/lib/python3.8/site-packages/pandas/core/base.py", line 937, in _map_values
new_values = map_f(values, mapper)
File "pandas/_libs/lib.pyx", line 2467, in pandas._libs.lib.map_infer
TypeError: 'numpy.ndarray' object is not callable
我正在使用Python 3.8.8和Pandas版本1.2.3。
我发现实现上述目标的一种方法是使用以下代码,但我特别想知道是否有更干净的方法来做到这一点。如果没有,为什么不呢?对我来说,上述尝试是 groupby 方法的预期行为,但我似乎误解了其背后的逻辑。
>>> df, groups = df.align(groups, axis=1, level=1)
>>> df.groupby(groups, axis=1).apply(lambda x: x.sum(axis=1, level=0)).swaplevel(axis=1).sort_index(axis=1)
X Y
D E D E
2020-01-01 10 2 7 5
2020-01-02 12 8 6 9
2020-01-03 9 4 14 1
2020-01-06 0 9 5 1
2020-01-07 15 4 11 1
您可以在 rename
的第二级使用
MultiIndex
,然后按两个级别进行聚合:
df = df.rename(columns=groups, level=1).sum(axis=1, level=[0,1])
#working like
#df = df.rename(columns=groups, level=1).groupby(axis=1, level=[0,1]).sum()
print (df)
X Y
D E D E
2020-01-01 10 2 7 5
2020-01-02 12 8 6 9
2020-01-03 9 4 14 1
2020-01-06 0 9 5 1
2020-01-07 15 4 11 1
您的解决方案应该通过 lambda 函数更改,但输出不同:
df = df.groupby(lambda x: groups[x], axis=1, level=1).sum()
print (df)
D E
2020-01-01 17 7
2020-01-02 18 17
2020-01-03 23 5
2020-01-06 5 10
2020-01-07 26 5