在“DataFrameGroupBy”中选择多个列(基于“MultiIndex”)

问题描述 投票:0回答:1

我有一个包含多列的复杂数据框。所有这些都是基于

MultiIndex
的。在某些时候,我想在估计一些指标时非常具体,所以我开始尝试
.groupby
方法。我可以设法完成基础知识:1)在整个数据帧上计算聚合方法或2)为一个特定列计算聚合方法。但是,我有兴趣通过指示第一列级别中的一些名称来计算聚合方法。当列中只有一个级别时,这很容易做到。为了便于理解,我创建了以下 MRO,它重现了我的想法和我遇到的错误:

import numpy as np
import pandas as pd


columns = pd.MultiIndex.from_tuples(
    [
        ("Dimensions", "x"),
        ("Dimensions", "y"),
        ("Dimensions", "z"),
        ("Coefficient", ""),
        ("Comments", ""),
    ],
    names=["Category", "Details"],
)

df = pd.DataFrame(index=range(11), columns=columns)
df[("Dimensions", "x")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "y")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "z")] = np.random.randint(1, 100, size=11)
df[("Coefficient", "")] = np.random.randint(1, 50, size=11)  # Coefficient como entero aleatorio
df[("Comments", "")] = np.random.choice(["Good", "Average", "Bad"], size=11)
df["Comments"] = df["Comments"].astype("category")

# Basic metrics
print(df.groupby("Comments").mean())  # It works
print(df.groupby("Comments")["Dimensions"].mean())  # It works

# Selecting multiple columns within a MultiIndex based one. Different ideas I tried:
df.groupby("Comments")["Dimensions", "Coefficient"].mean()  # It does not work
df.groupby("Comments")[["Dimensions", "Coefficient"]].mean()  # It does not work
df.groupby("Comments").agg({"Dimensions": "mean", "Coefficient": "mean"})  # It does not work

python pandas group-by data-science
1个回答
1
投票

如果您使用

print(df.columns)
,您将看到真正的列名称是元组而不是单个字符串。

试试这个:

df.groupby("Comments")[[('Dimensions', 'x'), ('Dimensions', 'y'), ('Dimensions', 'z'), ('Coefficient',  '')]].mean()


Category    Dimensions                 Coefficient
Details     x      y         z  
Comments                
Average     35.00  55.166667 59.333333  21.833333
Bad         81.75  24.250000 45.750000  35.750000
Good        36.00  1.000000  42.000000  20.000000
© www.soinside.com 2019 - 2024. All rights reserved.