我有一个包含多列的复杂数据框。所有这些都是基于
MultiIndex
的。在某些时候,我想在估计一些指标时非常具体,所以我开始尝试 .groupby
方法。我可以设法完成基础知识:1)在整个数据帧上计算聚合方法或2)为一个特定列计算聚合方法。但是,我有兴趣通过指示第一列级别中的一些名称来计算聚合方法。当列中只有一个级别时,这很容易做到。为了便于理解,我创建了以下 MRO,它重现了我的想法和我遇到的错误:
import numpy as np
import pandas as pd
columns = pd.MultiIndex.from_tuples(
[
("Dimensions", "x"),
("Dimensions", "y"),
("Dimensions", "z"),
("Coefficient", ""),
("Comments", ""),
],
names=["Category", "Details"],
)
df = pd.DataFrame(index=range(11), columns=columns)
df[("Dimensions", "x")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "y")] = np.random.randint(1, 100, size=11)
df[("Dimensions", "z")] = np.random.randint(1, 100, size=11)
df[("Coefficient", "")] = np.random.randint(1, 50, size=11) # Coefficient como entero aleatorio
df[("Comments", "")] = np.random.choice(["Good", "Average", "Bad"], size=11)
df["Comments"] = df["Comments"].astype("category")
# Basic metrics
print(df.groupby("Comments").mean()) # It works
print(df.groupby("Comments")["Dimensions"].mean()) # It works
# Selecting multiple columns within a MultiIndex based one. Different ideas I tried:
df.groupby("Comments")["Dimensions", "Coefficient"].mean() # It does not work
df.groupby("Comments")[["Dimensions", "Coefficient"]].mean() # It does not work
df.groupby("Comments").agg({"Dimensions": "mean", "Coefficient": "mean"}) # It does not work
如果您使用
print(df.columns)
,您将看到真正的列名称是元组而不是单个字符串。
试试这个:
df.groupby("Comments")[[('Dimensions', 'x'), ('Dimensions', 'y'), ('Dimensions', 'z'), ('Coefficient', '')]].mean()
Category Dimensions Coefficient
Details x y z
Comments
Average 35.00 55.166667 59.333333 21.833333
Bad 81.75 24.250000 45.750000 35.750000
Good 36.00 1.000000 42.000000 20.000000