我有一个 DataFrame
df_things
,看起来像这样,我想在训练之前预测分类的质量
A B C CLASS
-----------------------
al1 bal1 cal1 Ship
al1 bal1 cal1 Ship
al1 bal2 cal2 Ship
al2 bal2 cal2 Cow
al3 bal3 cal3 Car
al1 bal2 cal3 Car
al3 bal3 cal3 Car
我想按类对行进行分组,以便我了解特征的分布。我这样做(例如,在“B”栏),
df_B = df_things.groupby('CLASS').B.value_counts()
这给了我结果
CLASS B
-------------
ship bal1 2
bal2 1
cow bal2 2
car bal2 1
bal3 2
我想要的是仅可视化具有多个值的组,使其看起来像这样:
CLASS B
-------------
ship bal1 2
bal2 1
car bal2 1
bal3 2
我有点卡住了,有什么想法吗?
您可以使用
groupby
过滤 nunique
计数超过 1 的组。
v = df_things.groupby('CLASS').B.value_counts()
v[v.groupby(level=0).transform('nunique').gt(1)]
CLASS B
Car bal3 2
bal2 1
Ship bal1 2
bal2 1
Name: B, dtype: int64
来自
crosstab
的解决方案
s=pd.crosstab(df.CLASS,df.B)
s[s.ne(0).sum(1)>1].replace(0,np.nan).stack()
CLASS B
Car bal2 1.0
bal3 2.0
Ship bal1 2.0
bal2 1.0
dtype: float64
这是另一种方法。
设置输入数据:
In [1]:
import pandas as pd
df_things = pd.DataFrame({
'A': ['al1', 'al1', 'al1', 'al2', 'al3', 'al1', 'al3'],
'B': ['bal1', 'bal1', 'bal2', 'bal2', 'bal3', 'bal2', 'bal3'],
'C': ['cal1', 'cal1', 'cal2', 'cal2', 'cal3', 'cal3', 'cal3'],
'CLASS': ['Ship', 'Ship', 'Ship', 'Cow', 'Car', 'Car', 'Car']
})
print(df_things)
A B C CLASS
0 al1 bal1 cal1 Ship
1 al1 bal1 cal1 Ship
2 al1 bal2 cal2 Ship
3 al2 bal2 cal2 Cow
4 al3 bal3 cal3 Car
5 al1 bal2 cal3 Car
6 al3 bal3 cal3 Car
将其减少为具有多个唯一值的组
In [2]:
df_reduced = df_things.groupby(['CLASS']).filter(lambda grp: grp['B'].nunique() > 1)
print(df_reduced)
A B C CLASS
0 al1 bal1 cal1 Ship
1 al1 bal1 cal1 Ship
2 al1 bal2 cal2 Ship
4 al3 bal3 cal3 Car
5 al1 bal2 cal3 Car
6 al3 bal3 cal3 Car
应用 groupby 以获得所需的输出
In [3]:
df_reduced.groupby(['CLASS'])['B'].value_counts()
Out[3]:
CLASS B
Car bal3 2
bal2 1
Ship bal1 2
bal2 1
Name: B, dtype: int64
顺便说一句,您的问题中的 df_B 有一个拼写错误。应该是
In [4]:
df_B = df_things.groupby('CLASS').B.value_counts()
print(df_B)
CLASS B
Car bal3 2
bal2 1
Cow bal2 1