Python:按比例对不平衡数据进行分层抽样

问题描述 投票:0回答:1

这是我的数据框:

df = pd.DataFrame({'var1': [1,2,3,4,5,6,7,8,9,10,11,12,13,14],
                   'var2': ['a','a','a','a','b','b','b','b','b','b','b','c','d','d'],
                   'var3': ['y','y','y','y','r','r','r','r','r','r','r','q','q', 'r'],
                   'var4': [0,1,0,0,1,1,0,0,0,0,0,0,0,0]})

因为var4不平衡。我计划根据组var4 = 1var4 = 0分别取var2var3的两倍量。结果,组“ a”,“ y”将具有一个“ 1”和两个“ 0”;组“ b” r将具有两个“ 1”和四个“ 0”。其他组将没有任何人。如下图所示:

df_sampled = pd.DataFrame({'var1': [1,2,3,5,6,7,8,10,11],
                   'var2': ['a','a','a','b','b','b','b','b','b'],
                   'var3': ['y','y','y','r','r','r','r','r','r'],
                   'var4': [0,1,0,1,1,0,0,0,0]})

我试图找出每组var4 = 1的大小:

df.var4 = df.var4.mask(df.var4.ne(1))
dd = df.groupby(['var2', 'var3']).var4.count().tolist()

我也尝试使用sample()在列表dd上运行:

df.loc[df['var4'] == 0].groupby(['var2','var3'], group_keys=False).apply(lambda x: x.sample(dd))

但是,它不起作用。有什么建议吗?

python pandas dataframe sampling
1个回答
0
投票

让我们这样尝试。

df = df.sort_values('var4', ascending=False)
gb = df.groupby(['var2', 'var3'])
s = gb.cumcount().add(1)
s1 = gb.var4.transform('sum')
df_final =  df[(s - s1) <= (s1 * 2)].sort_index()

Out[1758]:
   var1 var2 var3  var4
0     1    a    y     0
1     2    a    y     1
2     3    a    y     0
4     5    b    r     1
5     6    b    r     1
6     7    b    r     0
7     8    b    r     0
8     9    b    r     0
9    10    b    r     0
© www.soinside.com 2019 - 2024. All rights reserved.