在Python中使用多个条件和百分比进行采样

问题描述 投票:0回答:1
人员ID 条件1 条件2 条件3
A 是的 没有 是的
B 没有 是的 没有
C 是的 没有 没有

嗨!我必须从相当大的数据集中生成样本,并且包含标准比我之前处理的标准要复杂一些。对于这个样本,我需要 100 人。 25 应该有条件 1,25 应该有条件 2,25 应该有条件 3。最后 25 应该没有条件(这是一个简单的,这里没有问题)。

这些条件不必是相互排斥的 - 因此一个人可以同时具备条件 1 和条件 2,但这将被视为每个条件的一个案例。由于数据集的性质,大多数人很可能会遇到多种条件的组合(这实际上比 3 个条件彼此完全不同的样本更可取)。这有道理吗?这在概念上并不困难,我只是还没找到解决方案!

通常我会使用这样的行来过滤样本所需的变量:

样本 = df[(df['条件 1'] == "是") & (df["条件 2"] != 0) & (df["条件 3"] != 0)]

但是在这种情况下,采样方案更加复杂,并且使用这种方法需要大量的试验和错误

python pandas filtering subset sampling
1个回答
0
投票

假设:你想要随机抽样;没有行被选择超过一次;如果您想要每个条件的特定数量,但一行中可能出现多个条件(因此总共少于 100 行),那么您可以使用以下方法,如使用简化示例所示。有必要在收集样本时跟踪所有情况,并处理没有足够的样本满足条件的情况。

import pandas as pd

num = 2   # number of rows in each of the 4 samples

# create DF with random 0/1 values for each Cond column
df= pd.DataFrame({'id': ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'],
                   'Cond1': [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0],
                  'Cond2': [0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
                  'Cond3': [0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
                  })

# take sample after checking maximum available to meet condition
x = min(num,df[df['Cond1'].eq(1)].shape[0])
sample1 = df[df['Cond1'].eq(1)].sample(x)

#drop all rows meeting Cond1 as now have required samples or max available for Cond1
rest = df.drop(df[df['Cond1'].eq(1)].index)

#count numnber of samples already have for remaining Conds
s1c2 = sample1[sample1['Cond2'].eq(1)].shape[0]
s1c3 = sample1[sample1['Cond3'].eq(1)].shape[0]

# obtain required remaining samples for Cond2 or as many available
# also not exceeded required samples for Cond3 (or loop again)
x = min(num-s1c2,df[df['Cond2'].eq(1)].shape[0])
while True:
    sample2 = rest[rest['Cond2'].eq(1)].sample(x)
    c3 = s1c3 + sample2[sample2['Cond3'].eq(1)].shape[0]
    if c3 > num:
        continue
    break

#drop all rows meeting Cond2 as now have required samples for Cond2
rest = rest.drop(rest[rest['Cond2'].eq(1)].index)

#obtain any remaining samples required for Cond3 or as many as available
x = min(num-c3,df[df['Cond3'].eq(1)].shape[0])
sample3 = rest[rest['Cond3'].eq(1)].sample(x)

#obtain last sample or as many as available
x = min(num,df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].shape[0])
sample4 = df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].sample(x)

# combine all samples to give DF with num x 4 rows or as many as meet the conditions
final = pd.concat([sample1,sample2, sample3, sample4])
#add .reset_index(drop=True) if sequential index required

print(final)

运行 3 次代码得出:

   id  Cond1  Cond2  Cond3
19  t      1      0      0
20  u      1      1      0
2   c      0      1      1
18  s      0      0      1
16  q      0      0      0
15  p      0      0      0

   id  Cond1  Cond2  Cond3
10  k      1      0      1
24  y      1      1      0
25  z      0      1      1
15  p      0      0      0
1   b      0      0      0

   id  Cond1  Cond2  Cond3
12  m      1      0      0
19  t      1      0      0
9   j      0      1      0
6   g      0      1      0
22  w      0      0      1
5   f      0      0      1
1   b      0      0      0
15  p      0      0      0
© www.soinside.com 2019 - 2024. All rights reserved.