在Python中使用多个条件和百分比进行采样

Question

人员ID	条件1	条件2	条件3
A	是的	没有	是的
B	没有	是的	没有
C	是的	没有	没有

嗨！我必须从相当大的数据集中生成样本，并且包含标准比我之前处理的标准要复杂一些。对于这个样本，我需要 100 人。 25 应该有条件 1，25 应该有条件 2，25 应该有条件 3。最后 25 应该没有条件（这是一个简单的，这里没有问题）。

这些条件不必是相互排斥的 - 因此一个人可以同时具备条件 1 和条件 2，但这将被视为每个条件的一个案例。由于数据集的性质，大多数人很可能会遇到多种条件的组合（这实际上比 3 个条件彼此完全不同的样本更可取）。这有道理吗？这在概念上并不困难，我只是还没找到解决方案！

通常我会使用这样的行来过滤样本所需的变量：

样本 = df[(df['条件 1'] == "是") & (df["条件 2"] != 0) & (df["条件 3"] != 0)]

但是在这种情况下，采样方案更加复杂，并且使用这种方法需要大量的试验和错误

Answer 1

假设：你想要随机抽样；没有行被选择超过一次；如果您想要每个条件的特定数量，但一行中可能出现多个条件（因此总共少于 100 行），那么您可以使用以下方法，如使用简化示例所示。有必要在收集样本时跟踪所有情况，并处理没有足够的样本满足条件的情况。

import pandas as pd

num = 2   # number of rows in each of the 4 samples

# create DF with random 0/1 values for each Cond column
df= pd.DataFrame({'id': ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'],
                   'Cond1': [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0],
                  'Cond2': [0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
                  'Cond3': [0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
                  })

# take sample after checking maximum available to meet condition
x = min(num,df[df['Cond1'].eq(1)].shape[0])
sample1 = df[df['Cond1'].eq(1)].sample(x)

#drop all rows meeting Cond1 as now have required samples or max available for Cond1
rest = df.drop(df[df['Cond1'].eq(1)].index)

#count numnber of samples already have for remaining Conds
s1c2 = sample1[sample1['Cond2'].eq(1)].shape[0]
s1c3 = sample1[sample1['Cond3'].eq(1)].shape[0]

# obtain required remaining samples for Cond2 or as many available
# also not exceeded required samples for Cond3 (or loop again)
x = min(num-s1c2,df[df['Cond2'].eq(1)].shape[0])
while True:
    sample2 = rest[rest['Cond2'].eq(1)].sample(x)
    c3 = s1c3 + sample2[sample2['Cond3'].eq(1)].shape[0]
    if c3 > num:
        continue
    break

#drop all rows meeting Cond2 as now have required samples for Cond2
rest = rest.drop(rest[rest['Cond2'].eq(1)].index)

#obtain any remaining samples required for Cond3 or as many as available
x = min(num-c3,df[df['Cond3'].eq(1)].shape[0])
sample3 = rest[rest['Cond3'].eq(1)].sample(x)

#obtain last sample or as many as available
x = min(num,df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].shape[0])
sample4 = df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].sample(x)

# combine all samples to give DF with num x 4 rows or as many as meet the conditions
final = pd.concat([sample1,sample2, sample3, sample4])
#add .reset_index(drop=True) if sequential index required

print(final)

运行 3 次代码得出：

   id  Cond1  Cond2  Cond3
19  t      1      0      0
20  u      1      1      0
2   c      0      1      1
18  s      0      0      1
16  q      0      0      0
15  p      0      0      0

   id  Cond1  Cond2  Cond3
10  k      1      0      1
24  y      1      1      0
25  z      0      1      1
15  p      0      0      0
1   b      0      0      0

   id  Cond1  Cond2  Cond3
12  m      1      0      0
19  t      1      0      0
9   j      0      1      0
6   g      0      1      0
22  w      0      0      1
5   f      0      0      1
1   b      0      0      0
15  p      0      0      0

在Python中使用多个条件和百分比进行采样

问题描述投票：0回答：1

1个回答

最新问题

在Python中使用多个条件和百分比进行采样

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1