人员ID | 条件1 | 条件2 | 条件3 |
---|---|---|---|
A | 是的 | 没有 | 是的 |
B | 没有 | 是的 | 没有 |
C | 是的 | 没有 | 没有 |
嗨!我必须从相当大的数据集中生成样本,并且包含标准比我之前处理的标准要复杂一些。对于这个样本,我需要 100 人。 25 应该有条件 1,25 应该有条件 2,25 应该有条件 3。最后 25 应该没有条件(这是一个简单的,这里没有问题)。
这些条件不必是相互排斥的 - 因此一个人可以同时具备条件 1 和条件 2,但这将被视为每个条件的一个案例。由于数据集的性质,大多数人很可能会遇到多种条件的组合(这实际上比 3 个条件彼此完全不同的样本更可取)。这有道理吗?这在概念上并不困难,我只是还没找到解决方案!
通常我会使用这样的行来过滤样本所需的变量:
样本 = df[(df['条件 1'] == "是") & (df["条件 2"] != 0) & (df["条件 3"] != 0)]
但是在这种情况下,采样方案更加复杂,并且使用这种方法需要大量的试验和错误
假设:你想要随机抽样;没有行被选择超过一次;如果您想要每个条件的特定数量,但一行中可能出现多个条件(因此总共少于 100 行),那么您可以使用以下方法,如使用简化示例所示。有必要在收集样本时跟踪所有情况,并处理没有足够的样本满足条件的情况。
import pandas as pd
num = 2 # number of rows in each of the 4 samples
# create DF with random 0/1 values for each Cond column
df= pd.DataFrame({'id': ['a','b','c','d','e','f','g','h','i','j','k','l','m','n','o','p','q','r','s','t','u','v','w','x','y','z'],
'Cond1': [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0],
'Cond2': [0, 0, 1, 1, 0, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 1],
'Cond3': [0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1]
})
# take sample after checking maximum available to meet condition
x = min(num,df[df['Cond1'].eq(1)].shape[0])
sample1 = df[df['Cond1'].eq(1)].sample(x)
#drop all rows meeting Cond1 as now have required samples or max available for Cond1
rest = df.drop(df[df['Cond1'].eq(1)].index)
#count numnber of samples already have for remaining Conds
s1c2 = sample1[sample1['Cond2'].eq(1)].shape[0]
s1c3 = sample1[sample1['Cond3'].eq(1)].shape[0]
# obtain required remaining samples for Cond2 or as many available
# also not exceeded required samples for Cond3 (or loop again)
x = min(num-s1c2,df[df['Cond2'].eq(1)].shape[0])
while True:
sample2 = rest[rest['Cond2'].eq(1)].sample(x)
c3 = s1c3 + sample2[sample2['Cond3'].eq(1)].shape[0]
if c3 > num:
continue
break
#drop all rows meeting Cond2 as now have required samples for Cond2
rest = rest.drop(rest[rest['Cond2'].eq(1)].index)
#obtain any remaining samples required for Cond3 or as many as available
x = min(num-c3,df[df['Cond3'].eq(1)].shape[0])
sample3 = rest[rest['Cond3'].eq(1)].sample(x)
#obtain last sample or as many as available
x = min(num,df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].shape[0])
sample4 = df[df[['Cond1', 'Cond2', 'Cond3']].eq(0).all(1)].sample(x)
# combine all samples to give DF with num x 4 rows or as many as meet the conditions
final = pd.concat([sample1,sample2, sample3, sample4])
#add .reset_index(drop=True) if sequential index required
print(final)
运行 3 次代码得出:
id Cond1 Cond2 Cond3
19 t 1 0 0
20 u 1 1 0
2 c 0 1 1
18 s 0 0 1
16 q 0 0 0
15 p 0 0 0
id Cond1 Cond2 Cond3
10 k 1 0 1
24 y 1 1 0
25 z 0 1 1
15 p 0 0 0
1 b 0 0 0
id Cond1 Cond2 Cond3
12 m 1 0 0
19 t 1 0 0
9 j 0 1 0
6 g 0 1 0
22 w 0 0 1
5 f 0 0 1
1 b 0 0 0
15 p 0 0 0