我正在处理包含 2,500 个样本的数据集,我需要提取 666 个样本的随机子集,同时满足基于两个布尔列的特定条件。
数据集 (
df
) 包含以下列:
ID
cond_1
(布尔值:真/假)cond_2
(布尔值:真/假)对子集进行采样时,我需要确保满足这些条件:
True
中的cond_1
值的数量应该正好是181。True
中的cond_2
值的数量应该正好是181。False
中的cond_1
值的数量应该恰好是485。False
中的cond_2
值的数量应该恰好是485。另外:
cond_1
和 cond_2
均为 True
或均为 False
的样本组成。必须是 True/False
或 False/True
的组合。有推荐的方法在Python中实现这种类型的约束采样吗?
任何可以提供帮助的代码示例或库将不胜感激。
提前致谢。
import pandas as pd
import numpy as np
df = pd.DataFrame({
'ID': range(1, 2501),
'cond_1': np.random.choice([True, False], size=2500),
'cond_2': np.random.choice([True, False], size=2500)
})
# Create the groups based on conditions
group_cond1_true = df[df['cond_1'] == True]
group_cond1_false = df[df['cond_1'] == False]
group_cond2_true = df[df['cond_2'] == True]
group_cond2_false = df[df['cond_2'] == False]
# Sample from each group to meet constraints
sample_cond1_true = group_cond1_true.sample(n=181, random_state=42)
sample_cond1_false = group_cond1_false.sample(n=485, random_state=42)
sample_cond2_true = group_cond2_true.sample(n=181, random_state=42)
sample_cond2_false = group_cond2_false.sample(n=485, random_state=42)
# Combine the samples to create the final dataset
df_extracted = pd.concat([sample_cond1_true, sample_cond1_false, sample_cond2_true, sample_cond2_false]).drop_duplicates()
# Check if the conditions are met
print(f"Total length: {len(df_extracted)}")
print(f"cond_1 True count: {df_extracted['cond_1'].sum()}")
print(f"cond_2 True count: {df_extracted['cond_2'].sum()}")
print(f"cond_1 False count: {(~df_extracted['cond_1']).sum()}")
print(f"cond_2 False count: {(~df_extracted['cond_2']).sum()}")
print()
count = df_extracted.groupby(['cond_1', 'cond_2']).size()
print("Extracted counts:\n", count)
上面的代码生成的 df_extracted 大小为 1148,而不是 666。
Total length: 1148
cond_1 True count: 454
cond_2 True count: 460
cond_1 False count: 694
cond_2 False count: 688
Extracted counts:
cond_1 cond_2
False False 396
True 298
True False 292
True 162
dtype: int64
代码
首先,创建一个数据框,其中包含一组 True 和 False 的随机数组,并为其指定一个 cumcount。
import numpy as np
import pandas as pd
# make number of True & False to variables
n_true, n_false = 181, 485
# Create a True/False array
values = np.array([True] * n_true + [False] * n_false)
# Randomly permute the array twice to create two columns in the DataFrame
df_sample = pd.DataFrame({
'cond_1': np.random.permutation(values),
'cond_2': np.random.permutation(values)
})
# Add a cumcount column 'cc' in each group of 'cond_1' and 'cond_2'
df_sample['cc'] = df_sample.groupby(['cond_1', 'cond_2']).cumcount()
df_样本:
cond_1 cond_2 cc
0 True False 0
1 True False 1
2 False False 0
.. ... ... ...
663 True True 50
664 False True 128
665 False True 129
[666 rows x 3 columns]
其次,随机化
df
,然后给出 cumcount 以提取与 df_sample
匹配的 ID。
target_id = df_sample.merge(
df.assign(cc=df.sample(frac=1).groupby(['cond_1', 'cond_2']).cumcount()),
how='left')['ID']
out = df[df['ID'].isin(target_id)]
出
ID cond_1 cond_2
5 6 True True
8 9 False True
12 13 False False
... ... ... ...
2487 2488 True False
2489 2490 True False
2490 2491 False False
[666 rows x 3 columns]
第三,验证样本提取是否正确。
print('number of rows :', len(out))
print('Number of True in all columns : \n', out.filter(like='cond').sum())
输出:
number of rows : 666
Number of True in all columns :
cond_1 181
cond_2 181
dtype: int64