如何通过Python对两个布尔条件具有特定约束的数据集进行随机采样？

Question

我正在处理包含 2,500 个样本的数据集，我需要提取 666 个样本的随机子集，同时满足基于两个布尔列的特定条件。

数据集 (

df

) 包含以下列：

```
ID
```
```
cond_1
```
（布尔值：真/假）
```
cond_2
```
（布尔值：真/假）

对子集进行采样时，我需要确保满足这些条件：

```
True
```
中的
```
cond_1
```
值的数量应该正好是181。
```
True
```
中的
```
cond_2
```
值的数量应该正好是181。
```
False
```
中的
```
cond_1
```
值的数量应该恰好是485。
```
False
```
中的
```
cond_2
```
值的数量应该恰好是485。
子集中的样本总数应恰好为 666。

另外：

由于这是随机抽样，因此子集不应完全由
```
cond_1
```
和
```
cond_2
```
均为
```
True
```
或均为
```
False
```
的样本组成。必须是
```
True/False
```
或
```
False/True
```
的组合。

有推荐的方法在Python中实现这种类型的约束采样吗？

任何可以提供帮助的代码示例或库将不胜感激。

提前致谢。

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'ID': range(1, 2501),
    'cond_1': np.random.choice([True, False], size=2500),
    'cond_2': np.random.choice([True, False], size=2500)
})

# Create the groups based on conditions
group_cond1_true = df[df['cond_1'] == True]
group_cond1_false = df[df['cond_1'] == False]

group_cond2_true = df[df['cond_2'] == True]
group_cond2_false = df[df['cond_2'] == False]

# Sample from each group to meet constraints
sample_cond1_true = group_cond1_true.sample(n=181, random_state=42)
sample_cond1_false = group_cond1_false.sample(n=485, random_state=42)

sample_cond2_true = group_cond2_true.sample(n=181, random_state=42)
sample_cond2_false = group_cond2_false.sample(n=485, random_state=42)

# Combine the samples to create the final dataset
df_extracted = pd.concat([sample_cond1_true, sample_cond1_false, sample_cond2_true, sample_cond2_false]).drop_duplicates()

# Check if the conditions are met
print(f"Total length: {len(df_extracted)}")
print(f"cond_1 True count: {df_extracted['cond_1'].sum()}")
print(f"cond_2 True count: {df_extracted['cond_2'].sum()}")
print(f"cond_1 False count: {(~df_extracted['cond_1']).sum()}")
print(f"cond_2 False count: {(~df_extracted['cond_2']).sum()}")
print()
count = df_extracted.groupby(['cond_1', 'cond_2']).size()
print("Extracted counts:\n", count)

上面的代码生成的 df_extracted 大小为 1148，而不是 666。

Total length: 1148
cond_1 True count: 454
cond_2 True count: 460
cond_1 False count: 694
cond_2 False count: 688

Extracted counts:
 cond_1  cond_2
False   False     396
        True      298
True    False     292
        True      162
dtype: int64

Answer 1

代码

首先，创建一个数据框，其中包含一组 True 和 False 的随机数组，并为其指定一个 cumcount。

import numpy as np
import pandas as pd

# make number of True & False to variables
n_true, n_false = 181, 485

# Create a True/False array
values = np.array([True] * n_true + [False] * n_false)

# Randomly permute the array twice to create two columns in the DataFrame
df_sample = pd.DataFrame({
    'cond_1': np.random.permutation(values),
    'cond_2': np.random.permutation(values)
})

# Add a cumcount column 'cc' in each group of 'cond_1' and 'cond_2'
df_sample['cc'] = df_sample.groupby(['cond_1', 'cond_2']).cumcount()

df_样本：

     cond_1  cond_2   cc
0      True   False    0
1      True   False    1
2     False   False    0
..      ...     ...  ...
663    True    True   50
664   False    True  128
665   False    True  129

[666 rows x 3 columns]

其次，随机化

df

，然后给出 cumcount 以提取与

df_sample

匹配的 ID。

target_id = df_sample.merge(
    df.assign(cc=df.sample(frac=1).groupby(['cond_1', 'cond_2']).cumcount()), 
    how='left')['ID']
out = df[df['ID'].isin(target_id)]

出

        ID  cond_1  cond_2
5        6    True    True
8        9   False    True
12      13   False   False
...    ...     ...     ...
2487  2488    True   False
2489  2490    True   False
2490  2491   False   False

[666 rows x 3 columns]

第三，验证样本提取是否正确。

print('number of rows :', len(out))
print('Number of True in all columns : \n', out.filter(like='cond').sum())

输出：

number of rows : 666
Number of True in all columns : 
 cond_1    181
cond_2    181
dtype: int64

如何通过Python对两个布尔条件具有特定约束的数据集进行随机采样？

问题描述投票：0回答：1

1个回答

最新问题

如何通过Python对两个布尔条件具有特定约束的数据集进行随机采样？

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1