您可以使用
StratifiedShuffleSplit
,它用于将数据集分成两部分 data_train
和 data_test
。两者都是基于向量 y
进行分层。
对于您的问题,您可以指定测试中的大小/行数,然后获取生成的测试数据集。
import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
# dataset
X = np.random.random(size = [1000, 2])
# your 'ids/groups' for stratification
y = np.random.randint(10, size=(X.shape[0],))
# n_splits - number of different samples to create
# test_size - number of rows required in sampled dataset
sss = StratifiedShuffleSplit(n_splits=500, test_size=100, random_state=0)
# iterate through generations.
# stratifications will be done based on `y`
for i, (_, test_index) in enumerate(sss.split(X, y)):
print (test_index.shape)