如何使用sklearns的cross_validate对样本进行加权仅用于评分?

问题描述 投票:0回答:1

我正在对由真实样本和增强样本组成的数据集运行回归任务。增强样本是通过抖动真实样本生成的。我想通过

sklearn
的交叉验证来选择性能最佳的模型。

为此我想:

  • 在由真实样本和增强样本组成的集合上训练模型。我不希望拟合过程考虑样本的来源(即它应该相当于运行
    estimator.fit(..., sample_weights = [1,1,..., 1]
    )。
  • 根据模型的性能对模型进行评分仅在真实样本上。为此,我考虑将增强(或真实)样本的权重设置为 0(或 1)。

如何使用 sklearn 实现这一目标

cross_validate

我尝试了以下方法:

from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
import numpy as np

n_smpl, n_feats = 100, 5
arr_source = np.random.random((n_smpl, n_feats))
arr_target = np.random.random((n_smpl, n_feats))
arr_weight = np.random.randint(0, 2, n_smpl)  # 0 for augmented, 1 for authentic

model = RandomForestRegressor()
kfold_splitter = model_selection.KFold(n_splits=5, random_state=7, shuffle=True)
my_scorers = {
    "r2_weighted": make_scorer(r2_score, sample_weight=arr_weight),
    "mse_weighted": make_scorer(mean_squared_error, greater_is_better=False, sample_weight=arr_weight)
}

cv_results = model_selection.cross_validate(model, arr_source, arr_target, scoring = my_scorers, cv=kfold_splitter)

但这会返回

ValueError: Found input variables with inconsistent numbers of samples: [20, 20, 100]
。我知道发生这种情况是因为 cross_validate 无法根据折叠分割样本权重。

有什么方法可以让它通过交叉验证吗?或者还有其他方法吗?

python machine-learning scikit-learn cross-validation sampling
1个回答
0
投票

元数据路由功能中找到了我正在寻找的东西,该功能可以将不同的参数传递给记分器和估计器。

使用步骤为:

  1. 启用元数据路由:
    sklearn.set_config(enable_metadata_routing=True)
  2. 关闭估算器的路由:
    RandomForestRegressor().set_fit_request(sample_weight=False)
  3. 打开记分员路线:
    make_scorer(r2_score).set_score_request(sample_weight=True)
  4. 将样本权重传递给交叉验证参数:
    model_selection.cross_validate(..., params={"sample_weight": arr_weight})

完整代码如下:

from sklearn import model_selection, set_config
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
import numpy as np

n_smpl, n_feats = 100, 5
arr_source = np.random.random((n_smpl, n_feats))
arr_target = np.random.random((n_smpl, n_feats))
arr_weight = np.random.randint(0, 2, n_smpl)  # 0 for augmented, 1 for authentic

set_config(enable_metadata_routing=True)
model = RandomForestRegressor().set_fit_request(sample_weight=False)
kfold_splitter = model_selection.KFold(n_splits=5, random_state=7, shuffle=True)
my_scorers = {
    "r2_weighted": make_scorer(r2_score).set_score_request(sample_weight=True),
    "mse_weighted": make_scorer(mean_squared_error, greater_is_better=False).set_score_request(sample_weight=True)
}

cv_results = model_selection.cross_validate(model, arr_source, arr_target, scoring = my_scorers, cv=kfold_splitter, params={"sample_weight": arr_weight})
© www.soinside.com 2019 - 2024. All rights reserved.