我正在对由真实样本和增强样本组成的数据集运行回归任务。增强样本是通过抖动真实样本生成的。我想通过
sklearn
的交叉验证来选择性能最佳的模型。
为此我想:
estimator.fit(..., sample_weights = [1,1,..., 1]
)。cross_validate
?
我尝试了以下方法:
from sklearn import model_selection
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
import numpy as np
n_smpl, n_feats = 100, 5
arr_source = np.random.random((n_smpl, n_feats))
arr_target = np.random.random((n_smpl, n_feats))
arr_weight = np.random.randint(0, 2, n_smpl) # 0 for augmented, 1 for authentic
model = RandomForestRegressor()
kfold_splitter = model_selection.KFold(n_splits=5, random_state=7, shuffle=True)
my_scorers = {
"r2_weighted": make_scorer(r2_score, sample_weight=arr_weight),
"mse_weighted": make_scorer(mean_squared_error, greater_is_better=False, sample_weight=arr_weight)
}
cv_results = model_selection.cross_validate(model, arr_source, arr_target, scoring = my_scorers, cv=kfold_splitter)
但这会返回
ValueError: Found input variables with inconsistent numbers of samples: [20, 20, 100]
。我知道发生这种情况是因为 cross_validate 无法根据折叠分割样本权重。
有什么方法可以让它通过交叉验证吗?或者还有其他方法吗?
在元数据路由功能中找到了我正在寻找的东西,该功能可以将不同的参数传递给记分器和估计器。
使用步骤为:
sklearn.set_config(enable_metadata_routing=True)
RandomForestRegressor().set_fit_request(sample_weight=False)
make_scorer(r2_score).set_score_request(sample_weight=True)
model_selection.cross_validate(..., params={"sample_weight": arr_weight})
完整代码如下:
from sklearn import model_selection, set_config
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score, mean_squared_error, make_scorer
import numpy as np
n_smpl, n_feats = 100, 5
arr_source = np.random.random((n_smpl, n_feats))
arr_target = np.random.random((n_smpl, n_feats))
arr_weight = np.random.randint(0, 2, n_smpl) # 0 for augmented, 1 for authentic
set_config(enable_metadata_routing=True)
model = RandomForestRegressor().set_fit_request(sample_weight=False)
kfold_splitter = model_selection.KFold(n_splits=5, random_state=7, shuffle=True)
my_scorers = {
"r2_weighted": make_scorer(r2_score).set_score_request(sample_weight=True),
"mse_weighted": make_scorer(mean_squared_error, greater_is_better=False).set_score_request(sample_weight=True)
}
cv_results = model_selection.cross_validate(model, arr_source, arr_target, scoring = my_scorers, cv=kfold_splitter, params={"sample_weight": arr_weight})