实现GridSearchCV和Pipelines来对KNN算法进行超参数调优

问题描述 投票:0回答:1

我一直在阅读有关 KNN 算法的超参数调整的信息,并了解实现它的最佳实践是确保对于每次折叠,我的数据集应使用管道进行归一化和过采样(以避免数据泄漏和过度拟合)。 我正在尝试做的是,我正在尝试确定可能的最佳邻居数量(

n_neighbors
),这可以让我在训练中获得最佳准确性。在代码中,我将邻居的数量设置为列表
range (1,50)
,以及迭代次数
cv=10

我的代码如下:

# dataset reading & preprocessing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler

#oversmapling
from imblearn.over_sampling import SMOTE

#KNN Model related Libraries
import cuml 
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier

#loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")

#filling missing values with zeros
df = df.fillna(0)

#replace the data in from being objects to integers
df["command response"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)
df["binary result"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)

#change the datatype of some features to be able to be used later 
df["command response"] = pd.to_numeric(df["command response"]).astype(float)
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)

# dataset splitting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]

# spliting the dataset into train and test for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)

#making pipleline that normalize, oversample and use classifier before GridSearchCV
pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='eculidean', output='input'))
])

#Using GridSearchCV
neighbors = list(range(1,50))
parameters = {
    'classifier__n_neighbors': neighbors 
}

grid_search = GridSearchCV(pipe, parameters, cv=10)
grid_search.fit(X_train, y_bin_train)

print("Best Accuracy: {}" .format(grid_search.best_score_))
print("Best num of neighbors: {}" .format(grid_search.best_estimator_.get_params()['n_neighbors']))

在步骤

grid_search.fit(X_train, y_bin_train)
,程序正在重复我收到的错误是:

/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:619: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py", line 266, in fit
    self._final_estimator.fit(Xt, yt, **fit_params_last_step)
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/kneighbors_classifier.pyx", line 176, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.fit
  File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
    return func(*args, **kwargs)
  File "cuml/neighbors/nearest_neighbors.pyx", line 397, in cuml.neighbors.nearest_neighbors.NearestNeighbors.fit
ValueError: Metric  is not valid. Use sorted(cuml.neighbors.VALID_METRICSeculidean[brute]) to get valid options.

我不确定这个错误来自哪一边,是因为我从 cuML 库而不是 sklearn 导入 KNN Algorthim 吗?或者我的 Pipeline 和 GridSearchCV 实现有问题吗?

python machine-learning scikit-learn gridsearchcv rapids
1个回答
2
投票

此错误表示您为

metric
参数传递了无效值(在 scikit-learn 和 cuML 中)。你拼错了“euclidean”。

import cuml
from sklearn import datasets
​
from sklearn.preprocessing import MinMaxScaler
​
from imblearn.over_sampling import SMOTE
​
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
​
X, y = datasets.make_classification(
    n_samples=100
)
​
pipe = Pipeline([
        ('normalization', MinMaxScaler()),
        ('oversampling', SMOTE()),
        ('classifier', KNeighborsClassifier(metric='euclidean', output='input'))
])
​
parameters = {
    'classifier__n_neighbors': [1,3,6] 
}
​
grid_search = GridSearchCV(pipe, parameters, cv=2)
grid_search.fit(X, y)
GridSearchCV(cv=2,
             estimator=Pipeline(steps=[('normalization', MinMaxScaler()),
                                       ('oversampling', SMOTE()),
                                       ('classifier', KNeighborsClassifier())]),
             param_grid={'classifier__n_neighbors': [1, 3, 6]})
© www.soinside.com 2019 - 2024. All rights reserved.