我一直在阅读有关 KNN 算法的超参数调整的信息,并了解实现它的最佳实践是确保对于每次折叠,我的数据集应使用管道进行归一化和过采样(以避免数据泄漏和过度拟合)。 我正在尝试做的是,我正在尝试确定可能的最佳邻居数量(
n_neighbors
),这可以让我在训练中获得最佳准确性。在代码中,我将邻居的数量设置为列表range (1,50)
,以及迭代次数cv=10
。
我的代码如下:
# dataset reading & preprocessing libraries
import pandas as pd
import numpy as np
from sklearn.preprocessing import MinMaxScaler
#oversmapling
from imblearn.over_sampling import SMOTE
#KNN Model related Libraries
import cuml
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
#loading the dataset
df = pd.read_csv("/content/drive/MyDrive/Colab Notebooks/dataset/IanDataset.csv")
#filling missing values with zeros
df = df.fillna(0)
#replace the data in from being objects to integers
df["command response"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)
df["binary result"].replace({"b'0'": "0", "b'1'": "1"}, inplace=True)
#change the datatype of some features to be able to be used later
df["command response"] = pd.to_numeric(df["command response"]).astype(float)
df["binary result"] = pd.to_numeric(df["binary result"]).astype(int)
# dataset splitting
X = df.iloc[:, 0:17]
y_bin = df.iloc[:, 17]
# spliting the dataset into train and test for binary classification
X_train, X_test, y_bin_train, y_bin_test = train_test_split(X, y_bin, random_state=0, test_size=0.2)
#making pipleline that normalize, oversample and use classifier before GridSearchCV
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier(metric='eculidean', output='input'))
])
#Using GridSearchCV
neighbors = list(range(1,50))
parameters = {
'classifier__n_neighbors': neighbors
}
grid_search = GridSearchCV(pipe, parameters, cv=10)
grid_search.fit(X_train, y_bin_train)
print("Best Accuracy: {}" .format(grid_search.best_score_))
print("Best num of neighbors: {}" .format(grid_search.best_estimator_.get_params()['n_neighbors']))
在步骤
grid_search.fit(X_train, y_bin_train)
,程序正在重复我收到的错误是:
/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py:619: FitFailedWarning: Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details:
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/sklearn/model_selection/_validation.py", line 598, in _fit_and_score
estimator.fit(X_train, y_train, **fit_params)
File "/usr/local/lib/python3.7/dist-packages/imblearn/pipeline.py", line 266, in fit
self._final_estimator.fit(Xt, yt, **fit_params_last_step)
File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/neighbors/kneighbors_classifier.pyx", line 176, in cuml.neighbors.kneighbors_classifier.KNeighborsClassifier.fit
File "/usr/local/lib/python3.7/site-packages/cuml/internals/api_decorators.py", line 409, in inner_with_setters
return func(*args, **kwargs)
File "cuml/neighbors/nearest_neighbors.pyx", line 397, in cuml.neighbors.nearest_neighbors.NearestNeighbors.fit
ValueError: Metric is not valid. Use sorted(cuml.neighbors.VALID_METRICSeculidean[brute]) to get valid options.
我不确定这个错误来自哪一边,是因为我从 cuML 库而不是 sklearn 导入 KNN Algorthim 吗?或者我的 Pipeline 和 GridSearchCV 实现有问题吗?
此错误表示您为
metric
参数传递了无效值(在 scikit-learn 和 cuML 中)。你拼错了“euclidean”。
import cuml
from sklearn import datasets
from sklearn.preprocessing import MinMaxScaler
from imblearn.over_sampling import SMOTE
from imblearn.pipeline import Pipeline
from sklearn.model_selection import train_test_split, GridSearchCV
from cuml.neighbors import KNeighborsClassifier
X, y = datasets.make_classification(
n_samples=100
)
pipe = Pipeline([
('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier(metric='euclidean', output='input'))
])
parameters = {
'classifier__n_neighbors': [1,3,6]
}
grid_search = GridSearchCV(pipe, parameters, cv=2)
grid_search.fit(X, y)
GridSearchCV(cv=2,
estimator=Pipeline(steps=[('normalization', MinMaxScaler()),
('oversampling', SMOTE()),
('classifier', KNeighborsClassifier())]),
param_grid={'classifier__n_neighbors': [1, 3, 6]})