当我改变了列的顺序(特征顺序),在 SciKit
线性模型与正则化,我得到了不同的分数。我已经用 ElasticNet
和 Lasso
. 我正在使用 scikit-learn==0.23.1
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5, 6],
'col2': [16, 32, 64, 12, 5, 256],
'col3': [7, 8, 9, 10, 12, 11],
'out': [40, 5, 60, 7, 9, 100]})
print(df)
X_df = df[['col1', 'col2', 'col3']]
y_df = df['out']
regr = linear_model.ElasticNet(alpha=0.1, random_state=0)
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
# change the order to: [col2, col1, col3]
first_cols = ['col2']
cols = first_cols.copy()
for c in X_df.columns:
if c not in cols:
cols.append(c)
X_df = X_df[cols]
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("\nReorder:")
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
以上的输出是。
col1 col2 col3 out
0 1 16 7 40
1 2 32 8 5
2 3 64 9 60
3 4 12 10 7
4 5 5 12 9
5 6 256 11 100
R2: 0.8277462579081043
MSE: 207.13034003933535
Reorder:
R2: 0.8277586094134455
MSE: 207.11548769725997
为什么会出现这种情况?
差异是由于 tol
param.
来自文档。
tol : float, default=1e-4
优化的容忍度:如果更新量小于
tol
,优化代码检查双间隙的最优性,并继续检查,直到它小于tol
.
只需添加得到你想要的精度等级添加。tol=1e-12
在这两种情况下。
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5, 6],
'col2': [16, 32, 64, 12, 5, 256],
'col3': [7, 8, 9, 10, 12, 11],
'out': [40, 5, 60, 7, 9, 100]})
# print(df)
X_df = df[['col1', 'col2', 'col3']]
y_df = df['out']
regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print(regr.coef_)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
# change the order to: [col2, col1, col3]
first_cols = ['col2']
cols = first_cols.copy()
for c in X_df.columns:
if c not in cols:
cols.append(c)
X_df = X_df[cols]
regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("\nReorder:")
print(regr.coef_)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
[-8.92519779 0.42980208 3.59812779]
R2: 0.8277593357239204
MSE: 207.11461432908925
Reorder:
[ 0.42980208 -8.92519779 3.59812779]
R2: 0.8277593357240851
MSE: 207.11461432889112
改变列序会影响训练时的运算顺序。在理想的情况下,这不会有什么影响,但是由于浮点数的精度损失,你可能会因为改变列序而得到稍微不同的值。