[当我通过正则化更改SciKit
线性模型中的列顺序(功能顺序)时,得到不同的分数。我已经用ElasticNet
和Lasso
对此进行了测试。我正在使用scikit-learn==0.23.1
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5, 6],
'col2': [16, 32, 64, 12, 5, 256],
'col3': [7, 8, 9, 10, 12, 11],
'out': [40, 5, 60, 7, 9, 100]})
print(df)
X_df = df[['col1', 'col2', 'col3']]
y_df = df['out']
regr = linear_model.ElasticNet(alpha=0.1, random_state=0)
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
# change the order to: [col2, col1, col3]
first_cols = ['col2']
cols = first_cols.copy()
for c in X_df.columns:
if c not in cols:
cols.append(c)
X_df = X_df[cols]
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("\nReorder:")
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
上面的输出是:
col1 col2 col3 out
0 1 16 7 40
1 2 32 8 5
2 3 64 9 60
3 4 12 10 7
4 5 5 12 9
5 6 256 11 100
R2: 0.8277462579081043
MSE: 207.13034003933535
Reorder:
R2: 0.8277586094134455
MSE: 207.11548769725997
为什么会这样?
差异是由于tol
参数。
来自文档:
tol:浮动,默认= 1e-4
优化的容忍度:如果更新为小于
tol
,优化代码会检查最佳化的双重间隙,一直持续到变小为止比tol
大。
Just add获得两种情况下您想要添加tol=1e-12
的精度等级。
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics
df = pd.DataFrame({
'col1': [1, 2, 3, 4, 5, 6],
'col2': [16, 32, 64, 12, 5, 256],
'col3': [7, 8, 9, 10, 12, 11],
'out': [40, 5, 60, 7, 9, 100]})
# print(df)
X_df = df[['col1', 'col2', 'col3']]
y_df = df['out']
regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print(regr.coef_)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
# change the order to: [col2, col1, col3]
first_cols = ['col2']
cols = first_cols.copy()
for c in X_df.columns:
if c not in cols:
cols.append(c)
X_df = X_df[cols]
regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("\nReorder:")
print(regr.coef_)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))
[-8.92519779 0.42980208 3.59812779]
R2: 0.8277593357239204
MSE: 207.11461432908925
Reorder:
[ 0.42980208 -8.92519779 3.59812779]
R2: 0.8277593357240851
MSE: 207.11461432889112
更改列的顺序会影响训练期间的操作顺序。在理想情况下,这无关紧要,但是由于浮点数的精度损失,您可能只是更改列顺序而获得了略有不同的值。