更改列顺序时线性模型收敛性发生变化

Question

[当我通过正则化更改SciKit线性模型中的列顺序（功能顺序）时，得到不同的分数。我已经用ElasticNet和Lasso对此进行了测试。我正在使用scikit-learn==0.23.1

import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics

df = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5, 6],
    'col2': [16, 32, 64, 12, 5, 256],
    'col3': [7, 8, 9, 10, 12, 11],
    'out': [40, 5, 60, 7, 9, 100]})

print(df)
X_df = df[['col1', 'col2', 'col3']]
y_df = df['out']
regr = linear_model.ElasticNet(alpha=0.1, random_state=0)

regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))

# change the order to: [col2, col1, col3]
first_cols = ['col2']
cols = first_cols.copy()
for c in X_df.columns:
    if c not in cols:
        cols.append(c)
X_df = X_df[cols]

regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("\nReorder:")
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))

上面的输出是：

col1  col2  col3  out
0     1    16     7   40
1     2    32     8    5
2     3    64     9   60
3     4    12    10    7
4     5     5    12    9
5     6   256    11  100
R2: 0.8277462579081043
MSE: 207.13034003933535

Reorder:
R2: 0.8277586094134455
MSE: 207.11548769725997

为什么会这样？

Answer 1

差异是由于tol参数。

来自文档：

tol：浮动，默认= 1e-4
优化的容忍度：如果更新为小于tol，优化代码会检查最佳化的双重间隙，一直持续到变小为止比tol大。

Just add获得两种情况下您想要添加tol=1e-12的精度等级。

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import train_test_split
from sklearn.pipeline import make_pipeline
from sklearn.compose import make_column_transformer
from sklearn.feature_extraction.text import CountVectorizer

import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn import metrics

df = pd.DataFrame({
    'col1': [1, 2, 3, 4, 5, 6],
    'col2': [16, 32, 64, 12, 5, 256],
    'col3': [7, 8, 9, 10, 12, 11],
    'out': [40, 5, 60, 7, 9, 100]})

# print(df)
X_df = df[['col1', 'col2', 'col3']]
y_df = df['out']
regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)

regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print(regr.coef_)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))

# change the order to: [col2, col1, col3]
first_cols = ['col2']
cols = first_cols.copy()
for c in X_df.columns:
    if c not in cols:
        cols.append(c)
X_df = X_df[cols]


regr = linear_model.ElasticNet(alpha=0.1, random_state=0, tol=1e-12)
regr.fit(X_df, y_df)
y_pred = regr.predict(X_df)
print("\nReorder:")
print(regr.coef_)
print("R2:", regr.score(X_df, y_df))
print("MSE:", metrics.mean_squared_error(y_df, y_pred))

[-8.92519779  0.42980208  3.59812779]
R2: 0.8277593357239204
MSE: 207.11461432908925

Reorder:
[ 0.42980208 -8.92519779  3.59812779]
R2: 0.8277593357240851
MSE: 207.11461432889112

Answer 2

更改列的顺序会影响训练期间的操作顺序。在理想情况下，这无关紧要，但是由于浮点数的精度损失，您可能只是更改列顺序而获得了略有不同的值。

更改列顺序时线性模型收敛性发生变化

问题描述投票：1回答：2

2个回答

最新问题

更改列顺序时线性模型收敛性发生变化

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2