从StandardScaler VS正规化线性回归结果进行比较

Question

我通过不同情景下的线性回归的一些实例工作，使用Normalizer和StandardScaler比较结果，并将结果令人费解。

我使用的是波士顿住房数据集，并准备推出这样说：

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import Normalizer
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression

#load the data
df = pd.DataFrame(boston.data)
df.columns = boston.feature_names
df['PRICE'] = boston.target

我目前正试图以推理的结果我从以下方案获得：

初始化线性回归用参数normalize=True VS使用Normalizer
初始化线性回归模型中参数fit_intercept = False有和没有标准化。

总的来说，我觉得结果令人困惑。

下面是我如何设置的一切行动：

# Prep the data
X = df.iloc[:, :-1]
y = df.iloc[:, -1:]
normal_X = Normalizer().fit_transform(X)
scaled_X = StandardScaler().fit_transform(X)

#now prepare some of the models
reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
reg3 = LinearRegression().fit(normal_X, y)
reg4 = LinearRegression().fit(scaled_X, y)
reg5 = LinearRegression(fit_intercept=False).fit(scaled_X, y)

然后，我在3个独立dataframes比较R_score，系数值，并且从每个模型预测。

要创建数据帧从比较每种模式系数值，我做了以下内容：

#Create a dataframe of the coefficients
coef = pd.DataFrame({
    'coeff':                       reg1.coef_[0],
    'coeff_normalize_true':        reg2.coef_[0],
    'coeff_normalizer':            reg3.coef_[0],
    'coeff_scaler':                reg4.coef_[0],
    'coeff_scaler_no_int':         reg5.coef_[0]
})

以下是我创建的数据帧来比较每个模型的R ^ 2点的值：

scores = pd.DataFrame({
    'score':                        reg1.score(X, y),
    'score_normalize_true':         reg2.score(X, y),
    'score_normalizer':             reg3.score(normal_X, y),
    'score_scaler':                 reg4.score(scaled_X, y),
    'score_scaler_no_int':          reg5.score(scaled_X, y)
    }, index=range(1)
)

最后，这里的这些预测从每个比较数据帧：

predictions = pd.DataFrame({
    'pred':                        reg1.predict(X).ravel(),
    'pred_normalize_true':         reg2.predict(X).ravel(),
    'pred_normalizer':             reg3.predict(normal_X).ravel(),
    'pred_scaler':                 reg4.predict(scaled_X).ravel(),
    'pred_scaler_no_int':          reg5.predict(scaled_X).ravel()
}, index=range(len(y)))

下面是导致dataframes：

系数：

成绩：

预测：

我有三个问题，我无法调和：

为什么会出现前两种模式之间绝对没有区别？看来，设置normalize=False什么都不做。我可以理解预测和R ^ 2倍的值是相同的，但我的特点有不同的数值尺度，所以我不知道为什么正火没有任何作用的。当你考虑到使用StandardScaler改变系数大大这是极为混乱。
我不明白为什么使用Normalizer模型导致从其他诸如完全不同的系数值，特别是当与LinearRegression(normalize=True)模型使得完全没有变化。

如果你看看文档，每个，看来他们如果不是相同的非常相似。

从sklearn.linear_model.LinearRegression()文档：

标准化：布尔值，可选，默认为false

当fit_intercept设置为False忽略此参数。如果真，回归系数X将回归前通过减去平均值和由L2范数除以标准化。

同时，sklearn.preprocessing.Normalizer states that it normalizes to the l2 norm by default的文档。

我看不出有什么之间这两个选项做一个区别，我不明白为什么一个会具有与其他的系数值这些激进的差异。

从模型使用StandardScaler的结果是一致的给我，但我不明白为什么使用StandardScaler和设置set_intercept=False模型进行如此糟糕。

从上Linear Regression module文档：

fit_intercept：布尔值，可选，默认为true

是否计算拦截此模型。如果设置为false，没有截距将在计算中使用（例如，数据预期为已经居中）。

该中心StandardScaler你的数据，所以我不明白为什么使用它与fit_intercept=False产生不连贯的结果。

Answer 1

究其原因，前两种模式之间的共同efficients没有什么区别的是，Sklearn反规范化计算：从标准的输入数据共同EFFS后的幕后共同efficients。 Reference

这种去正常化已经这样做了，任何的测试数据，我们可以直接申请共同EFFS。并获得预测与标准化的测试数据。

因此，设置normalize=True做有共同efficients的影响，但他们无论如何不影响最佳拟合线。

Normalizer确实正常化相对于每个样品（意味着按行）。你看参考代码here。

From documentation:

个别规范化样本单位规范。

而normalize=True确实正常化相对于每列/特征。 Reference

例子来理解正常化在数据的不同尺寸的影响。让我们采取两种尺寸x1和x2和y为目标变量。目标变量值是颜色图中的编码。

import matplotlib.pyplot as plt
from sklearn.preprocessing import Normalizer,StandardScaler
from sklearn.preprocessing.data import normalize

n=50
x1 = np.random.normal(0, 2, size=n)
x2 = np.random.normal(0, 2, size=n)
noise = np.random.normal(0, 1, size=n)
y = 5 + 0.5*x1 + 2.5*x2 + noise

fig,ax=plt.subplots(1,4,figsize=(20,6))

ax[0].scatter(x1,x2,c=y)
ax[0].set_title('raw_data',size=15)

X = np.column_stack((x1,x2))

column_normalized=normalize(X, axis=0)
ax[1].scatter(column_normalized[:,0],column_normalized[:,1],c=y)
ax[1].set_title('column_normalized data',size=15)

row_normalized=Normalizer().fit_transform(X)
ax[2].scatter(row_normalized[:,0],row_normalized[:,1],c=y)
ax[2].set_title('row_normalized data',size=15)

standardized_data=StandardScaler().fit_transform(X)
ax[3].scatter(standardized_data[:,0],standardized_data[:,1],c=y)
ax[3].set_title('standardized data',size=15)

plt.subplots_adjust(left=0.3, bottom=None, right=0.9, top=None, wspace=0.3, hspace=None)
plt.show()

你可以看到在图1,2的数据最佳拟合线和4将是相同的;表示该R2_分数不会改变由于柱/特征规格化或标准化的数据。只是，它有不同的合作EFFS结束。值。

注：fig3最佳拟合线会有所不同。

当您设置fit_intercept =假，偏项从预测中扣除。含义截距被设定为零，否则将已经平均目标变量的。

与截距为零的prediction将预期对于其中目标变量不是缩放问题（平均= 0）执行坏。你可以看到22.532每一行，这意味着输出的影响，有差别。

Answer 2

答案Q1

我假设你的意思与第2款是reg1和reg2。让我们知道，如果事实并非如此。

线性回归具有相同的预测能力，如果你规范化数据或没有。因此，使用normalize=True对预测没有影响。理解这一点的一种方式是看到正常化（列方向）是在各列（(x-a)/b）和一个线性回归的数据不影响系数估计的线性变换的线性操作中，仅改变它们的值。请注意，这种说法是不正确的套索/里奇/ ElasticNet。

那么，为什么不系数不同？那么，normalize=True还考虑什么用户通常希望是在原有的特色，而不是标准化的特征系数。因此，它调整系数。检查，这是有道理的一种方法是使用一个简单的例子：

# two features, normal distributed with sigma=10
x1 = np.random.normal(0, 10, size=100)
x2 = np.random.normal(0, 10, size=100)

# y is related to each of them plus some noise
y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)

X = np.array([x1, x2]).T  # X has two columns

reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)

# check that coefficients are the same and equal to [2,1]
np.testing.assert_allclose(reg1.coef_, reg2.coef_) 
np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

这证实了两种方法正确地捕获即，分别[X1，X2]和y，2位和1之间的实数信号。

答案Q2

Normalizer是不是你所期望的。它正常化每一行逐行。那么，结果将发生巨大的变化，并有可能破坏要避免除特定情况下（例如，TF-IDF）的特点和目标之间的关系。

怎么看，假设上面的例子，但考虑到不同的功能，x3，不与y有关。使用Normalizer导致x1由x3的值来修饰的，降低其与y关系的实力。

模型（1,2）之间，系数的差异（4,5）

系数之间的差异是，当你安装前规范，系数将是关于标准化的功能，同样的系数提到我在回答第一部分。它们可以被映射到使用reg4.coef_ / scaler.scale_原来的参数：

x1 = np.random.normal(0, 10, size=100)
x2 = np.random.normal(0, 10, size=100)
y = 3 + 2*x1 + 1*x2 + np.random.normal(0, 1, size=100)
X = np.array([x1, x2]).T

reg1 = LinearRegression().fit(X, y)
reg2 = LinearRegression(normalize=True).fit(X, y)
scaler = StandardScaler()
reg4 = LinearRegression().fit(scaler.fit_transform(X), y)

np.testing.assert_allclose(reg1.coef_, reg2.coef_) 
np.testing.assert_allclose(reg1.coef_, np.array([2, 1]), rtol=0.01)

# here
coefficients = reg4.coef_ / scaler.scale_
np.testing.assert_allclose(coefficients, np.array([2, 1]), rtol=0.01)

这是因为，在数学上，设置z = (x - mu)/sigma，模型REG4是解决y = a1*z1 + a2*z2 + a0。我们可以恢复通过简单的代数y和x之间的关系：y = a1*[(x1 - mu1)/sigma1] + a2*[(x2 - mu2)/sigma2] + a0，这可以简化为y = (a1/sigma1)*x1 + (a2/sigma2)*x2 + (a0 - a1*mu1/sigma1 - a2*mu2/sigma2)。

reg4.coef_ / scaler.scale_代表[a1/sigma1, a2/sigma2]在上面的符号，这是normalize=True究竟是干什么的，以保证系数是相同的。

模型5分的差距。

标准化的特点是零均值，但目标变量未必。因此，不fiting截距使模型无视目标的平均值。在我已经使用的例子中，“3”中y = 3 + ...未安装，这自然降低了模型的预测能力。 :)

从StandardScaler VS正规化线性回归结果进行比较

问题描述投票：16回答：2

2个回答

答案Q1

答案Q2

模型（1,2）之间，系数的差异（4,5）

模型5分的差距。

最新问题

从StandardScaler VS正规化线性回归结果进行比较

问题描述 投票：16回答：2

2个回答

答案Q1

答案Q2

模型（1,2）之间，系数的差异（4,5）

模型5分的差距。

最新问题

问题描述投票：16回答：2