仅具有一个数字特征的逻辑回归

Question

当您只有一个数字特征时，使用

scikit-learn

的

LogisticRegression

求解器的正确方法是什么？

我运行了一个我发现很难解释的简单示例。谁能解释一下我在这里做错了什么？

import pandas
import numpy as np
from sklearn.linear_model import LogisticRegression

X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))

lr = LogisticRegression()

lr.fit(X, Y)
print ("2 --> {0}".format(lr.predict(2)))
print ("4 --> {0}".format(lr.predict(4)))

这是脚本运行完成后得到的输出。 4 的预测不应该是 0吗，因为根据高斯分布 4 更接近根据测试集被分类为 0 的分布？

2 --> [0]
4 --> [1]

当只有一列包含数值数据时，逻辑回归采用什么方法？

Answer 1

您正确地处理了单个特征，但您错误地假设仅仅因为 4 接近 0 类特征，它也会被这样预测

您可以将训练数据与 sigmoid 函数一起绘制，假设分类阈值为

y=0.5

，并使用从回归模型中学习到的系数和截距：

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LogisticRegression

X = [1, 2, 3, 10, 11, 12]
X = np.reshape(X, (6, 1))
Y = [0, 0, 0, 1, 1, 1]
Y = np.reshape(Y, (6, 1))

lr = LogisticRegression()
lr.fit(X, Y)

plt.figure(1, figsize=(4, 3))
plt.scatter(X.ravel(), Y, color='black', zorder=20)

def model(x):
    return 1 / (1 + np.exp(-x))

X_test = np.linspace(-5, 15, 300)
loss = model(X_test * lr.coef_ + lr.intercept_).ravel()

plt.plot(X_test, loss, color='red', linewidth=3)
plt.axhline(y=0, color='k', linestyle='-')
plt.axhline(y=1, color='k', linestyle='-')
plt.axhline(y=0.5, color='b', linestyle='--')
plt.axvline(x=X_test[123], color='b', linestyle='--')

plt.ylabel('y')
plt.xlabel('X')
plt.xlim(0, 13)
plt.show()

以下是 sigmoid 函数在您的情况下的样子：

放大一点：

对于您的特定模型，当

处于 0.5 分类阈值时，

的值介于

3.161

和

3.227

之间。您可以通过比较

loss

和

X_test

数组来检查这一点（

X_test[123]

是与上限关联的 X 值 - 如果需要，您可以使用一些函数优化方法来获取精确值）

所以 4 被预测为类别

的原因是因为 4 高于

Y == 0.5

时的界限

您可以通过以下方式进一步展示这一点：

print ("2 --> {0}".format(lr.predict(2)))
print ("3 --> {0}".format(lr.predict(3)))
print ("3.1 --> {0}".format(lr.predict(3.1)))
print ("3.3 --> {0}".format(lr.predict(3.3)))
print ("4 --> {0}".format(lr.predict(4)))

这将打印出以下内容：

2 --> [0]
3 --> [0]
3.1 --> [0]  # Below threshold
3.3 --> [1]  # Above threshold
4 --> [1]

Answer 2

我更改了代码中的一些内容，出现了预期的结果：

import numpy as np
from sklearn.linear_model import LogisticRegression

X_train = np.array([1, 2, 3, 10, 11, 12]).reshape(-1, 1)
y_train = np.array([0, 0, 0, 1, 1, 1])

logistic_regression = LogisticRegression()
logistic_regression.fit(X_train, y_train)
results = logistic_regression.predict(np.array([2,4,6.4,6.5]).reshape(-1,1))

print('2--> {}'.format(results[0]))
print('4--> {}'.format(results[1]))
print('6.4 --> {}'.format(results[2]))
print('6.5 --> {}'.format(results[3]))

结果是：

'2--> 0'
'4--> 0'
'6.4--> 0'
'6.5--> 1'

我认为你得到了错误的结果，因为你不需要重塑 Y 数组......

Answer 3

我有一些输入，例如-T，T，F，T，F，T，F，T，F，F，...，F，F。 T 代表真，F 代表假，就是这样。只有一栏。预测下一个结果的正确方法是什么（T/F）。谁能帮我写代码吗？

仅具有一个数字特征的逻辑回归

问题描述投票：0回答：3

3个回答

最新问题

仅具有一个数字特征的逻辑回归

问题描述 投票：0回答：3

3个回答

最新问题

问题描述投票：0回答：3