我成功地仅对Iris数据集使用numpy成功地实现了多线性回归。我想为boston houses data set,但我的模型无法学习,我也不知道为什么。
import pandas as pd
# read data and split into test and training sets
data = pd.read_csv('train.csv')
data = (data - data.mean()) / data.std() # normalize data
split_data = np.random.rand(len(data)) < 0.8
train_data = data[split_data].round(5)
test_data = data[~split_data]
# create matrices
input_features_train = train_data.drop(['ID', 'medv'], 1).values
output_feature_train = train_data.medv.values.reshape(-1, 1)
ones = np.ones([input_features_train.shape[0], 1])
input_features_train = np.concatenate((ones, input_features_train), 1)
weight = np.zeros([1, 14])
def computeCost(X, y, theta):
summed = np.power(((X @ theta.T) - y), 2)
return np.sum(summed) / (2 * len(X))
def gradientDescent(X, y, theta, iters, alpha):
costs = np.zeros(iters)
for i in range(iters):
theta = theta - (alpha / len(X)) * np.sum(X * (X @ theta.T - y), 0)
costs[i] = computeCost(X, y, theta)
return theta, costs
learning_rate = 0.01
iterations = 100000
weights, cost = gradientDescent(input_features_train, output_feature_train, weight, iterations, learning_rate)
print("Weights: ", weights)
finalCost = computeCost(input_features_train, output_feature_train, weights)
# test
input_features_test = test_data.drop(['ID', 'medv'], 1).values
output_feature_test = test_data.medv.values.reshape(-1, 1)
ones = np.ones([input_features_test.shape[0], 1])
input_features_test = np.concatenate((ones, input_features_test), 1)
def test_data(input_features, output_feature, weights):
predictions = np.round(np.dot(input_features, weights.T))
for i in range(len(output_feature)):
predicted = predictions[i]
success = predictions[i] == output_feature[i]
print('For features: ', input_features[i], ' housing price should be ', output_feature[i])
print("Predicted: %f" % predicted)
print("Is success? ", success)
print()
test_data(input_features_test, output_feature_test, weights)
predictions = np.round(np.dot(input_features_test, weights.T))
accuracy = (sum(predictions == output_feature_test) / float(len(output_feature_test)) * 100)[0]
print("Accuracy of the model is ", accuracy, "% after ", iterations, "iterations")
示例输出如下
Weights: [[ 0.01465871 -0.11583742 0.17729105 0.01249782 0.09822299 -0.31249182
0.25208063 -0.00937766 -0.48751822 0.46772537 -0.27637035 -0.1590125
0.12926108 -0.48910136]]
For features: [ 1. -0.44852959 -0.47141352 0.09095532 -0.25240023 0.13793157
0.46506236 0.03105118 -0.62153314 -0.98758424 -0.79769195 1.18594974
0.37563165 -0.40259248] housing price should be [-0.04019949]
Predicted: 0.000000
Is success? [False]
我什至尝试了10000000次迭代,但仍然无法通过所有测试,并且精度为0%。在虹膜数据集上,我设法用此模型获得了100%的收益,所以我不明白为什么它不起作用。
我怀疑这可能是具有数据规范化的东西,因为没有它,我会收到RuntimeWarning: overflow encountered in power
summed = np.power(((X @ theta.T) - y), 2)
错误,我也不知道为什么会这样。您能指出我正确的方向吗?谢谢!
我真的建议您使用scikit学习。您可以使用SGD Regressor或Cat Boost Regressor,它们已为此类方法提供了内置支持。
此建议的主要动机是手动使用梯度下降可能会导致某些逻辑错误,可能无法发现。
尝试使用scikit学习解决。可能会有所帮助。