我已经从包含911个特定城市的火灾呼叫数据集的数据集中删除了趋势和季节性,该数据记录了17年中每小时的记录。然后,我将其与线性回归拟合,并尝试预测即将到来的24小时周期的值。但是,我的R ^ 2值通常接近0(通常为负),而我的预测值彼此之间都在十分之一以内(或更少),因此在绘制时,它基本上看起来像一条水平线,大致反映了平均值。
我在做什么错?
这是我的代码:
from datetime import timedelta
def run_regression(df, dependent, label):
cut_datetime = df[dependent].max()-timedelta(hours=26) #24 hour lag plus 4 hours to predict
train = df[df[dependent] < cut_datetime][['julian_datetime', label]].dropna(how='any') #train == data before cut_datetime
test = df[df[dependent] >= cut_datetime][['julian_datetime', label]].dropna(how='any') #test == data after cut_datetime
regress = sklearn.linear_model.LinearRegression().fit(
X = train[['julian_datetime']],
y = train[label])
test['predicted_value'] = regress.predict(
X = test[['julian_datetime']])
#Plots
(test[label] - test['predicted_value']).plot()
test[[label, 'predicted_value']].plot()
#Metrics
print('MSE: ', sklearn.metrics.mean_squared_error(test[label], test['predicted_value']))
print('R^2: ', sklearn.metrics.r2_score(test[label], test['predicted_value']))
print('Sample of predicted values: ', '\n', test['predicted_value'][:10])
run_regression(exp_model_df, 'incident_hour', 'label')
incident_hour
->在函数开始处引用的julian_date
的日期时间格式
这里是数据集的示例:
incident_hour julian_datetime label
0 2003-11-07 09:00:00 2.452951e+06 6.696136
1 2003-11-07 10:00:00 2.452951e+06 -5.293884
2 2003-11-07 11:00:00 2.452951e+06 5.679681
3 2003-11-07 12:00:00 2.452951e+06 4.411278
4 2003-11-07 13:00:00 2.452951e+06 5.837476
5 2003-11-07 14:00:00 2.452951e+06 6.469543
6 2003-11-07 15:00:00 2.452951e+06 2.191286
7 2003-11-07 16:00:00 2.452951e+06 0.347877
8 2003-11-07 17:00:00 2.452951e+06 0.151539
9 2003-11-07 18:00:00 2.452951e+06 5.925230
10 2003-11-07 19:00:00 2.452951e+06 8.563340
11 2003-11-07 20:00:00 2.452951e+06 3.151843
12 2003-11-07 21:00:00 2.452951e+06 3.751080
13 2003-11-07 22:00:00 2.452951e+06 5.476664
14 2003-11-07 23:00:00 2.452951e+06 0.146253
15 2003-11-08 00:00:00 2.452952e+06 2.879449
16 2003-11-08 01:00:00 2.452952e+06 0.712886
17 2003-11-08 02:00:00 2.452952e+06 6.118765
18 2003-11-08 03:00:00 2.452952e+06 6.052857
19 2003-11-08 04:00:00 2.452952e+06 0.892937
20 2003-11-08 05:00:00 2.452952e+06 -3.009876
21 2003-11-08 06:00:00 2.452952e+06 -3.525916
22 2003-11-08 07:00:00 2.452952e+06 -0.076345
23 2003-11-08 08:00:00 2.452952e+06 -3.236072
24 2003-11-08 09:00:00 2.452952e+06 -2.855910
25 2003-11-08 10:00:00 2.452952e+06 3.599330
26 2003-11-08 11:00:00 2.452952e+06 6.845144
27 2003-11-08 12:00:00 2.452952e+06 6.764351
28 2003-11-08 13:00:00 2.452952e+06 -1.896929
29 2003-11-08 14:00:00 2.452952e+06 0.370614
30 2003-11-08 15:00:00 2.452952e+06 4.899800
31 2003-11-08 16:00:00 2.452952e+06 7.245627
32 2003-11-08 17:00:00 2.452952e+06 1.559531
33 2003-11-08 18:00:00 2.452952e+06 8.437391
34 2003-11-08 19:00:00 2.452952e+06 4.957201
35 2003-11-08 20:00:00 2.452952e+06 1.349833
36 2003-11-08 21:00:00 2.452952e+06 6.257467
37 2003-11-08 22:00:00 2.452952e+06 -1.221531
38 2003-11-08 23:00:00 2.452952e+06 0.552749
39 2003-11-09 00:00:00 2.452952e+06 -0.917920
40 2003-11-09 01:00:00 2.452953e+06 -4.394944
41 2003-11-09 02:00:00 2.452953e+06 -2.238189
42 2003-11-09 03:00:00 2.452953e+06 -1.062656
43 2003-11-09 04:00:00 2.452953e+06 3.813087
44 2003-11-09 05:00:00 2.452953e+06 -4.540094
45 2003-11-09 06:00:00 2.452953e+06 2.680210
46 2003-11-09 07:00:00 2.452953e+06 4.581881
47 2003-11-09 08:00:00 2.452953e+06 3.803750
48 2003-11-09 09:00:00 2.452953e+06 6.590574
49 2003-11-09 10:00:00 2.452953e+06 8.227202
以下是结果图:
除非您提供产生这些预测值的测试集,否则我无法确定。根据代码,您将使用单个自变量julian_datetime拟合模型。根据您的样本数据,julian_datetime变量具有大量的重复值。如果您对于测试集中的每个观察值都具有相同的julian_datetime值,我不会感到惊讶。因为每个输入都是相同的,所以这将导致相同的预测。