为什么我的预测值都几乎相同(并且与平均值相似)?

问题描述 投票:0回答:1

我已经从包含911个特定城市的火灾呼叫数据集的数据集中删除了趋势和季节性,该数据记录了17年中每小时的记录。然后,我将其与线性回归拟合,并尝试预测即将到来的24小时周期的值。但是,我的R ^ 2值通常接近0(通常为负),而我的预测值彼此之间都在十分之一以内(或更少),因此在绘制时,它基本上看起来像一条水平线,大致反映了平均值。

我在做什么错?

这是我的代码:

from datetime import timedelta
def run_regression(df, dependent, label):
    cut_datetime = df[dependent].max()-timedelta(hours=26) #24 hour lag plus 4 hours to predict

    train = df[df[dependent] < cut_datetime][['julian_datetime', label]].dropna(how='any') #train == data before cut_datetime
    test = df[df[dependent] >= cut_datetime][['julian_datetime', label]].dropna(how='any') #test == data after cut_datetime

    regress = sklearn.linear_model.LinearRegression().fit(
                                              X = train[['julian_datetime']],
                                              y = train[label])

    test['predicted_value'] = regress.predict(
                                              X = test[['julian_datetime']])

    #Plots
    (test[label] - test['predicted_value']).plot()
    test[[label, 'predicted_value']].plot()

    #Metrics
    print('MSE: ', sklearn.metrics.mean_squared_error(test[label], test['predicted_value']))
    print('R^2: ', sklearn.metrics.r2_score(test[label], test['predicted_value']))
    print('Sample of predicted values: ', '\n', test['predicted_value'][:10])

run_regression(exp_model_df, 'incident_hour', 'label')

incident_hour->在函数开始处引用的julian_date的日期时间格式

这里是数据集的示例:

incident_hour   julian_datetime     label
0   2003-11-07 09:00:00     2.452951e+06    6.696136
1   2003-11-07 10:00:00     2.452951e+06    -5.293884
2   2003-11-07 11:00:00     2.452951e+06    5.679681
3   2003-11-07 12:00:00     2.452951e+06    4.411278
4   2003-11-07 13:00:00     2.452951e+06    5.837476
5   2003-11-07 14:00:00     2.452951e+06    6.469543
6   2003-11-07 15:00:00     2.452951e+06    2.191286
7   2003-11-07 16:00:00     2.452951e+06    0.347877
8   2003-11-07 17:00:00     2.452951e+06    0.151539
9   2003-11-07 18:00:00     2.452951e+06    5.925230
10  2003-11-07 19:00:00     2.452951e+06    8.563340
11  2003-11-07 20:00:00     2.452951e+06    3.151843
12  2003-11-07 21:00:00     2.452951e+06    3.751080
13  2003-11-07 22:00:00     2.452951e+06    5.476664
14  2003-11-07 23:00:00     2.452951e+06    0.146253
15  2003-11-08 00:00:00     2.452952e+06    2.879449
16  2003-11-08 01:00:00     2.452952e+06    0.712886
17  2003-11-08 02:00:00     2.452952e+06    6.118765
18  2003-11-08 03:00:00     2.452952e+06    6.052857
19  2003-11-08 04:00:00     2.452952e+06    0.892937
20  2003-11-08 05:00:00     2.452952e+06    -3.009876
21  2003-11-08 06:00:00     2.452952e+06    -3.525916
22  2003-11-08 07:00:00     2.452952e+06    -0.076345
23  2003-11-08 08:00:00     2.452952e+06    -3.236072
24  2003-11-08 09:00:00     2.452952e+06    -2.855910
25  2003-11-08 10:00:00     2.452952e+06    3.599330
26  2003-11-08 11:00:00     2.452952e+06    6.845144
27  2003-11-08 12:00:00     2.452952e+06    6.764351
28  2003-11-08 13:00:00     2.452952e+06    -1.896929
29  2003-11-08 14:00:00     2.452952e+06    0.370614
30  2003-11-08 15:00:00     2.452952e+06    4.899800
31  2003-11-08 16:00:00     2.452952e+06    7.245627
32  2003-11-08 17:00:00     2.452952e+06    1.559531
33  2003-11-08 18:00:00     2.452952e+06    8.437391
34  2003-11-08 19:00:00     2.452952e+06    4.957201
35  2003-11-08 20:00:00     2.452952e+06    1.349833
36  2003-11-08 21:00:00     2.452952e+06    6.257467
37  2003-11-08 22:00:00     2.452952e+06    -1.221531
38  2003-11-08 23:00:00     2.452952e+06    0.552749
39  2003-11-09 00:00:00     2.452952e+06    -0.917920
40  2003-11-09 01:00:00     2.452953e+06    -4.394944
41  2003-11-09 02:00:00     2.452953e+06    -2.238189
42  2003-11-09 03:00:00     2.452953e+06    -1.062656
43  2003-11-09 04:00:00     2.452953e+06    3.813087
44  2003-11-09 05:00:00     2.452953e+06    -4.540094
45  2003-11-09 06:00:00     2.452953e+06    2.680210
46  2003-11-09 07:00:00     2.452953e+06    4.581881
47  2003-11-09 08:00:00     2.452953e+06    3.803750
48  2003-11-09 09:00:00     2.452953e+06    6.590574
49  2003-11-09 10:00:00     2.452953e+06    8.227202

以下是结果图:

enter image description here

python machine-learning scikit-learn time-series linear-regression
1个回答
0
投票

除非您提供产生这些预测值的测试集,否则我无法确定。根据代码,您将使用单个自变量julian_datetime拟合模型。根据您的样本数据,julian_datetime变量具有大量的重复值。如果您对于测试集中的每个观察值都具有相同的julian_datetime值,我不会感到惊讶。因为每个输入都是相同的,所以这将导致相同的预测。

© www.soinside.com 2019 - 2024. All rights reserved.