在年分布式时间序列数据中使用线性回归得到-N-年后的预测。

Question

我被一个非常独特的问题所困。我有一个时间序列数据，数据是从2009年到2018年。问题是我要用这些数据回答一个很奇怪的问题。

Data sheets contains the energy generationstatistics of each Australian StateTerritory in GWh ( Gigawatt hours) for the year 2009 to 2018.

有以下字段。


State: Names of different Australian states.
Fuel_Type:  The type of fuel which is consumed.
Category:  Determines whether a fuel is considered as a renewable or nonrenewable.
Years: Years which the energy consumptions are recorded.

问题:

如何使用线性回归模型来预测某项工作的百分比？state X 说 维多利亚 能源生产将来自 y source 说 可再生能源 在 year Z 假设 2100?

我应该如何使用线性回归模型来解决这个问题？这个问题是我无法解决的。

数据来自这个环节

Answer 1

我认为首先你需要考虑你的模型最后应该是什么样子的。你可能想要的东西，关系到因变量。y(可再生能源的部分)到你的输入特征。而其中一个特征可能应该是年份，因为你对预测如何使用 y 如果你改变这个量，就会发生变化。所以一个非常基本的线性模型可以是 y = beta1 * x + beta0 与 x 是年。beta1 和 beta0 是您要拟合的参数，而 y 是可再生能源的一部分。当然，这忽略了状态部分，但我认为一个简单的开始可以是将这样的模型适合你感兴趣的状态。这种方法的代码可以是这样的。

import matplotlib
matplotlib.use("agg")
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sbn
from scipy.stats import linregress
import numpy as np

def fracRenewable(df):
    return np.sum(df.loc[df["Category"] == "Renewable fuels", "amount"]/np.sum(df["amount"]))


# load in data

data = pd.read_csv("./energy_data.csv")

# convert data to tidy format and rename columns
molten = pd.melt(data, id_vars=["State", "Fuel_Type", "Category"])
           .rename(columns={"variable": "year", "value": "amount"})

# calculate fraction of renewable fuel per year
grouped = molten.groupby(["year"]).apply(fracRenewable)
                                  .reset_index()
                                  .rename(columns={0: "amount"})
grouped["year"] = grouped["year"].astype(int)

# >>> grouped
#    year    amount
# 0  2009  0.029338
# 1  2010  0.029207
# 2  2011  0.032219
# 3  2012  0.053738
# 4  2013  0.061332
# 5  2014  0.066198
# 6  2015  0.069404
# 7  2016  0.066531
# 8  2017  0.074625
# 9  2018  0.077445

# fit linear model
slope, intercept, r_value, p_value, std_err = linregress(grouped["year"], grouped["amount"])

# plot result
f, ax = plt.subplots()
sbn.scatterplot(x="year", y="amount", ax=ax, data=grouped)
ax.plot(range(2009, 2030), [i*slope + intercept for i in range(2009, 2030)], color="red")
ax.set_title("Renewable fuels (simple predicion)")
ax.set(ylabel="Fraction renewable fuel")
f.savefig("test11.png", bbox_inches="tight")

这给了你一个（非常简单的）模型来预测某一年可再生燃料的比例。

如果你想进一步完善模型，我认为一个好的开始是根据各州的相似度将它们分组（无论是基于先前的知识还是聚类方法），然后对这些组进行预测。

Answer 2

是的，你可以使用线性回归进行预测。如何使用线性回归进行预测有不同的方法。您可以

对训练数据拟合一条线，并将该拟合线外推到未来，这有时也被称为 "拟合线"。漂移法。
将问题简化为表格回归问题。，将时间序列拆分为固定长度的窗口，并将其叠加在一起，然后使用线性回归。
使用其他常用趋势法.

下面是(1)和(2)的样子，有了 sktime 声明：我是开发者之一）。

import numpy as np
from sktime.datasets import load_airline
from sktime.forecasting.model_selection import temporal_train_test_split
from sktime.performance_metrics.forecasting import smape_loss
from sktime.forecasting.trend import PolynomialTrendForecaster
from sktime.utils.plotting.forecasting import plot_ys
from sktime.forecasting.compose import ReducedRegressionForecaster
from sklearn.linear_model import LinearRegression

y = load_airline()  # load 1-dimensional time series
y_train, y_test = temporal_train_test_split(y)  

# here I forecast all observations of the test series, 
# in your case you could only select the years you're interested in
fh = np.arange(1, len(y_test) + 1)  

# option 1
forecaster = PolynomialTrendForecaster(degree=1)
forecaster.fit(y_train)
y_pred_1 = forecaster.predict(fh)

# option 2
forecaster = ReducedRegressionForecaster(LinearRegression(), window_length=10)
forecaster.fit(y_train)
y_pred_2 = forecaster.predict(fh)

在年分布式时间序列数据中使用线性回归得到-N-年后的预测。

问题描述投票：0回答：1

1个回答

最新问题

在年分布式时间序列数据中使用线性回归得到-N-年后的预测。

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1