如何最好地将预测的每日销售额重新分配到每小时级别?

问题描述 投票:0回答:1

我有每小时的销售数据(0600H - 2200H)。我将它们汇总到每日,并根据每日水平做出预测。这是因为我发现日常预测可以带来更高的准确率。

我还有另一个数据框,其中包含每小时(0600-2200H)销售比例。

我尝试过根据比例将每日映射到每小时的简单方法,但它不是很准确(每日总销售额不相符)。

我想使用分层预测库根据每小时的历史销售比例重新分配每日预测水平。或者我怎样才能最好地将预测的每日水平重新分配为每小时水平?

平均每小时比例 df:

date hour proportion
0 2023-01-01 6 0.000000
1 2023-01-01 7 0.000000
2 2023-01-01 8 0.016458
3 2023-01-01 9 0.017825
4 2023-01-01 10 0.0390

每日 df:

date sales
0 2023-01-01 3840.10
1 2023-01-02 3323.80
2 2023-01-03 2605.40
3 2023-01-04 2616.60
4 2023-01-05 2622.89
5 2023-01-06 3596.50
6 2023-01-07 3769.77
7 2023-01-08 3572.30
8 2023-01-09 2381.50
9 2023-01-10 2900.00

每小时_x:

date hour sales proportion
0 2023-01-01 6 0.00 0.000000
1 2023-01-01 7 0.00 0.000000
2 2023-01-01 8 63.20 0.016458
3 2023-01-01 9 68.45 0.017825
4 2023-01-01 10 150.05 0.039075

每小时合并: 日期 小时 销售 promo_available 0 2023-01-01 6 0.00 1 1 2023-01-01 7 0.00 1 2 2023-01-01 8 63.20 1 3 2023-01-01 9 68.45 1 4 2023-01-01 10 150.05 1 ……………… 7220 2024-02-29 18 373.00 1 7221 2024-02-29 19 445.00 1 7222 2024-02-29 20 431.80 1 7223 2024-02-29 21 458.20 1 7224 2024-02-29 22 373.80 1 7225 行 × 4 列

代码:

from hierarchicalforecast.core import HierarchicalReconciliation
from hierarchicalforecast.methods import BottomUp, TopDown, MiddleOut
from statsforecast.core import StatsForecast
from statsforecast.models import AutoARIMA, Naive
# retain rows where date column is from 2023-01-01 to 2024-02-29
combined = combined[(combined['date'] >= '2023-01-01') & (combined['date'] <= '2024-02-29')]
# retain only columns 'date', 'hour' 'promo_available' and 'sales'
combined = combined[['date', 'hour', 'promo_available', 'sales']]
combined_hourly = combined.groupby(['date', 'hour']).agg({'sales': 'sum', 'promo_available': 'max'}).reset_index()

# Define the levels
days = combined_hourly['date'].dt.date.unique()
hours = combined_hourly['hour'].unique()

# Create the summing matrix S
num_days = len(days)
num_hours = len(hours)

# Bottom-level (hourly) series
bottom_series = combined_hourly.groupby(['date', 'hour']).size().reset_index(name='count')
num_bottom_series = bottom_series.shape[0]

# Higher-level (daily) series
daily_series = combined_hourly.groupby(['date']).size().reset_index(name='count')
num_daily_series = daily_series.shape[0]

# Initialize the summing matrix with zeros
S = np.zeros((num_daily_series, num_bottom_series))

# Fill the summing matrix
for i, day in enumerate(days):
    daily_indices = bottom_series[bottom_series['date'] == pd.to_datetime(day)].index
    S[i, daily_indices] = 1

S_df = pd.DataFrame(S)

# Create tags dictionary
tags = {
    'levels': ['Daily', 'Hourly'],
    'labels': {
        'Daily': [str(day) for day in days],
        'Hourly': [f"{day} {hour}" for day in days for hour in hours]
    }
}

# # Output the summing matrix and tags dictionary
# print("Summing Matrix S:")
# print(S_df)

# print("\nTags Dictionary:")
# print(tags)

combined_hourly_copying = combined_hourly.copy()

# Create hourly series
combined_hourly_copying['unique_id'] = combined_hourly_copying['date'].dt.strftime('%Y-%m-%d') + '_H' + combined_hourly_copying['hour'].astype(str)
combined_hourly_copying['ds'] = combined_hourly_copying['date'] + pd.to_timedelta(combined_hourly_copying['hour'], unit='h')
combined_hourly_copying_hourly = combined_hourly_copying[['unique_id', 'ds', 'sales']].rename(columns={'sales': 'y'})

# Aggregate to create daily series
combined_hourly_copying_daily = combined_hourly_copying.groupby('date').agg({'sales': 'sum'}).reset_index()
combined_hourly_copying_daily['unique_id'] = combined_hourly_copying_daily['date'].dt.strftime('%Y-%m-%d') + '_D'
combined_hourly_copying_daily['ds'] = combined_hourly_copying_daily['date']
combined_hourly_copying_daily = combined_hourly_copying_daily[['unique_id', 'ds', 'sales']].rename(columns={'sales': 'y'})

# Combine hourly and daily series
Y_df = pd.concat([combined_hourly_copying_hourly, combined_hourly_copying_daily], ignore_index=True)

# Output the Y_df DataFrame
print(Y_df)

Y_df['ds'] = pd.to_datetime(Y_df['ds'])

# Calculate the split point (80% of the data)
split_index = int(len(Y_df) * 0.8)

# Ensure the split respects the chronological order
train_df = Y_df.iloc[:split_index]
test_df = Y_df.iloc[split_index:]

h = len(test_df)  # Forecast horizon (number of periods to forecast)

# Define the models you want to use
models = [
    AutoARIMA(season_length=119),  # Adjust season_length according to your data's seasonality
    Naive()
]

# Create the StatsForecast object
fcst = StatsForecast(
    df=train_df, 
    models=models, 
    freq='H',  # Set frequency to hourly ('H') since your data is hourly
    n_jobs=-1
)

# Fit the models and generate forecasts
forecasts = fcst.forecast(h=h)

# Example reconcilers
reconcilers = [
    # BottomUp(),
    TopDown(method='forecast_proportions'),
    # MiddleOut(middle_level='Country/Purpose/State', top_down_method='forecast_proportions')
]

# Initialize HierarchicalReconciliation
hrec = HierarchicalReconciliation(reconcilers=reconcilers)

Y_rec_df = hrec.reconcile(Y_hat_df=forecasts, Y_df=Y_train_df, S=S_df, tags=tags)

给出: 例外:检查

S_df
Y_hat_df
系列差异、S\Y_hat=425、Y_hat\S=1

请帮忙?

python time-series
1个回答
0
投票

选项 A:问题似乎来自于您的预测不等于训练-测试-分割中的指数。您可以通过执行以下操作来解决此问题:

h=4
Y_test_df = Y_df.groupby('unique_id').tail(h)

Y_test_df = Y_test_df.reset_index(drop=True)
Y_train_df = Y_df[~Y_df.index.isin(Y_test_df.index)]

只是为了检查哪些指数没有排队:

count_diff = len(S_df.index.difference(Y_hat_df.index))
print(len(S_df.index.unique()))
print(len(Y_hat_df.index.unique()))
print(count_diff)

选项 B:当您的层次结构级别不是真正的顶层时,也可能会出现这种情况。这似乎不是你的情况,但一种强力的检查方法是重新排列级别。

© www.soinside.com 2019 - 2024. All rights reserved.