我有 3 个维度的数据,每个组合都有相同的日期范围和数字标签。我的目标是添加一列,其中包含前 n 天标签的平均值。
我有一个可行的解决方案,但需要很长时间(大约 20 分钟,2.270.400 行,2.400 种可能的维度组合)。我认为主要问题是
d.loc
查找作为插入方法。
您对如何提高性能有什么建议吗?我也很高兴用不同的方法得到相同的结果。
测试设置代码:
## create data to simulate
import pandas as pd
import random
## create test dataframes
df1 = pd.DataFrame({'A':[1,2,3,4,5,6,7,8,9,10,11,12]})
df2 = pd.DataFrame({'B':["r","s","t","u","v","w","x","y","z"]})
df3 = pd.DataFrame({'C':["a","b","c","d","e","f","g","k","h"]})
numdays = 600
date_list = pd.date_range(pd.datetime.today(), periods=numdays).tolist()
df4 = pd.DataFrame({'date':pd.to_datetime(date_list)})
df4['date'] = df4['date'].dt.date
## add dummy keys
df1['key'] = 0
df2['key'] = 0
df3['key'] = 0
df4['key'] = 0
## merge all together
dfn = df1.merge(df2, how='outer',on="key")
dfn = dfn.merge(df3, how='outer',on="key")
dfn = dfn.merge(df4, how='outer',on="key")
## drop dummy key
dfn.drop(columns=['key'],inplace=True)
## add vector
dfn['dim_vector'] = dfn.apply(lambda row: str(row.A) + '_' + row.B + '_' + row.C, axis=1)
## add random labels
dfn['label'] = dfn.apply(lambda x: random.randrange(0,10, 1),axis=1)
## set date as index
dfn = dfn.set_index(dfn['date'])
我的(慢)解决方案:
def add_last_n_days_avg_with_days_at_index(df,match_on_col='dim_vector',label_col='label',count_of_days=7,round_to=0):
vectors = df[match_on_col].unique()
new_label_col_name = label_col + '_'+str(count_of_days)+'D'
for vector in vectors:
chunk = df.loc[df[match_on_col] == vector].copy()
chunk[new_label_col_name] = chunk[label_col].rolling(count_of_days,count_of_days,axis=0).mean()
chunk[new_label_col_name] = chunk[new_label_col_name].shift()
df.loc[df[match_on_col] == vector,new_label_col_name] = round(chunk[new_label_col_name],round_to)
add_last_n_days_avg_with_days_at_index(df=dfn,match_on_col='dim_vector',label_col='label',count_of_days=7,round_to=0)
dfn.head(50)
如果只有9天的结果:
date A B C date dim_vector label label_7D
2018-12-14 1 r a 2018-12-14 1_r_a 1 NaN
2018-12-15 1 r a 2018-12-15 1_r_a 1 NaN
2018-12-16 1 r a 2018-12-16 1_r_a 0 NaN
2018-12-17 1 r a 2018-12-17 1_r_a 3 NaN
2018-12-18 1 r a 2018-12-18 1_r_a 0 NaN
2018-12-19 1 r a 2018-12-19 1_r_a 6 NaN
2018-12-20 1 r a 2018-12-20 1_r_a 7 NaN
2018-12-21 1 r a 2018-12-21 1_r_a 3 3.0
2018-12-22 1 r a 2018-12-22 1_r_a 0 3.0
2018-12-14 1 r b 2018-12-14 1_r_b 5 NaN
2018-12-15 1 r b 2018-12-15 1_r_b 2 NaN
2018-12-16 1 r b 2018-12-16 1_r_b 5 NaN
2018-12-17 1 r b 2018-12-17 1_r_b 2 NaN
2018-12-18 1 r b 2018-12-18 1_r_b 3 NaN
2018-12-19 1 r b 2018-12-19 1_r_b 0 NaN
2018-12-20 1 r b 2018-12-20 1_r_b 8 NaN
2018-12-21 1 r b 2018-12-21 1_r_b 2 4.0
2018-12-22 1 r b 2018-12-22 1_r_b 2 3.0
您甚至不需要循环或创建块来应用滚动函数。比这个简单多了。
def add_last_n_days_avg_with_days_at_index(df,
label_col='label',count_of_days=7,round_to=0):
new_label_col_name = label_col + '_'+str(count_of_days)+'D'
# create a new column, apply mean and round
df[new_label_col_name] = df[label_col].rolling(count_of_days).mean().round(round_to);
# I removed match_on_col parameter
add_last_n_days_avg_with_days_at_index(df=dfn,label_col='label',count_of_days=7,round_to=0)
结果:
A B C date dim_vector label label_7D
date
2018-12-14 1 r a 2018-12-14 1_r_a 1 NaN
2018-12-15 1 r a 2018-12-15 1_r_a 7 NaN
2018-12-16 1 r a 2018-12-16 1_r_a 8 NaN
2018-12-17 1 r a 2018-12-17 1_r_a 7 NaN
2018-12-18 1 r a 2018-12-18 1_r_a 5 NaN
2018-12-19 1 r a 2018-12-19 1_r_a 7 NaN
2018-12-20 1 r a 2018-12-20 1_r_a 1 5.0
2018-12-21 1 r a 2018-12-21 1_r_a 6 6.0
2018-12-22 1 r a 2018-12-22 1_r_a 9 6.0
2018-12-23 1 r a 2018-12-23 1_r_a 1 5.0
2018-12-24 1 r a 2018-12-24 1_r_a 1 4.0
2018-12-25 1 r a 2018-12-25 1_r_a 0 4.0
2018-12-26 1 r a 2018-12-26 1_r_a 3 3.0
2018-12-27 1 r a 2018-12-27 1_r_a 0 3.0
2018-12-28 1 r a 2018-12-28 1_r_a 0 2.0
2018-12-29 1 r a 2018-12-29 1_r_a 9 2.0
2018-12-30 1 r a 2018-12-30 1_r_a 1 2.0
2018-12-31 1 r a 2018-12-31 1_r_a 2 2.0
2019-01-01 1 r a 2019-01-01 1_r_a 1 2.0
2019-01-02 1 r a 2019-01-02 1_r_a 9 3.0
以下代码应该可以实现您想要做的事情:
new_df = dfn['label'].groupby(dfn['dim_vector'], group_keys=False).rolling(8).mean().reset_index()
pd.merge(left=dfn, right=new_df, left_on=['dim_vector', dfn.index], right_on=['dim_vector', 'date']).rename(columns={'label_x': 'label', 'label_y': 'label_rolling'}).reindex(columns=['date', 'A', 'B', 'C', 'dim_vector', 'label', 'label_rolling'])
给出输出
date A B C dim_vector label label_rolling
0 2024-06-11 1 r a 1_r_a 8 NaN
1 2024-06-12 1 r a 1_r_a 3 NaN
2 2024-06-13 1 r a 1_r_a 5 NaN
3 2024-06-14 1 r a 1_r_a 7 NaN
4 2024-06-15 1 r a 1_r_a 2 NaN
5 2024-06-16 1 r a 1_r_a 4 NaN
6 2024-06-17 1 r a 1_r_a 7 NaN
7 2024-06-18 1 r a 1_r_a 2 4.750
8 2024-06-19 1 r a 1_r_a 1 3.875
9 2024-06-11 1 r b 1_r_b 3 NaN