我有一些按日期和 FacilityID 分组后的交易数据,分组后如下所示。我正在尝试计算季度环比变化,即所有设施在本季度所有 3 个月和上一年季度 3 个月的支出中的总支出之和。因此,在此示例中,我只需要 2024 年 4 月至 6 月设施 #1 的支出总和超过 2023 年 4 月至 6 月设施 #1 的总支出即可获得零钱。应排除设施 2,因为它在 2023 年 4 月或 2024 年没有任何支出。
这是我迄今为止尝试过的代码,但它也在代码中包含了设施 2,而它应该被排除,因为它没有 2024 年 4 月和 2023 年 4 月的任何数据。
import pandas as pd
import datetime
def open_file(path, quarter_number, months):
df_raw = pd.DataFrame({'Date':["2024-04-01","2024-05-01","2024-06-01", "2024-06-01","2024-05-01","2023-04-01","2023-05-01","2023-06-01","2024-05-01","2024-06-01","2023-05-01","2023-06-01", "2023-04-01","2024-05-01","2024-06-01"],
'FacilityID': [1,1,1,1,1,1,1,1,2,2,2,2,3,4,4],
'TotalSpend': [100,110,120,50,70,90,100,110,150,140,120,60,90,190,150]
}).set_index('Date')
df = df_raw.groupby(['Date', 'FacilityID'])['TotalSpend'].sum()
print(df)
cur_dates = []
prev_dates = []
for month in months:
cur_date = datetime.date(2024, month, 1)
prev_date = datetime.date(cur_date.year - 1, month, 1)
cur_dates.append(cur_date.strftime('%Y-%m-%d'))
prev_dates.append(prev_date.strftime('%Y-%m-%d'))
cur_quarter_data = pd.concat(
[df.loc[date] if date in df.index.levels[0] else pd.Series(dtype='float64') for date in cur_dates])
prev_quarter_data = pd.concat(
[df.loc[date] if date in df.index.levels[0] else pd.Series(dtype='float64') for date in prev_dates])
common_facilities = cur_quarter_data.index.intersection(prev_quarter_data.index)
cur_quarter_vals = cur_quarter_data.loc[common_facilities]
prev_quarter_vals = prev_quarter_data.loc[common_facilities]
yoy_change = (cur_quarter_vals.sum() - prev_quarter_vals.sum()) / prev_quarter_vals.sum() * 100
return yoy_change
if __name__ == "__main__":
change = open_file("path",2 ,[4,5,6])
print(change)
示例代码
import pandas as pd
df = pd.DataFrame({'Date':["2024-04-01","2024-05-01","2024-06-01", "2024-06-01","2024-05-01","2023-04-01","2023-05-01","2023-06-01","2024-05-01","2024-06-01","2023-05-01","2023-06-01", "2023-04-01","2024-05-01","2024-06-01"], 'FacilityID': [1,1,1,1,1,1,1,1,2,2,2,2,3,4,4], 'TotalSpend': [100,110,120,50,70,90,100,110,150,140,120,60,90,190,150]})
df
Date FacilityID TotalSpend
0 2024-04-01 1 100
1 2024-05-01 1 110
2 2024-06-01 1 120
3 2024-06-01 1 50 <-- duplicated date
4 2024-05-01 1 70 <-- duplicated date
5 2023-04-01 1 90
6 2023-05-01 1 100
7 2023-06-01 1 110
8 2024-05-01 2 150
9 2024-06-01 2 140
10 2023-05-01 2 120
11 2023-06-01 2 60
12 2023-04-01 3 90
13 2024-05-01 4 190
14 2024-06-01 4 150
您的样本有重复的日期。我认为这是您的意图,我将继续将它们结合起来以获得结果。
代码
# Convert 'Date' column to datetime
df['Date'] = pd.to_datetime(df['Date'])
# groupby & resample 2times
tmp = (df.groupby('FacilityID')
.resample('MS', on='Date')['TotalSpend']
.sum(min_count=1)
.reset_index()
.groupby('FacilityID')
.resample('QS', on='Date')['TotalSpend']
.agg(['sum', 'count'])
)
# Shift the index by 12 months and reset the index
tmp_prev = (tmp1.reset_index(level=0)
.shift(freq='12MS')
.reset_index()
)
# Merge the current and previous periods data, keeping only rows count == 3
out = (
tmp.merge(tmp_prev, on=['Date', 'FacilityID'], how='left', suffixes=['_cur', '_prev'])
[lambda x: x.pop('count_cur').eq(3) & x.pop('count_prev').eq(3)]
)
输出:
Date FacilityID sum_cur sum_prev
4 2024-04-01 1 450.0 300.0