我有一个交易数据集,例如以下列:
UserID
、IBAN
、Timestamp
、Amount
。
我想为每个用户考虑一个 24 小时的滚动窗口,同时考虑一个 IBAN,因此,例如,我想根据实际交易计算过去 24 小时内同一用户针对同一 IBAN 的交易数量。
我已经使用分组依据中的单个元素完成了此操作,仅考虑这样的 UserID 列:
def sort_and_reset(df_to_sort, by: Union[List[str], str] = "Timestamp", inplace: bool = True):
"""
Sorts a DataFrame by Timestamp (unless otherwise specified), and then resets its indexes (both operations are performed inplace)
"""
if inplace:
df_to_sort.sort_values(by=by, inplace=True)
df_to_sort.reset_index(drop=True, inplace=True)
else:
df = df_to_sort.sort_values(by=by)
df = df.reset_index(drop=True)
return df
def add_rolling_count_feature(df: pd.DataFrame, time_column_name: str, groupby_on: str, agg_on: str, agg_period: str) -> pd.DataFrame:
"""
Adds a new column to the DataFrame that represents the rolling count of occurrences and the sum of the amounts given a specified feature
within a given time period for each group.
Parameters:
- df: pd.DataFrame
The original DataFrame containing the data.
- time_column_name: str
The name of the column containing the time or datetime values.
- groupby_on: str
The name of the column to group the data by (e.g., user ID, card number).
- agg_on: str
The name of the column for which the rolling count is calculated e.g "Amount"
- agg_period: str
The rolling window period as a string (e.g., '30D' for 30 days, '24h' for 24 hours).
Returns:
- pd.DataFrame
The original DataFrame with an additional column containing the rolling count of the specified feature.
re(df, time_column_name='Time', groupby_on='UserID', agg_on='Amount', agg_period='30D')
"""
# Ensure the time column is in datetime format.Check maybe not needed
df[time_column_name] = pd.to_datetime(df[time_column_name])
#sort and reset basically sort and then reset the indexes
df = sort_and_reset(df, by=["UserID","Timestamp"], inplace=False)
# Set the time column as the index
df = df.set_index(time_column_name)
#groupby_on is a single column here
# Compute the rolling count of the specified feature
df["Xa1"] = df.groupby(groupby_on)[agg_on] \
.rolling(agg_period, min_periods=0, closed="left") \
.count() \
.reset_index(level=0, drop=True)
# Compute the rolling sum and add it as a new column
df['Xa2'] = df.groupby(groupby_on)[agg_on] \
.rolling(agg_period, min_periods=0, closed='left') \
.sum() \
.reset_index(level=0, drop=True) # Drop the groupby index level
# Reset the index to bring the time column back as a regular column
df = df.reset_index()
return df
但是,当我尝试使用组中的 2 列时,它不起作用,或者更好地说,它返回所有 NAN 值(窗口是正确的,它应该返回一些东西)。我还更改了
.reset_index(drop=True)
以考虑所有级别。我认为多索引或类似的东西存在一些问题。
输入 df 的示例:
df_x = {
'UserID': [1, 1, 1, 2, 2, 3, 3, 3],
'Timestamp': [
'2024-09-01 10:00:00',
'2024-09-01 11:00:00',
'2024-09-01 12:00:00',
'2024-09-01 10:00:00',
'2024-09-01 12:00:00',
'2024-09-01 09:00:00',
'2024-09-01 10:00:00',
'2024-09-01 11:00:00'
],
'IBAN': ['A', 'B', 'A', 'A', 'B', 'A', 'A', 'B'],
'Amount': [100, 200, 150, 300, 250, 120, 180, 220]
}
df_x = add_rolling_count_feature(df_x,"Timestamp","UserID","Amount","24h")
您原来的方法弄乱了索引。由于您在各个组中都有重复的日期,最安全的方法是
merge
返回数据:
def add_rolling_count_feature(df: pd.DataFrame, time_column_name: str, groupby_on: str, agg_on: str, agg_period: str) -> pd.DataFrame:
# Ensure the time column is in datetime format.Check maybe not needed
df[time_column_name] = pd.to_datetime(df[time_column_name])
#sort and reset basically sort and then reset the indexes
df = sort_and_reset(df, by=["UserID","Timestamp"], inplace=False)
# prepare the groupby.rolling
r = (df.set_index(time_column_name)
.groupby(groupby_on)[agg_on]
.rolling(agg_period, min_periods=0, closed='left')
)
# compute count/sum and merge to input
return (df.merge(pd.concat({'Xa1': r.count(), 'Xa2': r.sum()}, axis=1)
.reset_index(), how='left')
.set_index(df.index)
)
示例:
add_rolling_count_feature(df_x, 'Timestamp', ['UserID', 'IBAN'], 'Amount', '24h')
UserID Timestamp IBAN Amount Xa1 Xa2
0 1 2024-09-01 10:00:00 A 100 0.0 0.0
1 1 2024-09-01 11:00:00 B 200 1.0 100.0
2 1 2024-09-01 12:00:00 A 150 2.0 300.0
3 2 2024-09-01 10:00:00 A 300 0.0 0.0
4 2 2024-09-01 12:00:00 B 250 1.0 300.0
5 3 2024-09-01 09:00:00 A 120 0.0 0.0
6 3 2024-09-01 10:00:00 A 180 1.0 120.0
7 3 2024-09-01 11:00:00 B 220 2.0 300.0