Pandas 按更多列+滚动+添加新列进行分组

问题描述 投票:0回答:1

我有一个交易数据集,例如以下列:

UserID
IBAN
Timestamp
Amount

我想为每个用户考虑一个 24 小时的滚动窗口,同时考虑一个 IBAN,因此,例如,我想根据实际交易计算过去 24 小时内同一用户针对同一 IBAN 的交易数量。

我已经使用分组依据中的单个元素完成了此操作,仅考虑这样的 UserID 列:

def sort_and_reset(df_to_sort, by: Union[List[str], str] = "Timestamp", inplace: bool = True):
    """
    Sorts a DataFrame by Timestamp (unless otherwise specified), and then resets its indexes (both operations are performed inplace)
    """
    if inplace:
        df_to_sort.sort_values(by=by, inplace=True)
        df_to_sort.reset_index(drop=True, inplace=True)
    else:
        df = df_to_sort.sort_values(by=by)
        df = df.reset_index(drop=True)
        return df



def add_rolling_count_feature(df: pd.DataFrame, time_column_name: str, groupby_on: str, agg_on: str, agg_period: str) -> pd.DataFrame:
    """
    Adds a new column to the DataFrame that represents the rolling count of occurrences and the sum of the amounts given a specified feature 
    within a given time period for each group.

    Parameters:
    - df: pd.DataFrame
        The original DataFrame containing the data.
    - time_column_name: str
        The name of the column containing the time or datetime values.
    - groupby_on: str
        The name of the column to group the data by (e.g., user ID, card number).
    - agg_on: str
        The name of the column for which the rolling count is calculated e.g "Amount"
    - agg_period: str
        The rolling window period as a string (e.g., '30D' for 30 days, '24h' for 24 hours).

    Returns:
    - pd.DataFrame
       The original DataFrame with an additional column containing the rolling count of the specified feature.

    re(df, time_column_name='Time', groupby_on='UserID', agg_on='Amount', agg_period='30D')
    """
    # Ensure the time column is in datetime format.Check maybe not needed
    df[time_column_name] = pd.to_datetime(df[time_column_name])
    #sort and reset basically sort and then reset the indexes
    df = sort_and_reset(df, by=["UserID","Timestamp"], inplace=False)
 


    # Set the time column as the index
    df = df.set_index(time_column_name)

    #groupby_on is a single column here

    # Compute the rolling count of the specified feature 
    df["Xa1"] = df.groupby(groupby_on)[agg_on] \
                  .rolling(agg_period, min_periods=0, closed="left") \
                  .count() \
                  .reset_index(level=0, drop=True)
    
    # Compute the rolling sum and add it as a new column
    df['Xa2'] = df.groupby(groupby_on)[agg_on] \
                  .rolling(agg_period, min_periods=0, closed='left') \
                  .sum() \
                  .reset_index(level=0, drop=True)  # Drop the groupby index level


    # Reset the index to bring the time column back as a regular column
    df = df.reset_index()
    return df

但是,当我尝试使用组中的 2 列时,它不起作用,或者更好地说,它返回所有 NAN 值(窗口是正确的,它应该返回一些东西)。我还更改了

.reset_index(drop=True)
以考虑所有级别。我认为多索引或类似的东西存在一些问题。

输入 df 的示例:

df_x = {
    'UserID': [1, 1, 1, 2, 2, 3, 3, 3],
    'Timestamp': [
        '2024-09-01 10:00:00',
        '2024-09-01 11:00:00',
        '2024-09-01 12:00:00',
        '2024-09-01 10:00:00',
        '2024-09-01 12:00:00',
        '2024-09-01 09:00:00',
        '2024-09-01 10:00:00',
        '2024-09-01 11:00:00'
    ],
    'IBAN': ['A', 'B', 'A', 'A', 'B', 'A', 'A', 'B'],
    'Amount': [100, 200, 150, 300, 250, 120, 180, 220]
}

df_x = add_rolling_count_feature(df_x,"Timestamp","UserID","Amount","24h")
pandas dataframe group-by
1个回答
0
投票

您原来的方法弄乱了索引。由于您在各个组中都有重复的日期,最安全的方法是

merge
返回数据:

def add_rolling_count_feature(df: pd.DataFrame, time_column_name: str, groupby_on: str, agg_on: str, agg_period: str) -> pd.DataFrame:
    # Ensure the time column is in datetime format.Check maybe not needed
    df[time_column_name] = pd.to_datetime(df[time_column_name])
    #sort and reset basically sort and then reset the indexes
    df = sort_and_reset(df, by=["UserID","Timestamp"], inplace=False)

    # prepare the groupby.rolling
    r = (df.set_index(time_column_name)
           .groupby(groupby_on)[agg_on]
           .rolling(agg_period, min_periods=0, closed='left')
      )

    # compute count/sum and merge to input
    return (df.merge(pd.concat({'Xa1': r.count(), 'Xa2': r.sum()}, axis=1)
                       .reset_index(), how='left')
              .set_index(df.index)
            )

示例:

add_rolling_count_feature(df_x, 'Timestamp', ['UserID', 'IBAN'], 'Amount', '24h')

   UserID           Timestamp IBAN  Amount  Xa1    Xa2
0       1 2024-09-01 10:00:00    A     100  0.0    0.0
1       1 2024-09-01 11:00:00    B     200  1.0  100.0
2       1 2024-09-01 12:00:00    A     150  2.0  300.0
3       2 2024-09-01 10:00:00    A     300  0.0    0.0
4       2 2024-09-01 12:00:00    B     250  1.0  300.0
5       3 2024-09-01 09:00:00    A     120  0.0    0.0
6       3 2024-09-01 10:00:00    A     180  1.0  120.0
7       3 2024-09-01 11:00:00    B     220  2.0  300.0
© www.soinside.com 2019 - 2024. All rights reserved.