使用 Pandas 优化用户登录间隔的数据帧处理

Question

我正在开发一个处理网站上用户登录时间的功能。这些时间存储在 Pandas DataFrame 中，其中第一列表示时间间隔，其余列表示用户是否在该间隔期间登录。我的程序作用于此 DataFrame 并对用户进行分组，创建与可能的用户组合一样多的列。然后，它检查这些用户是否在该行定义的时间间隔内连接。

例如，在有 3 个用户 A、B、C 的情况下，如果 A 和 B 在特定行登录，则 A、B 列将为 1，而 A 和 B 列将为 0。如果三个用户同时处于活动状态，则 A、B、C 列将为 1，其余将为 0。

在我的实际情况中，有很多列，因此函数的指数成本使其令人望而却步。我一直在尝试生成代码来查找从不重合的组，以避免构造冗余列。例如，如果 B 和 C 从来没有共同值为 1 的行，则生成列 B,C 或 A,B,C 就没有意义。

我尝试使用 GitHub Copilot，但它未能提供有用的解决方案。有人可以帮我优化我的代码吗？

这是我正在使用的代码：

def process_dataframe_opt2(df):
    # List of columns for combinations, excluding 'fecha_hora'
    columns = [col for col in df.columns if col != 'fecha_hora']
    
    # Generate all possible combinations of the columns
    for r in range(1, len(columns) + 1):
        for comb in combinations(columns, r):
            col_name = ','.join(comb)
            df[col_name] = df[list(comb)].all(axis=1).astype(int)
    
    # Create a copy of the original DataFrame to modify it
    df_copy = df.copy()
    
    # Process combinations from largest to smallest
    for r in range(len(columns), 1, -1):
        for comb in combinations(columns, r):
            col_name = ','.join(comb)
            active_rows = df[col_name] == 1
            if active_rows.any():
                for sub_comb in combinations(comb, r-1):
                    sub_col_name = ','.join(sub_comb)
                    df_copy.loc[active_rows, sub_col_name] = 0
    
    # Remove columns that only contain 0
    df_copy = df_copy.loc[:, (df_copy != 0).any(axis=0)]
    
    return df_copy

并生成一个示例

import pandas as pd
from itertools import combinations
# Create a range of time
date_rng = pd.date_range(start='2024-05-13 15:52:00', end='2024-05-13 16:04:00', freq='min')

# Create an empty DataFrame
df = pd.DataFrame(date_rng, columns=['fecha_hora'])

# Add the login columns with corresponding values
df['A'] = 1  # Always active
df['B'] = [1] * 6 + [0] * 5 + [1] * 2  # Active in the first 6 intervals
df['C'] = [1] * 5 + [0] * 6 + [1] * 2  # Active in the first 5 intervals
df['D'] = [1] * 4 + [0] * 7 + [1] * 2  # Active in the first 4 intervals
df['E'] = [1] * 3 + [0] * 3 + [1] * 2 + [0] * 3 + [1]*2  # Active in two blocks
df['F'] = [0] * 7 + [1] * 3 + [0] * 3  # Active in a single block towards the end
df['alfa'] = [0] * 10 + [1] * 1 + [0] * 2

# Adjust some rows to have more than one '1'
df.loc[1, ['A', 'B', 'C']] = 1  # Row with multiple '1's
df.loc[8, ['D', 'E', 'F']] = 1  # Another row with multiple '1's

df_copy = process_dataframe_opt2(df)

任何人都可以提供有关如何优化此功能以避免指数成本并提高性能的见解或建议吗？

Answer 1

预期结果的每条记录都包含一个等于 1 的字段，位于活跃用户命名的列中，而其余字段等于 0。因此，我们可以执行以下操作来生成此表：

用活跃用户的序列替换每条记录；
使用
```
pivot
```
、
```
unstack
```
、
```
get_dummies
```
等将获得的序列转换为数据框。

以 get_dummies 为例：

result = pd.get_dummies(
    df.astype(bool).apply(lambda x: ','.join(df.columns[x]), axis=1),
    dtype=int
)

使用 unstack 的示例：

result = (
    df.astype(bool)
    .apply(lambda x: ','.join(df.columns[x]), axis=1)   # select active users
    .to_frame('users')                                            
    .set_index('users', append=True)       # push users to the second index level
    .assign(mark=1)                        # mark records before pivoting
    .squeeze()
    .unstack(fill_value=0)
)

实验代码：

import pandas as pd

data = { 
    'A': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
    'B': [1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 1],
    'C': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1],
    'D': [1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1],
    'E': [1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 1, 1],
    'F': [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0],
    'alfa': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]
}
index = pd.date_range(
    start='2024-05-13 15:52:00', 
    periods=len(data['A']), freq='min'
)
df = pd.DataFrame(data, index)

# option with get dummies
result1 = pd.get_dummies(
    df.astype(bool).agg(lambda x: ','.join(df.columns[x]), axis=1)
    , dtype=int
)

# option with unstacking
result2 = (
    df.astype(bool)
    .apply(lambda x: ','.join(df.columns[x]), axis=1)
    .to_frame(None)
    .set_index(None, append=True)
    .assign(just_mark=1).squeeze()
    .unstack(fill_value=0)
)

assert result1.equals(result2)

>>> print(result1)
                     A,B  A,B,C  A,B,C,D  A,B,C,D,E  A,D,E,F  A,E  A,E,F  A,F  A,alfa
2024-05-13 15:52:00    0      0        0          1        0    0      0    0       0
2024-05-13 15:53:00    0      0        0          1        0    0      0    0       0
2024-05-13 15:54:00    0      0        0          1        0    0      0    0       0
2024-05-13 15:55:00    0      0        1          0        0    0      0    0       0
2024-05-13 15:56:00    0      1        0          0        0    0      0    0       0
2024-05-13 15:57:00    1      0        0          0        0    0      0    0       0
2024-05-13 15:58:00    0      0        0          0        0    1      0    0       0
2024-05-13 15:59:00    0      0        0          0        0    0      1    0       0
2024-05-13 16:00:00    0      0        0          0        1    0      0    0       0
2024-05-13 16:01:00    0      0        0          0        0    0      0    1       0
2024-05-13 16:02:00    0      0        0          0        0    0      0    0       1
2024-05-13 16:03:00    0      0        0          1        0    0      0    0       0
2024-05-13 16:04:00    0      0        0          1        0    0      0    0       0

使用 Pandas 优化用户登录间隔的数据帧处理

问题描述投票：0回答：1

1个回答

最新问题

使用 Pandas 优化用户登录间隔的数据帧处理

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1