Pandas 在 groupby 中创建复杂的滚动计算

Question

我有一个 Pandas 数据集，其中包含电话呼叫日志，从多个呼叫者号码到多个目的地号码。通话一完成就会被记录。

主要特点：

unix_timestamp（索引）
来电号码
dest_number（美国免费电话）
真实持续时间

样本df

unix_ts	来电号码	目的地号码	真实持续时间
1674567050.6435	16175962600.0	18448213248	435.0
1680545624.27747	14103914538.0	18775057141	497.0
1681923808.21773	19182890899.0	18006485010	132.0
1684535428.48401	15202197200.0	18883507446	450.0
1697056646.38694	13236327145.0	18888544812	390.0

我需要识别那些限制通话时间的目的地号码。特别是那些至少最后五个呼叫具有相同持续时间的情况（平线）。

识别扁线情况的伪代码如下：

分割阶段：groupby dest_number，不进行计算
应用阶段：对于分割后的每个单独帧，添加 5 个新列，其中包含 real_duration 的移位值，按列计算最后 5 个调用 real_duration 的平均值和标准差。然后添加一列，其ratio_pre = std /mean。
组合阶段：输出是一个带有索引 dest_number 的 df 和一列，每帧带有 last_ratio_pre

经过这些转换后，生成的数据框应如下所示：

目的地号码	真实持续时间总和	最后比率_前
18008259452	107278.0	2.1310975127055425
18773310081	94171.0	0.001336663795067271
18885436765	56977.0	0.03930810818390873
18009423141	40031.0	0.14554363343695811
18886450451	22803.0	0.9798228336647965

经过广泛研究，我实现该解决方案的代码如下：


# create a pandas groupby class, without applying any calculation
df_dest_temp = df.groupby(['dest_number'], sort=True)

# add column last_ratio_pre and fill it with ones as placeholders
df_dest['last_ratio_pre'] = 1

# define how many previous values
control = 5

# iterate over groupby components, and assign to df_dest one by one *** SUPER SLOW!! ***
for num, frame in df_dest_temp:
    
    # create a temp df
    df_temp = frame[['dest_number', 'real_duration']]
    
    # create 5 new columns that hold the shifted values of real_duration
    for i in range(control):
        df_temp['pre_'+str(i+1)] = df_temp.real_duration.shift(i+1)

    # select only the columns related to real_duration
    cols = ['real_duration'] + [col for col in df_temp.columns if col.startswith('pre_')]

    # calculate mean of last 5 real_durations, by columns
    mean_pre  = df_temp[cols].mean(axis=1)
    
    # calculate std of last 5 real_durations, by columns
    std_pre = df_temp[cols].std(axis=1)

    # assign results to new columns
    df_temp['mean_pre'] = mean_pre
    df_temp['std_pre']  = std_pre
    
    # calculate the ratio between std and mean
    df_temp['ratio_pre']  = std_pre / mean_pre

    # take each fram and groupby dest_number and choose last value for ratio_pre
    df_temp_grouped = df_temp.groupby('dest_number').ratio_pre.last()

    # assign result to df_dest
    df_dest.loc[num, 'ratio_pre'] = float(df_temp_grouped.values)

可以看出，上面的代码解决了问题，但是速度非常慢。我没有找到一种方法将所有功能包含在 lambda 函数中或以向量化的方式。

我当然欢迎评论、评论和建议，以使代码更加高效。

Answer 1

这行得通吗？

import pandas as pd

# sample data
data = {
    "unix_ts": [1, 2, 3, 4, 5, 6],
    "dest_number": [111, 111, 222, 222, 333, 333],
    "real_duration": [100, 100, 10, 20, 1000, 0],
}

df = pd.DataFrame(data)

# sort (most recent first)
df = df.sort_values(by="unix_ts", ascending=False)

# calculate ratio of mean to std of real_duration for the last 5 calls for each dest_number
def func(df):
    df = df.iloc[:5]  # last 5 calls
    ratio = df["real_duration"].mean() / df["real_duration"].std()
    return ratio


df.groupby("dest_number").apply(func)

Pandas 在 groupby 中创建复杂的滚动计算

问题描述投票：0回答：1

1个回答

最新问题

Pandas 在 groupby 中创建复杂的滚动计算

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1