Pandas 在 groupby 中创建复杂的滚动计算

问题描述 投票:0回答:1

我有一个 Pandas 数据集,其中包含电话呼叫日志,从多个呼叫者号码到多个目的地号码。通话一完成就会被记录。

主要特点:

  • unix_timestamp(索引)
  • 来电号码
  • dest_number(美国免费电话)
  • 真实持续时间

样本df

unix_ts 来电号码 目的地号码 真实持续时间
1674567050.6435 16175962600.0 18448213248 435.0
1680545624.27747 14103914538.0 18775057141 497.0
1681923808.21773 19182890899.0 18006485010 132.0
1684535428.48401 15202197200.0 18883507446 450.0
1697056646.38694 13236327145.0 18888544812 390.0

我需要识别那些限制通话时间的目的地号码。特别是那些至少最后五个呼叫具有相同持续时间的情况(平线)。

识别扁线情况的伪代码如下:

  • 分割阶段:groupby dest_number,不进行计算
  • 应用阶段:对于分割后的每个单独帧,添加 5 个新列,其中包含 real_duration 的移位值,按列计算最后 5 个调用 real_duration 的平均值和标准差。然后添加一列,其ratio_pre = std /mean。
  • 组合阶段:输出是一个带有索引 dest_number 的 df 和一列,每帧带有 last_ratio_pre

经过这些转换后,生成的数据框应如下所示:

目的地号码 真实持续时间总和 最后比率_前
18008259452 107278.0 2.1310975127055425
18773310081 94171.0 0.001336663795067271
18885436765 56977.0 0.03930810818390873
18009423141 40031.0 0.14554363343695811
18886450451 22803.0 0.9798228336647965

经过广泛研究,我实现该解决方案的代码如下:


# create a pandas groupby class, without applying any calculation
df_dest_temp = df.groupby(['dest_number'], sort=True)

# add column last_ratio_pre and fill it with ones as placeholders
df_dest['last_ratio_pre'] = 1

# define how many previous values
control = 5

# iterate over groupby components, and assign to df_dest one by one *** SUPER SLOW!! ***
for num, frame in df_dest_temp:
    
    # create a temp df
    df_temp = frame[['dest_number', 'real_duration']]
    
    # create 5 new columns that hold the shifted values of real_duration
    for i in range(control):
        df_temp['pre_'+str(i+1)] = df_temp.real_duration.shift(i+1)

    # select only the columns related to real_duration
    cols = ['real_duration'] + [col for col in df_temp.columns if col.startswith('pre_')]

    # calculate mean of last 5 real_durations, by columns
    mean_pre  = df_temp[cols].mean(axis=1)
    
    # calculate std of last 5 real_durations, by columns
    std_pre = df_temp[cols].std(axis=1)

    # assign results to new columns
    df_temp['mean_pre'] = mean_pre
    df_temp['std_pre']  = std_pre
    
    # calculate the ratio between std and mean
    df_temp['ratio_pre']  = std_pre / mean_pre

    # take each fram and groupby dest_number and choose last value for ratio_pre
    df_temp_grouped = df_temp.groupby('dest_number').ratio_pre.last()

    # assign result to df_dest
    df_dest.loc[num, 'ratio_pre'] = float(df_temp_grouped.values)

可以看出,上面的代码解决了问题,但是速度非常慢。我没有找到一种方法将所有功能包含在 lambda 函数中或以向量化的方式。

我当然欢迎评论、评论和建议,以使代码更加高效。

python pandas group-by vectorization split-apply-combine
1个回答
0
投票

这行得通吗?

import pandas as pd

# sample data
data = {
    "unix_ts": [1, 2, 3, 4, 5, 6],
    "dest_number": [111, 111, 222, 222, 333, 333],
    "real_duration": [100, 100, 10, 20, 1000, 0],
}

df = pd.DataFrame(data)

# sort (most recent first)
df = df.sort_values(by="unix_ts", ascending=False)

# calculate ratio of mean to std of real_duration for the last 5 calls for each dest_number
def func(df):
    df = df.iloc[:5]  # last 5 calls
    ratio = df["real_duration"].mean() / df["real_duration"].std()
    return ratio


df.groupby("dest_number").apply(func)

© www.soinside.com 2019 - 2024. All rights reserved.