我有一个 Pandas 数据集,其中包含电话呼叫日志,从多个呼叫者号码到多个目的地号码。通话一完成就会被记录。
主要特点:
样本df
unix_ts | 来电号码 | 目的地号码 | 真实持续时间 |
---|---|---|---|
1674567050.6435 | 16175962600.0 | 18448213248 | 435.0 |
1680545624.27747 | 14103914538.0 | 18775057141 | 497.0 |
1681923808.21773 | 19182890899.0 | 18006485010 | 132.0 |
1684535428.48401 | 15202197200.0 | 18883507446 | 450.0 |
1697056646.38694 | 13236327145.0 | 18888544812 | 390.0 |
我需要识别那些限制通话时间的目的地号码。特别是那些至少最后五个呼叫具有相同持续时间的情况(平线)。
识别扁线情况的伪代码如下:
经过这些转换后,生成的数据框应如下所示:
目的地号码 | 真实持续时间总和 | 最后比率_前 |
---|---|---|
18008259452 | 107278.0 | 2.1310975127055425 |
18773310081 | 94171.0 | 0.001336663795067271 |
18885436765 | 56977.0 | 0.03930810818390873 |
18009423141 | 40031.0 | 0.14554363343695811 |
18886450451 | 22803.0 | 0.9798228336647965 |
经过广泛研究,我实现该解决方案的代码如下:
# create a pandas groupby class, without applying any calculation
df_dest_temp = df.groupby(['dest_number'], sort=True)
# add column last_ratio_pre and fill it with ones as placeholders
df_dest['last_ratio_pre'] = 1
# define how many previous values
control = 5
# iterate over groupby components, and assign to df_dest one by one *** SUPER SLOW!! ***
for num, frame in df_dest_temp:
# create a temp df
df_temp = frame[['dest_number', 'real_duration']]
# create 5 new columns that hold the shifted values of real_duration
for i in range(control):
df_temp['pre_'+str(i+1)] = df_temp.real_duration.shift(i+1)
# select only the columns related to real_duration
cols = ['real_duration'] + [col for col in df_temp.columns if col.startswith('pre_')]
# calculate mean of last 5 real_durations, by columns
mean_pre = df_temp[cols].mean(axis=1)
# calculate std of last 5 real_durations, by columns
std_pre = df_temp[cols].std(axis=1)
# assign results to new columns
df_temp['mean_pre'] = mean_pre
df_temp['std_pre'] = std_pre
# calculate the ratio between std and mean
df_temp['ratio_pre'] = std_pre / mean_pre
# take each fram and groupby dest_number and choose last value for ratio_pre
df_temp_grouped = df_temp.groupby('dest_number').ratio_pre.last()
# assign result to df_dest
df_dest.loc[num, 'ratio_pre'] = float(df_temp_grouped.values)
可以看出,上面的代码解决了问题,但是速度非常慢。我没有找到一种方法将所有功能包含在 lambda 函数中或以向量化的方式。
我当然欢迎评论、评论和建议,以使代码更加高效。
这行得通吗?
import pandas as pd
# sample data
data = {
"unix_ts": [1, 2, 3, 4, 5, 6],
"dest_number": [111, 111, 222, 222, 333, 333],
"real_duration": [100, 100, 10, 20, 1000, 0],
}
df = pd.DataFrame(data)
# sort (most recent first)
df = df.sort_values(by="unix_ts", ascending=False)
# calculate ratio of mean to std of real_duration for the last 5 calls for each dest_number
def func(df):
df = df.iloc[:5] # last 5 calls
ratio = df["real_duration"].mean() / df["real_duration"].std()
return ratio
df.groupby("dest_number").apply(func)