Python:订单数量和线路的 3-sigma“异常”检测器

问题描述 投票:0回答:1

我想分析电子商务环境中的交易数据,重点是检测订单模式中的非典型活动。数据按客户标识符 SoldTo 进行分组,对于每个组,我们应用一种简单的统计技术来根据订单数量和行数检测异常情况。具体来说,步骤包括:

  1. 数据准备:确保日期列(Created_on)位于 格式正确。
  2. 分组数据:按 SoldTo 字段对数据进行分组 隔离每个客户的交易。滚动窗口 计算:对于每个组,应用滚动窗口计算 计算订单的滚动平均值和标准差 数量和行数。
  3. 应用 3-Sigma 规则应用:使用 3-Sigma 规则来识别明显偏离正常水平的交易,将其标记为非典型或可疑。
  4. 独立处理:独立处理每个客户群(SoldTo) 确保检测机制不受来自数据的影响 其他客户。合并结果:处理后,我们这样写 数据保存到 .csv 文件。

我的问题: 作为一项测试,我向一个已知的个体 SoldTo 提供异常线路,并且代码可以按预期检测异常线路! 但是,当我引入多个 SoldTo(包括那些已知的异常行)时,将不再检测到它们。 为什么会这样?

这是我尝试使用我的代码以及(我希望的)两个方便加载的数据集(其中一个只有一个 SoldTo,我的代码将检测已知的异常交易 - 以及另一个数据设置有 (2) 个 SoldTo 的 <- my code no longer works to detect known anomalous lines when 2+ Soldto's are used together...

# Create DataFrame
df = pd.DataFrame(fraud)

df['Created_on'] = pd.to_datetime(df['Created_on'])

# Group by 'SoldTo' and 'Created_on'
grouped = df.groupby(['SoldTo', 'Created_on', 'Sales_Doc']).agg(
    total_quantity=('Order_Quantity', 'sum'),
    line_count=('Sales_Doc', 'count')    # Modified this line so the provided data sets can be used.   Thanks @Timus
).reset_index()

# Compute rolling statistics and 3-sigma for each SoldTo group
grouped['avg_line'] = grouped.groupby('SoldTo')['line_count'].transform(lambda x: x.rolling(3, min_periods=1).mean())
grouped['ma_qty'] = grouped.groupby('SoldTo')['total_quantity'].transform(lambda x: x.rolling(3, min_periods=1).mean())
grouped['stDev_of_qty'] = grouped.groupby('SoldTo')['total_quantity'].transform(lambda x: x.rolling(3, min_periods=1).std(ddof=0))
grouped['stDev_of_lines'] = grouped.groupby('SoldTo')['line_count'].transform(lambda x: x.rolling(3, min_periods=1).std(ddof=0))

# Compute the 3-sigma thresholds
grouped['avg_qty_sigma_trigger'] = ((3 * grouped['stDev_of_qty']) + grouped['ma_qty'])
grouped['avg_line_sigma_trigger'] = ((3 * grouped['stDev_of_lines']) + grouped['avg_line'])

# Function to identify atypical rows based on 3-sigma rule within each SoldTo group
def identify_atypical(df):
    atypical_indices = []

    for sold_to, group in df.groupby('SoldTo'):
#        group = group.reset_index(drop=True) # Removed this line. Thx @Timus
        
        for i in range(len(group) - 1):
            current_row = group.iloc[i]
            next_row = group.iloc[i + 1]

            if (next_row['line_count'] > current_row['avg_line_sigma_trigger'] or
                next_row['total_quantity'] > current_row['avg_qty_sigma_trigger']):
                atypical_indices.append(group.index[i + 1])

    # Mark atypical rows in the dataframe
    df['is_atypical'] = False
    df.loc[atypical_indices, 'is_atypical'] = True

    return df, atypical_indices


# Identify atypical rows
grouped, atypical_indices = identify_atypical(grouped)

# Print the dataframe and indices of atypical rows
print("Atypical rows indices:", atypical_indices)
print("")

print(grouped)

# Filter atypical rows within a specified date range
#check_these = grouped[(grouped['is_atypical'] == True) & (grouped['Created_on'] >= '2024-06-01')]
check_these = grouped[(grouped['is_atypical'] == True) & (grouped['total_quantity'] != 1) & (grouped['line_count'] != 1) ]
#check_these = grouped[(grouped['is_atypical'] == True)]

# Save the cleaned dataframe to a CSV file
check_these.sort_values(by='SoldTo', ascending=True).to_csv('order_behavior_analysis_3.csv', index=False)

当使用此数据时,只有一个 Soldto,代码根据需要返回结果: 非典型行索引:[5,6,11]

[['SoldTo', 'Created_on', 'Sales_Doc', 'Order_Quantity'],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 17],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 2],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 6],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902368, 33],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 6],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 16],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 1],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 3],
 ['59908158', Timestamp('2023-11-13 00:00:00'), 110966070, 52],
 ['59908158', Timestamp('2023-11-15 00:00:00'), 111035845, 15],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 18],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 5],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 20],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 11],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 8],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 16],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 12],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 3],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 7],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 9],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 13]]

但是使用此数据,具有 (2) 个 SoldTo 值,代码会返回新行,但不再“检测”已知的非典型行并报告不同的索引: 非典型行索引:[1, 6, 12, 14, 17, 5, 6, 11]

[['SoldTo', 'Created_on', 'Sales_Doc', 'Order_Quantity'],
 ['56619720', Timestamp('2023-01-13 00:00:00'), 108036530, 10],
 ['56619720', Timestamp('2023-01-13 00:00:00'), 108036530, 1],
 ['56619720', Timestamp('2023-03-03 00:00:00'), 108391209, 20],
 ['56619720', Timestamp('2023-03-03 00:00:00'), 108391209, 2],
 ['56619720', Timestamp('2023-04-13 00:00:00'), 108738953, 30],
 ['56619720', Timestamp('2023-07-24 00:00:00'), 109827151, 20],
 ['56619720', Timestamp('2023-09-20 00:00:00'), 110467726, 30],
 ['56619720', Timestamp('2023-10-11 00:00:00'), 110658107, 10],
 ['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 2],
 ['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 3],
 ['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 5],
 ['56619720', Timestamp('2023-12-13 00:00:00'), 111681360, 5],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 6],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 4],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 2],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 2],
 ['56619720', Timestamp('2024-01-25 00:00:00'), 112057996, 5],
 ['56619720', Timestamp('2024-02-23 00:00:00'), 112322261, 12],
 ['56619720', Timestamp('2024-03-07 00:00:00'), 112453024, 5],
 ['56619720', Timestamp('2024-03-25 00:00:00'), 112625572, 5],
 ['56619720', Timestamp('2024-03-25 00:00:00'), 112625572, 3],
 ['56619720', Timestamp('2024-03-27 00:00:00'), 112651496, 2],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 5],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 5],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
 ['56619720', Timestamp('2024-05-09 00:00:00'), 113200232, 2],
 ['56619720', Timestamp('2024-05-22 00:00:00'), 113359192, 2],
 ['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 1],
 ['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 34],
 ['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 20],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 17],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 2],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 6],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902368, 33],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 6],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 16],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 1],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 3],
 ['59908158', Timestamp('2023-11-13 00:00:00'), 110966070, 52],
 ['59908158', Timestamp('2023-11-15 00:00:00'), 111035845, 15],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 18],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 5],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 20],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 11],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 8],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 16],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 12],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 3],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 7],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 9],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 13]]

感谢您帮助解释我的前进道路。

python pandas statistics
1个回答
0
投票

正如我在评论中所写的那样,我非常确定函数

identify_atypical
中组索引的重置会弄乱原始数据帧的最终更新。所以我建议您尝试以下操作:

def identify_atypical(df):
    atypical_idxs = set()
    for _, group in df.groupby('SoldTo'):
        m = (
            group['line_count'].gt(group['avg_line_sigma_trigger']).shift())
            | group['total_quantity'].gt(group['avg_qty_sigma_trigger'].shift())
        )
        atypical_idxs.update(group[m].index)
    return df.assign(is_atypical=df.index.isin(atypical_idxs)), atypical_idxs

(我无法运行完整的测试,因为示例不完整,但在简化的数据帧上看起来不错。)

© www.soinside.com 2019 - 2024. All rights reserved.