我想分析电子商务环境中的交易数据,重点是检测订单模式中的非典型活动。数据按客户标识符 SoldTo 进行分组,对于每个组,我们应用一种简单的统计技术来根据订单数量和行数检测异常情况。具体来说,步骤包括:
我的问题: 作为一项测试,我向一个已知的个体 SoldTo 提供异常线路,并且代码可以按预期检测异常线路! 但是,当我引入多个 SoldTo(包括那些已知的异常行)时,将不再检测到它们。 为什么会这样?
这是我尝试使用我的代码以及(我希望的)两个方便加载的数据集(其中一个只有一个 SoldTo,我的代码将检测已知的异常交易 - 以及另一个数据设置有 (2) 个 SoldTo 的 <- my code no longer works to detect known anomalous lines when 2+ Soldto's are used together...
# Create DataFrame
df = pd.DataFrame(fraud)
df['Created_on'] = pd.to_datetime(df['Created_on'])
# Group by 'SoldTo' and 'Created_on'
grouped = df.groupby(['SoldTo', 'Created_on', 'Sales_Doc']).agg(
total_quantity=('Order_Quantity', 'sum'),
line_count=('Sales_Doc', 'count') # Modified this line so the provided data sets can be used. Thanks @Timus
).reset_index()
# Compute rolling statistics and 3-sigma for each SoldTo group
grouped['avg_line'] = grouped.groupby('SoldTo')['line_count'].transform(lambda x: x.rolling(3, min_periods=1).mean())
grouped['ma_qty'] = grouped.groupby('SoldTo')['total_quantity'].transform(lambda x: x.rolling(3, min_periods=1).mean())
grouped['stDev_of_qty'] = grouped.groupby('SoldTo')['total_quantity'].transform(lambda x: x.rolling(3, min_periods=1).std(ddof=0))
grouped['stDev_of_lines'] = grouped.groupby('SoldTo')['line_count'].transform(lambda x: x.rolling(3, min_periods=1).std(ddof=0))
# Compute the 3-sigma thresholds
grouped['avg_qty_sigma_trigger'] = ((3 * grouped['stDev_of_qty']) + grouped['ma_qty'])
grouped['avg_line_sigma_trigger'] = ((3 * grouped['stDev_of_lines']) + grouped['avg_line'])
# Function to identify atypical rows based on 3-sigma rule within each SoldTo group
def identify_atypical(df):
atypical_indices = []
for sold_to, group in df.groupby('SoldTo'):
# group = group.reset_index(drop=True) # Removed this line. Thx @Timus
for i in range(len(group) - 1):
current_row = group.iloc[i]
next_row = group.iloc[i + 1]
if (next_row['line_count'] > current_row['avg_line_sigma_trigger'] or
next_row['total_quantity'] > current_row['avg_qty_sigma_trigger']):
atypical_indices.append(group.index[i + 1])
# Mark atypical rows in the dataframe
df['is_atypical'] = False
df.loc[atypical_indices, 'is_atypical'] = True
return df, atypical_indices
# Identify atypical rows
grouped, atypical_indices = identify_atypical(grouped)
# Print the dataframe and indices of atypical rows
print("Atypical rows indices:", atypical_indices)
print("")
print(grouped)
# Filter atypical rows within a specified date range
#check_these = grouped[(grouped['is_atypical'] == True) & (grouped['Created_on'] >= '2024-06-01')]
check_these = grouped[(grouped['is_atypical'] == True) & (grouped['total_quantity'] != 1) & (grouped['line_count'] != 1) ]
#check_these = grouped[(grouped['is_atypical'] == True)]
# Save the cleaned dataframe to a CSV file
check_these.sort_values(by='SoldTo', ascending=True).to_csv('order_behavior_analysis_3.csv', index=False)
当使用此数据时,只有一个 Soldto,代码根据需要返回结果: 非典型行索引:[5,6,11]
[['SoldTo', 'Created_on', 'Sales_Doc', 'Order_Quantity'],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 17],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 2],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 11],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 6],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 11],
['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 10],
['59908158', Timestamp('2023-11-07 00:00:00'), 110902368, 33],
['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 10],
['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 4],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 8],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 6],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 16],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 4],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 1],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 8],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 3],
['59908158', Timestamp('2023-11-13 00:00:00'), 110966070, 52],
['59908158', Timestamp('2023-11-15 00:00:00'), 111035845, 15],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 18],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 5],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 20],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 11],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 8],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 16],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 12],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 3],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 7],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 9],
['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 13]]
但是使用此数据,具有 (2) 个 SoldTo 值,代码会返回新行,但不再“检测”已知的非典型行并报告不同的索引: 非典型行索引:[1, 6, 12, 14, 17, 5, 6, 11]
[['SoldTo', 'Created_on', 'Sales_Doc', 'Order_Quantity'],
['56619720', Timestamp('2023-01-13 00:00:00'), 108036530, 10],
['56619720', Timestamp('2023-01-13 00:00:00'), 108036530, 1],
['56619720', Timestamp('2023-03-03 00:00:00'), 108391209, 20],
['56619720', Timestamp('2023-03-03 00:00:00'), 108391209, 2],
['56619720', Timestamp('2023-04-13 00:00:00'), 108738953, 30],
['56619720', Timestamp('2023-07-24 00:00:00'), 109827151, 20],
['56619720', Timestamp('2023-09-20 00:00:00'), 110467726, 30],
['56619720', Timestamp('2023-10-11 00:00:00'), 110658107, 10],
['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 2],
['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 3],
['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 5],
['56619720', Timestamp('2023-12-13 00:00:00'), 111681360, 5],
['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 6],
['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 4],
['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 2],
['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 2],
['56619720', Timestamp('2024-01-25 00:00:00'), 112057996, 5],
['56619720', Timestamp('2024-02-23 00:00:00'), 112322261, 12],
['56619720', Timestamp('2024-03-07 00:00:00'), 112453024, 5],
['56619720', Timestamp('2024-03-25 00:00:00'), 112625572, 5],
['56619720', Timestamp('2024-03-25 00:00:00'), 112625572, 3],
['56619720', Timestamp('2024-03-27 00:00:00'), 112651496, 2],
['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 5],
['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 5],
['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
['56619720', Timestamp('2024-05-09 00:00:00'), 113200232, 2],
['56619720', Timestamp('2024-05-22 00:00:00'), 113359192, 2],
['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 1],
['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 34],
['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 20],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 17],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 2],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 11],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 6],
['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 11],
['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 10],
['59908158', Timestamp('2023-11-07 00:00:00'), 110902368, 33],
['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 10],
['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 4],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 8],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 6],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 16],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 4],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 1],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 8],
['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 3],
['59908158', Timestamp('2023-11-13 00:00:00'), 110966070, 52],
['59908158', Timestamp('2023-11-15 00:00:00'), 111035845, 15],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 18],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 5],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 20],
['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 11],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 8],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 16],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 12],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 3],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 7],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 9],
['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 13]]
感谢您帮助解释我的前进道路。
正如我在评论中所写的那样,我非常确定函数
identify_atypical
中组索引的重置会弄乱原始数据帧的最终更新。所以我建议您尝试以下操作:
def identify_atypical(df):
atypical_idxs = set()
for _, group in df.groupby('SoldTo'):
m = (
group['line_count'].gt(group['avg_line_sigma_trigger']).shift())
| group['total_quantity'].gt(group['avg_qty_sigma_trigger'].shift())
)
atypical_idxs.update(group[m].index)
return df.assign(is_atypical=df.index.isin(atypical_idxs)), atypical_idxs
(我无法运行完整的测试,因为示例不完整,但在简化的数据帧上看起来不错。)