检查数据框中从一个日期到另一个日期的计数差异

问题描述 投票:0回答:1

假设我有这个数据

data = {'site': ['ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY', 'ACY'],
       'usage_date': ['2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-08-25', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01', '2019-09-01'],
       'item_id': ['COR30013', 'PAC10463', 'COR30018', 'PAC10958', 'PAC11188', 'PAC20467', 'COR20275', 'PAC20702', 'COR30020', 'PAC10137', 'PAC10445', 'COR30029', 'COR30025', 'PAC10457', 'COR10746', 'PAC11136', 'COR10346', 'PAC11050', 'PAC11132', 'PAC11135', 'PAC10964', 'COR10439', 'PAC11131', 'COR10695', 'PAC11128', 'COR10433', 'COR10432', 'PAC11051', 'PAC10137', 'COR10695', 'COR30029', 'COR10346', 'COR10432', 'COR10746', 'COR10439', 'COR10433', 'COR20275', 'COR30020', 'COR30018', 'PAC11135', 'PAC10964', 'PAC11136', 'PAC10445', 'PAC11050', 'PAC11132', 'PAC20467', 'PAC11188', 'PAC10463', 'PAC20702', 'PAC10457', 'PAC10958', 'PAC11051', 'PAC11128', 'PAC11131'],
       'start_count':[400.0, 96000.0, 315.0, 45000.0, 2739.0, 2232.0, 2800.0, 283500.0, 280.0, 200000.0, 96000.0, 481.0, 600.0, 18000.0, 400.0, 5500.0, 1200.0, 5850.0, 5500.0, 5500.0, 36000.0, 600.0, 5500.0, 550.0, 300.0, 4800.0, 1800.0, 1800.0, 108000.0, 500.0, 481.0, 1200.0, 1800.0, 400.0, 600.0, 3300.0, 2800.0, 455.0, 315.0, 5500.0, 36000.0, 5500.0, 96000.0, 5400.0, 5500.0, 2232.0, 2739.0, 96000.0, 283500.0, 18000.0, 72000.0, 1800.0, 300.0, 5500.0],
       'received_total': [0.0, 0.0, 0.0, 0.0, 3168.0, 0.0, 0.0, 0.0, 280.0, 0.0, 0.0, 0.0, 0.0, 0.0, 400.0, 0.0, 1800.0, 0.0, 0.0, 0.0, 0.0, 400.0, 0.0, 0.0, 0.0, 0.0, 0.0, 3600.0, 0.0, 0.0, 0.0, 1800.0, 2400.0, 400.0, 400.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 1800.0, 0.0, 0.0, 3168.0, 0.0, 0.0, 0.0, 45000.0, 3600.0, 0.0, 0.0],
       'end_count': [240.0, 84000.0, 280.0, 27000.0, 3432.0, 2160.0, 2000.0, 90000.0, 455.0, 108000.0, 96000.0, 437.0, 500.0, 9000.0, 600.0, 5500.0, 1950.0, 4950.0, 5500.0, 5500.0, 36000.0, 600.0, 5500.0, 550.0, 270.0, 3300.0, 1200.0, 4200.0, 192000.0, 450.0, 350.0, 1890.0, 3600.0, 600.0, 525.0, 2835.0, 1600.0, 420.0, 187.0, 5500.0, 36000.0, 5500.0, 96000.0, 6750.0, 5500.0, 1992.0, 1881.0, 84000.0, 58500.0, 9000.0, 85500.0, 3300.0, 252.0, 5500.0]}

df_sample = pd.DataFrame(data=data)

对于每个

item_id
,我们需要检查当前 (9/1/2019)
end_count
是否大于之前的 (8/25/2019)
end_count
,并且当前
received_total
0
这意味着计数错误。

我有这个有效的代码

def check_end_count(df):
    l = []
    for loc, df_loc in df.groupby(['site', 'item_id']):
        try:
            ending_count_previous = df_loc['end_count'].iloc[0]
            ending_count_current = df_loc['end_count'].iloc[1]
            received_total_current = df_loc['received_total'].iloc[1]
            
            if ending_count_current > ending_count_previous and received_total_current == 0:
                l.append("Ending count discrepancy")
                l.append("Ending count discrepancy")
            else:
                l.append("Good Row")
                l.append("Good Row")
        except:
            l.append("Nothing to compare")

    df['ending_count_check'] = l
    return df

df_sample = check_end_count(df_sample)

但它并不是那么Pythonic。另外,就我而言,我必须检查一系列日期,其中我有这个元组列表

print(sliding_window_dates[:3])

[array(['2019-08-25', '2019-09-01'], dtype=object),
 array(['2019-09-01', '2019-09-08'], dtype=object),
 array(['2019-09-08', '2019-09-15'], dtype=object)]

所以我想做的是在更大的数据帧上执行以下操作

df_list = []
for date1, date2 in sliding_window_dates:
    df_check = df_test[(df_test['usage_date'] == date1) | (df_test['usage_date'] == date2)]
    for loc, df_loc in df_check.groupby(['sort_center', 'item_id']):
        df_list.append(check_end_count(df_loc))

但我再次在两个 for 循环中执行此操作,因此我认为必须有更好的方法来执行此操作。如有任何建议,我们将不胜感激。

python pandas
1个回答
0
投票

每当我看到需要跨日期与特定属性进行比较的问题时,我立即想到“正确的数据帧索引是什么?”。在这种情况下,使用良好的索引和一些重组可以使问题变得非常简单。

我做到了

indexed = df_sample.set_index(["site", "item_id", "usage_date"]).unstack("usage_date")

并且,与

current = '2019-09-01'
previous = '2019-08-25'

我们可以将条件与问题陈述几乎一对一地表述:

如果当前 ...

end_count
大于前一个 ... end_count 并且当前
received_total
为 0 ... 则计数错误。

bad_rows = (indexed[("end_count", current)] > indexed[("end_count", previous)]) & (indexed[("received_total", current)] == 0)
indexed[bad_rows]

这给出:

              start_count            received_total            end_count   
usage_date     2019-08-25 2019-09-01     2019-08-25 2019-09-01 2019-08-25  2019-09-01  
 
site item_id                                                                
ACY  PAC10137    200000.0   108000.0            0.0        0.0   108000.0    192000.0

现在,对于多日期情况,您可以这样做:

from itertools import pairwise

for previous, current in pairwise(sorted(indexed.columns.levels[1])):
    indexed[("bad", current)] = (indexed[("end_count", current)] > indexed[("end_count", previous)]) & (indexed[("received_total", current)] == 0)

要获取原始形式的数据框,您只需

.unstack()

indexed.unstack()
© www.soinside.com 2019 - 2024. All rights reserved.