加快我用 pandas 构建物料清单的功能

问题描述 投票:0回答:2

对于我这个Python新手来说,我正在编写一个用于构建物料清单(BOM)的代码,它可以从excel文件中的订单表中获取客户所需的商品ID和购买数量,并使用另一个BOM表(也在同样的excel)来计算所需的所有原材料的数量。从库存中减去原材料的库存后,剩余的需求被输入到字典输出中,如{id_material:数量}。 BOM表格式如下

item_id 进程id 流程_编号 输入/输出 材质_id 数量 数量_出
A z420 1 12125 100
A z420 1 A-z512-2 100
A z512 2 A-z512-2 100
A z512 2 A-z600-3 120
A z600 3 A-z600-3 120
A z600 3 14551 -20
A z600 3 A 100

attr:processs_id:使用的进程ID

attr:processs_No.:进程路径中的进程顺序。并不总是像自然数那样连续或规则,例如 (51, 60, 70, 100)

attr: IN/OUT : 指示该材料是原材料还是输出

我使用的 pandas.dataframe 也是这样,但添加了两个 attr 列:“count_demand”用于指示所需的数量,“flag”用于我的函数来识别需要执行的材料。让我们称之为“df_demand”。

我能够完成符合我目的的功能,但速度并不令人满意。我用timeit等moudles进行了测试,发现有些操作花费了很多时间,但我想不出优化的方法,所以我来这里寻求帮助。

  1. 首先让我觉得不太满意,也是整个过程中最耗时的部分是我的代码中使用的一个函数,用于查找当前需求物料的原材料,并按比例计算原材料的需求数量根据需求物料的数量生成BOM,代码如下
def calculate_demand_raw(row, df_demand):
    try:
       if np.isnan(row['quantity_out']):
           raise ValueError('To avoid including those recycled materials with negative outputs')

       list_index = list(df_demand['item_id'].isin([row['item_id']]) &
                         df_demand['process_No.'].isin([row['process_No.']]) &
                         df_demand['IN/OUT'].isin(['IN']))
       index = [i for i, x in enumerate(list_index) if x==True]  
       # Search to find the index of the required generation process

       df_demand.loc[index, 'count_demand'] = row['count_demand']/row['quantity_out']*
                                               df_demand.loc[index, 'quantity_in']
       # calculate quantity of raw materials.
       df_demand.loc[index, 'flag'] = 1

    except ValueError:
        pass  # Prevent the query material is the base material, no process generation
    df_demand.loc[row.name, 'flag'] = 0

df_demand[df_demand['flag'].isin([1])].apply(lambda row: calculate_demand_raw(row, df_demand), axis=1)

timeit 告诉我,在该函数中,查找符合条件的行索引所需的时间是计算原材料数量的三倍,而calculate_demand_raw 也是循环中最耗时的函数。那么任何人都可以减少搜索索引时间吗?

  1. 循环中的第二个耗时函数是将汇总的原材料需求填充到 df_demand 的 attr'count_demand' 中的函数,在该函数中为流程生成原材料需求。
def fill_demand(row, qty_sum_demand, df_demand):
    df_demand[row.name, 'count_demand'] += qty_sum_demand.loc[
                                qty_sum_demand['IN/OUT'].isin([row['IN/OUT']).tolist(),
                                'count_demand'].tolist()
    df_demand.loc[index, 'flag'] = 1

df_demand.loc[index_generated_process].apply(lambda row: 
                                fill_demand(row, qty_sum_demand, df_demand), axis=1)

是不是函数中的条件搜索导致了这么长的时间,就像calculate_demand_raw一样?是否可以将此操作变成更快的 numpy 矢量化操作?

非常感谢任何帮助和建议

python-3.x pandas numpy datatable
2个回答
1
投票

示例的修改版本以显示一些功能:

df = pd.read_csv(io.StringIO(
"""
item_id,process_id,process_No.,IN/OUT,material_id,quantity_in,quantity_out,flag
A,z420,1,IN,12125,100.0,,0
A,z420,1,OUT,A-z512-2,,100.0,0
A,z512,2,IN,A-z512-2,100.0,,0
A,z512,2,OUT,A-z600-3,,120.0,0
A,z600,3,IN,A-z600-2,,400,1
A,z600,3,IN,A-z600-3,120.0,200,1
A,z600,3,IN,14551,,-20.0,0
A,z600,3,OUT,A,,100.0,0
""".strip()
))
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag
0       A       z420            1     IN       12125        100.0           NaN     0
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0
2       A       z512            2     IN    A-z512-2        100.0           NaN     0
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0
4       A       z600            3     IN    A-z600-2          NaN         400.0     1
5       A       z600            3     IN    A-z600-3        120.0         200.0     1 # <- only valid flag (non-na quantity_in)
6       A       z600            3     IN       14551          NaN         -20.0     0
7       A       z600            3    OUT           A          NaN         100.0     0

这是一种无需使用

calculate_demand_raw
+
.apply
查找即可实现
.loc
的方法。

通常,在这种情况下,您想要

.merge
,这样您就可以“并排”拥有所有数据,从而可以以“矢量化方式”工作。

flags = df[df['flag'] == 1].dropna(subset='quantity_in')

df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))

df_m.loc[ 
   (df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'), 
   df.columns.difference(['item_id', 'process_No.']) + '_y' 
] = float('nan')

rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']

df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1

步骤细分:

找到所有标志行。

flags = df[df['flag'] == 1].dropna(subset='quantity_in')

.dropna()
用于模拟代码中的
if np.isnan(row['quantity_out'])
行。

  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag
5       A       z600            3     IN    A-z600-3        120.0         200.0     1

左合并标志:

df_m = df.merge(flags, on=['item_id', 'process_No.'], how='left', suffixes=('', '_y'))
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
5       A       z600            3     IN    A-z600-3        120.0         200.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
6       A       z600            3     IN       14551          NaN         -20.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0         z600       IN      A-z600-3          120.0           200.0     1.0

您想要丢弃

OUT
行,不清楚您是否想要将标志行与其自身进行比较,所以我在这里丢弃了它们。

您可以将要丢弃的行上的

_y
列重置回
NaN

df_m.loc[ 
   (df_m['flag'] == 1) | (df_m['IN/OUT'] == 'OUT'), 
   df.columns.difference(['item_id', 'process_No.']) + '_y' 
] = float('nan')
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     1          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN         -20.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

然后您可以对具有 nonna

_y
值的行执行计算:

rows = df_m['process_id_y'].notna()
df_m.loc[rows, 'quantity_out'] *= df_m.loc[rows, 'quantity_in_y']
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     1          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     1          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN       -2400.0     0         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

切换标志值:

df_m.loc[df_m['flag'] == 1, 'flag'] = 0
df_m.loc[rows, 'flag'] = 1
  item_id process_id  process_No. IN/OUT material_id  quantity_in  quantity_out  flag process_id_y IN/OUT_y material_id_y  quantity_in_y  quantity_out_y  flag_y
0       A       z420            1     IN       12125        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
1       A       z420            1    OUT    A-z512-2          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN
2       A       z512            2     IN    A-z512-2        100.0           NaN     0          NaN      NaN           NaN            NaN             NaN     NaN
3       A       z512            2    OUT    A-z600-3          NaN         120.0     0          NaN      NaN           NaN            NaN             NaN     NaN
4       A       z600            3     IN    A-z600-2          NaN         400.0     0          NaN      NaN           NaN            NaN             NaN     NaN
5       A       z600            3     IN    A-z600-3        120.0         200.0     0          NaN      NaN           NaN            NaN             NaN     NaN
6       A       z600            3     IN       14551          NaN       -2400.0     1         z600       IN      A-z600-3          120.0           200.0     1.0
7       A       z600            3    OUT           A          NaN         100.0     0          NaN      NaN           NaN            NaN             NaN     NaN

0
投票

我不知道为什么我在编辑模式下仔细编辑的数据表变得如此糟糕,哈哈。为了弥补这一点,我在此回复中包含了预览中的表格和属性的屏幕截图 enter image description here

© www.soinside.com 2019 - 2024. All rights reserved.