矢量化或相对接近 C 速度的东西,其中每个结果取决于先前的结果。目前使用 Pandas apply() 没有显式循环

问题描述 投票:0回答:2

拥有一个具有下一个结构的数据框,其中的列具有布尔值:

     enabler1  disabler1  enabler2  disabler2
0       FALSE      FALSE     FALSE      FALSE
1        TRUE      FALSE     FALSE      FALSE
2        TRUE      FALSE     FALSE      FALSE
3       FALSE      FALSE      TRUE      FALSE
4       FALSE       TRUE     FALSE      FALSE
5       FALSE      FALSE     FALSE      FALSE
6       FALSE      FALSE      TRUE      FALSE
7       FALSE      FALSE     FALSE      FALSE
8       FALSE      FALSE     FALSE       TRUE
9       FALSE      FALSE     FALSE      FALSE
10      FALSE      FALSE     FALSE      FALSE
11       TRUE      FALSE     FALSE      FALSE
12      FALSE       TRUE     FALSE      FALSE
13      FALSE      FALSE      TRUE      FALSE
14      FALSE      FALSE     FALSE      FALSE
15      FALSE      FALSE      TRUE      FALSE
16      FALSE      FALSE     FALSE      FALSE
17      FALSE      FALSE     FALSE       TRUE
18      FALSE      FALSE     FALSE      FALSE
19      FALSE      FALSE     FALSE      FALSE

然后,接下来的方式需要一个

result
列,根据
enabler1
disabler1
enabler2
disabler2
列进行计算,同时还基于
result
之前的值

     enabler1  disabler1  enabler2  disabler2  result
0       FALSE      FALSE     FALSE      FALSE       0
1        TRUE      FALSE     FALSE      FALSE       1
2        TRUE      FALSE     FALSE      FALSE       1
3       FALSE      FALSE      TRUE      FALSE       1
4       FALSE       TRUE     FALSE      FALSE       0
5       FALSE      FALSE     FALSE      FALSE       0
6       FALSE      FALSE      TRUE      FALSE      -1
7       FALSE      FALSE     FALSE      FALSE      -1
8       FALSE      FALSE     FALSE       TRUE       0
9       FALSE      FALSE     FALSE      FALSE       0
10      FALSE      FALSE     FALSE      FALSE       0
11       TRUE      FALSE     FALSE      FALSE       1
12      FALSE       TRUE     FALSE      FALSE       0
13      FALSE      FALSE      TRUE      FALSE      -1
14      FALSE      FALSE     FALSE      FALSE      -1
15      FALSE      FALSE      TRUE      FALSE      -1
16      FALSE      FALSE     FALSE      FALSE      -1
17      FALSE      FALSE     FALSE       TRUE       0
18      FALSE      FALSE     FALSE      FALSE       0
19      FALSE      FALSE     FALSE      FALSE       0

正如您所注意到的,“结果”列相当于经典 for 循环中的一个变量,它保留一种 ON 值,直到禁用程序将其值重置为 OFF(默认值)。

我正在寻找其他观点,试图通过基于同一列/数组的先前值的计算来实现最快、更有效的方法来处理这种情况。

我知道可以有基于

Numba
的简化解决方案,例如使
for
循环更快,但在这种情况下,如果可能,那么我会尝试避免显式的 Python 循环。

请注意,这可以在 .ipynb 文件(Jupyter Notebook 或 VSC)中使用,也可以在 .py 文件中使用。

这是最新的(仍在开发中),没有显式的 Python 循环:

import pandas as pd
import numpy as np


df = pd.DataFrame({
    'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
    'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
    'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
    'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})

df['result'] = None
#Here also tried with `df['result'] = np.nan` and `df['result'] = 0` but they raise the next warning:
#FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas"...
#..."Value has dtype incompatible with int64, please explicitly cast to a compatible dtype first."

...下一个代码紧接着上一个代码,只是分开显示在没有垂直滚动的视图中。

def myFuntion(enabler_1, disabler_1, enabler_2, disabler_2):
    global tempResult #The idea with this variable is to have its most recent previous value.

    #These 6 conditions can be reduced to 3, but here just trying to better show the different possible cases:

    conditions = [
        (tempResult == 0) & (enabler_1 == True) & (disabler_1 == False) & (enabler_2 == False),
        (tempResult == 0) & (enabler_2 == True) & (disabler_2 == False) & (enabler_1 == False),
        (tempResult == 1) & (disabler_1 == True),
        (tempResult == -1) & (disabler_2 == True),
        (tempResult == 1) & (disabler_1 == False),
        (tempResultl == -1) & (disabler_2 == False),
    ]

    choices = [
        1,
        -1,
        0,
        0,
        1,
        -1,
    ]

    tempResult = np.select(conditions, choices, default=0) #tempResult = np.select(conditions, choices)
    
    return tempResult


df.loc[0, 'result'] = 0 #default initial value in the first row.

df.loc[1:, 'result'] = df.loc[1:].apply(lambda row: myFuntion(*row[['enabler1', 'disabler1', 'enabler2', 'disabler2']]), axis=1)

print(df)



更新1

注意事项:

  • tempResult
    只是一个整型变量,存储一种“状态”,默认值为OFF,满足相应条件时则为ON类型值。

  • 在本线程的具体情况下,要在任意行中获得非零结果,两个

    enablers
    不能同时为
    True

  • 获得非零结果的条件中最基本的部分是最近的先前

    tempResult
    值需要是
    0
    ,因为在这种情况下,如果该值已经是ON类型,那么有甚至不需要再次评估产生相同类型 ON 值的条件。


作为更清晰可视化的参考,这里是一个具有显式 Python 循环的代码版本,可产生相同的结果。在这里您可以准确地看到获得 **ON** 值(非零值)类型所需的条件以及获得 **OFF** 值的条件:
df['result'] = 0


def myFuntionLoop(enabler1, disabler1, enabler2, disabler2): 
    tempResult = 0  #The intial default value in `result` is `0`.
    resultList = [0] * len(enabler1)  #A list with the needed length and with `0` as initial values.

    for i in range(1, len(enabler1)):  #Starting from the row 1, i.e. skipping the row 0.
        if tempResult == 0 and enabler1[i] and disabler1[i] == False and enabler2[i] == False:
            tempResult = 1
        elif tempResult == 0 and enabler2[i] and disabler2[i] == False and enabler1[i] == False:
            tempResult = -1
        elif tempResult == 1 and disabler1[i] == True:
            tempResult = 0
        elif tempResult == -1 and disabler2[i] == True:
            tempResult = 0
        
        resultList[i] = tempResult

    return resultList


df['result'] = myFuntionLoop(df['enabler1'], df['disabler1'], df['enabler2'], df['disabler2'])


print(df)
python pandas numpy vectorization cython
2个回答
0
投票

apply()
并不总是更快,有时
for
循环效果更好。请检查此:链接

在这种场景下使用

numba()
套装,请参考以下代码,

import pandas as pd
import numpy as np
from numba import njit


df = pd.DataFrame({
    'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
    'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
    'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
    'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})

df['result'] = 0


@njit
def calculate_result(enabler1, disabler1, enabler2, disabler2):
    result = np.zeros(enabler1.shape[0], dtype=np.int32)
    for i in range(1, len(enabler1)):
        prev_result = result[i-1]
        if prev_result == 0 and enabler1[i]:
            result[i] = 1
        elif prev_result == 0 and enabler2[i]:
            result[i] = -1
        elif prev_result == 1 and disabler1[i]:
            result[i] = 0
        elif prev_result == -1 and disabler2[i]:
            result[i] = 0
        else:
            result[i] = prev_result
    return result


df['result'] = calculate_result(df['enabler1'].values, df['disabler1'].values, df['enabler2'].values, df['disabler2'].values)

print(df)

0
投票

我能看到的唯一优化是: 有些行的下一个值与上一个值相同(无论之前的数据到底是什么)。即,如果

e1, d1, e2, d2 = False, False, False, False
,则下一个值=上一个值,始终如此。事实上,这些行可以从计算中排除。您可以将它们保留为 NA 并在最后应用
fillna

像这样:

def calculate_result(df):
    def inner(row, tempResult):
        if tempResult == 0 and row.enabler1:
            return 1
        ... # All l`enter code here`ogic with enablers/disablers
    global tempResult
    tempResult = 0
    noEffect = (
        (~df.enabler1 & ~df.disabler1 & ~df.enabler2 & ~df.disabler2)|
        (...)  # other possible combinations which don't change anything
    )
    for num, row in df[
        (noEffect == False) & 
        (df.index != df.index[0])
    ].iterrows():
        df.loc[num, 'result'] = tempResult = inner(row, tempResult)
    df['result'] = df.result.astype(pd.Int64Dtype()).ffill()
    return df
df = calculate_result(df)

在这种情况下,您只需从一个有意义的行跳到另一行,并且不要在这些上花费时间,它们只是复制前一个状态

如果你有很多这样的“空”行,它可能是有利可图的

© www.soinside.com 2019 - 2024. All rights reserved.