递归计算。每个结果都取决于之前的结果。目前使用 Pandas apply() 没有显式循环。需要更有效的解决方案

问题描述 投票:0回答:1

拥有一个具有下一个结构的数据框,其中的列具有布尔值:

     enabler1  disabler1  enabler2  disabler2
0       FALSE      FALSE     FALSE      FALSE
1        TRUE      FALSE     FALSE      FALSE
2        TRUE      FALSE     FALSE      FALSE
3       FALSE      FALSE      TRUE      FALSE
4       FALSE       TRUE     FALSE      FALSE
5       FALSE      FALSE     FALSE      FALSE
6       FALSE      FALSE      TRUE      FALSE
7       FALSE      FALSE     FALSE      FALSE
8       FALSE      FALSE     FALSE       TRUE
9       FALSE      FALSE     FALSE      FALSE
10      FALSE      FALSE     FALSE      FALSE
11       TRUE      FALSE     FALSE      FALSE
12      FALSE       TRUE     FALSE      FALSE
13      FALSE      FALSE      TRUE      FALSE
14      FALSE      FALSE     FALSE      FALSE
15      FALSE      FALSE      TRUE      FALSE
16      FALSE      FALSE     FALSE      FALSE
17      FALSE      FALSE     FALSE       TRUE
18      FALSE      FALSE     FALSE      FALSE
19      FALSE      FALSE     FALSE      FALSE

然后,接下来的方式需要一个

result
列,根据
enabler1
disabler1
enabler2
disabler2
列进行计算,同时还基于
result
之前的值

     enabler1  disabler1  enabler2  disabler2  result
0       FALSE      FALSE     FALSE      FALSE       0
1        TRUE      FALSE     FALSE      FALSE       1
2        TRUE      FALSE     FALSE      FALSE       1
3       FALSE      FALSE      TRUE      FALSE       1
4       FALSE       TRUE     FALSE      FALSE       0
5       FALSE      FALSE     FALSE      FALSE       0
6       FALSE      FALSE      TRUE      FALSE      -1
7       FALSE      FALSE     FALSE      FALSE      -1
8       FALSE      FALSE     FALSE       TRUE       0
9       FALSE      FALSE     FALSE      FALSE       0
10      FALSE      FALSE     FALSE      FALSE       0
11       TRUE      FALSE     FALSE      FALSE       1
12      FALSE       TRUE     FALSE      FALSE       0
13      FALSE      FALSE      TRUE      FALSE      -1
14      FALSE      FALSE     FALSE      FALSE      -1
15      FALSE      FALSE      TRUE      FALSE      -1
16      FALSE      FALSE     FALSE      FALSE      -1
17      FALSE      FALSE     FALSE       TRUE       0
18      FALSE      FALSE     FALSE      FALSE       0
19      FALSE      FALSE     FALSE      FALSE       0

正如您所注意到的,“结果”列相当于经典 for 循环中的一个变量,它保留一种 ON 值,直到禁用程序将其值重置为 OFF(默认值)。

我正在寻找其他观点,试图通过基于同一列/数组的先前值的计算来实现最快、更有效的方法来处理这种情况。

我知道可以有基于

Numba
的简化解决方案,例如使
for
循环更快,但在这种情况下,如果可能,那么我会尝试避免显式的 Python 循环。

请注意,这可以在 .ipynb 文件(Jupyter Notebook 或 VSC)中使用,也可以在 .py 文件中使用。

这是最新的(仍在开发中),没有显式的 Python 循环:

import pandas as pd
import numpy as np


df = pd.DataFrame({
    'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
    'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
    'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
    'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})

df['result'] = None
#Here also tried with `df['result'] = np.nan` and `df['result'] = 0` but they raise the next warning:
#FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas"...
#..."Value has dtype incompatible with int64, please explicitly cast to a compatible dtype first."

...下一个代码紧接着上一个代码,只是分开显示在没有垂直滚动的视图中。

def myFuntion(enabler_1, disabler_1, enabler_2, disabler_2):
    global gPrevVal #The idea with this variable is to have its most recent previous value.

    #These 6 conditions can be reduced to 3, but here just trying to better show the different possible cases:

    conditions = [
        (gPrevVal == 0) & (enabler_1 == True) & (disabler_1 == False) & (enabler_2 == False),
        (gPrevVal == 0) & (enabler_2 == True) & (disabler_2 == False) & (enabler_1 == False),
        (gPrevVal == 1) & (disabler_1 == True),
        (gPrevVal == -1) & (disabler_2 == True),
        (gPrevVal == 1) & (disabler_1 == False),
        (gPrevVal == -1) & (disabler_2 == False),
    ]

    choices = [
        1,
        -1,
        0,
        0,
        1,
        -1,
    ]

    gPrevVal = np.select(conditions, choices, default=0) #gPrevVal = np.select(conditions, choices)
    
    return gPrevVal


df.loc[0, 'result'] = 0 #default initial value in the first row.

df.loc[1:, 'result'] = df.loc[1:].apply(lambda row: myFuntion(*row[['enabler1', 'disabler1', 'enabler2', 'disabler2']]), axis=1)

print(df)
python pandas numpy vectorization cython
1个回答
0
投票

apply()
并不总是更快,有时
for
循环效果更好。请检查此:链接

在这种场景下使用

numba()
套装,请参考以下代码,

import pandas as pd
import numpy as np
from numba import njit


df = pd.DataFrame({
    'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
    'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
    'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
    'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})

df['result'] = 0


@njit
def calculate_result(enabler1, disabler1, enabler2, disabler2):
    result = np.zeros(enabler1.shape[0], dtype=np.int32)
    for i in range(1, len(enabler1)):
        prev_result = result[i-1]
        if prev_result == 0 and enabler1[i]:
            result[i] = 1
        elif prev_result == 0 and enabler2[i]:
            result[i] = -1
        elif prev_result == 1 and disabler1[i]:
            result[i] = 0
        elif prev_result == -1 and disabler2[i]:
            result[i] = 0
        else:
            result[i] = prev_result
    return result


df['result'] = calculate_result(df['enabler1'].values, df['disabler1'].values, df['enabler2'].values, df['disabler2'].values)

print(df)
© www.soinside.com 2019 - 2024. All rights reserved.