拥有一个具有下一个结构的数据框,其中的列具有布尔值:
enabler1 disabler1 enabler2 disabler2
0 FALSE FALSE FALSE FALSE
1 TRUE FALSE FALSE FALSE
2 TRUE FALSE FALSE FALSE
3 FALSE FALSE TRUE FALSE
4 FALSE TRUE FALSE FALSE
5 FALSE FALSE FALSE FALSE
6 FALSE FALSE TRUE FALSE
7 FALSE FALSE FALSE FALSE
8 FALSE FALSE FALSE TRUE
9 FALSE FALSE FALSE FALSE
10 FALSE FALSE FALSE FALSE
11 TRUE FALSE FALSE FALSE
12 FALSE TRUE FALSE FALSE
13 FALSE FALSE TRUE FALSE
14 FALSE FALSE FALSE FALSE
15 FALSE FALSE TRUE FALSE
16 FALSE FALSE FALSE FALSE
17 FALSE FALSE FALSE TRUE
18 FALSE FALSE FALSE FALSE
19 FALSE FALSE FALSE FALSE
然后,接下来的方式需要一个
result
列,根据 enabler1
、disabler1
、enabler2
、disabler2
列进行计算,同时还基于 result
之前的值:
enabler1 disabler1 enabler2 disabler2 result
0 FALSE FALSE FALSE FALSE 0
1 TRUE FALSE FALSE FALSE 1
2 TRUE FALSE FALSE FALSE 1
3 FALSE FALSE TRUE FALSE 1
4 FALSE TRUE FALSE FALSE 0
5 FALSE FALSE FALSE FALSE 0
6 FALSE FALSE TRUE FALSE -1
7 FALSE FALSE FALSE FALSE -1
8 FALSE FALSE FALSE TRUE 0
9 FALSE FALSE FALSE FALSE 0
10 FALSE FALSE FALSE FALSE 0
11 TRUE FALSE FALSE FALSE 1
12 FALSE TRUE FALSE FALSE 0
13 FALSE FALSE TRUE FALSE -1
14 FALSE FALSE FALSE FALSE -1
15 FALSE FALSE TRUE FALSE -1
16 FALSE FALSE FALSE FALSE -1
17 FALSE FALSE FALSE TRUE 0
18 FALSE FALSE FALSE FALSE 0
19 FALSE FALSE FALSE FALSE 0
正如您所注意到的,“结果”列相当于经典 for 循环中的一个变量,它保留一种 ON 值,直到禁用程序将其值重置为 OFF(默认值)。
我正在寻找其他观点,试图通过基于同一列/数组的先前值的计算来实现最快、更有效的方法来处理这种情况。
我知道可以有基于
Numba
的简化解决方案,例如使 for
循环更快,但在这种情况下,如果可能,那么我会尝试避免显式的 Python 循环。
请注意,这可以在 .ipynb 文件(Jupyter Notebook 或 VSC)中使用,也可以在 .py 文件中使用。
这是最新的(仍在开发中),没有显式的 Python 循环:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})
df['result'] = None
#Here also tried with `df['result'] = np.nan` and `df['result'] = 0` but they raise the next warning:
#FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas"...
#..."Value has dtype incompatible with int64, please explicitly cast to a compatible dtype first."
...下一个代码紧接着上一个代码,只是分开显示在没有垂直滚动的视图中。
def myFuntion(enabler_1, disabler_1, enabler_2, disabler_2):
global gPrevVal #The idea with this variable is to have its most recent previous value.
#These 6 conditions can be reduced to 3, but here just trying to better show the different possible cases:
conditions = [
(gPrevVal == 0) & (enabler_1 == True) & (disabler_1 == False) & (enabler_2 == False),
(gPrevVal == 0) & (enabler_2 == True) & (disabler_2 == False) & (enabler_1 == False),
(gPrevVal == 1) & (disabler_1 == True),
(gPrevVal == -1) & (disabler_2 == True),
(gPrevVal == 1) & (disabler_1 == False),
(gPrevVal == -1) & (disabler_2 == False),
]
choices = [
1,
-1,
0,
0,
1,
-1,
]
gPrevVal = np.select(conditions, choices, default=0) #gPrevVal = np.select(conditions, choices)
return gPrevVal
df.loc[0, 'result'] = 0 #default initial value in the first row.
df.loc[1:, 'result'] = df.loc[1:].apply(lambda row: myFuntion(*row[['enabler1', 'disabler1', 'enabler2', 'disabler2']]), axis=1)
print(df)
apply()
并不总是更快,有时for
循环效果更好。请检查此:链接
在这种场景下使用
numba()
套装,请参考以下代码,
import pandas as pd
import numpy as np
from numba import njit
df = pd.DataFrame({
'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})
df['result'] = 0
@njit
def calculate_result(enabler1, disabler1, enabler2, disabler2):
result = np.zeros(enabler1.shape[0], dtype=np.int32)
for i in range(1, len(enabler1)):
prev_result = result[i-1]
if prev_result == 0 and enabler1[i]:
result[i] = 1
elif prev_result == 0 and enabler2[i]:
result[i] = -1
elif prev_result == 1 and disabler1[i]:
result[i] = 0
elif prev_result == -1 and disabler2[i]:
result[i] = 0
else:
result[i] = prev_result
return result
df['result'] = calculate_result(df['enabler1'].values, df['disabler1'].values, df['enabler2'].values, df['disabler2'].values)
print(df)