拥有一个具有下一个结构的数据框,其中的列具有布尔值:
enabler1 disabler1 enabler2 disabler2
0 FALSE FALSE FALSE FALSE
1 TRUE FALSE FALSE FALSE
2 TRUE FALSE FALSE FALSE
3 FALSE FALSE TRUE FALSE
4 FALSE TRUE FALSE FALSE
5 FALSE FALSE FALSE FALSE
6 FALSE FALSE TRUE FALSE
7 FALSE FALSE FALSE FALSE
8 FALSE FALSE FALSE TRUE
9 FALSE FALSE FALSE FALSE
10 FALSE FALSE FALSE FALSE
11 TRUE FALSE FALSE FALSE
12 FALSE TRUE FALSE FALSE
13 FALSE FALSE TRUE FALSE
14 FALSE FALSE FALSE FALSE
15 FALSE FALSE TRUE FALSE
16 FALSE FALSE FALSE FALSE
17 FALSE FALSE FALSE TRUE
18 FALSE FALSE FALSE FALSE
19 FALSE FALSE FALSE FALSE
然后,接下来的方式需要一个
result
列,根据 enabler1
、disabler1
、enabler2
、disabler2
列进行计算,同时还基于 result
之前的值:
enabler1 disabler1 enabler2 disabler2 result
0 FALSE FALSE FALSE FALSE 0
1 TRUE FALSE FALSE FALSE 1
2 TRUE FALSE FALSE FALSE 1
3 FALSE FALSE TRUE FALSE 1
4 FALSE TRUE FALSE FALSE 0
5 FALSE FALSE FALSE FALSE 0
6 FALSE FALSE TRUE FALSE -1
7 FALSE FALSE FALSE FALSE -1
8 FALSE FALSE FALSE TRUE 0
9 FALSE FALSE FALSE FALSE 0
10 FALSE FALSE FALSE FALSE 0
11 TRUE FALSE FALSE FALSE 1
12 FALSE TRUE FALSE FALSE 0
13 FALSE FALSE TRUE FALSE -1
14 FALSE FALSE FALSE FALSE -1
15 FALSE FALSE TRUE FALSE -1
16 FALSE FALSE FALSE FALSE -1
17 FALSE FALSE FALSE TRUE 0
18 FALSE FALSE FALSE FALSE 0
19 FALSE FALSE FALSE FALSE 0
正如您所注意到的,“结果”列相当于经典 for 循环中的一个变量,它保留一种 ON 值,直到禁用程序将其值重置为 OFF(默认值)。
我正在寻找其他观点,试图通过基于同一列/数组的先前值的计算来实现最快、更有效的方法来处理这种情况。
我知道可以有基于
Numba
的简化解决方案,例如使 for
循环更快,但在这种情况下,如果可能,那么我会尝试避免显式的 Python 循环。
请注意,这可以在 .ipynb 文件(Jupyter Notebook 或 VSC)中使用,也可以在 .py 文件中使用。
这是最新的(仍在开发中),没有显式的 Python 循环:
import pandas as pd
import numpy as np
df = pd.DataFrame({
'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})
df['result'] = None
#Here also tried with `df['result'] = np.nan` and `df['result'] = 0` but they raise the next warning:
#FutureWarning: Setting an item of incompatible dtype is deprecated and will raise an error in a future version of pandas"...
#..."Value has dtype incompatible with int64, please explicitly cast to a compatible dtype first."
...下一个代码紧接着上一个代码,只是分开显示在没有垂直滚动的视图中。
def myFuntion(enabler_1, disabler_1, enabler_2, disabler_2):
global tempResult #The idea with this variable is to have its most recent previous value.
#These 6 conditions can be reduced to 3, but here just trying to better show the different possible cases:
conditions = [
(tempResult == 0) & (enabler_1 == True) & (disabler_1 == False) & (enabler_2 == False),
(tempResult == 0) & (enabler_2 == True) & (disabler_2 == False) & (enabler_1 == False),
(tempResult == 1) & (disabler_1 == True),
(tempResult == -1) & (disabler_2 == True),
(tempResult == 1) & (disabler_1 == False),
(tempResultl == -1) & (disabler_2 == False),
]
choices = [
1,
-1,
0,
0,
1,
-1,
]
tempResult = np.select(conditions, choices, default=0) #tempResult = np.select(conditions, choices)
return tempResult
df.loc[0, 'result'] = 0 #default initial value in the first row.
df.loc[1:, 'result'] = df.loc[1:].apply(lambda row: myFuntion(*row[['enabler1', 'disabler1', 'enabler2', 'disabler2']]), axis=1)
print(df)
更新1:
注意事项:
tempResult
只是一个整型变量,存储一种“状态”,默认值为OFF,满足相应条件时则为ON类型值。
在本线程的具体情况下,要在任意行中获得非零结果,两个
enablers
不能同时为True
。
获得非零结果的条件中最基本的部分是最近的先前
tempResult
值需要是0
,因为在这种情况下,如果该值已经是ON类型,那么有甚至不需要再次评估产生相同类型 ON 值的条件。
df['result'] = 0
def myFuntionLoop(enabler1, disabler1, enabler2, disabler2):
tempResult = 0 #The intial default value in `result` is `0`.
resultList = [0] * len(enabler1) #A list with the needed length and with `0` as initial values.
for i in range(1, len(enabler1)): #Starting from the row 1, i.e. skipping the row 0.
if tempResult == 0 and enabler1[i] and disabler1[i] == False and enabler2[i] == False:
tempResult = 1
elif tempResult == 0 and enabler2[i] and disabler2[i] == False and enabler1[i] == False:
tempResult = -1
elif tempResult == 1 and disabler1[i] == True:
tempResult = 0
elif tempResult == -1 and disabler2[i] == True:
tempResult = 0
resultList[i] = tempResult
return resultList
df['result'] = myFuntionLoop(df['enabler1'], df['disabler1'], df['enabler2'], df['disabler2'])
print(df)
apply()
并不总是更快,有时for
循环效果更好。请检查此:链接
在这种场景下使用
numba()
套装,请参考以下代码,
import pandas as pd
import numpy as np
from numba import njit
df = pd.DataFrame({
'enabler1': [False, True, True, False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False],
'disabler1': [False, False, False, False, True, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False],
'enabler2': [False, False, False, True, False, False, True, False, False, False, False, False, False, True, False, True, False, False, False, False],
'disabler2': [False, False, False, False, False, False, False, False, True, False, False, False, False, False, False, False, False, True, False, False]
})
df['result'] = 0
@njit
def calculate_result(enabler1, disabler1, enabler2, disabler2):
result = np.zeros(enabler1.shape[0], dtype=np.int32)
for i in range(1, len(enabler1)):
prev_result = result[i-1]
if prev_result == 0 and enabler1[i]:
result[i] = 1
elif prev_result == 0 and enabler2[i]:
result[i] = -1
elif prev_result == 1 and disabler1[i]:
result[i] = 0
elif prev_result == -1 and disabler2[i]:
result[i] = 0
else:
result[i] = prev_result
return result
df['result'] = calculate_result(df['enabler1'].values, df['disabler1'].values, df['enabler2'].values, df['disabler2'].values)
print(df)
我能看到的唯一优化是: 有些行的下一个值与上一个值相同(无论之前的数据到底是什么)。即,如果
e1, d1, e2, d2 = False, False, False, False
,则下一个值=上一个值,始终如此。事实上,这些行可以从计算中排除。您可以将它们保留为 NA 并在最后应用 fillna
。
像这样:
def calculate_result(df):
def inner(row, tempResult):
if tempResult == 0 and row.enabler1:
return 1
... # All l`enter code here`ogic with enablers/disablers
global tempResult
tempResult = 0
noEffect = (
(~df.enabler1 & ~df.disabler1 & ~df.enabler2 & ~df.disabler2)|
(...) # other possible combinations which don't change anything
)
for num, row in df[
(noEffect == False) &
(df.index != df.index[0])
].iterrows():
df.loc[num, 'result'] = tempResult = inner(row, tempResult)
df['result'] = df.result.astype(pd.Int64Dtype()).ffill()
return df
df = calculate_result(df)
在这种情况下,您只需从一个有意义的行跳到另一行,并且不要在这些上花费时间,它们只是复制前一个状态
如果你有很多这样的“空”行,它可能是有利可图的