我有这段代码,它根据 pandas 数据帧上一些现有列的值来计算列的值。
def get_prj_yield(row):
try:
prj_yield = row['prj_rev'] / (row['ds'] + row['otb_demand'])
if pandas.isnull(prj_yield):
prj_yield = row['otb_rev'] / row['otb_demand']
return prj_yield
except ZeroDivisionError:
return 0
使用
apply
函数在数据帧上调用此函数。
df['prj_yield'] = output_df.apply(get_prj_yield, axis=1)
现有的数据帧有超过 1M 行,我想知道是否可以仅使用简单的数据帧计算来重写此函数。这会改善资源消耗吗?
不要循环,使用矢量代码:
s1 = df['prj_rev'].div(df['ds'] + df['otb_demand'])
s2 = df['otb_rev'].div(df['otb_demand'])
df['prj_yield'] = s1.mask(s1.eq(0), s2).replace({np.inf: 0, -np.inf: 0})
替代方案:
import numpy as np
s1 = df['prj_rev'].div(df['ds'] + df['otb_demand'])
s2 = df['otb_rev'].div(df['otb_demand'])
s3 = s1.mask(s1.eq(0), s2)
df['prj_yield'] = s3.where(np.isfinite(s3), 0)
输出示例:
prj_rev ds otb_demand otb_rev prj_yield
0 0 2 2 2 1.0
1 2 0 2 2 1.0
2 2 2 0 2 1.0
3 2 2 2 0 0.5
4 0 1 0 1 0.0
5 -2 1 -1 1 0.0