金融和强化学习中的常用术语是基于原始奖励C[i]
的时间序列的折扣累积奖励R[i]
。给定一个数组R
,我们想用C[i]
计算C[i] = R[i] + discount * C[i+1]
满足复发C[-1] = R[-1]
(并返回完整的数组C
)。
在numpy数组的python中计算这个数值稳定的方法可能是:
import numpy as np
def cumulative_discount(rewards, discount):
future_cumulative_reward = 0
assert np.issubdtype(rewards.dtype, np.floating), rewards.dtype
cumulative_rewards = np.empty_like(rewards)
for i in range(len(rewards) - 1, -1, -1):
cumulative_rewards[i] = rewards[i] + discount * future_cumulative_reward
future_cumulative_reward = cumulative_rewards[i]
return cumulative_rewards
但是,这依赖于python循环。鉴于这是一个如此常见的计算,当然有一个现有的矢量化解决方案依赖于其他一些标准函数而不需要求助于cythonization。
请注意,使用np.power(discount, np.arange(len(rewards))
之类的任何解决方案都不会稳定。
您可以使用scipy.signal.lfilter来解决递归关系:
def alt(rewards, discount):
"""
C[i] = R[i] + discount * C[i+1]
signal.lfilter(b, a, x, axis=-1, zi=None)
a[0]*y[n] = b[0]*x[n] + b[1]*x[n-1] + ... + b[M]*x[n-M]
- a[1]*y[n-1] - ... - a[N]*y[n-N]
"""
r = rewards[::-1]
a = [1, -discount]
b = [1]
y = signal.lfilter(b, a, x=r)
return y[::-1]
此脚本测试结果是否相同:
import scipy.signal as signal
import numpy as np
def orig(rewards, discount):
future_cumulative_reward = 0
cumulative_rewards = np.empty_like(rewards, dtype=np.float64)
for i in range(len(rewards) - 1, -1, -1):
cumulative_rewards[i] = rewards[i] + discount * future_cumulative_reward
future_cumulative_reward = cumulative_rewards[i]
return cumulative_rewards
def alt(rewards, discount):
"""
C[i] = R[i] + discount * C[i+1]
signal.lfilter(b, a, x, axis=-1, zi=None)
a[0]*y[n] = b[0]*x[n] + b[1]*x[n-1] + ... + b[M]*x[n-M]
- a[1]*y[n-1] - ... - a[N]*y[n-N]
"""
r = rewards[::-1]
a = [1, -discount]
b = [1]
y = signal.lfilter(b, a, x=r)
return y[::-1]
# test that the result is the same
np.random.seed(2017)
for i in range(100):
rewards = np.random.random(10000)
discount = 1.01
expected = orig(rewards, discount)
result = alt(rewards, discount)
if not np.allclose(expected,result):
print('FAIL: {}({}, {})'.format('alt', rewards, discount))
break
您描述的计算称为Horner's rule或Horner评估多项式的方法。它在NumPy polynomial.polyval中实现。
但是,你想要整个cumulative_rewards
列表,即Horner规则的所有中间步骤。 NumPy方法不返回那些中间值。你的功能,用Numba的@jit装饰,可能是最佳选择。
作为理论上的可能性,我将指出如果给出polyval
系数,Hankel matrix可以返回整个列表。这是矢量化但最终效率低于Python循环,因为cumulative_reward的每个值都是从头开始计算的,与其他值无关。
from numpy.polynomial.polynomial import polyval
from scipy.linalg import hankel
rewards = np.random.uniform(10, 100, size=(100,))
discount = 0.9
print(polyval(discount, hankel(rewards)))
这匹配的输出
print(cumulative_discount(rewards, discount))