Adam 优化器在急切和非急切执行中以不同的方式更新动量和速度 (Tensorflow)

Question

我正在使用 Tensorflow 和 Gymnasium 对 Atari 游戏进行强化学习，并意识到与急切执行相比，使用延迟执行的

tf.function

需要更多迭代才能收敛。尽管惰性执行总体上仍然更快，即使有额外的迭代。

由于优化器的动量和速度变量的更新方式，延迟执行似乎需要更多迭代。下面是一个最小的例子，显示动量和速度变量在五次训练迭代后是不同的（前两个

def

只是设置一些函数）。

import tensorflow as tf
import numpy as np
import gymnasium as gym

### FOR FORUM

@tf.function
def discounted_cumulative_sums( #from https://www.tensorflow.org/tutorials/reinforcement_learning/actor_critic
    rewards: tf.Tensor,
    gamma: float,
    standardize: bool = True) -> tf.Tensor:
    """Compute expected returns per timestep."""

    n = tf.shape(rewards)[0]
    returns = tf.TensorArray(dtype=tf.float32, size=n)

    # Start from the end of `rewards` and accumulate reward sums
    # into the `returns` array
    rewards = tf.cast(rewards[::-1], dtype=tf.float32)
    discounted_sum = tf.constant(0.0)
    discounted_sum_shape = discounted_sum.shape
    for i in tf.range(n):
        reward = rewards[i]
        discounted_sum = reward + gamma * discounted_sum
        discounted_sum.set_shape(discounted_sum_shape)
        returns = returns.write(i, discounted_sum)
    returns = returns.stack()[::-1]

    if standardize:
        returns = ((returns - tf.math.reduce_mean(returns)) /
                   tf.math.reduce_std(returns))

    return returns

def make_model(input_shape,output_nodes) : 
    tf.keras.utils.set_random_seed(1956) #set seed so it always returns identical model    
    input_layer = tf.keras.layers.Input(input_shape)
    x = tf.keras.layers.Dense(64,activation = 'relu')(input_layer)
    x = tf.keras.layers.Dense(64,activation = 'relu')(x)
    output_layer = tf.keras.layers.Dense(output_nodes,activation = 'linear')(x)
    model = tf.keras.Model(input_layer,output_layer)
    return model

@tf.function
def train_model(old_states, A, model, optimizer) : 
    for i in tf.range(5) : 
        with tf.GradientTape() as tape : 
            loss = tf.reduce_mean(tf.squeeze(model(old_states),1) * A)

        grads = tape.gradient(loss,model.trainable_variables)
        optimizer.apply_gradients(zip(grads,model.trainable_variables))

    return optimizer.variables

# create data
td_error = tf.random.normal(shape=(100,))
old_states = tf.random.normal(shape=(100,8))

# create models and optimizers
eager_model = make_model((8,),1)
non_eager_model = make_model((8,),1)
eager_optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0005)
non_eager_optimizer = tf.keras.optimizers.Adam(learning_rate = 0.0005)

tf.config.run_functions_eagerly(True) # EAGER EXECUTION
A = discounted_cumulative_sums(td_error,.99)
eager_opt_variables = train_model(old_states,A,eager_model,eager_optimizer)

tf.config.run_functions_eagerly(False) # NON-EAGER EXECUTION
A = discounted_cumulative_sums(td_error,.99) # if this is commented out, there will be no differences between optimizer variables
non_eager_opt_variables = train_model(old_states,A,non_eager_model,non_eager_optimizer)

# Show differences between optimizer variables after training
for x in range(22) : 
    print(eager_opt_variables[x].numpy()==non_eager_opt_variables[x].numpy())

运行此代码表明

eager_opt_variables

和

non_eager_opt_variables

在许多情况下是不同的。虽然差异很小，但它们要么随着时间的推移而增大，要么对收敛速度产生更大的影响。

有趣的是，这似乎是因为

A = discounted_cumulative_sums(td_error)

部分。如果我们不运行

discounted_cumulative_sums

函数（例如通过设置

A = td_error

），那么急切执行和惰性执行之间的优化器变量没有差异。

我的问题是为什么会发生这种情况，以及如何让惰性执行产生与 eager 相同的结果？

Answer 1

对我来说，即使我注释掉第二个

，优化器变量也不完全相同。但它们几乎相同，你可以使用 numpys allclose 进行测试：

for x in range(len(eager_opt_variables)):
  print(np.allclose(eager_opt_variables[x].numpy(), non_eager_opt_variables[x].numpy()), end=', ')
print()
for x in range(len(eager_opt_variables)):
  print(np.allclose(non_eager_opt_variables[x].numpy(), eager_opt_variables[x].numpy()), end=', ')

这些循环测试对我来说全部评估为

True

。请注意，

np.allclose

不是对称的（

np.isclose(a, b)

可以给出与

np.isclose(b, a)

不同的结果），所以我测试了两种方法。这意味着变量彼此接近，但数字并不完全相同。
例如，如果放宽相对误差，同时使绝对误差更严格，您可以看到并非所有变量都“足够接近”（默认值是

rtol=1e-5

和

atol=1e-8

：

for x in range(len(eager_opt_variables)):
  print(np.allclose(eager_opt_variables[x].numpy(), non_eager_opt_variables[x].numpy(), rtol=1e-4, atol=0.0), end=', ')
print()
for x in range(len(eager_opt_variables)):
  print(np.allclose(non_eager_opt_variables[x].numpy(), eager_opt_variables[x].numpy(), rtol=1e-4, atol=0.0), end=', ')

真，真，真，真，真，真，真，真，真，真，真，假，假，
真，真，真，真，真，真，真，真，真，真，真，假，假，

解释是，除了最后一层的变量之外，所有变量的误差都小于原始值的 0.001%。

“为什么”的答案只是我的一个假设，但我认为 TensorFlow 在图模式（问题中的惰性模式）和急切模式下的数学运算可以有不同的顺序。我相信在图形模式下，某些操作得到了优化，这会导致较小的舍入误差。这些错误会在优化器的历元和层中累积。虽然我对 TensorFlow 的机制还不够深入，无法证明我的假设。

Adam 优化器在急切和非急切执行中以不同的方式更新动量和速度 (Tensorflow)

问题描述投票：0回答：1

1个回答

最新问题

Adam 优化器在急切和非急切执行中以不同的方式更新动量和速度 (Tensorflow)

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1