PPO 仅适用于单个 epoch 和未剪裁的损失

Question

我正在尝试实现 PPO 来击败 cartpole-v2，如果我将事情保持为 A2C（即，没有削波损失和单个纪元），我设法让它工作，当我使用削波损失和多个纪元时没有学习，大约一周以来一直试图在我的实现中找到问题，但我找不到问题所在。

这里是负责优化的函数：

def finish_episode():
    # Calculating losses and performing backprop
    R = 0
    saved_actions = actor.saved_actions
    returns = []
    epsilon = 0.3
    num_epochs = 1 # When num_epochs is greater than one my network won't learn

    for r in actor.rewards[::-1]:
        R = r + 0.99 * R # Gamma is 0.99
        returns.insert(0, R)
    returns = torch.tensor(returns, device=device)
    returns = (returns - returns.mean()) / (returns.std() + eps)

    old_probs, state_values, states, actions = zip(*saved_actions)

    old_probs = torch.stack(old_probs).to(device)
    state_values = torch.stack(state_values).to(device)
    states = torch.stack(states).to(device)
    actions = torch.stack(actions).to(device)

    advantages = returns - state_values.squeeze()

    for epoch in range(num_epochs):

        new_probs = actor(states).gather(1, actions.unsqueeze(-1)).squeeze()

        ratios = new_probs / old_probs

        surr1 = ratios * advantages
        surr2 = torch.clamp(ratios, 1 - epsilon, 1 + epsilon) * advantages

        #actor_loss = -torch.min(surr1, surr2).mean() # When using this (clipped) loss my network won't learn
        actor_loss = -surr1.mean()

        actor_optimizer.zero_grad()
        actor_loss.backward(retain_graph=True)
        actor_optimizer.step()

        if epoch == num_epochs - 1:
            critic_loss = F.smooth_l1_loss(state_values.squeeze(), returns)
            
            critic_optimizer.zero_grad()
            critic_loss.backward(retain_graph=False)
            critic_optimizer.step()

    del actor.rewards[:]
    del actor.saved_actions[:]

尝试了不同的超参数，使用 gae 而不是完整的蒙特卡洛重新调整/优点，在梳理我的代码时我看不出有什么问题。

Answer 1

在

state_action

的函数中，与其得到概率本身，不如得到概率的对数。所以，你的函数应该是：

def select_action(state):
    state = torch.from_numpy(state).float().to(device)
    mean = F.softmax(actor(state), dim=-1)       # This is the mean value of the output
    m = Categorical(mean)            # This is our distribution based on mean
    action = m.sample()
    probs = m.log_prob(action)       # This is the log_prob (I didn't change the name)
    state_value = critic(state)
    actor.saved_actions.append((probs.detach(), state_value, state, action.detach()))   # changed to probs.detach()
    return action.item()

另外，在

finish episode

函数中，您需要更改比率的方程：

ratios = torch.exp(new_probs - old_probs)

我自己实现了，得到：

Episode 370 Reward: 99.00 Average reward: 130.40
Episode 380 Reward: 158.00 Average reward: 136.79
Episode 390 Reward: 139.00 Average reward: 129.28
Episode 400 Reward: 129.00 Average reward: 126.12
Episode 410 Reward: 134.00 Average reward: 118.20
Episode 420 Reward: 135.00 Average reward: 126.12
Episode 430 Reward: 263.00 Average reward: 158.45
Episode 440 Reward: 149.00 Average reward: 184.14
Episode 450 Reward: 250.00 Average reward: 197.46
Episode 460 Reward: 205.00 Average reward: 203.21
Episode 470 Reward: 180.00 Average reward: 228.05
Episode 480 Reward: 209.00 Average reward: 224.86
Episode 490 Reward: 226.00 Average reward: 249.10
Episode 500 Reward: 335.00 Average reward: 261.40
Episode 510 Reward: 500.00 Average reward: 308.77

给它一些时间来训练。

PPO 仅适用于单个 epoch 和未剪裁的损失

问题描述投票：0回答：1

1个回答

最新问题

PPO 仅适用于单个 epoch 和未剪裁的损失

问题描述 投票：0回答：1

1个回答

最新问题

问题描述投票：0回答：1