我正在尝试实现 PPO 来击败 cartpole-v2,如果我将事情保持为 A2C(即,没有削波损失和单个纪元),我设法让它工作,当我使用削波损失和多个纪元时没有学习,大约一周以来一直试图在我的实现中找到问题,但我找不到问题所在。
这里是负责优化的函数:
def finish_episode():
# Calculating losses and performing backprop
R = 0
saved_actions = actor.saved_actions
returns = []
epsilon = 0.3
num_epochs = 1 # When num_epochs is greater than one my network won't learn
for r in actor.rewards[::-1]:
R = r + 0.99 * R # Gamma is 0.99
returns.insert(0, R)
returns = torch.tensor(returns, device=device)
returns = (returns - returns.mean()) / (returns.std() + eps)
old_probs, state_values, states, actions = zip(*saved_actions)
old_probs = torch.stack(old_probs).to(device)
state_values = torch.stack(state_values).to(device)
states = torch.stack(states).to(device)
actions = torch.stack(actions).to(device)
advantages = returns - state_values.squeeze()
for epoch in range(num_epochs):
new_probs = actor(states).gather(1, actions.unsqueeze(-1)).squeeze()
ratios = new_probs / old_probs
surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1 - epsilon, 1 + epsilon) * advantages
#actor_loss = -torch.min(surr1, surr2).mean() # When using this (clipped) loss my network won't learn
actor_loss = -surr1.mean()
actor_optimizer.zero_grad()
actor_loss.backward(retain_graph=True)
actor_optimizer.step()
if epoch == num_epochs - 1:
critic_loss = F.smooth_l1_loss(state_values.squeeze(), returns)
critic_optimizer.zero_grad()
critic_loss.backward(retain_graph=False)
critic_optimizer.step()
del actor.rewards[:]
del actor.saved_actions[:]
尝试了不同的超参数,使用 gae 而不是完整的蒙特卡洛重新调整/优点,在梳理我的代码时我看不出有什么问题。
在
state_action
的函数中,与其得到概率本身,不如得到概率的对数。所以,你的函数应该是:
def select_action(state):
state = torch.from_numpy(state).float().to(device)
mean = F.softmax(actor(state), dim=-1) # This is the mean value of the output
m = Categorical(mean) # This is our distribution based on mean
action = m.sample()
probs = m.log_prob(action) # This is the log_prob (I didn't change the name)
state_value = critic(state)
actor.saved_actions.append((probs.detach(), state_value, state, action.detach())) # changed to probs.detach()
return action.item()
另外,在
finish episode
函数中,您需要更改比率的方程:
ratios = torch.exp(new_probs - old_probs)
我自己实现了,得到:
Episode 370 Reward: 99.00 Average reward: 130.40
Episode 380 Reward: 158.00 Average reward: 136.79
Episode 390 Reward: 139.00 Average reward: 129.28
Episode 400 Reward: 129.00 Average reward: 126.12
Episode 410 Reward: 134.00 Average reward: 118.20
Episode 420 Reward: 135.00 Average reward: 126.12
Episode 430 Reward: 263.00 Average reward: 158.45
Episode 440 Reward: 149.00 Average reward: 184.14
Episode 450 Reward: 250.00 Average reward: 197.46
Episode 460 Reward: 205.00 Average reward: 203.21
Episode 470 Reward: 180.00 Average reward: 228.05
Episode 480 Reward: 209.00 Average reward: 224.86
Episode 490 Reward: 226.00 Average reward: 249.10
Episode 500 Reward: 335.00 Average reward: 261.40
Episode 510 Reward: 500.00 Average reward: 308.77
给它一些时间来训练。