为什么政策迭代和价值迭代方法对于最优价值和最优政策给出不同的结果?

问题描述 投票:1回答:1

我目前正在强化学习中研究动态编程,其中遇到了两个概念:[[Value-Iteration和Policy-Iteration。为了理解这一点,我正在实现Sutton中的gridworld示例,它说:

非终端状态为S = {1,2,... 。 。 ,14}。有四个在每种状态下可能执行的操作,A = {上,下,右,左},其中确定性地导致相应的状态转换,除了实际上会使代理脱离网格的动作状态不变。因此,例如p(6,1 | 5,right)= 1,p(7,1 | 7,right)= 1,并且所有属于R的r的p(10,r | 5,right)= 0。不折不扣的情节任务。在所有过渡期间,奖励为-1,直到达到终端状态。终端状态在图(尽管在两个地方都显示了,但它在形式上是一种状态)。因此,对于所有状态,期望的奖励函数为r(s,a,s')= 1s,s和动作a。假设代理遵循等概率随机政策(所有行动的可能性均等)。

下面是两种方法的实现

class GridWorld: def __init__(self,grid_size = 5, gamma = 0.9, penalty = -1.0, theta = 1e-2): self.grid_size = grid_size self.discount = gamma self.actions = [np.array([0, -1]), np.array([-1, 0]), np.array([0, 1]), np.array([1, 0])] self.action_prob = 1/len(self.actions) self.theta = theta print('action prob : ',self.action_prob) self.penalty_reward = penalty self.re_init() def re_init(self): self.values = np.zeros((self.grid_size, self.grid_size)) self.policy = np.zeros(self.values.shape, dtype=np.int) def checkTerminal(self,state): x, y = state if x == 0 and y == 0: return 1 elif (x == self.grid_size - 1 and y == self.grid_size - 1): return 1 else : return 0 def step(self, state, action): #print(state) if self.checkTerminal(state): next_state = state reward = 0 else: next_state = (np.array(state) + action).tolist() x, y = next_state if x < 0 or x >= self.grid_size or y < 0 or y >= self.grid_size: next_state = state reward = self.penalty_reward return next_state, reward def compValueIteration(self): new_state_values = np.zeros((self.grid_size, self.grid_size)) policy = np.zeros((self.grid_size, self.grid_size)) iter_cnt = 0 while True: #delta = 0 state_values = new_state_values.copy() old_state_values = state_values.copy() for i in range(self.grid_size): for j in range(self.grid_size): values = [] for action in self.actions: (next_i, next_j), reward = self.step([i, j], action) values.append(reward + self.discount*state_values[next_i, next_j]) new_state_values[i, j] = np.max(values) policy[i, j] = np.argmax(values) #delta = max(delta, np.abs(old_state_values[i, j] - new_state_values[i, j])) delta = np.abs(old_state_values - new_state_values).max() print(f'Difference: {delta}') if delta < self.theta: break iter_cnt += 1 return new_state_values, policy, iter_cnt def policyEvaluation(self,policy,new_state_values): #new_state_values = np.zeros((self.grid_size, self.grid_size)) iter_cnt = 0 while True: delta = 0 state_values = new_state_values.copy() old_state_values = state_values.copy() for i in range(self.grid_size): for j in range(self.grid_size): action = policy[i, j] (next_i, next_j), reward = self.step([i, j], action) value = self.action_prob * (reward + self.discount * state_values[next_i, next_j]) new_state_values[i, j] = value delta = max(delta, np.abs(old_state_values[i, j] - new_state_values[i, j])) print(f'Difference: {delta}') if delta < self.theta: break iter_cnt += 1 return new_state_values def policyImprovement(self, policy, values, actions): #expected_action_returns = np.zeros((self.grid_size, self.grid_size, np.size(actions))) policy_stable = True for i in range(self.grid_size): for j in range(self.grid_size): old_action = policy[i, j] act_cnt = 0 expected_rewards = [] for action in self.actions: (next_i, next_j), reward = self.step([i, j], action) expected_rewards.append(self.action_prob * (reward + self.discount * values[next_i, next_j])) #max_reward = np.max(expected_rewards) #new_action = np.random.choice(np.where(expected_rewards == max_reward)[0]) new_action = np.argmax(expected_rewards) #print('new_action : ',new_action) #print('old_action : ',old_action) if old_action != new_action: policy_stable = False policy[i, j] = new_action return policy, policy_stable def compPolicyIteration(self): iterations = 0 total_start_time = time.time() while True: start_time = time.time() self.values = self.policyEvaluation(self.policy, self.values) elapsed_time = time.time() - start_time print(f'PE => Elapsed time {elapsed_time} seconds') start_time = time.time() self.policy, policy_stable = self.policyImprovement(self.policy,self.values, self.actions) elapsed_time = time.time() - start_time print(f'PI => Elapsed time {elapsed_time} seconds') if policy_stable: break iterations += 1 total_elapsed_time = time.time() - total_start_time print(f'Optimal policy is reached after {iterations} iterations in {total_elapsed_time} seconds') return self.values, self.policy

但是我的两个实现都给出了不同的最佳策略和最佳值。我完全按照书中给出的算法进行操作。 

具有策略迭代的结果:

values : [[ 0. -0.33300781 -0.33300781 -0.33300781] [-0.33300781 -0.33300781 -0.33300781 -0.33300781] [-0.33300781 -0.33300781 -0.33300781 -0.33300781] [-0.33300781 -0.33300781 -0.33300781 0. ]] **************************************************************************************************** **************************************************************************************************** policy : [[0 0 0 0] [1 0 0 0] [0 0 0 3] [0 0 2 0]]

具有值迭代的结果:

values : [[ 0.0 -1.0 -2.0 -3.0] [-1.0 -2.0 -3.0 -2.0] [-2.0 -3.0 -2.0 -1.0] [-3.0 -2.0 -1.0 0.0]] **************************************************************************************************** **************************************************************************************************** [[0. 0. 0. 0.] [1. 0. 0. 3.] [1. 0. 2. 3.] [1. 2. 2. 0.]]
此外,值迭代在4次迭代后收敛,而策略迭代在2次迭代后收敛。

我哪里出错了?他们可以给出不同的最优政策吗?但是我相信在获得第三次迭代值之后,这本书中写的是最优的,然后我的代码的策略迭代必然存在一些我看不到的问题。基本上,我应该如何制定政策?

python dynamic-programming reinforcement-learning policy value-iteration
1个回答
0
投票
我想问题在于这些行:

((1)值= self.action_prob *(奖励+ self.discount * state_values [next_i,next_j])((2) new_state_values [i,j] =值

这里您直接分配从

仅一个

操作中获得的值。如果您查看Bellman期望方程,则所有操作的开头都有一个总和。您必须考虑状态中的所有动作,对所有可能的动作进行计算[[((1)),将它们求和,然后将(2)中的总和分配给(i,j)的新值。
最新问题
© www.soinside.com 2019 - 2025. All rights reserved.