我目前正在 OpenAI Gym 中为 FrozenLake-v1 环境实施 Q-learning。然而,我的经纪人似乎喜欢采取很多不必要的步骤来达到目标。我已经多次检查了我的代码,但我无法查明问题所在。
这是我正在使用的代码:
import random
import numpy as np
import gymnasium as gym
def argmax(arr):
arr_max = np.max(arr)
return np.random.choice(np.where(arr == arr_max)[0])
def save_q_table(Q):
np.savetxt("q_table.csv", Q, delimiter=",")
def load_q_table():
return np.loadtxt("q_table.csv", delimiter=",")
def run(training):
if not training:
env = gym.make("FrozenLake-v1", render_mode='human')
else:
env = gym.make("FrozenLake-v1")
Q = np.zeros((env.observation_space.n, env.action_space.n)) # empty q_table
if not training:
Q = load_q_table()
alpha = 0.8
gamma = 0.95
episode = 0
episodes = 10000
epsilon = 0.95
epsilon_decay = (2 * epsilon) / episodes
epsilon_min = 0.05
env.metadata['render_fps'] = 10
state, info = env.reset()
while episode < episodes:
if random.random() < epsilon and training:
action = env.action_space.sample()
else:
action = argmax(Q[state])
new_state, reward, terminated, truncated, info = env.step(action)
if training:
Q[state, action] = Q[state, action] + alpha * (
float(reward) + gamma * np.max(Q[new_state]) - Q[state, action])
state = new_state
if terminated or truncated:
if epsilon > epsilon_min:
epsilon -= epsilon_decay
episode += 1
# save on last episode
if training and episode == episodes:
print("Saving Q table")
save_q_table(Q)
print("Episode: ", episode, "Epsilon: ", round(epsilon, 2), "Reward: ", reward)
state, info = env.reset() # Reset the environment
env.close()
run(training=False)
我尝试在步数较高时降低奖励,例如,如果找到目标,则每步删除 0.01 奖励。我希望这能帮助代理理解采取更少的步骤,但它似乎还是这么做了。即使没有找到目标,每一步降低奖励似乎是一个想法,但考虑到奖励会变成负值,我认为你不能这样做。
这不是
is_slippery
的问题吗?
is_slippery
默认设置为True
,这使得不良动作的概率为2/3。
请参阅描述:
湖很滑(除非禁用),因此玩家有时可能会垂直于预期方向移动(参见
)。is_slippery
您可以通过设置关闭此功能:
env = gym.make("FrozenLake-v1", render_mode='human', is_slippery=False)
或
env = gym.make("FrozenLake-v1", render_mode='human')