Q 学习代理为达到目标采取了太多步骤

问题描述 投票:0回答:1

我目前正在 OpenAI Gym 中为 FrozenLake-v1 环境实施 Q-learning。然而,我的经纪人似乎喜欢采取很多不必要的步骤来达到目标。我已经多次检查了我的代码,但我无法查明问题所在。

这是我正在使用的代码:

import random
import numpy as np
import gymnasium as gym

def argmax(arr):
    arr_max = np.max(arr)
    return np.random.choice(np.where(arr == arr_max)[0])


def save_q_table(Q):
    np.savetxt("q_table.csv", Q, delimiter=",")


def load_q_table():
    return np.loadtxt("q_table.csv", delimiter=",")


def run(training):
    if not training:
        env = gym.make("FrozenLake-v1", render_mode='human')
    else:
        env = gym.make("FrozenLake-v1")

    Q = np.zeros((env.observation_space.n, env.action_space.n))  # empty q_table

    if not training:
        Q = load_q_table()

    alpha = 0.8
    gamma = 0.95
    episode = 0
    episodes = 10000
    epsilon = 0.95
    epsilon_decay = (2 * epsilon) / episodes
    epsilon_min = 0.05
    env.metadata['render_fps'] = 10

    state, info = env.reset()

    while episode < episodes:

        if random.random() < epsilon and training:
            action = env.action_space.sample()
        else:
            action = argmax(Q[state])

        new_state, reward, terminated, truncated, info = env.step(action)

        if training:
            Q[state, action] = Q[state, action] + alpha * (
                        float(reward) + gamma * np.max(Q[new_state]) - Q[state, action])

        state = new_state

        if terminated or truncated:

            if epsilon > epsilon_min:
                epsilon -= epsilon_decay

            episode += 1

            # save on last episode
            if training and episode == episodes:
                print("Saving Q table")
                save_q_table(Q)

            print("Episode: ", episode, "Epsilon: ", round(epsilon, 2), "Reward: ", reward)

            state, info = env.reset()  # Reset the environment

    env.close()


run(training=False)

我尝试在步数较高时降低奖励,例如,如果找到目标,则每步删除 0.01 奖励。我希望这能帮助代理理解采取更少的步骤,但它似乎还是这么做了。即使没有找到目标,每一步降低奖励似乎是一个想法,但考虑到奖励会变成负值,我认为你不能这样做。

python reinforcement-learning openai-gym q-learning
1个回答
0
投票

这不是

is_slippery
的问题吗?
is_slippery
默认设置为
True
,这使得不良动作的概率为2/3。 请参阅描述

湖很滑(除非禁用),因此玩家有时可能会垂直于预期方向移动(参见

is_slippery
)。

您可以通过设置关闭此功能:

env = gym.make("FrozenLake-v1", render_mode='human', is_slippery=False)

env = gym.make("FrozenLake-v1", render_mode='human')
© www.soinside.com 2019 - 2024. All rights reserved.