rle-foundation / rlexplore Goto Github PK

RLeXplore provides stable baselines of exploration methods in reinforcement learning, such as intrinsic curiosity module (ICM), random network distillation (RND) and rewarding impact-driven exploration (RIDE).

Home Page: https://docs.rllte.dev/

License: MIT License

Python 49.28% Jupyter Notebook 50.72%

reinforcement-learning efficient-algorithm exploration-strategy baselines gym machine-learning pybullet pytorch robotics toolbox

rlexplore's Introduction

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning

RLeXplore is a unified, highly-modularized and plug-and-play toolkit that currently provides high-quality and reliable implementations of eight representative intrinsic reward algorithms. It used to be challenging to compare intrinsic reward algorithms due to various confounding factors, including distinct implementations, optimization strategies, and evaluation methodologies. Therefore, RLeXplore is designed to provide unified and standardized procedures for constructing, computing, and optimizing intrinsic reward modules.

The workflow of RLeXplore is illustrated as follows:

Installation

with pip recommended

Open a terminal and install rllte with pip:

conda create -n rllte python=3.8
pip install rllte-core

with git

Open a terminal and clone the repository from GitHub with git:

git clone https://github.com/RLE-Foundation/rllte.git
pip install -e .

Now you can invoke the intrinsic reward module by:

from rllte.xplore.reward import ICM, RIDE, ...

Module List

Type	Modules
Count-based	PseudoCounts, RND, E3B
Curiosity-driven	ICM, Disagreement, RIDE
Memory-based	NGU
Information theory-based	RE3

Tutorials

Click the following links to get the code notebook:

Benchmark Results

RLLTE's PPO+RLeXplore on SuperMarioBros:

CleanRL's PPO+RLeXplore's RND on Montezuma's Revenge:

Cite Us

To cite this repository in publications:

@article{yuan_roger2024rlexplore,
  title={RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning},
  author={Yuan, Mingqi and Castanyer, Roger Creus and Li, Bo and Jin, Xin and Berseth, Glen and Zeng, Wenjun},
  journal={arXiv preprint arXiv:2405.19548},
  year={2024}
}

rlexplore's People

Contributors

Stargazers

Watchers

Forkers

v0idwu denizhanpak balloch lttcc rrrjiia emrul leejwuniverse deephaejoong cameronberg jkbjh hsuanwudev yuyuanyaya deepneuralnetworks manila95 shenjiede haneenhassen

rlexplore's Issues

RLeXplore with Stable Baseline3 example issue

When I am running the following code:

import torch as th

from rllte.xplore.reward import RND
from rllte.env import make_mario_env
from rllte.agent import PPO, DDPG

if __name__ == '__main__':
    n_steps: int = 2048 * 16
    device = 'cuda' if th.cuda.is_available() else 'cpu'
    envs = make_mario_env('SuperMarioBros-1-1-v0', device=device, num_envs=1,
                          asynchronous=False, frame_stack=4, gray_scale=True)
    print(device, envs.observation_space, envs.action_space)
    # create the intrinsic reward module
    irs = RND(envs, device=device)
    # create the PPO agent
    agent = PPO(envs, device=device)
    # set the intrinsic reward module
    agent.set(reward=irs)
    # train the agent
    agent.train(n_steps * 153, eval_interval=n_steps // 8, save_interval=n_steps)

I receive the following error:

/opt/conda/lib/python3.10/site-packages/gym/envs/registration.py:555: UserWarning: WARN: The environment SuperMarioBros-1-1-v0 is out of date. You should consider upgrading to version `v3`.
  logger.warn(
/opt/conda/lib/python3.10/site-packages/gym/envs/registration.py:627: UserWarning: WARN: The environment creator metadata doesn't include `render_modes`, contains: ['render.modes', 'video.frames_per_second']
  logger.warn(
/opt/conda/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.metadata to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.metadata` for environment variables or `env.get_wrapper_attr('metadata')` that will search the reminding wrappers.
  logger.warn(
/opt/conda/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.single_observation_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.single_observation_space` for environment variables or `env.get_wrapper_attr('single_observation_space')` that will search the reminding wrappers.
  logger.warn(
/opt/conda/lib/python3.10/site-packages/gymnasium/core.py:311: UserWarning: WARN: env.single_action_space to get variables from other wrappers is deprecated and will be removed in v1.0, to get this variable you can do `env.unwrapped.single_action_space` for environment variables or `env.get_wrapper_attr('single_action_space')` that will search the reminding wrappers.
  logger.warn(
cuda Box(0, 255, (4, 84, 84), uint8) Discrete(7)
/opt/conda/lib/python3.10/site-packages/gym/utils/passive_env_checker.py:233: DeprecationWarning: `np.bool8` is a deprecated alias for `np.bool_`.  (Deprecated NumPy 1.24)
  if not isinstance(terminated, (bool, np.bool8)):
[05/24/2024 04:14:52 PM] - [INFO.] - Invoking RLLTE Engine...
[05/24/2024 04:14:52 PM] - [INFO.] - ================================================================================
[05/24/2024 04:14:52 PM] - [INFO.] - Tag               : default
[05/24/2024 04:14:52 PM] - [INFO.] - Device            : NVIDIA A100-SXM4-40GB
[05/24/2024 04:14:52 PM] - [DEBUG] - Agent             : PPO
[05/24/2024 04:14:52 PM] - [DEBUG] - Encoder           : MnihCnnEncoder
[05/24/2024 04:14:52 PM] - [DEBUG] - Policy            : OnPolicySharedActorCritic
[05/24/2024 04:14:52 PM] - [DEBUG] - Storage           : VanillaRolloutStorage
[05/24/2024 04:14:52 PM] - [DEBUG] - Distribution      : Categorical
[05/24/2024 04:14:52 PM] - [DEBUG] - Augmentation      : None
[05/24/2024 04:14:52 PM] - [DEBUG] - Intrinsic Reward  : RND
[05/24/2024 04:14:52 PM] - [DEBUG] - ================================================================================
Traceback (most recent call last):
  File "/workdir/got-it-memorized/src/run_rnd2.py", line 20, in <module>
    agent.train(n_steps * 153, eval_interval=n_steps // 8, save_interval=n_steps)
  File "/opt/conda/lib/python3.10/site-packages/rllte/common/prototype/on_policy_agent.py", line 105, in train
    obs, infos = self.env.reset(seed=self.seed)
  File "/opt/conda/lib/python3.10/site-packages/rllte/env/utils.py", line 152, in reset
    obs, infos = self.env.reset(seed=seed, options=options)
  File "/opt/conda/lib/python3.10/site-packages/gymnasium/wrappers/record_episode_statistics.py", line 78, in reset
    obs, info = super().reset(**kwargs)
  File "/opt/conda/lib/python3.10/site-packages/gymnasium/core.py", line 467, in reset
    return self.env.reset(seed=seed, options=options)
  File "/opt/conda/lib/python3.10/site-packages/gymnasium/vector/vector_env.py", line 140, in reset
    return self.reset_wait(seed=seed, options=options)
  File "/opt/conda/lib/python3.10/site-packages/gymnasium/vector/sync_vector_env.py", line 122, in reset_wait
    observation, info = env.reset(**kwargs)
ValueError: too many values to unpack (expected 2)

NGU implementation

Hi, First of all thank you for providing these implementations to the community.

I've a few questions about your NGU implementation. the original work uses two networks a randomly fixed network like in RND and an embedding network to calculate the exploration rewards. The idea of the embedding network is to use it to represent states in episodic memory and use them later to calculate intrinsic rewards. Also, the embedding network is trained each iteration to optimize for action-state pairs (a,s) with batches sampled from the replay buffer.

My questions are:

How does this implementation handle the episodic memory and the training embedding network. If I understood your implementation well. you assume that the buffer (either replay or rollout) is the episodic memory and use it to embed states.
Meanwhile the embedding network is used to calculate intrinsic rewards, a predictor network is the one trained and used for RND rewards. I didn't understand this part quite well. Can you elaborate this point please?

training performance of intrinsic module different from mlagents icm module

Hello,

I used the example code provided: https://github.com/yuanmingqi/rl-exploration-baselines/blob/main/examples/ppo_re3_bullet.py
to compute intrinsic rewards using ICM module. I found the results were different from what I have got from running mlagents icm module. This is against the custom unity game wrapped in gym wrapper. Can you please let me know what could be different that I must be missing? Thank you so much for the help. I can also share the code.

NGU - need to understand this line

https://github.com/yuanmingqi/rl-exploration-baselines/blob/35d9496affd7afae13873479e33845a90d3583fd/rlexplore/ngu/ngu.py#L54

Hi in this line we set the same variable twice - should it instead be target_network?

Is this project suitable for our own environment?

We want to implement the exploration algorithm into our own developed rl algorithm and a new env with the continuous action. According to our understanding to your code, it seems that there is still some difference between our env and your required env.

So is there any document showing how to code your required env or the standard process for wrapping your env? An example is also a great help to us! Thank you.

atari enviroment error

When executing the following code:
python

from stable_baselines3.common.env_util import make_vec_env, make_atari_env
from stable_baselines3.common.vec_env import VecFrameStack

envs = make_atari_env("MontezumaRevenge-v4")
envs = VecFrameStack(envs, n_stack=4)

print(envs.observation_space, envs.action_space)

the output is:

Box(0, 255, (84, 84, 4), uint8) Discrete(18)

This indicates that the observation space is a Box with shape (84, 84, 4).

In contrast, when executing this code:
python

from rllte.env import make_atari_env

envs = make_atari_env("MontezumaRevenge-v5")
print(envs.observation_space, envs.action_space)

the output is:

Box(0, 255, (4, 84, 84), uint8) Discrete(18)

Here, the observation space has been transformed to a Box with shape (4, 84, 84).

Save and load models

Hello there
Can we save the trained model in this example? Then is it possible to test the model we trained for another environment? How are we going to do? Thus, we can see the success and performance of the trained model more clearly. Could you help?

RIDE Code issues

I noticed that there is no part of training network in the ride code, only two random networks are directly encoded, which seems to be inconsistent with the original paper, why is that?

Project Page Stable Baselines3

Hello,

nice project =)

We would be interested if you could do a pull request on stable-baselines3 where you add your project to the documentation (project section) ;)

Deterministic behavior in evaluation

I am running the following code to evaluate the model I obtained

import torch as th
import os
from rllte.env import make_mario_env
from rllte.agent import PPO, DDPG
import rllte

if __name__ == '__main__':
    n_steps: int = 2048 * 16
    device = 'cuda' if th.cuda.is_available() else 'cpu'
    envs = make_mario_env('SuperMarioBros-1-1-v0', device=device, num_envs=1,
                          asynchronous=False, frame_stack=4, gray_scale=True)
    print(device, envs.observation_space, envs.action_space)

    agent = PPO(envs,
                device=device,
                batch_size=512,
                n_epochs=10,
                num_steps=n_steps//8,
                pretraining=False)

    agent.freeze(init_model_path="pretrained_1507328.pth")
    agent.eval_env = envs
    agent.eval(3)

But checking the x_pos of Mario at the end of each episode I noticed that for all the three evaluation the algorithm is behaving deterministically, returning the same result. Is there a way to avoid this?

Intrinsic reward

Hello,

Thank you for great code and implementation.
I am trying to implement Pathak's intrinsic reward along with stable baselines on a custom env. I would like to use your code. Can you please let me know things I will need to perform in order to accomplish? How do i know if the intrinsic reward computation is correct?
Thank you!

code for paper

Can you provide some of the paper's code so it can be reproduced?
I am having difficulties reproducing them.

Training in custom environments

Hi, thanks for the great repository. Is there functionality for testing on custom-user defined envs? If so, how do we do it?

State of this project?

Hey,

I was just wondering what the state of this project is, since there has been new activity recently. I understand that it had been merged with RLLTE which rather seems to be a replacement for stable-baselines3 than an addition to it's ecosystem.
My interest in particular is to combine MaskablePPO with ICM, which this project would have been ideal for since the other projects are missing one of the building blocks each, unfortunately.

So is there any interest in continuing the development of exploration add-ons for stable-baselines3?

OffPolicyAlgorithm and obs dict

Is the OffPolicyAlgorithm now supported, as well as the observation space supported as dict?

Correct usage with SB3 / Callbacks?

Hi, this looks like a really interesting set of algorithms. I wanted to try some out using the SB3-zoo and was hoping for a plug-and-play approach. I wondered if I could integrate rlexplore using callbacks so I came up with the following:

from stable_baselines3.common.callbacks import BaseCallback
from rlexplore import REVD
from stable_baselines3.common.on_policy_algorithm import OnPolicyAlgorithm
from stable_baselines3.common.off_policy_algorithm import OffPolicyAlgorithm

class RLeXploreCallback(BaseCallback):
    def __init__(self):
        super().__init__()
        self.explorer = None
        self.buffer = None
        pass

    def init_callback(self, model: "base_class.BaseAlgorithm") -> None:
        super().init_callback(model)
        env = self.training_env
        self.explorer = REVD(obs_shape=env.observation_space.shape, action_shape=env.action_space.shape, device=model.device, latent_dim=128, beta=1e-2, kappa=1e-5)

        if isinstance(self.model, OnPolicyAlgorithm):
            self.buffer = self.model.rollout_buffer
        elif isinstance(self.model, OffPolicyAlgorithm):
            self.buffer = self.model.replay_buffer
        pass

    def _on_rollout_end(self) -> None:
        intrinsic_rewards = self.explorer.compute_irs(
            rollouts={'observations': self.buffer.observations},
            time_steps=self.num_timesteps,
            k=3)
        self.buffer.rewards += intrinsic_rewards[:, :, 0]
        pass

    def _on_step(self) -> bool:
        # TODO maybe log to TensorBoard?
        return True

Then I include it in my list of callbacks and it seems to run. However, I'm still poking around without fully understanding what I'm doing (dangerous!) so does the above look correct? If it is correct, maybe it can be added as an example for others.

Second question is did I do this bit right: time_steps=self.num_timesteps?

Third question I have is that in the examples directory the sample uses rollout_buffer but is it valid to use this for Off Policy algorithms like DQN (switching for the replay_buffer instead?)

The modules cannot be called!

Hello,
I tried to run the code that you provided in the example ipynb file, but I got the following error:

AttributeError: module 'huggingface_hub.constants' has no attribute 'HF_HUB_CACHE'

here is the image for the complete error:

rle-foundation / rlexplore Goto Github PK

rlexplore's Introduction

RLeXplore: Accelerating Research in Intrinsically-Motivated Reinforcement Learning

Installation

Module List

Tutorials

Benchmark Results

Cite Us

rlexplore's People

Contributors

Stargazers

Watchers

Forkers

rlexplore's Issues

Recommend Projects

Recommend Topics

Recommend Org