Git Product home page Git Product logo

mbrl-lib's Introduction

PyPi Version Main License: MIT Python 3.7+ Code style: black

MBRL-Lib

mbrl is a toolbox for facilitating development of Model-Based Reinforcement Learning algorithms. It provides easily interchangeable modeling and planning components, and a set of utility functions that allow writing model-based RL algorithms with only a few lines of code.

See also our companion paper.

Getting Started

Installation

Standard Installation

mbrl requires Python 3.8+ library and PyTorch (>= 1.7). To install the latest stable version, run

pip install mbrl

Developer installation

If you are interested in modifying the library, clone the repository and set up a development environment as follows

git clone https://github.com/facebookresearch/mbrl-lib.git
pip install -e ".[dev]"

And test it by running the following from the root folder of the repository

python -m pytest tests/core
python -m pytest tests/algorithms

Basic example

As a starting point, check out our tutorial notebook on how to write the PETS algorithm (Chua et al., NeurIPS 2018) using our toolbox, and running it on a continuous version of the cartpole environment.

Provided algorithm implementations

MBRL-Lib provides implementations of popular MBRL algorithms as examples of how to use this library. You can find them in the mbrl/algorithms folder. Currently, we have implemented PETS, MBPO, PlaNet, we plan to keep increasing this list in the future.

The implementations rely on Hydra to handle configuration. You can see the configuration files in this folder. The overrides subfolder contains environment specific configurations for each environment, overriding the default configurations with the best hyperparameter values we have found so far for each combination of algorithm and environment. You can run training by passing the desired override option via command line. For example, to run MBPO on the Gymnasium version of HalfCheetah, you should call

python -m mbrl.examples.main algorithm=mbpo overrides=mbpo_halfcheetah 

By default, all algorithms will save results in a csv file called results.csv, inside a folder whose path looks like ./exp/mbpo/default/gym___HalfCheetah-v2/yyyy.mm.dd/hhmmss; you can change the root directory (./exp) by passing root_dir=path-to-your-dir, and the experiment sub-folder (default) by passing experiment=your-name. The logger will also save a file called model_train.csv with training information for the dynamics model.

Beyond the override defaults, You can also change other configuration options, such as the type of dynamics model (e.g., dynamics_model=basic_ensemble), or the number of models in the ensemble (e.g., dynamics_model.model.ensemble_size=some-number). To learn more about all the available options, take a look at the provided configuration files.

Supported environments

Our example configurations are largely based on Mujoco, but our library components (and algorithms) are compatible with any environment that follows the standard Gymnasium syntax. You can try our utilities in other environments by creating your own entry script and Hydra configuration, using our default entry main.py as guiding template. See also the example override configurations.

Without any modifications, our provided main.py can be used to launch experiments with the following environments:

You can test your Mujoco and PyBullet installations by running

python -m pytest tests/mujoco
python -m pytest tests/pybullet

To specify the environment to use for main.py, there are two possibilities:

  • Preferred way: Use a Hydra dictionary to specify arguments for your env constructor. See example.
  • Less flexible alternative: A single string with the following syntax:
    • mujoco-gym: "gym___<env-name>", where env-name is the name of the environment in Gymnasium (e.g., "HalfCheetah-v2").
    • dm_control: "dmcontrol___<domain>--<task>, where domain/task are defined as in DMControl (e.g., "cheetah--run").
    • pybullet-gym: "pybulletgym___<env-name>", where env-name is the name of the environment in pybullet gym (e.g., "HopperPyBulletEnv-v0")

Visualization and diagnostics tools

Our library also contains a set of diagnostics tools, meant to facilitate development and debugging of models and controllers. With the exception of the CPU-controller, which also supports PyBullet, these currently require a Mujoco installation, but we are planning to add support for other environments and extensions in the future. Currently, the following tools are provided:

  • Visualizer: Creates a video to qualitatively assess model predictions over a rolling horizon. Specifically, it runs a user specified policy in a given environment, and at each time step, computes the model's predicted observation/rewards over a lookahead horizon for the same policy. The predictions are plotted as line plots, one for each observation dimension (blue lines) and reward (red line), along with the result of applying the same policy to the real environment (black lines). The model's uncertainty is visualized by plotting lines the maximum and minimum predictions at each time step. The model and policy are specified by passing directories containing configuration files for each; they can be trained independently. The following gif shows an example of 200 steps of pre-trained MBPO policy on Inverted Pendulum environment.

    Example of Visualizer

  • DatasetEvaluator: Loads a pre-trained model and a dataset (can be loaded from separate directories), and computes predictions of the model for each output dimension. The evaluator then creates a scatter plot for each dimension comparing the ground truth output vs. the model's prediction. If the model is an ensemble, the plot shows the mean prediction as well as the individual predictions of each ensemble member.

    Example of DatasetEvaluator

  • FineTuner: Can be used to train a model on a dataset produced by a given agent/controller. The model and agent can be loaded from separate directories, and the fine tuner will roll the environment for some number of steps using actions obtained from the controller. The final model and dataset will then be saved under directory "model_dir/diagnostics/subdir", where subdir is provided by the user.

  • True Dynamics Multi-CPU Controller: This script can run a trajectory optimizer agent on the true environment using Python's multiprocessing. Each environment runs in its own CPU, which can significantly speed up costly sampling algorithm such as CEM. The controller will also save a video if the render argument is passed. Below is an example on HalfCheetah-v2 using CEM for trajectory optimization. To specify the environment, follow the single string syntax described here.

    Control Half-Cheetah True Dynamics

  • TrainingBrowser: This script launches a lightweight training browser for plotting rewards obtained after training runs (as long as the runs use our logger). The browser allows aggregating multiple runs and displaying mean/std, and also lets the user save the image to hard drive. The legend and axes labels can be edited in the pane at the bottom left. Requires installing PyQt5. Thanks to a3ahmad for the contribution!

    Training Browser Example

Note that, except for the training browser, all the tools above require Mujoco installation and are specific to models of type OneDimTransitionRewardModel. We are planning to extend this in the future; if you have useful suggestions don't hesitate to raise an issue or submit a pull request!

Advanced Examples

MBRL-Lib can be used for many different research projects in the subject area. Below are some community-contributed examples:

Documentation

Please check out our documentation and don't hesitate to raise issues or contribute if anything is unclear!

License

mbrl is released under the MIT license. See LICENSE for additional details about it. See also our Terms of Use and Privacy Policy.

Citing

If you use this project in your research, please cite:

@Article{Pineda2021MBRL,
  author  = {Luis Pineda and Brandon Amos and Amy Zhang and Nathan O. Lambert and Roberto Calandra},
  journal = {Arxiv},
  title   = {MBRL-Lib: A Modular Library for Model-based Reinforcement Learning},
  year    = {2021},
  url     = {https://arxiv.org/abs/2104.10159},
}

mbrl-lib's People

Contributors

a3ahmad avatar boneyag avatar dtch1997 avatar eugenevinitsky avatar franktiantt avatar freiberg-roman avatar gauravmm avatar hxri avatar jan1854 avatar jarach-209 avatar luisenp avatar marbaga avatar marcoke avatar matthiaskiller avatar odelalleau avatar raghavauppuluri13 avatar robertocalandra avatar rohan138 avatar shivakanthsujit avatar sradicwebster avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mbrl-lib's Issues

[Feature Request] Support for Dreamers

πŸš€ Feature Request

Add support for Dreamer and Dreamerv2

Motivation

Dreamer and Dreamerv2 are important and current SOTA in mbrl. Few standard implementations are available for both Dreamer (rllib, author's implementation) and Dreamerv2 (author's implementation). Some good implementations include https://github.com/RajGhugare19/dreamerv2, https://github.com/juliusfrost/dreamer-pytorch and https://github.com/jurgisp/pydreamer

Pitch

Both Dreamer and Dreamerv2 can be written on top of planet

Implement iCEM

There is a newer version of CEM that uses colored noise, which is correlated in time. This paper describes the method, along with several additional improvements. Having this as a new agent (or a modified version of CEMAgent) would be a great addition.

[Feature Request] (partially) known dynamics

Hi,
Usually in practice we have some dynamic model which at least approximate the true environment x_{t+1} = f(x_t, u_t) (e.g from Newton laws). My question is how to include this a priori knowledge in model of dynamics? It seems like we could have better starting point, faster learning, stability and reasonable output

Regards,

PETS Example Notebook Can't create Dynamics Model

The example notebook pets_example.ipynb fails at the line:
dynamics_model = common_util.create_one_dim_tr_model(cfg, obs_shape, act_shape)
in code cell number 8 in the notebook.

The error thrown appears to be coming from the dictconfig.py routine within the omegaconf routine within the Hydra configuration handler.

ConfigAttributeError: Missing key model
full_key: dynamics_model.model
object_type=dict

From what I can tell, the dictconfig.py is expecting the object defined as cfg_dict, which is defined in code cell number 7 in the notebook, to have a a subkey called model within the main key dynamics_model.

I'm unsure how to force this subkey to be optional, or how to add an appropriate dict type subkey within the cfg_dict object to avoid the error, but as written this example notebook isn't running past this point. I've tried this with hydra-core-1.0.3, with the current version of mbrl has a dependency, and also the most up-to-date version of Hydra, hydra-core-1.1.1, with the same result.

As this is the 'Basic Example' that is used to introduce the MBRL package, I think it's important to get it working.

[Bug] Issue with loading best weights in ModelTrainer

Hi, thanks for open-sourcing this library! It has really helped clarify some of the MBRL algorithms out there.

I have a specific question regarding this line

https://github.com/facebookresearch/mbrl-lib/blob/master/mbrl/models/model_trainer.py#L228

Maybe I am missing something, but I just want to clarify if this is actually loading the best weights according to the validation loss as expected? I am less familiar with PyTorch (I am mostly using JAX nowadays) but was under the impression that to perform load the model weights one calls module.load_state_dict(state_dict) instead of module.state_dict(state_dict). I checked the PyTorch documentation and it seems that the second function is putting the weights into state_dict instead of loading it. See

https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.state_dict

If this is indeed a bug, I would be very interested in seeing some updated results for the PETS implementation. I don't think this makes much of a difference for environments like CartPole or Pendulum, but it might be for tasks such as HalfCheetah.

*Sorry for not adopting the issue template. I feel this is a minor issue and the full template does not apply.

[Feature Request] Add PyBullet implementations of envs

πŸš€ Feature Request

Add PyBullet-backed implementations of standard envs such as those described here: https://github.com/benelot/pybullet-gym

Motivation

Mujoco is a paid software whereas PyBullet is open-source and of equal quality. This would allow those without Mujoco licenses to have access to the codebase.

Pitch

Describe the solution you'd like
PyBullet implementations of standard environments are added to the repository.

Are you willing to open a pull request?
Yes

[Bug] GaussianMLP's ensemble propagation error

Steps to reproduce

  1. Run either 'python -m pytest tests/mujoco' or 'python -m pytest tests/algorithms'

Observed Results

mbrl/planning/trajectory_opt.py:340: in act
plan = self.optimizer.optimize(trajectory_eval_fn)
mbrl/planning/trajectory_opt.py:227: in optimize
callback=callback,
mbrl/planning/trajectory_opt.py:134: in optimize
values = obj_fun(population)
mbrl/planning/trajectory_opt.py:337: in trajectory_eval_fn
return self.trajectory_eval_fn(obs, action_sequences)
mbrl/planning/trajectory_opt.py:399: in trajectory_eval_fn
action_sequences, initial_state=initial_state, num_particles=num_particles
mbrl/models/model_env.py:164: in evaluate_action_sequences
self.reset(initial_obs_batch, return_as_np=False)
mbrl/models/model_env.py:85: in reset
self._current_obs = self.dynamics_model.reset(batch, rng=self._rng)
mbrl/models/one_dim_tr_model.py:295: in reset
return self.model.reset(obs, rng=rng)
mbrl/models/gaussian_mlp.py:357: in reset
self._propagation_indices = self._sample_propagation_indices(x.shape[0], rng)


self = GaussianMLP(
(hidden_layers): Sequential(
(0): Sequential(
(0): EnsembleLinearLayer(num_members=7, in_size... (1): SiLU()
)
)
(mean_and_logvar): EnsembleLinearLayer(num_members=7, in_size=200, out_size=36, bias=True)
), batch_size = 10000, _rng = <torch._C.Generator object at 0x7f7a70ce8d10>

def _sample_propagation_indices(
    self, batch_size: int, _rng: torch.Generator
) -> torch.Tensor:
    """Returns a random permutation of integers in [0, ``batch_size``)."""
    model_len = (
        len(self.elite_models) if self.elite_models is not None else len(self)
    )
    if batch_size % model_len != 0:
        raise ValueError(
          "To use GaussianMLP's ensemble propagation, the batch size must "
            "be a multiple of the number of models in the ensemble."
        )

E ValueError: To use GaussianMLP's ensemble propagation, the batch size must be a multiple of the number of models in the ensemble.

mbrl/models/gaussian_mlp.py:369: ValueError

Expected Results

I expected the standard tests to pass cleanly

Relevant Code

// TODO(you): code here to reproduce the problem

[Bug] Training with (n, 1) dimensional Box/MultiDiscrete action spaces throwing error

Steps to reproduce

  1. Wrote a gym environment using a MultiDiscrete action space
  2. Copied training code from the PETS example (threw error)
  3. Tried replacing MultiDiscrete action space with a similarly-shaped Box shape (threw the same error)

Observed Results

In the traceback below, I have a length 9 observation space and a length 2 action space; I believe the code might be concatenating the two together, but only a length 1 set of actions is being generated.

Traceback (most recent call last):
  File "train_swarm.py", line 170, in <module>
    env, obs, agent, {}, replay_buffer)
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/util/common.py", line 570, in step_env_and_add_to_buffer
    action = agent.act(obs, **agent_kwargs)
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/planning/trajectory_opt.py", line 650, in act
    trajectory_eval_fn, callback=optimizer_callback
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/planning/trajectory_opt.py", line 526, in optimize
    callback=callback,
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/planning/trajectory_opt.py", line 134, in optimize
    values = obj_fun(population)
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/planning/trajectory_opt.py", line 646, in trajectory_eval_fn
    return self.trajectory_eval_fn(obs, action_sequences)
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/planning/trajectory_opt.py", line 710, in trajectory_eval_fn
    action_sequences, initial_state=initial_state, num_particles=num_particles
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/models/model_env.py", line 173, in evaluate_action_sequences
    _, rewards, dones, _ = self.step(action_batch, sample=True)
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/models/model_env.py", line 119, in step
    rng=self._rng,
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/models/one_dim_tr_model.py", line 289, in sample
    model_in = self._get_model_input_from_tensors(obs, actions)
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/models/one_dim_tr_model.py", line 128, in _get_model_input_from_tensors
    model_in = self.input_normalizer.normalize(model_in).float()
  File "/home/jennomai/miniconda3/envs/botnetenv/lib/python3.7/site-packages/mbrl/util/math.py", line 144, in normalize
    return (val - self.mean) / self.std
RuntimeError: The size of tensor a (10) must match the size of tensor b (11) at non-singleton dimension 1

Expected Results

This runtime error shouldn't be thrown.

Relevant Code

The gym environment I'm using is very messy right now but can be found here, and the corresponding training code is here.
However, the code depends heavily on the Botnet simulator, so it may be easier to try to replicate using the MultiAgentEnv here?

Add tests covering scripts in diagnostics folder

Currently there is no way to automatically tell if major changes are affecting diagnostics tool w/o running the scripts (which requires firing a training job, getting the config file, running, etc.). This should be automated to avoid problematic development cycles.

[Bug] Algorithm tests fail in Pytorch 1.7.1

Steps to reproduce

  1. Install this library with Python 3.7 and PyTorch 1.7.1
  2. Run the test python3 -m pytest tests/algorithms

Observed Results

You get the errors

FAILED tests/algorithms/test_algorithms.py::test_pets_icem_gaussian_mlp_ensemble - TypeError: clamp(): argument 'min' (position 2) must be Number, not Tensor
FAILED tests/algorithms/test_algorithms.py::test_pets_icem_basic_ensemble_deterministic_mlp - TypeError: clamp(): argument 'min' (position 2) must be Number, not Tensor

because torch.clamp doesn't support tensors for the min and max arguments till version 1.9.0. So either the README should be updated with the newer PyTorch version or the test/code should be fixed to work with 1.7.0 still.

Expected Results

Tests pass successfully.

Relevant Code

The error occurs at

population = mbrl.util.math.powerlaw_psd_gaussian(
    self.colored_noise_exponent,
    size=(decay_population_size, x0.shape[1], x0.shape[0]),
    device=self.device,
).transpose(1, 2)
population = torch.clamp(
    population * torch.sqrt(var) + mu, self.lower_bound, self.upper_bound
)

E           TypeError: clamp(): argument 'min' (position 2) must be Number, not Tensor

mbrl/planning/trajectory_opt.py:439: TypeError

[Bug] ReplayBuffer.sample_trajectory() returns transitions from two different trajectories

Steps to reproduce

import random

import numpy as np
from mbrl.util import ReplayBuffer

random.seed(0)
replay = ReplayBuffer(4, (1,), (1,), max_trajectory_length=2)

# Add the trajectory 1 --> 2 --> 3
replay.add(np.array([1]), np.array([1]), np.array([2]), 1, False)
replay.add(np.array([2]), np.array([1]), np.array([3]), 1, True)
# Add the trajectory 11 --> 12 --> 13
replay.add(np.array([11]), np.array([1]), np.array([12]), 1, False)
replay.add(np.array([12]), np.array([1]), np.array([13]), 1, True)
# Add the incomplete trajectory 21 --> 22
replay.add(np.array([21]), np.array([1]), np.array([22]), 1, False)


print(replay.sample_trajectory().obs)

Observed Results

The script above returns the state sequence [[21], [2]]. These states do not belong to the same trajectory. The 21 comes from the third (unfinished) trajectory that was fed into the buffer and the 2 comes from the first trajectory.

Expected Results

The state sequence [[11],[12]] should be returned every time.

The first four add() commands add two complete trajectories to the replay buffer: 1 --> 2 --> 3 and 11 --> 12 --> 13. Since the buffer has a capacity of 4, the fifth add() overwrites the first transition of the first trajectory. Hence, the only valid and complete trajectory in the buffer is 11 --> 12 --> 13.

[Bug] ModelTrainer.maybe_get_best_weights() does not deal properly with negative evaluation scores

improvement = (best_val_score - val_score) / best_val_score

The above calculation of the relative improvement of the evaluation score in ModelTrainer seems to be wrong for negative evaluation scores. This can be fixed by adding a torch.abs() around the divisor.

Steps to reproduce

import torch
from mbrl.models import ModelTrainer, GaussianMLP

dummy = GaussianMLP(1, 1, "cpu")
model_trainer = ModelTrainer(dummy)
previous_eval_value = torch.tensor(-1.0)
current_eval_value = torch.tensor(-10.0)
print(model_trainer.maybe_get_best_weights(previous_eval_value, current_eval_value))

Observed Results

model_trainer.maybe_get_best_weights() returns None, which should indicate that the evaluation value did not improve from previous_eval_value to current_eval_value.

Expected Results

The relative improvement from previous_eval_value to current_eval_value is 900%. Thus, model_trainer.maybe_get_best_weights() should return the parameters of the model, which would indicate that the evaluation value improved.

[Feature Request] Rich Logging Option

πŸš€ Feature Request

A configuration option to save all data for every trial.

Motivation

When debugging it can be very useful to see the intricate details of how different trials played out, without having to re-run computationally intensive experiments.

Is your feature request related to a problem? Please describe.
The replay buffer class is very efficient and elegant for running the algorithms, but saving more information in a more structured manner could be useful.

Pitch

Describe the solution you'd like
Core things to be logged in this mode:

  • raw trajectory data of each trial,
  • dynamics model of each trial,
  • any optimizer parameters of each trial that may change.

Advanced things to be logged in this mode:

  • planned action sequences from steps in a trial,
  • some sort of snapshot of the training and validation data at each trial.

Describe alternatives you've considered
The trajectory buffer is a step towards this direction.

Are you willing to open a pull request? (See CONTRIBUTING)

Additional context

Add any other context or screenshots about the feature request here.

Using Wrapper Class for Custom GYM Env

I have a custom open AI gym env and I am trying to use mbrl wrapper but getting error name 'model_env_args' is not defined. I am trying to follow example here, https://arxiv.org/pdf/2104.10159.pdf. Here's my code.

import gym import mbrl.models as models import numpy as np net = models.GaussianMLP(in_size=14, out_size=12, device="cpu") wrapper = models.OneDTransitionRewardModel(net, target_is_delta=True, learned_rewards=True) model_env = models.ModelEnv(wrapper, *model_env_args, term_fn=hopper)

[Feature Request] Logging of custom training metrics

πŸš€ Feature Request

When training a model with ModelTrainer, it would be nice to be able to log some custom metrics (ideally in tensorboard), defined by the model (e.g., the values of the individual loss terms if the loss of the model is a sum of multiple terms). Right now one can only access the overall loss of the model.

Motivation

Is your feature request related to a problem? Please describe.

At the moment I am working on a model that optimizes a sum of reconstruction loss, reward prediction loss, and a kl divergence term. For debugging purposes it would be nice to monitor how the individual losses evolve over time.
This logging can not be done by the model class on its own since it needs some information from the RL algorithm (e.g. the current iteration of the algorithm / number of samples drawn from the environment) for the logged values to be meaningful.

Pitch

Describe the solution you'd like

The simplest solution certainly is to just allow passing kwargs to ModelTrainer.train(), which are passed through to Model.update(). This would allow to pass some custom logging function / object that then logs values passed by the model implementation.
This is of course not the most elegant solution, but the kwargs could also be used for other purposes (e.g. passing some additional information to Model.update() if a model implementation requires this).

Describe alternatives you've considered

An alternative to this would be to let Model.update() return a dictionary of metrics in addition to the loss. This dictionary could then be returned by ModelTrainer.train() or it could be processed by the callback passed to the function.
This would of course cause breaking changes since the method signature of Model would need to be changed.

Are you willing to open a pull request? (See CONTRIBUTING)
Yes

[Bug] MBPO loss explode on humanoid and walker (sometimes)

Steps to reproduce

  1. Install latest mbrl-lib with python 3.7 pytorch 1.7.1 mujoco 2.0.2.13
  2. On humanoid: python -m mbrl.examples.main algorithm=mbpo overrides=mbpo_humanoid dynamics_model.hid_size=400

On walker (sometimes): python -m mbrl.examples.main algorithm=mbpo overrides=mbpo_walker

Observed Results

  • on humanoid:

I change some cfgs, but loss explode on humanoid always happens, just like what you said in you paper :-(
| results | S: 999 | R: 305.9237 | E: 0 | RL: 1
| train | S: 1000 | BR: 4.7887 | ALOSS: -6.8791 | CLOSS: 2.3326 | TLOSS: 2.2119 | TVAL: 0.1731 | AENT: 10.7918
| train | S: 2000 | BR: 4.7896 | ALOSS: -8.1911 | CLOSS: 6.5361 | TLOSS: 1.1089 | TVAL: 0.1353 | AENT: 6.0317
| train | S: 3000 | BR: 4.7901 | ALOSS: -28.7093 | CLOSS: 235.4853 | TLOSS: -1.3383 | TVAL: 0.1363 | AENT: -11.1760
| train | S: 4000 | BR: 4.7891 | ALOSS: -171.7771 | CLOSS: 10692.9198 | TLOSS: -7.0062 | TVAL: 0.2216 | AENT: -33.1193
| train | S: 5000 | BR: 4.7898 | ALOSS: -778.7793 | CLOSS: 251870.5247 | TLOSS: -16.1013 | TVAL: 0.3613 | AENT: -45.4896
| train | S: 6000 | BR: 4.7901 | ALOSS: -2556.9596 | CLOSS: 3338786.4839 | TLOSS: -31.5729 | TVAL: 0.5803 | AENT: -56.1798
| train | S: 7000 | BR: 4.7895 | ALOSS: -7508.6594 | CLOSS: 26147583.6990 | TLOSS: -62.3435 | TVAL: 0.9089 | AENT: -69.8062

  • on walker: (not always)
    | train | S: 2061000 | BR: 0.9034 | ALOSS: -57449213.0880 | CLOSS: 30649331655639.0391 | TLOSS: 34.0375 | TVAL: 92662.5002 | AENT: -3.0001
    | train | S: 2062000 | BR: 0.9086 | ALOSS: -57937056.9280 | CLOSS: 30568978606718.9766 | TLOSS: -794.3958 | TVAL: 95031.0067 | AENT: -3.0092
    | train | S: 2063000 | BR: 0.9088 | ALOSS: -58598286.2720 | CLOSS: 30982745023840.2578 | TLOSS: -1195.4651 | TVAL: 98324.6478 | AENT: -3.0133
    | train | S: 2064000 | BR: 0.9027 | ALOSS: -58989812.7840 | CLOSS: 32028101398495.2305 | TLOSS: 325.0210 | TVAL: 99747.0493 | AENT: -2.9972
    | train | S: 2065000 | BR: 0.9084 | ALOSS: -59708167.0400 | CLOSS: 33024967268368.3828 | TLOSS: 47.7344 | TVAL: 99133.9851 | AENT: -3.0003

Expected Results

Just like the result on cheetah...

Add unit tests for util.py

Most of the functions in util.py are missing unit tests, and they provide a lot of the library's functionality, so they should be tested carefully.

pets_example.ipynb problem

i run the pets_example.ipynb and what i get the following error:

i am not sure if it's my package's compatible problem.
so i am not sure following error is bug or not.
python:3.7.10
nmupy: 1.20.1
matplotlib: 3.4.2
torch:1.7.1 py3.7_cuda10.1.243_cudnn7.6.3_0

TypeError: normal() received an invalid combination of arguments when run the main loop
i found the model_env arg 'rng' is np.random.default_rng(seed=0), not torch.normal

# Create a gym-like environment to encapsulate the model
#model_env = models.ModelEnv(env, dynamics_model, term_fn, reward_fn, rng)

TypeError: can't convert cuda:0 device type tensor to numpy. when run the plot part
when the gpu is on, val_score tensor is (0.0023, device='cuda:0') and cause error in plot part

def train_callback(_model, _total_calls, _epoch, tr_loss, val_score, _best_val):
   train_losses.append(tr_loss)
   #val_scores.append(val_score.mean())   # this returns val score per ensemble model

[Feature Request] cartpole_continuous.py is not the standard continuous cartpole

πŸš€ Feature Request

The cartpole system implemented in cartpole_continuous is not the standard cartpole used in most of the control literature. We should implement also the standard cartpole.

Motivation

Is your feature request related to a problem? Please describe.
The current cartpole_continuous uses discrete actions, and a reward function which is 1 within the given angle cone.
The traditional control cartpole instead uses continuous actions, and as a reward function the Euclidean distance between the tip of the pendulum and a given point upright. Moreover, the cartpole starts from a down-pointing position. Overall this configuration is a harder problem, and more commonly used in the control community, compared to the one currently implemented.

Pitch

Describe the solution you'd like
I suggest renaming the current cartpole_continuous to cartpole_discrete, and instead create a new cartpole_continuous following the implementation by Chua et al. (https://github.com/kchua/handful-of-trials/blob/master/dmbrl/env/cartpole.py)

[Feature Request] Implement PlaNet

πŸš€ Feature Request

Implement PlaNet algorithm by Hafner et al. ICML 2019

Motivation

PlaNet is a strong MBRL method for visual input and it's a common request when discussing the library with people.

Is your feature request related to a problem? Please describe.

[Bug] `__len__` method of Model should always return non-negative integer

Steps to reproduce

from mbrl.models import Model
class MyModel(Model):
    def eval_score(self, *args, **kwargs):
        return
    def loss(self, *args, **kwargs):
        return

my_model = MyModel()
len(my_model)

Observed Results

TypeError: 'NoneType' object cannot be interpreted as an integer

Expected Results

Here is __len__ implementation for Model

def __len__(self):
    return None

and desired result is to return None by default, but it is not correct in Python, see https://docs.python.org/3/reference/datamodel.html#object.__len__

[Bug] Hopper-v2 crashing using MBPO

Hi, I encountered a problem running Hopper-v2 with MBPO.

Steps to reproduce

run python -m mbrl.examples.main algorithm=mbpo overrides=mbpo_hopper device="cpu"

Observed Results

The code crashed, with the following error message:
File "/home/user/mbrl-lib/mbrl/env/termination_fns.py", line 19, in hopper *(next_obs[:, 1:] < 100).abs().all(-1)
RuntimeError: "abs_cpu" not implemented for 'Bool' Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.

I have tried to correct that bug by using np.abs() instead of .abs(), but then I met new error messages.
Thanks for considering my issue !

[Question] Model Interface: forward arguments

I am not sure how to use the forward function of the Model interface in my subclass as the interface only accepts a single input argument. If i want to have a control input, should i 1) concat the obs and action tensor, 2) use the TransitionBatch class or 3) introduce another function argument with default values (such as action=None)? What would be the preferred alternative for you?

Thanks!

Difference in PETS implementation from the original TF version.

This follows from the conversation in #98. I have noticed some discrepancy between the TF and mbrl-lib implementation of PETS.

Difference in normalization.

https://github.com/kchua/handful-of-trials/blob/master/dmbrl/modeling/utils/TensorStandardScaler.py#L45

In the original version, the normalization is guarded against observation dimensions with small stddev by setting the dimensions with small stddev to 1. This prevents the normalized inputs from exploding when the stddev is small. This happens in environments such as Reacher or Pusher where some observation dimensions consist of goals. In that situation, it seems that the goal is never changing during an episode and the stddev will be 0. Hence setting the small stddev to be 1.0 would be helpful in that case.

Another very subtle thing happening in the above code is that the normalization is performed with NumPy instead of in TF, and I think the inputs here are in float64. In that case, the stddev computation is more accurate than those in float32, so the threshold 1e-12 is sensible. Using PyTorch to perform normalization, for example, would require changes to the threshold. I think some values like 1e-5 would be more appropriate in that case (not backed up by any numerical analysis).

Difference in activation function

The original implementation uses the swish activation function whereas in mbrl-lib we use silu. I am confused about the choice of silu in mbrl-lib and would love to know more about the difference in empirical performance.

Difference in CEM stopping criteria

In the TF implementation, the CEM optimizer uses an additional termination criterion on the variance:
https://github.com/kchua/handful-of-trials/blob/77fd8802cc30b7683f0227c90527b5414c0df34c/dmbrl/misc/optimizers/cem.py#L71
I doubt that criterion is ever satisfied during training but I am mentioning this here for completeness.

Difference in optimizer weight decay

The original TF implementation uses a carefully selected set of weight decays for different layers of the dynamics model whereas the decay in mbrl-lib is the same for all layers. However, the original implementation does not add weight decays on the biases. See

https://github.com/kchua/handful-of-trials/blob/master/dmbrl/modeling/layers/FC.py#L219

In PyTorch, the default Adam will add weight decay on all parameters. That also means that they are added to the max_logvar and min_logvar whereas in the TF version the only regularization on the max/min-logvars is through the var_loss.

Maybe a side note, have the authors tried using AdamW instead of Adam for the weight decays? I recently learned that naive weight decay in Adam does not behave as you may expect. See https://arxiv.org/abs/1711.05101

Difference in optimizer parameters

The default epsilon in TensorFlow's Adam is 1e-7, https://www.tensorflow.org/api_docs/python/tf/keras/optimizers/Adam
Scratch this, they are 1e-8 in TF 1 https://www.tensorflow.org/versions/r1.15/api_docs/python/tf/keras/optimizers/Adam.

Anyway, I am mentioning these here after a thorough look at both mbrl-lib and TF PETS to debug my own JAX implementation. Turns out my mistake was in the MPC code. I hope these notes are useful since the author mentions that currently, the current implementation does not get good performance on Half-Cheetah. Maybe it's because of one of these details, if not, fingers crossed the difference can be spotted by someone else :D

Reward setup of the HalfCheetah environment

Hello! Firstly, thanks for the awesome codebase :)

I have a very simple question regarding the way the halfcheetah environment is set up. Clearly, you seem to have just ported the implementation of the environment in the original PETS repo, but I wanted to ask here just in case you have an idea of what's going on.

So, when you compute the reward, you use the following equations

reward_ctrl = -0.1 * np.square(action).sum(axis=action.ndim - 1)
reward_run = next_ob[..., 0] - 0.0 * np.square(next_ob[..., 2])
reward = reward_run + reward_ctrl

In the second line of the above snippet, the second term that is being subtracted will always have the value of zero. I wonder what's then the purpose of having the term in the first place.

Thanks!

[Bug] Pets CartPoleEnv preprocess_fn changes the dimensions of an observation.

Steps to reproduce

  1. Run learning tasks with the pets cartpole mujoco environment with the preprocess function preprocessing model data

Observed Results

  • What happened?

File "/Users/MarkSelden/Work/Research/mbrl-lib/mbrl/util/math.py line 123, in update_stats
assert data.ndim == 2 and data.shape[1] == self.mean.shape[1]
AssertionError

Expected Results

  • What did you expect to happen?
    I expected the code to run without throwing an error.

Relevant Code

// TODO(you): code here to reproduce the problem

import hydra
import numpy as np
import omegaconf
import torch
import mbrl.env.mujoco_envs
import mbrl.env.termination_fns
import mbrl.env.reward_fns
import mbrl.algorithms.pets as pets


@hydra.main(config_path="exp/conf", config_name="main") #This config File is specific to my local machine. 
def run(cfg: omegaconf.DictConfig):
  env = mbrl.env.mujoco_envs.CartPoleEnv()
  term_fn = mbrl.env.termination_fns.no_termination
  reward_fn = mbrl.env.reward_fns.cartpole_pets
  cfg.device = 'cuda:0' if torch.cuda.is_available() else 'cpu'
  np.random.seed(cfg.seed)
  torch.manual_seed(cfg.seed)

  return pets.train(env, term_fn, reward_fn, cfg)


if __name__ == "__main__":
  run()

[Feature Request] Add batch support to CEM

πŸš€ Feature Request

Add a batch dimension to all operations in CEM so that it can optimize several problems at once.

Motivation

The current implementation of our CEM optimizer only supports one optimization problem, which means that, for example, it cannot be used to compute action for several observations in parallel, which in turn slows down model evaluation.

Is your feature request related to a problem? Please describe.

Pitch

Add batch support to all operations involved in CEM, so that the return is an optimal value for each element in the batch.

[Bug] Normalization of model inputs causes PETS to crash on Pybullet Half Cheetah

Steps to reproduce

  1. Run PETS algorithm from example notebook with Pybullet Halfcheetah instead of Cartpole env using the code I put at the bottom.
  2. Alternatively, set normalize: true in the algorithm parameters for any environment with an observation which doesn't change before normalization is attempted.

Observed Results

When running the code below I get the following Traceback and error:

Traceback (most recent call last):
File "/Users/markselden/learning/mbrl-lib/tests/algorithms/test_halfcheetah_normalizer.py", line 161, in
next_obs, reward, done, _ = common_util.step_env_and_add_to_buffer(env, obs, agent, {}, replay_buffer)
File "/Users/markselden/learning/mbrl-lib/mbrl/util/common.py", line 398, in step_env_and_add_to_buffer
action = agent.act(obs, **agent_kwargs)
File "/Users/markselden/learning/mbrl-lib/mbrl/planning/trajectory_opt.py", line 340, in act
plan = self.optimizer.optimize(trajectory_eval_fn)
File "/Users/markselden/learning/mbrl-lib/mbrl/planning/trajectory_opt.py", line 224, in optimize
best_solution = self.optimizer.optimize(
File "/Users/markselden/learning/mbrl-lib/mbrl/planning/trajectory_opt.py", line 134, in optimize
values = obj_fun(population)
File "/Users/markselden/learning/mbrl-lib/mbrl/planning/trajectory_opt.py", line 337, in trajectory_eval_fn
return self.trajectory_eval_fn(obs, action_sequences)
File "/Users/markselden/learning/mbrl-lib/mbrl/planning/trajectory_opt.py", line 398, in trajectory_eval_fn
return model_env.evaluate_action_sequences(
File "/Users/markselden/learning/mbrl-lib/mbrl/models/model_env.py", line 173, in evaluate_action_sequences
_, rewards, dones, _ = self.step(action_batch, sample=True)
File "/Users/markselden/learning/mbrl-lib/mbrl/models/model_env.py", line 116, in step
next_observs, pred_rewards = self.dynamics_model.sample(
File "/Users/markselden/learning/mbrl-lib/mbrl/models/one_dim_tr_model.py", line 274, in sample
preds = self.model.sample(model_in, rng=rng, deterministic=deterministic)[0]
File "/Users/markselden/learning/mbrl-lib/mbrl/models/model.py", line 319, in sample
return (torch.normal(means, stds, generator=rng),)
RuntimeError: normal expects all elements of std >= 0.0

After debugging I have located the source of the error. In line 138 of util/math linked here. The normalize function divides the model input tensor by the std tensor. For input features which the model has not yet seen change the std is 0. Dividing by 0 places NaNs in the normalized model input tensor. The model then cannot handle the NaNs.

Expected Results

The PETS algorithm to run smoothly.

Relevant Code

This is the code I am running which produces the error, it is mostly pulled from pets_example.ipynb

import matplotlib as mpl
import numpy as np
import torch
import omegaconf


import mbrl.env.termination_fns as termination_fns
import mbrl.models as models
import mbrl.planning as planning
import mbrl.util.common as common_util
import mbrl.util as util
import pybullet_envs
import gym
print('test')

mpl.rcParams.update({"font.size": 16})

device = 'cuda:0' if torch.cuda.is_available() else 'cpu'

seed = 16
env = gym.make('HalfCheetahBulletEnv-v0')
env.seed(seed)
rng = np.random.default_rng(seed=0)
#I believe the generator is used for random sampling
generator = torch.Generator(device=device)
generator.manual_seed(seed)
obs_shape = env.observation_space.shape
act_shape = env.action_space.shape

# This functions allows the model to evaluate the true rewards given an observation
# This function allows the model to know if an observation should make the episode end
term_fn = termination_fns.no_termination

trial_length = 200
num_trials = 10
ensemble_size = 5

# Everything with "???" indicates an option with a missing value.
# Our utility functions will fill in these details using the
# environment information
cfg_dict = {
  # dynamics model configuration
  "dynamics_model": {
      "model": {
          "_target_": "mbrl.models.GaussianMLP",
          "device": device,
          "num_layers": 3,
          "ensemble_size": ensemble_size,
          "hid_size": 200,
          "use_silu": True,
          "in_size": "???",
          "out_size": "???",
          "deterministic": False,
          "propagation_method": "fixed_model"
      }
  },
  # options for training the dynamics model
  "algorithm": {
      "learned_rewards": True,
      "target_is_delta": True,
      "normalize": True,
  },
  # these are experiment specific options
  "overrides": {
      "trial_length": trial_length,
      "num_steps": num_trials * trial_length,
      "model_batch_size": 32,
      "validation_ratio": 0.05
  }
}
cfg = omegaconf.OmegaConf.create(cfg_dict)

# Create a dynamics model for this environment
#dynamics_model = models.GaussianMLP(obs_shape + act_shape, obs_shape + 1)
dynamics_model = util.common.create_one_dim_tr_model(cfg, obs_shape, act_shape)

replay_buffer = util.common.create_replay_buffer(
  cfg, obs_shape, act_shape, rng=rng
)

util.common.rollout_agent_trajectories(
  env,
  50,
  planning.RandomAgent(env),
  {},
  replay_buffer=replay_buffer,
)

# Create a gym-like environment to encapsulate the model
model_env = models.ModelEnv(
      env, dynamics_model, term_fn, generator=generator
  )


agent_cfg = omegaconf.OmegaConf.create({
  # this class evaluates many trajectories and picks the best one
  "_target_": "mbrl.planning.TrajectoryOptimizerAgent",
  "planning_horizon": 30,
  "replan_freq": 5,
  "verbose": False,
  "action_lb": "???",
  "action_ub": "???",
  # this is the optimizer to generate and choose a trajectory
  "optimizer_cfg": {
      "_target_": "mbrl.planning.CEMOptimizer",
      "num_iterations": 5,
      "elite_ratio": 0.1,
      "population_size": 500,
      "alpha": 0.1,
      "device": device,
      "lower_bound": "???",
      "upper_bound": "???",
      "return_mean_elites": True
  }
})

agent = planning.create_trajectory_optim_agent_for_model(
  model_env,
  agent_cfg,
  num_particles=15
)
train_losses = []
val_scores = []

def train_callback(_model, _total_calls, _epoch, tr_loss, val_score, _best_val):
  train_losses.append(tr_loss)
  val_scores.append(val_score.mean().item())   # this returns val score per ensemble model


# Create a trainer for the model
model_trainer = models.ModelTrainer(dynamics_model, optim_lr=1e-3, weight_decay=5e-5)


# Main PETS loop
all_rewards = [0]
for trial in range(num_trials):
  obs = env.reset()
  agent.reset()

  done = False
  total_reward = 0.0
  steps_trial = 0
  while not done:
      # --------------- Model Training -----------------
      if steps_trial == 0:
          dynamics_model.update_normalizer(replay_buffer.get_all())  # update normalizer stats

          dataset_train, dataset_val = replay_buffer.get_iterators(
              batch_size=cfg.overrides.model_batch_size,
              val_ratio=cfg.overrides.validation_ratio,
              train_ensemble=True,
              ensemble_size=ensemble_size,
              shuffle_each_epoch=True,
              bootstrap_permutes=False,  # build bootstrap dataset using sampling with replacement
          )

          model_trainer.train(
              dataset_train, dataset_val=dataset_val, num_epochs=50, patience=50, callback=train_callback)

      # --- Doing env step using the agent and adding to model dataset ---
      next_obs, reward, done, _ = common_util.step_env_and_add_to_buffer(env, obs, agent, {}, replay_buffer)

      obs = next_obs
      total_reward += reward
      steps_trial += 1

      if steps_trial == trial_length:
          break

  all_rewards.append(total_reward)


[Bug] PETS not working

Steps to reproduce

  1. install mbrl with python3.8 & mujoco_py 2.0.2.0
  2. python -m mbrl.examples.main algorithm=pets overrides=pets_halfcheetah

Observed Results

env_step,episode_reward,step
1000.0,-224.74164192363065,1
2000.0,-216.55716608141833,2
3000.0,-23.61229154142554,3
4000.0,-226.04264782442579,4
5000.0,299.97272326884257,5
6000.0,-424.2352836475372,6
7000.0,-605.4988140825888,7
8000.0,-276.8960448750668,8
9000.0,-570.0111469500497,9
10000.0,-510.15227529837796,10
11000.0,-521.2191905188236,11
12000.0,-380.6738015630948,12
13000.0,-401.0656166902861,13
14000.0,-342.89326195274214,14
15000.0,-387.0973047072805,15
16000.0,271.654545187927,16
17000.0,-357.9662191309233,17
18000.0,-144.4911364581224,18
19000.0,-227.65608581868534,19
20000.0,-270.1466421280269,20
21000.0,-218.2495164661332,21
22000.0,-291.59770272027646,22
23000.0,5.605493817390425,23
24000.0,-260.5804876267262,24
25000.0,-311.1006996761441,25
26000.0,-87.68273024315891,26
27000.0,-224.6058292677028,27
28000.0,-243.66672977662145,28
29000.0,-417.3611859069211,29
30000.0,-205.45597669987774,30
31000.0,-220.6631462332176,31
32000.0,-306.92107250798256,32
33000.0,-321.6192194136308,33
34000.0,156.56899647240394,34
35000.0,-373.6946869809165,35
36000.0,-297.54081355112413,36
37000.0,-403.86887923659464,37
38000.0,-394.61809157238,38
39000.0,-397.597218596027,39
40000.0,-270.5546716816992,40
41000.0,-275.0500238719418,41
42000.0,-339.1503604637613,42
43000.0,-394.371951392158,43
44000.0,-284.8456374765922,44
45000.0,-230.30455468451476,45
46000.0,-452.69669066476587,46
47000.0,-369.8052064885858,47
48000.0,-277.8216601977107,48
49000.0,83.44271984210994,49
50000.0,-165.98679718221237,50
51000.0,-286.4235189537889,51
52000.0,-420.1238034618763,52
53000.0,-348.4956325925755,53
54000.0,-262.9499726805828,54
55000.0,-82.70856034802993,55
56000.0,-283.44756999937294,56
57000.0,-296.14589401299133,57
58000.0,-310.71395667647914,58
59000.0,-92.32547170477757,59
60000.0,-343.62926472041903,60
61000.0,194.0718436837866,61
62000.0,-449.34500076620725,62
63000.0,-317.03787784175205,63
64000.0,-203.2571831873085,64
65000.0,-90.52911874178189,65
66000.0,-188.53310534801767,66
67000.0,-131.71672373665217,67
68000.0,-241.95741966590174,68
69000.0,-329.25808904770525,69
70000.0,-146.0802349071957,70
71000.0,-474.47665284478336,71
72000.0,-191.43021635327702,72

Expected Results

like results in #97

Could you please provide the data corresponding to the figures of of experiments in the paperβ€”β€”PETS? Because we would like to like cite PETS as one of the state of the art work in our paper, but time is limited. Thank you![Feature Request]

πŸš€ Feature Request

Motivation

Is your feature request related to a problem? Please describe.

Pitch

Describe the solution you'd like

Describe alternatives you've considered

Are you willing to open a pull request? (See CONTRIBUTING)

Additional context

Add any other context or screenshots about the feature request here.

[Feature Request] Add batch support to ModelEnv.evaluate_action_sequences

πŸš€ Feature Request

Add a batch dimension to all operations in ModelEnv.evaluate_action_sequences so that it can find rewards for several trajectories in parallel, each starting at a different observation.

This issue depends on #131, so if that one is open, it'd be better to start with that one first.

Motivation

The current implementation of ModelEnv.evaluate_action_sequences only supports one observation at a time, which means that it's not possible to evaluate several environments in parallel, wasting GPU parallelization of the model rollouts.

Pitch

Add batch support (in term of possible initial states) to all operations involved in ModelEnv.evaluate_action_sequences, and aggregate particles correctly, so that the return is a set of total rewards batched for each initial observation.

[Bug] The generator in ModelEnv Class

Current Code

        if generator:
            self._rng = torch.Generator(device=self.device)    
        else:
            self._rng = generator
            

Correct Code

        if generator:
            self._rng = generator
        else:
            self._rng = torch.Generator(device=self.device)
            

[Feature Request] Training data selection: Create more "interesting" Replay Buffer Iterators

πŸš€ Feature Request

Create Replay Buffer Iterators that can select training and validation data in various "interesting" ways, similar to TransitionIterator and BootStrapIterator in
https://github.com/facebookresearch/mbrl-lib/blob/b0aabd79941efe8b56bcabbd1b43bf497b9b1746/mbrl/replay_buffer.py

Examples:

  1. Select transitions from highly-rewarding trajectories - this could be used to perform analyses of how data selection impacts MBRL, objective mismatch, etc.
  2. Select transitions randomly from the replay buffer to have a fixed size of training/validation data.

Motivation

This would make analysis similar to https://arxiv.org/abs/2002.04523 and https://arxiv.org/abs/2102.13651 easy to perform.

Pitch

It should be fairly easy to implement similar to TransitionIterator and BootStrapIterator above. (Taking care of trajectory/episodic boundaries could be a bit tricky.)

MBPO cannot work on HumanoidTruncatedObsEnv and original Humanoid Env[Bug]

Steps to reproduce

  1. I tried to run MBPO on HumanoidTruncatedObsEnv with the default parameters in this repo but the final reward is around 180(seems like random policy and not work)
  2. I tried to run MBPO on original Humanoid env(without truncated obs) and still cannot work

and I have tried different seeds and they all cannot work

Observed Results

  • The results of episode reward :

image

Expected Results

  • The expected results (episode reward) may around 6k

[Feature Request] Clean up the PlaNet visualizer

πŸš€ Feature Request

The planet visualizer is an ad-hoc version used during the debugging stages. Now that the PlaNet implementation is stable, we should update it to be given a results directory, which specifies the Hydra config with, and also has a checkpoint for the model. The rest of the script can operate the same way for the purposes of this pull request.

[Bug] MBPO on HalfCheetah-v2 not learning

Hello !
Thank you for this new library.

Steps to reproduce

I have tried to run MBPO on HalfCheetah-v2 using
python -m mbrl.examples.main algorithm=mbpo overrides=mbpo_halfcheetah
after having installed the library, following the README steps.

Observed Results

Looking at the results.csv file while training MBPO, the agent doesn't seem to be learning:

env_step,episode_reward,epoch,rollout_length,step
999.0,-0.8456277073670116,0.0,1.0,1
1999.0,-0.9440948246565134,1.0,1.0,2
2999.0,-0.7490316223516877,2.0,1.0,3
3999.0,-0.7571950947415443,3.0,1.0,4
4999.0,-0.8274641044949901,4.0,1.0,5
5999.0,-3.238894348513289,5.0,1.0,6
6999.0,-6.492029299434498,6.0,1.0,7
7999.0,-17.30643762007886,7.0,1.0,8
8999.0,-31.60495724347696,8.0,1.0,9
9999.0,-46.125412228730156,9.0,1.0,10
10999.0,-71.67442837269994,10.0,1.0,11
11999.0,-58.69499608366046,11.0,1.0,12
12999.0,-69.69717022249019,12.0,1.0,13
13999.0,-86.82171969107995,13.0,1.0,14
14999.0,-121.96192444262107,14.0,1.0,15
15999.0,-145.69370999969092,15.0,1.0,16
16999.0,-168.29146007014336,16.0,1.0,17
17999.0,-181.85075223574438,17.0,1.0,18
18999.0,-158.5637693244303,18.0,1.0,19
19999.0,-158.52343716747603,19.0,1.0,20
20999.0,-167.62715189227546,20.0,1.0,21
21999.0,-143.24211684101195,21.0,1.0,22
22999.0,-162.0094559119758,22.0,1.0,23
23999.0,-132.65619703115092,23.0,1.0,24
24999.0,-139.2981737548027,24.0,1.0,25

And the episode_reward values don't get any better ( I have tested it for about 400 000 environment steps).

However, python -m mbrl.examples.main algorithm=pets overrides=pets_halfcheetah gets much better results, with the following result.csv:

env_step,episode_reward,step
1000.0,-152.01008098334424,1
2000.0,15.21179995149966,2
3000.0,-49.4050625863483,3
4000.0,16.849760830275926,4
5000.0,-219.33720385147777,5
6000.0,-210.56151771405823,6
7000.0,652.796055680682,7
8000.0,902.5875139452987,8
9000.0,1312.0837729671678,9
10000.0,-279.0000574725552,10
11000.0,1753.5399394982003,11
12000.0,1215.2626283022482,12
13000.0,947.9160259748666,13
14000.0,2514.5715560595763,14
15000.0,2619.5086821951263,15
16000.0,3259.5381123355673,16
17000.0,3626.519190884455,17
18000.0,4070.194763445176,18
19000.0,4121.591162199583,19
20000.0,4602.759264114635,20
21000.0,4568.923964106727,21
22000.0,4919.350936975073,22
23000.0,4902.718673117782,23
24000.0,5196.1713305116145,24
25000.0,5124.92839588049,25
26000.0,5283.068158021224,26
27000.0,5191.014333635436,27
28000.0,5345.012072796462,28
29000.0,5019.163125754766,29

Expected Results

  • I expected the MBPO agent to improve its episode_reward performance, as env_step increases.

Thank you !

[Feature Request] More general reward

Hi, currently reward_fn is independent from environment class (mbrl.models.ModelEnv) and accepts as input actions and next observation. In practice more general, dependent on environment parameters reward functions are needed. For example,

  • we've got some reference trajectory or obstacles that are fixed or periodically updated
  • we want to include progress in reward function, e.g. reward - prev_reward

My initial thought is to change reward_fn from external function to class function of ModelEnv and then we could use self.parameter of that class. I wonder if this is "safe" and doesn't mess with other features

Regards,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.