Git Product home page Git Product logo

mbbl's Introduction

Model Based Reinforcement Learning Benchmarking Library (MBBL)

Introduction

Arxiv Link PDF Project Page Abstract: Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these MBRL algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics coupling effect, the planning horizon dilemma, and the early-termination dilemma.

Installation

Install the project with pip from the top-level directory:

pip install --user -e .

For sub-packages of algorithms not integrated here, please refer to the respective readmes.

Algorithms

Some of the algorithms are not yet merged into this repo. We use the following colors to represent their status. #22d50c indicates Merged into this repo. #f17819 indicates In a separate repo.

Shooting Algorithms

1. Random Shooting (RS) #22d50c

Rao, Anil V. "A survey of numerical methods for optimal control." Advances in the Astronautical Sciences 135.1 (2009): 497-528. Link

python main/rs_main.py --exp_id rs_gym_cheetah_seed_1234 \
    --task gym_cheetah \
    --num_planning_traj 1000 --planning_depth 10 --random_timesteps 10000 \
    --timesteps_per_batch 3000 --num_workers 20 --max_timesteps 200000 --seed 1234

The following script will test the performance when using ground-truth dynamics:

python main/rs_main.py --exp_id rs_${env_type}\
    --task gym_cheetah \
    --num_planning_traj 1000 --planning_depth 10 --random_timesteps 0 \
    --timesteps_per_batch 1 --num_workers 20 --max_timesteps 20000 \
    --gt_dynamics 1

Also, set --check_done 1 for agents to detect if the episode is terminated (needed for gym_fant, gym_fhopper).

2. Mode-Free Model-Based (MB-MF) #22d50c

Nagabandi, Anusha, et al. "Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning." arXiv preprint arXiv:1708.02596 (2017). Link

python main/mbmf_main.py --exp_id mbmf_gym_cheetah_ppo_seed_1234 \
    --task gym_cheetah --trust_region_method ppo \
    --num_planning_traj 5000 --planning_depth 20 --random_timesteps 1000 \
    --timesteps_per_batch 1000 --dynamics_epochs 30 \
    --num_workers 20 --mb_timesteps 7000 --dagger_epoch 300 \
    --dagger_timesteps_per_iter 1750 --max_timesteps 200000 \
    --seed 1234 --dynamics_batch_size 500

3. Probabilistic Ensembles with Trajectory Sampling (PETS-RS and PETS-CEM) #22d50c #f17819

Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems (pp. 4754-4765). Link

See the codebase for POPLIN, where you can benchmark PETS-RS and PETS-CEM following the readme. PETS-RS with ground-truth is essentially RS with ground-truth, and to run the PETS-CEM with ground-truth dynamics:

python main/pets_main.py --exp_id pets-gt-gym_cheetah \
    --task gym_cheetah \
    --num_planning_traj 500 --planning_depth 30 --random_timesteps 0 \
    --timesteps_per_batch 1 --num_workers 10 --max_timesteps 20000 \
    --gt_dynamics 1

Policy Search with Backpropagation through Time

4. Probabilistic Inference for Learning Control (PILCO) #f17819

Deisenroth, M., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11) (pp. 465-472). Link

We implemented and benchmarked the environments in this repo PILCO.

5. Iterative Linear Quadratic-Gaussian (iLQG) #22d50c

Tassa, Y., Erez, T., & Todorov, E. (2012, October). Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 4906-4913). IEEE. Link

python main/ilqr_main.py  --exp_id ilqr-gym_cheetah \ 
    --max_timesteps 2000 --task gym_cheetah \
    --timesteps_per_batch 1 --ilqr_iteration 10 --ilqr_depth 30 \
    --max_ilqr_linesearch_backtrack 10  --num_workers 2 \
    --gt_dynamics 1

6. Guided Policy Search (GPS) #f17819

Levine, Sergey, and Vladlen Koltun. "Guided policy search." International Conference on Machine Learning. 2013 Link

We implemented and benchmarked the environments in this repo GPS.

7. Stochastic Value Gradients (SVG) #f17819

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., & Tassa, Y. (2015). Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (pp. 2944-2952). Link

We implemented and benchmarked the environments in this repo SVG (will be set public soon).

Dyna-Style Algorithms

8. Model-Ensemble Trust-Region Policy Optimization (ME-TRPO) #f17819

Kurutach, Thanard, et al. "Model-Ensemble Trust-Region Policy Optimization." arXiv preprint arXiv:1802.10592 (2018). Link

We implemented and benchmarked the environments in this repo ME-TRPO.

9. Stochastic Lower Bound Optimization (SLBO) #f17819

Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., & Ma, T. (2018). Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees. Link

We implemented and benchmarked the environments in this repo SLBO

10. Model-Based Meta-Policy-Optimzation (MB-MPO) #f17819

Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., & Abbeel, P. (2018). Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214. Link We implemented and benchmarked the environments in this repo MB-MPO (will be set public soon).

Model-free Baselines

11. Trust-Region Policy Optimization (TRPO) #22d50c

Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015. Link

python main/mf_main.py --exp_id trpo_gym_cheetah_seed1234 \
    --timesteps_per_batch 2000 --task gym_cheetah \
    --num_workers 5 --trust_region_method trpo --max_timesteps 200000

12. Proximal-Policy Optimization (PPO) #22d50c

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017). Link

python main/mf_main.py --exp_id ppo_gym_cheetah_seed1234 \
    --timesteps_per_batch 2000 --task gym_cheetah \
    --num_workers 5 --trust_region_method ppo --max_timesteps 200000

13. Twin Delayed Deep Deterministic Policy Gradient (TD3) #f17819

Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Link

We implemented and benchmarked the environments in this repo TD3.

14. Soft Actor-Critic (SAC) #f17819

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Link

We implemented and benchmarked the environments in this repo SAC.

Disclaimer

As mentioned in the project webpage, it is a developing (unfinished) project. We are working towards a unified package for MBRL algorithms. but it might take a while given that we lack the manpower.

Engineering Stats and 1 Million Performance

Env

Here is available environments and their mappings to the name used in the paper.

Mapping Table
Env Pendulum InvertedPendulum Acrobot CartPole Mountain Car Reacher
Repo-Name gym_pendulum gym_invertedPendulum gym_acrobot gym_cartPole gym_mountain gym_reacher
Env HalfCheetah Swimmer-v0 Swimmer Ant Ant-ET Walker2D
Repo-Name gym_cheetah gym_swimmer gym_fswimmer gym_ant gym_fant gym_walker2d
Env Walker2D-ET Hopper Hopper-ET SlimHumanoid SlimHumanoid-ET Humanoid-ET
Repo-Name gym_fwalker2d gym_hopper gym_fhopper gym_nostopslimhumanoid gym_slimhumanoid gym_humanoid

mbbl's People

Contributors

edlanglois avatar wilsonwangthu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mbbl's Issues

Stochastic Value Gradients implementation

Hello,

thank you for your hard work in providing and polishing this repository.
I would like to replicate the results using Stochastic Value Gradients. I see in the readme that it will be set public: can you estimate how much time will it take?

Thank you,

Pierluca

Modify code to experiment on Mujoco env (eg. Walker) in new version of Gym

Thank you so much for provide a Benchmark for Model-Based Reinforcement Learning.

I want to modify your code to new version of Gym, I have some question about your code:
in file mbbl/mbbl/env/gym_env/walker.py, in function reset(), after line 75 you do some work:

  1. Get the Observation, store in self._old_ob
  2. Call self._env.reset()
  3. Set the self._old_ob back

As you comment in line 75: # the following is a hack, there is some precision issue in mujoco_py. I'm not familiar with old versions of Gym but in new version Gym (eg. 0.12.1), self._env.reset() return new observation.

I wonder if you want to recover the previous state after calling self._env.reset() or just there is an issue in old version of Gym.

For the case of there is an issue of old version:
Can I just get: self._old_ob = self._env.reset() then return it? (For current work, I skip the groundtruth-dynamic)

Thank you so much for your support!

Different Gym Version

Different algorithms benchmarked in the repo were using different gyms, including the version 0.7.4, 0.10.5 and 0.9.4

To make more fair comparison, should all experiments be done on the same version's gym?

Thanks

Off-by-one error in `gym_pendulum`

Hi, just a minor issue here. The gym_pendulum environment in here runs for 201 steps instead of 200 steps as the original gym pendulum does and I think it's because this line should be an inequality.

Code to see this issue

#!/usr/bin/env python3

import gym
from mbbl.env.env_register import make_env

env = gym.make('Pendulum-v0')
env.reset()

done = False
t = 0
while not done:
    obs, reward, done, _ = env.step(env.action_space.sample())
    t += 1

print(f'Gym Pendulum-v0: {t} steps')

env, _ = make_env('gym_pendulum', rand_seed=0)
env.reset()

done = False
t = 0
while not done:
    obs, reward, done, _ = env.step(env._env.action_space.sample())
    t += 1

print(f'mbbl pendulum: {t} steps')

Output

Gym Pendulum-v0: 200 steps
mbbl pendulum: 201 steps

BNN_MLP is missing.

Thank you so much for providing a Benchmark for Model-Based Reinforcement Learning.

I found that in bayesian_forward_dynamics.py, mbbl.util.bnn (line 12) and BNN_MLP(line 74) is missing.

DeepMind Control Suite: Environment Dimensionality

Loading the reacher-easy task I received an error about the observation dimensionality (6 vs. 7).

The original DeepMind paper states that the observation dimensionality is 7, which is also set in MBBL's env_register.py. However, debugging into dm_control/suite/reacher.py, the observations seem to have 6 dimensions (2 pos, 2 vel, 2 goal).

I was wondering whether this is an issue on my end or whether running the reacher-easy task also fails for you? Thanks a lot!

Option to save and load models

Is there an option to save the learned model and then load it for further training? I did not see any flag for that. Thanks in advance.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.