Model Based Reinforcement Learning Benchmarking Library (MBBL)

Introduction

Arxiv Link PDF Project Page Abstract: Model-based reinforcement learning (MBRL) is widely seen as having the potential to be significantly more sample efficient than model-free RL. However, research in model-based RL has not been very standardized. It is fairly common for authors to experiment with self-designed environments, and there are several separate lines of research, which are sometimes closed-sourced or not reproducible. Accordingly, it is an open question how these various existing MBRL algorithms perform relative to each other. To facilitate research in MBRL, in this paper we gather a wide collection of MBRL algorithms and propose over 18 benchmarking environments specially designed for MBRL. We benchmark these MBRL algorithms with unified problem settings, including noisy environments. Beyond cataloguing performance, we explore and unify the underlying algorithmic differences across MBRL algorithms. We characterize three key research challenges for future MBRL research: the dynamics coupling effect, the planning horizon dilemma, and the early-termination dilemma.

Installation

Install the project with pip from the top-level directory:

pip install --user -e .

For sub-packages of algorithms not integrated here, please refer to the respective readmes.

Algorithms

Some of the algorithms are not yet merged into this repo. We use the following colors to represent their status. indicates Merged into this repo. indicates In a separate repo.

Shooting Algorithms

1. Random Shooting (RS)

Rao, Anil V. "A survey of numerical methods for optimal control." Advances in the Astronautical Sciences 135.1 (2009): 497-528. Link

python main/rs_main.py --exp_id rs_gym_cheetah_seed_1234 \
    --task gym_cheetah \
    --num_planning_traj 1000 --planning_depth 10 --random_timesteps 10000 \
    --timesteps_per_batch 3000 --num_workers 20 --max_timesteps 200000 --seed 1234

The following script will test the performance when using ground-truth dynamics:

python main/rs_main.py --exp_id rs_${env_type}\
    --task gym_cheetah \
    --num_planning_traj 1000 --planning_depth 10 --random_timesteps 0 \
    --timesteps_per_batch 1 --num_workers 20 --max_timesteps 20000 \
    --gt_dynamics 1

Also, set --check_done 1 for agents to detect if the episode is terminated (needed for gym_fant, gym_fhopper).

2. Mode-Free Model-Based (MB-MF)

Nagabandi, Anusha, et al. "Neural network dynamics for model-based deep reinforcement learning with model-free fine-tuning." arXiv preprint arXiv:1708.02596 (2017). Link

python main/mbmf_main.py --exp_id mbmf_gym_cheetah_ppo_seed_1234 \
    --task gym_cheetah --trust_region_method ppo \
    --num_planning_traj 5000 --planning_depth 20 --random_timesteps 1000 \
    --timesteps_per_batch 1000 --dynamics_epochs 30 \
    --num_workers 20 --mb_timesteps 7000 --dagger_epoch 300 \
    --dagger_timesteps_per_iter 1750 --max_timesteps 200000 \
    --seed 1234 --dynamics_batch_size 500

3. Probabilistic Ensembles with Trajectory Sampling (PETS-RS and PETS-CEM)

Chua, K., Calandra, R., McAllister, R., & Levine, S. (2018). Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems (pp. 4754-4765). Link

See the codebase for POPLIN, where you can benchmark PETS-RS and PETS-CEM following the readme. PETS-RS with ground-truth is essentially RS with ground-truth, and to run the PETS-CEM with ground-truth dynamics:

python main/pets_main.py --exp_id pets-gt-gym_cheetah \
    --task gym_cheetah \
    --num_planning_traj 500 --planning_depth 30 --random_timesteps 0 \
    --timesteps_per_batch 1 --num_workers 10 --max_timesteps 20000 \
    --gt_dynamics 1

Policy Search with Backpropagation through Time

4. Probabilistic Inference for Learning Control (PILCO)

Deisenroth, M., & Rasmussen, C. E. (2011). PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on machine learning (ICML-11) (pp. 465-472). Link

We implemented and benchmarked the environments in this repo PILCO.

5. Iterative Linear Quadratic-Gaussian (iLQG)

Tassa, Y., Erez, T., & Todorov, E. (2012, October). Synthesis and stabilization of complex behaviors through online trajectory optimization. In 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems (pp. 4906-4913). IEEE. Link

python main/ilqr_main.py  --exp_id ilqr-gym_cheetah \ 
    --max_timesteps 2000 --task gym_cheetah \
    --timesteps_per_batch 1 --ilqr_iteration 10 --ilqr_depth 30 \
    --max_ilqr_linesearch_backtrack 10  --num_workers 2 \
    --gt_dynamics 1

6. Guided Policy Search (GPS)

Levine, Sergey, and Vladlen Koltun. "Guided policy search." International Conference on Machine Learning. 2013 Link

We implemented and benchmarked the environments in this repo GPS.

7. Stochastic Value Gradients (SVG)

Heess, N., Wayne, G., Silver, D., Lillicrap, T., Erez, T., & Tassa, Y. (2015). Learning continuous control policies by stochastic value gradients. In Advances in Neural Information Processing Systems (pp. 2944-2952). Link

We implemented and benchmarked the environments in this repo SVG (will be set public soon).

Dyna-Style Algorithms

8. Model-Ensemble Trust-Region Policy Optimization (ME-TRPO)

Kurutach, Thanard, et al. "Model-Ensemble Trust-Region Policy Optimization." arXiv preprint arXiv:1802.10592 (2018). Link

We implemented and benchmarked the environments in this repo ME-TRPO.

9. Stochastic Lower Bound Optimization (SLBO)

Luo, Y., Xu, H., Li, Y., Tian, Y., Darrell, T., & Ma, T. (2018). Algorithmic Framework for Model-based Deep Reinforcement Learning with Theoretical Guarantees. Link

We implemented and benchmarked the environments in this repo SLBO

10. Model-Based Meta-Policy-Optimzation (MB-MPO)

Clavera, I., Rothfuss, J., Schulman, J., Fujita, Y., Asfour, T., & Abbeel, P. (2018). Model-based reinforcement learning via meta-policy optimization. arXiv preprint arXiv:1809.05214. Link We implemented and benchmarked the environments in this repo MB-MPO (will be set public soon).

Model-free Baselines

11. Trust-Region Policy Optimization (TRPO)

Schulman, John, et al. "Trust region policy optimization." International Conference on Machine Learning. 2015. Link

python main/mf_main.py --exp_id trpo_gym_cheetah_seed1234 \
    --timesteps_per_batch 2000 --task gym_cheetah \
    --num_workers 5 --trust_region_method trpo --max_timesteps 200000

12. Proximal-Policy Optimization (PPO)

Schulman, John, et al. "Proximal policy optimization algorithms." arXiv preprint arXiv:1707.06347 (2017). Link

python main/mf_main.py --exp_id ppo_gym_cheetah_seed1234 \
    --timesteps_per_batch 2000 --task gym_cheetah \
    --num_workers 5 --trust_region_method ppo --max_timesteps 200000

13. Twin Delayed Deep Deterministic Policy Gradient (TD3)

Fujimoto, S., van Hoof, H., & Meger, D. (2018). Addressing function approximation error in actor-critic methods. arXiv preprint arXiv:1802.09477. Link

We implemented and benchmarked the environments in this repo TD3.

14. Soft Actor-Critic (SAC)

Haarnoja, T., Zhou, A., Abbeel, P., & Levine, S. (2018). Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor. arXiv preprint arXiv:1801.01290. Link

We implemented and benchmarked the environments in this repo SAC.

Disclaimer

As mentioned in the project webpage, it is a developing (unfinished) project. We are working towards a unified package for MBRL algorithms. but it might take a while given that we lack the manpower.

Engineering Stats and 1 Million Performance

Env

Here is available environments and their mappings to the name used in the paper.

Mapping Table
Env	Pendulum	InvertedPendulum	Acrobot	CartPole	Mountain Car	Reacher
Repo-Name	gym_pendulum	gym_invertedPendulum	gym_acrobot	gym_cartPole	gym_mountain	gym_reacher
Env	HalfCheetah	Swimmer-v0	Swimmer	Ant	Ant-ET	Walker2D
Repo-Name	gym_cheetah	gym_swimmer	gym_fswimmer	gym_ant	gym_fant	gym_walker2d
Env	Walker2D-ET	Hopper	Hopper-ET	SlimHumanoid	SlimHumanoid-ET	Humanoid-ET
Repo-Name	gym_fwalker2d	gym_hopper	gym_fhopper	gym_nostopslimhumanoid	gym_slimhumanoid	gym_humanoid

wilsonwangthu / mbbl Goto Github PK

mbbl's Introduction

Model Based Reinforcement Learning Benchmarking Library (MBBL)

Introduction

Installation

Algorithms

Shooting Algorithms

1. Random Shooting (RS)

2. Mode-Free Model-Based (MB-MF)

3. Probabilistic Ensembles with Trajectory Sampling (PETS-RS and PETS-CEM)

Policy Search with Backpropagation through Time

4. Probabilistic Inference for Learning Control (PILCO)

5. Iterative Linear Quadratic-Gaussian (iLQG)

6. Guided Policy Search (GPS)

7. Stochastic Value Gradients (SVG)

Dyna-Style Algorithms

8. Model-Ensemble Trust-Region Policy Optimization (ME-TRPO)

9. Stochastic Lower Bound Optimization (SLBO)

10. Model-Based Meta-Policy-Optimzation (MB-MPO)

Model-free Baselines

11. Trust-Region Policy Optimization (TRPO)

12. Proximal-Policy Optimization (PPO)

13. Twin Delayed Deep Deterministic Policy Gradient (TD3)

14. Soft Actor-Critic (SAC)

Disclaimer

Engineering Stats and 1 Million Performance

Env

mbbl's People

Contributors

Stargazers

Watchers

Forkers

mbbl's Issues

Code to see this issue

Output

Recommend Projects

Recommend Topics

Recommend Org