Git Product home page Git Product logo

mpo's Introduction

MPO (Maximum a Posteriori Policy Optimization)

Pytorch implementation of MPO (works cited below) with the help of other repositories (also cited below).

Policy evaluation is done using Retrace.

Currently only accommodate Discrete gym environments.

Usage

Look at main.py for examples of using MPO.

The architectures for Actor and Critic can be changed in mpo_net.py.

Citations

  • Maximum a Posteriori Policy Optimisation (Original MPO algorithm)

https://arxiv.org/abs/1806.06920

  • Relative Entropy Regularized Policy Iteration (Improved MPO algorithm)

https://arxiv.org/abs/1812.02256

  • daisatojp's mpo github repository (MPO implementation as reference)

https://github.com/daisatojp/mpo

  • Openai's ACER github repository (Replay buffer implementation as reference)

https://github.com/openai/baselines/tree/master/baselines/acer

Training Results

mpo_on_LunarLanderV2

  • 5 parallel environments

mpo_on_AcrobotV1

  • 5 paralle environments

mpo's People

Contributors

acyclics avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

mpo's Issues

q_ret update not used

I have enjoyed your really clean implementation of MPO. Thank you for making it available. I was looking at the critic update and think I may have spotted a bug. You update q_ret on line 163 according to retrace but as far as I can see you do not actually use it anywhere. I think you might want to use it recursively on line 161 in place of q_retraces[step + 1].

MPO/mpo.py

Lines 160 to 163 in c84bf23

for step in reversed(range(nsteps)):
q_ret = reward_batch[step] + self.γ * q_retraces[step + 1] * (1 - done_batch[step + 1])
q_retraces[step] = q_ret
q_ret = (rho_i[step] * (q_retraces[step] - q_i[step])) + val[step]

My 'loss_l' goes to 1.837 and the model never improves

Using the code, the loss_l goes to 1.837 and never improves from there.
I notice when the loss_l reaches 1.837 the values for kl_μ and kl_Σ are both at zero.

I'm using your code here to train a model in MuJoCo. I wonder if some of the constraints should have different values or if there is a more obvious explanation for what is going wrong here.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.