Git Product home page Git Product logo

bipedalwalker_ars_es's Introduction

BipedalWalker_ARS_ES

BipedalWalker is an environment provided by the OpenAI gym.

Reward is given for moving forward, total 300+ points up to the far end. If the robot falls, it gets -100. Applying motor torque costs a small amount of points, more optimal agent will get better score. State consists of hull angle speed, angular velocity, horizontal speed, vertical speed, position of joints and joints angular speed, legs contact with ground, and 10 lidar rangefinder measurements. There's no coordinates in the state vector.

Problem

The problem is posed as a finite-horizon, non-deterministic Markov decision process (MDP), and is as interesting as it is difficult. The high dimensionality and continuous ranges of inputs (space) and outputs (actions) poses especially challenging examples of the lemmas of delayed reward, credit assignment, and exploration vs. exploitation. Moreover, while the MDP might guarantee convergence to a deterministic optimal policy in the limit, the dimensionality and continuous range poses the challenge that it cannot be enumerated in finite space complexity.

A successful learner will "solve" the reinforcement learning problem by achieving an average reward of 300 over 100 consecutive trials. The scope of the study will both optimize and examine the effect of hyperparameters, e.g. learning rates, discount, etc., on performance. Lastly, a holistic comparison of both the reinforcement learner and evolutionary learner will be provided.

Algorithms

The chief algorithms to be employed include a synthesis of DQN and DDPG in an Actor-Critic (Konda 2000) architecture with batch normalization (Ioffe), and an application of an evolutionary strategy (ES) for optimization as a surrogate for traditional reinforcement learning.

The reinforcement learning algorithm is well-defined in (Lillicrap 2016), and borrows extensively from (Mnih 2013), using the concepts of experience replay and separate target networks:

randomly initialize critic network Q and actor μ
initialize target network Q' and μ'
initialize replay buffer R
for episode = 1, M do:
  initialize a random process N for action exploration
  receive initial observation state s1
  for t = 1, T do:
    select action a_t = μ + N_t according to current policy
    execute action a_t and observe reward r_t and new state s_t+1
    store experience in replay buffer
    sample a random minibatch of N transitions from R
    update target values according to discount, γ
    update the actor policy using the sampled policy gradient
    update the target networks      

The evolutionary strategy (ES) proposed in (Salimans 2017), in brief, applies randomized, "black box" optimization as an alternative to traditional reinforcement learning:

input: learning rate α, noise standard deviation σ, initial policy parameters θ_0
for episode = 1, M do:
  for t = 1, T do:
    sample ϵ_1 ... ϵ_n
    compute returns
    update policy parameters, θ_t+1

ARS is a random search method for training linear policies for continuous control problems, based on the paper "Simple random search provides a competitive approach to reinforcement learning."

Dependencies

pip install gym

Augmented Random Search

The credits for this code go to colinskow. I've merely created a wrapper. He explian the ARS in his video I just added the comment in the code for better understand.

Detial for the ColinSkow ARS algorithm explaination is in Move37 course by SirajRaval

Evolution Strategies

ES is somewhat unfortunately named, as I view it has more of a variation of Random Search than in any way being related to the principles of evolution. Before describing ES, I think it’s helpful to understand Random Search. Consider the problem of finding wights for a neural network in a Reinforcement Learning task, where the network is given a score based on how well it performs in a particular environment. Random Search samples points (potential weights) in a hyper-sphere around the origin. The ‘best’ point is selected and then a hyper-sphere around that best point is sampled. The next best point is found and the cycle repeats until some stopping criteria. ES is similar to Random Search but instead of the new ‘best’ point being decided by the single best performing point, the new ‘best’ is calculated using the average of the difference between the previous ‘best’ and the sampled points, weighted by their ‘fitness’ (the value returned from the environment). Pseudo code is below. In this way, the new ‘best’ will move in the direction that maximizes performance in the environment. This can be considered as a type of gradient descent based on finite differences, even though no gradient is being calculated.

For further reading regarding ES read https://blog.openai.com/evolution-strategies/

Im trying to implement using the alirezamikas evostra but don't know the reward was not increasing. I train the agent over 1000 iterations

ALT test

For the sake of simplicity I first implement ES on CartPole environment in Es_CartPole.py then transfer that concept into BipedalWalker environment in Es_BipedalWalker.py but it don't work well. I found the ES implementation in python by hardmaru in his repo It is easy to use tool to implement various ES strategies

Then I used the ARS to train the BipedalWalker in Ars_BipedalWalker.py

References

Bipedal Walker by Allan Reyes

Evolution Strategies by Fabrizio Frigeni

Evolution Strategies as a Scalable Alternative to Reinforcement Learning — Paper Summary

Augmented-Random-Search

Evolution Strategies with Keras

ESTool by hardmaru

A Visual Guide to Evolution Strategies

Evostra: Evolution Strategy for Python

An AI agent learning to walk in gym's BipedalWalker environment.

Move37 Course by SirajRaval

Augmented Random Search Tutorial (Teach a Robot to Walk)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.