Git Product home page Git Product logo

n-step-dueling-ddqn-per-pacman's Introduction

N-step-Dueling-DDQN-PER-Pacman

Using N-step dueling DDQN with PER for learning how to play a Pacman game

Summary

DeepMind published its famous paper Playing Atari with Deep Reinforcement Learning, in which a new algorithm called DQN was implemented. It showed that an AI agent could learn to play games by simply watching the screen without any prior knowledge about the game. Also, I have added a few enhancement to the vanilla DQN from various papers and tested it on the MsPacman-v4 OpenAI's environment.

Demo


Preprocessing

  • For keeping the downsampled image from distortion, I have dilated the pixels with a (3,3) kernel two times.
  • Downsample the game image to the size 88x80.
  • Change the color of Pacman's pixels for precise observation.

DQN

  • Instead of stacking four images and feeding them to the network, I'm taking an average of 4 recent images.
  • The main network is being updated each 1000 steps.
  • The replay buffer size is 100000.
  • The network architecture is:
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_1 (InputLayer)            (None, 88, 80, 1)    0                                            
__________________________________________________________________________________________________
conv2d_1 (Conv2D)               (None, 22, 20, 32)   2080        input_1[0][0]                    
__________________________________________________________________________________________________
conv2d_2 (Conv2D)               (None, 11, 10, 64)   32832       conv2d_1[0][0]                   
__________________________________________________________________________________________________
conv2d_3 (Conv2D)               (None, 11, 10, 64)   36928       conv2d_2[0][0]                   
__________________________________________________________________________________________________
flatten_1 (Flatten)             (None, 7040)         0           conv2d_3[0][0]                   
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 5)            35205       flatten_1[0][0]                  
__________________________________________________________________________________________________
lambda_1 (Lambda)               (None, 1)            0           dense_2[0][0]                    
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 1)            7041        flatten_1[0][0]                  
__________________________________________________________________________________________________
subtract_1 (Subtract)           (None, 5)            0           dense_2[0][0]                    
                                                               lambda_1[0][0]                   
__________________________________________________________________________________________________
add_1 (Add)                     (None, 5)            0           dense_1[0][0]                    
                                                               subtract_1[0][0]                 
==================================================================================================
Total params: 114,086
Trainable params: 114,086
Non-trainable params: 0
__________________________________________________________________________________________________
---------------------------------------------------------------------------

Double DQN

For updating the Q values in the max operator, DQN uses the same values both to select and to evaluate an action. This makes it more likely to select overestimated values, resulting in overoptimistic value estimates. In order to solve this issue, we can use the target network as a value estimator and the main network as an action selector.

N-Step Return

DQN accumulates a single reward and then uses the greedy action at the next step to bootstrap. Alternatively, forward-view multi-step targets can be used and bootstrap after few steps (5 steps here).

Dueling Network

The dueling architecture can learn which states are valuable for each state without learning the effect of each action. This is particularly useful in states where its actions in no relevant way affect the environment. Also, for a more stable optimization, we use an average baseline for Q evaluation.

Prioritized Experience Replay

Lastly, we can prioritize the episode by the magnitude of a transition’s TD error. Moreover, to overcome the issue of replaying a subset of transitions more frequently, we will use a stochastic sampling method that interpolates between pure greedy prioritization and uniform random sampling. I used a min-heap and chose about 60% by the TD error priority and 40% uniformly.

Related Papers

  1. V. Mnih et al., "Human-level control through deep reinforcement learning." Nature, 518 (7540):529–533, 2015.
  2. van Hasselt et al., "Deep Reinforcement Learning with Double Q-learning." arXiv preprint arXiv:1509.06461, 2015.
  3. T. Schaul et al., "Prioritized Experience Replay." arXiv preprint arXiv:1511.05952, 2015.
  4. Z. Wang et al., "Dueling Network Architectures for Deep Reinforcement Learning." arXiv preprint arXiv:1511.06581, 2015.
  5. R. S. Sutton, "Learning to predict by the methods of temporal differences." Machine learning, 3(1):9–44, 1988.

Author

License

This project is licensed under the MIT License - see the LICENSE file for details

n-step-dueling-ddqn-per-pacman's People

Contributors

cocolico14 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

n-step-dueling-ddqn-per-pacman's Issues

nice unintelligible codes

see the codes for agent.py; exactly look at the store method. it is for sure a bad written code. there is three or more buffers defined to store (action state reward, ...) but it is a headache

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.