Git Product home page Git Product logo

zeta36 / asynchronous-methods-for-deep-reinforcement-learning Goto Github PK

View Code? Open in Web Editor NEW
81.0 8.0 31.0 10.39 MB

Using a paper from Google DeepMind I've developed a new version of the DQN using threads exploration instead of memory replay as explain in here: http://arxiv.org/pdf/1602.01783v1.pdf I used the one-step-Q-learning pseudocode, and now we can train the Pong game in less than 20 hours and without any GPU or network distribution.

Python 100.00%

asynchronous-methods-for-deep-reinforcement-learning's Introduction

Asynchronous-Methods-for-Deep-Reinforcement-Learning

Using a paper from Google DeepMind I've developed a new version of the DQN using threads exploration instead of memory replay as explain in here: http://arxiv.org/pdf/1602.01783v1.pdf I used the one-step-Q-learning pseudocode, and now we can train the Pong game in less than 20 hours and without any GPU or network distribution.

asynchronous-methods-for-deep-reinforcement-learning's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

asynchronous-methods-for-deep-reinforcement-learning's Issues

It is GREAT!

Hi zeta:

I saw your reply and came here to have a look.

This is so cool! I really need a method that can avoid memory replay since the memory space is a big problem and time consuming.

Thank you. I will look into it further.
Mingyan

Different final epsilons from the paper

The paper states that the final epsilons should be [0.1, 0.01, 0.5]. But I noticed in your code they are [0.01, 0.01, 0.05] (Strangely there are two 0.01s). Is this a mistake or intentional improvement?

I'm tunning the model myself, while I'm not sure which hyper parameters are important.

Training in process/core level parallelism

Hi @Zeta36

Great project! I'm trying to run some experiments with the code. It seems that currently the code uses threading with tensorflow, and from my observation, the training loop is not really in full parallel because of running on threads instead of processes. I think ideally, each learner should be on a different process to fully utilize a modern machine.

This might be relevant:
http://stackoverflow.com/questions/34900246/tensorflow-passing-a-session-to-a-python-multiprocess

But it looks like bad news, I can't just spawn a bunch of processes and let them share the same tensorflow session. So maybe a distributed tensorflow session is what we need:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/g3doc/how_tos/distributed/index.md

exception for thread

when i start the thread it show this exception

Exception in thread Thread-31:
Traceback (most recent call last):
File "/home/anderson/.conda/envs/tensorflow/lib/python3.6/threading.py", line 916, in _bootstrap_inner
self.run()
File "/home/anderson/.conda/envs/tensorflow/lib/python3.6/threading.py", line 864, in run
self._target(*self._args, **self._kwargs)
File "", line 141, in actorLearner
x_t1_col, r_t, terminal, info = env.step(KEYMAP[GAME][action_index])
File "/home/anderson/Videos/gym/gym/wrappers/time_limit.py", line 31, in step
observation, reward, done, info = self.env.step(action)
File "/home/anderson/Videos/gym/gym/envs/atari/atari_env.py", line 68, in step
action = self._action_set[a]
IndexError: index 5 is out of bounds for axis 0 with size 4

Implement the actor-critic methods

Hello,
In the asynchronous dqn paper, they also described an on policy method, the advantage actor-critic (A3C), which achieved better results than others, do you currently have any plan to include this method in this repo as well?
Because I am working off this repo as a starting point, and attempt to reproduce the results of the A3C method on the continuous action domain, but I am still trying to figure out the network model they used in the physical state case when apply to Mojoco, and how the policy gradient is accumulated.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.