coreylynch / async-rl Goto Github PK

Tensorflow + Keras + OpenAI Gym implementation of 1-step Q Learning from "Asynchronous Methods for Deep Reinforcement Learning"

License: MIT License

Python 100.00%

async-rl's Introduction

Asyncronous RL in Tensorflow + Keras + OpenAI's Gym

This is a Tensorflow + Keras implementation of asyncronous 1-step Q learning as described in "Asynchronous Methods for Deep Reinforcement Learning".

Since we're using multiple actor-learner threads to stabilize learning in place of experience replay (which is super memory intensive), this runs comfortably on a macbook w/ 4g of ram.

It uses Keras to define the deep q network (see model.py), OpenAI's gym library to interact with the Atari Learning Environment (see atari_environment.py), and Tensorflow for optimization/execution (see async_dqn.py).

Requirements

tensorflow
gym
[gym's atari environment] (https://github.com/openai/gym#atari)
skimage
Keras

Usage

Training

To kick off training, run:

python async_dqn.py --experiment breakout --game "Breakout-v0" --num_concurrent 8

Here we're organizing the outputs for the current experiment under a folder called 'breakout', choosing "Breakout-v0" as our gym environment, and running 8 actor-learner threads concurrently. See this for a full list of possible game names you can hand to --game.

Visualizing training with tensorboard

We collect episode reward stats and max q values that can be vizualized with tensorboard by running the following:

tensorboard --logdir /tmp/summaries/breakout

This is what my per-episode reward and average max q value curves looked like over the training period:

Evaluation

To run a gym evaluation, turn the testing flag to True and hand in a current checkpoint file:

python async_dqn.py --experiment breakout --testing True --checkpoint_path /tmp/breakout.ckpt-2690000 --num_eval_episodes 100

After completing the eval, we can upload our eval file to OpenAI's site as follows:

import gym
gym.upload('/tmp/breakout/eval', api_key='YOUR_API_KEY')

Now we can find the eval at https://gym.openai.com/evaluations/eval_uwwAN0U3SKSkocC0PJEwQ

Next Steps

See a3c.py for a WIP async advantage actor critic implementation.

Resources

I found these super helpful as general background materials for deep RL:

Important notes

In the paper the authors mention "for asynchronous methods we average over the best 5 models from 50 experiments". I overlooked this point when I was writing this, but I think it's important. These async methods seem to vary in performance a lot from run to run (at least in my implementation of them!). I think it's a good idea to run multiple seeded versions at the same time and average over their performance to get a good picture of whether or not some architectural change is good or not. Equivalently don't get discouraged if you don't see performance on your task right away; try rerunning the same code a few more times with different seeds.
This repo has no affiliation with Deepmind or the authors; it was just a simple project I was using to learn TensorFlow. Feedback is highly appreciated.

async-rl's People

Contributors

Stargazers

Watchers

Forkers

jmrinaldi magnord skypea amoliu scylla codeaudit jeffstokes72 somaticapi philipz huleg thanujadax fnaval maniacs-ops wanjinchang techscientist datastark xzoo2013 neuroradiology leoh0 shangxing2015 ghotiv alanguo001 ml-ai-nlp-ir jacobzweig asmith26 paulhendricks floodsung poliflix loofahcus offbit aaronzhudp itfische jhayes14 edersantana drl-ycheng csdlrl vyraun tigerneil biruce-ai neverspill zeyuan1987 benjamesbabala hedgefair ematvey mazecreator stevekapturowski zengqinglong yif0 pengcheng-wang aistrych pkumusic risheekg programfiles sherjilozair hongzimao capybaralet nsimsiri davidalgo mayurand ivehui medusagit pwaila kaixianglin jinsongbo nkcr7 khudkhud osh jadielam yiiwood ryannnxu zencoding wsjeon trigrass2 danielgordon10 yangee ei-grad wjssx innixma lxpan wuntoguo victorzsl nwayt001 happywu anthonysull lichnak yongduek catchmrbharath ijeomaonuosa mightychaos coocoky zhexiaozhe mnrmja007 luonay bryant1410 fanninnypeom teeyihhou tawnkramer ginsongsong moniljhaveri alangoran

async-rl's Issues

About the randomness of the performance

I am currently trying to run your code and get the same performance, but the mean reward is stuck around a score of 5. I have tried to run it three times and I got the same performance each time. The code seems to run fine though.

How random is the performance ? How many trials did you do before obtaining the results presented in the README ?

pretrained model

Hi @coreylynch , thanks for the awesome project!

I was wondering, do you have the Keras weights of a pretrained agent somewhere? I was looking to do some quick visualizations with breakout.

Best,
-eder

null

FailedPreconditionError

Excuse me, as I execute your program, I've got an error in tensorflow initialize：
Attempting to use uninitialized value convolution2d_1_W
[[Node: convolution2d_1_W/read = IdentityT=DT_FLOAT, _class=["loc:@convolution2d_1_W"], _device="/job:localhost/replica:0/task:0/cpu:0"]]

How can I slove it？
Thx!

Attempting to use uninitialized value conv2d_1/kernel

Whenever I try to start the training I get the error:
FailedPreconditionError (see above for traceback): Attempting to use uninitialized value conv2d_1/kernel

tf.Variable unexpected keyword 'dtype'

Using TensorFlow backend.
can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_device.cc:25] Local device intra op parallelism threads: 4
can't determine number of CPU cores: assuming 4
I tensorflow/core/common_runtime/local_session.cc:45] Local session inter op parallelism threads: 4
INFO:gym.envs.registration:Making new env: Breakout-v0
[2016-07-16 22:50:28,278] Making new env: Breakout-v0
Traceback (most recent call last):
File "async_dqn.py", line 310, in
tf.app.run()
File "/Library/Python/2.7/site-packages/tensorflow/python/platform/default/_app.py", line 11, in run
sys.exit(main(sys.argv))
File "async_dqn.py", line 301, in main
graph_ops = build_graph(num_actions)
File "async_dqn.py", line 173, in build_graph
s, q_network = build_network(num_actions=num_actions, agent_history_length=FLAGS.agent_history_length, resized_width=FLAGS.resized_width, resized_height=FLAGS.resized_height)
File "/Users/nathaniel/Downloads/async-rl-master/model.py", line 10, in build_network
model = Convolution2D(nb_filter=16, nb_row=8, nb_col=8, subsample=(4,4), activation='relu', border_mode='same')(inputs)
File "/usr/local/lib/python2.7/site-packages/keras/engine/topology.py", line 458, in call
self.build(input_shapes[0])
File "/usr/local/lib/python2.7/site-packages/keras/layers/convolutional.py", line 296, in build
self.W = self.init(self.W_shape, name='{}_W'.format(self.name))
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 61, in glorot_uniform
return uniform(shape, s, name=name)
File "/usr/local/lib/python2.7/site-packages/keras/initializations.py", line 33, in uniform
name=name)
File "/usr/local/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 103, in variable
v = tf.Variable(value, dtype=_convert_string_dtype(dtype), name=name)
TypeError: init() got an unexpected keyword argument 'dtype'

Stop actor gradient flowing through the critic

I think you should use tf.stop_gradient() in https://github.com/coreylynch/async-rl/blob/master/a3c.py#L164. Otherwise, after some training the policy tends to use one action exclusively. Took me a while to figure this out in my own code, too.

Intel MKL FATAL ERROR: Cannot load libmkl_avx2.so or libmkl_def.so.

Has anyone else had this problem when running the example code?

May I know the version of keras and tensorflow?

Reward doesn't go up ....

I ran the async dqn model out of the box with 3 seeds on 7 atari games on 24 threads -- Pong, Breakout, SeaQuest, BeamRider, SpaceInvaders, Qbert, and Enduro. However, the reward stays the same for all the games until 11M global time steps. I've also run Breakout up to 30M global steps with 5 seeds and the reward doesn't go up either. Anybody has this issue?

OSError,why?

OSError: dlopen(/Users/weyn/anaconda/lib/python2.7/site-packages/atari_py/ale_interface/build/libale_c.so, 6): Symbol not found: __ZNKSt5ctypeIcE13_M_widen_initEv
Referenced from: /Users/weyn/anaconda/lib/python2.7/site-packages/atari_py/ale_interface/build/libale_c.so
Expected in: /usr/lib/libstdc++.6.dylib
in /Users/weyn/anaconda/lib/python2.7/site-packages/atari_py/ale_interface/build/libale_c.so

Do results differ only because of the seed?

You write that one should try experiments with multiple seeds. Did you found that results differ substantially given only different seeds?

I'm asking because in the paper, Mnih. et al. take the best 5 out of 50 runs with different learning rates. However, from the paper it's not clear to me whether the methods are sensitive to the choice of learning rate or instable in general.

No local network synchronization

I'm interested as to why you decided not to create a local copy of the variables in the worker threads and sync them with the global network at the end of the rollout. Does that create issues with the global network (being used for inference in the rollout) being updated in the middle of rollout? Is there a reason why you changed your algorithm from the one described in the Async methods for RL paper?

Example of Actor critic for large number of actions

Is the a3c implementation done?

Do you have an example implementation of actor critic for large action spaces. I see that there is a example for a3c, but the action space for this problem is small.

duplicate

ValueError: Filter must not be larger than the input: Filter: (8, 8) Input: (4, 84)

Hello, I install all dependencies and run the first train command python async_dqn.py --experiment breakout --game "Breakout-v0" --num_concurrent 8, but it come out an error. It seems that the input size (4,84) is wrong?

ValueError: need more than 4 values to unpack

When I try to run the a3c.py, I came across some problem.
“”“
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 754, in run
self.__target(*self.__args, **self.__kwargs)
File "a3c.py", line 71, in actor_learner_thread
s, a, R, minimize, p_network, v_network = graph_ops
ValueError: need more than 4 values to unpack
“‘’”
Follow by the solution in the Stackoverflow, I add a comma in the code. but it failed.
I would appreciate it if anyone can help me.

t_max = 32

Hello,

In the A3C paper they state t_max = 5, is there any reason you set it to 32?

Actually I don't really understand why the batch size should be so small, why shouldn't we use traditional batch sizes of 128 or more frames, shouldn't this make learning stronger?

How to speed up training with GPU?

Hey! Thanks a bunch for sharing this.
I've made some attempts of speeding up the training with a GPU, but if there is any increase at all - it's very little. I get about 10 global frames/steps per sec when running the algorithm not on ALE but on a very simple python-script I've written myself. I've tried other GPU-compatible DL-algoritms and the slowdown doesn't seem to originate from scrip I've written. Do you have any idea of how to manage this issue?

When are you planning to have A3C FF ( Algorithm 2) and A3C LSTM (Algorithm 3) done

What is your timeline of having n-Step Q-Learning A3C FF ( Algorithm 2 ) and A3C LSTM ( Algorithm 3) done as per you next steps in Keras + Tensorflow . I do have some code for a Stock Trading game that is using Deep Q ( just standard Deep Q Learning with Experience Play, but i would like to use A3C LSTM with Experience Play as per the research paper ) . Let me know if you are interested in working to incorporate the Stock trading Game into your code ( i will email you the zip code, it is 6 small python files) It is in Keras + TensorFlow .

Tensorflow outdated

I guess this code is written in old tensorflow?

x = tf.reshape(x, tf.pack([-1, prod(shape(x)[1:])]))
AttributeError: 'module' object has no attribute 'pack'

Is it possible that this code updated to latest tensorflow.

Thanks!