kaixhin / atari Goto Github PK

View Code? Open in Web Editor NEW

263.0 263.0 74.0 626 KB

Persistent advantage learning dueling double DQN for the Arcade Learning Environment

License: MIT License

Lua 96.28% Shell 3.72%

deep-learning deep-reinforcement-learning

atari's People

Contributors

Stargazers

Watchers

atari's Issues

About A3C

I just read the code and find that only policy network(critic) is trained,while there is no value work(actor).So I think it is a policy based RL network rather than a actor-critic network?Maybe I don't understand A3C right or maybe I just don't read the code carefully enough.

Hi，Kaixhin.Ask you for a question

In profile of experience class,I can't understand the transition for creating QPrimes. I think transition[i][1] is always zero. transition[i][histlen] is transition[i][4]??? Could you explain it.Thank you so much!!!

Partition number and segments

@Kaixhin Thank you for your excellent work!

I did not quite understand why partition number is not the same as segments number k, which is the batch size suggested in the PER paper, in your implementation. Could you please help me understand this?

Why is the current sharedRmsprop thread safe?

Hi,

I've read the discussion per #15 and #50, but I still don't understand why the current sharedRmsprop impl avoids thread racing? Actually, the code still occasionally outputs NaN on my machine unless I set the thread be one. By tracking the error I can tell it is due to these two lines:

 state.g:mul(momentum):addcmul(1 - momentum, dfdx, dfdx)
 state.tmp:copy(state.g):add(epsilon):sqrt()

as state.tmp can become zero while being divided in the rest code and I guess the zeros are due to state.tmp:copy(state.g) from other thread where state.g happens to include 0s...

Meanwhile, by changing them to

  state.g:mul(momentum):addcmul(1 - momentum, dfdx, dfdx)
  state.tmp:sqrt(torch.add(state.g, epsilon))

the error seems to disappear.

It my modification reasonable? Or I just have to update OpenBLAS or something?

actor-critic based

I look through the code and find that it seems the A3C algorithm has not been implemented?Will you release the A3C code based on Torch?

gnuplots memory unreleased

The gnuplots created by the simulations stay in the RAM as long as the simulation is running, effectively acting as a memory leak.
Item already appears in a TODO comment.

Can I convert rank-based prioritized experience replay to a python version

I want to implement the prioritized by python, and now I find your torch version and get inspired.
So can I convert rank-based prioritized experience replay to a python version, maybe some place will refer to your code?

Exploration with pseudo counts

New paper with method that performs well on Montezuma's revenge. Implementation could be used with both DDQN ER and async A3C. The probability used for the pseudo count is computed using Context Tree Switching that could be implemented based on this implementation.

Implement asynchronous methods

http://arxiv.org/pdf/1602.01783v1.pdf describes asynchronous methods using off policy (1 step /n step Q learning) and even on policy (sarsa and advantage actor-critic (A3C)) reinforcement learning.

These algorithms converge faster with less resources (cpu only multithreaded on a single machine without using a large replay memory) and can achieve better results than other methods.

I think the Hogwild method they use for the lockfree updating of the shared network can be implemented with Torch/Lua threads.sharedserialize

Allow non-visual environments

As an enhancement to #26, it would be great for the network to take in any kind of state as specified by the rlenvs API, as long as the state is a single tensor. This is slightly restrictive, but with the -modelBody option a lot can still be done.

Major changes would be to a) make visual preprocessing more explicit as options and b) have environments output a 3D array as part of a getDisplay method so that visuals can still be provided. An option also needs to added to discretise/not discretise the experience replay memory.

A "roadmap" can be seen in the readme of the nonvis branch.

Test different implementations

Async A3C Network Outputs NaN

Fresh Torch7 install here on Linux Mint 17 (not using CUDA). I can run all of the demo examples (demo, demo-grid, demo-async, and demo-async-a3c) without issue. Regular DQN and async-nstep also run without issue on Montezuma's Revenge. However, when running async-a3c, I get an error bad argument #2 to '?' (invalid multinomial distribution (sum of probabilities <= 0) at <torchPath>/lib/TH/generic/THTensorRandom.c:120) shortly after training begins.

The problem occurs at A3CAgent.lua, line 54 -- my own print statements have confirmed that the outputs of the network (probability, obtained on the previous line) are all NaN. Adding NaN checks in Model.lua showed that NaNs are being found in the nn.SpatialConvolution 64x64 layer after only a few iterations of training. The problem occurs intermittently (you may need to run it several times before getting the error).

Neither an update nor complete reinstall of Torch solved the issue. I have verified that the inputs to the network (passed into A3CAgent.lua, line 54 as state) are between 0 and 1, and it does not appear as if any of the training gradients in A3CAgent:accumulateGradients() are producing Inf or NaN.

The issue also occurs when running on a Redhat cluster GPU. Any thoughts?

Finish prioritised experience replay

Rank-based prioritised experience replay appears to be working, but technically needs some changes. Instead of storing terminal states with a priority of 0, they should not be stored at all. This requires more checks, as the elements in the experience replay memory and the elements in the priority queue will differ.

Secondly, proportional prioritised experience replay still needs to be implemented. See here and here for an implementation of the sum binary tree.

For reference, below are results from a working implementation of rank-based PER on Frostbite:

Unify ER and Async validation logic

The validation code is duplicated in async/ValidationAgent and Validation because of differences in agent design, but the code is structured such the two validators can be unified with some work. async/ValidationAgent does not plot tdErr and loss, so unification would add that for async as well.

Hierarchical Dqn

Interesting paper http://arxiv.org/abs/1604.06057 that tackles Montezuma's revenge, where Dqn does not work at all because of the delayed sparse rewards, it requires a longer strategy to be followed without any external rewards. They solve this by setting intrinsic goals that can be set in a general way by auto detecting shapes and rewarding their proximity. Their solution for defining the goals from the shapes is handcrafted for this game, but probably could be done in a general way across games.

Not a straightforward implementation issue though...

Refactor DQN train function into separate functions

Make function less monolithic by factoring out update rules e.g. persistent advantage learning.

Refactoring master before async merge

Before the async methods from async branch are merged master could be refactored to allow for reusing most of the code between async and ER methods. Also I believe this refactoring would improve the structure of the existing ER solution in itself. Currently single files (classes) contain functionality for multiple aspects, would be better to create objects dealing with one aspect at a time. I already started this approach in async if followed in master as well merge will be smooth and easy.

I suggest the following:

main would only contain option parsing. Currently it is also responsible for

training the ER Agent
validating the Agent
evaluating or demoing the Agent with one episode

These I would separate in classes

ERTrainer training loop
ValidationAgent validation loop, also the validation related functionality would be moved here from the ER Agent, ie. stats, report, saliency. This agent would also work with the async agents in its own thread in async mode later.
EvaluationAgent showcasing, movie making

So main would parse the options and then either create a

a single threaded ERTrainer + ValidationAgent
an EvaluationAgent
later the multithreaded async training / evaluation

I would not share the ER Agent code with async agents (1/n Q and A3C) for now. In the async agents I took the approach of having separate (subclassed) implementations of different methods to make it easier to look at one algorithm at a time. The ER Agent contains many methods, but all based on ER logic. I think this separation is fine. Much of the simplicity of async code comes from not having to deal with sampled batches.

Splitting up main is not strictly necessary the way I'm suggesting, we only gain the reuse of validation logic with it. Other parts can be still reused from async (eg. Model, CircularQueue, BinaryHeap) without doing this, but I do believe doing it would improve the ER code as well.

I would submit multiple PRs for the suggested steps, so would be easier to look at what's going on. @Kaixhin let me know if you like this or have other idea!

Possible improvements on speeding up

I found some possible speedup improvements which may speed the training time by some constant, list them as follows.

In Atari/async/AC3Agent.lua:53,
```
local V, probability = table.unpack(self.policyNet_:forward(state))
```
Maybe we can store the V and probability as we store the actions in self.actions, then in the gradient accumulating phrase, the redundant forwarding at line 87 can be avoided.
In Atari/async/NstepQAgent.lua, similar with 1, if we can store the Q values, we can avoid the redundant forwarding at line 81.

A problem about logging is that if the running of async model suspends, although it can be resume from the save data, but the plotted statics graph will not include the previous scores.

And by the way, I notice the comments at AsyncMaster.lua:146-147, I wonder maybe it's due to the serialization, as the threads package doc has addressed the problem. Maybe move the requires to the torchSetup will solve it.

How to process with the salient map?

Hi, I notice that the salient map is just the gradients of the guided backpropagation w.r.t. the input image.
And I did not catch any further processing on this gradient (I am not familiar with torch). Thus I was wondering why the scale of the gradients do not affect the plotted image. I mean the gradients' scale might be relative smaller or larger than the original image's scale, should merely putting the gradients in the red channel be unreasonable? Do we need to rescale the gradients to the same as the images?

Many thanks.

Implement Retrace(λ)

Safe and efficient off-policy reinforcement learning implements this new algorithm with experience replay, but actually uses asynchrononous agents with experience replay for testing (the combination was going to happen soon enough). Which means that this repo is a unique position of having both components already implemented.

correct SarsaAgent

Currently the Sarsa agent is not exactly as described in the async paper, should be corrected and tested again.

Agent.valMemory and validate()

In the Agent there is a valMemory filled, but when validation() prepares the indices that reference valMemory and calls learn(), there the indices are always used with memory and not valMemory.

I assume the intention was to use valMemory when called from validate().

Not much of an issue, only makes a difference for prioritised replay, otherwise memory and valMemory have the same data anyway.

Disagreements with the async paper

I am a newcomer to DRL and have just read through the code of this project, it is really an amazing job! As the title says, I found some implementations maybe different from the setting of the Asynchronous Methods for Deep Reinforcement Learning paper, list them as follows.

The network setting. By the part 8. Experimental Setup of the paper, it applies a smaller neural network, I think the body maybe like this

net:add(nn.SpatialConvolution(histLen*self.stateSpec[2][1], 16, 8, 8, 4, 4))
net:add(nn.ReLU(true))
net:add(nn.SpatialConvolution(16, 32, 4, 4, 2, 2))
net:add(nn.ReLU(true))
net:add(nn.Linear(#sizeof_previous_layer, 256))
net:add(nn.ReLU(true))
net:add(nn.Linear(256, #moves))

but the project use the standard DQN's network, much bigger than the proposed network.

No action repeat? The paper says they apply an action repeat of 4, but I have not found in the code... Is it because the environment automatically does this?
In Atari/async/A3CAgent.lua:90,
```
self.vTarget[1] = -0.5 * (R - V)
```
I have not figure out why there is a 0.5, but anyway it's no problem to keep it...

In Atari/async/Qagent.lua:6-7

local EPSILON_ENDS = { 0.01, 0.1, 0.5}
local EPSILON_PROBS = { 0.4, 0.7, 1 }

the probability of setting epsilon is different from the paper, should it be like this?

local EPSILON_ENDS = { 0.1, 0.01, 0.5}
local EPSILON_PROBS = { 0.4, 0.7, 1 }

And below might be a small but essential bug on asyncAgent using Q learning.

Look the line 59 and 67 of Atari/async/QAgent.lua,

56:function QAgent:eGreedy(state, net)
57:  self.epsilon = math.max(self.epsilonStart + (self.step - 1)*self.epsilonGrad, self.epsilonEnd)
58:
59:  if self.alwaysComputeGreedyQ then
60:    self.QCurr = net:forward(state):squeeze()
61:  end
62:
63:  if torch.uniform() < self.epsilon then
64:    return torch.random(1,self.m)
65:  end
66:
67:  if not self.alwaysComputeGreedyQ then
68:    self.QCurr = net:forward(state):squeeze()
69:  end
70:
71:  local _, maxIdx = self.QCurr:max(1)
72:  return maxIdx[1]
73:end

should them be exchanged? Otherwise if set self.alwaysComputeGreedyQ = false, the eGreedy on the contrary will always Compute Greedy Q.

bootstraps version when evaluating

Hi Kaixhin,
it's a really amazing repo here what you did, it's really great work!

One small problem here:
opt.bootstraps > 0 and when in evaluation mode, in Agent.lua line 175:
aIndex = torch.mode(QHeadsMaxInds)

Here, when running on gpu, QHeadsMaxInds is a cudaTensor, but "mode" is not implemented in cutorch, so this line might be changed to:
aIndex = torch.mode(QHeadsMaxInds:reshape(QHeadsMaxInds:size()[1]):float())[1]

Can you maybe look into this issue?
Also any plan on adding on this paper: Asynchronous Methods for Deep Reinforcement Learning ? :)

Multi gpu support

I got some speedup by using dual gpus dedicating one for the policy, other for the target network, so policy and target network forwards can be done in parallel (in the vanilla dqn), but not the policy backward, so target gpu is not utilized that much, speed up not 2x, rather 30%. At least there's no need to synchronize the networks between gpus as target net is seldom updated. The advanced methods do more forwards, so may benefit more.

I'll measure the speed gain with the dqn convnet and will add this simple multigpu method if makes sense. Of course one could do a complete scalable parallel solution like the Gorila, but only makes sense if someone can afford a multigpu cluster for running Atari Dqn.

Decouple Catch vs. Atari from code

At the moment the codebase supports 2 environments from rlenvs - Catch and Atari. To make the code cleaner, and to make it easier for others to develop from it, the code to support the difference in environments should be made more generic.

Implement optimality tightening

Learning to Play in a Day: Faster Deep Reinforcement Learning by Optimality Tightening potentially speeds up Q-learning by an order of magnitude! Apparently not too hard to implement either.

Implement Pop-Art

Learning functions across many orders of magnitudes introduces Preserving Outputs Precisely, while Adaptively Rescaling Targets (Pop-Art). In summary it normalises outputs across orders of magnitudes and gets rid of the clipping (i.e. counting) rewards heuristic for Atari games. The normalisation is also better for non-stationary problems, i.e., any decent real world problem.

The below is a picture of extra notes from the authors, next to their poster at NIPS 2016:

OpenAI integration

Would be interesting to be able to run Atari against the OpenAI gym environment. Currently OpenAI is in python, maybe would be possible to have a pytorch runner in Atari that integrates the two. Or also there could be general torch support for OpenAI, ie. this issue with fb-python.

(btw with the Deepmind capitulation things are going towards python/tf, not that I would follow at this point...)

Load models like environments

Currently models are loaded by checking for file existing on the provided path. If a class name is given this check will fail as paths.filep doesn't check LUA_PATH. Handling this like environments will add some consistency and allow users to more easily store their models in external repositories.

Atari/Model.lua

Line 73 in 1e48018

if paths.filep(self.modelBody .. '.lua') then

Will put together a PR where Atari and Catch models are placed in separate files and modelBody is setup in the same way env is currently.

Add LSTM support to all async modes

Only OneStepQ is supported, but should be supported in all. Especially A3C!

Recurrent Dqn

One central element of the Atari DQN is the use of 4 consecutive frames as input making the state more Markov, ie. having the vital dynamic movement information. This paper http://arxiv.org/abs/1507.06527v3 discusses DRQN: the multiframe input can be substituted with LSTM with the same effect (but no systematic advantage for one or the other). Also the Deepmind async paper mentions using LSTM instead of multi frame inputs for more challenging visual domains (Torcs and Labyrinth).

I think this would fit well in this codebase, I'll try to contribute this at one point.

stateBuffer issue with Catch on CPU

I noticed the CircularQueue stateBuffer will have the same state for all 4 steps when running catch on CPU. Ale is fine because Model:preprocess() returns a scaled copy of the screen, but for catch its returning the same screen tensor that is then put in the table in the CircularQueue at different positions without copying.

If running on the gpu it happens to be fine, because tensor:typeAs(self.queue[1]) in CircularQueue will make a copy, because the tensor types differ.

This could be another reason of poor CPU performance I saw (I thought its the random seeding). For now I put return observation:clone() in preprocess in my local async source to fix, but you may want to fix this differently.

What is the actual performance?

I would like to know what scores do the agents achieve.
Could this information be provided in the README?

Implement rank-based prioritised experience replay

Requires a "sum tree" binary heap for efficient execution.

Edit 2016-06-02: Please keep to the contributing guidelines.

Questions about training A3C

I am trying to use A3C to deal with the task of action detection.When training A3C,I found that the gradient of actor network is too large,mainly because
self.policyTarget[action] = -(R - V) / probability[action]
the "probability[action]"may be too small.After about 3 thousand times of training process(backprop),the network tend to do the same action?
May you give me some advice about how to fix this?
Thank you so much.

Implement Memory Q-networks

Control of Memory, Active Perception, and Action in Minecraft introduces a memory Q-network (MQN) and recurrent MQN (RMQN), based on a relatively simple key-value soft attention memory. These could feasibly be added as model options, although the feedback RMQN (FRMQN) would add another level of complexity to the architecture.

Edit 2016-06-14: DeepMind have been working on something similar. Expect to see something from them in the future.

problem in Agent.lua

I have ran the default experiment like
th main.lua
when it comes to the 50000th episode, it tells me that in Agent.lua 359 line
self.tdErr = torch.max(torch.cat(tdErrAL, self.tdErr:csub((self.VPrime:csub(QPrime):mul(self.PALpha))), 3), 3):view(N, 1)
the input of :view is 32 * 10 * 1 while the output is 32*1

so I change the code like :view(N , 10 , 1)

is that OK or I do the wrong thing?

Fix bootstrapped DQN

The test on Beam Rider is failing badly, and does not look promising.

kaixhin / atari Goto Github PK

atari's People

Contributors

Stargazers

Watchers

Forkers

atari's Issues

Recommend Projects

Recommend Topics

Recommend Org