awjuliani / deeprl-agents Goto Github PK

View Code? Open in Web Editor NEW

2.2K 119.0 825.0 360 KB

A set of Deep Reinforcement Learning Agents implemented in Tensorflow.

License: MIT License

Jupyter Notebook 96.32% Python 3.68%

reinforcement-learning tensorflow

deeprl-agents's Issues

DQN python3

see dqn.zip below

corrected code from Deep-Recurrent-Q-Network.ipynb

zip now is a class, so need to convert to list:
episodeBuffer = list(zip(bufferArray))

corrected helper.py

dqn.zip

Setting networks to be equal

the statement:
updateTarget(targetOps,sess) #Set the target network to be equal to the primary network.
uses the op list targetOps which is built with tau=0.001.

Therefore the networks are not the same after the op is executed.

Can you confirm this issue? Does it have an impact on the results?

Another thing:
this idea of getting, after update, a network whose weights are a convex combination between the weights of the target and the main networks seems to be a bit weird to me, even if it comes from a paper done by deep mind. starting from the same initialization and using very small update weights it might work, but in general it should actually not work at all. Interpolating between network weights of different runs can potentially disrupt performances. (i will have a look to the paper though, even though i would love to hear your comments as an expert in this field)

Learning rates and initial random weights

Dears

In the A3C Mnih Paper, It was mentioned that:

Figure 2: This shows that A3C is quite robust to learning rates and initial random weights

However, I think the performance is affected by learning rates and initial random weights! In other words, it is not robust!

For example see #18

What do you recommend?

Segmentation fault (core dumped)

Hi,
I has read your blogs about the RL and like it very much.
When I tried to run the A3C-Doom, I came across the Segmentation fault (core dumped) error after the terminal output starting workers .
My computer has 32G memory,and two E5 CPUs.
The error troubles me a lot and I wounder if you can give me some advice.
Thanks.

How to set hyper-parameters? "The right recipe!"

@DMTSource
@awjuliani

Is there a way to set hyper parameters?

Reward value
Parameter Initialization method
LSTM length
Learning rate
Optimizer (Adam or RMSProp)
Gradient Clipping value

After 10s of experiments, I found that any tiny change in one of these affects the whole training dramatically, usually in a bad way.

It is also not logic to conduct a grid search over different parameters, because a single experiment may take hours or days, and cost a lot of money

One trick I usually use, is to use large network and dropout to reduce/eliminate over fitting, but what about all of the above?

Another trick, try to adjust the learning rate * gradient = 1e-3 parameters. (In other works make the parameter update around 1/1000 of the parameter value, to prevent too large to too small updates)

What do you recommend?

A3C Basic Doom: effect of different Buffer sizes (Discuss)

Dear @awjuliani

I would like to share results when using different buffer sizes (for all workers or per worker), to understand how buffer size may affect the convergence of an agent.

In other words, how to select buffer size?
What are other factors that should be adjusted given certain buffer size?
Is it correct to make each worker has its own buffer size?

Your comments and suggestions are very important!

1- The original case, buffer length = 30 for all agents
Result looks smooth and agent start to converge at 200 (it seems that 30 is the magic number for this scenario!)

self.experience_length = 30
.
.
.
if len(episode_buffer) == self.experience_length and d != True and episode_step_count != max_episode_length - 1:

2- buffer size = 50 for all agents
delayed convergence (around 400) compared to buffer size = 30 (around 200). In this case, agent has longer memory to analyze sequences and see more frames through time.

3- buffer size =10 for all agents
It seems that only 10 frames is too short for the agent to learn!

4- Buffer size = 25 for agent_0, 30 for agent_1, 35 for agent_2 and so on
Here, each worker has its own buffer size, convergence is delayed (after 400) compared to 30 (around 200)

Actually, I thought this setup will have better results, but it was bad!

Warning Converting sparse IndexedSlices to a dense Tensor of unknown shape on Vanila Policy example

UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

Any ideas ? Does still seem to work though.

A3C Doom: Health Gathering

Hi Arthur

Instead of basic scenario, I used health_gathering.cfg scenario

Where:

        game.set_doom_scenario_path("health_gathering.wad")
        game.set_screen_resolution(ScreenResolution.RES_160X120) 
        game.set_screen_format(ScreenFormat.GRAY8)
        game.set_render_hud(False)
        game.set_render_crosshair(False)
        game.set_render_weapon(False)
        game.set_render_decals(False)
        game.set_render_particles(False)
        
        game.add_available_button(Button.TURN_LEFT)
        game.add_available_button(Button.TURN_RIGHT)
        game.add_available_button(Button.MOVE_FORWARD)

        game.add_available_game_variable(GameVariable.HEALTH)

        game.set_episode_timeout(2100)
        game.set_episode_start_time(10)

        game.set_window_visible(False)
        game.set_sound_enabled(False)
        
        game.set_living_reward(1) # Each step is good for you!
        game.set_death_penalty(100) # And death is not!
        
        game.set_mode(Mode.PLAYER)
        game.init()
        self.actions = [[True,False,False],[False,True,False],[False,False,True]]
        #End Doom set-up
        self.env = game

and

r = self.env.make_action(self.actions[a]) * 1.0

It seems that the agent is not learning!

I am thinking to use HEALTH to help the agent:

r = self.env.get_game_variable(GameVariable.HEALTH)

OR, to reshape the reward:

r = r + self.env.get_game_variable(GameVariable.HEALTH)

Or, should I wait for more time?!

What do you think?

Model-Network occasionally outputs unreasonably big mean reward

I copy your code in a python file and run the simulation several time.
Usually it works fine, but occasionally the mean reward become very large.
Below is the copy of the output log that you made

 World Perf: Episode 247.000000. Reward 35.333333. action: 0.000000. mean reward 35.000038.
 World Perf: Episode 250.000000. Reward 29.333333. action: 1.000000. mean reward 34.979885.
 World Perf: Episode 253.000000. Reward 39.666667. action: 0.000000. mean reward 34.893707.
 World Perf: Episode 256.000000. Reward 21.000000. action: 1.000000. mean reward 34.590328.
 World Perf: Episode 259.000000. Reward 62.333333. action: 0.000000. mean reward 34.643253.
 World Perf: Episode 262.000000. Reward 40.666667. action: 1.000000. mean reward 34.418655.
 World Perf: Episode 265.000000. Reward 31.000000. action: 1.000000. mean reward 34.128536.
 World Perf: Episode 268.000000. Reward 25.000000. action: 1.000000. mean reward 3763953194369116274688.000000.
 World Perf: Episode 271.000000. Reward 50.333333. action: 0.000000. mean reward 3689050732741573738496.000000.
 World Perf: Episode 274.000000. Reward 20.333333. action: 0.000000. mean reward 3615638681115714125824.000000.
 World Perf: Episode 277.000000. Reward 26.666667. action: 1.000000. mean reward 3543687766093959528448.000000.
 World Perf: Episode 280.000000. Reward 44.000000. action: 0.000000. mean reward 3473168432803755327488.000000.
 World Perf: Episode 283.000000. Reward 19.000000. action: 1.000000. mean reward 3404052533747430457344.000000.
 World Perf: Episode 286.000000. Reward 59.666667. action: 1.000000. mean reward 3336311921427313852416.000000.

It seems that the predicted reward of the Model sometimes become too large.
Do you know what is the problem?
Is it just some cases that the model failed to learn?

why not support the Python3

Hi, I use the Python3 and run your code, I find must change something, like / to //, print 'xxx' to print('xxx').

why not import _future_ and make sure the code can support 2.7 and 3?

and in GridWorld, the env take action, it always return done is False?

A3C episode length unbounded

Episodes are not terminated when episode_step_count exceeds max_episode_length.

It is not necessarily a problem, but as both summary writing and model saving is dependent on episode_count, if the episodes are getting too long then these actions are less frequent.

In my case (I'm using the algorithm with a different environment, not doom), as the agent got better episodes exceeded 8000 steps, so that really influenced the model saving frequency.

IOError: [Errno 2] No such file or directory: './Center/log.csv'

Any ideas my friend? Trying to run Deep-Recurrent-Q-Network.ipynb

File "D-R-Q.py", line 153, in
with open('./Center/log.csv', 'w') as myfile:
IOError: [Errno 2] No such file or directory: './Center/log.csv'

What is `grad_norms` in AC_Network?

Hi,

I come across your A3C implementation and find the following 2 lines in AC_network.py:

self.var_norms = tf.global_norm(local_vars)
grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)

I wonder what's grad_norms for? It seems to me that it is not used.

Thanks!

A3C: Doom: cfg

Is it better to use basic.cfg
game.load_config("basic.cfg") ;

wher wad and cfg files can be found at:
https://github.com/mwydmuch/ViZDoom/tree/master/scenarios

instead of:
game.set_screen_resolution(ScreenResolution.RES_160X120)
game.set_screen_format(ScreenFormat.GRAY8)
game.set_render_hud(False)
game.set_render_crosshair(False)
game.set_render_weapon(True)
game.set_render_decals(False)
game.set_render_particles(False)
game.add_available_button(Button.MOVE_LEFT)
game.add_available_button(Button.MOVE_RIGHT)
game.add_available_button(Button.ATTACK)
game.add_available_game_variable(GameVariable.AMMO2)
game.add_available_game_variable(GameVariable.POSITION_X)
game.add_available_game_variable(GameVariable.POSITION_Y)
game.set_episode_timeout(300)
game.set_episode_start_time(10)
game.set_window_visible(False)
game.set_sound_enabled(False)
game.set_living_reward(-1)
game.set_mode(Mode.PLAYER)

A3C-Doom w MultiRNNCell?

Wondering if anyone's had success implementing a deep RNN in AC_Network()? At first glance it looks as easy as

cell = tf.nn.rnn_cell.LSTMCell(256, state_is_tuple=True)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * 2, state_is_tuple=True)
# same stuff as before

but the output is now a tuple of tuples, which will change handling of c_in & h_in placeholders and feed_dicts throughout Worker(). Definitely a support request rather than bug, just I've tried my hand at this for a couple days & my newbiness is blocking me.

Simple Policy Faulty Loss Function

Your loss function for the simple policy doesn't really make sense

"Loss=-Log(pi)*A"

If you have a weight of .9 and reward of 1
your loss is .045.

but if you have a weight of .9 and your reward is 3
your loss increases to .09 .

So the only reason your function works at all is that you only assign a single amount of reward.

Solving NaN vlaues

Dears
Dear @DMTSource

Sometimes NaN values appear and agent just collapses. I used the following code to solve this issue.

After policy ...

          self.policy = slim.fully_connected(rnn_out,a_size,
               activation_fn=tf.nn.softmax,
               weights_initializer=normalized_columns_initializer(0.01),
               biases_initializer=None)

just add this line (to prevent log(0) )

self.policy += 1e-7

This line should protect the following code ...

self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)

If you have any comments and suggestions please share ...

invalid value encountered in less

Sometimes when working with new levels or environments I get the error:

RuntimeWarning: invalid value encountered in less
a = np.random.choice(a_dist[0],p=a_dist[0])

I'm guessing there is a nan in the nerwork? This usually occurs after many steps. I have tried strong and weak gradient clipping methods and my rewards are always within a small range such as ~(0-1). My gradient norms tends to grow with time and soar past the clipping value.

Any suggestions for how to prevent this issue with the network?

A possible typo?

In https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb, there's a comment:

#Get our reward for taking an action given a bandit.

That tutorial uses the CartPole-v0 environment right? I don't think there is a bandit in that problem :-)

Are you sure Deep recurrent notebook is correct?

In the notebook I don't see where your recurrent Q value model gets its trace dimension. You're just reshaping the output of a convnet and feeding this directly into an LSTM. Furthermore, should you not also provide the non-zero initial state determined at play time? I.e. the internal state should be stored in the experience buffer and used during training. Corrent me if I'm wrong please.

Fails to learn

I ran the notebook without any changes on the vizdoom environment. After around an hour the reward became non-negative and peaked at around 0.7, but continuing to run the code resulted in the reward going back to -3.0 (I assume the most negative reward possible) and remaining stagnant for over 24 hours. A view of the produced gifs shows the agent walking to the left continuously without choosing any other action.

I also attempted to change the environment to OpenAI's Pong-v0 and have run this for over 24 hours without the average reward improving at all. If anyone knows what variables might be worth changing here I'd be grateful. I'm using 80x80 pong images and allowing for all 6 actions to be chosen. Code otherwise is not different (apart from of course modifying the 'game' variable to work with the openai environment (tested manually - successful).

Low Rewards for DRQN

I tried with DRQN code for both partial or full observability cases. However, I found it sometimes gets trapped into repeated actions and obtains very low rewards. Do you have the same problems before ? Thanks

A3C Doom: Visualize Gradients (tricky)

Dears
@awjuliani
@DMTSource

Inspired by this paper

I wanted to know where the agent is looking while taking decisions. In other words, what are the pixels of input frame that have large influence on the decision it takes.

I have added the following line in AC_Network class:

self.input_gradients = tf.gradients(self.responsible_outputs,self.imageIn)[0]

so, given a batch of input frames, for each frame we have a map that tells us what are the important pixels.

It looks like this:

Looks nice! After a good training, the agent at this particular moment, and for a given input frame, the agent looks at the center of the input frame where the demon probably is located.

My Question is:

We are using LSTM and batch size = 30 (or less if episode ends). For example:
If we have batch of 30 frames, then we will have 30 corresponding maps. I want to know if each map depends on:

Only the current frame (ex: map number 11 depends on input frame number 11 in this batch)
OR, The previous frames (ex: map number 11 depends on input frames number 1 to 11 in this batch)
Or, All frames in the batch (ex: map number X depends on all input frames from 1 to 30 in this batch)

Reason for asking:
In some cases, and after training, I turn off some frames to test how the agent will act in case of noisy input. So, for example, in a batch of 30 frames, around 7 random frames are just black images. The agent performed well. However, The gradient for these black frames still look like the above figure.

These pixels were important to agent, however the whole input frame was black, why?!

Hint: I think the correct answer is number (2). The decision is taken given all previous frames. If the current frame is black, the gradient of the decision given the current black frame depends on the current frame and all previous frames too.

What do you think?!

Regards

When I run dueling DQN, there is an error, 'The kernel appears to have died. It will restart automatically.'.

I've got this error,

The kernel appears to have died. It will restart automatically.

I also run it by ~.py in terminal.
But, I got similar error

(10000, 1.7, 1)
F tensorflow/stream_executor/cuda/cuda_dnn.cc:222] Check failed: s.ok() could not find cudnnCreate in cudnn DSO; dlerror: /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cudnnCreate
Aborted (core dumped)

..... Could you please help me?

Adapting A3C LSTM for Pong

Did anyone managed to get the A3C LSTM of this repo to work for Pong (using the openai gym)?

I have already tried several different optimizers, learning rates, network architectures, but still no success. I even altered the code in this repo to try to replicate the architecture used in the A3C from the OpenAI starter agent, but no success.....the agent maintains a mean reward of about -20.5 forever.......I left it training until it reached 70k global episodes, but it didn't get any better. In some architectures, the agent would just diverge to a policy where it executes only a single action all the time.

If anyone managed to get this implementation to work for Pong, I would really appreciate some hints.

A3C Basic Doom: Loss Function

Our goal is to minimize the loss. Loss consists of three parts:

Value loss
Policy loss
Entropy (to encourage exploration)

As follows:


self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

Last line:
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

I think we do not need to multiply self.value_loss by 0.5 in the last line, correct?

A3C-Doom: worker_0 plot is lost

Dear

When using:

tensorboard --logdir=worker_0:'./train_0',worker_1:'./train_1',worker_2:'./train_2',worker_3:'./train_3'

worker_0 is not plotted

A3C: Questions

Dear Juliani

Excellent work!

I would like to know for how long you trained the A3C? and Number of frames used?

How do you find your your results compared to the original paper? (Denny code did Not achieve the results of the original paper)

Did you use the same A3C code for Atari games(openai/gym)? Is it easy to be done?

In case of breakout Atari game, do we still need the LSTM layer? and why?

Regards

_

A3C Doom: Basic scenario: How to select clipping?

Why 40.0?

grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)

A3C - Updating global network

Hi Arthur, I have been following your blog about RL with TensorFlow and it has been very useful. Thanks a lot for taking the time.

I have a question/comment about your last post regarding the A3C implementation. If I understand correctly, the workers should update the parameters of the global networks once in a while. However, I do not see that happening in your code. I see you are using the function update_target_graph to copy the parameters from one network to another, but this function is only called as:

self.update_local_ops = update_target_graph('global',self.name)

which means you are only using it to synchronize your workers with the global but not the other way around. Am I missing something here?

I also noticed that you input the global network master_network into the function work, but never use it for updating...

A3C Doom: execute agent

Hi,

Sorry for stupid question
I've done training process. Now I'm looking for execute the agent to play the Doom game but didn't see that piece of code. Could you please show me how to make it run?

A3C loss functions

In A3C,

self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01

since, 0.5 is already multiplied in self.value_loss, should

self.loss = self.value_loss + self.policy_loss - self.entropy * 0.01

DRQN-AttributeError: 'zip' object has no attribute 'append'

I run the code and meet a problem like this:

Target Set Success
5000 0.65 1
98%|█████████████████████████████████████████▏| 50/51 [00:00<00:00, 665.63it/s]
Traceback (most recent call last):
File "Deep_Recurrent_Q-Network.py", line 235, in
summaryLength,h_size,sess,mainQN,time_per_step)
File "/home/jimmy/desktop/tensorflowcode/DRL/helper.py", line 52, in saveToCenter
images.append(bufferArray[-1,3])
AttributeError: 'zip' object has no attribute 'append'

I have no idea to solve it and need a help, my friends.
Thanks a lot.

Double-Dueling-DQN: question about the rate to update target network

I've encountered the thing that I can't understand while following up the Double-Dueling-DQN.ipynb.

There's a def like below

def updateTargetGraph(tfVars,tau):
    total_vars = len(tfVars)
    op_holder = []
    for idx, var in enumerate(tfVars[0:total_vars//2]):
        op_holder.append(tfVars[idx+total_vars//2].assign((var.value()*tau) + ((1-tau)*tfVars[idx+total_vars//2].value())))
    return op_holder

What does the op_holder mean and its role?

I skimmed the paper of Double DQN and Dueling DQN again, but I could not find out about the 'rate to update target network', which is indicated as 'tau' in this code.

inputs to worker.train of A3C

Environment

Hello,

I am attempting to adapt your a3c doom code, and I was wondering what versions your using? Perhaps a pip -freeze?

A3C Doom: Delayed convergence

I faced delayed convergence in second run, is there an explanation?

First Run: (around 200 and smooth)

Second Run: (around 300 and sudden)

A3C Doom: Typo in comment

File: https://github.com/awjuliani/DeepRL-Agents/blob/master/A3C-Doom.ipynb

Where: A typo in a comment of the final gist

What: "# Start the "work" process for each worker in a separate threat."

Correction: threat => thread

Something wrong with Contextual-Policy.ipython

Dear Arthur,

I am following your tutorials for reinforcement learning. It is very helpful. However, when I try to run "Contextual-Policy.ipython", I encounter some problems. Could you tell me how to solve it?

`TypeError Traceback (most recent call last)
in ()
2
3 cBandit = contextual_bandit() #Load the bandits.
----> 4 myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent.
5 weights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network.
6

in init(self, lr, s_size, a_size)
4 self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)
5 state_in_OH = slim.one_hot_encoding(self.state_in,s_size)
----> 6 output = slim.fully_connected(state_in_OH,a_size, biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones)
7 self.output = tf.reshape(output,[-1])
8 self.chosen_action = tf.argmax(self.output,0)

/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.pyc in func_with_args(*args, **kwargs)
175 current_args = current_scope[key_func].copy()
176 current_args.update(kwargs)
--> 177 return func(*args, **current_args)
178 _add_op(func)
179 setattr(func_with_args, '_key_op', _key_op(func))

/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.pyc in fully_connected(inputs, num_outputs, activation_fn, normalizer_fn, normalizer_params, weights_initializer, weights_regularizer, biases_initializer, biases_regularizer, reuse, variables_collections, outputs_collections, trainable, scope)
841 regularizer=weights_regularizer,
842 collections=weights_collections,
--> 843 trainable=trainable)
844 if len(static_shape) > 2:
845 # Reshape inputs

/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.pyc in model_variable(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, device)
267 initializer=initializer, regularizer=regularizer,
268 trainable=trainable, collections=collections,
--> 269 caching_device=caching_device, device=device)
270
271
`

A3C Basic Doom: effect of gamma (Discuss)

Hi I would like to share with you the effect of gamma on performance

I believe that when gamma = 0.99 means that we think that next future states have large effect on our estimation of discounted future rewards.
And when gamma = 0.8 means that we think that next future states have smaller effect on our estimation of discounted future rewards.

Correct?

Case 1: gamma = 0.99

Case 2: gamma = 0.95

Convergence looks smoother thank 0.99 (what do you think?)

Case 3: gamma = 0.8
Almost no convergence

Finally:

gamma = 0.99 # discount rate for advantage estimation and reward discounting

Is it logical to us 2 different gammas?

A3C Doom: What is the number of trainable variables?

I am trying to find how many trainable variables are there ...

When I try this code (just before: with tf.Session() as sess:)

np.sum([np.product([xi.value for xi in x.get_shape()]) for x in tf.trainable_variables()])

And this code

total_parameters = 0
for variable in tf.trainable_variables():
    shape = variable.get_shape()
    print(shape)
    variable_parametes = 1
    for dim in shape:
        variable_parametes *= dim.value
    total_parameters += variable_parametes
print(total_parameters)

I get the same answer (10794672)

However I faced two issues:

The answer depends on how many cpus I have, So I enforce the number of cpus = 1 in order to get reliable answer. Now the answer is 2398816
But it seems that the shared network is counted twice! See below:

(8, 8, 1, 16)
(16,)
(4, 4, 16, 32)
(32,)
(2592, 256)
(256,)
(512, 1024)
(1024,)
(256, 3)
(256, 1)
(8, 8, 1, 16)
(16,)
(4, 4, 16, 32)
(32,)
(2592, 256)
(256,)
(512, 1024)
(1024,)
(256, 3)
(256, 1)

What is the accurate number of trainable variables?

Thank you

A3C Doom Basic: Skip Count

Based on ViZDoom paper figure 7, I tried to use skip count to speed up the training, as follows:

r = self.env.make_action(self.actions[a], 4) / 100.0

However, the agent performed very poorly (compared to the original code and performance where it should converge around 200~300)

The ideal average episode length should be around 30, however it is around 70

I think the agent found some sub-optimal policy and stuck to it!

why the apply_gradient could be totally externally assigned?

I want to ask something about this part in policy network example:

loss = -tf.reduce_mean((tf.log(input_y - probability)) * advantages) 
newGrads = tf.gradients(loss,tvars)

adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.
W2Grad = tf.placeholder(tf.float32,name="batch_grad2")
batchGrad = [W1Grad,W2Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))

where newGrads go out-of-graph, after postprocessing, then inserted to graph through W1Grad and W2Grad.

I am wondering how tensorflow knows the Variable gradient will be put to batchGrad?

I mean, if we just tf.apply_gradient to random placeholders which are not evaluated by some trainable tf.Variable, it should cast the Error: "No gradients provided for any variable".

Your code works, but I am still trying to figure out how it works?

slim?

Vanilla-Policy.ipynb

I am getting this when trying to run. I guess I didn't get a dependency?

Traceback (most recent call last):
File "agent2.py", line 68, in
myAgent = agent(lr=1e-2,s_size=5,a_size=3,h_size=10) #Load the agent.
File "agent2.py", line 41, in init
hidden = slim.fully_connected(self.state_in,h_size,biases_initializer=None,activation_fn=tf.nn.relu)
NameError: global name 'slim' is not defined

DRQN: Error prefix tensor must be either a scalar or vector

I tried the DRQN for partial observations, but I got the error:

ValueError: prefix tensor must be either a scalar or vector, but saw tensor: Tensor("Placeholder_2:0", dtype=int32)

----Error happens in this line-------

self.state_in = rnn_cell.zero_state(self.batch_size, tf.float32)

# For Policy Network Problem

Thanks for your code, but I have a question that if the rewards is negative, do the code still work?
If not, how to fix it or ensure the loss keep in positive?

A3C Basic Doom: effect of episode length (Discuss)

This is to discuss how the episode length may affect the learning process.

Case 1: The default as in the repo

Smoothed steady Reward is around 0.55 (see figure below)

game.set_episode_timeout(300)

Case 2: Shorter episode

game.set_episode_timeout(150)

Very similar to Case 1

Case 3: very short episode

game.set_episode_timeout(70)

The agent should find the policy fast because it has very limited time window to explore.
Delayed convergence (after 500 episodes)

However reward is around 0.65 > case 1 (0.55) (see figure below)
Why? convergence is delayed, but on the other hand, we have better rewards. I mean that, the agent usually accomplish the task in less time compared to case 1, or the agent is more efficient and focused compared to case 1) What do you think?!

Case 4: Longer episode

game.set_episode_timeout(450)

smoothed reward is around 0.62
smoothed length is around 33
delayed convergence compared to case 1

Case 5: each worker has its own length

Is it even a valid idea?!

Where: episode length = 75 + (number *25)

worker_0: episode length = 75
worker_1: episode length = 100
worker_2: episode length = 125
.
worker_7: episode length = 250

It seems that worker_0 with episode length 250 converged faster than worker_7 with episode length 75

The following figure includes all workers:

The following figure includes only worker_0 (episode length = 75) and worker_7 (episode length = 250)

However, all workers share the same global network, Do you think by having different episode lengths, could affect / enhance the learning? What to you think?

Again: Is it even a valid idea?!

Case 6: each worker has its own length, with lager range

Where: episode length = 100 + (number *50)

worker_0: episode length = 100
worker_1: episode length = 150
worker_2: episode length = 200
.
worker_7: episode length = 450

The following figure includes all workers: