awjuliani / deeprl-agents Goto Github PK
View Code? Open in Web Editor NEWA set of Deep Reinforcement Learning Agents implemented in Tensorflow.
License: MIT License
A set of Deep Reinforcement Learning Agents implemented in Tensorflow.
License: MIT License
see dqn.zip
below
Deep-Recurrent-Q-Network.ipynb
zip
now is a class, so need to convert to list
:
episodeBuffer = list(zip(bufferArray))
helper.py
the statement:
updateTarget(targetOps,sess) #Set the target network to be equal to the primary network.
uses the op list targetOps
which is built with tau=0.001.
Therefore the networks are not the same after the op is executed.
Can you confirm this issue? Does it have an impact on the results?
Another thing:
this idea of getting, after update, a network whose weights are a convex combination between the weights of the target and the main networks seems to be a bit weird to me, even if it comes from a paper done by deep mind. starting from the same initialization and using very small update weights it might work, but in general it should actually not work at all. Interpolating between network weights of different runs can potentially disrupt performances. (i will have a look to the paper though, even though i would love to hear your comments as an expert in this field)
Dears
In the A3C Mnih Paper, It was mentioned that:
Figure 2: This shows that A3C is quite robust to learning rates and initial random weights
However, I think the performance is affected by learning rates and initial random weights! In other words, it is not robust!
For example see #18
What do you recommend?
Hi,
I has read your blogs about the RL and like it very much.
When I tried to run the A3C-Doom, I came across the Segmentation fault (core dumped) error after the terminal output starting workers .
My computer has 32G memory,and two E5 CPUs.
The error troubles me a lot and I wounder if you can give me some advice.
Thanks.
Hi
Is there a way to set hyper parameters?
After 10s of experiments, I found that any tiny change in one of these affects the whole training dramatically, usually in a bad way.
It is also not logic to conduct a grid search over different parameters, because a single experiment may take hours or days, and cost a lot of money
One trick I usually use, is to use large network and dropout to reduce/eliminate over fitting, but what about all of the above?
Another trick, try to adjust the learning rate * gradient = 1e-3 parameters. (In other works make the parameter update around 1/1000 of the parameter value, to prevent too large to too small updates)
What do you recommend?
Dear @awjuliani
I would like to share results when using different buffer sizes (for all workers or per worker), to understand how buffer size may affect the convergence of an agent.
In other words, how to select buffer size?
What are other factors that should be adjusted given certain buffer size?
Is it correct to make each worker has its own buffer size?
Your comments and suggestions are very important!
1- The original case, buffer length = 30 for all agents
Result looks smooth and agent start to converge at 200 (it seems that 30 is the magic number for this scenario!)
self.experience_length = 30
.
.
.
if len(episode_buffer) == self.experience_length and d != True and episode_step_count != max_episode_length - 1:
2- buffer size = 50 for all agents
delayed convergence (around 400) compared to buffer size = 30 (around 200). In this case, agent has longer memory to analyze sequences and see more frames through time.
3- buffer size =10 for all agents
It seems that only 10 frames is too short for the agent to learn!
4- Buffer size = 25 for agent_0, 30 for agent_1, 35 for agent_2 and so on
Here, each worker has its own buffer size, convergence is delayed (after 400) compared to 30 (around 200)
Actually, I thought this setup will have better results, but it was bad!
UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
"Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
Any ideas ? Does still seem to work though.
Hi Arthur
Instead of basic scenario, I used health_gathering.cfg scenario
Where:
game.set_doom_scenario_path("health_gathering.wad")
game.set_screen_resolution(ScreenResolution.RES_160X120)
game.set_screen_format(ScreenFormat.GRAY8)
game.set_render_hud(False)
game.set_render_crosshair(False)
game.set_render_weapon(False)
game.set_render_decals(False)
game.set_render_particles(False)
game.add_available_button(Button.TURN_LEFT)
game.add_available_button(Button.TURN_RIGHT)
game.add_available_button(Button.MOVE_FORWARD)
game.add_available_game_variable(GameVariable.HEALTH)
game.set_episode_timeout(2100)
game.set_episode_start_time(10)
game.set_window_visible(False)
game.set_sound_enabled(False)
game.set_living_reward(1) # Each step is good for you!
game.set_death_penalty(100) # And death is not!
game.set_mode(Mode.PLAYER)
game.init()
self.actions = [[True,False,False],[False,True,False],[False,False,True]]
#End Doom set-up
self.env = game
and
r = self.env.make_action(self.actions[a]) * 1.0
r = self.env.get_game_variable(GameVariable.HEALTH)
r = r + self.env.get_game_variable(GameVariable.HEALTH)
What do you think?
I copy your code in a python file and run the simulation several time.
Usually it works fine, but occasionally the mean reward become very large.
Below is the copy of the output log that you made
World Perf: Episode 247.000000. Reward 35.333333. action: 0.000000. mean reward 35.000038.
World Perf: Episode 250.000000. Reward 29.333333. action: 1.000000. mean reward 34.979885.
World Perf: Episode 253.000000. Reward 39.666667. action: 0.000000. mean reward 34.893707.
World Perf: Episode 256.000000. Reward 21.000000. action: 1.000000. mean reward 34.590328.
World Perf: Episode 259.000000. Reward 62.333333. action: 0.000000. mean reward 34.643253.
World Perf: Episode 262.000000. Reward 40.666667. action: 1.000000. mean reward 34.418655.
World Perf: Episode 265.000000. Reward 31.000000. action: 1.000000. mean reward 34.128536.
World Perf: Episode 268.000000. Reward 25.000000. action: 1.000000. mean reward 3763953194369116274688.000000.
World Perf: Episode 271.000000. Reward 50.333333. action: 0.000000. mean reward 3689050732741573738496.000000.
World Perf: Episode 274.000000. Reward 20.333333. action: 0.000000. mean reward 3615638681115714125824.000000.
World Perf: Episode 277.000000. Reward 26.666667. action: 1.000000. mean reward 3543687766093959528448.000000.
World Perf: Episode 280.000000. Reward 44.000000. action: 0.000000. mean reward 3473168432803755327488.000000.
World Perf: Episode 283.000000. Reward 19.000000. action: 1.000000. mean reward 3404052533747430457344.000000.
World Perf: Episode 286.000000. Reward 59.666667. action: 1.000000. mean reward 3336311921427313852416.000000.
It seems that the predicted reward of the Model sometimes become too large.
Do you know what is the problem?
Is it just some cases that the model failed to learn?
Hi, I use the Python3 and run your code, I find must change something, like / to //, print 'xxx' to print('xxx').
why not import _future_ and make sure the code can support 2.7 and 3?
and in GridWorld, the env take action, it always return done is False?
Episodes are not terminated when episode_step_count exceeds max_episode_length.
It is not necessarily a problem, but as both summary writing and model saving is dependent on episode_count, if the episodes are getting too long then these actions are less frequent.
In my case (I'm using the algorithm with a different environment, not doom), as the agent got better episodes exceeded 8000 steps, so that really influenced the model saving frequency.
IOError: [Errno 2] No such file or directory: './Center/log.csv'
Any ideas my friend? Trying to run Deep-Recurrent-Q-Network.ipynb
File "D-R-Q.py", line 153, in
with open('./Center/log.csv', 'w') as myfile:
IOError: [Errno 2] No such file or directory: './Center/log.csv'
Hi,
I come across your A3C implementation and find the following 2 lines in AC_network.py:
self.var_norms = tf.global_norm(local_vars)
grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)
I wonder what's grad_norms for? It seems to me that it is not used.
Thanks!
Is it better to use basic.cfg
game.load_config("basic.cfg") ;
wher wad and cfg files can be found at:
https://github.com/mwydmuch/ViZDoom/tree/master/scenarios
instead of:
game.set_screen_resolution(ScreenResolution.RES_160X120)
game.set_screen_format(ScreenFormat.GRAY8)
game.set_render_hud(False)
game.set_render_crosshair(False)
game.set_render_weapon(True)
game.set_render_decals(False)
game.set_render_particles(False)
game.add_available_button(Button.MOVE_LEFT)
game.add_available_button(Button.MOVE_RIGHT)
game.add_available_button(Button.ATTACK)
game.add_available_game_variable(GameVariable.AMMO2)
game.add_available_game_variable(GameVariable.POSITION_X)
game.add_available_game_variable(GameVariable.POSITION_Y)
game.set_episode_timeout(300)
game.set_episode_start_time(10)
game.set_window_visible(False)
game.set_sound_enabled(False)
game.set_living_reward(-1)
game.set_mode(Mode.PLAYER)
Wondering if anyone's had success implementing a deep RNN in AC_Network()? At first glance it looks as easy as
cell = tf.nn.rnn_cell.LSTMCell(256, state_is_tuple=True)
cell = tf.nn.rnn_cell.MultiRNNCell([cell] * 2, state_is_tuple=True)
# same stuff as before
but the output is now a tuple of tuples, which will change handling of c_in
& h_in
placeholders and feed_dicts throughout Worker()
. Definitely a support request rather than bug, just I've tried my hand at this for a couple days & my newbiness is blocking me.
Your loss function for the simple policy doesn't really make sense
"Loss=-Log(pi)*A"
If you have a weight of .9 and reward of 1
your loss is .045.
but if you have a weight of .9 and your reward is 3
your loss increases to .09 .
So the only reason your function works at all is that you only assign a single amount of reward.
Dears
Dear @DMTSource
Sometimes NaN values appear and agent just collapses. I used the following code to solve this issue.
After policy ...
self.policy = slim.fully_connected(rnn_out,a_size,
activation_fn=tf.nn.softmax,
weights_initializer=normalized_columns_initializer(0.01),
biases_initializer=None)
just add this line (to prevent log(0) )
self.policy += 1e-7
This line should protect the following code ...
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)
If you have any comments and suggestions please share ...
Sometimes when working with new levels or environments I get the error:
RuntimeWarning: invalid value encountered in less
a = np.random.choice(a_dist[0],p=a_dist[0])
I'm guessing there is a nan in the nerwork? This usually occurs after many steps. I have tried strong and weak gradient clipping methods and my rewards are always within a small range such as ~(0-1). My gradient norms tends to grow with time and soar past the clipping value.
Any suggestions for how to prevent this issue with the network?
In https://github.com/awjuliani/DeepRL-Agents/blob/master/Vanilla-Policy.ipynb, there's a comment:
#Get our reward for taking an action given a bandit.
That tutorial uses the CartPole-v0 environment right? I don't think there is a bandit in that problem :-)
In the notebook I don't see where your recurrent Q value model gets its trace dimension. You're just reshaping the output of a convnet and feeding this directly into an LSTM. Furthermore, should you not also provide the non-zero initial state determined at play time? I.e. the internal state should be stored in the experience buffer and used during training. Corrent me if I'm wrong please.
I ran the notebook without any changes on the vizdoom environment. After around an hour the reward became non-negative and peaked at around 0.7, but continuing to run the code resulted in the reward going back to -3.0 (I assume the most negative reward possible) and remaining stagnant for over 24 hours. A view of the produced gifs shows the agent walking to the left continuously without choosing any other action.
I also attempted to change the environment to OpenAI's Pong-v0 and have run this for over 24 hours without the average reward improving at all. If anyone knows what variables might be worth changing here I'd be grateful. I'm using 80x80 pong images and allowing for all 6 actions to be chosen. Code otherwise is not different (apart from of course modifying the 'game' variable to work with the openai environment (tested manually - successful).
I tried with DRQN code for both partial or full observability cases. However, I found it sometimes gets trapped into repeated actions and obtains very low rewards. Do you have the same problems before ? Thanks
Dears
@awjuliani
@DMTSource
Inspired by this paper
I wanted to know where the agent is looking while taking decisions. In other words, what are the pixels of input frame that have large influence on the decision it takes.
I have added the following line in AC_Network class:
self.input_gradients = tf.gradients(self.responsible_outputs,self.imageIn)[0]
so, given a batch of input frames, for each frame we have a map that tells us what are the important pixels.
It looks like this:
Looks nice! After a good training, the agent at this particular moment, and for a given input frame, the agent looks at the center of the input frame where the demon probably is located.
My Question is:
We are using LSTM and batch size = 30 (or less if episode ends). For example:
If we have batch of 30 frames, then we will have 30 corresponding maps. I want to know if each map depends on:
Reason for asking:
In some cases, and after training, I turn off some frames to test how the agent will act in case of noisy input. So, for example, in a batch of 30 frames, around 7 random frames are just black images. The agent performed well. However, The gradient for these black frames still look like the above figure.
These pixels were important to agent, however the whole input frame was black, why?!
Hint: I think the correct answer is number (2). The decision is taken given all previous frames. If the current frame is black, the gradient of the decision given the current black frame depends on the current frame and all previous frames too.
What do you think?!
Regards
I've got this error,
The kernel appears to have died. It will restart automatically.
I also run it by ~.py in terminal.
But, I got similar error
(10000, 1.7, 1)
F tensorflow/stream_executor/cuda/cuda_dnn.cc:222] Check failed: s.ok() could not find cudnnCreate in cudnn DSO; dlerror: /usr/local/lib/python2.7/dist-packages/tensorflow/python/_pywrap_tensorflow.so: undefined symbol: cudnnCreate
Aborted (core dumped)
..... Could you please help me?
Did anyone managed to get the A3C LSTM of this repo to work for Pong (using the openai gym)?
I have already tried several different optimizers, learning rates, network architectures, but still no success. I even altered the code in this repo to try to replicate the architecture used in the A3C from the OpenAI starter agent, but no success.....the agent maintains a mean reward of about -20.5 forever.......I left it training until it reached 70k global episodes, but it didn't get any better. In some architectures, the agent would just diverge to a policy where it executes only a single action all the time.
If anyone managed to get this implementation to work for Pong, I would really appreciate some hints.
Hi
Our goal is to minimize the loss. Loss consists of three parts:
As follows:
self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01
Last line:
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01
I think we do not need to multiply self.value_loss by 0.5 in the last line, correct?
Dear
When using:
tensorboard --logdir=worker_0:'./train_0',worker_1:'./train_1',worker_2:'./train_2',worker_3:'./train_3'
worker_0 is not plotted
Dear Juliani
Excellent work!
I would like to know for how long you trained the A3C? and Number of frames used?
How do you find your your results compared to the original paper? (Denny code did Not achieve the results of the original paper)
Did you use the same A3C code for Atari games(openai/gym)? Is it easy to be done?
In case of breakout Atari game, do we still need the LSTM layer? and why?
Regards
_
Why 40.0?
grads,self.grad_norms = tf.clip_by_global_norm(self.gradients,40.0)
Hi Arthur, I have been following your blog about RL with TensorFlow and it has been very useful. Thanks a lot for taking the time.
I have a question/comment about your last post regarding the A3C implementation. If I understand correctly, the workers should update the parameters of the global networks once in a while. However, I do not see that happening in your code. I see you are using the function update_target_graph
to copy the parameters from one network to another, but this function is only called as:
self.update_local_ops = update_target_graph('global',self.name)
which means you are only using it to synchronize your workers with the global but not the other way around. Am I missing something here?
I also noticed that you input the global network master_network into the function work
, but never use it for updating...
Hi,
Sorry for stupid question
I've done training process. Now I'm looking for execute the agent to play the Doom game but didn't see that piece of code. Could you please show me how to make it run?
In A3C,
self.value_loss = 0.5 * tf.reduce_sum(tf.square(self.target_v - tf.reshape(self.value,[-1])))
self.entropy = - tf.reduce_sum(self.policy * tf.log(self.policy))
self.policy_loss = -tf.reduce_sum(tf.log(self.responsible_outputs)*self.advantages)
self.loss = 0.5 * self.value_loss + self.policy_loss - self.entropy * 0.01
since, 0.5 is already multiplied in self.value_loss, should
self.loss = self.value_loss + self.policy_loss - self.entropy * 0.01
I run the code and meet a problem like this:
Target Set Success
5000 0.65 1
98%|█████████████████████████████████████████▏| 50/51 [00:00<00:00, 665.63it/s]
Traceback (most recent call last):
File "Deep_Recurrent_Q-Network.py", line 235, in
summaryLength,h_size,sess,mainQN,time_per_step)
File "/home/jimmy/desktop/tensorflowcode/DRL/helper.py", line 52, in saveToCenter
images.append(bufferArray[-1,3])
AttributeError: 'zip' object has no attribute 'append'
I have no idea to solve it and need a help, my friends.
Thanks a lot.
I've encountered the thing that I can't understand while following up the Double-Dueling-DQN.ipynb.
There's a def like below
def updateTargetGraph(tfVars,tau):
total_vars = len(tfVars)
op_holder = []
for idx, var in enumerate(tfVars[0:total_vars//2]):
op_holder.append(tfVars[idx+total_vars//2].assign((var.value()*tau) + ((1-tau)*tfVars[idx+total_vars//2].value())))
return op_holder
What does the op_holder mean and its role?
I skimmed the paper of Double DQN and Dueling DQN again, but I could not find out about the 'rate to update target network', which is indicated as 'tau' in this code.
Hello,
I am attempting to adapt your a3c doom code, and I was wondering what versions your using? Perhaps a pip -freeze?
File: https://github.com/awjuliani/DeepRL-Agents/blob/master/A3C-Doom.ipynb
Where: A typo in a comment of the final gist
What: "# Start the "work" process for each worker in a separate threat."
Correction: threat => thread
Dear Arthur,
I am following your tutorials for reinforcement learning. It is very helpful. However, when I try to run "Contextual-Policy.ipython", I encounter some problems. Could you tell me how to solve it?
`TypeError Traceback (most recent call last)
in ()
2
3 cBandit = contextual_bandit() #Load the bandits.
----> 4 myAgent = agent(lr=0.001,s_size=cBandit.num_bandits,a_size=cBandit.num_actions) #Load the agent.
5 weights = tf.trainable_variables()[0] #The weights we will evaluate to look into the network.
6
in init(self, lr, s_size, a_size)
4 self.state_in= tf.placeholder(shape=[1],dtype=tf.int32)
5 state_in_OH = slim.one_hot_encoding(self.state_in,s_size)
----> 6 output = slim.fully_connected(state_in_OH,a_size, biases_initializer=None,activation_fn=tf.nn.sigmoid,weights_initializer=tf.ones)
7 self.output = tf.reshape(output,[-1])
8 self.chosen_action = tf.argmax(self.output,0)
/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.pyc in func_with_args(*args, **kwargs)
175 current_args = current_scope[key_func].copy()
176 current_args.update(kwargs)
--> 177 return func(*args, **current_args)
178 _add_op(func)
179 setattr(func_with_args, '_key_op', _key_op(func))
/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/layers/python/layers/layers.pyc in fully_connected(inputs, num_outputs, activation_fn, normalizer_fn, normalizer_params, weights_initializer, weights_regularizer, biases_initializer, biases_regularizer, reuse, variables_collections, outputs_collections, trainable, scope)
841 regularizer=weights_regularizer,
842 collections=weights_collections,
--> 843 trainable=trainable)
844 if len(static_shape) > 2:
845 # Reshape inputs
/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/arg_scope.pyc in func_with_args(*args, **kwargs)
175 current_args = current_scope[key_func].copy()
176 current_args.update(kwargs)
--> 177 return func(*args, **current_args)
178 _add_op(func)
179 setattr(func_with_args, '_key_op', _key_op(func))
/home/rlig/anaconda2/envs/tensorflow/lib/python2.7/site-packages/tensorflow/contrib/framework/python/ops/variables.pyc in model_variable(name, shape, dtype, initializer, regularizer, trainable, collections, caching_device, device)
267 initializer=initializer, regularizer=regularizer,
268 trainable=trainable, collections=collections,
--> 269 caching_device=caching_device, device=device)
270
271
`
Hi I would like to share with you the effect of gamma on performance
Correct?
Case 1: gamma = 0.99
Case 2: gamma = 0.95
Convergence looks smoother thank 0.99 (what do you think?)
Case 3: gamma = 0.8
Almost no convergence
Finally:
gamma = 0.99 # discount rate for advantage estimation and reward discounting
Is it logical to us 2 different gammas?
Hi
I am trying to find how many trainable variables are there ...
When I try this code (just before: with tf.Session() as sess:)
np.sum([np.product([xi.value for xi in x.get_shape()]) for x in tf.trainable_variables()])
And this code
total_parameters = 0
for variable in tf.trainable_variables():
shape = variable.get_shape()
print(shape)
variable_parametes = 1
for dim in shape:
variable_parametes *= dim.value
total_parameters += variable_parametes
print(total_parameters)
I get the same answer (10794672)
However I faced two issues:
The answer depends on how many cpus I have, So I enforce the number of cpus = 1 in order to get reliable answer. Now the answer is 2398816
But it seems that the shared network is counted twice! See below:
(8, 8, 1, 16)
(16,)
(4, 4, 16, 32)
(32,)
(2592, 256)
(256,)
(512, 1024)
(1024,)
(256, 3)
(256, 1)
(8, 8, 1, 16)
(16,)
(4, 4, 16, 32)
(32,)
(2592, 256)
(256,)
(512, 1024)
(1024,)
(256, 3)
(256, 1)
What is the accurate number of trainable variables?
Thank you
hi
Based on ViZDoom paper figure 7, I tried to use skip count to speed up the training, as follows:
r = self.env.make_action(self.actions[a], 4) / 100.0
However, the agent performed very poorly (compared to the original code and performance where it should converge around 200~300)
The ideal average episode length should be around 30, however it is around 70
I think the agent found some sub-optimal policy and stuck to it!
I want to ask something about this part in policy network example:
loss = -tf.reduce_mean((tf.log(input_y - probability)) * advantages)
newGrads = tf.gradients(loss,tvars)
adam = tf.train.AdamOptimizer(learning_rate=learning_rate) # Our optimizer
W1Grad = tf.placeholder(tf.float32,name="batch_grad1") # Placeholders to send the final gradients through when we update.
W2Grad = tf.placeholder(tf.float32,name="batch_grad2")
batchGrad = [W1Grad,W2Grad]
updateGrads = adam.apply_gradients(zip(batchGrad,tvars))
where newGrads go out-of-graph, after postprocessing, then inserted to graph through W1Grad
and W2Grad
.
I am wondering how tensorflow knows the Variable gradient will be put to batchGrad
?
I mean, if we just tf.apply_gradient
to random placeholders which are not evaluated by some trainable tf.Variable
, it should cast the Error: "No gradients provided for any variable".
Your code works, but I am still trying to figure out how it works?
Vanilla-Policy.ipynb
I am getting this when trying to run. I guess I didn't get a dependency?
Traceback (most recent call last):
File "agent2.py", line 68, in
myAgent = agent(lr=1e-2,s_size=5,a_size=3,h_size=10) #Load the agent.
File "agent2.py", line 41, in init
hidden = slim.fully_connected(self.state_in,h_size,biases_initializer=None,activation_fn=tf.nn.relu)
NameError: global name 'slim' is not defined
I tried the DRQN for partial observations, but I got the error:
ValueError: prefix tensor must be either a scalar or vector, but saw tensor: Tensor("Placeholder_2:0", dtype=int32)
self.state_in = rnn_cell.zero_state(self.batch_size, tf.float32)
Thanks for your code, but I have a question that if the rewards is negative, do the code still work?
If not, how to fix it or ensure the loss keep in positive?
Hi
This is to discuss how the episode length may affect the learning process.
Case 1: The default as in the repo
Smoothed steady Reward is around 0.55 (see figure below)
Case 2: Shorter episode
game.set_episode_timeout(150)
Very similar to Case 1
Case 3: very short episode
game.set_episode_timeout(70)
The agent should find the policy fast because it has very limited time window to explore.
Delayed convergence (after 500 episodes)
Case 4: Longer episode
game.set_episode_timeout(450)
smoothed reward is around 0.62
smoothed length is around 33
delayed convergence compared to case 1
Case 5: each worker has its own length
Is it even a valid idea?!
Where: episode length = 75 + (number *25)
worker_0: episode length = 75
worker_1: episode length = 100
worker_2: episode length = 125
.
worker_7: episode length = 250
It seems that worker_0 with episode length 250 converged faster than worker_7 with episode length 75
The following figure includes all workers:
The following figure includes only worker_0 (episode length = 75) and worker_7 (episode length = 250)
However, all workers share the same global network, Do you think by having different episode lengths, could affect / enhance the learning? What to you think?
Again: Is it even a valid idea?!
Case 6: each worker has its own length, with lager range
Where: episode length = 100 + (number *50)
worker_0: episode length = 100
worker_1: episode length = 150
worker_2: episode length = 200
.
worker_7: episode length = 450
The following figure includes all workers:
The following figure includes only worker_0 (episode length = 100), worker_4 (episode length = 300) worker_7 (episode length = 450)
The longer the episode, the faster the learning
Great tutorial!
If you are recently installing from pip Tensorflow is now 1.0.0+. In order to run the tutorial you will need to update:
tf.nn.rnn_cell.BasicLSTMCell
to:
tf.contrib.rnn.BasicLSTMCell
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.