Comments (10)
I have been having an issue that I think might be related/explain what you are seeing. I noticed when running the a3c doom code on 8-16 cpus, that sometimes one or two threads would not launch or at least were failing silently. The name of the thread, when printed from worker.work or other places, could be seen as being repeated or mixed up. So I would have two "1" threads or some other repeated index in the worker name.
When extending the code for personal use I ran into this repeatedly as my environment was much lighter and faster to start-up a worker. To fix the issue I added a simple sleep(0.5) (see code below). Now, when I print, from the worker.work, the name of the thread I no longer see repeated items and there is no longer a mix up of print locations and other bugs caused by the issue.
It appears workers were spooling up too quickly in my case and repeated or mixing up their context? I'm used to Scoop or Multiprocessing modules so I am unsure if this is a common issue with global scope and Threading?
for worker in workers:
worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
t = threading.Thread(target=(worker_work))
t.start()
worker_threads.append(t)
sleep(0.5)
coord.join(worker_threads)
from deeprl-agents.
Thanks for the suggesting DMTSource! I have incorporated the sleep
line into the notebook.
from deeprl-agents.
So plots are showing and all workers are alive and well with their respective names...until what appears to be step ~10(saving time).
Looks like there is trouble with the model saving as the "master" worker is making it to that point and then shutting down. If the crash is truly silent you might want to add some print statements to study how far the code is getting once it reaches the code block relevant to saving the model.
My ignorant guess is something like ffmpeg is causing the trouble as its a very external tool to this code, and saving checkpoint files should be trivial for Tensorflow despite the system. You could try commenting out the gif generation code if that is the case. I had trouble getting a working ffmpeg installation on my system the first time I ran the code as some versions threw errors(Ubuntu 14.04). But I was able to get it working once the issue was identified.
from deeprl-agents.
Hi Ibrahim,
Did you encounter an error in the worker_0 process? Otherwise it should certainly be plotting.
from deeprl-agents.
I have added the sleep
as follows:
worker_threads = []
for worker in workers:
worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
t = threading.Thread(target=(worker_work))
t.start()
worker_threads.append(t)
sleep(0.5) # here is it
coord.join(worker_threads)
worker_0 is in orange color
- However, worker_0 seems to stop very early
- Moreover, model and frames are not saved (the code for them is based on worker_0)
Then I used worker_1 instead of worker_0 for saving model and frames, but then worker_1 stopped
I tried sleep
in the code that is responsible for saving model and frames ... but the same problem.
Regards
from deeprl-agents.
Possible cause:
After removing model and gif saving code, things worked fine!
Removed code:
if self.name == 'worker_1' and episode_count % 25 == 0:
time_per_step = 0.05
images = np.array(episode_frames)
make_gif(images,'./frames/image'+str(episode_count)+'.gif',
duration=len(images)*time_per_step,true_image=True,salience=False)
if episode_count % 250 == 0 and self.name == 'worker_1':
saver.save(sess,self.model_path+'/model-'+str(episode_count)+'.cptk')
print ("Saved Model")
Figure: (all threads are there) I think I have some error in saving!
Any clue?
Regards
from deeprl-agents.
Thanks DMTSource
I can save the model but not the frames!
any other way to save gifs or video?
from deeprl-agents.
Hi Ibrahim,
Are you sure that you have both moviepy and ffmpeg installed? You will also need to ensure the version of imageio you have is 1.6.
from deeprl-agents.
Hi Arthur
imageio
print imageio.--version--
2.1.2
ffmpeg -version
ffmpeg version N-80901-gfebc862 Copyright (c) 2000-2016 the FFmpeg developers
built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
configuration: --extra-libs=-ldl --prefix=/opt/ffmpeg --mandir=/usr/share/man --enable-avresample --disable-debug --enable-nonfree --enable-gpl --enable-version3 --enable-libopencore-amrnb --enable-libopencore-amrwb --disable-decoder=amrnb --disable-decoder=amrwb --enable-libpulse --enable-libfreetype --enable-gnutls --enable-libx264 --enable-libx265 --enable-libfdk-aac --enable-libvorbis --enable-libmp3lame --enable-libopus --enable-libvpx --enable-libspeex --enable-libass --enable-avisynth --enable-libsoxr --enable-libxvid --enable-libvidstab
libavutil 55. 28.100 / 55. 28.100
libavcodec 57. 48.101 / 57. 48.101
libavformat 57. 41.100 / 57. 41.100
libavdevice 57. 0.102 / 57. 0.102
libavfilter 6. 47.100 / 6. 47.100
libavresample 3. 0. 0 / 3. 0. 0
libswscale 4. 1.100 / 4. 1.100
libswresample 2. 1.100 / 2. 1.100
libpostproc 54. 0.100 / 54. 0.100
from deeprl-agents.
I believe you will need imageio 1.6, and not 2.1 in order for the gif generation to work. Unfortunately they changed the encoder in 2.1 and broke the gif code I used. If you have a fix that works with 2.1, I would be happy to incorporate it.
from deeprl-agents.
Related Issues (20)
- _ HOT 1
- simple and odd python problem HOT 2
- Double-Dueling-DQN: question about the rate to update target network
- Double-Dueling-DQN stops learning
- Can't see the source code. HOT 2
- checkGoal() in gridworld.py
- apply_gradients need a lock?
- A3C-Doom, is threading can make real parallelism?
- Please add more comments..
- Target network updates / Double-Dueling-DQN.ipynb HOT 1
- A3C Doom : function error
- DRQN plays FlappyBird
- what is the mean of multiply (1./(i+1))?
- Reward Smoothing
- A3C Doom: Why there should be no more workers than there are threads on CPU?
- How to do twice training session for the same buffer
- scipy.misc.imresize is deprecated in Scipy 1.14.3 --> modified code HOT 1
- A garbage code in Model-Network.ipynb
- Issue in DRQN
- Crash and burn in TF 2.0 and alter
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from deeprl-agents.