Git Product home page Git Product logo

Comments (10)

DMTSource avatar DMTSource commented on September 26, 2024 1

I have been having an issue that I think might be related/explain what you are seeing. I noticed when running the a3c doom code on 8-16 cpus, that sometimes one or two threads would not launch or at least were failing silently. The name of the thread, when printed from worker.work or other places, could be seen as being repeated or mixed up. So I would have two "1" threads or some other repeated index in the worker name.

When extending the code for personal use I ran into this repeatedly as my environment was much lighter and faster to start-up a worker. To fix the issue I added a simple sleep(0.5) (see code below). Now, when I print, from the worker.work, the name of the thread I no longer see repeated items and there is no longer a mix up of print locations and other bugs caused by the issue.

It appears workers were spooling up too quickly in my case and repeated or mixing up their context? I'm used to Scoop or Multiprocessing modules so I am unsure if this is a common issue with global scope and Threading?

for worker in workers:
        worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
        t = threading.Thread(target=(worker_work))
        t.start()
        worker_threads.append(t)
        sleep(0.5)
    coord.join(worker_threads)

from deeprl-agents.

awjuliani avatar awjuliani commented on September 26, 2024 1

Thanks for the suggesting DMTSource! I have incorporated the sleep line into the notebook.

from deeprl-agents.

DMTSource avatar DMTSource commented on September 26, 2024 1

So plots are showing and all workers are alive and well with their respective names...until what appears to be step ~10(saving time).

Looks like there is trouble with the model saving as the "master" worker is making it to that point and then shutting down. If the crash is truly silent you might want to add some print statements to study how far the code is getting once it reaches the code block relevant to saving the model.

My ignorant guess is something like ffmpeg is causing the trouble as its a very external tool to this code, and saving checkpoint files should be trivial for Tensorflow despite the system. You could try commenting out the gif generation code if that is the case. I had trouble getting a working ffmpeg installation on my system the first time I ran the code as some versions threw errors(Ubuntu 14.04). But I was able to get it working once the issue was identified.

from deeprl-agents.

awjuliani avatar awjuliani commented on September 26, 2024

Hi Ibrahim,

Did you encounter an error in the worker_0 process? Otherwise it should certainly be plotting.

from deeprl-agents.

IbrahimSobh avatar IbrahimSobh commented on September 26, 2024

I have added the sleep as follows:

    worker_threads = []
    for worker in workers:
        worker_work = lambda: worker.work(max_episode_length,gamma,sess,coord,saver)
        t = threading.Thread(target=(worker_work))
        t.start()
        worker_threads.append(t)
        sleep(0.5) # here is it
    coord.join(worker_threads)

worker_0 is in orange color

w_0

  • However, worker_0 seems to stop very early
  • Moreover, model and frames are not saved (the code for them is based on worker_0)

Then I used worker_1 instead of worker_0 for saving model and frames, but then worker_1 stopped

w_1

I tried sleep in the code that is responsible for saving model and frames ... but the same problem.

Regards

from deeprl-agents.

IbrahimSobh avatar IbrahimSobh commented on September 26, 2024

Possible cause:

After removing model and gif saving code, things worked fine!

Removed code:

                    if self.name == 'worker_1' and episode_count % 25 == 0:
                        time_per_step = 0.05
                        images = np.array(episode_frames)
                        make_gif(images,'./frames/image'+str(episode_count)+'.gif',
                            duration=len(images)*time_per_step,true_image=True,salience=False)
                    if episode_count % 250 == 0 and self.name == 'worker_1':
                        saver.save(sess,self.model_path+'/model-'+str(episode_count)+'.cptk')
                        print ("Saved Model")

Figure: (all threads are there) I think I have some error in saving!

w_no_save

Any clue?

Regards

from deeprl-agents.

IbrahimSobh avatar IbrahimSobh commented on September 26, 2024

Thanks DMTSource

I can save the model but not the frames!

any other way to save gifs or video?

from deeprl-agents.

awjuliani avatar awjuliani commented on September 26, 2024

Hi Ibrahim,

Are you sure that you have both moviepy and ffmpeg installed? You will also need to ensure the version of imageio you have is 1.6.

from deeprl-agents.

IbrahimSobh avatar IbrahimSobh commented on September 26, 2024

Hi Arthur

imageio
print imageio.--version--
2.1.2

ffmpeg -version
ffmpeg version N-80901-gfebc862 Copyright (c) 2000-2016 the FFmpeg developers
built with gcc 4.8 (Ubuntu 4.8.4-2ubuntu1~14.04.3)
configuration: --extra-libs=-ldl --prefix=/opt/ffmpeg --mandir=/usr/share/man --enable-avresample --disable-debug --enable-nonfree --enable-gpl --enable-version3 --enable-libopencore-amrnb --enable-libopencore-amrwb --disable-decoder=amrnb --disable-decoder=amrwb --enable-libpulse --enable-libfreetype --enable-gnutls --enable-libx264 --enable-libx265 --enable-libfdk-aac --enable-libvorbis --enable-libmp3lame --enable-libopus --enable-libvpx --enable-libspeex --enable-libass --enable-avisynth --enable-libsoxr --enable-libxvid --enable-libvidstab
libavutil 55. 28.100 / 55. 28.100
libavcodec 57. 48.101 / 57. 48.101
libavformat 57. 41.100 / 57. 41.100
libavdevice 57. 0.102 / 57. 0.102
libavfilter 6. 47.100 / 6. 47.100
libavresample 3. 0. 0 / 3. 0. 0
libswscale 4. 1.100 / 4. 1.100
libswresample 2. 1.100 / 2. 1.100
libpostproc 54. 0.100 / 54. 0.100

from deeprl-agents.

awjuliani avatar awjuliani commented on September 26, 2024

I believe you will need imageio 1.6, and not 2.1 in order for the gif generation to work. Unfortunately they changed the encoder in 2.1 and broke the gif code I used. If you have a fix that works with 2.1, I would be happy to incorporate it.

from deeprl-agents.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.