Git Product home page Git Product logo

Comments (11)

rodrigo-castellon avatar rodrigo-castellon commented on May 30, 2024

Hi,
I am not sure which docker container you are running there (what is 393fa1440720?)

It appears to me that whatever Docker container you are running has some fundamental NVIDIA driver issues that need to be resolved (this appears to be common enough if you search for it on Google).

Also, as a heads up, Jukebox's multi-GPU code is not really explicitly necessary, so if that keeps causing you issues down the line, you could probably get away with getting rid of the rank, local_rank, device = setup_dist_from_mpi() line (that you show there in your traceback) from main.py and replacing device with "cuda".

Hope this helps!

from jukemir.

borishanzju avatar borishanzju commented on May 30, 2024

I adopt docker pull jukemir/representations_jukebox
this docker

from jukemir.

borishanzju avatar borishanzju commented on May 30, 2024

How can I modify the main.py? I find that the container run automatically and I can not enter into the container

from jukemir.

rodrigo-castellon avatar rodrigo-castellon commented on May 30, 2024

I see, that is odd. Worst-case scenario, you can just replace the device variable with "cpu", and it should (though I have not tested it) work, at least on the CPU, but this has not happened to me with the official Docker image on a machine with a GPU.

You can modify main.py by adding the --entrypoint bash flag to the docker run command. Then, it will give you a shell and you will be able to modify whatever files you want. You may have to apt install nano or vim to do this. Notice that the original entrypoint is python main.py, so after making your changes you can run python main.py inside of that shell.

If you want to make your changes permanent, you could automate those changes within a new Dockerfile that reads something like:

FROM jukemir/representations_jukebox

RUN # patch main.py command here

and then do docker build -t juke_modified . and then when you run the docker run command replace the name of the image with juke_modified.

from jukemir.

borishanzju avatar borishanzju commented on May 30, 2024

when I modify the main.py according to your comment, I got the same Error

root@9284416e805e:/code# python main.py
0%| | 0/2858 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 165, in
setup_hparams(vqvae, dict(sample_length=1048576)), "cpu"
File "/code/jukebox/jukebox/make_models.py", line 92, in make_vqvae
**block_kwargs)
File "/code/jukebox/jukebox/vqvae/vqvae.py", line 79, in init
self.bottleneck = Bottleneck(l_bins, emb_width, mu, levels)
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 189, in init
self.level_blocks.append(level_block(level))
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 186, in
level_block = lambda level: BottleneckBlock(l_bins, emb_width, mu)
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 13, in init
self.reset_k()
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 20, in reset_k
self.register_buffer('k', t.zeros(self.k_bins, self.emb_width).cuda())
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 196, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 101, in _check_driver
http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx

from jukemir.

rodrigo-castellon avatar rodrigo-castellon commented on May 30, 2024

I see. You could try getting rid of the .cuda() in the file /code/jukebox/jukebox/vqvae/bottleneck.py as well. There are a couple other places in the Jukebox code where they explicitly assume CUDA is there I believe.

from jukemir.

borishanzju avatar borishanzju commented on May 30, 2024

But I get the new error

Traceback (most recent call last):
File "main.py", line 165, in
setup_hparams(vqvae, dict(sample_length=1048576)), "cpu"
File "/code/jukebox/jukebox/make_models.py", line 95, in make_vqvae
restore_model(hps, vqvae, hps.restore_vqvae)
File "/code/jukebox/jukebox/make_models.py", line 55, in restore_model
checkpoint = load_checkpoint(checkpoint_path)
File "/code/jukebox/jukebox/make_models.py", line 29, in load_checkpoint
if dist.get_rank() % 8 == 0:
File "/code/jukebox/jukebox/utils/dist_adapter.py", line 23, in get_rank
return _get_rank()
File "/code/jukebox/jukebox/utils/dist_adapter.py", line 65, in _get_rank
return dist.get_rank()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 564, in get_rank
_check_default_pg()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized

from jukemir.

rodrigo-castellon avatar rodrigo-castellon commented on May 30, 2024

Yeah, this is one of Jukebox's multi-node/GPU things that you can also probably get rid of as well in the file /code/jukebox/jukebox/make_models.py.

from jukemir.

borishanzju avatar borishanzju commented on May 30, 2024

get rid of what?

from jukemir.

borishanzju avatar borishanzju commented on May 30, 2024

I solved the problem,but I do not have the checkpoint, can you share it?

from jukemir.

rodrigo-castellon avatar rodrigo-castellon commented on May 30, 2024

The checkpoint should download automatically when you run that code (see https://github.com/openai/jukebox/blob/08efbbc1d4ed1a3cef96e08a931944c8b4d63bb3/jukebox/make_models.py#L34).

from jukemir.

Related Issues (13)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.