Comments (11)
Hi,
I am not sure which docker container you are running there (what is 393fa1440720
?)
It appears to me that whatever Docker container you are running has some fundamental NVIDIA driver issues that need to be resolved (this appears to be common enough if you search for it on Google).
Also, as a heads up, Jukebox's multi-GPU code is not really explicitly necessary, so if that keeps causing you issues down the line, you could probably get away with getting rid of the rank, local_rank, device = setup_dist_from_mpi()
line (that you show there in your traceback) from main.py
and replacing device
with "cuda".
Hope this helps!
from jukemir.
I adopt docker pull jukemir/representations_jukebox
this docker
from jukemir.
How can I modify the main.py? I find that the container run automatically and I can not enter into the container
from jukemir.
I see, that is odd. Worst-case scenario, you can just replace the device
variable with "cpu", and it should (though I have not tested it) work, at least on the CPU, but this has not happened to me with the official Docker image on a machine with a GPU.
You can modify main.py
by adding the --entrypoint bash
flag to the docker run
command. Then, it will give you a shell and you will be able to modify whatever files you want. You may have to apt install nano
or vim to do this. Notice that the original entrypoint is python main.py
, so after making your changes you can run python main.py
inside of that shell.
If you want to make your changes permanent, you could automate those changes within a new Dockerfile that reads something like:
FROM jukemir/representations_jukebox
RUN # patch main.py command here
and then do docker build -t juke_modified .
and then when you run the docker run
command replace the name of the image with juke_modified
.
from jukemir.
when I modify the main.py according to your comment, I got the same Error
root@9284416e805e:/code# python main.py
0%| | 0/2858 [00:00<?, ?it/s]
Traceback (most recent call last):
File "main.py", line 165, in
setup_hparams(vqvae, dict(sample_length=1048576)), "cpu"
File "/code/jukebox/jukebox/make_models.py", line 92, in make_vqvae
**block_kwargs)
File "/code/jukebox/jukebox/vqvae/vqvae.py", line 79, in init
self.bottleneck = Bottleneck(l_bins, emb_width, mu, levels)
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 189, in init
self.level_blocks.append(level_block(level))
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 186, in
level_block = lambda level: BottleneckBlock(l_bins, emb_width, mu)
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 13, in init
self.reset_k()
File "/code/jukebox/jukebox/vqvae/bottleneck.py", line 20, in reset_k
self.register_buffer('k', t.zeros(self.k_bins, self.emb_width).cuda())
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 196, in _lazy_init
_check_driver()
File "/usr/local/lib/python3.6/dist-packages/torch/cuda/init.py", line 101, in _check_driver
http://www.nvidia.com/Download/index.aspx""")
AssertionError:
Found no NVIDIA driver on your system. Please check that you
have an NVIDIA GPU and installed a driver from
http://www.nvidia.com/Download/index.aspx
from jukemir.
I see. You could try getting rid of the .cuda()
in the file /code/jukebox/jukebox/vqvae/bottleneck.py
as well. There are a couple other places in the Jukebox code where they explicitly assume CUDA is there I believe.
from jukemir.
But I get the new error
Traceback (most recent call last):
File "main.py", line 165, in
setup_hparams(vqvae, dict(sample_length=1048576)), "cpu"
File "/code/jukebox/jukebox/make_models.py", line 95, in make_vqvae
restore_model(hps, vqvae, hps.restore_vqvae)
File "/code/jukebox/jukebox/make_models.py", line 55, in restore_model
checkpoint = load_checkpoint(checkpoint_path)
File "/code/jukebox/jukebox/make_models.py", line 29, in load_checkpoint
if dist.get_rank() % 8 == 0:
File "/code/jukebox/jukebox/utils/dist_adapter.py", line 23, in get_rank
return _get_rank()
File "/code/jukebox/jukebox/utils/dist_adapter.py", line 65, in _get_rank
return dist.get_rank()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 564, in get_rank
_check_default_pg()
File "/usr/local/lib/python3.6/dist-packages/torch/distributed/distributed_c10d.py", line 193, in _check_default_pg
"Default process group is not initialized"
AssertionError: Default process group is not initialized
from jukemir.
Yeah, this is one of Jukebox's multi-node/GPU things that you can also probably get rid of as well in the file /code/jukebox/jukebox/make_models.py
.
from jukemir.
get rid of what?
from jukemir.
I solved the problem,but I do not have the checkpoint, can you share it?
from jukemir.
The checkpoint should download automatically when you run that code (see https://github.com/openai/jukebox/blob/08efbbc1d4ed1a3cef96e08a931944c8b4d63bb3/jukebox/make_models.py#L34).
from jukemir.
Related Issues (13)
- Dockerfile incomplete? HOT 2
- RuntimeError: Error(s) in loading state_dict for SimplePrior HOT 2
- Model selection for extracting jukebox representations HOT 2
- 3_extract.sh not generating outputs HOT 8
- RuntimeError: Failed to initialize NCCL HOT 3
- Resample for different different rate of downstream tasks
- ValueError: Audio file is not long enough HOT 3
- RuntimeError: CuDNN Error: CUDNN_STATUS_MAPPING_ERROR HOT 1
- Precomputed features link broken HOT 3
- Errors when running collab notebook
- Pre-computed features are not available online
- Loading version 3 of the ids.txt (Colab) HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from jukemir.