Git Product home page Git Product logo

Comments (21)

galitus avatar galitus commented on June 16, 2024 3

This library is essential and required when you want to use the GPU part of tensorflow.

Here are the requirements for tensorflow-gpu (part Software requirements):
https://www.tensorflow.org/install/gpu

I guess for a docker container specialized for GPU acceleration, I would recommend to include the cuDNN library.

Greetings

from gpu-jupyter.

ChristophSchranz avatar ChristophSchranz commented on June 16, 2024 2

Hi guys,

I have the same problem in my AWS/EKS-Jupyerhub setup.
Apart from adding the cuDNN library to the gpulibs it might also be possible to change the base image from nvidia/cuda:10.1-base-ubuntu18.04 to nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04. I couldn't try it locally due to some strange docker build errors, that's why I wanted to ask you.

The resulting image will be bigger but tensorflow might work. What do you think?

It seems to work fine. I guess that it is better to fetch directly from nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 as their compatibility would be guaranteed. Do you agree?

from gpu-jupyter.

ChristophSchranz avatar ChristophSchranz commented on June 16, 2024 1

@ph-lp
Installing libcudnn7 takes about 900 MB.

Reading package lists...
Building dependency tree...
Reading state information...
The following additional packages will be installed:
  libcudnn7
The following NEW packages will be installed:
  libcudnn7 libcudnn7-dev
0 upgraded, 2 newly installed, 0 to remove and 35 not upgraded.
Need to get 355 MB of archives.
After this operation, 892 MB of additional disk space will be used.

Maybe it is a good idea to use the nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04 base image as you've suggested.
I'll test it today.

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024 1

Solved with commit e5eb8f6

Use nvidia/cuda:10.1-base-ubuntu18.04 as base image.

Thanks to everyone here! smiley

I agree it's better to use that as base. Storage isn't that big a problem but I think crossing 10 puts us into "be careful now" territory (assume you get 25-50GB of cloud storage with an instance, models can easily take up 10-20 even without any datasets).

DNN's are popular so I agree they should be included by default, but not technically can be unnecessary for running some very sophisticated programs. I ran a bunch of NLP networks that gobbled up 10GB of VRAM (for inference!) and those still weren't using deep network architecture. Half a gig in image savings is a potentially half a million pages of rich text corpus to train on.

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024

spawners handle attaching gpu's differently, so this is a case-by-case thing. Would be interested to know how you solve this with kube.

CUDA does need to be installed on the host, yes, but shouldn't require special mounting. The nvidia-docker runtime is used, and it sounds (to my inexperienced ears) like you're hitting up against an issue where kube doesn't quite behave the same way, analogous to running this lib without nvidia-docker but rather plain docker instead. it'll work, you'll just be miissing GPUs.

I have multiple GPUs and need to pass --gpus all or --gpus device=1 when initializing.

from gpu-jupyter.

galitus avatar galitus commented on June 16, 2024

hey,

I use the nvidia-container-runtime which is using "docker --runtime nvidia" for every docker container by default.
I also need the "nvidia-device-plugin" which will provide the number of gpus on the kubernetes nodes.
Then you can use the "'extra_resource_guarantees': {"nvidia.com/gpu": "1"}," in the kubespawner and assign a GPU to the container.
Which is there and accessible.
In a terminal in jupyterhub, I can run nvidia-smi and can see the card attached to the container.

The problem is, if you want to use the tensorflow gpu part, you need cuda, tensorflow and cuDNN.
cuda and tensorflow is installed in the container but cuDNN seems not to be installed.

I guess the container with pure docker will get access to this library from the host?
Because otherwise a lot of people would have complained about it or?

Log of the container:

tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcuda.so.1
tensorflow/core/common_runtime/gpu/gpu_device.cc:1716] Found device 0 with properties: 
pciBusID: 0000:06:00.0 name: Quadro RTX 4000 computeCapability: 7.5
coreClock: 1.545GHz coreCount: 36 deviceMemorySize: 7.79GiB deviceMemoryBandwidth: 387.49GiB/s
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcudart.so.10.1
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcublas.so.10
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcufft.so.10
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcurand.so.10
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusolver.so.10
tensorflow/stream_executor/platform/default/dso_loader.cc:48] Successfully opened dynamic library libcusparse.so.10
tensorflow/stream_executor/platform/default/dso_loader.cc:59] Could not load dynamic library 'libcudnn.so.7'; dlerror: libcudnn.so.7: cannot open shared object file: No such file or directory

As you can see the device is found, a lot of cuda libraries loaded, but the cuDNN is missing.
I used find / -iname *cudnn* inside of the container and it's not there.
That's why the $PATH or another ENV variable from the host maybe have to passthrough?

Thanks again and greetings

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024

maybe you just ran into a missing library? You should have root privileges within the container so try downloading it from the ubuntu PPAs and do report back if you manage to find some extra libs that you believe should be added to the default configuration

from gpu-jupyter.

galitus avatar galitus commented on June 16, 2024

Hey,

thanks, yep it is missing

apt install libcudnn7-dev

helped.

Big thanks and now I need to figure out, to make it permanent.
I guess my own docker registry with a modified container of this would be the right approach.

Thanks again!

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024

would you say this library is important enough to warrant including in this build by default?

from gpu-jupyter.

ChristophSchranz avatar ChristophSchranz commented on June 16, 2024

Thanks for figuring that out!

@galitus would you like to make a PR with a versioned dependency in src/Dockerfile.gpulibs?

from gpu-jupyter.

ph-lp avatar ph-lp commented on June 16, 2024

Hi guys,

I have the same problem in my AWS/EKS-Jupyerhub setup.
Apart from adding the cuDNN library to the gpulibs it might also be possible to change the base image from nvidia/cuda:10.1-base-ubuntu18.04 to nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04. I couldn't try it locally due to some strange docker build errors, that's why I wanted to ask you.

The resulting image will be bigger but tensorflow might work. What do you think?

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024

Hi guys,

I have the same problem in my AWS/EKS-Jupyerhub setup.
Apart from adding the cuDNN library to the gpulibs it might also be possible to change the base image from nvidia/cuda:10.1-base-ubuntu18.04 to nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04. I couldn't try it locally due to some strange docker build errors, that's why I wanted to ask you.

The resulting image will be bigger but tensorflow might work. What do you think?

trying with the manual install in #26, can you see if that solves your problem as well? would definitely prefer a slimmer image

from gpu-jupyter.

ChristophSchranz avatar ChristophSchranz commented on June 16, 2024

Hi guys,

I have the same problem in my AWS/EKS-Jupyerhub setup.
Apart from adding the cuDNN library to the gpulibs it might also be possible to change the base image from nvidia/cuda:10.1-base-ubuntu18.04 to nvidia/cuda:10.1-cudnn7-runtime-ubuntu18.04. I couldn't try it locally due to some strange docker build errors, that's why I wanted to ask you.

The resulting image will be bigger but tensorflow might work. What do you think?

Good to know and important to have it on the scope. Is there an explanation why the image is bigger? Are there other packages pre-installed that improve the toolstack or would be installed later on?

from gpu-jupyter.

galitus avatar galitus commented on June 16, 2024

hey,

thanks for taking care of this problem.
Like I said before, I'm new to this game, so I didn't open a PR (yet).

Yeah, the download of cuDNN is already 500M gzip'ed.
So, I don't know if the other base image would be smaller.

Maybe a bit off-topic:
I noticed that there is ~700M in /home/jovyan/.cache/yarn/ .
I don't know if it is still used but I think it is only used while installing some programs?
So this could be freed to be used for cuDNN?

from gpu-jupyter.

ph-lp avatar ph-lp commented on June 16, 2024

@mathematicalmichael From the logs in the docker-default pipeline I can see a build error which matches the one on my machine with your updated dockerfile
Step 72/93 : RUN apt-get update && apt-get install -y libcudnn7-dev && rm -rf /var/lib/apt/lists/* ---> Running in 8dd883761a73 Reading package lists... E: List directory /var/lib/apt/lists/partial is missing. - Acquire (13: Permission denied) Service 'gpu-jupyter' failed to build: The command '/bin/bash -o pipefail -c apt-get update && apt-get install -y libcudnn7-dev && rm -rf /var/lib/apt/lists/*' returned a non-zero code: 100

I'm still trying to make it work on my machine using the other image.

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024

@ph-lp that is peculiar. I didn't test this on my machine, I made the edit on mobile and hoped the CI pipelines with the PR would help me debug if need be. I'll take a look.

Curious it didn't fail originally, @ChristophSchranz isn't that file run as root? The error @ph-lp pasted looks like a missing USER root directory.

from gpu-jupyter.

ChristophSchranz avatar ChristophSchranz commented on June 16, 2024

@ph-lp

Add USER root before the command

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024

hey,

thanks for taking care of this problem.
Like I said before, I'm new to this game, so I didn't open a PR (yet).

Yeah, the download of cuDNN is already 500M gzip'ed.
So, I don't know if the other base image would be smaller.

Maybe a bit off-topic:
I noticed that there is ~700M in /home/jovyan/.cache/yarn/ .
I don't know if it is still used but I think it is only used while installing some programs?
So this could be freed to be used for cuDNN?

I think the stuff in yarn is used by jupyterlab, so I'm not sure it's safe to delete. if you try and it works ... let us know.

from gpu-jupyter.

ChristophSchranz avatar ChristophSchranz commented on June 16, 2024

Solved with commit e5eb8f6

Use nvidia/cuda:10.1-base-ubuntu18.04 as base image.

Thanks to everyone here! 😃

from gpu-jupyter.

ca-scribner avatar ca-scribner commented on June 16, 2024

Yeah I think these drivers are essential, but it does make for a big image. Add a few extra installs and I'm having trouble keeping under the 14GB limit for github actions. Haven't figured out how to shave any more space off yet though

from gpu-jupyter.

mathematicalmichael avatar mathematicalmichael commented on June 16, 2024

@ca-scribner if you're using them for actions, consider trimming off all the jupyter installs, scrap conda. while convenient to use the same image for development as you do for testing, you run into the situation you just outlined more often than not. I started deploying projects with a minimal image that matches prod/testing, and then all my development tools are not published in a registry but rather built on top of it locally.

from gpu-jupyter.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.