Comments (14)
Adding NVIDIA GPU support to CRIU is a fundamental goal of the cuda-checkpoint project.
We'd be happy to contribute and work with the CRIU developers to help make that happen!
from criu.
@alexfrolov thanks for your thoughts. Today we already can checkpoint and restore amd GPU containers with Podman. So we know it is doable, but from my point of view nvidia needs to do the work to fully make work. Just like amd came along and implemented it. We are also following closely what Nvidia does with their checkpoint tool. It is extremely limited at this point but it looks promising for the future.
The actual error about the mount point looks fixable by correctly specifying all mountpoints in config.json from runc.
from criu.
@adrianreber Nvidia chose another way to implement C/R, and there's nothing wrong with that. I looked at the https://github.com/NVIDIA/cuda-checkpoint tool, and I think we need to implement support for it in CRIU. The only thing we need to do is run this tool for all processes that use CUDA and nvml (it isn't supported yet, but they are working on that). It has to be done before the dump and after the restore.
Even without the support of this tool in CRIU, users can checkpoint/restore CUDA workloads but they will need to run this tool for CUDA processes.
from criu.
Checkpointing Kubernetes containers with Nvidia GPUs is not working as far as we know.
We have seen success with AMD GPUs.
from criu.
Checkpointing Kubernetes containers with Nvidia GPUs is not working as far as we know.
We have seen success with AMD GPUs.
I checked that containerd caused this error, and containerd uses criu. Then you know how to skip nvidia when checkpointing nvidia gpus container
from criu.
Checkpointing Kubernetes containers with Nvidia GPUs is not working as far as we know.
We have seen success with AMD GPUs.
I checked that containerd caused this error, and containerd uses criu. Then you know how to skip nvidia when checkpointing nvidia gpus container
I just want to keep the environment inside the container, mainly files
from criu.
Then checkpointing is the wrong approach.
from criu.
Then checkpointing is the wrong approach.
What I mean is, I want to preserve the environment inside the container. After checkpointing the export, it is then built as an image. This method is very fast for building a runtime image
from criu.
Sorry, I do not understand what you want to do. First you said you want to checkpoint the container then you said you want to just keep the environment inside of the container.
Anyway, checkpointing containers with Nvidia GPUs does not work. You need to talk to Nvidia to enable it.
from criu.
Sorry, I do not understand what you want to do. First you said you want to checkpoint the container then you said you want to just keep the environment inside of the container.
Anyway, checkpointing containers with Nvidia GPUs does not work. You need to talk to Nvidia to enable it.
thanks
from criu.
Hi!
I want to put my 2 cents here. Nvidia recently uploaded their utility (in binary only) to github called cuda-checkpoint, which provides a method for checkpointing applications when they do not have any kernel running. Actually, this method utilizes some new capabilities of the nvidia driver (550). After application's data storing in GPU has been copied to the host memory, the application can be safely dumped with criu. The restore process looks the same as in common case but after application has been restored, it need to be toggled with cuda-checkpoint.
However, to be able to use this in docker in the future, it seems that some more work has to be done. Currently, the error when docker checkpoint create
is invoked looks to be related to /dev/nvidia* bindings.
(00.009747) mnt: Found /dev/null mapping for ./proc/timer_list mountpoint
(00.009750) mnt: Found /dev/null mapping for ./proc/keys mountpoint
(00.009752) mnt: Found /dev/null mapping for ./proc/kcore mountpoint
(00.009773) mnt: Found /etc/hosts mapping for ./etc/hosts mountpoint
(00.009776) mnt: Found /etc/hostname mapping for ./etc/hostname mountpoint
(00.009778) mnt: Found /etc/resolv.conf mapping for ./etc/resolv.conf mountpoint
(00.009781) mnt: Found /sys/fs/cgroup/blkio mapping for ./sys/fs/cgroup/blkio mountpoint
(00.009783) mnt: Found /sys/fs/cgroup/memory mapping for ./sys/fs/cgroup/memory mountpoint
(00.009786) mnt: Found /sys/fs/cgroup/devices mapping for ./sys/fs/cgroup/devices mountpoint
(00.009788) mnt: Found /sys/fs/cgroup/net_cls,net_prio mapping for ./sys/fs/cgroup/net_cls,net_prio mountpoint
(00.009790) mnt: Found /sys/fs/cgroup/cpu,cpuacct mapping for ./sys/fs/cgroup/cpu,cpuacct mountpoint
(00.009793) mnt: Found /sys/fs/cgroup/hugetlb mapping for ./sys/fs/cgroup/hugetlb mountpoint
(00.009795) mnt: Found /sys/fs/cgroup/perf_event mapping for ./sys/fs/cgroup/perf_event mountpoint
(00.009797) mnt: Found /sys/fs/cgroup/freezer mapping for ./sys/fs/cgroup/freezer mountpoint
(00.009799) mnt: Found /sys/fs/cgroup/cpuset mapping for ./sys/fs/cgroup/cpuset mountpoint
(00.009802) mnt: Found /sys/fs/cgroup/pids mapping for ./sys/fs/cgroup/pids mountpoint
(00.009804) mnt: Found /sys/fs/cgroup/systemd mapping for ./sys/fs/cgroup/systemd mountpoint
(00.009809) mnt: Inspecting sharing on 308 shared_id 0 master_id 0 (@./sys/firmware)
(00.009812) mnt: Inspecting sharing on 307 shared_id 0 master_id 0 (@./proc/scsi)
(00.009815) mnt: Inspecting sharing on 306 shared_id 0 master_id 0 (@./proc/timer_list)
(00.009817) mnt: The mount 305 is bind for 306 (@./proc/keys -> @./proc/timer_list)
(00.009820) mnt: The mount 304 is bind for 306 (@./proc/kcore -> @./proc/timer_list)
(00.009826) mnt: The mount 340 is bind for 306 (@./dev -> @./proc/timer_list)
(00.009828) mnt: Inspecting sharing on 305 shared_id 0 master_id 0 (@./proc/keys)
(00.009831) mnt: Inspecting sharing on 304 shared_id 0 master_id 0 (@./proc/kcore)
(00.009833) mnt: Inspecting sharing on 303 shared_id 0 master_id 0 (@./proc/acpi)
(00.009835) mnt: Inspecting sharing on 302 shared_id 0 master_id 0 (@./proc/sysrq-trigger)
(00.009837) mnt: The mount 301 is bind for 302 (@./proc/sys -> @./proc/sysrq-trigger)
(00.009840) mnt: The mount 300 is bind for 302 (@./proc/irq -> @./proc/sysrq-trigger)
(00.009842) mnt: The mount 299 is bind for 302 (@./proc/fs -> @./proc/sysrq-trigger)
(00.009844) mnt: The mount 298 is bind for 302 (@./proc/bus -> @./proc/sysrq-trigger)
(00.009846) mnt: The mount 339 is bind for 302 (@./proc -> @./proc/sysrq-trigger)
(00.009849) mnt: Inspecting sharing on 301 shared_id 0 master_id 0 (@./proc/sys)
(00.009851) mnt: Inspecting sharing on 300 shared_id 0 master_id 0 (@./proc/irq)
(00.009853) mnt: Inspecting sharing on 299 shared_id 0 master_id 0 (@./proc/fs)
(00.009855) mnt: Inspecting sharing on 298 shared_id 0 master_id 0 (@./proc/bus)
(00.009857) mnt: Inspecting sharing on 297 shared_id 0 master_id 0 (@./dev/console)
(00.009860) mnt: The mount 341 is bind for 297 (@./dev/pts -> @./dev/console)
(00.009862) mnt: Inspecting sharing on 389 shared_id 0 master_id 14 (@./proc/driver/nvidia/gpus/0000:00:06.0)
(00.009864) Error (criu/mount.c:926): mnt: Mount 389 ./proc/driver/nvidia/gpus/0000:00:06.0 (master_id: 14 shared_id: 0) has unreachable sharing. Try --enable-external-masters.
(00.009885) Unlock network
(00.009890) Running network-unlock scripts
(00.009892) RPC
(00.100301) Unfreezing tasks into 1
(00.100330) Unseizing 2961004 into 1
(00.100359) Error (criu/cr-dump.c:1781): Dumping FAILED.
from criu.
Some more thoughts on that..
Another case, that can be potentially interesting, is that docker container is running application which generates periodic gpu load forking new process each time and using some process(es) containing temporary data in host memory.
process A (stores temporary data in host memory, does not have CUDA calls)
process B_1,..B_n (one-shot processes working with GPU and sending some data to process A and then terminate)
In this case snapshotting docker container would be useful to preserve state of the process A, but now it is not possible.
from criu.
cc: @sgurfinkel
from criu.
@sgurfinkel sounds great, thanks.
from criu.
Related Issues (20)
- How a app can know that it had been dumping by criu HOT 7
- ERR: vdso01.c:378: Delta is too big HOT 2
- docker checkpoint create failed: Error (compel/src/lib/ptrace.c:27): suspending seccomp failed: Operation not permitted HOT 2
- Checkpointing runC container is giving error: Unable to connect a transport socket: Permission denied HOT 6
- How to make parasite code support glibc? HOT 5
- How disable plugin for nvidia gpu HOT 2
- How to use CRIU with CUDA HOT 2
- Cannot dump process that opened file in tracefs HOT 4
- gcc format-truncation warnings on Ubuntu 24.04 HOT 7
- Question: Lazy restore tends to restore all pages rather than those pages that really touched ? HOT 9
- Can't get reg-files.img by dump. HOT 7
- Following the `setcap` instruction raises 'fatal error: Invalid argument' HOT 6
- Can CRIU use arm based runners from Actuated? HOT 2
- CRIU package for Ubuntu 24.04 HOT 9
- "Fedora ASAN Test / build" fails with "cgroup.clone_children: No such file or directory"
- "Cirrus CI / Vagrant Fedora Rawhide based test" fails with error "setenforce: SELinux is disabled" HOT 2
- compel parasite sys_open return -1 always HOT 3
- Cannot checkpoint container: "failed: could not load libcriu.so.2" HOT 7
- Can not pass images_dir_fd option when using pycriu HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from criu.