Comments (14)
Thanks for the feedback! I'm sorry that nvitop
does not support MIG enabled devices yet. But we are working on it. It would be very nice that if you can help us to make nvitop
better.
When running nvitop on MiG enabled A100 GPU. nvitop fails to detect the GPU running process and GPU memory consumption. Which can otherwise be viewed by running the command,
nvidia-smi
nvitop
has not tested on MIG enabled devices. (I don't have any A100/A30 GPU available though.) Could you please run the following commands on your device, which could be very helpful to identify the error.
python3 -m venv venv # create a virtual environment
source venv/bin/activate
pip3 install nvidia-ml-py==11.450.51 # the pinned version for nvitop
python3 test.py
pip3 install nvidia-ml-py==11.450.129 # the newer version
python3 test.py
deactivate
rm -rf venv
nvidia-smi
The content of test.py
:
from pynvml import *
nvmlInit()
print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode()))
device = nvmlDeviceGetHandleByIndex(index=0) # change the GPU index here
print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device)))
print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device)))
print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device)))
print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device)))))
migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0) # change the MIG device index here
print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice)))
print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice)))
print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice)))))
nvmlShutdown()
Possible Solutions
I think that the MiG naming convention is different from regular naming conventions, and looks something like this:
MIG 7g.80gb Device 0:
rather than justDevice 0:
as is currently set-up in the nvitop repo.
Agreed. I think we should redesign the UI and add a new panel for MIG devices.
Steps to reproduce
- Run A100 in Mig mode
- start nvitop
watch -n 0.5 nvitop
You can use the monitor mode of nvitop
by:
nvitop -m
Type nvitop --help
for more command line options.
from nvitop.
Hi,
Thanks for the quick response! Here are the outputs of running the above commands:
- For the bit,
pip3 install nvidia-ml-py==11.450.51 # the pinned version for nvitop python3 test.py
, the console output is:
Driver version: 450.142.00
MIG mode: [1, 1]
MIG count: 7
Traceback (most recent call last):
File "test.py", line 11, in <module>
print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
_nvmlCheckReturn(ret)
File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
- For the new install:
pip3 install nvidia-ml-py==11.450.129
, the console output is:
MIG mode: [1, 1]
MIG count: 7
Traceback (most recent call last):
File "test.py", line 11, in <module>
print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 2009, in nvmlDeviceGetUtilizationRates
_nvmlCheckReturn(ret)
File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 703, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
AFAIK - this is expected as to use MiG mode you've gotta disable the A100s from using Fabric Manager and NVLink.
Let me know how else I can help with this - this library's pretty cool :)
from nvitop.
Thanks to @ki-arie !
It seams that we cannot get the GPU level infos about fans peed, memory usage, GPU utilization on MIG enabled devices. Both from NVML python bindings and the nvidia-smi
output.
I'm sorry for the poor exception handling in the example code. Can you try the Python code above again, but in Python REPL (just type python3
in command line)?
$ # create virtual environment and pip3 install ...
$ python3
Python 3.9.6 (default, Jun 28 2021, 08:57:49)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
...
And it could be better that there are some processes is running on the MIG device when you testing with the NVML bindings. You can try:
pip3 install cupy-cuda102 # replace the suffix here to your CUDA version (e.g. `cuda110` for CUDA 11.0)
python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &
If you have installed TensorFlow or PyTorch, you can try:
python3 -c 'import time; import torch; x = torch.zeros((1, 1), device="cuda:0"); time.sleep(120)' &
This command will use the GPU for 2 minutes in the background.
from nvitop.
works fine on 2x3090 NVLINK (MIG)
from nvitop.
@zabique Thanks for the report. Glad to see people using nvitop
on Windows!
According to your screenshot, you are using Dual-3090 on Windows, which is not a MIG setup. BTW, you can change the font of your terminal to get a better experience (the fonts are missing (?
s in boxes) in the graph views and the last characters of the bars.)
NVIDIA Multi-Instance GPU User Guide: Introduction
Introduction
The new Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.
The MIG feature is to split one physical GPU into multiple separate GPU instances.
By now, only A100 series and A30 GPUs support MIG mode and are only available on Linux (NVIDIA Multi-Instance GPU User Guide: Supported GPUs).
from nvitop.
Thanks for reply and font hint as I was too shy to ask about it :).
I thougjt MIG is enabled on my GPUs because nvidia-smi show it in too right corner.
I also compared performance in ubuntu 20.04 and windows and I can run same model with pretty much same performance + nvidia-smi in windows allow a lot more hardware control.
Feel free to ask for any testing.
Your nvitop is great!
from nvitop.
I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop.
Hope these information below will help you. And thanks for develop this awesome tools.
BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight?
$ python testnvitop.py
Driver version: 470.42.01
MIG mode: [1, 1]
MIG count: 7
Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B)
Traceback (most recent call last):
File "/home/testnvitop.py", line 12, in <module>
print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
_nvmlCheckReturn(ret)
File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
$ python
Python 3.9.7 (default, Sep 16 2021, 13:09:58)
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
>>> print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode()))
Driver version: 470.42.01
>>> device = nvmlDeviceGetHandleByIndex(index=0)
>>> print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device)))
MIG mode: [1, 1]
>>> print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device)))
MIG count: 7
>>> print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device)))
Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B)
>>> print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
_nvmlCheckReturn(ret)
File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
>>> print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device)))))
Processes from GPU: ["{'pid': 1562472, 'usedGpuMemory': 2104492032}", "{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1698220, 'usedGpuMemory': 2832203776}", "{'pid': 1701131, 'usedGpuMemory': 2345664512}", "{'pid': 1702015, 'usedGpuMemory': 2588934144}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"]
>>> migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0)
>>> print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice)))
Memory info from MIG device: c_nvmlMemory_t(total: 10468982784 B, free: 5387976704 B, used: 5081006080 B)
>>> print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice)))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
_nvmlCheckReturn(ret)
File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument
>>> print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice)))))
Processes from MIG device: ["{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"]
>>> nvmlShutdown()
>>> exit()
This time the process running situation in GPU.
$ nvidia-smi
Wed Nov 17 11:58:34 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA A100-PCI... Off | 00000000:05:00.0 Off | On |
| N/A 84C P0 154W / 250W | 14293MiB / 40536MiB | N/A Default |
| | | Enabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| MIG devices: |
+------------------+----------------------+-----------+-----------------------+
| GPU GI CI MIG | Memory-Usage | Vol| Shared |
| ID ID Dev | BAR1-Usage | SM Unc| CE ENC DEC OFA JPG|
| | | ECC| |
|==================+======================+===========+=======================|
| 0 3 0 0 | 4845MiB / 9984MiB | 28 0 | 2 0 1 0 0 |
| | 8MiB / 16383MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 9 0 1 | 2014MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 8MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 10 0 2 | 2708MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 4MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 11 0 3 | 2244MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 8MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 12 0 4 | 2476MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 8MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
| 0 13 0 5 | 3MiB / 4864MiB | 14 0 | 1 0 0 0 0 |
| | 6MiB / 8191MiB | | |
+------------------+----------------------+-----------+-----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| 0 3 0 1695752 C python 2647MiB |
| 0 3 0 1733844 C python 2185MiB |
| 0 9 0 1562472 C python 2007MiB |
| 0 10 0 1698220 C python 2701MiB |
| 0 11 0 1701131 C python 2237MiB |
| 0 12 0 1702015 C python 2469MiB |
+-----------------------------------------------------------------------------+
from nvitop.
I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop.
Hope these information below will help you.
@lixeon Thanks a lot for the informative results. I'll try to improve nvitop
on MIG enabled devices.
BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight?
From NVML API Reference:
The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing various states within NVIDIA Tesla™ GPUs.
The NVML and applications based on it (nvidia-smi
, nvidia-ml-py
, nvitop
, nvtop
, gpustat
, etc.) are designed to monitor the GPU states in a global view. These tools can only capture the overall GPU SM and VRAM usage of a single process. They are not designed for code profiling.
Nsight is a profiling tool that can grab more fine-grained GPU usage information (nvvp is deprecated). It counts the running time for each API call.
from nvitop.
Hi, you guys! I add the MIG support to the GUI. To install:
git clone --branch=mig-support https://github.com/XuehaiPan/nvitop.git
cd nvitop
python3 -m venv --upgrade-deps venv
source venv/bin/activate
pip3 install -r requirements.txt
python3 nvitop.py -m
Any feedback is welcome.
from nvitop.
Close as resolved by PR #8.
from nvitop.
I just got my sudo
access to an A100 GPU. I tweaked the visual result of the CLI and may release it soon.
from nvitop.
Omg incredible work! 🤩
from nvitop.
so nvitop can't get the migdevice's gpu utilization and sm?
from nvitop.
so nvitop can't get the migdevice's gpu utilization and sm?
@ytaoeer nvitop
is based on the NVML library. The API reference of nvmlDeviceGetUtilizationRates
notes that:
Note:
- During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
- On MIG-enabled GPUs, querying device utilization rates is not currently supported.
All NVML-based monitoring tools cannot track the GPU utilization of the MIG instances (including nvidia-smi
). You can submit a feature request to the NVML upstream to ask for this support.
from nvitop.
Related Issues (20)
- [Feature Request] Add shortcut that shows parent process's name for selected process HOT 1
- [Question] Memory bandwidth utilization of GPUs? HOT 4
- [BUG] Monitor mode displays nothing under Python 3.12 in Windows 11 HOT 8
- [Feature Request] It is recommended to change the dependency from nvidia-ml-py to pynvml HOT 1
- [BUG][exporter] Process metrics still exist when the process is gone HOT 5
- [BUG] curser error init display on ubuntu 22.10 HOT 1
- [Question] How to log GPU performance to `wandb` HOT 2
- [Feature Request] add io stat for disk and process like glances HOT 2
- [BUG] Pytorch lightning callback HOT 3
- [Feature Request] Show real-time bandwidth under monitor mode
- [Question] ERROR: Failed to initialize `curses` (setupterm: could not find terminfo database) HOT 4
- [Question] How snapshot could be used HOT 2
- Installation: which step to follow? HOT 1
- [BUG] Prometheus connection refused HOT 4
- [Question] Grafana Dashboard Example HOT 1
- [Feature Request] can you support word wrap for COMMAND information HOT 1
- Windows download link please.. HOT 1
- [Feature Request] Add CPU Processes
- [BUG] UTF-8 Error during decoding device name on R555 driver HOT 9
- [BUG] `nvitop.Device.from_cuda_visible_devices()` not detecting GPU HOT 4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nvitop.