Git Product home page Git Product logo

Comments (14)

XuehaiPan avatar XuehaiPan commented on August 16, 2024 2

Thanks for the feedback! I'm sorry that nvitop does not support MIG enabled devices yet. But we are working on it. It would be very nice that if you can help us to make nvitop better.

Ref wookayin/gpustat#102


When running nvitop on MiG enabled A100 GPU. nvitop fails to detect the GPU running process and GPU memory consumption. Which can otherwise be viewed by running the command, nvidia-smi

nvitop has not tested on MIG enabled devices. (I don't have any A100/A30 GPU available though.) Could you please run the following commands on your device, which could be very helpful to identify the error.

python3 -m venv venv  # create a virtual environment
source venv/bin/activate
pip3 install nvidia-ml-py==11.450.51   # the pinned version for nvitop
python3 test.py
pip3 install nvidia-ml-py==11.450.129  # the newer version
python3 test.py
deactivate
rm -rf venv

nvidia-smi

The content of test.py:

from pynvml import *

nvmlInit()
print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode()))

device = nvmlDeviceGetHandleByIndex(index=0)  # change the GPU index here

print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device)))
print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device)))

print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device)))
print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))

print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device)))))

migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0)  # change the MIG device index here

print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice)))
print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice)))

print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice)))))

nvmlShutdown()

Possible Solutions
I think that the MiG naming convention is different from regular naming conventions, and looks something like this:
MIG 7g.80gb Device 0: rather than just Device 0: as is currently set-up in the nvitop repo.

Agreed. I think we should redesign the UI and add a new panel for MIG devices.


Steps to reproduce

  • Run A100 in Mig mode
  • start nvitop watch -n 0.5 nvitop

You can use the monitor mode of nvitop by:

nvitop -m

Type nvitop --help for more command line options.

from nvitop.

ki-arie avatar ki-arie commented on August 16, 2024

Hi,
Thanks for the quick response! Here are the outputs of running the above commands:

  • For the bit, pip3 install nvidia-ml-py==11.450.51 # the pinned version for nvitop python3 test.py, the console output is:
Driver version: 450.142.00
MIG mode: [1, 1]
MIG count: 7
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
  • For the new install: pip3 install nvidia-ml-py==11.450.129, the console output is:
MIG mode: [1, 1]
MIG count: 7
Traceback (most recent call last):
  File "test.py", line 11, in <module>
    print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 2009, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/opt/conda/envs/nvitop/lib/python3.8/site-packages/pynvml.py", line 703, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported

AFAIK - this is expected as to use MiG mode you've gotta disable the A100s from using Fabric Manager and NVLink.

  • Running nvidia-smi results in this screen:
    image

Let me know how else I can help with this - this library's pretty cool :)

from nvitop.

XuehaiPan avatar XuehaiPan commented on August 16, 2024

Thanks to @ki-arie !

It seams that we cannot get the GPU level infos about fans peed, memory usage, GPU utilization on MIG enabled devices. Both from NVML python bindings and the nvidia-smi output.

I'm sorry for the poor exception handling in the example code. Can you try the Python code above again, but in Python REPL (just type python3 in command line)?

$ # create virtual environment and pip3 install ...
$ python3
Python 3.9.6 (default, Jun 28 2021, 08:57:49) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
...

And it could be better that there are some processes is running on the MIG device when you testing with the NVML bindings. You can try:

pip3 install cupy-cuda102  # replace the suffix here to your CUDA version (e.g. `cuda110` for CUDA 11.0)
python3 -c 'import time; import cupy as cp; x = cp.zeros((1, 1)); time.sleep(120)' &

If you have installed TensorFlow or PyTorch, you can try:

python3 -c 'import time; import torch; x = torch.zeros((1, 1), device="cuda:0"); time.sleep(120)' &

This command will use the GPU for 2 minutes in the background.

from nvitop.

zabique avatar zabique commented on August 16, 2024

image_2021-11-11_203703

works fine on 2x3090 NVLINK (MIG)

from nvitop.

XuehaiPan avatar XuehaiPan commented on August 16, 2024

@zabique Thanks for the report. Glad to see people using nvitop on Windows!

According to your screenshot, you are using Dual-3090 on Windows, which is not a MIG setup. BTW, you can change the font of your terminal to get a better experience (the fonts are missing (?s in boxes) in the graph views and the last characters of the bars.)


NVIDIA Multi-Instance GPU User Guide: Introduction

Introduction

The new Multi-Instance GPU (MIG) feature allows GPUs based on the NVIDIA Ampere architecture (such as NVIDIA A100) to be securely partitioned into up to seven separate GPU Instances for CUDA applications, providing multiple users with separate GPU resources for optimal GPU utilization. This feature is particularly beneficial for workloads that do not fully saturate the GPU’s compute capacity and therefore users may want to run different workloads in parallel to maximize utilization.

The MIG feature is to split one physical GPU into multiple separate GPU instances.

By now, only A100 series and A30 GPUs support MIG mode and are only available on Linux (NVIDIA Multi-Instance GPU User Guide: Supported GPUs).

from nvitop.

zabique avatar zabique commented on August 16, 2024

Thanks for reply and font hint as I was too shy to ask about it :).

I thougjt MIG is enabled on my GPUs because nvidia-smi show it in too right corner.
I also compared performance in ubuntu 20.04 and windows and I can run same model with pretty much same performance + nvidia-smi in windows allow a lot more hardware control.

Feel free to ask for any testing.
Your nvitop is great!

from nvitop.

lixeon avatar lixeon commented on August 16, 2024

I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop.
Hope these information below will help you. And thanks for develop this awesome tools.
BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight?

$ python testnvitop.py 
Driver version: 470.42.01
MIG mode: [1, 1]
MIG count: 7
Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B)
Traceback (most recent call last):
  File "/home/testnvitop.py", line 12, in <module>
    print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
$ python
Python 3.9.7 (default, Sep 16 2021, 13:09:58) 
[GCC 7.5.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from pynvml import *
>>> nvmlInit()
>>> print('Driver version: {}'.format(nvmlSystemGetDriverVersion().decode()))
Driver version: 470.42.01
>>> device = nvmlDeviceGetHandleByIndex(index=0) 
>>> print('MIG mode: {}'.format(nvmlDeviceGetMigMode(device)))
MIG mode: [1, 1]
>>> print('MIG count: {}'.format(nvmlDeviceGetMaxMigDeviceCount(device)))
MIG count: 7
>>> print('Memory info from GPU: {}'.format(nvmlDeviceGetMemoryInfo(device)))
Memory info from GPU: c_nvmlMemory_t(total: 42505273344 B, free: 27517911040 B, used: 14987362304 B)
>>> print('Utilization rates from GPU: {}'.format(nvmlDeviceGetUtilizationRates(device)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_NotSupported: Not Supported
>>> print('Processes from GPU: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(device)))))
Processes from GPU: ["{'pid': 1562472, 'usedGpuMemory': 2104492032}", "{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1698220, 'usedGpuMemory': 2832203776}", "{'pid': 1701131, 'usedGpuMemory': 2345664512}", "{'pid': 1702015, 'usedGpuMemory': 2588934144}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"]
>>> migDevice = nvmlDeviceGetMigDeviceHandleByIndex(device, index=0) 
>>> print('Memory info from MIG device: {}'.format(nvmlDeviceGetMemoryInfo(migDevice)))
Memory info from MIG device: c_nvmlMemory_t(total: 10468982784 B, free: 5387976704 B, used: 5081006080 B)
>>> print('Utilization rates from MIG device: {}'.format(nvmlDeviceGetUtilizationRates(migDevice)))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 1993, in nvmlDeviceGetUtilizationRates
    _nvmlCheckReturn(ret)
  File "/home/anaconda3/envs/testnvitop/lib/python3.9/site-packages/pynvml.py", line 697, in _nvmlCheckReturn
    raise NVMLError(ret)
pynvml.NVMLError_InvalidArgument: Invalid Argument
>>> print('Processes from MIG device: {}'.format(list(map(str, nvmlDeviceGetComputeRunningProcesses(migDevice)))))
Processes from MIG device: ["{'pid': 1695752, 'usedGpuMemory': 2775580672}", "{'pid': 1733844, 'usedGpuMemory': 2291138560}"]
>>> nvmlShutdown()
>>> exit()

This time the process running situation in GPU.

$ nvidia-smi
Wed Nov 17 11:58:34 2021       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.42.01    Driver Version: 470.42.01    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA A100-PCI...  Off  | 00000000:05:00.0 Off |                   On |
| N/A   84C    P0   154W / 250W |  14293MiB / 40536MiB |     N/A      Default |
|                               |                      |              Enabled |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| MIG devices:                                                                |
+------------------+----------------------+-----------+-----------------------+
| GPU  GI  CI  MIG |         Memory-Usage |        Vol|         Shared        |
|      ID  ID  Dev |           BAR1-Usage | SM     Unc| CE  ENC  DEC  OFA  JPG|
|                  |                      |        ECC|                       |
|==================+======================+===========+=======================|
|  0    3   0   0  |   4845MiB /  9984MiB | 28      0 |  2   0    1    0    0 |
|                  |      8MiB / 16383MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0    9   0   1  |   2014MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      8MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   10   0   2  |   2708MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      4MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   11   0   3  |   2244MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      8MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   12   0   4  |   2476MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      8MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
|  0   13   0   5  |      3MiB /  4864MiB | 14      0 |  1   0    0    0    0 |
|                  |      6MiB /  8191MiB |           |                       |
+------------------+----------------------+-----------+-----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0    3    0    1695752      C   python                           2647MiB |
|    0    3    0    1733844      C   python                           2185MiB |
|    0    9    0    1562472      C   python                           2007MiB |
|    0   10    0    1698220      C   python                           2701MiB |
|    0   11    0    1701131      C   python                           2237MiB |
|    0   12    0    1702015      C   python                           2469MiB |
+-----------------------------------------------------------------------------+

from nvitop.

XuehaiPan avatar XuehaiPan commented on August 16, 2024

I test in Python REPL, and seems just some little fix that we can add this MIG feature for nvitop.
Hope these information below will help you.

@lixeon Thanks a lot for the informative results. I'll try to improve nvitop on MIG enabled devices.


BTW, if i want to study the performance of GPU can i from these GPU info API code to start, how is it different from nvvp or nsight?

From NVML API Reference:

The NVIDIA Management Library (NVML) is a C-based programmatic interface for monitoring and managing various states within NVIDIA Tesla™ GPUs.

The NVML and applications based on it (nvidia-smi, nvidia-ml-py, nvitop, nvtop, gpustat, etc.) are designed to monitor the GPU states in a global view. These tools can only capture the overall GPU SM and VRAM usage of a single process. They are not designed for code profiling.

Nsight is a profiling tool that can grab more fine-grained GPU usage information (nvvp is deprecated). It counts the running time for each API call.

from nvitop.

XuehaiPan avatar XuehaiPan commented on August 16, 2024

Hi, you guys! I add the MIG support to the GUI. To install:

git clone --branch=mig-support https://github.com/XuehaiPan/nvitop.git
cd nvitop

python3 -m venv --upgrade-deps venv
source venv/bin/activate

pip3 install -r requirements.txt
python3 nvitop.py -m

Any feedback is welcome.

from nvitop.

XuehaiPan avatar XuehaiPan commented on August 16, 2024

Close as resolved by PR #8.

from nvitop.

XuehaiPan avatar XuehaiPan commented on August 16, 2024

I just got my sudo access to an A100 GPU. I tweaked the visual result of the CLI and may release it soon.

MIG

from nvitop.

ki-arie avatar ki-arie commented on August 16, 2024

Omg incredible work! 🤩

from nvitop.

ytaoeer avatar ytaoeer commented on August 16, 2024

so nvitop can't get the migdevice's gpu utilization and sm?

from nvitop.

XuehaiPan avatar XuehaiPan commented on August 16, 2024

so nvitop can't get the migdevice's gpu utilization and sm?

@ytaoeer nvitop is based on the NVML library. The API reference of nvmlDeviceGetUtilizationRates notes that:

Note:

  • During driver initialization when ECC is enabled one can see high GPU and Memory Utilization readings. This is caused by ECC Memory Scrubbing mechanism that is performed during driver initialization.
  • On MIG-enabled GPUs, querying device utilization rates is not currently supported.

All NVML-based monitoring tools cannot track the GPU utilization of the MIG instances (including nvidia-smi). You can submit a feature request to the NVML upstream to ask for this support.

from nvitop.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.