Comments (11)
We're simulating distributed machine learning in combination with containernet.
How would you describe your "buggy" experience with the A100? The only thing that felt buggy was disabling and enabling MIG mode, as it always thought there was a process still using the GPU. But that was fixable by rebooting the machine. I usually just leave it enabled the entire time and apply a 7g40gb partition whenever I need the entire GPU.
from nvtop.
I haven't looked at the details, but from what I understood is that each MIG instance should show up as a separate a handle (device in nvtop
).
Since I haven't updates the API in a while, the handles right now are only physical GPUs
if I have to guess.
from nvtop.
I also tried MIG on a A100, but it was buggy for us. What is the use case for you and the MIG configuration? Multiple users? Multiple jobs/same user? Something else?
from nvtop.
it's been a long time ago, let me check if i can remember/reconstruct from history
starting at 2023-02-20T09:22:19+0100 nvidia-smi -i 0 -mig 1
some time (weeks) passed...
and the users wanted pytorch and minkowskiengine, and i think it was a problem with pytorch,
it just would fail to run. we had back then cuda11.3 and cuda11.6, also tried 11.7, 11.8. torch version was 1.13.1.
whatever it was, it worked again when MIG was disabled. (it's important to know if you turn on MIG, it stays that way after reboots)
from nvtop.
Did you maybe try running jobs on the GPU and not on MIG-Instances while MIG mode was enabled? That also didn't work for me. The fix is to create a 7g40gb MIG instance and let the job run on that.
from nvtop.
@Greenscreen23 this is very possibly possible. thanks, TIL.
from nvtop.
Hello,
I think that I might have to update the Nvidia backend to support MIG device handles.
Although there is a notice in the documentation: "In MIG mode, if device handle is provided, the API returns aggregate information, only if the caller has appropriate privileges. Per-instance information can be queried by using specific MIG device handles. Querying per-instance information using MIG device handles is not supported if the device is in vGPU Host virtualization mode."
from nvtop.
Hi,
Updating the backend sounds like a good idea. I'll have access to the machine next Monday and will be glad to test any fixes :)
However, the document seems to suggest that one has to query multiple handles to access the total load of the GPU. Maybe its easier to handle a GPU in MIG mode as a set of MIG instances instead of aggregating the load over all MIG devices? So like if an A100 has MIG mode disabled it is displayed like normal, and if it has MIG mode enabled, it is omitted from the visualization and each MIG device is treated like a separate GPU. Or we could show the MIG devices and the GPU with our own aggregated values. I think there might be value in visualizing each MIG device separately.
I have not yet looked into the code of nvtop, so I don't know how easy / hard this would be to implement, but I'd be happy to help :). I also don't know how well nvtop visualization scales with multiple GPUs. In my scenario, I might have 2x A100 split into 7 MIG devices each, resulting in 14 different devices.
from nvtop.
Great, let me know if I can help :)
from nvtop.
All right. I have a good and a bad news:
- The good: I updated the code to use the latest NVML API functions to retrieve Nvidia GPUs info.
- The bad: it seems that retrieving the processes utilization is not supported in MIG mode (see comment in header). My last resort was supporting accounting, however it seems that it also cannot be enabled in MIG mode according to this comment.
So I'm out of options to provide this info in MIG mode!
from nvtop.
No worries, thanks for trying!
Feel free to leave this issue as it is as a reminder, in case there is some nvidia update down the road, or close it if you want to mark it as currently impossible :)
from nvtop.
Related Issues (20)
- How to get temperature memory VRAM on Nvidia GPU? HOT 3
- nvtop will wake dGPUs not in use HOT 2
- [Feature Request] Graph for PCIe bus load HOT 2
- nvtop master incl. M2 pr HOT 2
- Compilation error from macOS on Apple Silicon HOT 2
- How to specify nvtop to monitor Nvidia gpu only when my PC has an integrated Intel GPU with CPU? HOT 3
- where is appimage for new version HOT 1
- NVLink transceiving metrics
- RX 7600 does not display decode bar when decoding videos HOT 2
- Nvtop constantly shows 25% being used even when there are no processes which have that much total usage in the table below HOT 5
- Question: works via Termux on Android Adreno devices with kernel 6.1? like new Xiaomi 14’s
- error: use of undeclared identifier 'kIOMainPortDefault'
- Compilation error from macOS on Apple ARM HOT 1
- [Feature Request] Add plot height config?
- Getting nvtop Snap to show GPU processes on AMD/Intel
- Support for mouse click like htop.
- [Feature Request] Support for xe kernel module
- Hardware Encoding/Decoding At Same Time Breaks NVTOP on AMD GPU
- 8 Nvidia GPUs: No GPU to monitor. HOT 1
- Ubuntu 24:04 : AppImage throws error "dlopen(): error loading libfuse.so.2"
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from nvtop.