run-ai / rntop Goto Github PK
View Code? Open in Web Editor NEWA top-like tool for monitoring GPUs in a cluster
License: GNU General Public License v3.0
A top-like tool for monitoring GPUs in a cluster
License: GNU General Public License v3.0
Is there any reason that the docker image for rntop leave the user as root? This is contrary to best practices for security.
A quick reading of the README and the source leads me to understand that rntop is ssh'ing to remote machines and running nvidia-smi on the remote machine. This means that you don't need root on either the local machine, remote machine, or within the container.
$ sudo docker run -it --rm -v ~/.ssh:/root/.ssh runai/rntop [email protected]
Error authenticating client side: Failed to read private key: /root/.ssh/id_rsa
terminate called after throwing an instance of 'std::exception'
what(): std::exception
Any idea how I can resolve this error?
Hi, I have issue running rntop.
This is my setup:
docker run -it --rm -v $HOME/.ssh:/root/.ssh --entrypoint bash runai/rntop -c "ssh user@machine nvidia-smi"
. From the README, it means that the container can connect to the machine and it's the rntop application itself that can't.sudo docker run -it --rm -v $HOME/.ssh:/root/.ssh --entrypoint bash runai/rntop:latest user@machine
, it fails. I have error "GPUs wmove() failed. Terminate called after throwing an instance of 'std::expression'. In the printed output, there is no cluster and nodes info printed out too.I am not sure how to further troubleshoot, any advice? Thanks
Is there any option to get GPU names in the list?
not those short names from default nvidia-smi
view, but longer versions from nvidia-smi -L
(e.g. there are 3+ Titan cards TITAN X (Pascal)
vs GeForce GTX TITAN X
which short is GeForce GTX TIT...
)
Hi!
Thanks for the great work. Could you consider adding GPU temperature monitoring in the tool, which is helpful information to monitor machine health?
Thanks!
feature request
a way to switch to GB - a more human readable form
Hi,
Thanks for the great tool. I'm running it for 9 machines, and inevitably sometimes some machines' nvidia-smi might be down. For example:
$ nvidia-smi
Failed to initialize NVML: Driver/library version mismatch
Then the
sudo docker run -it --rm -v $HOME/.ssh:/root/.ssh runai/rntop ...
will result in
terminate called after throwing an instance of 'std::out_of_range'
what(): vector::_M_range_check: __n (which is 1) >= this->size() (which is 1)
Is this the expected behavior, and any plans to fix it? Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.