Git Product home page Git Product logo

gpudash's Introduction

gpudash

The gpudash command displays a GPU utilization dashboard in text (no graphics) for the last hour:

gpudash example

The dashboard can be generated for a specific user:

gpudash user example

The gpudash command is part of the Jobstats platform. Here is the help menu:

usage: gpudash [-h] [-u NETID] [-n] [-c]

GPU utilization dashboard for the last hour

optional arguments:
  -h, --help       show this help message and exit
  -u NETID         create dashboard for a single user
  -n, --no-legend  flag to hide the legend

Utilization is the percentage of time during a sampling window (< 1 second) that
a kernel was running on the GPU. The format of each entry in the dashboard is
username:utilization (e.g., aturing:90). Utilization varies between 0 and 100%.

Examples:

  Show dashboard for all users:
    $ gpudash

  Show dashboard for the user aturing:
    $ gpudash -u aturing

  Show dashboard for all users without displaying legend:
    $ gpudash -n

Getting Started

The gpudash command builds on the Jobstats platform. To run the software it requires Python 3.6+ and version 1.17+ of the Python blessed package.

1. Create a script to pull data from Prometheus

The query_prometheus.sh script below makes three queries to Prometheus. Old files are removed. The extract.py Python script is called to extract the data and write columns files. The column files are read by gpudash.

$ cat query_prometheus.sh
#!/bin/bash

printf -v SECS '%(%s)T' -1
DATA='/path/to/gpudash/data'
PROM_QUERY='http://vigilant2.sn17:8480/api/v1/query?'

curl -s ${PROM_QUERY}query=nvidia_gpu_duty_cycle > ${DATA}/util.${SECS}
curl -s ${PROM_QUERY}query=nvidia_gpu_jobUid     > ${DATA}/uid.${SECS}
curl -s ${PROM_QUERY}query=nvidia_gpu_jobId      > ${DATA}/jobid.${SECS}

# remove any files that are greater or equal to 70 minutes old
find ${DATA} -type f -mmin +70 -exec rm -f {} \;

# extract the data from the Prometheus queries, convert UIDs to usernames, write column files
python3 /path/to/extract.py

Be sure to customize nodelist in extract.py for the given system. The above Bash script will generate column files with the format:

$ head -n 5 column.1
{"timestamp": "1678144802", "host": "comp-g1", "index": "0", "user": "ft130", "util": "92", "jobid": "46034275"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "1", "user": "ft130", "util": "99", "jobid": "46015684"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "2", "user": "ft130", "util": "99", "jobid": "46015684"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "3", "user": "kt415", "util": "44", "jobid": "46048505"}
{"timestamp": "1678144802", "host": "comp-g2", "index": "0", "user": "kt415", "util": "82", "jobid": "46015407"}

The column files are read by gpudash to generate the dashboard.

2. Generate a CSV file containing UIDs and the corresponding usernames

Here is a sample of the file:

$ head -n 5 uid2user.csv
153441,ft130
150919,lc455
224256,sh235
329819,bb274
347117,kt415

The above file can be generated by running the following command:

$ getent passwd | awk -F":" '{print $3","$1}' > /path/to/uid2user.csv

3. Create two entries in crontab

0,10,20,30,40,50 * * * * /path/to/query_prometheus.sh > /dev/null 2>&1
0 6 * * 1 getent passwd | awk -F":" '{print $3","$1}' > /path/to/uid2user.csv 2> /dev/null

The first entry above calls the script that queries the Prometheus server every 10 minutes. The second entry creates the CSV file of UIDs and usernames on every Monday at 6 AM.

4. Download gpudash

gpudash is a pure Python code. It's only dependency is the blessed Python package. On Ubuntu Linux, this can be installed with:

$ apt-get install python-blessed

Then put gpudash in a location like /usr/local/bin:

$ cd /usr/local/bin
$ wget https://raw.githubusercontent.com/PrincetonUniversity/gpudash/main/gpudash
$ chmod 755 gpudash

Next, edit gpudash by replacing cluster1 with the beginning of the login node name. Modify all_nodes to generate a list of compute node names. Lastly, set SBASE to the path containing the column files produced by extract.py and make sure that the shebang line at the very top is pointing to python3.

With these steps in place, you can use the gpudash command:

$ gpudash

About the Design

The choice was made to enter the node names in the script (i.e., all_nodes) as opposed to reading the Prometheus configuration file or using the output of the sinfo command. The code looks for data on each of the specified nodes and only updates the values for a given node if the data is found. Calling sinfo has the disadvantage of not having any node names if the command fails. One would also have to specify partitions. Reading the Prometheus server configuration file(s) is reasonable but changes would be required if Prometheus were swapped with an alternative.

Troubleshooting

The two most commons problems are (1) setting the correct paths throughout the procedure and (2) installing the Python blessed package.

Getting Help

Please post an issue to this repo. Extensions to the code are welcome via pull requests.

gpudash's People

Contributors

jdh4 avatar

Stargazers

Leland Vakarian avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.