Git Product home page Git Product logo

vesta's Introduction

Requirements

This script is depending on Python3, and nvidia-smi, awk, ps commands.

Setup

pip install -r requirements.txt

If there is a missing package, please install by yourself using pip.
Also you might need to setup some configurations for your own environment.

Configuration

When using gpu_status_server.py and gpu_info_sender.py, there are many options to change the settings, or you can use .yaml file for overwriting the arguments.
Use -h to see all arguments and use --local_settings_yaml_path option to overwrite.
.yaml file can be written like

# every key must be in capital letter

# this ip address is the server address for the client which send the gpu information
IP: "192.168.0.1"

# server's open port
PORT_NUM: 8080

MAIN_PAGE_TITLE: "AWSOME GPUs"
MAIN_PAGE_DESCRIPTION: "awsome description"
TABLE_PAGE_TITLE: "AWSOME Table"
TABLE_PAGE_DESCRIPTION: "awsome description"

# DD/MM/YYYY
TIMESTAMP_FORMAT: "DMY"

# this will filter the networks
# it will be fed in python `re.search()`, so you can use regular expressions
VALID_NETWORK: "192.168.11.(129|1[3-9][0-9]|2[0-5][0-9])"
# this allows 192.168.11.129~255
...

Example is in example/local_settings.yaml

nvidia-smi's information printing format has been changed, so you need to specify a paring version for the client (which is sending a GPU information) script.
Please specify the format version (1 or 2) using --nvidia-smi_parse_version or write NVIDIA_SMI_PARSE_VER in local .yaml file.

version: 1 is for format of following

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID  Type  Process name                               Usage      |
|=============================================================================|
|    0     16163    C   python                                         240MiB |
|    1     16163    C   python                                        8522MiB |
+-----------------------------------------------------------------------------+

version: 2 is for format of following (this is default now)

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     24898      C   python                          17939MiB |
|    1   N/A  N/A     24899      C   python                          17063MiB |
+-----------------------------------------------------------------------------+

Usage

You can use simple wrapper,
for Server

python gpu_status_server.py

for Nodes

python gpu_info_sender.py

For automation, using systemd and crontab will do the work.

from Terminal

To get GPU information from terminal app, use curl and access http://<server_address>/?term=true.
You will get like

$ curl "http://0.0.0.0:8080/?term=true"
+------------------------------------------------------------------------------+
| vesta ver. 1.2.4                                                   gpu info. |
+------------------+------------------------+-----------------+--------+-------+
| host             | gpu                    | memory usage    | volat. | temp. |
+------------------+------------------------+-----------------+--------+-------+
|mau_local         | 0:GeForce GTX 1080 Ti  |   8018 /  11169 |    92 %|  80 °C|
|                  | 1:GeForce GTX 1080 Ti  |      2 /  11172 |     0 %|  38 °C|
+------------------+------------------------+-----------------+--------+-------+
|mau_local_11a7c5eb| 0:GeForce GTX 1080 Ti  |   2400 /  11169 |    78 %|  79 °C|
|                  | 1:GeForce GTX 1080 Ti  |      2 /  11172 |     0 %|  38 °C|
+------------------+------------------------+-----------------+--------+-------+
|mau_local_ac993634| 0:GeForce GTX 1080 Ti  |   8897 /  11169 |    98 %|  82 °C|
|                  | 1:GeForce GTX 1080 Ti  |      2 /  11172 |     0 %|  38 °C|
|                  | 2:GeForce GTX 1080 Ti  |      2 /  11172 |     0 %|  36 °C|
|                  | 3:GeForce GTX 1080 Ti  |      2 /  11172 |     0 %|  40 °C|
+------------------+------------------------+-----------------+--------+-------+

If you want to see detail information you can use detail option like http://<server_address>/?term=true&detail=true.
You will get like

$ curl "http://0.0.0.0:8080/?term=true&detail=true"
vesta ver. 1.2.4

#### mau_local :: 127.0.0.1 ####################################################
  last update: 24/03/2019 20:27:10
--------------------------------------------------------------------------------
  ┌[ gpu:0 GeForce GTX 1080 Ti 2019/03/24 20:00:00.000 ]─────────────────────┐
  │      memory used  memory available  gpu volatile  temperature            │
  │  8018 / 11169MiB           3151MiB           92%         80°C            │
  │                                                                          │
  │ mem [///////////////////////////////////////////                  ]  71% │
  │  ├── train1                      6400MiB user1                           │
  │  └── train2                      1618MiB user1                           │
  └──────────────────────────────────────────────────────────────────────────┘

  ┌[ gpu:1 GeForce GTX 1080 Ti 2019/03/24 20:00:00.000 ]─────────────────────┐
  │      memory used  memory available  gpu volatile  temperature            │
  │     2 / 11172MiB          11170MiB            0%         38°C            │
  │                                                                          │
  │ mem [                                                             ]   0% │
  └──────────────────────────────────────────────────────────────────────────┘

________________________________________________________________________________
.
.
.

Server will also provide you to access host data by json. Access http://<server_address>/states/, or to specify host http://<server_address>/states/<host_name>/
You can use url parameter to fetch how many log you want by fetch_num=<# you want>

from Web Browser

Just access http://<server_address>/
You will get like
sample web broser image

API Response

User can get the information of GPU by accessing http://<server_address>/states/.
Json response is like

{
    "host1":{
        # the order of data is ascending order in time
        "data":
            # host_name log are in array
            [ 
                {   # each GPU will be denote by "gpu:<device_num>"
                    'gpu_data':{
                        'gpu:0':{'available_memory': '10934',
                            'device_num': 0,
                            'gpu_name': 'GeForce GTX 1080 Ti',
                            'gpu_volatile': 92,
                            'processes': [
                                  {
                                    'name': 'train1',
                                    'pid': "31415",
                                    'used_memory': 6400,
                                    'user': 'user1'
                                  },
                                  {
                                    'name': 'train2',
                                    'pid': "27182",
                                    'used_memory': 1618,
                                    'user': 'user1'
                                  }
                            ],
                            'temperature': 80,
                            'timestamp': '2019/03/24 20:00:00.000',
                            'total_memory': 11169,
                            'used_memory': 8018,
                            'uuid': 'GPU-...'
                        },
                        'gpu:1':{
                            'available_memory': '11170',
                            'device_num': '1',
                            'gpu_name': 'GeForce GTX 1080 Ti',
                            'gpu_volatile': '0',
                            'processes': [],
                               .
                               .
                               .

                        }
                    },
                    "timestamp": 20181130232947 # server recorded timestamp YYYYMMDDhhmmss
                }
            ],
        "ip_address": 127.0.0.1 # host IP address
    },
    "host2":{...}
}

Slack Notification

If you set slack's webhook and bot setting, you can receive notification via slack.

up and down

sample notification image

interact with bot

sample interact image

For specifying slack setting, use --slack_webhook, --slack_bot_token, and --slack_bot_post_channel for gpu_status_server.py.
Or you can use .yaml file, see example/local_settings.yaml

Topology

Topology is very simple, Master (server) and Slave (each local machine) style, but it is ad hoc.
Server is only waiting the slaves to post the gpu information.

Database

machine table

mchine table is a lookup table for hash_code (id) to host name.
Table field is

id (TEXT) name (TEXT) ip_address (TEXT)
hash_code_1 host_1 host_1_ip
hash_code_2 host_2 host_2_ip
... ... ...
hash_code_n host_n host_n_ip

hash_code will be generated by Python code

hash_code = random.getrandbits(128)

{host} table

Each host has own table for logging.
Table field is

timestamp (INTEGER) data (BLOB)
timestamp_1 data_1
timestamp_2 data_2
... ...
timestamp_n data_n

timestamp is based on server time zone and the style is "YYYYMMDDhhmmss".
data is a Python dict object while it is serialized and compressed by Python pickle and bz2.

vesta's People

Contributors

a-maumau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

vesta's Issues

todo

add user information at terminal format

  • from v0.5.3

✅add left ellipsis

  • from nvidia-smi, long command seems to be already ellipsis at left.
  • from v0.5.3

add more flexible setting/configuration

  • currently, it always need to be rebooting.
  • add theme or color configuration.
  • we can trace the file date to detect update on setting file.

✅ add no log recorded mode (partially implemented -> add interval time for storing data.)

  • for high frequency logging, it might be a bottleneck and large storage space will be consumed.

  • we actually don't need the logs (?).

  • or take some interval for recording.

    added an interval and it enable to force no recording. (v0.5.1 >=)

✅fix websocket reconnection

  • it seems to be not working at current implementation.

  • need to be watched the state of websocket object.

    fixed ٩( 'ω' )و✨

fix using magic number.

✅ fix typo in LICENSE
fixed ٩( 'ω' )و✨

CVE-2017-18342

Upgrade pyyaml to version 4.2b1
but stable latest release is 3.13

Suggest to loosen the dependency on schedule

Hi, your project vesta(commit id: 658b9f1) requires "schedule==0.5.0" in its dependency. After analyzing the source code, we found that the following versions of schedule can also be suitable, i.e., schedule 0.4.3, 0.6.0, 1.0.0, since all functions that you directly (1 APIs: schedule.init.run_pending) or indirectly (propagate to 3 schedule's internal APIs and 0 outsider APIs) used from the package have not been changed in these versions, thus not affecting your usage.

Therefore, we believe that it is quite safe to loose your dependency on schedule from "schedule==0.5.0" to "schedule>=0.4.3,<=1.0.0". This will improve the applicability of vesta and reduce the possibility of any further dependency conflict with other projects.

May I pull a request to further loosen the dependency on schedule?

By the way, could you please tell us whether such an automatic tool for dependency analysis may be potentially helpful for maintaining dependencies easier during your development?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.