msr-fiddle / blox Goto Github PK

View Code? Open in Web Editor NEW

19.0 19.0 9.0 11.97 MB

License: MIT License

Python 93.55% Makefile 0.07% Dockerfile 0.25% Shell 6.13%

blox's People

Contributors

Stargazers

Watchers

Forkers

machinelearningsystem gaow0007 lai-jx henls rutwik-n-jain rayincode jfcooper2

blox's Issues

checkpoint for real cluster exp

Is it possible to set a checkpoint for real cluster exp? For example, we need to do 100 rounds of scheduling. The scheduler already finished the 21 rounds but we are confronted with an issue at 22 rounds. I am wondering whether we can restart the real cluster exp at 21 rounds by saving the checkpoint. Thanks!

Adding GPU UUID to cluster state when running real cluster experiments

nvidia-smi -L

In shell script:
UUID_list=(`nvidia-smi -L | awk '{print $NF}' | tr -d '[)]'`)

can be used to produce a list of UUIDs on a node using nvidia-smi. Requesting addition of UUID to the cluster state so that it can be used to update gpu_df with appropriate attributes.

Job Preemption to finish before new launch

In a version of Blox, we used to check if the GPU is free before launch. In order to support packing we disabled this check.

I want to enable a new check which makes sure job has freed the GPU before launching new jobs.

Metric Fix

We add an attained service from the perspective of the scheduler.

This does not account the time the scheduler operations take. Modify this metric with additional information that is necessary to reflect real time the job has used to cluster.

The given execution code for reproduction does not work

Traceback (most recent call last):
File "blox_new_flow_multi_run.py", line 196, in
main(args)
File "blox_new_flow_multi_run.py", line 25, in main
blox_mgr = BloxManager(args)
File "/home/wxh/blox/blox/blox_manager.py", line 41, in init
node_manager_port=args.node_manager_port
AttributeError: 'Namespace' object has no attribute 'node_manager_port'

API for Job Launch

In current setup users typically have to go and change the parameters for job launch based on their preference for on how job is launched- https://github.com/msr-fiddle/blox/blob/main/blox/deployment/grpc_client_rm.py#L28

This is ugly. Here are a couple of alternatives, in order of preference based on some internal discussion

(i) Setup Environment variables and applications can read those environment. Need to test if environment variables are being overwritten because of subprocess environment.

(ii) Set environment in redis on node manager.

(iii) Provide a function which users can pass to blox using which users can massage the data in the form they want.

msr-fiddle / blox Goto Github PK

blox's People

Contributors

Stargazers

Watchers

Forkers

blox's Issues

checkpoint for real cluster exp

Adding GPU UUID to cluster state when running real cluster experiments

Job Preemption to finish before new launch

Metric Fix

The given execution code for reproduction does not work

API for Job Launch

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent