Git Product home page Git Product logo

blox's People

Contributors

iidsample avatar msr-fiddle avatar waterpine avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

blox's Issues

checkpoint for real cluster exp

Is it possible to set a checkpoint for real cluster exp? For example, we need to do 100 rounds of scheduling. The scheduler already finished the 21 rounds but we are confronted with an issue at 22 rounds. I am wondering whether we can restart the real cluster exp at 21 rounds by saving the checkpoint. Thanks!

Adding GPU UUID to cluster state when running real cluster experiments

nvidia-smi -L

In shell script:
UUID_list=(`nvidia-smi -L | awk '{print $NF}' | tr -d '[)]'`)

can be used to produce a list of UUIDs on a node using nvidia-smi. Requesting addition of UUID to the cluster state so that it can be used to update gpu_df with appropriate attributes.

Job Preemption to finish before new launch

In a version of Blox, we used to check if the GPU is free before launch. In order to support packing we disabled this check.

I want to enable a new check which makes sure job has freed the GPU before launching new jobs.

Metric Fix

We add an attained service from the perspective of the scheduler.

This does not account the time the scheduler operations take. Modify this metric with additional information that is necessary to reflect real time the job has used to cluster.

The given execution code for reproduction does not work

Traceback (most recent call last):
File "blox_new_flow_multi_run.py", line 196, in
main(args)
File "blox_new_flow_multi_run.py", line 25, in main
blox_mgr = BloxManager(args)
File "/home/wxh/blox/blox/blox_manager.py", line 41, in init
node_manager_port=args.node_manager_port
AttributeError: 'Namespace' object has no attribute 'node_manager_port'

API for Job Launch

In current setup users typically have to go and change the parameters for job launch based on their preference for on how job is launched- https://github.com/msr-fiddle/blox/blob/main/blox/deployment/grpc_client_rm.py#L28

This is ugly. Here are a couple of alternatives, in order of preference based on some internal discussion

(i) Setup Environment variables and applications can read those environment. Need to test if environment variables are being overwritten because of subprocess environment.

(ii) Set environment in redis on node manager.

(iii) Provide a function which users can pass to blox using which users can massage the data in the form they want.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.