Git Product home page Git Product logo

hivedscheduler's Introduction

Microsoft OpenPAI HiveDScheduler

Build Status codecov Latest Release Docker Pulls License

HiveD is a scheduler for deep learning workloads.

As one standalone component of Microsoft OpenPAI, HiveD is designed to be a Kubernetes Scheduler Extender for Multi-Tenant GPU clusters. A multi-tenant GPU cluster assumes multiple tenants (teams) share the same GPU pool in a single physical cluster (PC) and provides some resource guarantees to each tenant. HiveD models each tenant as a virtual cluster (VC), so that one tenant can use its own VC as if it is a private cluster, while it can also use other VCs' free resource at lower priority.

Why You Need HiveD

HiveD provides several key features for deep learning workloads as follows.

The killer feature that distinguishes HiveD is that it provides resource guarantee to each VC, not only in terms of quantity, a numeric value, but also in terms of topology, a key requirement of GPU-based training jobs. For example, a traditional scheduler guarantees that a VC can use 8 GPUs. However, it does not know the topology of these 8 GPUs. It is possible that an 8-GPU training job which has to run within a single node, cannot be allocated even if its VC still has 8 free GPUs. This is because these 8 free GPUs may belong to multiple nodes.

HiveD protects VCs' resources in terms of cell, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that there is always one 8-GPU node available for the VC, regardless of the other workloads in the cluster.

HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.

HiveD optimizes the performance of gang scheduling, a typical scheduling requirement for deep learning training jobs, where all containers should be allocated before the training job can begin. Multiple gang-scheduled jobs competing for the same set of resource may lead to starvation, where each job only gets partial resource and has to wait indefinitely.

HiveD schedules all containers within a job in a transactional manner, i.e., all these containers' requirements will be granted or denied as a whole, thus avoiding partial resource allocation and starvation.

Priorities

HiveD supports multiple job priorities. Higher-priority jobs can preempt lower-priority jobs. HiveD also introduces opportunistic jobs, i.e., jobs with the lowest priority which can use other VCs' free resource when possible (without breaking the resource guarantees to other VCs).

Feature

  1. Multi-Tenancy: Virtual Cluster (VC)
  2. Fine-Grained VC Resource Guarantee: Quantity, Topology, Type, Pinned VC Resource, etc.
  3. Flexible Intra-VC Scheduling: Topology-Awareness, Flexible Hardware Types, Pinned VC Resource, Scheduling Policy Customization, etc.
  4. Optimized Resource Fragmentation and Less Starvation
  5. Priorities, Overuse with Low Priority, and Inter-/Intra-VC Preemption
  6. Job (Full/Partial) Gang Scheduling/Preemption
  7. Fault-Tolerance, Bad Hardware Awareness, Work-Preserving Reconfiguration

Prerequisite

  1. A Kubernetes cluster, v1.14.2 or above, on-cloud or on-premise.

Quick Start

  1. Config Scheduler
  2. Run Scheduler
  3. Submit Workload to Scheduler

Doc

  1. User Manual
  2. Feature Demo
  3. Design

Official Image

Related Project

  • FrameworkController: A General-Purpose Kubernetes Pod Controller, which can easily leverage HiveD to schedule jobs.
  • OpenPAI: A complete solution for AI platform. HiveD will be more user-friendly when working in tandem with OpenPAI.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Reference

Please cite HiveD in your publications if it helps your research:

@inproceedings {hived-osdi2020,
author = {Hanyu Zhao and Zhenhua Han and Zhi Yang and Quanlu Zhang and Fan Yang and Lidong Zhou and Mao Yang and Francis C.M. Lau and Yuqi Wang and Yifan Xiong and Bin Wang},
title = {{HiveD}: Sharing a {GPU} Cluster for Deep Learning with Guarantees},
booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {515--532},
url = {https://www.usenix.org/conference/osdi20/presentation/zhao-hanyu},
publisher = {{USENIX} Association},
month = nov,
}

hivedscheduler's People

Contributors

abuccts avatar hzhua avatar microsoftopensource avatar squirrelsc avatar yqwang-ms avatar zhypku avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

hivedscheduler's Issues

algorithm simulations and metrics

i use hived on product environment with 20+ nivida V100, so i want to get some reporters on reducing fragmentation of GPU and job wait time like hived paper in fifth chapter,how can i simulations. meantime, Is there a plan to add some metrics of the hived scheduler???

Infinitely retry in intra-VC scheduler due to inconsistent view of bad cells.

If bad cells are not bound to VCs, it is possible that intra-VC scheduler keeps schedule a job to bad cells, which cannot find the healthy cell in the physical cluster. When the intra-VC scheduler is aware of the bad cells, it will avoid placing a job on the bad cells.

Example:
A physical cluster of two L2 cells, each of which can be split to two L1 cells. There is only one VC with the assignment of two L2 cells.
Each of the physical L2 cell has one bad GPU (L1 cell).
If the intra-VC scheduler is unware of the bad GPUs, it can place two 1-GPU requests using a L2 cell, following a packing policy. (suppose it is a gang allocation)
However, there is no healthy L2 cell in the physical cluster. Buddy allocation returns an allocation failure. Due to the intra-VC scheduler cannot see bad cells in its VC, it may infinitely retry the same placement (two 1-GPU in a L2 cell).

It is necessary to pre-bind the bad cells to VCs that expose the bad GPUs to intra-VC schedulers.

v0.4.0 release plan

Features:

  • Cell as SKU in intra-vc scheduler. @abuccts #34
    • Spec and algorithm design.
    • Intra-vc implementation.
    • End-to-end tests.
    • Document and test cases.
  • Visualizations, including physical cluster, virtual cluster, jobs, etc. @hzhua
    • Visualization prototype.
    • UX design.
    • Webportal implementation.

Handling cells with hardware failure

  • 1. When allocating a level-k cell, we will firstly allocate the "healthier" cell in the free cell list in level-k. (healthiness is defined by # of good GPUs/ # of total GPUs) @zhypku
  • 2. When all free level-k cells are bad cell, check if we get a new level-k cell by splitting a higher level cell (by leveraging the initial cell assignment for each VC). @abuccts #27
  • 3. If we cannot get a new level-k cell from step 2, this means all "allocable" level-k cells are bad, we will pre-bind these level-k bad cells to all VCs assigned with level-k cells. @zhypku #25
  • 4. If we have a set of new level-k cells (buddy cells) from step 2, but they are still all bad cells, repeat step 2. @abuccts #27

Virtual cluster will not show bad resource if enough nodes are guaranteed

There are 14 nodes and 3 virtual clusters total in pai int cluster:

Default: 10 nodes
Vc1: 2 nodes
Vc2: 2nodes

When I shut down 2 nodes in pai, the home page resource chart will not show the bad node status because there are enough idle nodes in pai for all vcs.

image

When I submited several jobs and then shut down another node, the cluster is too busy to manage enough idle nodes for vc, so the home page chart will show the bad node status.

image

Can hived provide an overallocation strategy?

Following this microsoft/pai#5546 thread.

I think this should be a hived issue, so I hope to discuss it here.

I thought of a compromise solution, allowing users to set hivedscheduler.config.physicalCluster.skuTypes.skuname.memory to -1 or none. At this time, pai should not limit the memory when submitting jobs. In this way, the decision right is handed over to the user.

This just gives the user one more choice, whether or not hived manage memory. When hived manage memory, the system behaves the same as it is now. When hived does not manage memory, it can be considered that all jobs compete for the use of the host's memory. Except for memory, the remaining resources (CPU, GPU) are still managed by hived.

I think the system can be designed in this way, which can give users more flexibility. If the user does have the need for all jobs to share memory space, he can configure it in this way, of course, at the user's own risk.

hived don't aware gpu topology

run mpijob on p4 node in kubernetes1.11,one gpu per pod。
the p4 gpu topology as fellow:
image

the worker-0 pod see gpu as fellow:
image

the worker1 pod see gpu as fellow:
image

the hived config as fellow:
apiVersion: v1
kind: ConfigMap
metadata:
name: hivedscheduler-config
namespace: kube-system
data:
policy.cfg : |
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix": "http://10.220.187.143:30096/v1/extender",
"filterVerb": "filter",
"preemptVerb": "preempt",
"bindVerb": "bind",
"enableHttps": false,
"httpTimeout": 5000000000,
"nodeCacheCapable": true,
"ignorable": false,
"managedResources": [
{
"name": "hivedscheduler.microsoft.com/pod-scheduling-enable",
"ignoredByScheduler": true
}
]
}
]
}
hivedscheduler.yaml: |
webServerAddress: ":30096"
waitingPodSchedulingBlockMilliSec: 50
physicalCluster:
skuTypes:
V100:
gpu: 1
cpu: 6
memory: 6Gi
P4:
gpu: 1
cpu: 1
memory: 2Gi
cellTypes:
V100-PCIE:
childCellType: V100
childCellNumber: 4
P4-CPU:
childCellType: P4
childCellNumber: 2
V100-NODE:
childCellType: V100-PCIE
childCellNumber: 2
isNodeLevel: true
P4-NODE:
childCellType: P4-CPU
childCellNumber: 2
isNodeLevel: true
V100-NODE-POOL:
childCellType: V100-NODE
childCellNumber: 1
P4-NODE-POOL:
childCellType: P4-NODE
childCellNumber: 2
physicalCells:
- cellType: V100-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-58.h.chinabank.com.cn
- cellType: P4-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-26.h.chinabank.com.cn
- cellAddress: tx-220-189-33.h.chinabank.com.cn

virtualClusters:
  vc1:
    virtualCells:
    - cellType: P4-NODE-POOL.P4-NODE
      cellNumber: 2
  vc2:
    virtualCells:
    - cellType: V100-NODE-POOL.V100-NODE
      cellNumber: 1

the mpijob yaml as fellow:
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: mpi-hived-cpu
namespace: kubeflow
spec:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 1
leafCellNumber: 1
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- command:
- /bin/bash
- -c
- horovodrun -np 2 python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet50 --batch_size 32 --variable_update horovod --num_epochs=1
image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu
Worker:
replicas: 2
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
hivedscheduler.microsoft.com/pod-scheduling-enable: 1
securityContext:
capabilities:
add:
- IPC_LOCK
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
serviceAccountName: mpi-operator
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu

the pod can't allocate the GPU0 and GPU1 in p4 node

Refactor test cases

Motivation

  1. Current test cases are mainly located in https://github.com/microsoft/hivedscheduler/blob/v0.3.4/pkg/algorithm/hived_algorithm_test.go . There are a lot of global variables and the code uses a lot of functions to reference them, which causes poor readability.

  2. As we are going to use the v2 schema, old test cases will be out-dated. It is a good time to refactor them.

Proposal

Define a hivedAlgorithmTester interface as follows:

type hivedAlgorithmTester interface {
	SchedulePod(podName string, pgsr v2.PodGroupSchedulingRequest, isDryRun bool)
	AssertPodScheduleSucceed(podName string, psr internal.PodScheduleResult)
	AssertPodScheduleFail(podName string)

	SetAllNodesToHealthy()
	SetAllNodesToBad()
	SetNodeToBad(nodeName string)
	SetAllNodesToHealthy(nodeName string)

	ExecuteCasesFromYaml(yamlFilename string)
}

func NewHivedAlgorithmTester (t *testing.T, configFilePath string) *hivedAlgorithmTester{

}

After this tester is implemented, we will be able to express test cases within a yaml file, e.g.

- method: SchedulePod
  parameters:
  - pod1
  - vc: vc1
    pinnedCellId: "",
    chain: "DGX2-V100-Node"
    priority: 0
    podRootGroup:
      Name: "pod1-group"
      WithinOneCell: ""
      Pods:
        PodMinNumber: 1
        PodMaxNumber: 1
        CellsPerPod:
          CellType: "DGX2-V100-gpu"
          CellNumber: 1
        ContainsCurrentPod: true
  - false
- method: AssertPodScheduleSucceed
  parameters:
  - pod1
  - psr:
      PodWaitInfo:
      PodPreemptInfo:
      PodWaitInfo:

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.