Microsoft OpenPAI HiveDScheduler

HiveD is a scheduler for deep learning workloads.

As one standalone component of Microsoft OpenPAI, HiveD is designed to be a Kubernetes Scheduler Extender for Multi-Tenant GPU clusters. A multi-tenant GPU cluster assumes multiple tenants (teams) share the same GPU pool in a single physical cluster (PC) and provides some resource guarantees to each tenant. HiveD models each tenant as a virtual cluster (VC), so that one tenant can use its own VC as if it is a private cluster, while it can also use other VCs' free resource at lower priority.

Why You Need HiveD

HiveD provides several key features for deep learning workloads as follows.

Topology-Aware Resource Guarantee

The killer feature that distinguishes HiveD is that it provides resource guarantee to each VC, not only in terms of quantity, a numeric value, but also in terms of topology, a key requirement of GPU-based training jobs. For example, a traditional scheduler guarantees that a VC can use 8 GPUs. However, it does not know the topology of these 8 GPUs. It is possible that an 8-GPU training job which has to run within a single node, cannot be allocated even if its VC still has 8 free GPUs. This is because these 8 free GPUs may belong to multiple nodes.

HiveD protects VCs' resources in terms of cell, a user-defined resource type that encodes both the quantity and other kinds of information, such as topology and hardware type. In the above example, a user can define a cell type of 8-GPU node, and the VC can be assigned one of such cell. Then, HiveD will ensure that there is always one 8-GPU node available for the VC, regardless of the other workloads in the cluster.

HiveD allows flexible cell definitions for fine-grained resource guarantees. For example, users can define cells at multiple topology levels (e.g., PCI-e switch), for different device models (e.g., NVIDIA V100 GPU, AMD Radeon MI100 GPU, Cloud TPU v3), or networking configurations (e.g., InfiniBand domain). A VC can have various types of cells, and HiveD will guarantee all of them.

Gang Scheduling

HiveD optimizes the performance of gang scheduling, a typical scheduling requirement for deep learning training jobs, where all containers should be allocated before the training job can begin. Multiple gang-scheduled jobs competing for the same set of resource may lead to starvation, where each job only gets partial resource and has to wait indefinitely.

HiveD schedules all containers within a job in a transactional manner, i.e., all these containers' requirements will be granted or denied as a whole, thus avoiding partial resource allocation and starvation.

Priorities

HiveD supports multiple job priorities. Higher-priority jobs can preempt lower-priority jobs. HiveD also introduces opportunistic jobs, i.e., jobs with the lowest priority which can use other VCs' free resource when possible (without breaking the resource guarantees to other VCs).

Feature

Multi-Tenancy: Virtual Cluster (VC)
Fine-Grained VC Resource Guarantee: Quantity, Topology, Type, Pinned VC Resource, etc.
Flexible Intra-VC Scheduling: Topology-Awareness, Flexible Hardware Types, Pinned VC Resource, Scheduling Policy Customization, etc.
Optimized Resource Fragmentation and Less Starvation
Priorities, Overuse with Low Priority, and Inter-/Intra-VC Preemption
Job (Full/Partial) Gang Scheduling/Preemption
Fault-Tolerance, Bad Hardware Awareness, Work-Preserving Reconfiguration

Prerequisite

A Kubernetes cluster, v1.14.2 or above, on-cloud or on-premise.

Quick Start

Doc

Official Image

DockerHub

Related Project

FrameworkController: A General-Purpose Kubernetes Pod Controller, which can easily leverage HiveD to schedule jobs.
OpenPAI: A complete solution for AI platform. HiveD will be more user-friendly when working in tandem with OpenPAI.

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Reference

Please cite HiveD in your publications if it helps your research:

@inproceedings {hived-osdi2020,
author = {Hanyu Zhao and Zhenhua Han and Zhi Yang and Quanlu Zhang and Fan Yang and Lidong Zhou and Mao Yang and Francis C.M. Lau and Yuqi Wang and Yifan Xiong and Bin Wang},
title = {{HiveD}: Sharing a {GPU} Cluster for Deep Learning with Guarantees},
booktitle = {14th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 20)},
year = {2020},
isbn = {978-1-939133-19-9},
pages = {515--532},
url = {https://www.usenix.org/conference/osdi20/presentation/zhao-hanyu},
publisher = {{USENIX} Association},
month = nov,
}

hived don't aware gpu topology

run mpijob on p4 node in kubernetes1.11，one gpu per pod。
the p4 gpu topology as fellow:

the worker-0 pod see gpu as fellow:

the worker1 pod see gpu as fellow:

the hived config as fellow:
apiVersion: v1
kind: ConfigMap
metadata:
name: hivedscheduler-config
namespace: kube-system
data:
policy.cfg : |
{
"kind": "Policy",
"apiVersion": "v1",
"extenders": [
{
"urlPrefix": "http://10.220.187.143:30096/v1/extender",
"filterVerb": "filter",
"preemptVerb": "preempt",
"bindVerb": "bind",
"enableHttps": false,
"httpTimeout": 5000000000,
"nodeCacheCapable": true,
"ignorable": false,
"managedResources": [
{
"name": "hivedscheduler.microsoft.com/pod-scheduling-enable",
"ignoredByScheduler": true
}
]
}
]
}
hivedscheduler.yaml: |
webServerAddress: ":30096"
waitingPodSchedulingBlockMilliSec: 50
physicalCluster:
skuTypes:
V100:
gpu: 1
cpu: 6
memory: 6Gi
P4:
gpu: 1
cpu: 1
memory: 2Gi
cellTypes:
V100-PCIE:
childCellType: V100
childCellNumber: 4
P4-CPU:
childCellType: P4
childCellNumber: 2
V100-NODE:
childCellType: V100-PCIE
childCellNumber: 2
isNodeLevel: true
P4-NODE:
childCellType: P4-CPU
childCellNumber: 2
isNodeLevel: true
V100-NODE-POOL:
childCellType: V100-NODE
childCellNumber: 1
P4-NODE-POOL:
childCellType: P4-NODE
childCellNumber: 2
physicalCells:
- cellType: V100-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-58.h.chinabank.com.cn
- cellType: P4-NODE-POOL
cellChildren:
- cellAddress: tx-220-189-26.h.chinabank.com.cn
- cellAddress: tx-220-189-33.h.chinabank.com.cn

virtualClusters:
  vc1:
    virtualCells:
    - cellType: P4-NODE-POOL.P4-NODE
      cellNumber: 2
  vc2:
    virtualCells:
    - cellType: V100-NODE-POOL.V100-NODE
      cellNumber: 1

the mpijob yaml as fellow：
apiVersion: kubeflow.org/v1alpha2
kind: MPIJob
metadata:
name: mpi-hived-cpu
namespace: kubeflow
spec:
cleanPodPolicy: Running
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 1
leafCellNumber: 1
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- command:
- /bin/bash
- -c
- horovodrun -np 2 python /benchmarks/scripts/tf_cnn_benchmarks/tf_cnn_benchmarks.py
--model resnet50 --batch_size 32 --variable_update horovod --num_epochs=1
image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu
Worker:
replicas: 2
template:
metadata:
annotations:
hivedscheduler.microsoft.com/pod-scheduling-spec: |-
virtualCluster: vc1
priority: 1
leafCellType: P4
leafCellNumber: 1
affinityGroup:
name: mpi-hived-cpu
members:
- podNumber: 2
leafCellNumber: 1
spec:
containers:
- image: idockerhub.jd.com/zhouzijiang/horovod-training:v1.2
imagePullPolicy: Always
name: mpi-hived
resources:
limits:
cpu: "1"
memory: 2Gi
nvidia.com/gpu: "1"
hivedscheduler.microsoft.com/pod-scheduling-enable: 1
securityContext:
capabilities:
add:
- IPC_LOCK
nodeSelector:
nvidia.com/accelerator: nvidia-tesla-p4
schedulerName: hivedscheduler
serviceAccountName: mpi-operator
tolerations:
- effect: NoSchedule
key: dedicated
value: lambda-training
- effect: NoSchedule
key: nvidia.com/gpu

the pod can't allocate the GPU0 and GPU1 in p4 node

microsoft / hivedscheduler Goto Github PK

hivedscheduler's Introduction

Microsoft OpenPAI HiveDScheduler

Why You Need HiveD

Priorities

Feature

Prerequisite

Quick Start

Doc

Official Image

Related Project

Contributing

Reference

hivedscheduler's People

Contributors

Stargazers

Watchers

Forkers

hivedscheduler's Issues

Motivation

Proposal

Recommend Projects

Recommend Topics

Recommend Org