Git Product home page Git Product logo

kube-batch's Introduction

kube-batch

Build Status Go Report Card RepoSize Release LICENSE

kube-batch is a batch scheduler for Kubernetes, providing mechanisms for applications which would like to run batch jobs leveraging Kubernetes. It builds upon a decade and a half of experience on running batch workloads at scale using several systems, combined with best-of-breed ideas and practices from the open source community.

Refer to tutorial on how to use kube-batch to run batch job in Kubernetes.

Overall Architecture

The following figure describes the overall architecture and scope of kube-batch; the out-of-scope part is going to be handled by other projects.

kube-batch

Who uses kube-batch?

As the kube-batch Community grows, we'd like to keep track of our users. Please send a PR with your organization name.

Currently officially using kube-batch:

  1. openBCE
  2. Kubeflow
  3. Volcano
  4. Baidu Inc
  5. TuSimple
  6. MOGU Inc
  7. Vivo

Community, discussion, contribution, and support

Learn how to engage with the Kubernetes community on the community page.

You can reach the maintainers of this project at:

Code of conduct

Participation in the Kubernetes community is governed by the Kubernetes Code of Conduct.

kube-batch's People

Contributors

adam-marek avatar animeshsingh avatar asifdxtreme avatar bysph avatar chanyilin avatar chenyangxuehdu avatar denkensk avatar dmatch01 avatar gaocegege avatar hex108 avatar hzxuzhonghu avatar ingvagabund avatar jeffwan avatar jiaxuanzhou avatar jinzhejz avatar k82cn avatar k8s-ci-robot avatar leileiwan avatar mateuszlitwin avatar mitake avatar scostache avatar shivramsrivastava avatar suleisl2000 avatar swiftdiaries avatar thandayuthapani avatar tizhou86 avatar tommylike avatar wackxu avatar xichengliudui avatar zionwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kube-batch's Issues

Data Placement, Rack Awareness and Scheduling

Not sure if this really fits into the goals for this Incubator, if it does not, please feel free to close the issue.

At some level a BatchJob processing a large corpus of data should have awareness of data (location, Skew etc) for efficient scheduling.

  • If there are multiple copies of the data block, the scheduler should be aware of their locations for enhanced scheduling.
  • The scheduling algorithm should have ability to optimize the cost of data access: local disk --> Single Network Hop (within Rack) --> Multiple Network Hops.
  • Intelligence to address Data Skews: Performance of a BatchJob is directly related to the ability to address data skews efficiently. Schedulers should have intelligence to address the stragglers.

These requirements may be more appropriate for a layer above the scheduler, if so there should be a way to identify how these requirements would translate into changes at the scheduler layer.

Schedule pods by priority for minAvailable

Currently, kube-batchd assign pod of job randomly for minAvailable.

For example, there are three pods p1 p2 p3 in a job, and its minAvailable is 2. kube-batchd may start minAvailable pods in three ways

  • p1 and p2
  • p1 and p3
  • p2 and p3

for some job, such as tensorflow job, some pods of the job must start first. Otherwise, the job could not startup.

To handle above issue, the user could specify a different priority for pods, then kube-batchd will start minAvailable pods for the job based on their priority.

Queue job level resources assignment

Kube-arbitrator only allocates resources to queue level now.

However, a queue also can contain multiple queuejobs as design doc and need to support allocate/assign resources to queuejobs.

There may be two ways:

  1. Allocate resources to queuejobs directly by policy
  2. Assign resources to queuejobs after policy allocates resources to each queue, such as assign resources to queuejobs in Interface.Assign()

The integration test is failed

The integration test is failed in CI env. We need to install etcd before testing:

The command "make test" exited with 0.
0.04s$ make test-integration
hack/make-rules/test-integration.sh 
+++ [0929 07:05:29] Checking etcd is on PATH
+++ [0929 07:05:29] Cannot find etcd, cannot run integration tests.
+++ [0929 07:05:29] Please see https://github.com/kubernetes/community/blob/master/contributors/devel/testing.md#install-etcd-dependency for instructions.
You can use 'hack/install-etcd.sh' to install a copy in third_party/.
!!! Error in hack/make-rules/test-integration.sh:87
  Error in hack/make-rules/test-integration.sh:87. 'return 1' exited with status 1
Call stack:
  1: hack/make-rules/test-integration.sh:87 main(...)
Exiting with status 1
make: *** [test-integration] Error 1

@jinzhejz , please check travis-ci related doc to fix it :).

Consider relationship between QueueJob and Application

I noticed today the proposal for an Application API object which, at first glance, appears to be a way to group multiple API objects needed to launch an application. This seems pretty similar to the QueueJob concept from the original kube arbitrator proposal, at least at a high level (in particular see the ApplicationSpec.Components field). Anyway, I just wanted to point this out so you might look into whether there is anything to leverage from Application. Maybe there is not a useful connection, but it is probably worth investigating.

Add node selector to identify a sub-cluster for batch job

Currently, we cached all node in kube-bached and assign node to pod accordingly. But for batch job, the conflict is so heavy :(. We'll use 'static' zone for batch job before finding a "good" way to work together with kube-scheduler.

List some tasks (including but not limited to)

  • Filter pods with scheduler name for kube-batchd (#170)
  • Filter nodes with labels for kube-batchd (#203)
  • Filter nodes with NodeAffinity for kube-batchd

Code refine to use the new Queue informer

An informer for custom resources definition Queue is already added(pkg/client/informers/queue) and need to go through the code to see if can use the new informer.

Error handling: create/delete quota for new/old queue

Kube-arbitrator uses resource quota to limit resource usage of each queue.

Now the quota must be created for each queue manually, and it still exists even if the queue is deleted.

Need to:

  1. Create a quota if there is not any quota for a new queue
  2. Delete or reset the quota limitation to zero if a queue is deleted

Enhance object cache for policy.

For policy, we need a snapshot of cluster, e.g. Pod, Node, to calculate how many resource should be allocated to a tenant. This task is used to provide such kind of cache and related helper functions.

Sub-Tasks:

  • Resource type and related helper functions, e.g. Resource.Add
  • Pod and Node cache
  • Consumer cache

Enhance QueueJob controller to handle QueueJob/Pod lifecycle

Now the QueueJob controller in kube-batchd supports to create pods for a QueueJob simply after #194 is merged, however, it is not enough for the QueueJob lifecycle and QueueJob controller doesn't handle pod lifecycle of a QueueJob.

So kube-batchd should be enhanced to support following task:

  • Handle QueueJob lifecycle, include add/update/delete
  • Handle Pod lifecycle of a QueueJob, include add/update/delete

Preemption design doc

The Queue API doc contains the basic function description of preemption. Need to provide a preemption design doc of details.

Roadmap: Using QueueQuota and QueueJobQuota to replace ResourceQuota

Currently, ResourceQuota is used to limit resource usage of each queue(namespace level indeed).

In the roadmap, we need to new admission control for Queue and QueueJob to limit resource usage.

  • QueueQuota: resource usage limitation of a Queue
  • QueueQuotaController: admission controller for QueueQuota
  • QueueJobQuota: resource usage limitation of a QueueJob
  • QueueJobQuotaController: admission controller for QueueJobQuota

Refine vendor directory

Now some code in vendor directory is redundancy and need sync up from github too. Including but not limited to:

  • github.com:kubernetes/api.git
  • github.com:kubernetes/apiextensions-apiserver.git
  • github.com:kubernetes/apimachinery.git
  • github.com:kubernetes/apiserver.git
  • github.com:kubernetes/client-go.git
  • github.com:kubernetes/metrics.git
  • github.com:kubernetes/kube-aggregator.git

add support for complex QueueJob Objects

According to proposal https://docs.google.com/document/d/1-H2hnZap7gQivcSU-9j4ZrJ8wE_WwcfOkTeAGjzUyLA/edit#heading=h.a1k69dgabg0w, a new abstraction is to be incorporated in Kubernetes to support complex batch jobs also named QueueJobs. These jobs can be composed of services, replica sets, deployments, stateful sets, etc. and might benefit of atomic allocation, e.g., all-or-nothing, preemption, prioritization of QueueJobs and within the same QueueJob, etc.

We have developed a proof-of-concept prototype that offers support for jobs defined of services, pods, replica sets and deployments (https://github.com/hanghliu/kube-arbitrator) using CRD, as well as prioritization and preemption for them. We would like to integrate this prototype with the current kube-arbitrator project to enhance the support for QueueJobs in it.

build and push kube-batchd/kube-quotalloc images to a public repository?

currently user are required to download src and build kube-batchd/kube-quotalloc.

I am trying to deploy kube-batchd and kube-quotalloc on a minikube cluster. I must ssh into the minkube VM and build kube-batchd/kube-quotalloc form src code and build docker images there. unfortunately, minkube even has no apt-get, make, go installed, building process is a nightmare.

Is it possible to open a public repository on dockerhub ? I am glad to write script to automate the build and push process

[Question] Using kube-arbitrator for multinode MPI jobs

Hi, all!

I've been investigating how applicable kube-arbitrator might be for running multinode MPI jobs on a kubernetes cluster. Here are my observations - please let me know if I was doing something wrong or whether the outcome was as expected. Also, please note that I don't have too much experience with Kubernetes.

For the experiments, I had three nodes with 80 CPUs each.

I first experimented with kube-batchd, hoping that it wouldn't schedule a job until the requested resources are available. I set up one dormant (sleep infinity) pod that used 50 CPUs. I then created the pod disruption budget with matchLabels: app: mpi3pod with minAvailable: 3. I then created a deployment with 3 replicas, label app: mpi3pod, using the kube-arbitrator as scheduler and requesting 50 CPUs per replica.

It seems that minAvailable is not honoured in the way I expected it to be - the whole deployment was marked as "not ready" (good), but 2 of the 3 replicas were launched. I hoped that kube-arbitrator would not start the pods until all the required resources were available. That's not how it works, I presume?

Second experiment was with kube-quotaalloc. My intended use was to generate a separate namespace/QuotaAllocator/ResourceQuota for each MPI job that would be submitted. So I created three namespaces, three QuotaAllocators (with 100 CPU each) and three resource quotas but upon running kube-quotaalloc, I saw that the quotas were clipped to fit within one node:

$ kubectl get quota rq01 -n allocator-ns01 -o yaml
apiVersion: v1
kind: ResourceQuota
metadata:
  creationTimestamp: 2018-02-05T11:08:38Z
  name: rq01
  namespace: allocator-ns01
  resourceVersion: "7507551"
  selfLink: /api/v1/namespaces/allocator-ns01/resourcequotas/rq01
  uid: eac65704-0a64-11e8-b0e9-54ab3a8ca064
spec:
  hard:
    limits.cpu: "80"
    limits.memory: "540954869760"
    pods: "100"
    requests.cpu: "80"
    requests.memory: "540954869760"
status:
  hard:
    limits.cpu: "80"
    limits.memory: "540954869760"
    pods: "100"
    requests.cpu: "80"
    requests.memory: "540954869760"
  used:
    limits.cpu: "0"
    limits.memory: "0"
    pods: "0"
    requests.cpu: "0"
    requests.memory: "0"

It was initially 100 CPUs, but was trimmed down to 80 CPUs (which is how many CPUs are present on each node).

How can I use kube-arbitrator as a resource manager/scheduler for multinode MPI jobs? Ideally, I'd wish for the Job's pods to only be scheduled once all the resources for the whole job become available.

Also, can kube-batchd handle a case of two applications having been submitted at pretty much the same time that could lead to a deadlock? I.e. imagine we have 5 nodes and two applications which require 4 nodes each (and this is a hard request, i.e. they'd wait for all their jobs to start running before making progress, thus indifinitely blocking whatever nodes they have been assigned). If they were to be submitted at the same time, we could run into a deadlock because app1 would take e.g. 3 nodes and app2 would take the remaining 2 nodes in the meantime. Neither of the apps can make progress until they get more nodes and hence a deadlock occurs.

Add UT cases for DRF policy

PR #151 fix the following issue in DRF policy and need to add UT cases for this issue.

  • The priority of PodSet is not updated after assign minAvailable

Migrate integration test from upstream.

At early stage, integration test is ok to verify our code; it's better to reuse upstream integration test here.

Overall, prefer to build the prototype quickly to make all developer on the same page. And keep improving it before 0.1.

Add golint check for source code

Now there is gofmt check by hack/update-gofmt.sh.

Need to add golint check for source code and refine code according to golint result. Such as, adding a new script as hack/update-golint.sh for the check.

  • Add golint tools into this repo (#168)
  • Enable golint check, at least policy directory

schedulercache need notify controllers when being updated

Is this a BUG REPORT or FEATURE REQUEST?: FEATURE REQUEST

The queuejob controller need get notifications when Queue/QueueJob/Pods are being updated.

This could be implemented as a kind of shared informer and lister

/kind feature

[question] The relationship between quotalloc and batchd

Hi, kube-arbitrator members

I am trying to run ML(especially TF) workload on Kubernetes with the support of kube-arbitrator, and have some questions about this repo.

First, what's the relationship between kube-arbitrator and IBM EGO, is it an open source re-implementation of EGO? And if I'm not mistaken I think EGO is similar to mesos:

EGO 还在其内核中提供了一个高度可配置的、高效的**布局服务

I was wondering what is a **布局服务, I can not find the description in the post.

Second, what's the relationship between batchd and quotalloc? I found that they have a similar CRD: QuotaAllocator and Queue, but I am not sure what is it used to do.

Finally, I have a question about the behavioud of batchd:

When resources are not sufficient, Kube-batchd just tries to start minAvailable pods of each application as much as possible.

If I have two applications: the first one app1's minAvailable pod is 4, and the second one app2's minAvailable pod is 6. There are only 5 slots in the machine, then what will happen if batched schedules the two applications?

I'd appreciate it if you could help me 😄

Enhancement: allocate CPU by decimals

Currently, the minimum allocation unit of CPU by kube-arbitrator is 1. And the used CPU by pods must be an integer to fit kube-arbitrator.

Need to support smaller minimum allocation unit of CPU to make the kube-arbitrator more flexible, such as 1m(1=1000m)

Enhancement: out of sync between quota manager and queue controller

Currently, queue controller will update allocation result to Queue and quota manager will update the allocation result from Queue to ResourceQuota. There is a time window between the two update and it may cause some strange behavior.

Just open the issue to trace this.

Support pods preemption between PodSet

Now pod preemption between PodSet is not supported by kube-batchd.

It may cause the first PodSet get all resources and the second PodSet get no resources unless some pods of the first PodSet finish.

So kube-batchd need to have the ability to balance the resources between PodSet. Preempting resources from one PodSet and assign to other PodSet.

hack/verify-golint.sh: Incorrect check for Go version

Problem Description:
Upon executing:
chmod +x hack/verify-golint.sh; ./hack/verify-golint.sh

I received the following error message:

Detected go version: go version go1.10 linux/amd64.
Kubernetes requires go1.8.3 or greater.
Please install go1.8.3 or later.

!!! Error in ./hack/verify-golint.sh:334
Error in ./hack/verify-golint.sh:334. 'return 2' exited with status 2
Call stack:
1: ./hack/verify-golint.sh:334 main(...)
Exiting with status 1

I verified my go version with go version which also confirmed:
go version go1.10 linux/amd64

Defined MVP of kube-arbitrator

Currently, there're several features in kube-arbitrator's roadmap; and other requirements from community and user. But we can not build them in one or two days, so I open this issue to trace discussion of kube-arbitrator MVP.

The MVP will be the first version of kube-arbitrator; if any requirements to MVP, please feel free to ask :).

Add integration test cases

Currently, there is only one integration test case for policy/preemption. Need to add more integration test cases.

Group pods into one PodSet when they belong to same deployment

Currently, kube-batchd groups pods into a PodSet by their owner references. However, it will cause some problems in following cases:

K8S Deployment
A deployment may contain more than one RS when rolling-upgrade happens, pods in different RS have different owner references, so kube-batchd will group them into different PodSet. However, these pods belong to the same deployment and should be in one PodSet.

Kubeflow job
In a tfjob, kubeflow create Master/PS/Worker as a k8s job and each job contains one pod. kube-batchd will group these pods into different PodSet because of different owner references. However, they belong to the same tfjob and should be in one PodSet.

Race condition between cache and policy

There is a race condition between cache and policy, which will cause policy schedule more pods on the same host. Here are the details:

The brief workflow of each thread/component:

T1. Cache thread:
  1. Update Pod/Node/Consumer information from api-server
T2.Policy - resource alloc thread:
  1. Get Pod/Node/Consumer snapshot from Cache
  2. Allocate resources for Pod/PodSet in the cache snapshot
    2.1. Assign MinAvailable to each PodSet one by one
    2.2. Assign left resources to each PodSet by DRF
  3. Add allocate resources result to a queue, name it as AllocationQueue
T3.Policy - process alloc decision thread:
  1. Get the scheduler result from AllocationQueue and update it into api-server
T4.Kubelet
  1. Get Pod information from api-server, start/stop it and then update status to api-server

Case 01:

  1. In step T2.2.1: Policy allocate MinAvailable to PodSet
  2. In step T2.3: Policy add allocate result to AllocationQueue (Pods still pending)
  3. In step T3.1: Policy update allocate result to api-server (Pods still pending, with Nodename)
  4. In step T1.1: Cache update information from api-server in step T1.1 (Pods still pending, with Nodename)
  5. In step T2.1: Policy get cache snapshot (Pods still pending, with Nodename)
  6. In step T2.2.1: Policy allocate MinAvailable to PodSet, however, the policy doesn't calculate the number of pending pods with Nodename, still get pods from pending queue one by one.
  7. In step T3.1: Policy update allocate result to api-server (more Pods still pending, with Nodename)
  8. In step T4.1: Kubelet start pods

Case 02:

  1. In step T2.2.1: Policy allocate MinAvailable to PodSet
  2. In step T2.3: Policy add allocate result to AllocationQueue (Pods still pending)
  3. In step T3.1: Policy update allocate result to api-server (Pods still pending, with Nodename)
  4. In step T2.1: Policy get cache snapshot (Pods still pending)
  5. In step T1.1: Cache update information from api-server (Pods still pending, with Nodename)
  6. In step T2.2.1: Policy allocate MinAvailable to PodSet, however, all the pods in the pending queue without nodename, it may cause policy allocate more MinAvailable in this step.
  7. In step T3.1: Policy update allocate result to api-server (more Pods still pending, with Nodename)
  8. In step T4.1: Kubelet start pods

Support gang scheduling (hard) for podset

When resources are not sufficient, kube-batchd just tries to start minAvailable pods of each application as much as possible. It will cause the running pods' size of a podSet is less than minAvailable which is not acceptable for some applications, e.g. MPI.

To handle above issue, kube-batchd need to support that it will not assign any resources to the job unless its minAvailable could be met.

Enhancement: multiple strategy of pod selection

The policy will terminate pod to release resources if the Queue is underused (Used > Allocated). Now the pod is selected randomly. Need support more strategy of pod selection according to some factors, such as

  • Running time
  • Priority
  • Labels

Queue API & Controller

Queue is an important API object in kube-arbitrator. This issue is used to trace all related features.

  • Design doc of Queue

Merge branch release-pre0.1 to master

Plan to merge release-pre0.1 branch to master, and provide two independent binaries kube-batchd and kube-quotalloc.

kube-batchd contains current master branch function, it is to support batch job scheduling.
kube-quotalloc contains current release-pre0.1 branch function, it is for resources management.

Need to handle the following items:

  • Rename Consumer to Queue to in master branch to avoid conflict
  • Move pkg/* into pkg/batchd/ for master branch and build kube-batchd
  • Move release-pre0.1(only include resource allocation for original Queue) into pkg/quotalloc/ in master branch and rename Queue to ResourceQuotaAllocation

[Request] Publish scheduler events to pod

The default scheduler publishes events to pod:

  FirstSeen	LastSeen	Count	From			SubObjectPath				Type		Reason		Message
  ---------	--------	-----	----			-------------				--------	------		-------
  17m		17m		1	default-scheduler						Normal		ScheduledSuccessfully assigned kube-batchd-5f86bd5b6-hx8kt to ist

But batchd does not implement it, it seems that batchd does not have the recorder. I think it should be added to make users know which scheduler schedules the pod.

Race condition between cache and policy

kube-batchd reuse kubernetes PDB for minAvailable. Scheduler cache sync PDB from api-server and policy get cache snapshot for scheduling.

While the policy may get cache snapshot before cache sync PDB from api-server. It may cause policy lost minAvailable of a PodSet when scheduling pods for it.

StatefulSet is not working well with pdb minAvailable

step to reproduce:
cluster can run 3 pods. whatever replicas I have in the replica set yaml file, as long as pdb min available is larger than 1. cluster can not run pods.

pdb.yaml

apiVersion: policy/v1beta1
kind: PodDisruptionBudget
metadata:
  name: pdb-01
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nginx

ss.yaml

apiVersion: v1
kind: Service
metadata:
  name: nginx
  labels:
    app: nginx
spec:
  ports:
  - port: 80
    name: web
  clusterIP: None
  selector:
    app: nginx
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: web
spec:
  selector:
    matchLabels:
      app: nginx # has to match .spec.template.metadata.labels
  serviceName: "nginx"
  replicas: 3
  template:
    metadata:
      labels:
        app: nginx # has to match .spec.selector.matchLabels
    spec:
      schedulerName: kube-batchd
      terminationGracePeriodSeconds: 10
      containers:
      - name: nginx
        image: k8s.gcr.io/nginx-slim:0.8
        resources:
          limits:
            memory: "3Gi"
            cpu: "3"
          requests:
            memory: "3Gi"
            cpu: "3"

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.