Hi, all!
I've been investigating how applicable kube-arbitrator might be for running multinode MPI jobs on a kubernetes cluster. Here are my observations - please let me know if I was doing something wrong or whether the outcome was as expected. Also, please note that I don't have too much experience with Kubernetes.
For the experiments, I had three nodes with 80 CPUs each.
I first experimented with kube-batchd
, hoping that it wouldn't schedule a job until the requested resources are available. I set up one dormant (sleep infinity
) pod that used 50 CPUs. I then created the pod disruption budget with matchLabels: app: mpi3pod
with minAvailable: 3
. I then created a deployment with 3 replicas, label app: mpi3pod
, using the kube-arbitrator as scheduler and requesting 50
CPUs per replica.
It seems that minAvailable
is not honoured in the way I expected it to be - the whole deployment was marked as "not ready" (good), but 2 of the 3 replicas were launched. I hoped that kube-arbitrator would not start the pods until all the required resources were available. That's not how it works, I presume?
Second experiment was with kube-quotaalloc
. My intended use was to generate a separate namespace/QuotaAllocator/ResourceQuota for each MPI job that would be submitted. So I created three namespaces, three QuotaAllocators (with 100 CPU each) and three resource quotas but upon running kube-quotaalloc
, I saw that the quotas were clipped to fit within one node:
$ kubectl get quota rq01 -n allocator-ns01 -o yaml
apiVersion: v1
kind: ResourceQuota
metadata:
creationTimestamp: 2018-02-05T11:08:38Z
name: rq01
namespace: allocator-ns01
resourceVersion: "7507551"
selfLink: /api/v1/namespaces/allocator-ns01/resourcequotas/rq01
uid: eac65704-0a64-11e8-b0e9-54ab3a8ca064
spec:
hard:
limits.cpu: "80"
limits.memory: "540954869760"
pods: "100"
requests.cpu: "80"
requests.memory: "540954869760"
status:
hard:
limits.cpu: "80"
limits.memory: "540954869760"
pods: "100"
requests.cpu: "80"
requests.memory: "540954869760"
used:
limits.cpu: "0"
limits.memory: "0"
pods: "0"
requests.cpu: "0"
requests.memory: "0"
It was initially 100 CPUs, but was trimmed down to 80 CPUs (which is how many CPUs are present on each node).
How can I use kube-arbitrator
as a resource manager/scheduler for multinode MPI jobs? Ideally, I'd wish for the Job's pods to only be scheduled once all the resources for the whole job become available.
Also, can kube-batchd handle a case of two applications having been submitted at pretty much the same time that could lead to a deadlock? I.e. imagine we have 5 nodes and two applications which require 4 nodes each (and this is a hard request, i.e. they'd wait for all their jobs to start running before making progress, thus indifinitely blocking whatever nodes they have been assigned). If they were to be submitted at the same time, we could run into a deadlock because app1 would take e.g. 3 nodes and app2 would take the remaining 2 nodes in the meantime. Neither of the apps can make progress until they get more nodes and hence a deadlock occurs.