armadaproject / armada Goto Github PK

A multi-cluster batch queuing system for high-throughput workloads on Kubernetes.

License: Apache License 2.0

Go 78.74% Shell 0.27% Dockerfile 0.27% Smarty 0.32% C# 9.14% HTML 0.03% TypeScript 8.41% CSS 0.29% JavaScript 0.03% Python 2.17% PLpgSQL 0.11% Julia 0.21%

gr-oss kubernetes

armada's People

Contributors

Stargazers

Watchers

Forkers

laik jimbobby5 itamarst devlounge robertdigital jgiannuzzi dgl samclark justinas-marozas stackedsax h2atecnologia damiansol ericjohnsohnisc dvianello carlocamurri eax- gitter-badger miamipanther saul syedshameersarwar mariannovotny denkensk madamowicz sapk ljubon oscarknagg robertdavidsmith william-wang d80tb7 severinson kitianfresh pombredanne mkomatinovic liggitt georgejahad dearyangyu owenthomas17 jayofdoom richscott kindyluv dejanzele rhotchkiss kotwic4 dans77777 nikitakoselev agatagans lookbeat karimtabet peterwi suprjinx ankitamalik22 shafink11 clifhouck headphonejames warmchang rohankumardubey jwillker damellp isaac-jordan nand-porko valkenburg-prevue-ch mo-fatah ambuj-1211 behouba octonawish-akcodes the-amazing-atharva sarthaksarthak9 ratul-07 jakexks chinwendu20 iyear yuktea elijah0kello theefreelancer akash-kumar-saw chillorb panpan0000 tabathad toseebnadaf biswajit-9776 i-am-yuvi satyampsoni swapnil-2502 sbdtu5498 nawaz027 kannon92 em-r lowang-bh alimurtuzacodes gr-robinlevins rishabhksharma ginohmk raajheer1 nitishchauhan0022 parthiba-hazra anshgoyalevil deployers-delite yashpimple tanayvaswani sharpz7

armada's Issues

Add metrics on executor background loop duration/frequency

Currently there is no way to see how long the background loops are taking and to be able to see when they slow down.

If they slowdown significantly it could impact overall ability for the component to work

i.e if leasing is taking 10 minutes, every lease will expire

Implement report done for queued jobs.

Create Kubernetes deployment templates for components.

Fine grain authorization

Implement queue ownership by user or group
Add permissions for all possible user actions

Create simple end to end test

Either we can use Kind or real Kubernetes cluster.
Test should cover all basic operations in small scale.

Export metrics from individual components.

Investigate bsycorp kind for CI

https://github.com/bsycorp/kind

This is meant to be optimized for CI, so it only takes 30 seconds to start a cluster with:

1 master
1 worker

Implement job storage

Job definition needs to be stored and placed into the queue.
Job should be removed from the storage after it is finished (Job history will be recorded in the Events Recording).

Implement submit for multiple jobs at once

Make helm charts more flexible

ServiceMonitors custom labels
Executor node selector customisation (maybe flag for on master node / not on master node)

Add to watch to allow watching multiple job sets

Problem
If you run a load test with multiple jobsets.

The loadtest will watch these multiple jobsets.

However there is no way to stop/restart this watching of multiple jobsets. meaning you have to leave it running and/or can't rewatch an experiment how it happened.

Suggested solution

Armadactl watch should be updated to support watching multiple jobsets
armada-load-tester should be updated to support watching multiple jobsets (potentially on multiple connections, as is the case in loadtest)

Investigate how to integrate with Kubeflow or other workflow tools

Design internal API between Components.

Decrease cluster queue priorities when it is down

Repository cleanup

Code cleanup
Review and fix documentation
Clean project & issues

Improve end to end test to use more representative workloads

Implement basic Cluster Monitoring

Allow API users to specify LastId in GetJobSetEvents

Keep position on connection restart when watching for events.

Renew lease on completed and cached submitted jobs

Properly publish all artifacts for multiple platforms.

@itamarst suggested

Build user-facing CLIs for Mac/Linux/maybe even Windows in CI (Azure Pipelines is good if you decide on all 3), and put them somewhere downloadable, especially releases.
Have the CLI do a version check (which users can disable), so it can say "oh, you're out of date, you should update".
Distribute CLI via means that allow for automatic updates, depending on expected user base (e.g. if you expect Brew users, Homebrew for Mac, and then either Snap/Flatpak for Linux users or Deb/RPM channels).

Allow submitting jobs to the api in bulk

Problem
When submitting 10000 - 100000 jobs at a time, it takes a long time to complete submission.

This is because the jobs are submitted one at a time.

Suggested solution

The API should be updated to support bulk submission,
The client should be updated to submit in groups of say 100 at a time

This should significantly speed up submitting many jobs at once.

Write commandline formatter so we can use log in commandline tools rather than fmt

Problem

All of our code in commandline tools (armadactl/armada-load-tester) have to use fmt.Println to log messages to the user.

This is inconsistent with the other code and less flexible (if we say wanted to provide a log of what user had done to a file etc).

Suggested solution
We could implement a custom logrus formatter, so it would printout like we are using fmt.Println but actually runs through logrus.

Some details on how to do that:
https://stackoverflow.com/questions/43022607/how-do-i-disable-field-name-in-logrus-while-logging-to-file

Setup continuous deployment for test clusters.

Make watch client code usage more efficient

Problem
When watching a very large jobset (100000+ jobs) it takes a very long time to catch up to the most recent events in the stream.

Cause
Currently the code that uses watch generates a summary of the current state each time an event comes in.

This involves looking through all currently seen jobs and reporting what state they are in.

For large jobsets this is highly inefficient and means the application uses a LOT of CPU along with being slow.

As every event that comes in, loops over 100000 objects and prints a line

Suggested solution

We should just make a state struct that holds the current state. Each new event would just add to the state, rather then regenerating a state summary each time.

Implement basic job submission

This functionality will be implemented in Executor component.
Acceptance criteria: We can submit and run job on Kubernetes cluster.

Support having pod resource request != pod resource limit

Problem
Currently the executor reports pod usage to the server as simply the pod request values.

This restricts use to:

Pod request = Pod limit

If request != limit, it could be the case that the executor over allocates the cluster (potentially by a lot if request and limit significantly differ).

Potential solution
The executor should report pod usage as:
max (pod request, pod actual usage)

This means when the pod usage > pod request, we don't over allocate the cluster.

This will be further complicated if we want to start doing preemption based on pods being over the request

Considerations

How accurate we can get pod current actual usage
How much impact we have on the Kubernetes API, if we ask for the real usages too frequently when there are many pods

Implement job events recording

Lets record all jobs events into Redis streams, but there might be better technology for this.

Add .Net client

Problem
There is no native support for .Net client applications.

This means there is no way for users them to programmatically interact with Armada without writing their own gRPC client.

Suggested solution
Create a .Net client that implements all the gRPC endpoints.

Hopefully this can be generated using the .proto files and possibly be generated for several languages.

Implement basic Accounting

User usage or priority should be stored in redis, initially we will implement Condor like priority algorithm (https://htcondor.readthedocs.io/en/v8_8_3/admin-manual/user-priorities-negotiation.html#priority-calculation)

Investigate Helm not updating deployments correctly

Problem
Sometimes helm doesn't update configmaps/deployments correctly, particularly when going from A -> B -> A.

This seems to have been many helm issues tracking this over time, the current one appears to be:
helm/helm#5915

Fix ideas

Use -- force flag
Updating helm version (apparently helps some people)
Helm delete then helm install (a bit rubbish)

Alternatives

Use Helm template and kubectl apply + kubectl delete by label
Wait for helm 3 to see if this fixes the issues
Kustomize

Improve tooling for running tests locally

As a developer I need to run e2e & unit test by one simple command.

Delete pods when executor fails to renew the lease.

Naively report cluster usage based on resource request

Swap from current reporting from Limit to Request.

For now all batch pods will use request = limit

This makes scheduling better match the current state in kubernetes (which schedules based request rather than limit)

Make CI enforce proper formatting using goimports

Report terminated event from executor for killed pods

Problem
When a pod is cancelled, the executor never sends an event to confirm that the pod was killed.

Instead it is just assumed that because the API will refuse leasing the job, it'll get cancelled. However there is no confirmation when this occurs from a users point of view.

Suggested solution
For cancelled pods due to lease renew failure, report a terminated event.

Handle jobs not getting cleaned up

When Kubernetes is being slow to delete jobs

This appears to happen when there is a lot of scheduling and deletion is given a lower priority (I don't know if there is a way to fix it)
Our application tells kubernetes to delete but it doesn't get deleted

We only delete when the armada says it has been "cleaned" when calling ReportDone

If you have called ReportDone on the job already, then armada will not report it "cleaned" anymore. So we won't tell kubernetes to delete it again, so the job just never goes away.

Lets label all pods before deletion so we know which ones were reported to the armada server.

Handle allocating jobs when slices are too small for any job to be allocated

Improve scheduling using current usage information

Current algorithm use priority to decide how much resource from cluster lease request should be allocated to a particular queue. If the requests are small enough queue with highest priority becomes oversubscribed.
We should leverage information about current queue usage to schedule more optimally and distribute resource more evenly.

Create component performance test

Implement fake executor
Stress test

armadaproject / armada Goto Github PK

armada's People

Contributors

Stargazers

Watchers

Forkers

armada's Issues

Recommend Projects

Recommend Topics

Recommend Org