armadaproject / armada Goto Github PK
View Code? Open in Web Editor NEWA multi-cluster batch queuing system for high-throughput workloads on Kubernetes.
Home Page: https://armadaproject.io
License: Apache License 2.0
A multi-cluster batch queuing system for high-throughput workloads on Kubernetes.
Home Page: https://armadaproject.io
License: Apache License 2.0
Currently there is no way to see how long the background loops are taking and to be able to see when they slow down.
If they slowdown significantly it could impact overall ability for the component to work
i.e if leasing is taking 10 minutes, every lease will expire
Either we can use Kind or real Kubernetes cluster.
Test should cover all basic operations in small scale.
https://github.com/bsycorp/kind
This is meant to be optimized for CI, so it only takes 30 seconds to start a cluster with:
Job definition needs to be stored and placed into the queue.
Job should be removed from the storage after it is finished (Job history will be recorded in the Events Recording).
Problem
If you run a load test with multiple jobsets.
The loadtest will watch these multiple jobsets.
However there is no way to stop/restart this watching of multiple jobsets. meaning you have to leave it running and/or can't rewatch an experiment how it happened.
Suggested solution
@itamarst suggested
Problem
When submitting 10000 - 100000 jobs at a time, it takes a long time to complete submission.
This is because the jobs are submitted one at a time.
Suggested solution
The API should be updated to support bulk submission,
The client should be updated to submit in groups of say 100 at a time
This should significantly speed up submitting many jobs at once.
Problem
All of our code in commandline tools (armadactl/armada-load-tester) have to use fmt.Println to log messages to the user.
This is inconsistent with the other code and less flexible (if we say wanted to provide a log of what user had done to a file etc).
Suggested solution
We could implement a custom logrus formatter, so it would printout like we are using fmt.Println but actually runs through logrus.
Some details on how to do that:
https://stackoverflow.com/questions/43022607/how-do-i-disable-field-name-in-logrus-while-logging-to-file
... and update logo
Problem
When watching a very large jobset (100000+ jobs) it takes a very long time to catch up to the most recent events in the stream.
Cause
Currently the code that uses watch generates a summary of the current state each time an event comes in.
This involves looking through all currently seen jobs and reporting what state they are in.
For large jobsets this is highly inefficient and means the application uses a LOT of CPU along with being slow.
Suggested solution
We should just make a state struct that holds the current state. Each new event would just add to the state, rather then regenerating a state summary each time.
This functionality will be implemented in Executor component.
Acceptance criteria: We can submit and run job on Kubernetes cluster.
Problem
Currently the executor reports pod usage to the server as simply the pod request values.
This restricts use to:
If request != limit, it could be the case that the executor over allocates the cluster (potentially by a lot if request and limit significantly differ).
Potential solution
The executor should report pod usage as:
max (pod request, pod actual usage)
This means when the pod usage > pod request, we don't over allocate the cluster.
This will be further complicated if we want to start doing preemption based on pods being over the request
Considerations
Lets record all jobs events into Redis streams, but there might be better technology for this.
Problem
There is no native support for .Net client applications.
This means there is no way for users them to programmatically interact with Armada without writing their own gRPC client.
Suggested solution
Create a .Net client that implements all the gRPC endpoints.
Hopefully this can be generated using the .proto files and possibly be generated for several languages.
User usage or priority should be stored in redis, initially we will implement Condor like priority algorithm (https://htcondor.readthedocs.io/en/v8_8_3/admin-manual/user-priorities-negotiation.html#priority-calculation)
Problem
Sometimes helm doesn't update configmaps/deployments correctly, particularly when going from A -> B -> A.
This seems to have been many helm issues tracking this over time, the current one appears to be:
helm/helm#5915
Fix ideas
Alternatives
As a developer I need to run e2e & unit test by one simple command.
Swap from current reporting from Limit to Request.
For now all batch pods will use request = limit
This makes scheduling better match the current state in kubernetes (which schedules based request rather than limit)
Problem
When a pod is cancelled, the executor never sends an event to confirm that the pod was killed.
Instead it is just assumed that because the API will refuse leasing the job, it'll get cancelled. However there is no confirmation when this occurs from a users point of view.
Suggested solution
For cancelled pods due to lease renew failure, report a terminated event.
When Kubernetes is being slow to delete jobs
This appears to happen when there is a lot of scheduling and deletion is given a lower priority (I don't know if there is a way to fix it)
Our application tells kubernetes to delete but it doesn't get deleted
We only delete when the armada says it has been "cleaned" when calling ReportDone
If you have called ReportDone on the job already, then armada will not report it "cleaned" anymore. So we won't tell kubernetes to delete it again, so the job just never goes away.
Lets label all pods before deletion so we know which ones were reported to the armada server.
Current algorithm use priority to decide how much resource from cluster lease request should be allocated to a particular queue. If the requests are small enough queue with highest priority becomes oversubscribed.
We should leverage information about current queue usage to schedule more optimally and distribute resource more evenly.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.