Git Product home page Git Product logo

tortoise's Introduction

Tortoise

Tortoise

Get cute Tortoises into your Kubernetes garden and say goodbye to the days optimizing your rigid autoscalers.

Tortoise is still in the early stage and we recommend evaluating its behaviour with your services in your development environment carefully.

Motivation

At Mercari, the responsibilities of the Platform team and the service development teams are clearly distinguished. Not all service owners possess expert knowledge of Kubernetes.

Also, Mercari has embraced a microservices architecture, currently managing over 1000 Deployments, each with its dedicated development team.

To effectively drive FinOps across such a sprawling landscape, it's clear that the platform team cannot individually optimize all services. As a result, they provide a plethora of tools and guidelines to simplify the process of the Kubernetes optimization for service owners.

But, even with them, manually optimizing various parameters across different resources, such as resource requests/limits, HPA parameters, and Golang runtime environment variables, presents a substantial challenge.

Furthermore, this optimization demands engineering efforts from each team constantly - adjustments are necessary whenever there’s a change impacting a resource usage, which can occur frequently: Changes in implementation can alter resource consumption patterns, fluctuations in traffic volume are common, etc.

Therefore, to keep our Kubernetes clusters optimized, it would necessitate mandating all teams to perpetually engage in complex manual optimization processes indefinitely, or until Mercari goes out of business.

To address these challenges, the platform team has embarked on developing Tortoise, an automated solution designed to meet all Kubernetes resource optimization needs.

This approach shifts the optimization responsibility from service owners to the platform team (Tortoises), allowing for comprehensive tuning by the platform team to ensure all Tortoises in the cluster adapts to each workload. On the other hand, service owners are required to configure only a minimal number of parameters to initiate autoscaling with Tortoise, significantly simplifying their involvement.

See more details in the blog post:

Install

You cannot get it from the breeder, you need to get it from GitHub instead.

# Install CRDs into the K8s cluster specified in ~/.kube/config.
make install
# Deploy controller to the K8s cluster specified in ~/.kube/config.
make deploy

You don't need a rearing cage, but need VPA in your Kubernetes cluster before installing it.

Usage

As described in Motivation section, Tortoise exposes many global parameters to a cluster admin, while it exposes few parameters in Tortoise resource.

Cluster admin

See Admin guide to understand how to configure the tortoise controller to make it fit your workloads in one cluster.

Tortoise users

Tortoise CRD itself has a very simple interface:

apiVersion: autoscaling.mercari.com/v1beta3
kind: Tortoise
metadata:
  name: lovely-tortoise
  namespace: zoo
spec:
  updateMode: Auto 
  targetRefs:
    scaleTargetRef:
      kind: Deployment
      name: sample

Then, Tortoise creates HPA and VPA under the hood. Despite its simple appearance, each tortoise stores a rich collection of historical data on resource utilization beneath its shell, and cleverly utilizes them to manage parameters in autoscalers.

Please refer to User guide to learn more about other parameters.

Documentations

  • User guide: describes a minimum knowledge that the end-users have to know, and how they can configure Tortoise so that they can let tortoises autoscale their workloads.
  • Admin guide: describes how the cluster admin can configure the global behavior of tortoise.
  • Emergency mode: describes the emergency mode.
  • Horizontal scaling: describes how the Tortoise does the horizontal autoscaling internally.
  • Vertical scaling: describes how the Tortoise does the vertical autoscaling internally.
  • Technically details: describes the technically details of Tortoise. (mostly for the contributors)
  • Contributor guide: describes other stuff for the contributor. (testing etc)

API definition

Notes

Here's some notes that you may want to pay attention to before starting to use Tortoise.

  • Tortoise only supports Deployment at the moment. In the future, we'll support all resources supporting scale subresources.
  • In Mercari, we've evaluated Tortoise with many Golang microservices, while there're a few services implemented in other languages using Tortoise. Any contributions would be welcome for enhance the recommendation for your language's services!

Contribution

Before implementing any feature changes as Pull Requests, please raise the Issue and discuss what you propose with maintainers.

A major change may have to be proposed via proposal process.

Also, please read the CLA carefully before submitting your contribution to Mercari. Under any circumstances, by submitting your contribution, you are deemed to accept and agree to be bound by the terms and conditions of the CLA.

https://www.mercari.com/cla/

LICENSE

Copyright 2023 Mercari, Inc.

Licensed under the MIT License.

tortoise's People

Contributors

harpratap avatar mercari-oss-bot avatar rafiramadhana avatar randytqwjp avatar sadath-12 avatar sanposhiho avatar zchee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tortoise's Issues

mutable `.spec.ResourcePolicy`

Currently, ResourcePolicy is supposed not to be changed.


For VPA, we don't need to change anything.

If HPA is created by tortoise, we need to update HPA.
If HPA is given from users, we don't need to do anything with HPA, but users need to update their HPA to have the metric for a new container.

implement `TortoiseBackToNormal` phase

After the emergency mode, we'd like to prevent to scale down too rapidly. Instead, it's better to scale down gradually, and Tortoise has got TortoiseBackToNormal phase during that period.

VPA patches the deployment instead of individual pods

I think by default VPA should patch the deployment.spec for requests instead of pod.spec

This ensures that any other pod from the replicaset can also handle if uneven spike happens on it. Also, replicasets are recreated, so these new VPA recommended values should persist across pod creation

This is highly depended on controller type though, for eg. daemonsets and statefulsets should not apply the same changes across all pods

The integration test for the controller and webhook

Currently, the huge functions mostly have enough UTs.
But, we don't have the integration tests from the controller package.

This issue means the integration test (not e2e test), we don't need to run up the clusters (kind, minikube etc) as they're too mendokusai to wait for them to start. Let's just use envtest.
https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/envtest

  • [single container] test for the reconcile with the tortoise TortoisePhaseWorking #22
  • [single container] test for the reconcile with the tortoise TortoisePhaseEmergency #57
  • [multi container] test for the reconcile with the tortoise TortoisePhaseWorking #70
  • [multi container] test for the reconcile with the tortoise TortoisePhaseEmergency #70
  • test for the webhook

observability: the metrics to show each recommendation

  • proposed_hpa_utilization_target (Histogram): HPA utilization target that tortoise recommends
    • label tortoise_name: tortoise name
    • label namespace: tortoise namespace
    • label index: the metric index. (0 index-ed)
    • label hpa_name: hpa name
  • proposed_hpa_minreplica (Histogram): HPA minReplica that tortoise recommends
    • label tortoise_name
    • label namespace
    • label hpa_name: hpa name
  • proposed_hpa_maxreplica (Histogram): HPA maxReplica that tortoise recommends
    • label tortoise_name
    • label namespace
    • label hpa_name: hpa name
  • proposed_cpu_request (Histogram): CPU request that tortoise recommends
    • label tortoise_name
    • label namespace
    • label container_name
  • proposed_memory_request (Histogram): Memory request that tortoise recommends
    • label tortoise_name
    • label namespace
    • label container_name

Support only `Vertical` tortoise

Currently, tortoise needs to have at least one Horizontal.
But, people may want to have tortoise with only Vertical for both CPU/memory.

The multi container specific optimization for `Horizontal`

Let's say

  • the Pod has two containers (app and istio-proxy)
  • the both container's CPU scale via HPA. (both targets 75%)
  • the app container usually uses around 75%, whereas the istio-proxy container usually uses around 50%.
    • It could happen because the app container always kicks HPA to scale out while the istio-proxy doesn't need to scale up.

In this case, some CPU given to istio-proxy is always wasted. And, we should change the CPU request of istio-proxy to be smaller or the CPU request of the app container to be bigger.

This issue is to automate such optimization in the tortoise controller

Implement `GatheringData` phase on tortoise

Tortoise generates the recommendations from past data.
And, at least, users need to run Tortoise as dry-run so that the Tortoise can generate the accurate recommendation.

During such period, the controller marks the tartoise as GatheringData phase. And, it'll mark it as Working once enough time passes.

observability: expose the metrics to monitor tortoise controller

  • tortoise_count: (Gauge) – the current number of controlled tortoise object
    • label update_mode with values: off, emergency, auto
    • label phase with values: initializing, gathering_data, working, emergency, back_to_normal
  • reconciliations_total: (Counter) -- the number of reconciliations split by result.
    • label error with values: none, internal?
    • EDIT: controller_runtime_reconcile_total created by kubebuilder already does this.
  • etc

single container HPA support

Currently, all of the places are considering that the deployment has multiple containers and thus HPA has external metrics or type: ContainerResource metrics.
But, users may want to use tortoises with single container deployment and HPA with type: Resource metrics

Update HPA and VPA in the mutating webhook

Currently, tortoise controller checkes the HPA and VPA periodically and modify based on the calculated recommendations.
But, people may apply something to those HPA and VPA by themselves.

Here, let's create the mutating webhook against HPA and VPA object so that the recommendations are always applied to them.

Emergency mode on Vertical

If emergency mode is enabled, we only increase the replica number, but we may want to give additional buffer on resource under Vertical autoscaling policy.

buffer replicas / resource request based on label

Spot Nodes are very cheap but they have high risk of getting evicted at very short notice.

In order to use Spot VMs efficiently we can always have much higher replica count compared to running on OnDemand Nodes, this will solve 2 problems -

  • Spread pods on more nodes so less risk of eviction of 100% of replicas at same time
  • The eviction period is quite short 120seconds at best, but new pods take long time to become ready (sometimes more than 5minutes), so existing replicas should be able to handle traffic even when lot of replicas are down. Meaning CPU utilization per pod will be lower than that of OnDemand Nodes

So suggestion is to run 25% (or even 50%) more replicas and ideal scenario if Deployment is running completely on Spot Nodes

feature: scheduled scaling up

Sometimes we can predict the increase of the resource consumption before it actually happens. (like TV, push notification on app, etc. Or load testing in a upstream service in dev.)
This feature allows people to schedule scaling up before it actually happens.
They will configure it with "when scaling up" and "how long scaling up" so that it can be back to normal afterward.

Maybe this feature will eventually replace resourcePolicy in Tortoise. (plus, Emergency mode as well?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.