Git Product home page Git Product logo

roadmap's Introduction

Giant Swarm Product Roadmap

This repository represents Giant's Swarm's product roadmap.

The Giant Swarm Roadmap project provides a Kanban-style overview of things we are working on by the development stage they are in, even if we are still thinking about them.

Refer to this issues list to search roadmap stories.

FAQs

Q: Why did we build this?

A: In the spirit of the open source ecosystem and towards the goal of delighting our customers, we are completely transparent with our roadmap. We want to provide our customers the opportunity to understand better how we work, how decisions were made, and what priorities we have.

Q: Why are there no dates on our roadmap?

A: Our primary focus is on security and operational stability. As such, we can't provide specific target dates for features.

Q: Is it possible to know which features will be included in which release?

A: Yes! We use Milestones to group features by releases so our customers can plan ahead.

Explicit version number, e.g. v13.0.0, means that the feature will be included in this Major release. Non-explicit version number, e.g. v13.x.x, means that the feature will be included in a Minor or Patch release for that Major version.

Closer to release date, Milestones' description will be updated with link to release PR where you can find the full scope of the release, including release notes and detailed changelog.

Q: What do the roadmap categories mean?

A: We have 6 columns in order to allow us to be very clear on the stages of the development process:

  • Under consideration: This lists the features we are thinking about. We have not committed to them, since we are still checking with our customers, upstream and the market as a whole.

  • Future: We intend to work on it in the future, but there is no active development yet.

  • Near Term: At this stage we know what we want to build and when we plan to build it. These features are next up to be allocated to development teams. We are working on technical specifications, designs, PoCs.

  • In Progress: This is our construction site and you get a clear view of what it is that we are building at the moment. These items will be ready in 1-3 months.

  • Ready Soon: We are testing these features and preparing them for release. Expect the features in this column to be available within 4 weeks.

  • Released: Congratulations! The feature is now available to you. Let us know what you think.

Q: How to navigate the roadmap?

A: Teams in Giant Swarm are devided into 3 product areas:

  • KaaS (Kubernetes as a Service): Improving Giant Swarm's product offering for customers running their operations on Azure, AWS, onprem as well as keeping up to date with new kubernetes versions. Use label area/kaas to filter features for this area or corresponding provider labels to view only provider-related features (e.g. provider/aws, provider/azure)

  • Managed Apps: Enabling our customers to get the most out of their cloud-native stack, by offering fully managed optional Apps. Use label area/managed-apps to filter features for this area.

  • Empowerment: Empowering other product teams by improving our internal platforms, particularly towards release engineering, observability, and operations. Use label area/empowerment to filter features for this area.

Q: Are all our plans and projects on the roadmap?

A: Typicaly yes. Please note, that things may move around, we may do things we didn't list and cancel things that we planned. We track postmortems and security related issues privately and don't include them in the Roadmap to ensure the safety of our customers.

Q: How can I provide feedback or ask for more information?

A: We encourage you to comment and provide feedback on the issues themselves.

Q: How can I request a feature be added to the roadmap?

A: Please open an issue! Please only use Feature Request template for submitting your ideas. Community submitted issues will be tagged "Feature-request" and will be reviewed by the team.

Security disclosures

If you think you’ve found a potential security issue, please follow the instructions here.

License

PROJECT is under the Apache 2.0 license. See the LICENSE file for details.

Learn more about Giant Swarm at https://www.giantswarm.io

roadmap's People

Contributors

architectbot avatar averagemarcus avatar brinker211 avatar ced0ps avatar cokiengchiara avatar cornelius-keller avatar gacko avatar josephsalisbury avatar marians avatar martad avatar pipo02mix avatar puja108 avatar rossf7 avatar rotfuks avatar snizhana-dynnyk avatar stone-z avatar t-kukawka avatar teemow avatar ubergesundheit avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

roadmap's Issues

Centralized Metrics

User Story

- As GiantSwarm Engineer I want to have a high-level overview of all installations so that I can use this data for decision making.

- As GiantSwarm Sales Manager / Marketing Manager I want to collect business metrics so that I can use this data for decision making.

Details, Background

With an increasing number of installations, getting an overview of metrics becomes more difficult. We need a method of accessing all our metrics from one place.

TODO

  • 1. Evaluate Cortex for cross installations querying

Versioning of Operators

User Story

As a GS Engineer, I want to have one latest version per operator codebase for all operators.

Background / Current state

  • Currently, (almost) every operator's codebase contains multiple versions. That is, we have multiple packages, with each package being part of a version.
  • In the version bundle, we have each relevant package wired up into a version bundle.
  • When an operator sees a Custom Resource - which has the version bundle embedded - the operator then dispatches to the right controller version dependent on the version bundle.

TODO

  • Observability of control plane capacity
  • release-operator ensures App CRs
  • Deploy azure-operator to a test environment
  • Having branches in test catalog
  • Running aws-operator from catalog in control planes
  • Deploy multiple aws-operators
  • Run multiple aws-operators
  • Handle changelogs with multiple aws-operators
  • Correctly reconcile multiple clusters with aws-operator
  • CI for deploying multiple aws-operators
  • Flatten aws-operator
  • Add support for 'opsctl deploy' to override existing App CRs during deployment
  • Remove unused fields from releases repository
  • Flatten cluster-operator
  • Add support for ensuring that multiple operators are not reconciling the same CR
  • Flatten cert-operator
  • Flatten chart-operator
  • Flatten Azure-operator
  • Flatten kvm-operator
  • Flatten app-operator
  • Use label selectors for watching aws-operator CRs
  • Tag repository on release version in project.go

Deliverables

  • Deployed software
  • New CI architecture
  • Process change
  • Internal documentation

Control Plane / Tenant Cluster

Control Plane

Provider

all

Blocked by / depends on

n/a

Customers driving/requiring this

GS product teams

Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers

Goals

  • Make our ingress controllers more dynamic with autoscaling worker nodes.
  • Consume less resources in tenant clusters where IC is not used or is low traffic.
  • Respond better to peaks in traffic to Ingress Controller.
  • Have load test coverage for Ingress Controler HPA and cluster-autoscaler.

Current state:

  • When launching a new cluster we set the number of ingresses to the number of nodes in the workers-array in cluster-operator.
  • This causes problems as the workers-array is no longer supported as a source of truth for worker counts with cluster-autoscaling.
  • A cluster with autoscaling will not change the number ICs running according to the changing number of nodes.

TODO

EDIT: Updated checklist with current state.

Spec Representation of Catalog Entries (Apps) within the Control Plane

Mon, Mar 23, '20 status in this story: #95
Wed, Apr 8, '20 Spec: Representation of apps in the CP
Mon, Apr 20, '20 update: When spec is done, team needs a call to spec out implementation.

User Story

This is an internal story to solve for several internal technical needs and requirements of user facing stories.

This is the basis. The user stories depending on this are:

  • Listing of Apps
  • Updating of Apps (update events/notifications)
  • Defaulting of App CR fields (user just says I want prometheus and we default to latest version)
  • Validation of App CRs (specifically validate if app name and version are exisiting in said catalog)

Latter 2 stories are concerned with havng functionality baked into our Control Plane and its Kubernetes API so that we are not dependent on the to-be-deprecated GS API.

Future stories depending on this might be:

  • Dependency management between apps and other apps
  • Dependency management between apps and GS versions
  • Making required config fields explicit
  • Other metadata to apps (Screenshots, long descriptions, example configs)

Current State

Currently, an App Catalog is represented solely by its appcatalog CR in the CP. There's no information about its contents in the CP.

Happa talks directly to the storage of the Helm repository depicted in the appcatalog CR (i.e. in our cases Github) to get the index.yaml, parse it, and use it for some defaulting and representation cases.

The functionality to list apps is not present in our API. Thus, UX is being able to see the SSoT, i.e. Github, and following it to check out which apps are present.

Requirements

What we need is a way to get the SSoT, i.e. a list of apps, their versions, as well as metadata, from a definite place inside the Control Plane.

Options

As we are extending the Kubernetes API of the CP there's basically 2 ways that are native:

  1. CRD + Controller (i.e. Operator Pattern)
  2. Microservice that offers an aggregated API

While 1 has been our goto solution for most issues, we often have mitigated the impact of such an asynchronous, event-driven API towards the user by adding a microservice or API logic of our own. Latter patterns we want to get rid of as our API and the microservices that "translate" towards CRDs are something we want to deprecate mid-term.

For 2 we do not have any production experience. It is used upstream for the metrics API and Team Ludacris is considering using it for #13 (Health of TCs). There's a PoC that they have built in the Hackathon last week.

Team Batman has extensively evaluated Option 1 (see also https://docs.google.com/document/d/1LCd4XeLh8fYKILjTfyvBILrWZc07Gn0zdrbpC-MZUOg/edit), so we will describe this and also a lower fidelity spec of Option 2 in the following.

Keep in mind that no matter what option we go with the SSoT stays within the Helm Repository, no matter if that repository is blob storage with an index.yaml like right now, or might have a better API that can offer such information as in chart-museum and Harbor. Currently, I do not see a way around having a local representation somehow (more on this in the harbor case below). Correct me if this assumption is wrong.

1) AppCatalogEntry CRD

The idea is to have a cluster-scoped AppCatalogEntry CR per App in a Catalog. This CR is to be created by a new controller in app-operator based on the SSoT that is the appCatalog storage, i.e. github or other helm repo and its index.yaml

Sample CR (details can be discussed in apiextensions PR once we have a better plan here)

metadata:
  name: kong-incubator
  labels:
   application.giantswarm.io/catalog: giantswarm-incubator
spec:
  name: kong-app
  description: The Cloud-Native Ingress and Service Mesh for APIs and Microservices
  engine: gotpl
  home: https://KongHQ.com/
  Icon: https://s3.amazonaws.com/downloads.kong/universe/assets/icon-kong-inc-large.png
  sources:
    - https://github.com/Kong/kong
  urls:
    - https://giantswarm.github.com/giantswarm-incubator-catalog/kong-app-0.2.0.tgz
versions:
  - version: 0.2.1
    appVersion: 1.2.3
    created: "2019-09-28T13:18:25.002458784Z"
  - version: 0.2.0
    appVersion: 1.2.2
    created: "2019-08-28T13:18:22.004458784Z"

This CRD can then be used for all possible use cases mentioned above. It can be extended to include additional metadata with time, but the idea would be to start small.

2) appcatalog-service

A microservice serving an aggregated API that does one of the 2 following things:

  1. In case of blob storage for Helm Repository, it downloads the index.yaml, keeps a purged representation of that in its "cache" and answers requests for apps, versions, and metadata.
  2. In case of chart-museum/harbor it proxies (and maybe caches) requests to said chart registry.

The reason for a proxy in case 2 is that it abstracts away from the actual representation and storage in case those change. Another reason is that for a better UX a single place to get the information about app and version availability and install would make more sense. Having the user switch to a Chart Registry frontend to explore which apps they want and then come back and install and configure them in clusters sounds icky. Again correct me if I'm wrong.

Make Managed EFK more production-ready

We offer the EFK Stack as we have some experience with it and it is common in the CNCF. Loki, is not an alternative for all customers, as some cannot adhere to the structural logging requirements for using it.

Managed App Testing | Testing strategy for Managed Apps charts

We currently use the same approach for testing charts that we use for operators. But we only need to test Helm Charts and the resources they create.

This ticket is to improve the dev UX for chart developers. Current problems are.

  • Requiring Go skills
  • Regular vendor updates for e2e library changes and upstream components like client-go.

We should consider community tooling like Sonobuoy as well as our own internal tooling.

Setup Prometheus server per tenant cluster

User Story

- As GiantSwarm Engineer I want to have scalable monitoring architecture so the Prometheus resource usage is not tight to the number of Tenant Clusters on Installation.

- As GiantSwarm Engineer I want to have stable monitoring architecture so the Prometheus uptime is 100%.

- As a GiantSwarm Engineer, I want to be able to add as much metrics to Prometheus as I need so that I have good observability of apps and services.

Details, Background

Currently, we have one Prometheus, AlertManager and Grafana instance per installation running in a control plane. So the resource usage of Prometheus is tied to the number of tenant clusters. As the number of tenant clusters increases, Prometheus resource usages grows. This results in memory issues and downtimes of Prometheus.

To ensure Prometheus stability, we need to have a separate Prometheus server per tenant cluster installed on the Control Plane.

cp_prometheus

TODO

Spikes

Targets

  • Docker daemon
  • K8s api server / controller manager / scheduler
  • Kube-proxy

Support peering on cloud providers

V and AA are peering VPCs with our clusters. Our automation doesn't have awareness of this configuration. This can lead to trouble during upgrades and we should support this out of the box.

From Puja in Slack:

  1. peering between 2 GS owned VPCs (i.e. two workload clusters)
  2. peering between workload cluster VPC and other VPC in same account (e.g. China MongoDB runs on VMs in own VPC in same account AFAIK)
  3. peering between workload cluster VPC and other VPC in different account
    Then there’s bi-directional vs 1-directional peering

Clean Up App Catalogs (Only Giant Swarm and Playground catalogs remaining)

Outcome of this is just delete incubator and move stuff to right catalog. No quality bars.

User story:

  1. As a GS person giving an external demo, I want one catalog I could demo, so that I don’t waste time asking someone every time which of the many catalogs I can demo.
  2. As an end user developer, I want to easily see which apps I could actually use for my work.

Todos

Incubator status Mon, Mar 16, '20

  • kubernetes-elastic-stack-elastic-logging -- to be deleted in favor of efk-stack
  • redis-app — decided to delete
  • kong-app — @yasn77 this cycle, to GS catalog
  • prometheus-operator-app-chart -- to GS catalog, @glitchcrab talk to customer
  • Aqua, to GS catalog -- @glitchcrab
  • Ff up w @pipo02mix : k8s-initiator-app, to playground catalog. Decided to delete. When Fer has time he will add to Playground.

Playground status Mon, Mar 16, '20

  • app-mesh (to playground)
  • efk-logging - delete (removed from playground catalogs as well)
  • efk-stack - move to Giant Swarm @webwurst will ping @glitchcrab for steps to do
  • eventrouter (to playground)
  • fluent-logshipping (to playground)
  • giantswarm-todo (to playground)
  • linkerd2 (to playground)
  • loki-stack (to playground)

Next story: https://github.com/giantswarm/giantswarm/issues/8444

Kubernetes Release v1.16

Kubernetes 1.16

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1-16.md

Provide a new release with the kubernetes version

Check migration recommendations

Check migration recommendations from kubernetes and decide what we need to document for the customer and what we should migrate automatically for them

  • Do we need to change our own resources within the tenant clusters?
  • What do we need to communicate to our customers?
  • Anything else that would be better automated for the upgrade?

Run e2e and conformance tests

  • Start a new cluster and run conformance tests against it
  • Start a new cluster and run e2e tests against it

Check Core Components

  • Check core components for updates and include them in cluster-operator PR giantswarm/giantswarm#788
  • All helm releases in the giantswarm namespace should be in status deployed
  • Core components are runnning without errors
    • Cert-exporter
    • Chart-operator
    • Cluster-autoscaler
    • CoreDNS
    • Kube-state-metrics
    • Net-exporter
    • Nginx ingress controller
    • Node-exporter
  • Cluster-operator is running without errors
  • Cluster-operator is able to create chart-configs in the tenant

Test migration (both cluster functionality itself and workloads)

  • Start a cluster with an older version and upgrade the cluster to the new release
  • Upgrade a cluster that is not critical but has some more real workloads (eg giantswarm website, dev cluster of customer)
  • All Providers

Write summary for the release and update docs

Aha! Link: https://giantswarm.aha.io/features/L-7

Kubernetes Release v1.17

Preparation steps

  • Read upstream changelogs for any important changes.
  • Establish component upgrade targets
    • Calico v3.11/v3.12
    • etcd v3.3.18/v3.4.3
    • Container Linux v2303.3.0
  • Ensure that retagger has correctly tagged all targeted component versions during automatic nightly retagginig in Quay and Aliyun
  • Create a PR in k8scloudconfig/ignition-operatoor with the new components versions and any other required changes
  • Check Prometheus rules for deprecated metrics and fix them if necessary
  • Establish common release notes and create WIP releases for each platform
  • Do we need to change our own resources within the tenant clusters?
  • What do we need to communicate to our customers?
  • Anything else that would be better automated for the upgrade?

Child issues

Managed Loki

We are testing Loki , and might offer it as a lightweight more cloud native alternative.

Control plane API through Kubernetes itself

Replace our existing API with Kubernetes itself.

We would give our customers access to the Kubernetes API of the control plane and define roles for them so that they can see and edit the Custom resources we provide to manage the clusters and also see the components that manage the clusters.

First approach is to let the API talk to kubernetes directly instead of using the services in between. Later goal would be to remove the API in front of Kubernetes.

This has been discussed here: https://github.com/giantswarm/giantswarm/issues/4135

A related UX story can be found here: https://github.com/giantswarm/giantswarm/issues/4370

An idea to involve some customers early would be to give them access to a read-only role in the control plane.

All stories around more visibility into the system would be affected by this. In a positive way. Standard Kubernetes tooling (like kubectl) would give our users already much more insights of what is going on inside the control plane and its tenant clusters.

Authentication for customers will also be affected

These roadmap items would then be obsolete:

Make workload cluster k8s CIS compliant

Enable direct write access to App Catalog through Kubernetes API of the Control Plane

The thought behind this, would be that as App Catalog is mainly green-field, we could enable write access to it earlier than the rest of Giant Swarm.

Structural validation can be done using OpenAPI schema
This would require some work on defaulting and validating the (user-facing) CRDs. We already have a PoC for that using OPA
This might also include SSO for the Kubernetes API of Control Planes

Blocked by

  1. Defaulting and Validation is blocked by #26
  2. We have to work with the new WG to define the scope

Note: Moved from Managed Services Roadmap 2019 https://github.com/giantswarm/giantswarm/blob/1f0dbd05bd806df2670ccd2a7cc906c1a8990d4f/areas/managed-services/roadmap/2019-ROADMAP-MANAGED-SERVICES.md#stories-for-2019q4 This issue is intentionally left incomplete, Chiara needs to clarify the user story.

Status / health of workload clusters

Epic Story

As a customer, I want to be able to easily and efficiently check the health of my workload clusters so I know whether they require attention or not.

(This Epic is in ideation phase and the following stories below have been created to collect the various sources of information we require in order to build the MVP of this Epic)

Linked UserStories

  • Competitor Analysis giantswarm/giantswarm#6136
  • Customer Focus Sessions/ Feedback giantswarm/giantswarm#6137
  • Available Data giantswarm/giantswarm#6138
  • Mock ups giantswarm/giantswarm#6139
  • Architectural implications giantswarm/giantswarm#6312
  • In scope/ out of scope

User Personas

Linked Stories

Build a Multi Tenancy Helper

We see many customers struggling with implementing best-practices both when it comes to security but also on scheduling matters.

Current thoughts go towards improving documentation (and that is important) as offering defaults only would be straight forward if you control all namespaces and know them in advance.

I was thinking though that we could offer an actual tool (be that a controller or just an external helper tool) that could make the life's of our customers (and hopefully also other k8s users) easier.

I had lots of thoughts about this and wrote down some stuff, which I will dump below in this issue.

I'd like to get your thoughts, especially from @giantswarm/sig-product @giantswarm/sig-customer @giantswarm/sig-ux @giantswarm/sig-support

Better release crafting & testing.

Epic Story

  • As a Swarmie struggling with the constant battle of releasing via a manual, laborious and time consuming process, I would like our release process to be as automated, pain free and delightful as possible so that I can spend my time on other worthwhile value generating activities.

Epic Acceptance Criteria

Iteration 1: Quick fixes whilst we are waiting for VOO (see UserStories 1-6)
Iteration 2. The full end to end release process is automated (see UserStories 7-8)

Linked User Stories in priority order:

Iteration 1.

  • 1. Improve the reliability of E2E tests giantswarm/giantswarm#5965
  • 2. Automate slack channels for release notes giantswarm/giantswarm#5966
  • 3. Step by step guide to release notes giantswarm/giantswarm#5967
  • 4. Automate slack channel for K8s releases (not needed?)
  • 5. Re-tagger should automatically retag images for k8s releases giantswarm/giantswarm#5968
  • 6. Move release info into separate repository giantswarm/giantswarm#5969
  • 10. Automate China E2E tests giantswarm/giantswarm#6052

Iteration 2. stories which are dependent on the completion of VOO giantswarm/giantswarm#5021

  • 7. Add support to devctl to automatically list PRs since the last release giantswarm/giantswarm#5970
  • 8. Do a full installation E2E tests from installations giantswarm/giantswarm#5971
  • 9. Add official conformance tests to our E2E tests giantswarm/giantswarm#5972

Dependencies

  • VOO giantswarm/giantswarm#5021 !!

Other Tasks (non userstories)

  • Add k8s next release to calendar and code freeze (Jessica)
  • Start preparing new release before new k8s is out by checking upcoming e.g. 1.14.2 (collect updates from components to include etc)
  • Write more e3e tests

Aha! Link: https://giantswarm.aha.io/features/L-2

Expose cluster creator via our API / front end

Currently we ask everyone to include their username in the cluster name when creating test clusters. This is important so we can contact the owner if they are generating errors or unnecessary costs.

However it's easy to forget and we have lots of new starters who don't know this.

Idea from @marians is to add a created_by field to the API object. It can then be exposed in Happa and gsctl.

Idea from @corest is to use this for the Cluster Police bot. To notify the owner and so we don't spam the #ops channel with these notifications.

Threads:

#sig-product https://gigantic.slack.com/archives/C02521EC0/p1573809116294100
#product-kaas https://gigantic.slack.com/archives/CNX3Z835G/p1573809431036900

Managed OPA (for the customer)

We can use Gatekeeper for Validation and defaulting. Have a separate OPA for Authorization (at least Gatekeeper devs told us that it will be separate)

Revamp Ingress Controller setup (Optional NGINX Ingress Controller)

User Story

As a customer, I would like to have the option to not have Nginx Ingress Controller installed in all my clusters by default, so we don't waste resources in cases where we don't need to expose any endpoints.

As a customer, I would like to have the option to not have Nginx Ingress Controller, so I can run a different Ingress Controller that we need. For example, I need an API Gateway, or I use a service mesh and it contains an ingress controller.

As a customer, I would like to have the option to install the Nginx Ingress Controller only internally, so my service can use it to communicate internally without exposing endpoints to the Internet.

TODO

  • Move LB/DNS logic out of the aws operator
    • Remove DNS/ELB set up logic from aws operator
    • Modify IC to provide LB annotation to provision the ELB
    • Add cert-manager as default managed application
      • monitor for not satisfied
    • Add kiam as a default managed application
      • Precreate kiam role / reuse master role
      • monitor for not satisfied
    • Add external DNS with the proper configuration
      • Precreate external-dns role with proper policy in operator
      • Add cleanup resource for hosted zones which will ensure all the records except SOA and NS are cleaned up from hosted zone so that CF can delete it
      • Set proper IAM policies for the service for Fer's customer
      • Add external dns as managed service
      • monitor for not satisfied
  • Make possible to select nginx class for nginx controller ingress
  • Make possible to select between internal or external ingress controller (i.e. add annotation option of service of type LB to IC user config)
  • migrate nginx ic to App CR and make default everywhere except clusterapi
  • update docs on hello-world app with examples on installing IC from catalog
  • Stretch Goal: Ingress Controller should be completely optional (this depends on us being able to add apps to clusters on creation time)

Details, Background

Right now our ingress controller (IC) is installed by default in all clusters. We have customers that have different ingress controller and they waste resources. Also, some customers will like to not expose ingress endpoints by default. Currently, in AWS and Azure, we provision an LB pointing to the workers for the default ingress controller (exposed as nodePort).

On the other hand, we have moved the ingress controller to be run by the chart operator. The configuration of the IC chart is managed in cluster operator. This is moving to app catalog operators. Ideally, ingress controller will be part of the catalog and users will choose to install which IC is deployed and which flavor (external/internal). At the same time, it will allow us to offer more than one IC solution (out of scope right now).

Customers driving/requiring this

I think all customer have one or other interest in this development

UPDATE 17.10.19

  • For full optionality of IC with a similar simple setup like now (i.e. not requiring customer to set up custom things in AWS) we would additionally need ‘kiam’,‘external-dns’, and ‘cert-manager’
  • ‘external-dns’ is already a default app on Azure
  • ‘kiam’ would increase overall security of our setup, especially making customer workloads that needs access to AWS resources more secure
  • ‘cert-manager’ is an app that most customers are anyway running and would welcome us taking over its management
  • Having those apps in AWS clusters would mean we could get rid of a lot of hard coding in aws-operator
  • It would also mean that the load balancer would be configurable, e.g. to set it to internal
  • It would mean we do not need to take care of it when we add node pools and stuff
  • It would mean that even the default DNS entry for ingress is configurable
  • based the above we decided to make those 3 apps new default apps for AWS clusters
  • the actual Ingress Controller then is fully replaceable, so a customer could use our default components to run Traefik and at some point we would also be able to run other ICs like for service meshes or if there’s clsuters that will only have Kong

Azure: HA Control Plane

Epic Story:

As a Customer, I want GS to provide a high availability of Master Nodes on my workload clusters so that if the working master goes down for example, it automatically boots to the next available master.

OR

As a cluster admin, I want to configure a cluster to use multiple master nodes, so that the risk of a Kubernetes API downtime is minimal

Rationale

Even if workloads are still functioning, a temporary downtime of the Kubernetes API seriously impacts our customers in that they can't update any deployments, scale replicas etc.

Background

Reliability will be increased by making multiple masters reside in different availability zones (AZ), as this reduces the likelihood of all masters being unavailable at the same time. Cluster upgrades would benefit from this, as with a single master an upgrade automatically implies a downtime. This might be implemented as optional, as not all use cases may require the high availability and justify the increased cost (e. g. dev, testing).

Enable disabling 24/7 support for workload clusters

User story

- As a Giant Swarm engineer, I want to be able to disable alerting for a CAPI cluster so that I won’t get paged for a testing cluster but still will be able to check monitoring data.

Description

As a user I'd like to disable the 24/7 support on test clusters. If a customer wants to play with a cluster and try out a service mesh or other things that are currently not officially supported they can turn off our 24/7 support and just play with the cluster.

The same would work for the SEs. They can disable 24/7 on the clusters that are in an known broken state.

Additional benefit would be that we start clusters with disable support as a default. This way it would be a conscious decision of the user to turn on the 24/7 support.

The last point would also allow us to give more partners access to control planes to play with our product. The clusters would all be without 24/7 support and nobody would get paged during the night.

Setup management cluster Prometheus

User Story

- As GiantSwarm Engineer I want to have observability of Control Plane components (operators, etcd etc.) so that I can avoid outages and other issues.

Details

Setup Control Plane Prometheus to monitor Control Plane components (operators, etcd etc.) and workload cluster Prometheus servers.

cp_prometheus

TODO

Notes / Questions

  • we want to allow customers to use our Grafana on the control plane. We might also want to think about routing alerts to them. The example of use case is IP capacity. When there are only 5 IPs left for a cluster we could have an alert but that would only trigger us to tell the customer, in stead we could alert customers directly.

Real-time event stream

User story

  • If I as a cluster-admin create/upgrade/scale/delete a cluster I want to get feedback about the steps of the initiated process.

  • If I as a cluster-admin create/upgrade/scale/delete a cluster I want to see errors within the initiated process immediatly.

Rationale

Get real-time information on status changes, so that we can display e. g. readyness after cluster creation.

Background

We don't provide any feedback during cluster creation at the moment. And the same will happen during an upgrade
of a cluster.

Since cluster creation, upgrades, and scaling, to name a few, are complex processes that can take some time,
it would be benefitial to be able to display detailed progress information in our UIs. This would also help
users to diagnose errors when something does not work as expected.

An event stream as part of our API would also enable custom integrations. For example, customers could trigger
the creation of an external uptime check whenever a new ingress is created in a guest cluster. Note that this
aspect is especially targeted in the EVENT-HOOKS story.

As an obvious technology choice, a WebSocket API should be offered for clients to consume the stream.

With "real time" we mean a delay of a few seconds between an event occurring and being presented via the API.

In cluster creation, these are a few examples for the types of events we would like to get from the API:

  • Cluster creation event types we can think of as being reported via the API:

    • Key-pair backend available (for creating key pairs for the cluster)
    • Kubernetes API server is up
    • Individual worker are "known" with their identifier or IP address
    • Individual worker nodes are in Ready state
    • Ingress load balancer is working
    • Calico network is set up and working
    • Internal DNS is up and working
    • Prometheus scrape targets for the cluster have been created
  • Upgrades

    • Individual node is cordoned, drained, shut down
    • Individual node is in Ready state (in new version)
  • Scaling up

    • Individual worker are "known" with their identifier or IP address
    • Individual worker nodes are in Ready state
  • Scaling down

    • Individual worker is cordoned, drained, shut down
    • Individual worker is removed from the cluster
  • Deletion

    • Individual node is removed
    • Individual worker is removed from the cluster (TBD)
    • Cluster object is deleted

We are aware that in deletion, the events don't provide actionable information, still having
it will improve the experience and raise trust in the system.

MVP and further iteration

As a minimal version, for cluster creation we expect these event types:

  • Kubernetes API server is up
  • Individual worker nodes are in Ready state

Open questions

  • Do we want to watch custom object status changes, or do we rather prefer a messaging API to be used
    from within operators?

Possible side effects

  • Load on the kubernetes API server when watching custom object status changes.

Dependencies/prerequisites

  • Differentiation between actual state and desired state in custom objects would allow for
    watching object changes, which would be the basis of the implemnentation. That should be
    covered in the STATE story.
  • We should have finalizers in place, to be able to observe a cluster deletion.
  • A unified worker identifier working over all providers would be benefitial.
  • Guest cluster upgrades must be working in order to have upgrade events

Related stories

  • Transparency Regarding Cluster State #75
  • History of a Cluster #57
  • Webhooks for Events #74

Multi-AZ tenants in Azure

User Story

As an Azure user I want to be able to spread clusters over multiple availability zones to decrease the possibility of service interruption in case of datacenter failure.

Details, Background

Azure Availability Zones are separate data center units within Microsoft Azure, each with its own power, cooling and networking. By running services on multiple availability zones, you can make your applications resilient to failure or disruption in your primary data center.

Provider

  • Azure

Azure: Node Pools

Epic User Story

As a user creating multiple clusters, I want to be able to define the individual character of each of these clusters, so I can have multiple clusters use cases in one cluster.

Acceptance Criteria

I can have different node pools in a cluster which:

  • are spread across different availability zones
  • have different levels of storage
  • have different availability zones which I can manage through the UI Happa, GSCTL, API.

Implementation Plan

  • Split master and workers management into two different resources (effectively: split instance resource)
  • Sort out CRDs(both Cluster API and ours) that are required for MachinePool (and provider-specific parts of it). #161
  • Create controllers required for MachinePool (provider-independent stuff to cluster-operator and rest provider-specific stuff to azure-operator). #157
  • Copy workers instance resource to the machine pool controller.
  • Copy ipam resource from aws-operator and adapt it to azure-operator.
  • Refactor the machine pool controller's workers resource to fully manage node pool
    • Allocate and create subnet for workers.
    • Configure security groups if needed.
    • Configure routing if needed.
    • Handle the specific VMSS and its status.
  • Create migration resource in AzureConfig controller
    • Create MachinePool (and required related) CRs to mirror existing worker VMSS.
    • When created machine pool CRs have Status such that all workers are up & running Ready, then Drain the old workers.
    • When old workers are drained, delete the old worker VMSS.

Links

  • Dimensions of Complexity analysis: report

  • Please view this slide-deck for an overview of the 'AS-IS - TO-BE' design: design

Managed Service: Image scanning

Here just dropping though on creating an operator to manage CoreOS Clair.

The main features I see with this operator would be:

  • Clair server reconciliation
  • Watch Pods creation events and send analyze request for every new images
  • Manage Notifications subscription

I think this can be useful for Managed Services, Julien's customer already ask if we can help them with Clair.

Upgrade to Helm 3

User Story: As a GS team member I want to be able to use Helm 3. To benefit from the improved security model and to no longer have to manage Tiller which takes time and adds complexity.

References

Epics

Upgrade tenant clusters to Helm 3 https://github.com/giantswarm/giantswarm/issues/11748 (blocked until after CP upgrade)

Releases

Update issue: [epic] Managed Kong in Giant Swarm Catalog (Preview Quality)

Outcome: Get Kong to the specific managed service it is supposed to be. It’s hanging at 70% there. This means we have decided what it means (what we do and don't do) for Kong to be a Managed App. "Done" means having the answers in written form somewhere (whether in app catalog, a google doc, etc.) and sent to customer.

By when?

March 31, 2020, we have provided our Kong to the customer with documentation on (1) how they can use it (2) how they could add their plugins, and (3) how they can go live with it. Not full SLA production ready, but usable w possible bugs.

People involved

Related Docs:

Obsolete

Engineering Todos

Product Todos

Waiting for feedback on these docs PRs

  1. They will use our Kong. This is clear now. In DBLess mode.

After first draft of Managed Kong definition

  • Document - after deciding where (e.g. in app catalog, in a doc, etc.)
  • Communicate to customer

Managed App Catalog - Phase II

User Story

This phase is about getting more apps into the catalog, the automation around that, and the release process.

  1. Move existing Managed Services (CoreDNS, Ingress Controller etc) to apps using the new release process
  2. Versions of required Managed Services are linked to Giant Swarm releases
  3. cluster-operator creates/updates app CRs, the rest is moved to app-operator
  4. UX lets you manage and update app CRs

TODO (Cycle 2)

UX @oponder

Tracked here: https://github.com/giantswarm/giantswarm/issues/6659

Migrate core components to App Catalog

cluster-operator

e2e

  • Create basicapp e2e test to replace managedservices @tomahawk28

Migrate Charts

Spec

Related Issues

Aha! Link: https://giantswarm.aha.io/features/M-14

Enable Central Services

This would most probably involve some kind of multi-peered "ops-cluster", which holds central services, instead of hosting the central services inside out Control Plane.

Service Broker API is moving to CRDs, so we will wait for that

Just closed. Managed Prometheus Operator (Monitoring) for customers

Towards #175


We have a Prometheus Operator in the Giant Swarm Catalog that customers already use.

In a Prometheus-focused Cycle, we will improve how we support it in production. This includes:

Prior to the cycle, we will also talk to customers and gather feedback for improvement.

Managed Service: CI

Can we provide our customers with a managed services that adds CI across all their clusters? How would this look like?

Enable App Updates

Right now, you can update by choosing the version in the version picker. However, things will break if things are not configured correctly. There is no migration path for configuration values.

As an app user, I want to be able to update apps.

Related Stories

  • UX for Updates (version picker)
  • Notification about Updates (As an end user developer, I want to get notified where there is a new version of an app I have installed.)
  • Automated updates through "update-operator" (As an end user developer, I want to be able to specify how updated I want my app to be within a certain major / minor boundaries and have the app be automatically updated according to those specifications.) - but don't encourage until configurations are safe

Upgrades (more graceful, more flexible, less aggressive, more robust)

User Story:

  • As a customer, I would like Giant Swarm to issue less aggressive upgrades so that it does not interrupt my SDLC.
  • As a customer, I would like Giant Swarm to offer more upgrade options to adapt to the idiosyncrasy of my applications.

Testing Acceptance:

  • GS upgrades are scheduled around:
    • my clusters schedule such as a pause
    • batch size of termination period can be configured (allowing single swap upgrade)
    • my SDLCs environment such as dev first, stage and production later
    • a code freeze (out of scope)
  • GS upgrades are almost invisible

We collected lots of feedback with the first version of our upgrades. We reached a state we can work with on all providers. Now we need to think about improvements to finally reach a state that allows us to activate them automatically.

The goal of this story is not to activate upgrades automatically. We would need to schedule them and we need to allow customers to influence this based on individual clusters (pause), based on environment (dev first, stage and production later), based on a freeze (pause everything). This is a separate story.

But this story here should bring upgrades into a state where they become almost invisible for our customers.

Upgrades should be less aggressive:

General

  • Lockdown a cluster for upgrades #186
  • Using the node-operator for draining on all providers (implement this independent from the actual provider)
  • Determine node readiness correctly
  • Mark old nodes unschedulable so pods do not get moved twice (or tainted like kubectl taint nodes node_name key=value:NoSchedule)

AWS

  • Ensure TCPN(s) stack are not updated till master(s) TCPN stack is in update complete (it is recommended practice in community and it will avoid problems when master stack fails and leave the cluster in an inconsistent state)
  • Be able to select batch size (slow - 10%, regular - 33%, fast - 50%, single swap - 100% nodes) #216
  • Time between node rolling #217
  • Consider the possibility to disable DisableRollback option on Cloud Formation stack to avoid larger impact in big clusters
  • Be able to pause an upgrade (it could be achieved by letting customer upgrade each node pool separately)
  • Dont trigger new batch upgrade till the last node batch runs fine
  • Rollback option for GS version patches #221

Azure

  • Starting new nodes first

Related issues:

App Catalog Partner Integration

User Story

As a Giant Swarm partner company, I want to enable my user/customers to install my software easily to their clusters.

As Giant Swarm partner company, I want to be featured and have a special position (compared to community apps) in the App Catalog section.

TODO

  • Find a way to do easy release management for a separate partner catalog
    • evaluate options
    • talk with partners if options will work out
    • build tooling
  • Expand on UX for partners
  • Documentation
    • Document/explain what the partner catalog is for and what support level can be expected as a customer
    • Document how to add new partners

Details, Background

We started our first partnership with Instana, which is installed through their chart in the Helm upstream community catalog. However, the community catalog is full of apps and it is hard to find our partners among them.

The current list of planned partner integrations is tracked in https://github.com/orgs/giantswarm/projects/74 and can be found by the label growth/partners.

There would be the option to have prioritized displaying of certain charts in the community catalog, but it is still in the community catalog, which will say sth along the lines of "apps in here are not supported by Giant Swarm, all support is purely open source".

Another option would be having partners provide their own catalogs, but that would mean we have tons of single app catalogs which are bad UX too.

Third option and currently favored would be a general partner catalog where we include some differing release tooling to our own. Basic thought around is a similar tool to https://github.com/giantswarm/retagger, i.e. a tool that we can point to a chart repository and that syncs the chart to our partner catalog. This option was suggested by @rossf7

Aha! Link: https://giantswarm.aha.io/features/B-9

Spark

We have a customer that might need this soon.

Chiara Example

User Story

As a (customer|user|admin|...) I want to ... so that ...

TODO

  • item 1
    • item 1.1
  • item 2

Please complete the TODO list as possible

Details, Background

Please give all possible details, requirements, etc. Technical or not.

Deliverables

  • Deployed software
  • Monitoring/alerting
  • API changes
  • Public documentation
  • Internal documentation
  • Frontend changes
  • Other (please describe)

Host/guest cluster

  • Host cluster
  • Guest cluster

Please remove non-applicable items

Provider

  • AWS
  • Azure
  • KVM / bare metal

Please remove non-applicable items

Blocked by / depends on

Please link issues/epics here which have to be solved before

Customers driving/requiring this

Please add customer names, if applicable.

Timeline

Please add info regarding deadlines/timelines, if available.

Aha! Link: https://giantswarm.aha.io/features/B-3

Kafka

Tue, Dec 10, '19

Related

General steps of adding an app

  1. Get a rough idea of demand. Who has requested this? @giantswarm/sig-sales @giantswarm/sig-customer
  2. Have a standard of idea of demand before we say we want to do the work of estimating the cost of this app. Smoke test: sth like: Do we reasonably believe at least 5 customers would be willing to pay sth like 60k a year for it?
  3. If this app passes 2. Estimate: What is the estimated effort to build then maintain? (Halo, but let's not do this before deciding on priority, or could give a rough estimate? How much effort does it take to estimate?) https://github.com/giantswarm/giantswarm/issues/8036

Oliver: I think the next step is really having a list and an estimated effort and a price idea that is validated with customers.

Aha! Link: https://giantswarm.aha.io/features/B-18

Manage Invasive Config and Addons by Customers better

As we have seen time and time again there's configuration options and addons that customers can add that are so invasive and disruptive that they impact our components, SLA, and in cases the whole cluster.

Some of my thoughts on that, as I do NOT want to limit the freedom we give customers:

  • moar education and more control through involving SEs
    • We need to make clear what an "invasive" configuration and/or component is
      • We do that already for block config in CoreDNS
      • Anything that pertains to security and auth (incl admission of ALL types)
    • Customers need to know that they need to contact us before they do the changes
    • We need to have a good testing plan for such invasive actions
      • Just running it in a cluster is not enough
      • There needs to be forced rescheduling of pods
      • There needs to be an upgrade rolling both workers and master

Azure: Spot VMs

User Story

As an Azure user, I want to configure NodePools to use Spot VMs so that I can save costs by using the unused machines.

TODO

  • investigate technical complexity
  • prepare documentation and instructions for customers

Details, Background

This is a feature that can help save costs for our customers. This type of VM is not appropriate for all types of workloads but there are use cases where it is viable.

Provider

  • Azure

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.