giantswarm / roadmap Goto Github PK

View Code? Open in Web Editor NEW

3.0 10.0 0.0 105 KB

Giant Swarm Product Roadmap

Home Page: https://github.com/orgs/giantswarm/projects/273

License: Apache License 2.0

aws azure cluster-api kubernetes vmware

roadmap's Introduction

Giant Swarm Product Roadmap

This repository represents Giant's Swarm's product roadmap.

The Giant Swarm Roadmap project provides a Kanban-style overview of things we are working on by the development stage they are in, even if we are still thinking about them.

Refer to this issues list to search roadmap stories.

FAQs

Q: Why did we build this?

A: In the spirit of the open source ecosystem and towards the goal of delighting our customers, we are completely transparent with our roadmap. We want to provide our customers the opportunity to understand better how we work, how decisions were made, and what priorities we have.

Q: Why are there no dates on our roadmap?

A: Our primary focus is on security and operational stability. As such, we can't provide specific target dates for features.

Q: Is it possible to know which features will be included in which release?

A: Yes! We use Milestones to group features by releases so our customers can plan ahead.

Explicit version number, e.g. v13.0.0, means that the feature will be included in this Major release. Non-explicit version number, e.g. v13.x.x, means that the feature will be included in a Minor or Patch release for that Major version.

Closer to release date, Milestones' description will be updated with link to release PR where you can find the full scope of the release, including release notes and detailed changelog.

Q: What do the roadmap categories mean?

A: We have 6 columns in order to allow us to be very clear on the stages of the development process:

Under consideration: This lists the features we are thinking about. We have not committed to them, since we are still checking with our customers, upstream and the market as a whole.
Future: We intend to work on it in the future, but there is no active development yet.
Near Term: At this stage we know what we want to build and when we plan to build it. These features are next up to be allocated to development teams. We are working on technical specifications, designs, PoCs.
In Progress: This is our construction site and you get a clear view of what it is that we are building at the moment. These items will be ready in 1-3 months.
Ready Soon: We are testing these features and preparing them for release. Expect the features in this column to be available within 4 weeks.
Released: Congratulations! The feature is now available to you. Let us know what you think.

Q: How to navigate the roadmap?

A: Teams in Giant Swarm are devided into 3 product areas:

KaaS (Kubernetes as a Service): Improving Giant Swarm's product offering for customers running their operations on Azure, AWS, onprem as well as keeping up to date with new kubernetes versions. Use label area/kaas to filter features for this area or corresponding provider labels to view only provider-related features (e.g. provider/aws, provider/azure)
Managed Apps: Enabling our customers to get the most out of their cloud-native stack, by offering fully managed optional Apps. Use label area/managed-apps to filter features for this area.
Empowerment: Empowering other product teams by improving our internal platforms, particularly towards release engineering, observability, and operations. Use label area/empowerment to filter features for this area.

Q: Are all our plans and projects on the roadmap?

A: Typicaly yes. Please note, that things may move around, we may do things we didn't list and cancel things that we planned. We track postmortems and security related issues privately and don't include them in the Roadmap to ensure the safety of our customers.

Q: How can I provide feedback or ask for more information?

A: We encourage you to comment and provide feedback on the issues themselves.

Q: How can I request a feature be added to the roadmap?

A: Please open an issue! Please only use Feature Request template for submitting your ideas. Community submitted issues will be tagged "Feature-request" and will be reviewed by the team.

Security disclosures

If you think you’ve found a potential security issue, please follow the instructions here.

License

PROJECT is under the Apache 2.0 license. See the LICENSE file for details.

Learn more about Giant Swarm at https://www.giantswarm.io

roadmap's People

Contributors

Stargazers

Watchers

roadmap's Issues

Centralized Metrics

User Story

- As GiantSwarm Engineer I want to have a high-level overview of all installations so that I can use this data for decision making.

- As GiantSwarm Sales Manager / Marketing Manager I want to collect business metrics so that I can use this data for decision making.

Details, Background

With an increasing number of installations, getting an overview of metrics becomes more difficult. We need a method of accessing all our metrics from one place.

TODO

1. Evaluate Cortex for cross installations querying

Versioning of Operators

User Story

As a GS Engineer, I want to have one latest version per operator codebase for all operators.

Background / Current state

Currently, (almost) every operator's codebase contains multiple versions. That is, we have multiple packages, with each package being part of a version.
In the version bundle, we have each relevant package wired up into a version bundle.
When an operator sees a Custom Resource - which has the version bundle embedded - the operator then dispatches to the right controller version dependent on the version bundle.

TODO

Deliverables

Deployed software
New CI architecture
Process change
Internal documentation

Control Plane / Tenant Cluster

Control Plane

Provider

all

Blocked by / depends on

n/a

Customers driving/requiring this

GS product teams

Autoscaling Ingress Controllers | Add HPA to our tenant ingress controllers

Goals

Make our ingress controllers more dynamic with autoscaling worker nodes.
Consume less resources in tenant clusters where IC is not used or is low traffic.
Respond better to peaks in traffic to Ingress Controller.
Have load test coverage for Ingress Controler HPA and cluster-autoscaler.

Current state:

When launching a new cluster we set the number of ingresses to the number of nodes in the workers-array in cluster-operator.
This causes problems as the workers-array is no longer supported as a source of truth for worker counts with cluster-autoscaling.
A cluster with autoscaling will not change the number ICs running according to the changing number of nodes.

TODO

Helm chart with Storm Forger test app - Fer - giantswarm/loadtest-app#1
Storm Forger test in JavaScript - Fer
Update outdated cluster-autoscaler PR - Fer giantswarm/kubernetes-nginx-ingress-controller#78
Use Storm Forger CLI to launch test - Ross
Add load test to e2etests that creates TC, installs test app and triggers test - Ross https://github.com/giantswarm/roadmap/issues/141
Check if thresholds for CPU / Memory make sense in production https://github.com/giantswarm/giantswarm/issues/10163
Check what the min/max number of ICs should be (also consider how many workers an IC is using?) https://github.com/giantswarm/giantswarm/issues/10163
Find a better metric to scale the pods https://github.com/giantswarm/giantswarm/issues/10163

EDIT: Updated checklist with current state.

Spec Representation of Catalog Entries (Apps) within the Control Plane

Mon, Mar 23, '20 status in this story: #95
Wed, Apr 8, '20 Spec: Representation of apps in the CP
Mon, Apr 20, '20 update: When spec is done, team needs a call to spec out implementation.

User Story

This is an internal story to solve for several internal technical needs and requirements of user facing stories.

This is the basis. The user stories depending on this are:

Listing of Apps
Updating of Apps (update events/notifications)
Defaulting of App CR fields (user just says I want prometheus and we default to latest version)
Validation of App CRs (specifically validate if app name and version are exisiting in said catalog)

Latter 2 stories are concerned with havng functionality baked into our Control Plane and its Kubernetes API so that we are not dependent on the to-be-deprecated GS API.

Future stories depending on this might be:

Dependency management between apps and other apps
Dependency management between apps and GS versions
Making required config fields explicit
Other metadata to apps (Screenshots, long descriptions, example configs)

Current State

Currently, an App Catalog is represented solely by its appcatalog CR in the CP. There's no information about its contents in the CP.

Happa talks directly to the storage of the Helm repository depicted in the appcatalog CR (i.e. in our cases Github) to get the index.yaml, parse it, and use it for some defaulting and representation cases.

The functionality to list apps is not present in our API. Thus, UX is being able to see the SSoT, i.e. Github, and following it to check out which apps are present.

Requirements

What we need is a way to get the SSoT, i.e. a list of apps, their versions, as well as metadata, from a definite place inside the Control Plane.

Options

As we are extending the Kubernetes API of the CP there's basically 2 ways that are native:

CRD + Controller (i.e. Operator Pattern)
Microservice that offers an aggregated API

While 1 has been our goto solution for most issues, we often have mitigated the impact of such an asynchronous, event-driven API towards the user by adding a microservice or API logic of our own. Latter patterns we want to get rid of as our API and the microservices that "translate" towards CRDs are something we want to deprecate mid-term.

For 2 we do not have any production experience. It is used upstream for the metrics API and Team Ludacris is considering using it for #13 (Health of TCs). There's a PoC that they have built in the Hackathon last week.

Team Batman has extensively evaluated Option 1 (see also https://docs.google.com/document/d/1LCd4XeLh8fYKILjTfyvBILrWZc07Gn0zdrbpC-MZUOg/edit), so we will describe this and also a lower fidelity spec of Option 2 in the following.

Keep in mind that no matter what option we go with the SSoT stays within the Helm Repository, no matter if that repository is blob storage with an index.yaml like right now, or might have a better API that can offer such information as in chart-museum and Harbor. Currently, I do not see a way around having a local representation somehow (more on this in the harbor case below). Correct me if this assumption is wrong.

1) AppCatalogEntry CRD

The idea is to have a cluster-scoped AppCatalogEntry CR per App in a Catalog. This CR is to be created by a new controller in app-operator based on the SSoT that is the appCatalog storage, i.e. github or other helm repo and its index.yaml

Sample CR (details can be discussed in apiextensions PR once we have a better plan here)

metadata:
  name: kong-incubator
  labels:
   application.giantswarm.io/catalog: giantswarm-incubator
spec:
  name: kong-app
  description: The Cloud-Native Ingress and Service Mesh for APIs and Microservices
  engine: gotpl
  home: https://KongHQ.com/
  Icon: https://s3.amazonaws.com/downloads.kong/universe/assets/icon-kong-inc-large.png
  sources:
    - https://github.com/Kong/kong
  urls:
    - https://giantswarm.github.com/giantswarm-incubator-catalog/kong-app-0.2.0.tgz
versions:
  - version: 0.2.1
    appVersion: 1.2.3
    created: "2019-09-28T13:18:25.002458784Z"
  - version: 0.2.0
    appVersion: 1.2.2
    created: "2019-08-28T13:18:22.004458784Z"

This CRD can then be used for all possible use cases mentioned above. It can be extended to include additional metadata with time, but the idea would be to start small.

2) appcatalog-service

A microservice serving an aggregated API that does one of the 2 following things:

In case of blob storage for Helm Repository, it downloads the index.yaml, keeps a purged representation of that in its "cache" and answers requests for apps, versions, and metadata.
In case of chart-museum/harbor it proxies (and maybe caches) requests to said chart registry.

The reason for a proxy in case 2 is that it abstracts away from the actual representation and storage in case those change. Another reason is that for a better UX a single place to get the information about app and version availability and install would make more sense. Having the user switch to a Chart Registry frontend to explore which apps they want and then come back and install and configure them in clusters sounds icky. Again correct me if I'm wrong.

Make Managed EFK more production-ready

We offer the EFK Stack as we have some experience with it and it is common in the CNCF. Loki, is not an alternative for all customers, as some cannot adhere to the structural logging requirements for using it.

Managed App Testing | Testing strategy for Managed Apps charts

We currently use the same approach for testing charts that we use for operators. But we only need to test Helm Charts and the resources they create.

This ticket is to improve the dev UX for chart developers. Current problems are.

Requiring Go skills
Regular vendor updates for e2e library changes and upstream components like client-go.

We should consider community tooling like Sonobuoy as well as our own internal tooling.

Setup Prometheus server per tenant cluster

User Story

- As GiantSwarm Engineer I want to have scalable monitoring architecture so the Prometheus resource usage is not tight to the number of Tenant Clusters on Installation.

- As GiantSwarm Engineer I want to have stable monitoring architecture so the Prometheus uptime is 100%.

- As a GiantSwarm Engineer, I want to be able to add as much metrics to Prometheus as I need so that I have good observability of apps and services.

Details, Background

Currently, we have one Prometheus, AlertManager and Grafana instance per installation running in a control plane. So the resource usage of Prometheus is tied to the number of tenant clusters. As the number of tenant clusters increases, Prometheus resource usages grows. This results in memory issues and downtimes of Prometheus.

To ensure Prometheus stability, we need to have a separate Prometheus server per tenant cluster installed on the Control Plane.

TODO

Spikes

https://github.com/giantswarm/giantswarm/issues/8987 How should prometheus-operator CRs work with managed apps?
https://github.com/giantswarm/giantswarm/issues/10903 Enable storage for on-prem installations

Targets

Docker daemon
K8s api server / controller manager / scheduler
Kube-proxy

Support peering on cloud providers

V and AA are peering VPCs with our clusters. Our automation doesn't have awareness of this configuration. This can lead to trouble during upgrades and we should support this out of the box.

From Puja in Slack:

peering between 2 GS owned VPCs (i.e. two workload clusters)
peering between workload cluster VPC and other VPC in same account (e.g. China MongoDB runs on VMs in own VPC in same account AFAIK)
peering between workload cluster VPC and other VPC in different account
Then there’s bi-directional vs 1-directional peering

Clean Up App Catalogs (Only Giant Swarm and Playground catalogs remaining)

Outcome of this is just delete incubator and move stuff to right catalog. No quality bars.

User story:

As a GS person giving an external demo, I want one catalog I could demo, so that I don’t waste time asking someone every time which of the many catalogs I can demo.
As an end user developer, I want to easily see which apps I could actually use for my work.

Todos

Decision: how many catalogs for clients from Managed Apps https://github.com/giantswarm/giantswarm/issues/7443
Decision: Definition of Done for apps and catalogs https://github.com/giantswarm/giantswarm/issues/7445
Review apps to see if they meet the definitions of done
Move k8s-initiator and fluentd-shipping-logs to playground

Incubator status Mon, Mar 16, '20

kubernetes-elastic-stack-elastic-logging -- to be deleted in favor of efk-stack
redis-app — decided to delete
kong-app — @yasn77 this cycle, to GS catalog
prometheus-operator-app-chart -- to GS catalog, @glitchcrab talk to customer
Aqua, to GS catalog -- @glitchcrab
Ff up w @pipo02mix : k8s-initiator-app, to playground catalog. Decided to delete. When Fer has time he will add to Playground.

Playground status Mon, Mar 16, '20

app-mesh (to playground)
efk-logging - delete (removed from playground catalogs as well)
efk-stack - move to Giant Swarm @webwurst will ping @glitchcrab for steps to do
eventrouter (to playground)
fluent-logshipping (to playground)
giantswarm-todo (to playground)
linkerd2 (to playground)
loki-stack (to playground)

Next story: https://github.com/giantswarm/giantswarm/issues/8444

Kubernetes Release v1.16

Kubernetes 1.16

https://github.com/kubernetes/kubernetes/blob/master/CHANGELOG-1-16.md

Provide a new release with the kubernetes version

Check migration recommendations

Check migration recommendations from kubernetes and decide what we need to document for the customer and what we should migrate automatically for them

Do we need to change our own resources within the tenant clusters?
What do we need to communicate to our customers?
Anything else that would be better automated for the upgrade?

Run e2e and conformance tests

Start a new cluster and run conformance tests against it
- AWS legacy @tfussell / @stone-z
- AWS nodepools @tfussell
- KVM @omegas27
- Azure @MarcelMue
Start a new cluster and run e2e tests against it

Check Core Components

Test migration (both cluster functionality itself and workloads)

Start a cluster with an older version and upgrade the cluster to the new release
Upgrade a cluster that is not critical but has some more real workloads (eg giantswarm website, dev cluster of customer)
All Providers
- aws nodepools @tfussell
- aws legacy @tfussell / @stone-z
- azure @MarcelMue
- kvm @omegas27

Write summary for the release and update docs

Write some release notes
- Mention breaking change from https://github.com/giantswarm/giantswarm/issues/6413 in changelog.
Check if our docs need to be updated
Announce the release internally (SEs are important here)
Announce the release to the customers

Aha! Link: https://giantswarm.aha.io/features/L-7

Kubernetes Release v1.17

Preparation steps

Child issues

Create WIP platform operator branches for 1.17 testing including component updates only
Release cluster-operator with 1.17 compatible apps
Release kvm-operator with 1.17
Release aws-operator with 1.17
Release azure-operator with 1.17

Managed Loki

We are testing Loki , and might offer it as a lightweight more cloud native alternative.

Move promised apps to Giant Swarm Catalog

Towards cleaning up catalogs: #47

Blocked by decision on definition of done https://github.com/orgs/giantswarm/projects/100#card-30175551

Control plane API through Kubernetes itself

Replace our existing API with Kubernetes itself.

We would give our customers access to the Kubernetes API of the control plane and define roles for them so that they can see and edit the Custom resources we provide to manage the clusters and also see the components that manage the clusters.

First approach is to let the API talk to kubernetes directly instead of using the services in between. Later goal would be to remove the API in front of Kubernetes.

This has been discussed here: https://github.com/giantswarm/giantswarm/issues/4135

A related UX story can be found here: https://github.com/giantswarm/giantswarm/issues/4370

An idea to involve some customers early would be to give them access to a read-only role in the control plane.

All stories around more visibility into the system would be affected by this. In a positive way. Standard Kubernetes tooling (like kubectl) would give our users already much more insights of what is going on inside the control plane and its tenant clusters.

Authentication for customers will also be affected

https://github.com/giantswarm/giantswarm/pull/2211

These roadmap items would then be obsolete:

Make workload cluster k8s CIS compliant

Priority: No customer asked for specific items in here, it is more a question of if we want to be CIS compliant.

Regarding last CIS benchmark report on our latest version of Guest cluster (4.2.0), here is the currently open points:

Part Of RoadMap item

Can be Part of Sig-Security or Sig-Customer

Need more info

ref:

Wiki page
Sig Security Issues: Go to the board
Report: Google Docs

Aha! Link: https://giantswarm.aha.io/features/L-6

Enable direct write access to App Catalog through Kubernetes API of the Control Plane

The thought behind this, would be that as App Catalog is mainly green-field, we could enable write access to it earlier than the rest of Giant Swarm.

Structural validation can be done using OpenAPI schema
This would require some work on defaulting and validating the (user-facing) CRDs. We already have a PoC for that using OPA
This might also include SSO for the Kubernetes API of Control Planes

Blocked by

Defaulting and Validation is blocked by #26
We have to work with the new WG to define the scope

Note: Moved from Managed Services Roadmap 2019 https://github.com/giantswarm/giantswarm/blob/1f0dbd05bd806df2670ccd2a7cc906c1a8990d4f/areas/managed-services/roadmap/2019-ROADMAP-MANAGED-SERVICES.md#stories-for-2019q4 This issue is intentionally left incomplete, Chiara needs to clarify the user story.

Status / health of workload clusters

Epic Story

As a customer, I want to be able to easily and efficiently check the health of my workload clusters so I know whether they require attention or not.

(This Epic is in ideation phase and the following stories below have been created to collect the various sources of information we require in order to build the MVP of this Epic)

Linked UserStories

Competitor Analysis giantswarm/giantswarm#6136
Customer Focus Sessions/ Feedback giantswarm/giantswarm#6137
Available Data giantswarm/giantswarm#6138
Mock ups giantswarm/giantswarm#6139
Architectural implications giantswarm/giantswarm#6312
In scope/ out of scope

User Personas

Linked Stories

Statuspage.io giantswarm/giantswarm#2966
How to Define Cluster Health giantswarm/giantswarm#3221
Overview of Cluster State and Status giantswarm/giantswarm#3223
Transparency regarding cluster state: giantswarm/giantswarm#2209
Real-time event stream: giantswarm/giantswarm#3579
Expose workload cluster metrics: giantswarm/giantswarm#1959
History of a cluster: giantswarm/giantswarm#2197
Other related tickets https://github.com/orgs/giantswarm/projects/41?fullscreen=true&card_filter_query=label%3Aarea%2Fobservability

Build a Multi Tenancy Helper

We see many customers struggling with implementing best-practices both when it comes to security but also on scheduling matters.

Current thoughts go towards improving documentation (and that is important) as offering defaults only would be straight forward if you control all namespaces and know them in advance.

I was thinking though that we could offer an actual tool (be that a controller or just an external helper tool) that could make the life's of our customers (and hopefully also other k8s users) easier.

I had lots of thoughts about this and wrote down some stuff, which I will dump below in this issue.

I'd like to get your thoughts, especially from @giantswarm/sig-product @giantswarm/sig-customer @giantswarm/sig-ux @giantswarm/sig-support

Managed Service: Function-as-a-service

Julien's customer came back with Serverless request for our Roadmap

Better release crafting & testing.

Epic Story

As a Swarmie struggling with the constant battle of releasing via a manual, laborious and time consuming process, I would like our release process to be as automated, pain free and delightful as possible so that I can spend my time on other worthwhile value generating activities.

Epic Acceptance Criteria

Iteration 1: Quick fixes whilst we are waiting for VOO (see UserStories 1-6)
Iteration 2. The full end to end release process is automated (see UserStories 7-8)

Linked User Stories in priority order:

Iteration 1.

1. Improve the reliability of E2E tests giantswarm/giantswarm#5965
2. Automate slack channels for release notes giantswarm/giantswarm#5966
3. Step by step guide to release notes giantswarm/giantswarm#5967
4. Automate slack channel for K8s releases (not needed?)
5. Re-tagger should automatically retag images for k8s releases giantswarm/giantswarm#5968
6. Move release info into separate repository giantswarm/giantswarm#5969
10. Automate China E2E tests giantswarm/giantswarm#6052

Iteration 2. stories which are dependent on the completion of VOO giantswarm/giantswarm#5021

7. Add support to devctl to automatically list PRs since the last release giantswarm/giantswarm#5970
8. Do a full installation E2E tests from installations giantswarm/giantswarm#5971
9. Add official conformance tests to our E2E tests giantswarm/giantswarm#5972

Dependencies

VOO giantswarm/giantswarm#5021 !!

Other Tasks (non userstories)

Add k8s next release to calendar and code freeze (Jessica)
Start preparing new release before new k8s is out by checking upcoming e.g. 1.14.2 (collect updates from components to include etc)
Write more e3e tests

Aha! Link: https://giantswarm.aha.io/features/L-2

Expose cluster creator via our API / front end

Currently we ask everyone to include their username in the cluster name when creating test clusters. This is important so we can contact the owner if they are generating errors or unnecessary costs.

However it's easy to forget and we have lots of new starters who don't know this.

Idea from @marians is to add a created_by field to the API object. It can then be exposed in Happa and gsctl.

Idea from @corest is to use this for the Cluster Police bot. To notify the owner and so we don't spam the #ops channel with these notifications.

Threads:

#sig-product https://gigantic.slack.com/archives/C02521EC0/p1573809116294100
#product-kaas https://gigantic.slack.com/archives/CNX3Z835G/p1573809431036900

Managed OPA (for the customer)

We can use Gatekeeper for Validation and defaulting. Have a separate OPA for Authorization (at least Gatekeeper devs told us that it will be separate)

Revamp Ingress Controller setup (Optional NGINX Ingress Controller)

User Story

As a customer, I would like to have the option to not have Nginx Ingress Controller installed in all my clusters by default, so we don't waste resources in cases where we don't need to expose any endpoints.

As a customer, I would like to have the option to not have Nginx Ingress Controller, so I can run a different Ingress Controller that we need. For example, I need an API Gateway, or I use a service mesh and it contains an ingress controller.

As a customer, I would like to have the option to install the Nginx Ingress Controller only internally, so my service can use it to communicate internally without exposing endpoints to the Internet.

TODO

Details, Background

Right now our ingress controller (IC) is installed by default in all clusters. We have customers that have different ingress controller and they waste resources. Also, some customers will like to not expose ingress endpoints by default. Currently, in AWS and Azure, we provision an LB pointing to the workers for the default ingress controller (exposed as nodePort).

On the other hand, we have moved the ingress controller to be run by the chart operator. The configuration of the IC chart is managed in cluster operator. This is moving to app catalog operators. Ideally, ingress controller will be part of the catalog and users will choose to install which IC is deployed and which flavor (external/internal). At the same time, it will allow us to offer more than one IC solution (out of scope right now).

Customers driving/requiring this

I think all customer have one or other interest in this development

UPDATE 17.10.19

For full optionality of IC with a similar simple setup like now (i.e. not requiring customer to set up custom things in AWS) we would additionally need ‘kiam’,‘external-dns’, and ‘cert-manager’
‘external-dns’ is already a default app on Azure
‘kiam’ would increase overall security of our setup, especially making customer workloads that needs access to AWS resources more secure
‘cert-manager’ is an app that most customers are anyway running and would welcome us taking over its management
Having those apps in AWS clusters would mean we could get rid of a lot of hard coding in aws-operator
It would also mean that the load balancer would be configurable, e.g. to set it to internal
It would mean we do not need to take care of it when we add node pools and stuff
It would mean that even the default DNS entry for ingress is configurable
based the above we decided to make those 3 apps new default apps for AWS clusters
the actual Ingress Controller then is fully replaceable, so a customer could use our default components to run Traefik and at some point we would also be able to run other ICs like for service meshes or if there’s clsuters that will only have Kong

Azure: HA Control Plane

Epic Story:

As a Customer, I want GS to provide a high availability of Master Nodes on my workload clusters so that if the working master goes down for example, it automatically boots to the next available master.

As a cluster admin, I want to configure a cluster to use multiple master nodes, so that the risk of a Kubernetes API downtime is minimal

Rationale

Even if workloads are still functioning, a temporary downtime of the Kubernetes API seriously impacts our customers in that they can't update any deployments, scale replicas etc.

Background

Reliability will be increased by making multiple masters reside in different availability zones (AZ), as this reduces the likelihood of all masters being unavailable at the same time. Cluster upgrades would benefit from this, as with a single master an upgrade automatically implies a downtime. This might be implemented as optional, as not all use cases may require the high availability and justify the increased cost (e. g. dev, testing).

Enable disabling 24/7 support for workload clusters

User story

- As a Giant Swarm engineer, I want to be able to disable alerting for a CAPI cluster so that I won’t get paged for a testing cluster but still will be able to check monitoring data.

Description

As a user I'd like to disable the 24/7 support on test clusters. If a customer wants to play with a cluster and try out a service mesh or other things that are currently not officially supported they can turn off our 24/7 support and just play with the cluster.

The same would work for the SEs. They can disable 24/7 on the clusters that are in an known broken state.

Additional benefit would be that we start clusters with disable support as a default. This way it would be a conscious decision of the user to turn on the 24/7 support.

The last point would also allow us to give more partners access to control planes to play with our product. The clusters would all be without 24/7 support and nobody would get paged during the night.

Setup management cluster Prometheus

User Story

- As GiantSwarm Engineer I want to have observability of Control Plane components (operators, etcd etc.) so that I can avoid outages and other issues.

Details

Setup Control Plane Prometheus to monitor Control Plane components (operators, etcd etc.) and workload cluster Prometheus servers.

TODO

1. Enable CRs creation for Prometheus operator
2. Enable Control Plane Prometheus to scrape CP components
https://github.com/giantswarm/giantswarm/issues/13093 Enable CP and TC Prometheus servers to send alerts to Alertmanager
https://github.com/giantswarm/giantswarm/issues/13094 Add support for Grafana to query Control Plane Prometheus servers / or solve this with Promxy
https://github.com/giantswarm/giantswarm/issues/13302 Add support for sending Heartbeats to opsgenie (ensure heartbeat opsgenie alerts have same tags as other opsgenie alerts)
~~- [ ] https://github.com/giantswarm/giantswarm/issues/10903 Storage on onprem~~

Notes / Questions

we want to allow customers to use our Grafana on the control plane. We might also want to think about routing alerts to them. The example of use case is IP capacity. When there are only 5 IPs left for a cluster we could have an alert but that would only trigger us to tell the customer, in stead we could alert customers directly.

Upgrade to Helm v2.16.1 for Kubernetes 1.16 support

This is because 2.14.3 uses deprecrated API groups for configmaps which are removed in K8s 1.16.

Real-time event stream

User story

If I as a cluster-admin create/upgrade/scale/delete a cluster I want to get feedback about the steps of the initiated process.
If I as a cluster-admin create/upgrade/scale/delete a cluster I want to see errors within the initiated process immediatly.

Rationale

Get real-time information on status changes, so that we can display e. g. readyness after cluster creation.

Background

We don't provide any feedback during cluster creation at the moment. And the same will happen during an upgrade
of a cluster.

Since cluster creation, upgrades, and scaling, to name a few, are complex processes that can take some time,
it would be benefitial to be able to display detailed progress information in our UIs. This would also help
users to diagnose errors when something does not work as expected.

An event stream as part of our API would also enable custom integrations. For example, customers could trigger
the creation of an external uptime check whenever a new ingress is created in a guest cluster. Note that this
aspect is especially targeted in the EVENT-HOOKS story.

As an obvious technology choice, a WebSocket API should be offered for clients to consume the stream.

With "real time" we mean a delay of a few seconds between an event occurring and being presented via the API.

In cluster creation, these are a few examples for the types of events we would like to get from the API:

Cluster creation event types we can think of as being reported via the API:
- Key-pair backend available (for creating key pairs for the cluster)
- Kubernetes API server is up
- Individual worker are "known" with their identifier or IP address
- Individual worker nodes are in Ready state
- Ingress load balancer is working
- Calico network is set up and working
- Internal DNS is up and working
- Prometheus scrape targets for the cluster have been created
Upgrades
- Individual node is cordoned, drained, shut down
- Individual node is in Ready state (in new version)
Scaling up
- Individual worker are "known" with their identifier or IP address
- Individual worker nodes are in Ready state
Scaling down
- Individual worker is cordoned, drained, shut down
- Individual worker is removed from the cluster
Deletion
- Individual node is removed
- Individual worker is removed from the cluster (TBD)
- Cluster object is deleted

We are aware that in deletion, the events don't provide actionable information, still having
it will improve the experience and raise trust in the system.

MVP and further iteration

As a minimal version, for cluster creation we expect these event types:

Kubernetes API server is up
Individual worker nodes are in Ready state

Open questions

Do we want to watch custom object status changes, or do we rather prefer a messaging API to be used
from within operators?

Possible side effects

Load on the kubernetes API server when watching custom object status changes.

Dependencies/prerequisites

Differentiation between actual state and desired state in custom objects would allow for
watching object changes, which would be the basis of the implemnentation. That should be
covered in the STATE story.
We should have finalizers in place, to be able to observe a cluster deletion.
A unified worker identifier working over all providers would be benefitial.
Guest cluster upgrades must be working in order to have upgrade events

User Story

As an Azure user I want to be able to spread clusters over multiple availability zones to decrease the possibility of service interruption in case of datacenter failure.

Details, Background

Azure Availability Zones are separate data center units within Microsoft Azure, each with its own power, cooling and networking. By running services on multiple availability zones, you can make your applications resilient to failure or disruption in your primary data center.

Provider

Azure

Azure: Node Pools

Epic User Story

As a user creating multiple clusters, I want to be able to define the individual character of each of these clusters, so I can have multiple clusters use cases in one cluster.

Acceptance Criteria

I can have different node pools in a cluster which:

are spread across different availability zones
have different levels of storage
have different availability zones which I can manage through the UI Happa, GSCTL, API.

Implementation Plan

References

Epics

Upgrade control planes to Helm 3 https://github.com/giantswarm/giantswarm/issues/10973 WIP @cokiengchiara @rossf7 @tomahawk28

Upgrade tenant clusters to Helm 3 https://github.com/giantswarm/giantswarm/issues/11748 (blocked until after CP upgrade)

Releases

Dupe. Managed Service Mesh

Most probably Istio. Alternatively, we might also offer linkerd 2 for its simplicity.

Update issue: [epic] Managed Kong in Giant Swarm Catalog (Preview Quality)

Outcome: Get Kong to the specific managed service it is supposed to be. It’s hanging at 70% there. This means we have decided what it means (what we do and don't do) for Kong to be a Managed App. "Done" means having the answers in written form somewhere (whether in app catalog, a google doc, etc.) and sent to customer.

By when?

March 31, 2020, we have provided our Kong to the customer with documentation on (1) how they can use it (2) how they could add their plugins, and (3) how they can go live with it. Not full SLA production ready, but usable w possible bugs.

People involved

@othylmann
@jgsqware
@pipo02mix and @rossf7 -> @webwurst and @yasn77
@piontec
@cokiengchiara
@teemow

Related Docs:

One-pagers Definitions: Managed Apps, Managed Kong, customer-specific things: https://github.com/giantswarm/giantswarm/tree/master/areas/managed-services/managed-optional-apps
Slack thread discussing who owns metrics https://gigantic.slack.com/archives/CPC3M70UE/p1575969913017800
Kong Project Board with customer

Obsolete

Managed Kong: Definition: https://docs.google.com/document/d/1gyKoe0EAM1ILiwiLH-gNL3Hh0dOvXTHv9Rzk_csD-Mc/edit

Engineering Todos

@piontec meet with @yasn77 to list specific things that need to be completed so we can ship Kong by end of Q1
Go through Quality Bars checklist https://github.com/giantswarm/giantswarm/issues/8410
Monitoring. @glitchcrab Decide which metrics to monitor, so we can "make sure all pods and main components of the app are running and that the app is working right. We monitor and alert on necessary metrics to ensure our SLA." https://github.com/giantswarm/giantswarm/issues/8410
Create documentation @yasn77 https://github.com/giantswarm/giantswarm/issues/8410
Document how customer can use it
Document how customer can add plug ins
Document how customer can go live

Product Todos

Waiting for feedback on these docs PRs

Edit redis info in Kong https://github.com/giantswarm/giantswarm/pull/8528
Update redis info w customer https://github.com/giantswarm/giantswarm/pull/8530/files#diff-dc19aafe90f8230e5b2e6fcdb9d5b1da
Add the PSP stuff to managed apps https://github.com/giantswarm/giantswarm/pull/8531
Postgres stuff - waiting for Yasser on latest on being DBless in halo channel

They will use our Kong. This is clear now. In DBLess mode.

After first draft of Managed Kong definition

Document - after deciding where (e.g. in app catalog, in a doc, etc.)
Communicate to customer

Managed Aqua Security

Managed App Catalog - Phase II

User Story

This phase is about getting more apps into the catalog, the automation around that, and the release process.

Move existing Managed Services (CoreDNS, Ingress Controller etc) to apps using the new release process
Versions of required Managed Services are linked to Giant Swarm releases
cluster-operator creates/updates app CRs, the rest is moved to app-operator
UX lets you manage and update app CRs

See tech spec for more details
https://github.com/giantswarm/giantswarm/blob/master/areas/managed-services/specs/app-catalog-phase-2.md
This is the 2nd 6-week release cycle from 16th Sep to 28th Oct 2019 and continues from https://github.com/giantswarm/giantswarm/issues/6078.

TODO (Cycle 2)

UX @oponder

Tracked here: https://github.com/giantswarm/giantswarm/issues/6659

Migrate core components to App Catalog

cluster-operator

Add appmigration resource @rossf7 https://github.com/giantswarm/giantswarm/issues/6800
Add configmapmigration resource @rossf7

e2e

Create basicapp e2e test to replace managedservices @tomahawk28

Migrate Charts

kube-state-metrics @rossf7 https://github.com/giantswarm/giantswarm/issues/6800
cluster-autoscaler @rossf7
metrics-server
coredns @rossf7
node-exporter
net-exporter @sslavic
cert-exporter @sslavic
legacy Ingress Controller
Removal of chartconfig CRD and resources

Spec

App Updates including version validation (https://github.com/giantswarm/giantswarm/issues/7070)
Monitoring of Apps (https://github.com/giantswarm/giantswarm/issues/7259)

Related Issues

Revamp of Ingress Controller https://github.com/giantswarm/giantswarm/issues/4795
Add HPA to tenant ingress controllers https://github.com/giantswarm/giantswarm/issues/5264

Aha! Link: https://giantswarm.aha.io/features/M-14

Enable Central Services

This would most probably involve some kind of multi-peered "ops-cluster", which holds central services, instead of hosting the central services inside out Control Plane.

Service Broker API is moving to CRDs, so we will wait for that

Just closed. Managed Prometheus Operator (Monitoring) for customers

Towards #175

We have a Prometheus Operator in the Giant Swarm Catalog that customers already use.

In a Prometheus-focused Cycle, we will improve how we support it in production. This includes:

Monitoring and Alerting, to catch common problems in advance
Automated Testing, to improve reliability of upgrades
Documentation on what it does vs does not do
Addressing some Prometheus Operator-related issues https://github.com/orgs/giantswarm/projects/100?fullscreen=true&card_filter_query=prometheus (private repo)

Prior to the cycle, we will also talk to customers and gather feedback for improvement.

Managed Service: CI

Can we provide our customers with a managed services that adds CI across all their clusters? How would this look like?

Cloudprovider service broker integration

Afaik there is support for managed services within the cloud providers via the open service broker api. We should enable this for customers or at least make this optional (because of permissions etc.).

See:

Enable App Updates

Right now, you can update by choosing the version in the version picker. However, things will break if things are not configured correctly. There is no migration path for configuration values.

As an app user, I want to be able to update apps.

General

Lockdown a cluster for upgrades #186
Using the node-operator for draining on all providers (implement this independent from the actual provider)
Determine node readiness correctly
Mark old nodes unschedulable so pods do not get moved twice (or tainted like kubectl taint nodes node_name key=value:NoSchedule)

AWS

Ensure TCPN(s) stack are not updated till master(s) TCPN stack is in update complete (it is recommended practice in community and it will avoid problems when master stack fails and leave the cluster in an inconsistent state)
Be able to select batch size (slow - 10%, regular - 33%, fast - 50%, single swap - 100% nodes) #216
Time between node rolling #217
Consider the possibility to disable DisableRollback option on Cloud Formation stack to avoid larger impact in big clusters
Be able to pause an upgrade (it could be achieved by letting customer upgrade each node pool separately)
Dont trigger new batch upgrade till the last node batch runs fine
Rollback option for GS version patches #221

Azure

Starting new nodes first

Related issues:

App Catalog Partner Integration

User Story

As a Giant Swarm partner company, I want to enable my user/customers to install my software easily to their clusters.

As Giant Swarm partner company, I want to be featured and have a special position (compared to community apps) in the App Catalog section.

TODO

Find a way to do easy release management for a separate partner catalog
- evaluate options
- talk with partners if options will work out
- build tooling
Expand on UX for partners
Documentation
- Document/explain what the partner catalog is for and what support level can be expected as a customer
- Document how to add new partners

Details, Background

We started our first partnership with Instana, which is installed through their chart in the Helm upstream community catalog. However, the community catalog is full of apps and it is hard to find our partners among them.

The current list of planned partner integrations is tracked in https://github.com/orgs/giantswarm/projects/74 and can be found by the label growth/partners.

There would be the option to have prioritized displaying of certain charts in the community catalog, but it is still in the community catalog, which will say sth along the lines of "apps in here are not supported by Giant Swarm, all support is purely open source".

Another option would be having partners provide their own catalogs, but that would mean we have tons of single app catalogs which are bad UX too.

Third option and currently favored would be a general partner catalog where we include some differing release tooling to our own. Basic thought around is a similar tool to https://github.com/giantswarm/retagger, i.e. a tool that we can point to a chart repository and that syncs the chart to our partner catalog. This option was suggested by @rossf7

Aha! Link: https://giantswarm.aha.io/features/B-9

Managed EFK Stack with OpenDistro plugins

A customer wants it.

Latest version of OpenDistro plugins, probably included in a container image, but disabled.
Initial setup of certificates for internal communication by helm chart
Optional setup of users
- Documentation about that
some kind of automated test setup

Two ancient we should merge here somehow:

Spark

We have a customer that might need this soon.

Chiara Example

User Story

As a (customer|user|admin|...) I want to ... so that ...

TODO

item 1
- item 1.1
item 2

Please complete the TODO list as possible

Details, Background

Please give all possible details, requirements, etc. Technical or not.

Deliverables

Deployed software
Monitoring/alerting
API changes
Public documentation
Internal documentation
Frontend changes
Other (please describe)

Host/guest cluster

Host cluster
Guest cluster

Please remove non-applicable items

Provider

AWS
Azure
KVM / bare metal

Please remove non-applicable items

Blocked by / depends on

Please link issues/epics here which have to be solved before

Customers driving/requiring this

Please add customer names, if applicable.

Timeline

Please add info regarding deadlines/timelines, if available.

Aha! Link: https://giantswarm.aha.io/features/B-3

Kafka

Tue, Dec 10, '19

Tue, Dec 3, '19 Vaclav saying people struggle with Kafka a lot https://gigantic.slack.com/archives/C02521EC0/p1575404193405300
Tue, Dec 10, '19 @henninglange saying sales get continuously asked abt. offering Managed Kafka https://gigantic.slack.com/archives/CPC3M70UE/p1575973591024800
Estimated cost for us to add an app https://github.com/giantswarm/giantswarm/issues/8036

General steps of adding an app

Get a rough idea of demand. Who has requested this? @giantswarm/sig-sales @giantswarm/sig-customer
Have a standard of idea of demand before we say we want to do the work of estimating the cost of this app. Smoke test: sth like: Do we reasonably believe at least 5 customers would be willing to pay sth like 60k a year for it?
If this app passes 2. Estimate: What is the estimated effort to build then maintain? (Halo, but let's not do this before deciding on priority, or could give a rough estimate? How much effort does it take to estimate?) https://github.com/giantswarm/giantswarm/issues/8036

Oliver: I think the next step is really having a list and an estimated effort and a price idea that is validated with customers.

Aha! Link: https://giantswarm.aha.io/features/B-18

Manage Invasive Config and Addons by Customers better

As we have seen time and time again there's configuration options and addons that customers can add that are so invasive and disruptive that they impact our components, SLA, and in cases the whole cluster.

Some of my thoughts on that, as I do NOT want to limit the freedom we give customers:

moar education and more control through involving SEs
- We need to make clear what an "invasive" configuration and/or component is
  - We do that already for block config in CoreDNS
  - Anything that pertains to security and auth (incl admission of ALL types)
- Customers need to know that they need to contact us before they do the changes
- We need to have a good testing plan for such invasive actions
  - Just running it in a cluster is not enough
  - There needs to be forced rescheduling of pods
  - There needs to be an upgrade rolling both workers and master

Azure: Spot VMs

User Story

As an Azure user, I want to configure NodePools to use Spot VMs so that I can save costs by using the unused machines.

TODO

investigate technical complexity
prepare documentation and instructions for customers

Details, Background

This is a feature that can help save costs for our customers. This type of VM is not appropriate for all types of workloads but there are use cases where it is viable.

Provider

Azure