kyma-incubator / reconciler Goto Github PK

View Code? Open in Web Editor NEW

12.0 12.0 68.0 4.56 MB

Kyma reconciler

License: Apache License 2.0

Makefile 0.28% Shell 1.10% Go 97.74% Crystal 0.07% Python 0.81%

reconciler's People

Contributors

Stargazers

Watchers

reconciler's Issues

Create OpenAPI specification (Swagger) for REST API of mothership reconciler

Description

The current REST API exposed by the mothership reconciler is not described in an OpenAPI specification (Swagger spec). This has to be changed and the exposed REST API and related models should to be generated on base of the specification.

The pattern of supporting different API versions in parallel should not be removed: means the URL should still include an indicator of the used API version (e.g. https://host/v1/...).

AC:

Describe the REST API based using OpenAPI specification (Swagger spec)
Introduce code-generators into the build process to generate models (and if meaningful also boilerplate- and middleware-code) for the exposed REST API

Reasons

Establishing a OpenAPI specification makes the consumption easier for API by clients, adds code-generator capabilities for REST-API models and middleware code and can simplify discussions about API changes.

Attachments

Component-reconciler is not rendering ISTIO chart completely

Description

It turned out that the call of the manifest-renderer for ISTIO is not returning the fully rendered manifest (Job-resources were missing).

Expected result

Manifest includes all resources of a Helm chart.

Actual result

Manifest is not includes all resources. E.g. the HELM chart for ISTIO defines also Job resources which are missing in the manifest result

Steps to reproduce

Render manifest resources for HELM ISTIO chart and compare the result with the YAML charts inside of the the ISTIO component. Access to the rendered manifests is possible by setting a breakpoint to the runner.install() function where manifests are rendered.

Troubleshooting

P1: Ensure CRDs and ISTIO are always installed before any other components will be installed

Description

Kyma has a few mandatory resources which have to be installed before any other component can be installed. These are:

Kyma CRDs
ISTIO component

To ensure both resources are available, the mothership-reconciler has to take care that

CRDs are always installed at the very beginning of each reconciliation run
ISTIO is also installed at the very beginning but only if it's part of the components-list for this cluster

After the reconciliation of CRDs and ISTIO is finished, all other components listed in the component-list of the cluster will be reconciled.

AC:

Mothership-reconciler always installs CRDs first
Mothership-reconciler also installs the ISTIO component first but only if it's listed in the components-list of the cluster

Reasons

Ensure resources which are common pre-requisites for Kyma components are made available before the reconciliation of these components starts.

Attachments

Reconciler can't handle boolean values properly

Description

Boolean flags in charts are not properly handled. Example from tracing chart:

{{- if .Values.virtualservice.enabled }}
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: {{ template "jaeger-operator.fullname" . }}
  labels:
{{ include "jaeger-operator.labels" . | indent 4 }}
spec:
  hosts:
  - jaeger.{{ .Values.global.ingress.domainName }}
...

Such resource should not be rendered if you set virtualservice.enabled to false.

Expected result

Virtual service is not rendered.

Actual result

Error:

Default-reconciliation of 'tracing' with version 'main' failed: Failed to get manifest for component 'tracing' in Kyma version 'main': Failed to render HELM template for component 'tracing': template: tracing/templates/kyma-additions/virtualservice.yaml:10:21: executing "tracing/templates/kyma-additions/virtualservice.yaml" at <.Values.global.ingress.domainName>: nil pointer evaluating interface {}.domainName

Steps to reproduce

Run reconciler with such command:

./bin/reconciler-darwin local  --components tracing --value tracing.virtualservice.enabled=false

Links
This PR demonstrates the issue: #139

Implement status tracker

Description

The status tracker has the responsibility to record the progress/state of the ongoing cluster reconciliation (e.g. state of deployments/pods which were updated [tbc]).

Each reconciliation worker has a communication channel to the status tracker and sends progress updates. The status tracker will track each status changes / installation result per cluster and store these date in a change log. Purpose is to give full transparency about the status of modified cluster resources.

In case of a restart of the reconciliation service, the status tracker information can be used to identify non-finished reconciliation processes and the schedule can re-schedule them.

After a reconciliation process was finished, the status tracker will update the cluster state in the cluster inventory.

ACs:

BE able to receive and track any reconciliation update which modified a cluster resources
Support querying of resources states related to a particular reconciliation run
Allow querying for non-finished reconciliation processes (e.g. used by scheduler after restart of the reconciliation service)
After a reconciliation process was finished, update the cluster state in the cluster inventory

Reasons

Track status changes of cluster resources during a reconciliation run.

Attachments

Reconciler has to synchronise cluster inventory with KEB

Description

The reconciler is storing copies of cluster data in his cluster inventory. But data leading system is KEB. To avoid discrepancies between the copy the reconciler is using and the single-point-of-truth on KEB side, a regular synchronisation of cluster inventory data between reconciler and KEB has to happen (at least once a night).

Reconciler is retrieving inventory data from KEB
Reconciler is synchronising the cluster inventory and identifying data gaps (data leading system is KEB):
- Data of existing cluster entires have to be equal to KEB provided data
- Missing cluster entries have to be identified and reported in logs + communicated in monitoring system to SRE (has to be clarified/aligned with SRE)
- Data which exist only in reconciler database have to be marked as deleted

Reasons

Identify and avoid gaps in cluster data between reconciler and KEB.

Attachments

Implement cluster-inventory

Description

The cluster inventory is managing all clusters handled by the reconciliation service. The inventory has an interface to KEB to receive information about new clusters or clusters which require an upgrade.

The inventory will store all cluster data in a database and support querying for cluster entities. Each entity has, beside it's name and some metadata (like its configuration, component list etc.) also a particular state. The cluster state will be several times updated during its lifetime (each reconciliation causes several cluster date updates).

The inventory will support efficient access to clusters which require any kind of reconciliation (e.g. used by schedule) and it allows to query efficiently for clusters which are in a failed/error or transition state (e.g. required by the metrics exporter).

ACs:

REST interface to KEB implemented
Support for CRUD operations for database cluster-entities to change cluster metadata (done only KEB) or cluster states (done only by scheduler)
Query possibilities to retrieve clusters in particular states (query performance has to be verified, respectively optimised as it will be used quite frequently).
The inventory is capable to retrieve kubeconfig files/strings for particular clusters from either Gardener or the underneath running KCP cluster.

Reasons

Centralised managing of cluster data.

Attachments

Create OpenAPI specification (Swagger) for REST API of component reconcilers

Description

The component reconciler REST API is not described as OpenAPI specification (Swagger spec). This has to be changed and the exposed REST API and related models should to be generated on base of the specification.

The pattern of supporting different API versions in parallel should not be removed: means the URL should still include an indicator of the used API version (e.g. https://host/v1/...).

Please consider #109 before implementing this feature as it is also impacting the REST API implementation and have tickets to be aligned.

AC:

Describe the REST API based using OpenAPI specification (Swagger spec)
Introduce code-generators into the build process to generate models (and if meaningful also boilerplate- and middleware-code) for the exposed REST API

Reasons

Attachments

Delete namespaces which were managed by reconciler only if they do not include any further resources

Description

When deleting resources based on a manifest, defined namespaces in the manifest should per default be excluded and processes in a second step:

After the deletion of all manifest resources is finished, the namespace deletion can happen but only if no further resources from Kyma exist in the namespace. If a namespace is not empty, the deletion is not allowed.

AC:*

Resources created by the reconciler are removed (deletion should happen based on the resources defined in the component manifests)
Reminiang resources have to be verified whether they were created by Kyma or by a customer. Only resources created by Kyma should be deleted.
A namespace can only be removed if it is completely empty (not including any customer resources).

Reasons

Don't delete namespaces which include resources.

Attachments

Implement metrics exporter

Description

The metrics exporter is responsible to expose to external monitoring systems (e.g. Prometheus from SREs) the list of clusters which are currently in a failure, error or transition state.

These data will be evaluated by the SRE monitoring system and used to reduce false-positive alerts during reconciliation runs. SRE will disable the alerting for a particular time range when a cluster enters a transition state.

Output of the metrics exporter are Prometheus metrics, containing all clusters which are in a transition or error/failure state inclusive the date when the transition state started.

ACs:

Prometheus metrics has to be exposed over an HTTP socket
Exposed data have to include all clusters which are in transition/error/failure state inclusive the timestamp when this state started

Reasons

Reduce false-positive alerts on SRE side for clusters which are currently reconciled.

Attachments

Implement reconciliation strategies (related to strategy manager)

Description

The reconciliation is executing different logic depending on the cluster state (e..g install upgrade or reconcile the Kyma cluster). Each logic block has to be implemented as a strategy which can be passed between Go entities. The strategy has to be added to the strategy manager.

ACs:

The different reconciliation logics are encapsulated in dedicated objects and known by the strategy manager.
Internally, the strategies are using the installer library (for install, upgrade or reconciliation).
Strategies have to send progress updates to the status tracker: it has to be aligned with the stakeholders how detailed this progress/status trackig has to happen (which resources we have to verify during the reconciliation etc.).

Reasons

Encapsulation of different reconciliation logic.

Attachments

Implement strategy manager

Description

Depending on the state of a cluster, different reconciliation strategies have to be used:

New Cluster: run the installer library
Cluster exists and has Kyma installed: run reconciliation
Cluster exists and requires an upgrade: run installer library

The strategy manager is responsible to define which logic a reconciliation worker has to execute. It evaluates the cluster state and returns a closure-object which wraps the logic to worker has to run.

ACs:

It receives the cluster information from the reconciliation worker and decides which reconciliation strategy has to be applied by evaluating the cluster state.
It either wraps the call of the installer-library (Hydrofrom) in case of install or upgrade actions, or requests the component templates from the chart provider for any other actions and returns the wrapped logic as closure-object to the reconciliation worker.

Reasons

Encapsulation of different reconciliation logic blocks.

Attachments

Adjust reconciler API for CLI compatibility

Description

The Kyma CLI requires several smaller changes in the reconciler API to get their requirements implemented:

Workspace factory has to support local Kyma sources (not cloning of sources from Github)
Heartbeat-messages failed and error should include the error as message
Model used by component reconcilers should provide an error message in case of reporting a failure or error.

AC:

Workspace factory supports the configuration of a local workspace directory (cloning of Kyma sources from Github will no longer happen).
The communicated heartbeat message has to include an error field which is filled in case of an error or failed status of the component reconciler.
Component reconciler model includes an error message in case of failure or error messages

Reasons

Proper integration of the reconciler API into to the Kyma CLI

Attachments

Introduce simple dependency management for component reconcilers into mothership reconcilers

Description

Component reconcilers can define dependencies to other components. Such dependencies can vary between Kyma versions.

Currently the mothership reconciler (MSR) is notified about missing dependencies by the component reconcilers (CR). Such CRs will be triggered again by the MSR as soon as other components were successfully reconciled.

The MSR and CRs have to be enriched to handle dependencies between CRs more efficiently.

AC:

MSR has to detect cyclic dependencies between CRs: if CR1 has a dependency to CR2 and CR2 is at the same time depending to CR1 the reconciliation has to be cancelled and an error has to be logged.
MSR has to detect non-fulfillable dependencies: if a CR is depending on another component which is not part of the component-list of a cluster, the reconciliation has to be cancelled and an error has to be logged.
A CR which reported a missing dependency should only be triggered again after the missing dependency became available. If the dependency never becomes available (because the required component was reconciled and ended in an errorstatus) the depending CR won't be called for this reconciliation run.

Reasons

Establish simple dependency management to mothership reconciler to detect failure-cases and reduce amount of failing requests to component reconcilers.

Attachments

Setup E2E CI pipeline for reconciler

Description

The reconciler has to be executed continuously and end2end tests have to be executed.

The test should include:

Create a new cluster in Gardner
Simulate the KEB REST-call to the reconciler and register the cluster in the reconciler's inventory
Wait until reconciler installed Kyma on the new cluster: in between, check the /metrics-URL of the reconciler, the amount of clusters in transition has to increase / decrease (after cluster was reconciled)
Modify Kyma resources on the cluster and wait for reconciliation of them
Simulate the KEB REST-call for the reconciler and upgrade the cluster in the reconciler's inventory
Wait until the reconciler upgraded Kyma on the cluster
Simulate the KEB REST-call to the reconciler and delete the cluster in the inventory
Modify Kyma resources on the cluster and ensure reconciler won't revert the change
Erase the created cluster in Gardener

Reasons

Continuous testing as best practise development approach.

Attachments

Remove dependency to Hydroform API and migrate required code into reconciler

Description

Currently the Hydroform API parallel-install is used for rendering HELM templates. As only the HELM templating functionality is required by the reconciler, the overhead of calling the parallel-install API is no longer valid.

Goal is to cherry pick the HELM templating code and migrate it into the reconciler. Afterwards, the dependency to Hydroform has to be removed.

AC:

Rendering HELM templates is possible for the reconciler without calling the parallel-install API of Hydroform
Dependency to Hydroform parallel-install API is removed in the reconciler go.mod file

Reasons

Reduce execution overhead and avoid indirect dependencies coming with the Hydroform parallel-install API.

Attachments

Setup operational concept with SRE

Description

The reconcile requires an operational concept which is aligned with SREs. The concept has to cover:

How the mothership- and component-reconciler(s) will be integrated into existing deployment pipelines (e.g. adding it pipelines which are also used to deploy KEB, provisioner etc.)
What the operational requirements for mothership- and component-reconcilers are (expected deliverables like trouble-shooting guide, further documentation, integration into SRE's logging-/monitoring-system)
What are the requirements / expectations from SRE side to the reconciler to analyse incidents (e.g. access to reconciler-logs, mandatory CLI features etc.): see #115

AC:

Written concept (could also be a checklist or similar stored inside of this Github issue) which covers the agreed action items / decision and was reviewed by SREs
If tickets are created for some of these items, the tickets have to be referenced in the concept

Reasons

Ensure the reconciler is addressing operational constraints from SREs properly.

Attachments

Cutover plan for Reconciler

Description

For the rollout of the reconciler (which is finally a pre-requisite before Kyma2 can go live) is a cutover plan required.

The plan has to cover:

Checklist of tasks to be fulfilled before the rollout can happen. Example:
- update CI pipelines
- alignment of SRE's troubleshooting guidelines to consider reconciler before touching a cluster
- SRE drills to simulate outages and how to interact with the reconciler
Action plan with step-by-step guide to rollout the reconciler (e.g. rollout together with KEB, provisioner, update KEB configuration, updates in on-call structures etc.)

AC:

Cutover plan created (in a written form) which includes the required steps, their execution order (by considering dependencies to other tasks, e.g. by using kind of GANT diagram or similar) and defined the owner of each task (recommended is a easily share- and editable document like Sharepoint-Excelsheet, Github document (or similar)
Plan is consolidated and reviewed by all involved parties (SRE, KEB team, Provisioner team, Reconciler team, release-manager, etc.)
If a ticket is needed for some cutover steps, the ticket has to be referenced in the cutover plan

Reasons

Provide transparency to all teams and make the rollout properly manage -and traceable.

Attachments

Checkmarx has to be added as additional quality gate to the CI pipeline

Description

Checkmarx is currently not pat of the regular CI pipeline checks. This step has to be added to the CI pipeline of the reconciler.

AC:

Checkmarx scanner is automatically executed at least once a day for the main branch by a CI pipeline

Reasons

Finding of the security threat modelling workshop and required to fulfil security requirements.

Attachments

Delete function of KubeClient should block until resources are fully deleted

Description

The kubeClient implementation should also support a blocking deletion-call for Kubernetes entity. The Delete function should wait until all deleted resources were fully removed by Kubernetes. See this pull-based example how to wait until a resources reached a particular state.

The Delete function of the Client interface should be changed to define whether the client will block or directly continue when the resources of a manifest got deleted.

 type Client interface {
    Delete(manifest string, blocking bool) error
    ...
 }

AC:

Deletion of a deployed manifest can be configured to block until all manifest resources were deleted
Ensure functionality by implementing a unit-test which ensures that no deployed resources exist after the deletion-call returns
Deletion of deployed manifest is still possible in a non-blocking way

Reasons

Currently the call of the delete method returns immediately which can cause side-effects when the same resource is re-created while the old resource still terminates.

Attachments

Replace kubectl-client with Golang Kubernetes-client API

Description

Goal is to use the Golang Kubernetes client API for any interaction with Kubernetes clusters. This has several advantages:

avoid version conflicts caused by differences between the internally used Kubernetes-client API and kubectl.
for security reasons is the call of dynamic exec-cmds too risky and we mitigate that issue by dropping exec-calls completely when switching to a Golang implementation.
consistent usage of Golang APIs for any Kubernetes interactions

ACs:

Implement kubernetesClient interface (see https://github.com/kyma-incubator/reconciler/blob/main/pkg/compreconciler/kubernetes.go#L13) using Golang Kuberentes-client API
Write unit-tests to verify correct behaviour
Drop kubectl-based kubernetesClient (see https://github.com/kyma-incubator/reconciler/blob/main/pkg/compreconciler/kubectl.go) and use native implementation instead

Reasons

Consolidation of current Kubernetes interactions and security risk mitigation.

Attachments

Administrate reconciler remotely via CLI using REST API

Description

The reconciler has to support remote administration via the CLI. The communication between the CLI and the mothership-reconciler has to be handled via an REST API.

The REST API has to be specified using OpenAPI specification (e.g. Swagger, see #116), support a secure and trusted communication (HTTPS) and be integrated into the SAP SSO solution (ORY?). Any user-action triggered by a client has to be recorded in an audit log.

AC:

CLI communicates remotely via REST API with the mothership reconciler
API is described in a OpenAPI specification (Swagger spec) - see #116
SSO integration is available to allow users to login using their SAP account
Any action triggered via the REST API is recoreded in an audit log
Following features have to be supported by the REST API and can be used via the CLI (acting as client of the REST API):

Show reconciliation runs of a cluster
Show details of a reconciliation run (start time, end-time, reported progress of the component-reconcilers)
Show details of reconcilation output created for a particular component (requires integration with logging-system from SRE?)
Disable the reconciliation of a cluster (either for a particular time range or endless) / Enable the reconciliation of a cluster if ti was disabled - #188

Reasons

Establish a standardised tooling to control and administrate the mothership reconciler which fulfils security requirements and is integrated with the SAP SSO system.

Attachments

Need eventing readiness status after reconciliation

Description

In order to mark "Eventing" component ready, there needs to be a mechanism to give this information back to the client of mothership who triggered the reconciliation for eventing.

One way to do that:

Mothership reconciler queries for EventingBackend CR and checks for the overall readiness status. This field is reported back to the client of the statusURL.

Reasons

Attachments

Kubeconfig has to be passed from KEB-contract model to reconciler-DB entity

Description

The kubeconfig is currently not passed to the reconciler from KEB and missing in the contract.

The contract has to be adjuted and the kubeconfig-value has to be considered when creating new cluster entities in the reconciler.

See inventory.createCluster function for further details.

Expected result

Kubeconfig is another attribute of a ClusterEntity and stored in the DB.

Actual result

Kubeconfig is missing in models and in DB.

Steps to reproduce

Troubleshooting

Implement reconciliation controller logic

Description

The reconciliation controller to be highly scalable and each reconciliation controller should run in a dedicate go-routine. Creating and dispatching/reuse of go-routines has to be handled in efficient way. The pool size of worker routines has to be configurable.

ACs:

Receive a set of component-manifests and call the components reconcilers (passing the list with all components to process + their state + the manifest of the component to reconcile + callback url)
Receive updates from scheduler about the component-reconciler status
In case that a component-reconciler has missing dependencies, retry the call to the component-reconciler after X seconds

Reasons

Enable scaling of reconciliation workers.

Attachments

Setup CI pipeline for reconciliation

Description

Code changes committed to the reconciler repository have to picked up by a CI system (Prow) and trigger a build and unitest/e2e test run.

Reasons

Establish CI driven development approach.

Attachments

Encrypt sensitive data in database

Description

The database layer has to support encryption of sensitive data columns in the database. Encryption has to happen by using a secure algorithm like AES256 (e.g. https://www.melvinvivas.com/how-to-encrypt-and-decrypt-data-using-aes/).

Key rotation has to be considered and code should be provided which allows the rotation of a key in a reliable and idempotent way (e.g. by storing the AES key-hash as prefix to the binary entry or similar).

ACs:

Columsn can be configured to be encrypted (e.g. using Tags in the golang model)
Encryption has to use a security encryption algorithm (e.g. AES256 or comparable)
Key-rotation of encrypted values has to be supported and has to be implemented in an idempotent way (rotating multiple times with the same key should not lead to different results)

Reasons

Sensitive data have to be encrypted to increase security.

Attachments

Scaffolding script for component reconcilers creates invalid pkg names

Description

The scaffolding script pkg/reconciler/instances/reconcilerctl.sh creates package names with hyphen if the reconciler-name also includes a hyphen. Such package names are not allowed din Golang and the script has to remove them.

Expected result

Valid package names are generated also if the component reconciler name includes a hyphen.

Actual result

Package name with hyphen are generated which leads to invalid Go code.

Steps to reproduce

Call reconcilerctl.sh and use a reconciler name which includes a hyphen

Troubleshooting

Each recociliation run leads to another correlation-ID entry in log-messages as JSON string

Description

With each reconciliation run is another reconciliation ID added to the log messages. The correlation Id should be set just once.

Also, the correlation ID is added as JSON - but the rest of the log-message is plain text.

Log messages should either be in JSON or in plain-text.

Expected result

Just one correlation-Id is added to log-messages. The log message is either JSON or Plain Text.

Actual result

   2021-08-09T18:17:20.735+0200	DEBUG	status/status.go:102	Status 'success' successfully communicated: stopping update loop	{"correlation-id": "6a18a54c-98da-4998-b934-e50577279097", "correlation-id": "53ecbf04-a427-4e08-bdd7-a828d8aaa148", "correlation-id": "91db97a3-d318-4c6a-bd3d-406ec82fe989", "correlation-id": "f3d0a53d-574f-4150-af87-919a5c1b1208"}

Steps to reproduce

Start mothership-reconciler and "helm" component-reconciler. Each triggering of the component-reconciler by the ms-reconciler adds another correlation-Id to the log-message.

Troubleshooting

Implement end2end test-suite for reconciler (includes KEB, mothership- and component-reconcilers)

Description

To establish a reliable delivery process, a full end-to-end test for the interaction between:

KEB to the mothership-reconciler
mothership-reconciler to component-reconcilers

The used mothership- and component-reconcilers run as dedicated services and communicate via REST with each other. The deployment of the mothership- and component-reconcilers has to be based on a Kubernetes deployment (comparable to the deployment used for the productive setup in KCP).

AC:

End-to-end test can be executed in a dedicated pipeline in PROW
A dedicated Kubernetes cluster has to be created which is used by the component-reconcilers to reconcile Kyma (e.g. a new created Gardener cluster or a K3d/s cluster).
Kyma reconciliation is triggered by simulation KEB requests
After reconciler is finished, the Kyma installation is verify (tbc)

Reasons

Ensure delivery of fully working reconciliation services which covers also edge-case scenarios.

Attachments

Implement component reconciler framework

Description

The reconciler team will offer a framework for external teams to easily implement a component specific reconciler.

ACs:

Basic build script to wrap the framework into a container (Dockerfile)
Implement API which can be easily consumed by other teams and allows to run a component-reconciler service. Teams have the possibility to define customer pre- and post-installation logic which will be executed by the component-reconciler.
Finalize draft implementation:
- Status updater (sending updates in intervals and retry if scheduler is not available until timeout is reached)
- Apply-Manifest logic
- Measure progress of applied manifests (tbc how we will do that)
- Define final model (used to marshal/unmarshall we-calls)
- Run reconciliations in dedicated go-routines (reconciling has to be a non-blocking action)

Reasons

Base framework for the component reconciler.

Attachments

Add support to run a component reconciler without calling a REST API to CLI

Description

Currently, the CLI starts a component reconciler always as standalone microservice which has to be called via its REST API. This makes testing for reconciler times unnecessary complicated if they only want to trigger their component reconciler.

To address this disadvantage, the CLI command ./bin/reconciler reconciler test ... has to support the option to start a particular component reconciler embedded (without an surrounding webserver) and to all it directly.

AC:

CLI supports starting a component reconciler without a REST API (embedded) and runs it directly
CLI is extended (maybe by a flag) to indicate whether the component reconciler should be started with an REST API or embedded

Reasons

Simplify development and testing of component reconcilers.

Attachments

Switch to event based cluster state updates

Description

Currently the update of the cluster state happens in a predefined interval (e.g. each 10 seconds). This has the benefit to have a linear scaling load on the database depending on the amount of parallel running reconciliations but also the disadvantage, that in case of an outage of the mothership-reconciler won't update the cluster-status within the given interval-window.

It's possible to reduce the load on the database by establishing an event-base cluster-status update approach: the operation-registry informs the workers in the mothership-reconciler when it's worth to update the cluster state. The operation-registry can make intelligent decisions (e.g. by changing the cluster-state only if there is a high likelihood that the state won't change again soon) and reduce the amount of applied status-changes.

Cluster states will be updated by these rules:

ERROR cluster-state is set immediately when >= 1 component reconciler reports an ERROR status
READY cluster-state is set immediately set when all component reconcilers report SUCCESS status
Switches from RECONCILING to RECONCILE_FAILED cluster-state are only triggered if a component-reconciler has >= 2 times reported a FAILED status
Switches from RECONCILE_FAILED to RECONCILING cluster-state are triggered if all failing component-reconcilers reported a SUCCESS status and other component-reconcilers are still running

AC:

Cluster status changes are triggered by the operation-registry (e.g. using a shared channel between worker and operation-registry)
operation registry makes a decision based on the results provided by the component-reconcilers following the rules above

Reasons

Switch to event-based cluster-state updates by letting the operation-registry decide when it's time for a cluster-status change.

Attachments

Progress tracker of deployed resources returns success for terminating pods

Description

The progress tracker handles pods which are in "Terminating" state as ready.

Expected result

Only pods in Running-state should be treated as ready.

Actual result

Pods in "Terminating" state are also treated as ready.

Steps to reproduce

Create a pod, delete it and run progress tracker during termination phase.

Troubleshooting

Implement security setup for Kyma reconciler

Description

The reconciler requires a security concept to ensure authentication, authorisation and auditing/accounting is established.

Goal is to fulfil SAP security requirements which covers

Encrypted communication is used for any HTTP endpoints (even between component-reconcilers and mothership-reconciler) [ISTIO]
Authentication is mandatory for calls from the outside to the mothership-HTTP endpoint [ISTIO]
Authorisation is mandatory for calls from the outside to the mothership-HTTP endpoint [ISTIO]
Service-accounts are established for communication between mothership-reconciler and component-reconcilers [ISTIO]
Auditing/Accounting happens for any action the mothership reconciler receives. [User-ID is provided as HTTP header] - See #291

ACs:

HTTPS communication is mandatory
Requests have to be authentication and authorised e.g. based on OAUTH2 (e.g. bearer token)
Auditing of critical operations is mandatory (auditlogs per user-call etc.) on MS-reconciler side.

Reasons

Fulfil SAP security requirements.

Attachments

Component reconciler integration test

Description

An integration test suite is required to verify the correct behaviour of the component reconcilers (considering edge cases).

AC:

Scope of the integration test covers:

Reconciliation of a cluster with a manipulated Kyma installation (e.g. a Kyma component was partially deleted)
Reconciliation of a cluster with a manipulated Kyma installation which cannot be recovered (e..g component was changed in a way that a reconciliation won't fix it)
Reconciliation of a cluster which is not reachable (K8s down / access to API blocked)
Reconciliation of a cluster with insufficient permissions (e.g. K8s-user has no permissions on kyma-system namespace)
Reconciliation of a cluster with a defective Kyma component (e.g. syntax error in HELM template)

Reasons

Verify correct behaviour of component reconcilers in expected edge cases.

Attachments

Implement scheduler

Description

The scheduler is responsible to react on clusters which require a transition (e.g. pending for an upgrade, installation or a regular reconciliation). The scheduler queries the cluster inventory to retrieve such clusters.

It requests the reconciliation logic (strategy) for each cluster from the strategy factory and passes both (the cluster data + the strategy) to the worker pool.

ACs:

Query clusters which requires a reconciliation from cluster inventory
Gather the reconciliation strategy from the strategy factory and assign an reconciliation worker to it which runs the logic
A failing / restarted scheduler has to be able to recover its state (means: re-trigger the reconciliation for clusters with were in a non-finished reconciliation state by querying the history/archive for non-finished transitions.
Track the applied manifests in a history/archive (for makeing cluster-changes easier traceable)

Reasons

Identify clusters which require a reconciliation, pass them to the worker pool and track all changes applied to a cluster and its results.

Attachments

Component namespace provided by KEB has precedence over component-internal namespace and has to be created if missing

Description

Currently is the namespace passed to the HELM chart rendering but it's not warrantied that the provided namespace will be properly used in all resources declared by the chart.

To enforce the usage of the correct namespaces, the deployment logic has to set the namespace provided by KEB. It also has to ensure that this namespace is created before the component gets deployed.

Expected result

Namespace provided by KEB has always to have precedence over the namespaces defined by a component. If the namespace doesn't exist, it has to be created before the component gets deployed.

Unit test has to be implemented which is setting a different namespace as defined by the component.

Actual result

Namespace given by KEB is not enforced before deploying a Kyma component.

Steps to reproduce

Troubleshooting

Improve security of database layer in reconciler

Description

Following risks have to be mitigated related to the used data layer on reconciler side:

Encrypt the column including the Kyma component configurations (can contain SSL keys etc. which have to be protected)
Add detection mechanism for potential SQL injection points (verify SQL query for missing placeholders etc.) and report warnings in logs.

AC:

Any generated SQL query has to be verified for missing placeholders to detect attack points for SQL injection. Findings have to be reported with WARN level in log output.
The column including Kyma component configurations has to be encrypted on database level

Reasons

Increase security on database layer.

Attachments

Scheduler sets cluster-state to 'reconciling' even when component-reconciler is down

Description

Scheduler is setting the cluster state to reconciling even when the component-reconciler is not reachable. This is bascially acceptable, but it should retry to connect to the component-reconciler or set the cluster state to failed.

Expected result

Cluster state is either not changed or set to failed when the component-reconciler cannot be reached.

Actual result

Cluster state is set to reconciling but scheduler doesn't retry to connect to the component-reconciler. The inventory is also not returning this cluster when querying for "clusters to reconcile" because it considers only clusters which are in error, failed state or which are too old.

Steps to reproduce

Start ms-reconciler and register a new cluster without starting the required component-reconcilers.

Troubleshooting

Support mutliple mothership reconcilers without having a risk of race conditions

Description

We should be able to start multiple mothership reconcilers without the risk of having potential race conditions. Currently are these parts identified for potential conflicts:

It has to be warrantied that no multiple reconciliations of a cluster at the same time are happening => picking a cluster for reconciliation by a mothership-reconciler has to be atomic
Failing component-reconciler runs (= outdated operations because heartbeat updates didn't happen) are not allowed to be re-triggered more than once => picking a reconciler and restarting a component reconciliation has to be atomic

Potential solutions:

The detection of race conditions will be handled by using standard DB features (isolation level + primary keys):

Multiple ms-reconcilers are allowed to query for clusters which have to be reconciled
Following steps have to happen in the same DB transaction
1. Before a cluster will be marked as reconciling, the ms-reconciler has to create an operation for the cluster (handled in a dedicated DB table - cluster-ID will be a unique value/PK). Only if no entry for this cluster exists, the DB will create a new operation entry in the DB and the ms-reconciler is allowed to continue. Otherwise the DB will complain the violation of unique key constraint and the ms reconciler will know that another ms-reconciler was already picking this cluster.
2. After creating the operation, all required reconciliation operations are created
3. Finally, the cluster status can be set to reconciling
Multiple ms-reconcilers are allowed to query the operation registry for operations which are in status new and were not yet assigned to a component reconciler OR which are in status orphaned (the ms-reconcilers will have a mechanism whcih is updating running operations to status orphaned if they haven't received an update longer than X minutes).
A ms-reconcliler will now try to update the state of an operation from new (respectively orphaned) to in_progress. Before triggering the component reconciler it has to ensure that the update was successful (means affected-rows is == 1) by using an update query which considers the cluster + previous status (e.g. UPDATE operation SET status=reconciling` WHERE operation-id=$1 and status=$2)

AC:

Multiple ms-instance work in parallel without causing duplicate cluster reconciliations because starting the reconciliation (=creating an operation entry in the DB) is by the DB restricted to one cluster. Creating multiple operations for the same cluster will cause a DB error (unique key constraint violation).
MS reconcilers are identifying orphaned operations (latest update is longer than X minutes ago) and mark them as orphaned.
- This feature should also be responsible to trigger a cluster status update (see #194 )
Multiple ms-instance work in parallel without causing duplicate component reconciliations because an operation will only be picked up by one ms-instance by checking that no parallel DB-updates of the operation were happening in between (check for affected rows).

Reasons

Become scalable for mothership-reconcilers without having risks of race conditions.

Attachments

Add support for remote and embedded component-reconcilers by using ISTIO use-case as reference implementation

Description

Provide a basic implementation of Istio configuration using istioctl + configuration template.

Reasons

Since Istio is a special case for component installations right now, we'll attempt to reduce risk and dev time by implementing custom logic to install Istio via istioctl.

Links
kyma-project/kyma#11635

Integrate reconciler into Kyma CLI

Description

The reconciler is the new approach for installing Kyma on a cluster and will replace the parallel-install module of Hydroform. The Kyma CLI has to be adjusted to replace the used parallel-install API with the Kyma reconciler API.

AC:

Replace Hydroform parallel-install module from Kyme CLI and integrate the reconciler API instead
Ensure existing CI pipelines stay green after switching to the reconciler based installation approach

Reasons

Consolidated and consistent approach how Kyma gets installed on clusters.

Attachments

Reconciliation is not stopping in CLI when pressing CTRL+C

Description

Running the local reconciliation via CLI (bin/reconiler local) and pressing CTRL+C leads to an interrupt-event (execution context gets cancelled) but the process doesn't shutdown properly.

Only pressing CTRL+C a second time the hard shutdown is triggered which finally stops the execution.

Expected result

Pressing CTRL+C the first time should trigger a graceful shutdown and the process should stop running (at least a clean shutdown should be visible).

Actual result

Pressing CTRL+C has no impact on the running process.

Steps to reproduce

Start local reconciliation via CLI (bin/reconiler local) and press CTRL+C - shutdown happens only after triggering a hard shutdown (happens after pressing CTRL+C a second time).

Troubleshooting

Setup security scanners for reconciliation code base

Description

Security scanners have to be enabled for any implemented code in Kyma. The reconciliation code base has to be scanned by SAP security scanners regularly.

Reasons

Required by company policy.

Attachments

Align naming of operation-/correlation-ID and review cluster-status updates

Description

This can be achieved by replacing the ticker-channel with a channel which is used by the operation registry to send status-update to the worker. The mothership-reconciler should track the cluster status on base of these conditions:

If all component-reconcilers are reporting a RECONCILING status: cluster state becomes RECONCILING
If one ore more component-reconcilers are reporting an ERROR status: cluster state becomes ERROR
If all component-reconcilers are reporting a SUCCESS status: cluster state becomes READY

(Also documented in the Wiki)

Additionally, the naming of the shared processing ID between mothership-reconciler and component-reconciler should be aligned. In the mothership-reconciler is the variable normally called operationID but on component-reconciler side we use the naming correlationID. This should be aligned to one name to make the code base more consistent.

ACs:

Establish consistent naming for operationID/correlationID in mothership- and component-reconciler
Verify that the cluster status is updated accordingly to the listed conditions above

Reasons

Reduce risk of lost data which can happen caused by time-based updates and make code base easier readable by using consistent naming.

Attachments

Move HTTP server creation for component-reconcilers into CLI

Description

The creation of the HTTP-server instance for component-reconcilers happens inside of the component reconciler instance (see here).

In the mothership is the HTTP-server and routing creating as part of the CLI command. To be consistent, the same pattern should be used for the component-reconcilers: configuring and starting the HTTP server should happen as part of the related CLI command.

AC:

Refactor component-reconcilers and move the HTTP-server creation code + related unit-tests to the CLI

Reasons

Establish standardised approach for creating HTTP interfaces of the mothership- and component-reconcilers.

Attachments

Reconciliation tracing in comp-reconciler: introduce correlation-ID to log-messages and send logs to reconciler-ctrl in failure/error cases

Description

For increasing the transparency and to make debugging much simpler, each log message which is related to a component-reconciliation process has to include a kind of correlation-ID which has to be provided by the reconciler-controller. The correlation ID allows the mapping between reconciler-controller calls to component-reconciler processes.

The log messages should also be stored temporarily on the component-reconciler side (e.g. as file) to be available by the reconciler-controller for debugging purposes.

Additionally, the component-reconciler has to send the latest log-messages related to a particular reconciliation-process to the reconciliation-controller as part of the status-updates (heartbeat-messages) as soon as the process failed or reached the error state.

ACs:

Reconciler-controller has to send a correlation-Id to the component-reconciler.
Log message which are generated as part of reconciliation-process inside of a component-reconciler has to include a correlation ID.
Logs have to be stored temporarily on disk per correlation-ID (= per reconciliation process). It has to be possible to retrieve these logs via a REST url (for debugging purposes via CLI or reconciler-controller)
Component-reconciler has to send the logs tracked for a particular correlation-ID as part of the status-messages (= our heartbeat messages send to reconciler-controller) as soon as a process falls into a failure or error state.
Reconciler-controller should track the logs as part of the process-status

Reasons

Improve traceability of reconciliation runs and improve debugging possibilities.

Attachments

Implement chart-provider

Description

The chart provider is responsible to render HELM charts to YAML. The output of the chart compiler will be a list of rendered Kyma component objects. Each object includes the rendered Kubernetes YAML and offers functions to verify the installation status of the component (e.g. checking the state of the K8s-deployments, -pods etc.).

ACs:

Rendering of charts should follow the same flow as charts are processed by the installer library (to avoid differences in overwrites etc.)
1. Clone the Kyma version (brach/tag) from repository (e.g for kyma version 1.19 repo)
2. Read the components list yaml from data model
3. Render the Kyma components by considering the provided cluster configuration
Return a list of rendered component YAMLs

Reasons

Rendering of Helm components is a mandatory feature of the Kyma reconciler.

Attachments

Workspace factory has to detect incomplete Git repository clones

Description

Currently the workspace factory checks only for directory names to decide whether a workspace exists. This is not sufficient if the clone of the GIT repository was interrupted and couldn't finish.

The clone process should create marker file after the clone was successfully completed. If the file exist, it can be assumed the workspace is ready to use, otherwise the next goroutine should delete it and clone it again.

See https://github.com/kyma-incubator/reconciler/blob/main/pkg/workspace/factory.go#L72 - check for marker file instead for the directory itself.

AC:

incomplete Git clones are detected and replaced with new clones
Warn log-message has to be written if an incomplete workspace was detected
Unittest is verifying that a missing marker file will cause a re-creation of the workspace
Multithreading is tested in unittest to ensure that no duplicate GIT clones as happening.

Expected result

Interrupted GIT clones don't cause incomplete workspaces. Such workspaces are automatically renewed if they were detected.

Actual result

Steps to reproduce

Troubleshooting

kyma-incubator / reconciler Goto Github PK

reconciler's People

Contributors

Stargazers

Watchers

Forkers

reconciler's Issues

Recommend Projects

Recommend Topics

Recommend Org