migtools / mig-controller Goto Github PK

View Code? Open in Web Editor NEW

22.0 9.0 41.0 125.47 MB

OpenShift Migration Controller

License: Apache License 2.0

Dockerfile 0.04% Makefile 0.22% Go 99.39% Shell 0.32% Starlark 0.04%

mig-controller's Introduction

mig-controller

Installing

mig-controller is installed by mig-operator.

Demo Video

This video shows a CLI driven migration of a rocket-chat app and associated PersistentVolume data from OpenShift 3.11 => OpenShift 4.6.

Development

HACKING.md has instructions for:

Building mig-controller
Running mig-controller locally
Invoking CI on pull requests

DEVGUIDE.md has guidelines for:

Design patterns
Conventions
Dev practices

Quick-start: CLI based migration

1. Identify a pair of running OpenShift clusters to migrate workloads between

mig-controller can help you move OpenShift application workloads from a source to a destination cluster. You'll need cluster-admin permissions on both OpenShift clusters.

velero will need to be installed on both clusters, and will be driven by mig-controller
mig-controller should only be installed on one of the two clusters. You can decide which cluster will host this component.

2. Use mig-operator to deploy Migration Tools to both the source and destination OpenShift clusters

Use mig-operator to install selected components of Migration Tooling (mig-controller, mig-ui, velero) onto your source and destination OpenShift clusters.

After installing mig-operator, you can select which components should be installed by creating a MigrationController CR:

apiVersion: migration.openshift.io/v1alpha1
kind: MigrationController
[...]
spec:
  migration_velero: true
  migration_controller: true
  migration_ui: true
 
[...]

See mig-operator docs for more details: https://github.com/konveyor/mig-operator

3. Create Mig CRs to describe the Migration that will be performed

Before mig-controller can run a Migration, you'll need to provide it with:

Coordinates & auth info for 2 OpenShift clusters (source + destination)
A list of namespaces to be migrated
Storage to use for the migration

These details are specified by "Mig" CRs. For the source of truth on what will be accepted in CR fields, see the appropriate types.go file.

To make it easier to run your first Migration with mig-controller, we've published a set of annotated sample CRs in config/samples that you can walk through and fill out values on. The first step will be to run make samples, which will copy these to migsamples.

make samples
# [... sample CR content will be copied to 'migsamples' dir, which is .gitignored]

These sample resources describe a migration where the source cluster is running the controller, so a Service Account (SA) token and cluster URL must be provided for the destination cluster only.

Inspect and edit each of the files in the 'migsamples' directory, making changes as needed. Much of the content in these sample files can stay unchanged.

As an example, you'll need to provide the following parameters to perform a Migration using an AWS S3 bucket as temporary migration storage:

Parameter	Purpose	Sample CR File
`namespaces`	List of namespaces to migrate from source to destination cluster	`mig-plan.yaml`
`url`	Endpoint of remote cluster mig-controller will connect to	`cluster-remote.yaml`
`saToken`	Base64 encoded SA token used to authenticate with remote cluster	`sa-secret-remote.yaml`
`awsBucketName`	Name of the S3 bucket to be used for temporary Migration storage	`mig-storage.yaml`
`awsRegion`	Region of S3 bucket to be used for temporary Migration storage	`mig-storage.yaml`
`aws-access-key-id`	AWS access key to auth with AWS services	`mig-storage-creds.yaml`
`aws-secret-access-key`	AWS secret access key to auth with AWS services	`mig-storage-creds.yaml`

After modifying resource yaml, create the resources on the OpenShift cluster where the controller is running.

# Option 1: Create everything in a single command
oc apply -f migsamples

# ------------------------------------------------

# Option 2: Create resources individually
cd migsamples

# Remote cluster definition, coordinates, auth details
oc apply -f sa-secret-remote.yaml
oc apply -f mig-cluster-remote.yaml

# Describes where to store data during Migration, storage auth details
oc apply -f mig-storage-creds.yaml
oc apply -f mig-storage.yaml

# Describes which clusters, storage, and namespaces should be to run a Migration
oc apply -f mig-plan.yaml

# Declares that a Migration operation should be run 
oc apply -f mig-migration.yaml

Getting a migration-controller Service Account (SA) token

For mig-controller to perform migration actions on a remote cluster, you'll need to provide a valid Service Account (SA) token granting 'cluster-admin' access to the remote cluster.

To get the SA token created by mig-operator, run the following on the remote cluster:

# Get the ServiceAccount token in a base64-encoded format to put in the remote MigCluster spec
# If you are providing this token to mig-ui, skip the base64 encoding step
oc sa get-token -n openshift-migration migration-controller | base64 -w 0

Use the base64-encoded SA token from the last command output to fill in migsamples/sa-secret-remote.yaml

mig-controller's People

Contributors

Stargazers

Watchers

mig-controller's Issues

When setting the openshift.io/migration-registry annotation, if there's more than one service we should return an error (or fail validation on the plan when calling ensureRegistry)

When setting the openshift.io/migration-registry annotation, if there's more than one service we should return an error (or fail validation on the plan when calling ensureRegistry). Currently if there happen to be two services associated with the plan (possibly due to an improper prior resource cleanup) we just set an annotation matching the first one found, which may be the wrong one.

Restic PV restore does not work when pod to be restored is `Unschedulable` on destination cluster

This is related to #151

I have an OCP v3.11.104 and OCP v4.1.0 environments deployed with agnosticd. I am trying to migrate an MsSQL server app (with a NFS PV) deployed on OCP 3. I have the controller deployed on v4 side.

Problem
The backup stage completes without errors. However, the restic restore on the destination is not working. The pod gets restored, but is stuck in pending state. The migration is stuck at CreateStagePodsStarted phase. The Velero restore is in InProgress phase.

Identified Cause

Whenever a pod to be restored has a PV associated (copy action in our case), Velero adds an init container to that pod at the time of restore. Velero then restores the pod. The init container waits until the Restic restore process is completed. On the other hand, the restic process does not start until the init container is started. In an ideal scenario, as soon as the pod is restored, the init container runs and triggers the restic restore process. The restic restore process finishes and init container exits. Then, the pod comes live with the PV attached to it.

In my case, the pod which got backed up, had a nodeSelector looking for label node-role.kubernetes.io/compute:"true". This label is not present on any of the OCP4 nodes in my setup. At the time of restore, the pod got restored with the nodeSelector. Since the labels were absent, the pod never got scheduled. Since it never got scheduled, the init container never started. Since the init container never started, the Restic process never started. And migration is stuck hoping that the restore will finish.

Possible Solution
Migration plugin could replace the old OCP3 compute labels with new OCP4 worker labels.

Fix
Fixes need to be done in the plugin. Follow this issue - fusor/ocp-velero-plugin#61

Support migration of apps that will depend on 4.1 images requiring "pull secret"

As reported by @garethahealy, mig-controller was unable to migrate a Jenkins app deployed by the Service Catalog in OCP 3.11 onto an OpenShift 4.1 cluster.

This appears to be an issue with the way that we (or Velero) handles setup of service accounts / dockercfg pull secrets when running a Restore from a Backup, and the lack of a linked OpenShift 4.1 pull secret which is necessary to get the Jenkins and likely other included images on a 4.1 cluster.

Possibly relevant section from Velero source: https://github.com/fusor/velero/blob/fusor-dev/pkg/restore/merge_service_account.go

Reproducer steps

On source 3.11 cluster, deploy "nodejs-mongodb-example" from "Pipeline Build Example" in the Service Catalog (Jenkins icon)
Create migration to 4.1 cluster
Observe Jenkins image not being able to pull image in 4.1 due to:
1. Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/jenkins@sha256:2600aaa3b380a0be3a2b8c67229209968625d4bfff9579326fd3bee811e951bc": rpc error: code = Unknown desc = Error reading manifest sha256:2600aaa3b380a0be3a2b8c67229209968625d4bfff9579326fd3bee811e951bc in image-registry.openshift-image-registry.svc:5000/openshift/jenkins: unauthorized: authentication required
Create another project and deploy Jenkins on 4.1:
1. oc process jenkins-ephemeral -n openshift | oc apply -n anotherproject -f -
Get the Jenkins dockercfg secret from the above, and create it in your migrated project
Link it to the Jenkins service account
oc secrets link jenkins jenkins-dockercfg --for=pull
Delete the Jenkins pod to force a re-pull
Watch Jenkins start up correctly in 4.1

Add progress reporting on Stage and Migrate actions

Add more granular progress reporting for PV migration.

Travis should be setup to gate PRs on test and style

It's good practice to gate PR acceptance on travis runs (or any CI for that matter) prior to acceptance. We spent some time on this in CPMA, the make ci target is particularly useful. It ensures the code is formatted according to expectations, code builds, and tests are green, and because it's a make target, devs can (should) run it before any submissions (https://github.com/fusor/cpma/blob/master/Makefile#L25). This isn't based on any violations I've seen, it's just a practice that's served us nicely, especially when/if we start to receive community submissions.

Restic restart (mount-propagation workaround) only on OCP 3.7-3.9.

Perform Restic restart (mount-propagation workaround) only when cluster is OCP 3.7-3.9.

Task phase handling needs to account for PartiallyFailed backups and restores

As of the 1.0 release, it's possible for a Backup or a Restore to complete with a PartiallyFailed state, in addition to Failed or Completed. At this point, I think we should probably consider a PartialFailed backup or restore the same as a Failed one. The two examples I've come across today are ones that should definitely fail the action. We may want to eventually be more granular and dig into the partial failure reasons to distinguish fatal from non-fatal errors, but for now setting task.Phase to RestoreFailed or BackupFailed should be sufficient.

Where we currently check for velero.RestorePhaseFailed or velero.BackupPhaseFailed we should also check for velero.RestorePhasePartiallyFailed or velero.BackupPhasePartiallyFailed

Remove use of registry-cluster.

As a result of a discussion with Paul Morie, turns out Federation had an API review and it was suggested that their current usage of cluster-registry exposed them to a potential security escalation, hence they've removed usage of cluster-registry for time being and are using the cluster details in their own resource. Recommendation is for us to do similar and REMOVE cluster-registry usage, move the cluster details into MigCluster and just use MigCluster with secretRefs for now.

Controller has a error "Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value"

After following documentation from https://github.com/fusor/mig-controller#quick-start. I am getting error

E0603 10:12:08.953346       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value
{"level":"error","ts":1559556729.5387843,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"migplan-controller","request":"mig/migplan-sample","error":"secrets \"cloud-credentials\" already exists","stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1559556747.359269,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"migplan-controller","request":"mig/migplan-sample","error":"secrets \"cloud-credentials\" already exists","stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
^LE0603 10:14:58.950502       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value

MigPlan Status

 status:
    conditions:
    - category: Required
      lastTransitionTime: 2019-06-03T10:12:19Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"

More status needs to report

Openshift cluster version

Source cluster - OCP 3
Destination cluster - OCP 4

Premature "Restic pod successfully restarted"

I deployed Velero onto OCP 4 and started running a migration. I noticed two things:

There are 3 restic pods in the Velero namespace

$ oc get pods -n velero
NAME                                     READY     STATUS              RESTARTS   AGE
migplan-sample-registry-xbvhs-1-deploy   0/1       ContainerCreating   0          7s
restic-cgshq                             0/1       ContainerCreating   0          6s
restic-m4wx9                             1/1       Running             0          12m
restic-w5krg                             1/1       Running             0          12m
velero-78949985fd-m6cfn                  1/1       Running             0          12m

The "restic pod successfully" log message pops up while the above restic pod is still in "ContainerCreating" status.

{"level":"info","ts":1558550568.6081119,"logger":"migration|2fw6c","msg":"Restic pod successfully restarted"}

I am realizing that we may have done our initial tests of this feature on OCP 3 setups with a single node, so a DaemonSet in that environment only yields a single pod. on OCP 4, where the default is 3 nodes for AWS, you get the above 3 pods running and the bounce doesn't work as intended.

Break out plan validation logic to a common pkg

Since the MigStage and MigMigration objects share a need to validate planRefs, we should break out the Plan validation to a common package.

We should change https://github.com/fusor/mig-controller/blob/master/pkg/controller/migmigration/validation.go#L63 to take a MigPlanRef as an input and have a common validation pkg.

Document issues with NFS PV usage

Reminder to ensure we have documented any gotchas we've learned from NFS PV handling.

Believe there was an issue with no_root_squash we need to ensure users are aware of.

Restriction of needing to ensure only 1 action is running at a time when mountPropagation workaround is being used.

This issue is to ensure we track future concerns of running multiple actions in parallel. By multiple actions, I mean if we have multiple MigrationPlans, in future we will likely want to allow running Stage on each MigrationPlan at same time, assuming they do not overlap, no shared namespaces. We will continue to want to restrict inside of a MigrationPlan queueing actions and running them sequentially on a per MigrationPlan basis.

So...assuming we multiple MigrationPlans that do not overlap.
In future, we would like to be able to have say a Stage running potentially on each MigrationPlan at same time.

If we did this with our mountPropagation workaround we'd have potential issues as we restart the Restic daemonset on a per action run basis.

I don't believe this a concern as of today, my understanding is that Velero is limiting actions run on a cluster to 1 action only running cluster wide, with future work to allow running multiple actions concurrently.

If Velero begins to support running multiple actions concurrently we will want to revisit the mountPropagation workaround, or perhaps continue to enforce a sequential/serial run of actions globally.

MigPlan doesn't notice after "Namespaces not found on ... cluster" is resolved

This message doesn't go away after the namespace in question is created.

status:
  conditions:
  - category: Critical
    lastTransitionTime: 2019-05-22T18:34:11Z
    message: Namespaces [nginx-example] not found on the source cluster.
    reason: NotFound
    status: "True"
    type: NamespaceNotFoundOnSourceCluster

Predicates should handle maintenance of the resource reference map

Manual management of reference relationships is error prone. Since predicates can see the old and new object when changes occur, we should be able to reliably attach and break reference relationships from predicates.

Example of manual management of relationships in reconcile: https://github.com/fusor/mig-controller/blob/master/pkg/controller/migcluster/migcluster_controller.go#L142-L150

Example predicate: https://github.com/fusor/mig-controller/blob/master/pkg/controller/migplan/predicate.go

Functions for managing reference map: https://github.com/fusor/mig-controller/blob/master/pkg/util/resource_parent_map.go

Support multiple sets of BSL/VSL credentials

Velero upstream supports multiple sets of BSL/VSL credentials as of vmware-tanzu/velero#1548 -- once this commit makes it into our velero fork, we need to update the controller to support this.

Currently when we create the BSL/VSL resources for a plan, we overwrite the existing cloud-credentials secret with a new default set of AWS credentials corresponding to the current plan, which overwrites whatever credentials were there before. The new Velero feature still uses a single cloud-credentials secret, but allows the creation of multiple aws profiles. When we create a plan, instead overwriting cloud-credentials with the current plan's credentials, we need to read in the existing cloud-credentials secret, pull the "[default]" section from it, and rewrite the rest of the secret using the associated backup storage (and volume snapshot, if present) credentials for all open plans:

[default]
aws_access_key_id=<whatever is there already>
aws_secret_access_key=<whatever is there already>
[bsl-<plan1UID>]
aws_access_key_id=<from plan1 backup storage credentials secret>
aws_secret_access_key=<from plan1 backup storage credentials secret>
[vsl-<plan1UID>]
aws_access_key_id=<from plan1 volume snapshot credentials secret>
aws_secret_access_key=<from plan1 volume snapshot credentials secret>
[bsl-<plan2UID>]
aws_access_key_id=<from plan1 backup storage credentials secret>
aws_secret_access_key=<from plan1 backup storage credentials secret>
[vsl-<plan2UID>]
aws_access_key_id=<from plan2 volume snapshot credentials secret>
aws_secret_access_key=<from plan2 volume snapshot credentials secret>

Plan storage cleanup operations need to do this as well to remove this plan's credentials from the secret.

When creating the velero BSL/VSL CRs, the profile above should be set in location.Spec.Config.Profile

Note that supporting this is currently incompatible with supporting multiple controller instances (#47 ) since the two controllers would clobber each other's cloud-credentials changes (although, they'll also clobber the default credentials with the current implementation). To support multiple controllers, we need to make another upstream Velero change to support multple separate cloud-credentials secrets, so that adding/removing MigPlans would only require adding/removing a plan-specific cloud-credentials file.

Also note: We can't really start working on this until we have the underlying upstream Velero commit in our fork, either as a result of updating to a future Velero release including it, or (if we need this sooner), rebasing to a prerelease commit that's newer than 1.0.0.

Controller has error "secrets \"cloud-credentials\" already exists" several times during migration

Have been seeing several instances of "cloud-credentials" already exists in rapid succession during migrations recently. @jortel I think this is an unintended side-effect of the ensureStorage pieces that have recently been added.

I've never seen this block a migration, but would be nice to fix the underlying issue or silence the error message if we don't care about it.

{"level":"info"
"ts":1558560519.166244
"logger":"cluster|6fd7s"
"msg":"[rWatch] Starting manager"}


{"level":"info"
"ts":1558560519.1662652
"logger":"cluster|6fd7s"
"msg":"[rWatch] Manager started"}


{"level":"info"
"ts":1558560519.1662774
"logger":"cluster|6fd7s"
"msg":"Remote watch started."
"cluster":"migcluster-local"}


{"level":"error"
"ts":1558560525.3750854
"logger":"kubebuilder.controller"
"msg":"Reconciler error"
"controller":"migplan-controller"
"request":"mig/migplan-sample"
"error":"secrets \"cloud-credentials\" already exists"
"stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Is it correct to create initial cloud credentails with no keys in them with deploy script?

Possibly related to #107, but I noticed the deploy script for velero creates some initial cloud credentials with a secret that decodes to:

[default]
aws_access_key_id=
aws_secret_access_key=

Shouldn't the controller check for the presence of this, and if it doesn't already exist, create it? Why create empty credentials?

Memory consumption should be reduced

mig-controller memory usage has grown significantly. Memory consumption does level off eventually.

In the chart below you can see controller memory usage over period of several minutes, peaking at 224.36 MiB.

Clarify user interaction needed during PV migration

This is the initial validation condition that we raise after we perform PV discovery, before the user has selected an action:

PV in 'persistentVolumes' [pv2] has an unsupported 'action'.

When the MigPlan looks something like this

$ oc describe migplan migplan-sample -n mig
[...]
Kind:         MigPlan
Metadata:
  [...]
  Persistent Volumes:
    Name:  pv2
    Supported Actions:
      copy
      move
  [...]
Status:
  Conditions:
    Category:              Error
    Last Transition Time:  2019-05-29T20:35:44Z
    Message:               PV in 'persistentVolumes' [pv2] has an unsupported 'action'.
    Reason:                NotDone
    Status:                True
    Type:                  PvInvalidAction

    Category:              Required
    Last Transition Time:  2019-05-29T21:13:25Z
    Message:               The 'persistentVolumes' list has been updated with discovered PVs.
    Reason:                Done
    Status:                True
    Type:                  PvsDiscovered
    [...]
Events:                    <none>

This is confusing because the user has not yet selected an 'action' at all, so telling them that a PV has an unsupported action almost sounds like a bug with the controller (e.g. the generated supported action list is incorrect).

I'm thinking a better initial message (when 'action' is "" or nil) could be something more along the lines of:

Select an 'action' for PV [pv2] in 'persistentVolumes'.

And we can keep the current message if the user actually provides an unsupported action.

After the user provides an action the condition would go away.

  Persistent Volumes:
    Name:  pv2
    Action: move
    Supported Actions:
      copy
      move

Need additional fields during PV discovery

Currently the reported pv objects look like this when written back to the MigPlan:

  persistentVolumes:
  - name: pv1
    supportedActions:
    - copy
    - move

The UI requires a number of additional fields for us to render the PV tables as shown in the mocks, the following fields are missing:

namespace - the namespace the PV was found in
size - size of PV
claim - associated pvc name
storageClass - associated storageClass, although we'll probably need to think about this more in depth given the issues that Dylan raised around differing storage classes in two different clusters.

Add ability to cancel a Stage or Migrate

Assuming that for Stage we want to cancel any running operations.

For Migrate, ideal case would be we cancel any running operations and restore the associated namespaces on source cluster to be in same state as prior to Migrate running. Essentially we'd want to cancel the running Migrate operations and rollback any changes on source side.

Add security to migration registry

The current implementation of the (temporary) migration registry creates a registry pod which doesn't use https or require login credentials. We should update this to allow for securing the registry with certs and credentials. This will also require plugin support. The plugin will need to obtain certs and credentials created by the controller (probably via secrets created in the velero namespace with some supporting annotations on the velero CR) and use them to access the registry.

Controller has error "spec.clusterIP" invalid: field is immutable during migrations

Seeing this on every migration I run from OCP4->OCP4.

@sseago I think this is from the registry creation code. Doesn't seem to stop migration from completing but would be good to either re-arrange logic or ignore this type of error so it doesn't pollute logs.

Thinking the problem here is that once you define a service, you are not allowed to modify the clusterIP for that service, so we need to treat that field as immutable in our reconcile logic.

{"level":"info"
"ts":1558560727.27501
"logger":"migration|2f8nr"
"msg":"Migration: resumed."
"name":"migmigration-sample"
"phase":"RestoreStarted"}

{"level":"info"
"ts":1558560727.3467073
"logger":"migration|2f8nr"
"msg":"Migration: waiting."
"name":"migmigration-sample"
"phase":"BackupReplicated"
"backup":"migmigration-sample-8gbmh"
"restore":""}

{"level":"error"
"ts":1558560727.3468246
"logger":"kubebuilder.controller"
"msg":"Reconciler error"
"controller":"migmigration-controller"
"request":"mig/migmigration-sample"
"error":"Service \"migplan-sample-registry-p9k57\" is invalid: spec.clusterIP: Invalid value: \"\": field is immutable"
"stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Add Azure and GCP support to Cloud Providers for BSL/VSL

Need to confirm our API can handle Azure configuration along with S3 and test compatibility with velero.

Recovery workflows after migration failures

We now have basic failure handling in place to properly set Migration state if something goes wrong during backup or restore. Longer-term, we need some sort of recovery workflow. After a failure, in some cases a simple retry is all a user might need (especially if it's on backup). In other cases, we may need to clean up from the failure (remove destination namespaces, un-do quiesce actions on the source) before allowing the user to re-run another stage and then a final migration.

MigMigration can exist without a stage

The current MigMigration CRD has a stage ref, but migrations can be run without a stage. This should reference a MigPlan.

Completed migrations should no longer have the Ready condition.

Completed migrations should no longer have the Ready condition. The Ready condition indicates that a migration is ready to run so makes sense that it lose the condition after completion.

Plan controller predict default PV action.

The controller can predict a good default for PV actions. The user would still review and make the final selection before running migrations. This just save the user a lot of clicks/typing. The default for local storage would be copy and for mounted storage it would be move.

Ensure able to change namespaces within a MigrationPlan

Example:
Source namespace 'foo' may be migrated to Destination namespace 'bar'

Transient "MigPlan not ready" when migration is running

Was running a MigMigration and noticed a status condition "MigPlan is not ready" pop up on the MigMigration status a few times for a few seconds each time during the Migration. Not sure what caused this (and it didn't cause any real problems for me) but wanted to document in case this is an issue later.

I didn't isolate this all the way down the chain, it disappeared very quickly.

Controller should be setting annotations on backup/restore

The plugin depends on annotations to function properly. The list of annotations to set are here: https://github.com/fusor/ocp-velero-plugin/blob/master/velero-plugins/common/types.go#L28

To solve this I think it makes the most sense to add a field to the Task struct Annotations that the stage and migration controllers can set before running Task.Run().

MigPlan can be in "Ready" state with PVs discovered but no action selected after namespace switch

I corrected the namespaces field on a MigPlan from a namespace that didn't exist nginx-example to one that existed and had an associated PV. The MigPlan controller performed PV discovery and kept the MigPlan in "Ready" state for a few seconds before the "Invalid PV action selected" message showed up.

This is an issue because it allows a MigMigration to start prematurely before a PV action is selected.

MigPlan with INVALID namespace

apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
  kind: MigPlan
  metadata:
    creationTimestamp: 2019-06-10T21:07:42Z
    generation: 1
    labels:
      controller-tools.k8s.io: "1.0"
    name: migplan-sample
    namespace: mig
  spec:
    destMigClusterRef:
      name: migcluster-aws
      namespace: mig
    migStorageRef:
      name: migstorage-sample
      namespace: mig
    namespaces:
    - nginx-example
    srcMigClusterRef:
      name: migcluster-local
      namespace: mig
  status:
    conditions:
    - category: Critical
      lastTransitionTime: 2019-06-10T21:47:51Z
      message: Namespaces [nginx-example] not found on the source cluster.
      reason: NotFound
      status: "True"
      type: NamespaceNotFoundOnSourceCluster
    - category: Required
      lastTransitionTime: 2019-06-10T21:47:56Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"
      type: PvsDiscovered

MigPlan right after changing spec to be an VALID namespace with PVs

Notice how the PV discovery is completed but there is no error condition due to an unsupported action.

apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
  kind: MigPlan
  metadata:
    creationTimestamp: 2019-06-10T21:07:42Z
    generation: 1
    name: migplan-sample
    namespace: mig
    resourceVersion: "23003"
  spec:
    destMigClusterRef:
      name: migcluster-aws
      namespace: mig
    migStorageRef:
      name: migstorage-sample
      namespace: mig
    namespaces:
    - mysql-persistent
    persistentVolumes:
    - name: pv5
      supportedActions:
      - copy
      - move
    srcMigClusterRef:
      name: migcluster-local
      namespace: mig
  status:
    conditions:
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:42Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"
      type: PvsDiscovered
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:43Z
      message: The storage resources have been created.
      status: "True"
      type: StorageEnsured
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:46Z
      message: The migration registry resources have been created.
      status: "True"
      type: RegistriesEnsured
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:46Z
      message: The migration plan is ready.
      status: "True"
      type: Ready

MigPlan after a few more seconds, error condition for invalid PV action has been added

apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
  kind: MigPlan
  metadata:    creationTimestamp: 2019-06-10T21:07:42Z
    generation: 1
    name: migplan-sample
    namespace: mig
    resourceVersion: "23140"
  spec:
    destMigClusterRef:
      name: migcluster-aws
      namespace: mig
    migStorageRef:
      name: migstorage-sample
      namespace: mig
    namespaces:
    - mysql-persistent
    persistentVolumes:
    - name: pv5
      supportedActions:
      - copy
      - move
    srcMigClusterRef:
      name: migcluster-local
      namespace: mig
  status:
    conditions:
    - category: Error
      lastTransitionTime: 2019-06-10T21:42:49Z
      message: PV in `persistentVolumes` [pv5] has an unsupported `action`.
      reason: NotDone
      status: "True"
      type: PvInvalidAction
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:55Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"
      type: PvsDiscovered
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:57Z
      message: The storage resources have been created.
      status: "True"
      type: StorageEnsured
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:59Z
      message: The migration registry resources have been created.
      status: "True"
      type: RegistriesEnsured

Add support for multiple mig-controller instances connected to a single cluster

There are some potential use-cases where we might want to be able to connect multiple mig-controller instances to a single cluster.

Moving content from 1 source cluster to 2 destination clusters, with the controllers running on the 2 destination clusters
Moving contents from 2 source clusters to 1 destination cluster, with the controllers running on the 2 source clusters

In both of the above cases, we could work around the issue by installing mig-controller on the side with only 1 cluster involved (source and destination, respectively). If for some reason it was not possible to choose the side we would deploy mig-controller to, the flexibility of controller placement could come in handy.

Having the capability for the mig-controller to peacefully co-manage with another mig-controller may be possible if we were to tag all resources created on the remote cluster with some kind of unique identifier.

This use-case was brought up by @sseago

Add IsHostCluster to MigCluster CRD

Through slack discussion, it became clear that we need something to create a MigCluster that represents the cluster hosting the controller + the UI. Both will need to be able to distinguish it from MigCluster objects that have been added by users, because it should be immutable in the UI, and the controller will construct a client for this cluster differently than if it were a user added cluster with an associated ServiceAccount token. We settled on a few conclusions:

It will be the "migration operator's" responsibility to ensure that a MigCluster representing the host cluster is created during installation, which should be considered immutable (the operator will enforce it exists unchanged).
We should add an IsHostCluster boolean field to the MigCluster CRD that will act as a flag. The controller should validate only one of these exists in a Ready state at any given time, and it should default to false.

Log when validation "Ready" condition changes

Since we have the ability to detect when the set of conditions has been modified during validation now, I think it could be a nice feature to see when resources gain or lose "Ready" state in the logs controller logs.

I don't think this would be too spammy as long as we are only logging when readiness changes. I also wonder if showing error conditions in the logs would be useful.

Ignore remote watch events on stage pods.

There is a remote watch on pods. The migration creates/deletes (potentially) large numbers of stage pods. This will generate a ton of cluster, plan and migration reconciles which is not necessary or desirable.

Proposed solution.

In remote/watch.go: Add a constant:

const (
    RemoteWatchAnnotation = "remotewatch"  // bool
)

In the remote watch controller Pod predicate, ignore pods with the annotation = FALSE.

Create the stage pods with the annotation.

Prevent migration running on closed plan.

Need to add a critical category validation to prevent running a migration after the plan is Closed.

Need capability to stop Remote Watches when the details of a MigCluster change

Need to keep track of the sigStopChan and build capability to send a stop signal to cleanly exit:

manager
controller
watch

https://github.com/fusor/mig-controller/blob/master/pkg/controller/migcluster/remote_watch.go#L90-L93

Selinux issues with Velero restic pod on Openshift 4

From @jmontleon and @dymurray investigations

Velero's restic implementation mounts the 'pods' directory to do backups
and restores.

In Openshift 3 this directory is at
/var/lib/origin/openshift.local.volumes/pods/ and has a selinux context
of system_u:object_r:container_file_t:s0.

In Openshift 4 the directory is now in the default location for
Kubernetes /var/lib/kubelet/pods which is not a problem, but the selinux
context now appears to default to system_u:object_r:var_lib_t:s0.

Running privileged seems to get us working, need to learn if there is a better path forward.

oc adm policy add-scc-to-user privileged system:serviceaccount:velero:velero

oc patch -n velero ds restic -p
'{"spec":{"template":{"spec":{"containers":[{"name": "velero",
"securityContext":{"privileged":true}}]}}}}'

Logged stack traces not useful.

The current policy on the controller is for errors raised are bubbled up to the main Reconcile() which returns them to controller-runtime (caller). This was based on an assumption that an error carried the original stack trace. This does not seem to be true. As a result, the logged stack trace is not very useful.

I propose that the controllers should log.Error("") where the error is originally detected (omitting conflict errors). This will ensure the stack trace is useful. Then, the Reconcile() will return reconcile.Result{Requeue: true}, nil when an error is encountered. To keep the code clean, we may want to use a custom logger to do a IsConflict(err) test.

Ensure migrations cannot run concurrently for the same plan.

We want to ensure that for a given MigrationPlan only one action may be running at any point in time.

For example, if a Stage is running, another Stage or Migrate will not be run in parallel.

Handle Velero delay in noticing changes to BSL/VSL and secrets

Support use of `caBundle` when connecting to remote cluster

Our MigCluster CR holds a reference to a cluster-registry Cluster, which has a spot for the user to input a caBundle.

We are currently ignoring any caBundle input provided and always establish "insecure" connections to remote clusters.

I think it would be best to switch over to "opt-in to running without a caBundle", meaning that we would require the user to turn on an insecure: true bool on the MigCluster if they don't want to use a caBundle.

My thinking is that if we automatically use insecure mode if a user doesn't provide a caBundle, they may not realize they are operating insecurely, which would be undesirable. We should make the user take a conscious step to operate insecurely.

https://github.com/fusor/mig-controller/blob/f47f24237cfe601803f405f8f4e7217f511fab5f/pkg/apis/migration/v1alpha1/migcluster_types.go#L172.

Migrations on overlapping plans can run concurrently.

Overview

Foreach migration plan, the controller runs migrations serially. The goal is to prevent migrations from doing conflicting things on the clusters involved or otherwise allowing the migrations to interfere with each other. However, I don't think this approach alone is sufficient. Currently, multiple plans can be created for the same source and destination clusters, and, with overlapping namespaces. This defeats the migration (run) orchestration. Running 2 migrations concurrently on different plans with the same source and destination clusters is effectively the same as running 2 migrations concurrently on the same plan.

It's my understanding that the use case for multiple plans with the same source & destination cluster is to support migrating workloads by groups of namespaces in sequential passes/efforts.

The conflict

Let's start with a complete understanding of the conflict.

The source cluster overlap will result in a Restic pod restart to be triggered by one migration in the middle of another migration's backup. With overlapping namespaces: the pod quiesce will conflict; duplicate stage pods will be created/deleted; The PV copy/move will likely conflict.
The destination cluster overlap with overlapping namespaces: the stage pods will conflict; The PV copy/move will conflict (or just overlap?).

Proposed Solution

There are 2 options for preventing this (that I can think of).

Option 1 seems like the best option assuming it does not block any known use cases.

Option 1: Add a validation that will block running migrations on (open) plans that have the same source or destination cluster. The validation would set a Conflict condition that will prevent the Ready condition which will block migrations on overlapping plans. Would this block any known user cases?

Pros:

Simple.
Alerts the user to the overlap.
Seems more intuitive for the user.
Keeps complexity low in the migration (run) orchestration algorithm.
Promotes work flow of closing completed plans.

Cons:

May give the user impression that overlapping plans are unhealthy when they're not.
May block a use case I'm not aware of.

Option 2: Enhance the migration (run) orchestration to run migrations serially across plans that have the same source or destination cluster.

Pros:

Migrations will seem to sort them selves out and just eventually run.
Overlapping plans will remain Ready which give the user the feeling that things are healthy.

Cons:

Migration (run) orchestration across plans adds a good bit of complexity.
User may get confused about why the migrations on a plan are being postponed prompting question s like "When will my migration run?"
The changes (edits) on one plan, will block/interrupt migrations running on another plan.
User not alerted to the overlap.
Does not promote plan closing work flow.
Requires the (run) ordering of overlapping migration plans. The order created? This fixed ordering could conflict with user intentions. Then what?

Related to #156

Add ability to detect errors in Stage or Migrate and report back to Custom Resource

Investigate path for migrating SCCs from source to destination cluster

Currently Velero (and by extension the controller) will not migrate SCC profiles across cluster since these are cluster-scoped resources that don't have any link to Kube resources. As a first pass, we should provide documentation to the end user that they will need to manually migrate custom SCC profiles to the destination cluster prior to the migration of their applications.

Down the line we should add support in the UI for a user to select custom SCC profiles they wish to migrate and the controller should set the SCCs in the IncludedResources field.

Surface storageClass on PVs being migrated from source cluster in PV discovery output
Provide recommended storageClass mapping from source -> destination cluster where possible
Give user control over storageClass that will be used for dynamic storage provisioning on destination in 'copy' workflows