Git Product home page Git Product logo

mig-controller's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

mig-controller's Issues

When setting the openshift.io/migration-registry annotation, if there's more than one service we should return an error (or fail validation on the plan when calling ensureRegistry)

When setting the openshift.io/migration-registry annotation, if there's more than one service we should return an error (or fail validation on the plan when calling ensureRegistry). Currently if there happen to be two services associated with the plan (possibly due to an improper prior resource cleanup) we just set an annotation matching the first one found, which may be the wrong one.

Add security to migration registry

The current implementation of the (temporary) migration registry creates a registry pod which doesn't use https or require login credentials. We should update this to allow for securing the registry with certs and credentials. This will also require plugin support. The plugin will need to obtain certs and credentials created by the controller (probably via secrets created in the velero namespace with some supporting annotations on the velero CR) and use them to access the registry.

Restriction of needing to ensure only 1 action is running at a time when mountPropagation workaround is being used.

This issue is to ensure we track future concerns of running multiple actions in parallel. By multiple actions, I mean if we have multiple MigrationPlans, in future we will likely want to allow running Stage on each MigrationPlan at same time, assuming they do not overlap, no shared namespaces. We will continue to want to restrict inside of a MigrationPlan queueing actions and running them sequentially on a per MigrationPlan basis.

So...assuming we multiple MigrationPlans that do not overlap.
In future, we would like to be able to have say a Stage running potentially on each MigrationPlan at same time.

If we did this with our mountPropagation workaround we'd have potential issues as we restart the Restic daemonset on a per action run basis.

I don't believe this a concern as of today, my understanding is that Velero is limiting actions run on a cluster to 1 action only running cluster wide, with future work to allow running multiple actions concurrently.

If Velero begins to support running multiple actions concurrently we will want to revisit the mountPropagation workaround, or perhaps continue to enforce a sequential/serial run of actions globally.

Ignore remote watch events on stage pods.

There is a remote watch on pods. The migration creates/deletes (potentially) large numbers of stage pods. This will generate a ton of cluster, plan and migration reconciles which is not necessary or desirable.

Proposed solution.

In remote/watch.go: Add a constant:

const (
    RemoteWatchAnnotation = "remotewatch"  // bool
)

In the remote watch controller Pod predicate, ignore pods with the annotation = FALSE.

Create the stage pods with the annotation.

Travis should be setup to gate PRs on test and style

It's good practice to gate PR acceptance on travis runs (or any CI for that matter) prior to acceptance. We spent some time on this in CPMA, the make ci target is particularly useful. It ensures the code is formatted according to expectations, code builds, and tests are green, and because it's a make target, devs can (should) run it before any submissions (https://github.com/fusor/cpma/blob/master/Makefile#L25). This isn't based on any violations I've seen, it's just a practice that's served us nicely, especially when/if we start to receive community submissions.

Support migration of apps that will depend on 4.1 images requiring "pull secret"

As reported by @garethahealy, mig-controller was unable to migrate a Jenkins app deployed by the Service Catalog in OCP 3.11 onto an OpenShift 4.1 cluster.

This appears to be an issue with the way that we (or Velero) handles setup of service accounts / dockercfg pull secrets when running a Restore from a Backup, and the lack of a linked OpenShift 4.1 pull secret which is necessary to get the Jenkins and likely other included images on a 4.1 cluster.

Possibly relevant section from Velero source: https://github.com/fusor/velero/blob/fusor-dev/pkg/restore/merge_service_account.go


Reproducer steps

  1. On source 3.11 cluster, deploy "nodejs-mongodb-example" from "Pipeline Build Example" in the Service Catalog (Jenkins icon)
  2. Create migration to 4.1 cluster
  3. Observe Jenkins image not being able to pull image in 4.1 due to:
    1. Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/jenkins@sha256:2600aaa3b380a0be3a2b8c67229209968625d4bfff9579326fd3bee811e951bc": rpc error: code = Unknown desc = Error reading manifest sha256:2600aaa3b380a0be3a2b8c67229209968625d4bfff9579326fd3bee811e951bc in image-registry.openshift-image-registry.svc:5000/openshift/jenkins: unauthorized: authentication required
  4. Create another project and deploy Jenkins on 4.1:
    1. oc process jenkins-ephemeral -n openshift | oc apply -n anotherproject -f -
  5. Get the Jenkins dockercfg secret from the above, and create it in your migrated project
  6. Link it to the Jenkins service account
  7. oc secrets link jenkins jenkins-dockercfg --for=pull
  8. Delete the Jenkins pod to force a re-pull
  9. Watch Jenkins start up correctly in 4.1

Recovery workflows after migration failures

We now have basic failure handling in place to properly set Migration state if something goes wrong during backup or restore. Longer-term, we need some sort of recovery workflow. After a failure, in some cases a simple retry is all a user might need (especially if it's on backup). In other cases, we may need to clean up from the failure (remove destination namespaces, un-do quiesce actions on the source) before allowing the user to re-run another stage and then a final migration.

Plan controller predict default PV action.

The controller can predict a good default for PV actions. The user would still review and make the final selection before running migrations. This just save the user a lot of clicks/typing. The default for local storage would be copy and for mounted storage it would be move.

Clarify user interaction needed during PV migration

This is the initial validation condition that we raise after we perform PV discovery, before the user has selected an action:

PV in 'persistentVolumes' [pv2] has an unsupported 'action'.

When the MigPlan looks something like this

$ oc describe migplan migplan-sample -n mig
[...]
Kind:         MigPlan
Metadata:
  [...]
  Persistent Volumes:
    Name:  pv2
    Supported Actions:
      copy
      move
  [...]
Status:
  Conditions:
    Category:              Error
    Last Transition Time:  2019-05-29T20:35:44Z
    Message:               PV in 'persistentVolumes' [pv2] has an unsupported 'action'.
    Reason:                NotDone
    Status:                True
    Type:                  PvInvalidAction

    Category:              Required
    Last Transition Time:  2019-05-29T21:13:25Z
    Message:               The 'persistentVolumes' list has been updated with discovered PVs.
    Reason:                Done
    Status:                True
    Type:                  PvsDiscovered
    [...]
Events:                    <none>

This is confusing because the user has not yet selected an 'action' at all, so telling them that a PV has an unsupported action almost sounds like a bug with the controller (e.g. the generated supported action list is incorrect).

I'm thinking a better initial message (when 'action' is "" or nil) could be something more along the lines of:

Select an 'action' for PV [pv2] in 'persistentVolumes'.

And we can keep the current message if the user actually provides an unsupported action.

After the user provides an action the condition would go away.

  Persistent Volumes:
    Name:  pv2
    Action: move
    Supported Actions:
      copy
      move

MigPlan can be in "Ready" state with PVs discovered but no action selected after namespace switch

I corrected the namespaces field on a MigPlan from a namespace that didn't exist nginx-example to one that existed and had an associated PV. The MigPlan controller performed PV discovery and kept the MigPlan in "Ready" state for a few seconds before the "Invalid PV action selected" message showed up.

This is an issue because it allows a MigMigration to start prematurely before a PV action is selected.

MigPlan with INVALID namespace

apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
  kind: MigPlan
  metadata:
    creationTimestamp: 2019-06-10T21:07:42Z
    generation: 1
    labels:
      controller-tools.k8s.io: "1.0"
    name: migplan-sample
    namespace: mig
  spec:
    destMigClusterRef:
      name: migcluster-aws
      namespace: mig
    migStorageRef:
      name: migstorage-sample
      namespace: mig
    namespaces:
    - nginx-example
    srcMigClusterRef:
      name: migcluster-local
      namespace: mig
  status:
    conditions:
    - category: Critical
      lastTransitionTime: 2019-06-10T21:47:51Z
      message: Namespaces [nginx-example] not found on the source cluster.
      reason: NotFound
      status: "True"
      type: NamespaceNotFoundOnSourceCluster
    - category: Required
      lastTransitionTime: 2019-06-10T21:47:56Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"
      type: PvsDiscovered

MigPlan right after changing spec to be an VALID namespace with PVs

Notice how the PV discovery is completed but there is no error condition due to an unsupported action.

apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
  kind: MigPlan
  metadata:
    creationTimestamp: 2019-06-10T21:07:42Z
    generation: 1
    name: migplan-sample
    namespace: mig
    resourceVersion: "23003"
  spec:
    destMigClusterRef:
      name: migcluster-aws
      namespace: mig
    migStorageRef:
      name: migstorage-sample
      namespace: mig
    namespaces:
    - mysql-persistent
    persistentVolumes:
    - name: pv5
      supportedActions:
      - copy
      - move
    srcMigClusterRef:
      name: migcluster-local
      namespace: mig
  status:
    conditions:
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:42Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"
      type: PvsDiscovered
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:43Z
      message: The storage resources have been created.
      status: "True"
      type: StorageEnsured
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:46Z
      message: The migration registry resources have been created.
      status: "True"
      type: RegistriesEnsured
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:46Z
      message: The migration plan is ready.
      status: "True"
      type: Ready

MigPlan after a few more seconds, error condition for invalid PV action has been added

apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
  kind: MigPlan
  metadata:    creationTimestamp: 2019-06-10T21:07:42Z
    generation: 1
    name: migplan-sample
    namespace: mig
    resourceVersion: "23140"
  spec:
    destMigClusterRef:
      name: migcluster-aws
      namespace: mig
    migStorageRef:
      name: migstorage-sample
      namespace: mig
    namespaces:
    - mysql-persistent
    persistentVolumes:
    - name: pv5
      supportedActions:
      - copy
      - move
    srcMigClusterRef:
      name: migcluster-local
      namespace: mig
  status:
    conditions:
    - category: Error
      lastTransitionTime: 2019-06-10T21:42:49Z
      message: PV in `persistentVolumes` [pv5] has an unsupported `action`.
      reason: NotDone
      status: "True"
      type: PvInvalidAction
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:55Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"
      type: PvsDiscovered
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:57Z
      message: The storage resources have been created.
      status: "True"
      type: StorageEnsured
    - category: Required
      lastTransitionTime: 2019-06-10T21:42:59Z
      message: The migration registry resources have been created.
      status: "True"
      type: RegistriesEnsured

Transient "MigPlan not ready" when migration is running

Was running a MigMigration and noticed a status condition "MigPlan is not ready" pop up on the MigMigration status a few times for a few seconds each time during the Migration. Not sure what caused this (and it didn't cause any real problems for me) but wanted to document in case this is an issue later.

I didn't isolate this all the way down the chain, it disappeared very quickly.

Document issues with NFS PV usage

Reminder to ensure we have documented any gotchas we've learned from NFS PV handling.

Believe there was an issue with no_root_squash we need to ensure users are aware of.

Add support for multiple mig-controller instances connected to a single cluster

There are some potential use-cases where we might want to be able to connect multiple mig-controller instances to a single cluster.

  • Moving content from 1 source cluster to 2 destination clusters, with the controllers running on the 2 destination clusters
  • Moving contents from 2 source clusters to 1 destination cluster, with the controllers running on the 2 source clusters

In both of the above cases, we could work around the issue by installing mig-controller on the side with only 1 cluster involved (source and destination, respectively). If for some reason it was not possible to choose the side we would deploy mig-controller to, the flexibility of controller placement could come in handy.

Having the capability for the mig-controller to peacefully co-manage with another mig-controller may be possible if we were to tag all resources created on the remote cluster with some kind of unique identifier.


This use-case was brought up by @sseago

Log when validation "Ready" condition changes

Since we have the ability to detect when the set of conditions has been modified during validation now, I think it could be a nice feature to see when resources gain or lose "Ready" state in the logs controller logs.

I don't think this would be too spammy as long as we are only logging when readiness changes. I also wonder if showing error conditions in the logs would be useful.

Support multiple sets of BSL/VSL credentials

Velero upstream supports multiple sets of BSL/VSL credentials as of vmware-tanzu/velero#1548 -- once this commit makes it into our velero fork, we need to update the controller to support this.

Currently when we create the BSL/VSL resources for a plan, we overwrite the existing cloud-credentials secret with a new default set of AWS credentials corresponding to the current plan, which overwrites whatever credentials were there before. The new Velero feature still uses a single cloud-credentials secret, but allows the creation of multiple aws profiles. When we create a plan, instead overwriting cloud-credentials with the current plan's credentials, we need to read in the existing cloud-credentials secret, pull the "[default]" section from it, and rewrite the rest of the secret using the associated backup storage (and volume snapshot, if present) credentials for all open plans:

[default]
aws_access_key_id=<whatever is there already>
aws_secret_access_key=<whatever is there already>
[bsl-<plan1UID>]
aws_access_key_id=<from plan1 backup storage credentials secret>
aws_secret_access_key=<from plan1 backup storage credentials secret>
[vsl-<plan1UID>]
aws_access_key_id=<from plan1 volume snapshot credentials secret>
aws_secret_access_key=<from plan1 volume snapshot credentials secret>
[bsl-<plan2UID>]
aws_access_key_id=<from plan1 backup storage credentials secret>
aws_secret_access_key=<from plan1 backup storage credentials secret>
[vsl-<plan2UID>]
aws_access_key_id=<from plan2 volume snapshot credentials secret>
aws_secret_access_key=<from plan2 volume snapshot credentials secret>

Plan storage cleanup operations need to do this as well to remove this plan's credentials from the secret.

When creating the velero BSL/VSL CRs, the profile above should be set in location.Spec.Config.Profile

Note that supporting this is currently incompatible with supporting multiple controller instances (#47 ) since the two controllers would clobber each other's cloud-credentials changes (although, they'll also clobber the default credentials with the current implementation). To support multiple controllers, we need to make another upstream Velero change to support multple separate cloud-credentials secrets, so that adding/removing MigPlans would only require adding/removing a plan-specific cloud-credentials file.

Also note: We can't really start working on this until we have the underlying upstream Velero commit in our fork, either as a result of updating to a future Velero release including it, or (if we need this sooner), rebasing to a prerelease commit that's newer than 1.0.0.

Premature "Restic pod successfully restarted"

I deployed Velero onto OCP 4 and started running a migration. I noticed two things:

  • There are 3 restic pods in the Velero namespace
$ oc get pods -n velero
NAME                                     READY     STATUS              RESTARTS   AGE
migplan-sample-registry-xbvhs-1-deploy   0/1       ContainerCreating   0          7s
restic-cgshq                             0/1       ContainerCreating   0          6s
restic-m4wx9                             1/1       Running             0          12m
restic-w5krg                             1/1       Running             0          12m
velero-78949985fd-m6cfn                  1/1       Running             0          12m
  • The "restic pod successfully" log message pops up while the above restic pod is still in "ContainerCreating" status.
{"level":"info","ts":1558550568.6081119,"logger":"migration|2fw6c","msg":"Restic pod successfully restarted"}

I am realizing that we may have done our initial tests of this feature on OCP 3 setups with a single node, so a DaemonSet in that environment only yields a single pod. on OCP 4, where the default is 3 nodes for AWS, you get the above 3 pods running and the bounce doesn't work as intended.

Controller has a error "Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value"

After following documentation from https://github.com/fusor/mig-controller#quick-start. I am getting error

E0603 10:12:08.953346       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value
{"level":"error","ts":1559556729.5387843,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"migplan-controller","request":"mig/migplan-sample","error":"secrets \"cloud-credentials\" already exists","stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1559556747.359269,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"migplan-controller","request":"mig/migplan-sample","error":"secrets \"cloud-credentials\" already exists","stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
^LE0603 10:14:58.950502       1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value

MigPlan Status

 status:
    conditions:
    - category: Required
      lastTransitionTime: 2019-06-03T10:12:19Z
      message: The `persistentVolumes` list has been updated with discovered PVs.
      reason: Done
      status: "True"

More status needs to report

Openshift cluster version

  • Source cluster - OCP 3
  • Destination cluster - OCP 4

Support use of `caBundle` when connecting to remote cluster

Our MigCluster CR holds a reference to a cluster-registry Cluster, which has a spot for the user to input a caBundle.

We are currently ignoring any caBundle input provided and always establish "insecure" connections to remote clusters.

I think it would be best to switch over to "opt-in to running without a caBundle", meaning that we would require the user to turn on an insecure: true bool on the MigCluster if they don't want to use a caBundle.

My thinking is that if we automatically use insecure mode if a user doesn't provide a caBundle, they may not realize they are operating insecurely, which would be undesirable. We should make the user take a conscious step to operate insecurely.

https://github.com/fusor/mig-controller/blob/f47f24237cfe601803f405f8f4e7217f511fab5f/pkg/apis/migration/v1alpha1/migcluster_types.go#L172.

image

Selinux issues with Velero restic pod on Openshift 4

From @jmontleon and @dymurray investigations

Velero's restic implementation mounts the 'pods' directory to do backups
and restores.

In Openshift 3 this directory is at
/var/lib/origin/openshift.local.volumes/pods/ and has a selinux context
of system_u:object_r:container_file_t:s0.

In Openshift 4 the directory is now in the default location for
Kubernetes /var/lib/kubelet/pods which is not a problem, but the selinux
context now appears to default to system_u:object_r:var_lib_t:s0.


Running privileged seems to get us working, need to learn if there is a better path forward.

oc adm policy add-scc-to-user privileged system:serviceaccount:velero:velero

oc patch -n velero ds restic -p
'{"spec":{"template":{"spec":{"containers":[{"name": "velero",
"securityContext":{"privileged":true}}]}}}}'

Add ability to cancel a Stage or Migrate

Assuming that for Stage we want to cancel any running operations.

For Migrate, ideal case would be we cancel any running operations and restore the associated namespaces on source cluster to be in same state as prior to Migrate running. Essentially we'd want to cancel the running Migrate operations and rollback any changes on source side.

Memory consumption should be reduced

mig-controller memory usage has grown significantly. Memory consumption does level off eventually.

In the chart below you can see controller memory usage over period of several minutes, peaking at 224.36 MiB.

image

Migrations on overlapping plans can run concurrently.

Overview

Foreach migration plan, the controller runs migrations serially. The goal is to prevent migrations from doing conflicting things on the clusters involved or otherwise allowing the migrations to interfere with each other. However, I don't think this approach alone is sufficient. Currently, multiple plans can be created for the same source and destination clusters, and, with overlapping namespaces. This defeats the migration (run) orchestration. Running 2 migrations concurrently on different plans with the same source and destination clusters is effectively the same as running 2 migrations concurrently on the same plan.

It's my understanding that the use case for multiple plans with the same source & destination cluster is to support migrating workloads by groups of namespaces in sequential passes/efforts.

The conflict

Let's start with a complete understanding of the conflict.

  • The source cluster overlap will result in a Restic pod restart to be triggered by one migration in the middle of another migration's backup. With overlapping namespaces: the pod quiesce will conflict; duplicate stage pods will be created/deleted; The PV copy/move will likely conflict.
  • The destination cluster overlap with overlapping namespaces: the stage pods will conflict; The PV copy/move will conflict (or just overlap?).

Proposed Solution

There are 2 options for preventing this (that I can think of).

Option 1 seems like the best option assuming it does not block any known use cases.

Option 1: Add a validation that will block running migrations on (open) plans that have the same source or destination cluster. The validation would set a Conflict condition that will prevent the Ready condition which will block migrations on overlapping plans. Would this block any known user cases?

Pros:

  • Simple.
  • Alerts the user to the overlap.
  • Seems more intuitive for the user.
  • Keeps complexity low in the migration (run) orchestration algorithm.
  • Promotes work flow of closing completed plans.

Cons:

  • May give the user impression that overlapping plans are unhealthy when they're not.
  • May block a use case I'm not aware of.

Option 2: Enhance the migration (run) orchestration to run migrations serially across plans that have the same source or destination cluster.

Pros:

  • Migrations will seem to sort them selves out and just eventually run.
  • Overlapping plans will remain Ready which give the user the feeling that things are healthy.

Cons:

  • Migration (run) orchestration across plans adds a good bit of complexity.
  • User may get confused about why the migrations on a plan are being postponed prompting question s like "When will my migration run?"
  • The changes (edits) on one plan, will block/interrupt migrations running on another plan.
  • User not alerted to the overlap.
  • Does not promote plan closing work flow.
  • Requires the (run) ordering of overlapping migration plans. The order created? This fixed ordering could conflict with user intentions. Then what?

Related to #156

Controller has error "spec.clusterIP" invalid: field is immutable during migrations

Seeing this on every migration I run from OCP4->OCP4.

@sseago I think this is from the registry creation code. Doesn't seem to stop migration from completing but would be good to either re-arrange logic or ignore this type of error so it doesn't pollute logs.

Thinking the problem here is that once you define a service, you are not allowed to modify the clusterIP for that service, so we need to treat that field as immutable in our reconcile logic.

{"level":"info"
"ts":1558560727.27501
"logger":"migration|2f8nr"
"msg":"Migration: resumed."
"name":"migmigration-sample"
"phase":"RestoreStarted"}

{"level":"info"
"ts":1558560727.3467073
"logger":"migration|2f8nr"
"msg":"Migration: waiting."
"name":"migmigration-sample"
"phase":"BackupReplicated"
"backup":"migmigration-sample-8gbmh"
"restore":""}

{"level":"error"
"ts":1558560727.3468246
"logger":"kubebuilder.controller"
"msg":"Reconciler error"
"controller":"migmigration-controller"
"request":"mig/migmigration-sample"
"error":"Service \"migplan-sample-registry-p9k57\" is invalid: spec.clusterIP: Invalid value: \"\": field is immutable"
"stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Restic PV restore does not work when pod to be restored is `Unschedulable` on destination cluster

This is related to #151

I have an OCP v3.11.104 and OCP v4.1.0 environments deployed with agnosticd. I am trying to migrate an MsSQL server app (with a NFS PV) deployed on OCP 3. I have the controller deployed on v4 side.

Problem
The backup stage completes without errors. However, the restic restore on the destination is not working. The pod gets restored, but is stuck in pending state. The migration is stuck at CreateStagePodsStarted phase. The Velero restore is in InProgress phase.

Identified Cause

Whenever a pod to be restored has a PV associated (copy action in our case), Velero adds an init container to that pod at the time of restore. Velero then restores the pod. The init container waits until the Restic restore process is completed. On the other hand, the restic process does not start until the init container is started. In an ideal scenario, as soon as the pod is restored, the init container runs and triggers the restic restore process. The restic restore process finishes and init container exits. Then, the pod comes live with the PV attached to it.

In my case, the pod which got backed up, had a nodeSelector looking for label node-role.kubernetes.io/compute:"true". This label is not present on any of the OCP4 nodes in my setup. At the time of restore, the pod got restored with the nodeSelector. Since the labels were absent, the pod never got scheduled. Since it never got scheduled, the init container never started. Since the init container never started, the Restic process never started. And migration is stuck hoping that the restore will finish.

Possible Solution
Migration plugin could replace the old OCP3 compute labels with new OCP4 worker labels.

Fix
Fixes need to be done in the plugin. Follow this issue - fusor/ocp-velero-plugin#61

Logged stack traces not useful.

The current policy on the controller is for errors raised are bubbled up to the main Reconcile() which returns them to controller-runtime (caller). This was based on an assumption that an error carried the original stack trace. This does not seem to be true. As a result, the logged stack trace is not very useful.

I propose that the controllers should log.Error("") where the error is originally detected (omitting conflict errors). This will ensure the stack trace is useful. Then, the Reconcile() will return reconcile.Result{Requeue: true}, nil when an error is encountered. To keep the code clean, we may want to use a custom logger to do a IsConflict(err) test.

Need additional fields during PV discovery

Currently the reported pv objects look like this when written back to the MigPlan:

  persistentVolumes:
  - name: pv1
    supportedActions:
    - copy
    - move

The UI requires a number of additional fields for us to render the PV tables as shown in the mocks, the following fields are missing:

namespace - the namespace the PV was found in
size - size of PV
claim - associated pvc name
storageClass - associated storageClass, although we'll probably need to think about this more in depth given the issues that Dylan raised around differing storage classes in two different clusters.

Add IsHostCluster to MigCluster CRD

Through slack discussion, it became clear that we need something to create a MigCluster that represents the cluster hosting the controller + the UI. Both will need to be able to distinguish it from MigCluster objects that have been added by users, because it should be immutable in the UI, and the controller will construct a client for this cluster differently than if it were a user added cluster with an associated ServiceAccount token. We settled on a few conclusions:

  1. It will be the "migration operator's" responsibility to ensure that a MigCluster representing the host cluster is created during installation, which should be considered immutable (the operator will enforce it exists unchanged).

  2. We should add an IsHostCluster boolean field to the MigCluster CRD that will act as a flag. The controller should validate only one of these exists in a Ready state at any given time, and it should default to false.

Controller has error "secrets \"cloud-credentials\" already exists" several times during migration

Have been seeing several instances of "cloud-credentials" already exists in rapid succession during migrations recently. @jortel I think this is an unintended side-effect of the ensureStorage pieces that have recently been added.

I've never seen this block a migration, but would be nice to fix the underlying issue or silence the error message if we don't care about it.

{"level":"info"
"ts":1558560519.166244
"logger":"cluster|6fd7s"
"msg":"[rWatch] Starting manager"}


{"level":"info"
"ts":1558560519.1662652
"logger":"cluster|6fd7s"
"msg":"[rWatch] Manager started"}


{"level":"info"
"ts":1558560519.1662774
"logger":"cluster|6fd7s"
"msg":"Remote watch started."
"cluster":"migcluster-local"}


{"level":"error"
"ts":1558560525.3750854
"logger":"kubebuilder.controller"
"msg":"Reconciler error"
"controller":"migplan-controller"
"request":"mig/migplan-sample"
"error":"secrets \"cloud-credentials\" already exists"
"stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until
	/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

Predicates should handle maintenance of the resource reference map

Manual management of reference relationships is error prone. Since predicates can see the old and new object when changes occur, we should be able to reliably attach and break reference relationships from predicates.

Example of manual management of relationships in reconcile: https://github.com/fusor/mig-controller/blob/master/pkg/controller/migcluster/migcluster_controller.go#L142-L150

Example predicate: https://github.com/fusor/mig-controller/blob/master/pkg/controller/migplan/predicate.go

Functions for managing reference map: https://github.com/fusor/mig-controller/blob/master/pkg/util/resource_parent_map.go

Investigate path for migrating SCCs from source to destination cluster

Currently Velero (and by extension the controller) will not migrate SCC profiles across cluster since these are cluster-scoped resources that don't have any link to Kube resources. As a first pass, we should provide documentation to the end user that they will need to manually migrate custom SCC profiles to the destination cluster prior to the migration of their applications.

Down the line we should add support in the UI for a user to select custom SCC profiles they wish to migrate and the controller should set the SCCs in the IncludedResources field.

Task phase handling needs to account for PartiallyFailed backups and restores

As of the 1.0 release, it's possible for a Backup or a Restore to complete with a PartiallyFailed state, in addition to Failed or Completed. At this point, I think we should probably consider a PartialFailed backup or restore the same as a Failed one. The two examples I've come across today are ones that should definitely fail the action. We may want to eventually be more granular and dig into the partial failure reasons to distinguish fatal from non-fatal errors, but for now setting task.Phase to RestoreFailed or BackupFailed should be sufficient.

Where we currently check for velero.RestorePhaseFailed or velero.BackupPhaseFailed we should also check for velero.RestorePhasePartiallyFailed or velero.BackupPhasePartiallyFailed

Remove use of registry-cluster.

As a result of a discussion with Paul Morie, turns out Federation had an API review and it was suggested that their current usage of cluster-registry exposed them to a potential security escalation, hence they've removed usage of cluster-registry for time being and are using the cluster details in their own resource. Recommendation is for us to do similar and REMOVE cluster-registry usage, move the cluster details into MigCluster and just use MigCluster with secretRefs for now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.