migtools / mig-controller Goto Github PK
View Code? Open in Web Editor NEWOpenShift Migration Controller
License: Apache License 2.0
OpenShift Migration Controller
License: Apache License 2.0
When setting the openshift.io/migration-registry annotation, if there's more than one service we should return an error (or fail validation on the plan when calling ensureRegistry). Currently if there happen to be two services associated with the plan (possibly due to an improper prior resource cleanup) we just set an annotation matching the first one found, which may be the wrong one.
The current implementation of the (temporary) migration registry creates a registry pod which doesn't use https or require login credentials. We should update this to allow for securing the registry with certs and credentials. This will also require plugin support. The plugin will need to obtain certs and credentials created by the controller (probably via secrets created in the velero namespace with some supporting annotations on the velero CR) and use them to access the registry.
This issue is to ensure we track future concerns of running multiple actions in parallel. By multiple actions, I mean if we have multiple MigrationPlans, in future we will likely want to allow running Stage on each MigrationPlan at same time, assuming they do not overlap, no shared namespaces. We will continue to want to restrict inside of a MigrationPlan queueing actions and running them sequentially on a per MigrationPlan basis.
So...assuming we multiple MigrationPlans that do not overlap.
In future, we would like to be able to have say a Stage running potentially on each MigrationPlan at same time.
If we did this with our mountPropagation workaround we'd have potential issues as we restart the Restic daemonset on a per action run basis.
I don't believe this a concern as of today, my understanding is that Velero is limiting actions run on a cluster to 1 action only running cluster wide, with future work to allow running multiple actions concurrently.
If Velero begins to support running multiple actions concurrently we will want to revisit the mountPropagation workaround, or perhaps continue to enforce a sequential/serial run of actions globally.
There is a remote watch on pods
. The migration creates/deletes (potentially) large numbers of stage pods. This will generate a ton of cluster, plan and migration reconciles which is not necessary or desirable.
Proposed solution.
In remote/watch.go: Add a constant:
const (
RemoteWatchAnnotation = "remotewatch" // bool
)
In the remote watch controller Pod
predicate, ignore pods with the annotation = FALSE.
Create the stage pods with the annotation.
Example:
Source namespace 'foo' may be migrated to Destination namespace 'bar'
It's good practice to gate PR acceptance on travis runs (or any CI for that matter) prior to acceptance. We spent some time on this in CPMA, the make ci
target is particularly useful. It ensures the code is formatted according to expectations, code builds, and tests are green, and because it's a make target, devs can (should) run it before any submissions (https://github.com/fusor/cpma/blob/master/Makefile#L25). This isn't based on any violations I've seen, it's just a practice that's served us nicely, especially when/if we start to receive community submissions.
Goals:
Add more granular progress reporting for PV migration.
As reported by @garethahealy, mig-controller was unable to migrate a Jenkins app deployed by the Service Catalog in OCP 3.11 onto an OpenShift 4.1 cluster.
This appears to be an issue with the way that we (or Velero) handles setup of service accounts / dockercfg pull secrets when running a Restore from a Backup, and the lack of a linked OpenShift 4.1 pull secret which is necessary to get the Jenkins and likely other included images on a 4.1 cluster.
Possibly relevant section from Velero source: https://github.com/fusor/velero/blob/fusor-dev/pkg/restore/merge_service_account.go
Failed to pull image "image-registry.openshift-image-registry.svc:5000/openshift/jenkins@sha256:2600aaa3b380a0be3a2b8c67229209968625d4bfff9579326fd3bee811e951bc": rpc error: code = Unknown desc = Error reading manifest sha256:2600aaa3b380a0be3a2b8c67229209968625d4bfff9579326fd3bee811e951bc in image-registry.openshift-image-registry.svc:5000/openshift/jenkins: unauthorized: authentication required
oc process jenkins-ephemeral -n openshift | oc apply -n anotherproject -f -
We now have basic failure handling in place to properly set Migration state if something goes wrong during backup or restore. Longer-term, we need some sort of recovery workflow. After a failure, in some cases a simple retry is all a user might need (especially if it's on backup). In other cases, we may need to clean up from the failure (remove destination namespaces, un-do quiesce actions on the source) before allowing the user to re-run another stage and then a final migration.
Need to add a critical category validation to prevent running a migration after the plan is Closed
.
The controller can predict a good default for PV actions. The user would still review and make the final selection before running migrations. This just save the user a lot of clicks/typing. The default for local storage would be copy
and for mounted storage it would be move
.
This is the initial validation condition that we raise after we perform PV discovery, before the user has selected an action:
PV in 'persistentVolumes' [pv2] has an unsupported 'action'.
When the MigPlan looks something like this
$ oc describe migplan migplan-sample -n mig
[...]
Kind: MigPlan
Metadata:
[...]
Persistent Volumes:
Name: pv2
Supported Actions:
copy
move
[...]
Status:
Conditions:
Category: Error
Last Transition Time: 2019-05-29T20:35:44Z
Message: PV in 'persistentVolumes' [pv2] has an unsupported 'action'.
Reason: NotDone
Status: True
Type: PvInvalidAction
Category: Required
Last Transition Time: 2019-05-29T21:13:25Z
Message: The 'persistentVolumes' list has been updated with discovered PVs.
Reason: Done
Status: True
Type: PvsDiscovered
[...]
Events: <none>
This is confusing because the user has not yet selected an 'action' at all, so telling them that a PV has an unsupported action almost sounds like a bug with the controller (e.g. the generated supported action list is incorrect).
I'm thinking a better initial message (when 'action' is "" or nil) could be something more along the lines of:
Select an 'action' for PV [pv2] in 'persistentVolumes'.
And we can keep the current message if the user actually provides an unsupported action.
After the user provides an action the condition would go away.
Persistent Volumes:
Name: pv2
Action: move
Supported Actions:
copy
move
I corrected the namespaces
field on a MigPlan from a namespace that didn't exist nginx-example
to one that existed and had an associated PV. The MigPlan controller performed PV discovery and kept the MigPlan in "Ready" state for a few seconds before the "Invalid PV action selected" message showed up.
This is an issue because it allows a MigMigration to start prematurely before a PV action is selected.
MigPlan with INVALID namespace
apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
creationTimestamp: 2019-06-10T21:07:42Z
generation: 1
labels:
controller-tools.k8s.io: "1.0"
name: migplan-sample
namespace: mig
spec:
destMigClusterRef:
name: migcluster-aws
namespace: mig
migStorageRef:
name: migstorage-sample
namespace: mig
namespaces:
- nginx-example
srcMigClusterRef:
name: migcluster-local
namespace: mig
status:
conditions:
- category: Critical
lastTransitionTime: 2019-06-10T21:47:51Z
message: Namespaces [nginx-example] not found on the source cluster.
reason: NotFound
status: "True"
type: NamespaceNotFoundOnSourceCluster
- category: Required
lastTransitionTime: 2019-06-10T21:47:56Z
message: The `persistentVolumes` list has been updated with discovered PVs.
reason: Done
status: "True"
type: PvsDiscovered
MigPlan right after changing spec to be an VALID namespace with PVs
Notice how the PV discovery is completed but there is no error condition due to an unsupported action.
apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata:
creationTimestamp: 2019-06-10T21:07:42Z
generation: 1
name: migplan-sample
namespace: mig
resourceVersion: "23003"
spec:
destMigClusterRef:
name: migcluster-aws
namespace: mig
migStorageRef:
name: migstorage-sample
namespace: mig
namespaces:
- mysql-persistent
persistentVolumes:
- name: pv5
supportedActions:
- copy
- move
srcMigClusterRef:
name: migcluster-local
namespace: mig
status:
conditions:
- category: Required
lastTransitionTime: 2019-06-10T21:42:42Z
message: The `persistentVolumes` list has been updated with discovered PVs.
reason: Done
status: "True"
type: PvsDiscovered
- category: Required
lastTransitionTime: 2019-06-10T21:42:43Z
message: The storage resources have been created.
status: "True"
type: StorageEnsured
- category: Required
lastTransitionTime: 2019-06-10T21:42:46Z
message: The migration registry resources have been created.
status: "True"
type: RegistriesEnsured
- category: Required
lastTransitionTime: 2019-06-10T21:42:46Z
message: The migration plan is ready.
status: "True"
type: Ready
MigPlan after a few more seconds, error condition for invalid PV action has been added
apiVersion: v1
items:
- apiVersion: migration.openshift.io/v1alpha1
kind: MigPlan
metadata: creationTimestamp: 2019-06-10T21:07:42Z
generation: 1
name: migplan-sample
namespace: mig
resourceVersion: "23140"
spec:
destMigClusterRef:
name: migcluster-aws
namespace: mig
migStorageRef:
name: migstorage-sample
namespace: mig
namespaces:
- mysql-persistent
persistentVolumes:
- name: pv5
supportedActions:
- copy
- move
srcMigClusterRef:
name: migcluster-local
namespace: mig
status:
conditions:
- category: Error
lastTransitionTime: 2019-06-10T21:42:49Z
message: PV in `persistentVolumes` [pv5] has an unsupported `action`.
reason: NotDone
status: "True"
type: PvInvalidAction
- category: Required
lastTransitionTime: 2019-06-10T21:42:55Z
message: The `persistentVolumes` list has been updated with discovered PVs.
reason: Done
status: "True"
type: PvsDiscovered
- category: Required
lastTransitionTime: 2019-06-10T21:42:57Z
message: The storage resources have been created.
status: "True"
type: StorageEnsured
- category: Required
lastTransitionTime: 2019-06-10T21:42:59Z
message: The migration registry resources have been created.
status: "True"
type: RegistriesEnsured
Was running a MigMigration and noticed a status condition "MigPlan is not ready" pop up on the MigMigration status a few times for a few seconds each time during the Migration. Not sure what caused this (and it didn't cause any real problems for me) but wanted to document in case this is an issue later.
I didn't isolate this all the way down the chain, it disappeared very quickly.
Reminder to ensure we have documented any gotchas we've learned from NFS PV handling.
Believe there was an issue with no_root_squash we need to ensure users are aware of.
There are some potential use-cases where we might want to be able to connect multiple mig-controller instances to a single cluster.
In both of the above cases, we could work around the issue by installing mig-controller on the side with only 1 cluster involved (source and destination, respectively). If for some reason it was not possible to choose the side we would deploy mig-controller to, the flexibility of controller placement could come in handy.
Having the capability for the mig-controller to peacefully co-manage with another mig-controller may be possible if we were to tag all resources created on the remote cluster with some kind of unique identifier.
This use-case was brought up by @sseago
Since we have the ability to detect when the set of conditions has been modified during validation now, I think it could be a nice feature to see when resources gain or lose "Ready" state in the logs controller logs.
I don't think this would be too spammy as long as we are only logging when readiness changes. I also wonder if showing error conditions in the logs would be useful.
I have not seen evidence that we log anything during creation of BackupStorageLocation / VolumeSnapshotLocations currently. I think it makes sense to log that we are creating these resources.
Velero upstream supports multiple sets of BSL/VSL credentials as of vmware-tanzu/velero#1548 -- once this commit makes it into our velero fork, we need to update the controller to support this.
Currently when we create the BSL/VSL resources for a plan, we overwrite the existing cloud-credentials secret with a new default set of AWS credentials corresponding to the current plan, which overwrites whatever credentials were there before. The new Velero feature still uses a single cloud-credentials secret, but allows the creation of multiple aws profiles. When we create a plan, instead overwriting cloud-credentials with the current plan's credentials, we need to read in the existing cloud-credentials secret, pull the "[default]" section from it, and rewrite the rest of the secret using the associated backup storage (and volume snapshot, if present) credentials for all open plans:
[default]
aws_access_key_id=<whatever is there already>
aws_secret_access_key=<whatever is there already>
[bsl-<plan1UID>]
aws_access_key_id=<from plan1 backup storage credentials secret>
aws_secret_access_key=<from plan1 backup storage credentials secret>
[vsl-<plan1UID>]
aws_access_key_id=<from plan1 volume snapshot credentials secret>
aws_secret_access_key=<from plan1 volume snapshot credentials secret>
[bsl-<plan2UID>]
aws_access_key_id=<from plan1 backup storage credentials secret>
aws_secret_access_key=<from plan1 backup storage credentials secret>
[vsl-<plan2UID>]
aws_access_key_id=<from plan2 volume snapshot credentials secret>
aws_secret_access_key=<from plan2 volume snapshot credentials secret>
Plan storage cleanup operations need to do this as well to remove this plan's credentials from the secret.
When creating the velero BSL/VSL CRs, the profile above should be set in location.Spec.Config.Profile
Note that supporting this is currently incompatible with supporting multiple controller instances (#47 ) since the two controllers would clobber each other's cloud-credentials changes (although, they'll also clobber the default credentials with the current implementation). To support multiple controllers, we need to make another upstream Velero change to support multple separate cloud-credentials secrets, so that adding/removing MigPlans would only require adding/removing a plan-specific cloud-credentials file.
Also note: We can't really start working on this until we have the underlying upstream Velero commit in our fork, either as a result of updating to a future Velero release including it, or (if we need this sooner), rebasing to a prerelease commit that's newer than 1.0.0.
Delete BSL/VSL when plan closed.
I deployed Velero onto OCP 4 and started running a migration. I noticed two things:
$ oc get pods -n velero
NAME READY STATUS RESTARTS AGE
migplan-sample-registry-xbvhs-1-deploy 0/1 ContainerCreating 0 7s
restic-cgshq 0/1 ContainerCreating 0 6s
restic-m4wx9 1/1 Running 0 12m
restic-w5krg 1/1 Running 0 12m
velero-78949985fd-m6cfn 1/1 Running 0 12m
{"level":"info","ts":1558550568.6081119,"logger":"migration|2fw6c","msg":"Restic pod successfully restarted"}
I am realizing that we may have done our initial tests of this feature on OCP 3 setups with a single node, so a DaemonSet
in that environment only yields a single pod. on OCP 4, where the default is 3 nodes for AWS, you get the above 3 pods running and the bounce doesn't work as intended.
After following documentation from https://github.com/fusor/mig-controller#quick-start. I am getting error
E0603 10:12:08.953346 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value
{"level":"error","ts":1559556729.5387843,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"migplan-controller","request":"mig/migplan-sample","error":"secrets \"cloud-credentials\" already exists","stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
{"level":"error","ts":1559556747.359269,"logger":"kubebuilder.controller","msg":"Reconciler error","controller":"migplan-controller","request":"mig/migplan-sample","error":"secrets \"cloud-credentials\" already exists","stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217\ngithub.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134\ngithub.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
^LE0603 10:14:58.950502 1 reflector.go:134] sigs.k8s.io/controller-runtime/pkg/cache/internal/informers_map.go:126: Failed to list *v1.Secret: couldn't get version/kind; json parse error: invalid character 'T' after top-level value
MigPlan Status
status:
conditions:
- category: Required
lastTransitionTime: 2019-06-03T10:12:19Z
message: The `persistentVolumes` list has been updated with discovered PVs.
reason: Done
status: "True"
More status needs to report
Openshift cluster version
Our MigCluster CR holds a reference to a cluster-registry Cluster, which has a spot for the user to input a caBundle.
We are currently ignoring any caBundle
input provided and always establish "insecure" connections to remote clusters.
I think it would be best to switch over to "opt-in to running without a caBundle", meaning that we would require the user to turn on an insecure: true
bool on the MigCluster if they don't want to use a caBundle.
My thinking is that if we automatically use insecure mode if a user doesn't provide a caBundle, they may not realize they are operating insecurely, which would be undesirable. We should make the user take a conscious step to operate insecurely.
From @jmontleon and @dymurray investigations
Velero's restic implementation mounts the 'pods' directory to do backups
and restores.
In Openshift 3 this directory is at
/var/lib/origin/openshift.local.volumes/pods/ and has a selinux context
of system_u:object_r:container_file_t:s0.
In Openshift 4 the directory is now in the default location for
Kubernetes /var/lib/kubelet/pods which is not a problem, but the selinux
context now appears to default to system_u:object_r:var_lib_t:s0.
Running privileged seems to get us working, need to learn if there is a better path forward.
oc adm policy add-scc-to-user privileged system:serviceaccount:velero:velero
oc patch -n velero ds restic -p
'{"spec":{"template":{"spec":{"containers":[{"name": "velero",
"securityContext":{"privileged":true}}]}}}}'
Assuming that for Stage we want to cancel any running operations.
For Migrate, ideal case would be we cancel any running operations and restore the associated namespaces on source cluster to be in same state as prior to Migrate running. Essentially we'd want to cancel the running Migrate operations and rollback any changes on source side.
Completed migrations should no longer have the Ready
condition. The Ready
condition indicates that a migration is ready to run so makes sense that it lose the condition after completion.
Possibly related to #107, but I noticed the deploy script for velero creates some initial cloud credentials with a secret that decodes to:
[default]
aws_access_key_id=
aws_secret_access_key=
Shouldn't the controller check for the presence of this, and if it doesn't already exist, create it? Why create empty credentials?
Since the MigStage
and MigMigration
objects share a need to validate planRefs, we should break out the Plan validation to a common package.
We should change https://github.com/fusor/mig-controller/blob/master/pkg/controller/migmigration/validation.go#L63 to take a MigPlanRef
as an input and have a common validation
pkg.
This message doesn't go away after the namespace in question is created.
status:
conditions:
- category: Critical
lastTransitionTime: 2019-05-22T18:34:11Z
message: Namespaces [nginx-example] not found on the source cluster.
reason: NotFound
status: "True"
type: NamespaceNotFoundOnSourceCluster
Foreach migration plan, the controller runs migrations serially. The goal is to prevent migrations from doing conflicting things on the clusters involved or otherwise allowing the migrations to interfere with each other. However, I don't think this approach alone is sufficient. Currently, multiple plans can be created for the same source and destination clusters, and, with overlapping namespaces. This defeats the migration (run) orchestration. Running 2 migrations concurrently on different plans with the same source and destination clusters is effectively the same as running 2 migrations concurrently on the same plan.
It's my understanding that the use case for multiple plans with the same source & destination cluster is to support migrating workloads by groups of namespaces in sequential passes/efforts.
Let's start with a complete understanding of the conflict.
There are 2 options for preventing this (that I can think of).
Option 1
seems like the best option assuming it does not block any known use cases.
Option 1: Add a validation that will block running migrations on (open) plans that have the same source or destination cluster. The validation would set a Conflict condition that will prevent the Ready condition which will block migrations on overlapping plans. Would this block any known user cases?
Pros:
Cons:
Option 2: Enhance the migration (run) orchestration to run migrations serially across plans that have the same source or destination cluster.
Pros:
Cons:
Related to #156
Seeing this on every migration I run from OCP4->OCP4.
@sseago I think this is from the registry creation code. Doesn't seem to stop migration from completing but would be good to either re-arrange logic or ignore this type of error so it doesn't pollute logs.
Thinking the problem here is that once you define a service, you are not allowed to modify the clusterIP for that service, so we need to treat that field as immutable in our reconcile logic.
{"level":"info"
"ts":1558560727.27501
"logger":"migration|2f8nr"
"msg":"Migration: resumed."
"name":"migmigration-sample"
"phase":"RestoreStarted"}
{"level":"info"
"ts":1558560727.3467073
"logger":"migration|2f8nr"
"msg":"Migration: waiting."
"name":"migmigration-sample"
"phase":"BackupReplicated"
"backup":"migmigration-sample-8gbmh"
"restore":""}
{"level":"error"
"ts":1558560727.3468246
"logger":"kubebuilder.controller"
"msg":"Reconciler error"
"controller":"migmigration-controller"
"request":"mig/migmigration-sample"
"error":"Service \"migplan-sample-registry-p9k57\" is invalid: spec.clusterIP: Invalid value: \"\": field is immutable"
"stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
This is related to #151
I have an OCP v3.11.104 and OCP v4.1.0 environments deployed with agnosticd. I am trying to migrate an MsSQL server app (with a NFS PV) deployed on OCP 3. I have the controller deployed on v4 side.
Problem
The backup stage completes without errors. However, the restic restore on the destination is not working. The pod gets restored, but is stuck in pending state. The migration is stuck at CreateStagePodsStarted
phase. The Velero restore is in InProgress
phase.
Identified Cause
Whenever a pod to be restored has a PV associated (copy
action in our case), Velero adds an init container to that pod at the time of restore. Velero then restores the pod. The init container waits until the Restic restore process is completed. On the other hand, the restic process does not start until the init container is started. In an ideal scenario, as soon as the pod is restored, the init container runs and triggers the restic restore process. The restic restore process finishes and init container exits. Then, the pod comes live with the PV attached to it.
In my case, the pod which got backed up, had a nodeSelector looking for label node-role.kubernetes.io/compute:"true"
. This label is not present on any of the OCP4 nodes in my setup. At the time of restore, the pod got restored with the nodeSelector. Since the labels were absent, the pod never got scheduled. Since it never got scheduled, the init container never started. Since the init container never started, the Restic process never started. And migration is stuck hoping that the restore will finish.
Possible Solution
Migration plugin could replace the old OCP3 compute
labels with new OCP4 worker
labels.
Fix
Fixes need to be done in the plugin. Follow this issue - fusor/ocp-velero-plugin#61
The current policy on the controller is for errors raised are bubbled up to the main Reconcile() which returns them to controller-runtime (caller). This was based on an assumption that an error carried the original stack trace. This does not seem to be true. As a result, the logged stack trace is not very useful.
I propose that the controllers should log.Error("")
where the error is originally detected (omitting conflict errors). This will ensure the stack trace is useful. Then, the Reconcile()
will return reconcile.Result{Requeue: true}, nil
when an error is encountered. To keep the code clean, we may want to use a custom logger to do a IsConflict(err)
test.
Need to confirm our API can handle Azure configuration along with S3 and test compatibility with velero.
Currently the reported pv objects look like this when written back to the MigPlan:
persistentVolumes:
- name: pv1
supportedActions:
- copy
- move
The UI requires a number of additional fields for us to render the PV tables as shown in the mocks, the following fields are missing:
namespace
- the namespace the PV was found in
size
- size of PV
claim
- associated pvc name
storageClass
- associated storageClass, although we'll probably need to think about this more in depth given the issues that Dylan raised around differing storage classes in two different clusters.
The current MigMigration
CRD has a stage ref, but migrations can be run without a stage. This should reference a MigPlan
.
Through slack discussion, it became clear that we need something to create a MigCluster
that represents the cluster hosting the controller + the UI. Both will need to be able to distinguish it from MigCluster
objects that have been added by users, because it should be immutable in the UI, and the controller will construct a client for this cluster differently than if it were a user added cluster with an associated ServiceAccount token. We settled on a few conclusions:
It will be the "migration operator's" responsibility to ensure that a MigCluster
representing the host cluster is created during installation, which should be considered immutable (the operator will enforce it exists unchanged).
We should add an IsHostCluster
boolean field to the MigCluster
CRD that will act as a flag. The controller should validate only one of these exists in a Ready state at any given time, and it should default to false.
Have been seeing several instances of "cloud-credentials" already exists
in rapid succession during migrations recently. @jortel I think this is an unintended side-effect of the ensureStorage pieces that have recently been added.
I've never seen this block a migration, but would be nice to fix the underlying issue or silence the error message if we don't care about it.
{"level":"info"
"ts":1558560519.166244
"logger":"cluster|6fd7s"
"msg":"[rWatch] Starting manager"}
{"level":"info"
"ts":1558560519.1662652
"logger":"cluster|6fd7s"
"msg":"[rWatch] Manager started"}
{"level":"info"
"ts":1558560519.1662774
"logger":"cluster|6fd7s"
"msg":"Remote watch started."
"cluster":"migcluster-local"}
{"level":"error"
"ts":1558560525.3750854
"logger":"kubebuilder.controller"
"msg":"Reconciler error"
"controller":"migplan-controller"
"request":"mig/migplan-sample"
"error":"secrets \"cloud-credentials\" already exists"
"stacktrace":"github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr.(*zapLogger).Error
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/github.com/go-logr/zapr/zapr.go:128
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:217
github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func1
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:158
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil.func1
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:133
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.JitterUntil
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:134
github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait.Until
/home/dwhatley/dev/GO/src/github.com/fusor/mig-controller/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
Need to keep track of the sigStopChan and build capability to send a stop signal to cleanly exit:
Manual management of reference relationships is error prone. Since predicates can see the old and new object when changes occur, we should be able to reliably attach and break reference relationships from predicates.
Example of manual management of relationships in reconcile: https://github.com/fusor/mig-controller/blob/master/pkg/controller/migcluster/migcluster_controller.go#L142-L150
Example predicate: https://github.com/fusor/mig-controller/blob/master/pkg/controller/migplan/predicate.go
Functions for managing reference map: https://github.com/fusor/mig-controller/blob/master/pkg/util/resource_parent_map.go
We want to ensure that for a given MigrationPlan only one action may be running at any point in time.
For example, if a Stage is running, another Stage or Migrate will not be run in parallel.
Perform Restic restart (mount-propagation workaround) only when cluster is OCP 3.7-3.9.
Currently Velero (and by extension the controller) will not migrate SCC profiles across cluster since these are cluster-scoped resources that don't have any link to Kube resources. As a first pass, we should provide documentation to the end user that they will need to manually migrate custom SCC profiles to the destination cluster prior to the migration of their applications.
Down the line we should add support in the UI for a user to select custom SCC profiles they wish to migrate and the controller should set the SCCs in the IncludedResources
field.
As of the 1.0 release, it's possible for a Backup or a Restore to complete with a PartiallyFailed state, in addition to Failed or Completed. At this point, I think we should probably consider a PartialFailed backup or restore the same as a Failed one. The two examples I've come across today are ones that should definitely fail the action. We may want to eventually be more granular and dig into the partial failure reasons to distinguish fatal from non-fatal errors, but for now setting task.Phase to RestoreFailed or BackupFailed should be sufficient.
Where we currently check for velero.RestorePhaseFailed
or velero.BackupPhaseFailed
we should also check for velero.RestorePhasePartiallyFailed
or velero.BackupPhasePartiallyFailed
As a result of a discussion with Paul Morie, turns out Federation had an API review and it was suggested that their current usage of cluster-registry exposed them to a potential security escalation, hence they've removed usage of cluster-registry for time being and are using the cluster details in their own resource. Recommendation is for us to do similar and REMOVE cluster-registry usage, move the cluster details into MigCluster and just use MigCluster with secretRefs for now.
The plugin depends on annotations to function properly. The list of annotations to set are here: https://github.com/fusor/ocp-velero-plugin/blob/master/velero-plugins/common/types.go#L28
To solve this I think it makes the most sense to add a field to the Task
struct Annotations
that the stage and migration controllers can set before running Task.Run()
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.