keikoproj / instance-manager Goto Github PK

View Code? Open in Web Editor NEW

153.0 22.0 38.0 10.39 MB

Create and manage instance groups with Kubernetes

License: Apache License 2.0

Dockerfile 0.13% Makefile 0.57% Go 97.07% Gherkin 2.24%

kubernetes kubernetes-cluster kubernetes-node aws aws-eks kubernetes-controller kubernetes-tools eks spot worker-nodes

instance-manager's People

Contributors

Stargazers

Watchers

instance-manager's Issues

Kubernetes 1.15: Label node-role.kubernetes.io/<role> is incompatible

We currently add the node role label with the following format:

node-role.kubernetes.io/<roleName>=""

in 1.15, --node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix

node.kubernetes.io/role=<role>

We should address this once 1.15 is GA and find a backwards compatible approach for changing this.

Take clusterName as a controller flag

The controller should accept a new flag --cluster-name, which would be the cluster the controller is operating under.

This will make .spec.configuration.clusterName not mandatory which will make it easier for users.

We can add the flag (not mandatory) in the next release and still keep .spec.configuration.clusterName around as an override in case someone wants to run controller in one cluster for provisioning in another cluster.
Flag will act as a default option.

Deletion may fail if it's issued mid creation

When deleting an IG resource during the time stack is creating, it might end up in error state due to:

2020-03-04T20:11:29.809Z	DEBUG	controller-runtime.controller	Successfully Reconciled	{"controller": "instancegroup", "request": "instance-manager/bdd-test-rolling"}
time="2020-03-04T20:11:29Z" level=info msg="reconcile started"
time="2020-03-04T20:11:29Z" level=info msg="resource document version: 35158460"
time="2020-03-04T20:11:29Z" level=info msg="resource namespace: instance-manager"
time="2020-03-04T20:11:29Z" level=info msg="resource name: bdd-test-rolling"
time="2020-03-04T20:11:29Z" level=info msg="resource provisioner: eks-cf"
time="2020-03-04T20:11:29Z" level=info msg="upgrade strategy: rollingupdate"
time="2020-03-04T20:11:36Z" level=info msg="state transitioned to: Init"
time="2020-03-04T20:11:36Z" level=info msg="starting cloud discovery"
time="2020-03-04T20:11:44Z" level=info msg="stack state: CREATE_COMPLETE"
time="2020-03-04T20:11:44Z" level=info msg="starting state discovery"
time="2020-03-04T20:11:44Z" level=info msg="state transitioned to: InitDelete"
time="2020-03-04T20:11:44Z" level=info msg="starting delete"
time="2020-03-04T20:11:44Z" level=error msg="failed to remove role from aws-auth configmap: could not find rolemap"
time="2020-03-04T20:11:44Z" level=info msg="could not find rolemap"
time="2020-03-04T20:11:44Z" level=info msg="state transitioned to: Error"
time="2020-03-04T20:11:44Z" level=error msg="could not find rolemap"
time="2020-03-04T20:11:44Z" level=error msg="reconcile failed"

Rename project

Orkaproj has been renamed to keikoproj, need to reflect this in all APIs and codebase

Consider mechanism for error events

Right now when a stack is in error state, there is no way of knowing why it go to error state without looking at controller logs or looking at CF stack status.
We should collect this information and possibly place it under .status.message or even / and publish kubernetes events to describe what happened

EKSCF: Desired might change to Min in update flow

Looks like our CFN template is using Min for desired parameter:

instance-manager/config/crd/bases/instance-manager-configmap.yaml

Line 182 in 26126d0

DesiredCapacity: !Ref NodeAutoScalingGroupMinSize

This works on create flow, but update should never modify desired, if it's mandatory we should GET current desired and use that.

Event publishing

Provisioners should publish the following events:

IG creation is successful / failed.
NodesReady condition goes from true/false.
Upgrade (rotation) started / successful / failed

Support ManagedPolicies

instance-manager should support managedPolicies in InstanceGroup custom resource.
This should add managed policies to the role that is created for the instance group.

Consider using Goformation vs. Cloudformation Template with Go Templating

We currently using Go Templating to template a cloudformation template that is in the controller config map.

It could be more elegant to generate this template from code using goformation.
https://github.com/awslabs/goformation

We should consider switching to it

Scenarios where we have an existing Role does not work when instance-profile is using a different name

It seems when the instance-profile name is different than the role name, and we use .spec.roleName, the IG stack will get stack as CREATE_IN_PROGRESS forever (or until it times out).

This happens because the instance-profile does not exist.

We should also take in .spec.instanceProfileName to support this use case.

Update to Golang 1.13

Dockerfile is building the image using golang 1.12, we should go to 1.13.

Update AWS sdk version to support IRSA

In order to support IRSA (https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html)
We should update the AWS SDK to a compatible version.

Add configurable prefix to stack name

The current stack name is constructed as:
CLUSTERNAME-NAMESPACE-GROUPNAME

We should change it to:

<CONFIGURABLE_PREFIX>-CLUSTERNAME-GROUPNAME

Configurable prefix can be added to the controller configmap as an option stackNamePrefix.
Constructed name must be < 63 characters or operation should fail.

instance-manager strips out user permissions from aws-auth configmap

Looks like instance-manager is stripping out user permissions from aws-auth configmap.
For example, given a configmap:

apiVersion: v1
data:
  mapRoles: |
    - rolearn: arn:aws:iam::1234567890:role/a-valid-node-role
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes
    - rolearn: arn:aws:iam::1234567890:role/an-invalid-node-role
      username: system:node:{{EC2PrivateDNSName}}
      groups:
        - system:bootstrappers
        - system:nodes
    - rolearn: arn:aws:iam::1234567890:role/a-valid-user
      username:  user:some-user
      groups:
        - system:masters
kind: ConfigMap
metadata:
  annotations:
  name: aws-auth
  namespace: kube-system

After reconciling, instance-manager will strip out last two mapRoles.
We should have some mechanism of NOT stripping out user access, possibly we can look at .username if it starts with user: and as a result not touch it.

Documentation improvements

Document all features that went in to 0.4.0.
Consider an additional doc that describes various features, rather than using the walkthrough for describe those.

Not able to create instance group using latest release (0.5.0) and the provided sample yaml spec

Is this a BUG REPORT or FEATURE REQUEST?:
Bug

What happened:
Not able to create instance group using latest release (0.5.0) and the provided sample yaml spec

What you expected to happen:
We should be able to create instance group using the sample yaml

How to reproduce it (as minimally and precisely as possible):
Using latest release (0.5.0)
Create an instance group using the instruction in the doc: https://github.com/keikoproj/instance-manager/tree/master/docs#create-an-intancegroup-object
Look at controller log and you will see the error

Other debugging information (if applicable):

controller logs:

time="2020-03-24T22:07:53Z" level=error msg="failed to load cloudformation configuration: template: InstanceGroup:204:38: executing \"InstanceGroup\" at <.Spec.AwsUpgradeStrategy.RollingUpgradeType.MinInstancesInService>: can't evaluate field MinInstancesInService in type *v1alpha1.RollingUpgradeStrategy"
time="2020-03-24T22:07:53Z" level=error msg="template: InstanceGroup:204:38: executing \"InstanceGroup\" at <.Spec.AwsUpgradeStrategy.RollingUpgradeType.MinInstancesInService>: can't evaluate field MinInstancesInService in type *v1alpha1.RollingUpgradeStrategy"
time="2020-03-24T22:07:53Z" level=info msg="reconcile completed with requeue"

I believe this is caused by this change: 5a6f76d#diff-fe058806af3b4349a786152df5cbc36cR85. Since RollingUpgradeType is changed to a pointer, it will now be nil when we don't specify the rollingUpdate block in the yaml spec (rather than resolving to an empty struct with default values). I verified this by trying to create an instance group using a yaml spec that contains the rollingUpdate block and that worked. Example:

cat <<EOF | kubectl create -f -
apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
  name: hello-world
  namespace: instance-manager
spec:
  provisioner: eks-cf
  strategy:
    type: rollingUpdate
    rollingUpdate:
      maxBatchSize: 0
      minInstancesInService: 0
      minSuccessfulInstancesPercent: 0
  eks-cf:
    maxSize: 6
    minSize: 3
    configuration:
      clusterName: $EKS_CLUSTER_NAME
      vpcId: $VPC_ID
      subnets: $NODE_GROUP_SUBNET_LIST
      keyPairName: $KEYPAIR_NAME
      image: $NODE_AMI_ID
      instanceType: m5.large
      volSize: 20
      securityGroups:
      - $NODE_SECURITY_GROUP
      tags:
        - key: k8s.io/cluster-autoscaler/enabled
          value: "true"
        - key: hello
          value: world
EOF

So it looks like rollingUpdate is now a mandary field. Is that what we really want? Users shouldn't have to specify it if they're ok with the default values.

move bootstrap package to aws-auth module

switching to https://github.com/keikoproj/aws-auth will mean we can get rid of a lot of duplicate code in instance-manager.

strategy: crd strategy concurrencyPolicy will requeue if different ASG is updating

Currently, the sub-CR discovery stage as part of implementing concurrencyPolicy=Forbid, will watch ALL sub-CRs in the namespace having to determine if it should requeue or not.
This works when there is a single instance group reconciling, however when there are multiple they may be blocked sub-CRs belonging to other ASGs.

A quick solution is to annotate the sub-CR with the ASG value instead of simple ownership annotation.
When discovery happens we should filter for sub-CRs belonging only to that ASG scope.

IAM Role pass-through

Currently, instance-manager creates the IAM role for the instance group as part of the cloudformation template, which means the controller pods need to get iam:* permissions.

Some users may find this inappropriate to allow arbitrary IAM role creation from the controller.
We should allow taking .spec.eks-cf.configuration.role which would be an IAM role that is pre-created which would be attached to the scaling group instead of a created one.

This should override other options such as managedPolicies etc.

Add slack channel badge to README.md

Need to add a badge that links to orkaproj public slack channel.

Possible bootstrapping issues

When two instance groups are using provided roles which are the same, when once instance-group gets deleted it will remove the role from aws-auth, leading to nodes going to NotReady.

We should emphasize on docs that every instance group should have it's own role.
Alternatively we can get all roles of IGs on the cluster and validate them whenever we call bootstrap, this would re-add a role if it's removed by a deleting IG

In addition, when nodes are NotReady it cannot reach BootstrapNodes() code, so if the reason for nodes not being ready is bootstrapping, they will be stuck forever.

We should call BootstrapNodes in update flow, before UpdateNodeReadyCondition.

EPIC: support launch templates

instance-manager should support launch templates.
This can be done as a separate kind, e.g. Kind: InstanceGroupTemplate (or different), and can then reference it in InstanceGroup e.g. .spec.provisioner.configuration.template.

Deprecate EKS-CF Provisioner

The EKS-CF provisioner will be deprecated in favor of the new EKS provisioner which does not use cloudformation for provisioning.

This approach was found to be faster and more reliable, and has a less complex logic in term of state handling.

It also removes the dependency on a cloudformation template and it's versioning.

There is no real reason to maintain code for both provisioners when they set out to achieve the same goal.

Refactor eks-managed provisioner

The eks-managed provisioner needs to be refactored to pull it's provisioner-specific log from awsprovider to the provisioner package, mainly around the use of parameters and storing them inside AwsWorker instead of internally.

CI: Travis Improvements

Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request

What happened:
Currently Travis CI only builds PRs, and pushes images post-merge. this is insufficient when wanting to test out a PR on a real cluster (or run bdd test)

What you expected to happen:

We should push an image even on PR, and use the commit hash as tag.
We should also run the BDD test nightly on instance-manager:master and publish instance-manager:nightly as a result.
We should consider some sort of hourly functional test as well

EPIC: Provision instance groups in KOPS based clusters

instance-manager should support KOPS based clusters running on EC2.

Feature: Recommendation Reconciler

In order to support spot-instances using instance-manager we will need to be able to reconcile based on events.

minion manager (or other recommendation controllers) will emit events with reason SpotRecommendationGiven and we should process the message and set spot price on instance groups that have it enabled.

Improve/Re-Write unit-tests

Current unit tests is becoming brittle.. We should have a better design for test unit tests (internal vs. external) and have better structure for unit tests so that they don't break easily and it's easy to extend them.
This should also include some refactoring to make code more testable
We should also consider other packages like the controller itself, etc. which are currently not covered.

InstanceGroup Validating Webhooks

It would be nice if pre-reconcile (i.e. stack submission) we would attempt to validate requested objects exist and prevent a subsequent cloudformation failure (fail fast).

For example, we should verify the following exist:
keyPairName
image
clusterName -- already being validated
subnets
securityGroups

Providing a wrong value in any of these will fail the stack and cause behavior described in #49

CR Deletion: Will not reconcile if stack is in finite-state

It seems that if a stack is in an error state where stack is finite (ROLLBACK_COMPLETE), it will not delete the stack as a result of deleting the CR.
While it's good to leave the stack for a human to inspect it, we should respond to the act of CR deletion as this is an explicit operation

Rolling upgrade CR not submitted on tag change

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:
An InstanceGroup CR was modified such that new tags were added. This causes the instance-manager controller to reconcile. However, changes to tags does not change the launch-config. As a result instance-manager did not submit any rollingupgrade CRs.

What you expected to happen:
Instance-manager should submit CRs in such cases when tags are modified.

How to reproduce it (as minimally and precisely as possible):

Create and submit an IG CR
Wait for it to be created
Update the CR such that the tags are modified
Notice that:
- The tags don't get propagated to existing instance
- There is no rollingupgrade CR submitted

Improved handling of UpdateRecoverableError

Currently, reconciler will play ping-pong with AWS when this happens.
Example:
Change existing IG's KeyPair to a non-existing key pair.
This will start the CF update, which will then move to UPDATE_ROLLBACK_IN_PROGRESS because the key does not exist.
Once that is done, stack is updatable again, so stack will repeat this transition from UPDATE_IN_PROGRESS to UPDATE_ROLLBACK_IN_PROGRESS forever, this is correct behavior however should probably backoff exponentially rather than keep trying to reconcile immediately.

For the case of UpdateRecoverableError, we should error out the instance-group (with appropriate message via #25) and re-queue with exponential backoff;

For example:
IG is changed > stack UPDATE_IN_PROGRESS > stack UPDATE_ROLLBACK_IN_PROGRESS > IG in error state with reason 'stack update failed' (or similar) > reconcile with re-queue (exponential).

Functional test failing due to parallel update

There is a problem with the way we are parallelizing the functional test's update scenario.
We should make update scenario sequential for now and work on concurrency of tests down the road.

Remove Vendor

We should remove vendor folder and all the vended code, this makes the repo very heavy and PRs are huge.

EPIC: Fargate Support

Is this a BUG REPORT or FEATURE REQUEST?:

Feature Request

What happened:

Amazon announced Fargate support for EKS. Now pods can be scheduled via Fargate.

What you expected to happen:

Pods within a namespace (and optional tags) can be automatically scheduled to run via Fargate. Users should be able to create, modify and delete Fargate support.

Anything else we need to know?:

https://docs.aws.amazon.com/eks/latest/userguide/fargate.html
https://docs.aws.amazon.com/eks/latest/userguide/fargate-profile.html

Support MetricsCollection from EKS API

We should add support in the EKS provisioner for enabling / disabling metric collection on the scaling group.

https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#AutoScaling.EnableMetricsCollection

https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#AutoScaling.DisableMetricsCollection

This can probably be added in create/update flow given a new API is enabled .spec.eks.configuration.metricsCollection which is a list of metrics described here:
https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#EnableMetricsCollectionInput

if list is empty, we should disable metrics collection.

RollingUpdate configuration gets set regardless of chose strategy

Since we added #39, it now sets the rollingUpdate configuration even if you are using CRD strategy.

For example, if you describe the YAML of an IG that has been using CRD Strategy you might see:

strategy: 
  type: crd
  crd: 
    crdName: rollingupgrades
    spec: |
        apiVersion: upgrademgr.orkaproj.io/v1alpha1
        kind: RollingUpgrade
        metadata:
          name: rollup-nodes
          namespace: instance-manager
        spec:
          postDrainDelaySeconds: 30
          nodeIntervalSeconds: 180
          region: {{ .ControllerRegion }}
          asgName: {{ .InstanceGroup.Status.ActiveScalingGroupName }}
    statusJSONPath: .status.currentStatus
    statusSuccessString: completed
  rollingUpdate: 
    maxBatchSize: 1
    minInstancesInService: 1
    minSuccessfulInstancesPercent: 100
    pauseTime: PT5M

Setting these defaults when user is using crd strategy is just confusing, rollingUpdate should be set to {} or omitted and populated only if type: rollingUpdate

CRD Strategy: may fail when multiple spec updates happen during upgrade

Scenario:

Create an InstanceGroup with CRD strategy submitting a workflow for upgrade.
Update the InstanceGroup AMI, let node rotation start
Mid-node rotation, change AMI again

This seems to break things - we should have some sort of locking mechanism or come up with some approach that accommodates the continuously converging nature of the controller vs. submitting workflows which are 'oneshot' or job style execution.

EKS: Mechanism to force node upgrade

There is currently no mechanism for forcing a node rotation when there are no configuration changes.

Ref #73
We should consider some mechanism to do this, possibly in the form of label/annotation that can be added to cause this.

RollingUpdate: Support additional configuration options

Currently strategy: rollingUpdate uses some hard-coded default parameters from the cloudformation template:

{{$strategy := .Spec.AwsUpgradeStrategy.Type | ToLower}}
{{if (eq $strategy "rollingupdate")}}
UpdatePolicy:
  AutoScalingRollingUpdate:
    MinInstancesInService: '1'
    MaxBatchSize: '1'
{{end}}

However, cloudformation API supports the following:

"UpdatePolicy" : {
  "AutoScalingRollingUpdate" : {
    "MaxBatchSize" : Integer,
    "MinInstancesInService" : Integer,
    "MinSuccessfulInstancesPercent" : Integer,
    "PauseTime" : String,
    "SuspendProcesses" : [ List of processes ],
    "WaitOnResourceSignals" : Boolean
  }
}

Caveats on parameters mentioned here:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html

Log messages do not indicate which IG they're for

Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT

What happened:

Create 2 or more instance-groups.
Watch the logs of the instance-manager.
The log looked something like this:

time="2019-08-31T03:51:07Z" level=info msg="stack state: CREATE_IN_PROGRESS"

It's unclear which IG the above message was for.

What you expected to happen:
It should be possible to know which IG the log messages are for.

Provisioner Caching and throttling retries

We should test scenarios of many instance groups and effect on API limits.
We should probably explore caching via https://github.com/keikoproj/aws-sdk-go-cache -- this will allow us to have better performance in those cases and also enable multiple worker threads on reconciler.

functional test: timing out due to travis 50m limitation

It seems travis-ci has a hard limit on total job duration, since we changed to using upgrade-manager as a crd strategy, test now takes few minutes longer which may exceed this limit and fail.

Need to remove the env setup from the job (which takes 40 minutes +) and have a long running cluster for testing purposes

https://travis-ci.org/orkaproj/instance-manager/builds/571627175

functional test should test update scenario concurrently

Due to a bug in the test I made the update of instance groups during functional test sequential.
We should improve the polling mechanism and go back to using parallel updates.. this will make the test faster and allow us to scale while maintaining the travis limitation of 60m

Provision IGs in EKS directly with AWS SDK

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

What happened:
Currently, the IGs in EKS are provisioned using cloudformation templates (using the eks-cf provisioner). While CFN is great, it has a drawback. It prevents any other components/tools from directly modifying the AWS resource or it's dependencies. For e.g. (i) minion-manager directly modifies the launchconfig of an ASG to change between spot and on-demand instances or changing the instance type for scaling up/down the instance size would require a change in launch-config. (ii) Adding/removing cloud labels (tags) is required for dynamic changes for cluster-autoscaler and AWS cost analysis tools, etc.

If direct changes are made to the ASG or the launch-config, it creates a drift in the CFN.

What you expected to happen:
More of a question: Would it be easier if instance-manager had a new provisioner called eks-direct (or something similar) that would directly use the AWS SDK to provision the ASGs? It would solve the above problem of working well with other tools. Another positive side-effect would also be that the ASGs will have deterministic (and prettier) names :-)

Thoughts?

Support for MetricsCollection

Currently spec does not allow enabling MetricsCollection.
We should probably add support for it.

More info in the API docs:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-metricscollection.html

Improve unit test coverage

unit-test coverage is around 60%, should get it >75% at the very least.

CI: Functional test should use rollingupgrades (via upgrademgr) instead of Argo

Current BDD test is submitting an argo workflow that does nothing, instead we should submit a rollingupgrade via upgrade-manager.

setup.sh should deploy upgrade-manager
CRD Strategy BDD should have an inline rollingupgrade spec instead or argo

Add UserData support for EKS

Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request

What happened:

Can't find where to place the UserData to inject into the Autoscale Group through Cloudformation.

What you expected to happen:

Be able to inject UserData to CloudFormation

Kubernetes version:

1.15 , 1.16

Support managed worker node group for EKS

Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST

What happened:
AWS now supports managed worker nodegroups. These are not currently supported by instance-manager.

What you expected to happen:
It should be possible to create/read/update/delete these managed node groups as IGs.

Anything else we need to know?:
There are CFNs for provisioning of managed node group: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html

Other links:
Upgrades: https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html

keikoproj / instance-manager Goto Github PK

instance-manager's People

Contributors

Stargazers

Watchers

Forkers

instance-manager's Issues

Recommend Projects

Recommend Topics

Recommend Org