keikoproj / instance-manager Goto Github PK
View Code? Open in Web Editor NEWCreate and manage instance groups with Kubernetes
License: Apache License 2.0
Create and manage instance groups with Kubernetes
License: Apache License 2.0
We currently add the node role label with the following format:
node-role.kubernetes.io/<roleName>=""
in 1.15, --node-labels in the 'kubernetes.io' namespace must begin with an allowed prefix
node.kubernetes.io/role=<role>
We should address this once 1.15 is GA and find a backwards compatible approach for changing this.
The controller should accept a new flag --cluster-name
, which would be the cluster the controller is operating under.
This will make .spec.configuration.clusterName
not mandatory which will make it easier for users.
We can add the flag (not mandatory) in the next release and still keep .spec.configuration.clusterName
around as an override in case someone wants to run controller in one cluster for provisioning in another cluster.
Flag will act as a default option.
When deleting an IG resource during the time stack is creating, it might end up in error state due to:
2020-03-04T20:11:29.809Z DEBUG controller-runtime.controller Successfully Reconciled {"controller": "instancegroup", "request": "instance-manager/bdd-test-rolling"}
time="2020-03-04T20:11:29Z" level=info msg="reconcile started"
time="2020-03-04T20:11:29Z" level=info msg="resource document version: 35158460"
time="2020-03-04T20:11:29Z" level=info msg="resource namespace: instance-manager"
time="2020-03-04T20:11:29Z" level=info msg="resource name: bdd-test-rolling"
time="2020-03-04T20:11:29Z" level=info msg="resource provisioner: eks-cf"
time="2020-03-04T20:11:29Z" level=info msg="upgrade strategy: rollingupdate"
time="2020-03-04T20:11:36Z" level=info msg="state transitioned to: Init"
time="2020-03-04T20:11:36Z" level=info msg="starting cloud discovery"
time="2020-03-04T20:11:44Z" level=info msg="stack state: CREATE_COMPLETE"
time="2020-03-04T20:11:44Z" level=info msg="starting state discovery"
time="2020-03-04T20:11:44Z" level=info msg="state transitioned to: InitDelete"
time="2020-03-04T20:11:44Z" level=info msg="starting delete"
time="2020-03-04T20:11:44Z" level=error msg="failed to remove role from aws-auth configmap: could not find rolemap"
time="2020-03-04T20:11:44Z" level=info msg="could not find rolemap"
time="2020-03-04T20:11:44Z" level=info msg="state transitioned to: Error"
time="2020-03-04T20:11:44Z" level=error msg="could not find rolemap"
time="2020-03-04T20:11:44Z" level=error msg="reconcile failed"
Orkaproj has been renamed to keikoproj, need to reflect this in all APIs and codebase
Right now when a stack is in error state, there is no way of knowing why it go to error state without looking at controller logs or looking at CF stack status.
We should collect this information and possibly place it under .status.message
or even / and publish kubernetes events to describe what happened
Looks like our CFN template is using Min for desired parameter:
This works on create flow, but update should never modify desired, if it's mandatory we should GET current desired and use that.
Provisioners should publish the following events:
instance-manager should support managedPolicies in InstanceGroup custom resource.
This should add managed policies to the role that is created for the instance group.
We currently using Go Templating to template a cloudformation template that is in the controller config map.
It could be more elegant to generate this template from code using goformation.
https://github.com/awslabs/goformation
We should consider switching to it
It seems when the instance-profile name is different than the role name, and we use .spec.roleName
, the IG stack will get stack as CREATE_IN_PROGRESS forever (or until it times out).
This happens because the instance-profile does not exist.
We should also take in .spec.instanceProfileName
to support this use case.
Dockerfile is building the image using golang 1.12, we should go to 1.13.
In order to support IRSA (https://docs.aws.amazon.com/eks/latest/userguide/iam-roles-for-service-accounts-minimum-sdk.html)
We should update the AWS SDK to a compatible version.
The current stack name is constructed as:
CLUSTERNAME-NAMESPACE-GROUPNAME
We should change it to:
<CONFIGURABLE_PREFIX>-CLUSTERNAME-GROUPNAME
stackNamePrefix
.Looks like instance-manager is stripping out user permissions from aws-auth configmap.
For example, given a configmap:
apiVersion: v1
data:
mapRoles: |
- rolearn: arn:aws:iam::1234567890:role/a-valid-node-role
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
- rolearn: arn:aws:iam::1234567890:role/an-invalid-node-role
username: system:node:{{EC2PrivateDNSName}}
groups:
- system:bootstrappers
- system:nodes
- rolearn: arn:aws:iam::1234567890:role/a-valid-user
username: user:some-user
groups:
- system:masters
kind: ConfigMap
metadata:
annotations:
name: aws-auth
namespace: kube-system
After reconciling, instance-manager will strip out last two mapRoles.
We should have some mechanism of NOT stripping out user access, possibly we can look at .username
if it starts with user:
and as a result not touch it.
Document all features that went in to 0.4.0.
Consider an additional doc that describes various features, rather than using the walkthrough for describe those.
Is this a BUG REPORT or FEATURE REQUEST?:
Bug
What happened:
Not able to create instance group using latest release (0.5.0) and the provided sample yaml spec
What you expected to happen:
We should be able to create instance group using the sample yaml
How to reproduce it (as minimally and precisely as possible):
Using latest release (0.5.0)
Create an instance group using the instruction in the doc: https://github.com/keikoproj/instance-manager/tree/master/docs#create-an-intancegroup-object
Look at controller log and you will see the error
Other debugging information (if applicable):
time="2020-03-24T22:07:53Z" level=error msg="failed to load cloudformation configuration: template: InstanceGroup:204:38: executing \"InstanceGroup\" at <.Spec.AwsUpgradeStrategy.RollingUpgradeType.MinInstancesInService>: can't evaluate field MinInstancesInService in type *v1alpha1.RollingUpgradeStrategy"
time="2020-03-24T22:07:53Z" level=error msg="template: InstanceGroup:204:38: executing \"InstanceGroup\" at <.Spec.AwsUpgradeStrategy.RollingUpgradeType.MinInstancesInService>: can't evaluate field MinInstancesInService in type *v1alpha1.RollingUpgradeStrategy"
time="2020-03-24T22:07:53Z" level=info msg="reconcile completed with requeue"
I believe this is caused by this change: 5a6f76d#diff-fe058806af3b4349a786152df5cbc36cR85. Since RollingUpgradeType
is changed to a pointer, it will now be nil when we don't specify the rollingUpdate
block in the yaml spec (rather than resolving to an empty struct with default values). I verified this by trying to create an instance group using a yaml spec that contains the rollingUpdate
block and that worked. Example:
cat <<EOF | kubectl create -f -
apiVersion: instancemgr.keikoproj.io/v1alpha1
kind: InstanceGroup
metadata:
name: hello-world
namespace: instance-manager
spec:
provisioner: eks-cf
strategy:
type: rollingUpdate
rollingUpdate:
maxBatchSize: 0
minInstancesInService: 0
minSuccessfulInstancesPercent: 0
eks-cf:
maxSize: 6
minSize: 3
configuration:
clusterName: $EKS_CLUSTER_NAME
vpcId: $VPC_ID
subnets: $NODE_GROUP_SUBNET_LIST
keyPairName: $KEYPAIR_NAME
image: $NODE_AMI_ID
instanceType: m5.large
volSize: 20
securityGroups:
- $NODE_SECURITY_GROUP
tags:
- key: k8s.io/cluster-autoscaler/enabled
value: "true"
- key: hello
value: world
EOF
So it looks like rollingUpdate
is now a mandary field. Is that what we really want? Users shouldn't have to specify it if they're ok with the default values.
switching to https://github.com/keikoproj/aws-auth will mean we can get rid of a lot of duplicate code in instance-manager.
Currently, the sub-CR discovery stage as part of implementing concurrencyPolicy=Forbid, will watch ALL sub-CRs in the namespace having to determine if it should requeue or not.
This works when there is a single instance group reconciling, however when there are multiple they may be blocked sub-CRs belonging to other ASGs.
A quick solution is to annotate the sub-CR with the ASG value instead of simple ownership annotation.
When discovery happens we should filter for sub-CRs belonging only to that ASG scope.
Currently, instance-manager creates the IAM role for the instance group as part of the cloudformation template, which means the controller pods need to get iam:*
permissions.
Some users may find this inappropriate to allow arbitrary IAM role creation from the controller.
We should allow taking .spec.eks-cf.configuration.role
which would be an IAM role that is pre-created which would be attached to the scaling group instead of a created one.
This should override other options such as managedPolicies
etc.
Need to add a badge that links to orkaproj public slack channel.
When two instance groups are using provided roles which are the same, when once instance-group gets deleted it will remove the role from aws-auth, leading to nodes going to NotReady.
We should emphasize on docs that every instance group should have it's own role.
Alternatively we can get all roles of IGs on the cluster and validate them whenever we call bootstrap, this would re-add a role if it's removed by a deleting IG
In addition, when nodes are NotReady it cannot reach BootstrapNodes()
code, so if the reason for nodes not being ready is bootstrapping, they will be stuck forever.
We should call BootstrapNodes
in update flow, before UpdateNodeReadyCondition
.
instance-manager should support launch templates.
This can be done as a separate kind, e.g. Kind: InstanceGroupTemplate
(or different), and can then reference it in InstanceGroup e.g. .spec.provisioner.configuration.template
.
The EKS-CF provisioner will be deprecated in favor of the new EKS provisioner which does not use cloudformation for provisioning.
This approach was found to be faster and more reliable, and has a less complex logic in term of state handling.
It also removes the dependency on a cloudformation template and it's versioning.
There is no real reason to maintain code for both provisioners when they set out to achieve the same goal.
The eks-managed provisioner needs to be refactored to pull it's provisioner-specific log from awsprovider to the provisioner package, mainly around the use of parameters and storing them inside AwsWorker instead of internally.
Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request
What happened:
Currently Travis CI only builds PRs, and pushes images post-merge. this is insufficient when wanting to test out a PR on a real cluster (or run bdd test)
What you expected to happen:
instance-manager:master
and publish instance-manager:nightly
as a result.instance-manager should support KOPS based clusters running on EC2.
In order to support spot-instances using instance-manager we will need to be able to reconcile based on events.
minion manager (or other recommendation controllers) will emit events with reason SpotRecommendationGiven
and we should process the message and set spot price on instance groups that have it enabled.
Current unit tests is becoming brittle.. We should have a better design for test unit tests (internal vs. external) and have better structure for unit tests so that they don't break easily and it's easy to extend them.
This should also include some refactoring to make code more testable
We should also consider other packages like the controller itself, etc. which are currently not covered.
It would be nice if pre-reconcile (i.e. stack submission) we would attempt to validate requested objects exist and prevent a subsequent cloudformation failure (fail fast).
For example, we should verify the following exist:
keyPairName
image
clusterName
-- already being validated
subnets
securityGroups
Providing a wrong value in any of these will fail the stack and cause behavior described in #49
It seems that if a stack is in an error state where stack is finite (ROLLBACK_COMPLETE), it will not delete the stack as a result of deleting the CR.
While it's good to leave the stack for a human to inspect it, we should respond to the act of CR deletion as this is an explicit operation
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
An InstanceGroup CR was modified such that new tags were added. This causes the instance-manager controller to reconcile. However, changes to tags does not change the launch-config. As a result instance-manager did not submit any rollingupgrade CRs.
What you expected to happen:
Instance-manager should submit CRs in such cases when tags are modified.
How to reproduce it (as minimally and precisely as possible):
Currently, reconciler will play ping-pong with AWS when this happens.
Example:
Change existing IG's KeyPair to a non-existing key pair.
This will start the CF update, which will then move to UPDATE_ROLLBACK_IN_PROGRESS
because the key does not exist.
Once that is done, stack is updatable again, so stack will repeat this transition from UPDATE_IN_PROGRESS
to UPDATE_ROLLBACK_IN_PROGRESS
forever, this is correct behavior however should probably backoff exponentially rather than keep trying to reconcile immediately.
For the case of UpdateRecoverableError, we should error out the instance-group (with appropriate message via #25) and re-queue with exponential backoff;
For example:
IG is changed > stack UPDATE_IN_PROGRESS
> stack UPDATE_ROLLBACK_IN_PROGRESS
> IG in error state with reason 'stack update failed' (or similar) > reconcile with re-queue (exponential).
There is a problem with the way we are parallelizing the functional test's update scenario.
We should make update scenario sequential for now and work on concurrency of tests down the road.
We should remove vendor folder and all the vended code, this makes the repo very heavy and PRs are huge.
Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request
What happened:
Amazon announced Fargate support for EKS. Now pods can be scheduled via Fargate.
What you expected to happen:
Pods within a namespace (and optional tags) can be automatically scheduled to run via Fargate. Users should be able to create, modify and delete Fargate support.
Anything else we need to know?:
https://docs.aws.amazon.com/eks/latest/userguide/fargate.html
https://docs.aws.amazon.com/eks/latest/userguide/fargate-profile.html
We should add support in the EKS provisioner for enabling / disabling metric collection on the scaling group.
https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#AutoScaling.EnableMetricsCollection
https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#AutoScaling.DisableMetricsCollection
This can probably be added in create/update flow given a new API is enabled .spec.eks.configuration.metricsCollection
which is a list of metrics described here:
https://docs.aws.amazon.com/sdk-for-go/api/service/autoscaling/#EnableMetricsCollectionInput
if list is empty, we should disable metrics collection.
Since we added #39, it now sets the rollingUpdate
configuration even if you are using CRD strategy.
For example, if you describe the YAML of an IG that has been using CRD Strategy you might see:
strategy:
type: crd
crd:
crdName: rollingupgrades
spec: |
apiVersion: upgrademgr.orkaproj.io/v1alpha1
kind: RollingUpgrade
metadata:
name: rollup-nodes
namespace: instance-manager
spec:
postDrainDelaySeconds: 30
nodeIntervalSeconds: 180
region: {{ .ControllerRegion }}
asgName: {{ .InstanceGroup.Status.ActiveScalingGroupName }}
statusJSONPath: .status.currentStatus
statusSuccessString: completed
rollingUpdate:
maxBatchSize: 1
minInstancesInService: 1
minSuccessfulInstancesPercent: 100
pauseTime: PT5M
Setting these defaults when user is using crd strategy is just confusing, rollingUpdate should be set to {}
or omitted and populated only if type: rollingUpdate
Scenario:
This seems to break things - we should have some sort of locking mechanism or come up with some approach that accommodates the continuously converging nature of the controller vs. submitting workflows which are 'oneshot' or job style execution.
There is currently no mechanism for forcing a node rotation when there are no configuration changes.
Ref #73
We should consider some mechanism to do this, possibly in the form of label/annotation that can be added to cause this.
Currently strategy: rollingUpdate
uses some hard-coded default parameters from the cloudformation template:
{{$strategy := .Spec.AwsUpgradeStrategy.Type | ToLower}}
{{if (eq $strategy "rollingupdate")}}
UpdatePolicy:
AutoScalingRollingUpdate:
MinInstancesInService: '1'
MaxBatchSize: '1'
{{end}}
However, cloudformation API supports the following:
"UpdatePolicy" : {
"AutoScalingRollingUpdate" : {
"MaxBatchSize" : Integer,
"MinInstancesInService" : Integer,
"MinSuccessfulInstancesPercent" : Integer,
"PauseTime" : String,
"SuspendProcesses" : [ List of processes ],
"WaitOnResourceSignals" : Boolean
}
}
Caveats on parameters mentioned here:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-attribute-updatepolicy.html
Is this a BUG REPORT or FEATURE REQUEST?:
BUG REPORT
What happened:
time="2019-08-31T03:51:07Z" level=info msg="stack state: CREATE_IN_PROGRESS"
What you expected to happen:
It should be possible to know which IG the log messages are for.
We should test scenarios of many instance groups and effect on API limits.
We should probably explore caching via https://github.com/keikoproj/aws-sdk-go-cache -- this will allow us to have better performance in those cases and also enable multiple worker threads on reconciler.
It seems travis-ci has a hard limit on total job duration, since we changed to using upgrade-manager as a crd strategy, test now takes few minutes longer which may exceed this limit and fail.
Need to remove the env setup from the job (which takes 40 minutes +) and have a long running cluster for testing purposes
https://travis-ci.org/orkaproj/instance-manager/builds/571627175
Due to a bug in the test I made the update of instance groups during functional test sequential.
We should improve the polling mechanism and go back to using parallel updates.. this will make the test faster and allow us to scale while maintaining the travis limitation of 60m
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
Currently, the IGs in EKS are provisioned using cloudformation templates (using the eks-cf
provisioner). While CFN is great, it has a drawback. It prevents any other components/tools from directly modifying the AWS resource or it's dependencies. For e.g. (i) minion-manager
directly modifies the launchconfig of an ASG to change between spot and on-demand instances or changing the instance type for scaling up/down the instance size would require a change in launch-config. (ii) Adding/removing cloud labels (tags) is required for dynamic changes for cluster-autoscaler and AWS cost analysis tools, etc.
If direct changes are made to the ASG or the launch-config, it creates a drift in the CFN.
What you expected to happen:
More of a question: Would it be easier if instance-manager had a new provisioner called eks-direct
(or something similar) that would directly use the AWS SDK to provision the ASGs? It would solve the above problem of working well with other tools. Another positive side-effect would also be that the ASGs will have deterministic (and prettier) names :-)
Thoughts?
Currently spec does not allow enabling MetricsCollection.
We should probably add support for it.
More info in the API docs:
https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-properties-as-metricscollection.html
unit-test coverage is around 60%, should get it >75% at the very least.
Current BDD test is submitting an argo workflow that does nothing, instead we should submit a rollingupgrade via upgrade-manager.
Is this a BUG REPORT or FEATURE REQUEST?:
Feature Request
What happened:
Can't find where to place the UserData to inject into the Autoscale Group through Cloudformation.
What you expected to happen:
Be able to inject UserData to CloudFormation
1.15 , 1.16
Is this a BUG REPORT or FEATURE REQUEST?:
FEATURE REQUEST
What happened:
AWS now supports managed worker nodegroups. These are not currently supported by instance-manager.
What you expected to happen:
It should be possible to create/read/update/delete these managed node groups as IGs.
Anything else we need to know?:
There are CFNs for provisioning of managed node group: https://docs.aws.amazon.com/AWSCloudFormation/latest/UserGuide/aws-resource-eks-nodegroup.html
Other links:
Upgrades: https://docs.aws.amazon.com/eks/latest/userguide/update-managed-node-group.html
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.