rancherfederal / rke2-aws-tf Goto Github PK
View Code? Open in Web Editor NEWLicense: MIT License
License: MIT License
Governance policies often times prevent the creation of Security Groups, so heavy modifications are needed to use the rke2-aws-tf repository in our environment. A feature flag is requested to turn on/off the creation of the Security Groups and pass our own pre-existing values. The implementation should mimic the IAM modules feature flag.
Oversight where user defined optional additional rke2
config is not being properly passed into userdata appropriately
It is a bit of an ironic omission that SuSE OS's are not covered in the context of this IaC. I have FIPS enabled SuSE 15sp5 AMI's in my account that I would like to use.
Side note, the RHEL, Ubuntu, etc. AMI's pulled in as data calls will build without FIPS (as far as I can tell) which may be of value for some to know/notice.
First, thanks for the work on this, it's a great stepping stone.
Forgive me if I'm missing something and if so please point it out ...
Unfortunately the use of Auto Scale groups without additional scripting for RKE2 cluster management either via lambda or some other means that using ASGs can seriously cripple the cluster if people don't know what they are doing.
ASG for the servers, the problem here is that while additive is OK, remove/subtract or replace is not. For example, if you update the launch config to change the instance size and do an ASG refresh, all your servers will get replaced and etcd will be lost. If you do a refresh with one server at a time, you have to manually go in and remove the old node so the new node will join, otherwise it'll complain about etcd not being healthy.
ASG to agents, while this isn't as big of a problem, any time an agent is replaced or removed the old entry will remain, unsure what this means for when the node gets the same IP and DNS later down the line, could be problematic.
See hashicorp/terraform-provider-aws#34135. Right now a load balancer won't successfully provision because the latest provider attempts to set an attribute that is valid in commercial but not in Gov Cloud. Need to pin the version to <= 5.22 until this is fixed upstream.
Requesting documentation on preferences for contributing to this repository reflected in README.md
.
I'm working through implementing IAM Roles for Service Accounts on a RKE2 deployment which requires updates to some of the arguments in the kube-apiserver.yaml file. An issue is that the file is not persistent, such that if the main node goes down and is replaced, it reverts back to the old configuration.
Is there a simple way to update arguments on deployment or is the kube-apiserver.yaml file configured somewhere in the repo that could be updated prior to deployment?
Essentially what needs to be configured is:
spec:
containers:
- command:
- kube-apiserver
- --service-account-issuer=<OIDC provider URL>
- --service-account-key-file=/var/lib/rancher/rke2/server/irsa/sa-signer-pkcs8.pub
- --service-account-key-file=/var/lib/rancher/rke2/server/tls/service.key
- --service-account-signing-key-file=/var/lib/rancher/rke2/server/irsa/sa-signer.key
volumeMounts:
- mountPath: /var/lib/rancher/rke2/server/irsa
name: dir3
volumes:
- hostPath:
path: /var/lib/rancher/rke2/server/irsa
type: DirectoryOrCreate
name: dir3
The only real solution I've found that might work is updating the rke2-init.sh
and having it manually modify the file on the instance, or calling the RKE2 server cli to possibly inject those values in.
Is there a better/supported way to do this that I'm not seeing?
The optional()
feature is used here:
To enable the feature, this needs to be included in a Terraform template:
terraform {
experiments = [module_variable_optional_attrs]
}
This should be shipped in the same directory as variables.tf
When the elb name
variable is the right length, controlplane_name
, server_name
, or supervisor_name
can truncate to a name ending in -
, which is not permitted.
rke2-aws-tf/modules/elb/main.tf
Line 3 in 71146b5
I created an RKE2 cluster (v1.18.13+rke2r1) in AWS using this Terraform project (https://github.com/rancherfederal/rke2-aws-tf) using the cloud-enabled example for AWS. The working terraform project was configured to build 3 autoscaling nodes and 1 server.
The RKE2 cluster appears to work just fine. kubectl works great and I’m able to deploy a highly customized Jenkins helm chart with no issues. ELB for service creates with no issues, etc.
Now I’m deploying the rancher helm chart into my RKE2 as follows:
helm upgrade rancher rancher-stable/rancher --install \
--version v2.5.7 \
--namespace cattle-system \
--debug --wait --timeout 5m \
--set hostname=rke-test.k8s.hla-associates.info \
--set ingress.tls.source=rancher \
--set ingress.enabled=true \
--set tls=external
Questions:
rke2-aws-tf/modules/nlb/main.tf
Line 58 in 1fe22df
This is problematic.
According to line 3 (
rke2-aws-tf/modules/nlb/main.tf
Line 3 in 1fe22df
controlplane_name
is 31 characters.
If we combine this with the -6443
that's stuck on the end, that's 36 characters.
target groups can only have up to 32 characters. This breaks if given a long name. Please handle this scenario.
Version 4 of the AWS provider introduces some major changes aws_s3_bucket
resource. The acl
and server_side_encryption_configuration
attributes now have to be set using their own corresponding resource. See https://stackoverflow.com/questions/71078462/terraform-aws-provider-error-value-for-unconfigurable-attribute-cant-configur and https://registry.terraform.io/providers/hashicorp/aws/latest/docs/guides/version-4-upgrade#s3-bucket-refactor
These attributes now need to be set using the aws_s3_bucket_acl
and aws_s3_bucket_server_side_encryption_configuration
resources
Trying to run terraform apply
now will return the errors:
│ Error: Value for unconfigurable attribute
│
│ with module.rke2.module.statestore.aws_s3_bucket.bucket,
│ on ../../modules/statestore/main.tf line 1, in resource "aws_s3_bucket" "bucket":
│ 1: resource "aws_s3_bucket" "bucket" {
│
│ Can't configure a value for "server_side_encryption_configuration": its value will be decided
│ automatically based on the result of applying this configuration.
╵
╷
│ Error: Value for unconfigurable attribute
│
│ with module.rke2.module.statestore.aws_s3_bucket.bucket,
│ on ../../modules/statestore/main.tf line 3, in resource "aws_s3_bucket" "bucket":
│ 3: acl = "private"
│
│ Can't configure a value for "acl": its value will be decided automatically based on the result of
│ applying this configuration.
╵```
variable "asg" {
description = "Node pool AutoScalingGroup scaling definition"
type = object({
min = number
max = number
desired = number
suspended_processes = optional(list(string))
termination_policies = optional(list(string))
})
default = {
min = 1
max = 10
desired = 1
suspended_processes = []
termination_policies = ["Default"]
}
}
https://github.com/rancherfederal/rke2-aws-tf/blob/v2.4.2/modules/agent-nodepool/versions.tf
terraform {
required_version = ">= 0.13"
}
~ terraform --version
Terraform v0.13.0
~ cd modules/agent_nodepool && terraform init
There are some problems with the configuration, described below.
The Terraform configuration must be valid before initialization so that
Terraform can determine which modules and providers need to be installed.
Error: Invalid type specification
on variables.tf line 83, in variable "asg":
83: suspended_processes = optional(list(string))
Keyword "optional" is not a valid type constructor.
Error: Invalid type specification
on variables.tf line 84, in variable "asg":
84: termination_policies = optional(list(string))
Keyword "optional" is not a valid type constructor.
According to the Terraform changelog, the optional
keyword was added as an experiment in Terraform 0.14 and made official in Terraform 1.3.
Solution: Remove the optional
keyword, OR bump the required Terraform version to >= 1.3
.
This affects v2.4.2, the latest version of this module.
This doesn't work in certain environments. This is more portable, but introduces a dependency on jq"
aws configure set default.region "$(curl -s http://169.254.169.254/latest/dynamic/instance-identity/document | jq -r .region)"
The Server ASG fails to complete successfully with an error
module.rke2.module.servers.aws_autoscaling_group.this: Still creating... [9m50s elapsed]
module.rke2.module.servers.aws_autoscaling_group.this: Still creating... [10m0s elapsed]
Upon digging further I found that the AMI I am using is having was CLI already installed
Currently running this on any up-to-date RHEL AMI will fail. This is due to this upstream issue, which I hope will be resolved soon:
While I am aware that the latest patch that can fix this is in the testing channel, it's a little inconvenient that this terraform module has no official way to specify the release channel into the cloud-init userdata (without manually modding the script). Please correct me if I'm wrong and there's some hack that can be done using the extra_cloud_config_config
variable.
As a feature request, it would be nice to be able to pass in the environment variables, such as "INSTALL_RKE2_CHANNEL" and have them be read into the relevant parts of the userdata scripts.
If you have any guidance for alternate workarounds, such as copying the binary file directly to the node, please let me know.
On following the best practices for cloud-enabled example of main.tf
module rke2 and agent has subnet with public IP on changing to private-subnet, I get private IP
subnets = module.vpc.public_subnets
Question is how to access private IP server/agent , there should be code for bastion host when using private_subnet.
Any feedback please?
It seems like the control plane nlb can't see the instances from the asg as healthy. Funny thing they all passed health status and health check. This is a block and would appreciate some help>
module.servers.aws_autoscaling_group.this: Still creating... [9m50s elapsed]
module.servers.aws_autoscaling_group.this: Still creating... [10m0s elapsed]
╷
│ Error: waiting for Auto Scaling Group (p1-il2-dev-nv-km3-server-rke2-nodepool) capacity satisfied: timeout while waiting for state to become 'ok' (last state: 'want at least 1 healthy instance(s) registered to Load Balancer, have 0', timeout: 10m0s)
│
│ with module.servers.aws_autoscaling_group.this,
│ on modules/nodepool/main.tf line 69, in resource "aws_autoscaling_group" "this":
│ 69: resource "aws_autoscaling_group" "this" {
│
╵
Releasing state lock. This may take a few moments...
ERRO[0790] 1 error occurred:
* exit status 1
The same issue has also been reported here => https://repo1.dso.mil/platform-one/distros/rancher-federal/rke2/rke2-aws-terraform/-/issues/5
please consider publishing this module to the terraform registry.
Using terraform 1.27, there are 5 items flagged as deprecated.
{
"context": "output \"bucket\"",
"code": " value = aws_s3_bucket_object.token.bucket",
"detail": "The attribute \"bucket\" is deprecated. Refer to the provider documentation for details.",
"filename": "modules/statestore/outputs.tf",
"start_line": 2
}
{
"context": "resource \"aws_s3_bucket_object\" \"token\"",
"code": " bucket = aws_s3_bucket.bucket.id",
"detail": "Use the aws_s3_object resource instead",
"filename": "modules/statestore/main.tf",
"start_line": 24
}
{
"context": "resource \"aws_s3_bucket_object\" \"token\"",
"code": " key = \"token\"",
"detail": "Use the aws_s3_object resource instead",
"filename": "modules/statestore/main.tf",
"start_line": 25
}
{
"context": "resource \"random_string\" \"uid\"",
"code": " number = true",
"detail": "Use numeric instead.",
"filename": "main.tf",
"start_line": 27
}
{
"context": "output \"token\"",
"code": " bucket = aws_s3_bucket_object.token.bucket",
"detail": "The attribute \"bucket\" is deprecated. Refer to the provider documentation for details.",
"filename": "modules/statestore/outputs.tf",
"start_line": 15
}
You can use terraform validate -json | jq '.diagnostics[] | {context: .snippet.context, code: .snippet.code, detail: .detail, filename: .range.filename, start_line: .range.start.line}'
to replicate the results above.
AWS changed S3 to default to ACLs disabled: https://aws.amazon.com/about-aws/whats-new/2022/12/amazon-s3-automatically-enable-block-public-access-disable-access-control-lists-buckets-april-2023/
This breaks the lines below:
rke2-aws-tf/modules/statestore/main.tf
Lines 8 to 11 in e60ad00
In a recent update, the control plane was changed to use a NLB instead of a classic load balancer. Those upgrading the module to the latest version will find the following error take place
Not sure how to fix.
I seem oddly unable to make any use of this module at all. When I deploy the example TF files (quickstart and cloud-enabled) I get a wonky deployment that has the NGINX Backend, CoreDNS, and Metrics pods cycle repeatedly in a crash loop
Steps I performed:
cd
into the quickstart or cloud-enabled folderquickstart> kubectl logs -n kube-system rke2-coredns-rke2-coredns-6f7676fdf7-p9z7z -p
.:53
[INFO] plugin/reload: Running configuration MD5 = 7da3877dbcacfd983f39051ecafd33bd
CoreDNS-1.6.9
linux/amd64, go1.15.2b5, 17665683
[INFO] SIGTERM: Shutting down servers then terminating
[INFO] plugin/health: Going into lameduck mode for 5s
quickstart> kubectl logs -n kube-system rke2-ingress-nginx-default-backend-65f75d6664-nrckx -p
stopping http server...
quickstart> kubectl logs -n kube-system rke2-metrics-server-5d8c549c9f-297tx -p
I0826 18:51:11.018334 1 secure_serving.go:116] Serving securely on [::]:8443
I've been encountering issues with the NLB module name length:
Error: "name" cannot be longer than 32 characters
Specifically for these two lines:
I did notice there is already some name trimming/substringing happening here but it appears the port is putting it past the limit? It looks like the current names are trimmed down to 32 characters but the ports add an extra -
+ port number (~4 characters).
I haven't had a chance to dig into this further yet, as a workaround we just shortened our name that was being passed in.
When supplying a list of private subnets to the "subnets" field, the resulting nodes will never join each other as the communication seems to break through the load balancer. The module logic for the lb uses the private subnets. Perhaps I am missing something, but this module seems to only work for public subnets.
My use (perhaps I am doing something wrong here):
`
module "rke2" {
source = "git::https://github.com/rancherfederal/rke2-aws-tf.git"
cluster_name = local.cluster_name
unique_suffix = false
vpc_id = module.vpc.vpc_id
#subnets = module.vpc.public_subnets
subnets = module.vpc.private_subnets
ami = data.aws_ami.ubuntu.id
#enable_ccm = true
ssh_authorized_keys = [tls_private_key.ssh_keygen.public_key_openssh]
instance_type = local.rke2_instance_type
controlplane_internal = false
servers = local.rke2_servers
#associate_public_ip_address = true
controlplane_enable_cross_zone_load_balancing = true
extra_security_group_ids = [aws_security_group.nlb_ui_sg.id]
metadata_options = {
http_endpoint = "enabled"
http_tokens = "optional"
instance_metadata_tags = "disabled"
http_put_response_hop_limit = 1
}
rke2_version = local.rke2_version
rke2_config = <<-EOT
node-label:
}
`
The k8s ccm route-controller is querying the AWS API every 10 seconds, which is a normal interval, with DescribeRouteTables with no Request Parameters nor Filtering. The documentation says it is supposed to us a tag filter when querying but cloudtrail doesn't seem to agree.
"requestParameters": {
"routeTableIdSet": {},
"filterSet": {}
},
2020-12-16T16:30:36.981841663Z stderr F E1216 16:30:36.981657 1 route_controller.go:118] Couldn't reconcile node routes: error listing routes: found multiple matching AWS route tables for AWS cluster: kubernetes
Is there a way we can disable the route-controller in the k8s ccm that rke2 is using? The k8s fix for this is to set this in the yaml in the controller command section: - --configure-cloud-routes=false
https://kubernetes.io/docs/tasks/administer-cluster/running-cloud-controller/
# these flags will vary for every cloud provider
- --allocate-node-cidrs=true
- --configure-cloud-routes=true
- --cluster-cidr=172.17.0.0/16
Hi guys,
I'm having an issue running more than a single node control plane as nodes doesn't seem to be joining the cluster, here's the error I am seeing
2020-11-11 14:42:56,743 - util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/00_download.sh'] with allowed return codes [0] (shell=False, capture=False)
2020-11-11 14:44:34,107 - util.py[DEBUG]: Running command ['/var/lib/cloud/instance/scripts/01_rke2.sh'] with allowed return codes [0] (shell=False, capture=False)
2020-11-11 14:46:36,461 - util.py[WARNING]: Failed running /var/lib/cloud/instance/scripts/01_rke2.sh [1]
2020-11-11 14:46:36,461 - util.py[DEBUG]: Failed running /var/lib/cloud/instance/scripts/01_rke2.sh [1]
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/cloudinit/util.py", line 896, in runparts
subp(prefix + [exe_path], capture=False)
File "/usr/lib/python3.6/site-packages/cloudinit/util.py", line 2083, in subp
cmd=args)
cloudinit.util.ProcessExecutionError: Unexpected error while running command.
Command: ['/var/lib/cloud/instance/scripts/01_rke2.sh']
Exit code: 1
Reason: -
Stdout: -
Stderr: -
2020-11-11 14:46:36,463 - cc_scripts_user.py[WARNING]: Failed to run module scripts-user (scripts in /var/lib/cloud/instance/scripts)
2020-11-11 14:46:36,464 - handlers.py[DEBUG]: finish: modules-final/config-scripts-user: FAIL: running config-scripts-user with frequency once-per-instance
2020-11-11 14:46:36,464 - util.py[WARNING]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
2020-11-11 14:46:36,464 - util.py[DEBUG]: Running module scripts-user (<module 'cloudinit.config.cc_scripts_user' from '/usr/lib/python3.6/site-packages/cloudinit/config/cc_scripts_user.py'>) failed
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/cloudinit/stages.py", line 852, in _run_modules
freq=freq)
File "/usr/lib/python3.6/site-packages/cloudinit/cloud.py", line 54, in run
return self._runners.run(name, functor, args, freq, clear_on_fail)
File "/usr/lib/python3.6/site-packages/cloudinit/helpers.py", line 187, in run
results = functor(*args)
File "/usr/lib/python3.6/site-packages/cloudinit/config/cc_scripts_user.py", line 45, in handle
util.runparts(runparts_path)
File "/usr/lib/python3.6/site-packages/cloudinit/util.py", line 903, in runparts
% (len(failed), len(attempted)))
RuntimeError: Runparts: 1 failures in 2 attempted commands
and if I try to run the script manually afterwords, I do get the following
# /var/lib/cloud/instance/scripts/01_rke2.sh
[INFO] Beginning user defined pre userdata
[INFO] Beginning user defined pre userdata
[INFO] Fetching rke2 join token...
REDACTED
[INFO] Found token from s3 object
[INFO] API server available, identifying as server joining existing cluster
[INFO] Cluster is ready
[ERROR] Failed to create kubeconfig
I am also looking into that right now so I will share more insights unless you guys have any ideas? Not sure if important, but I am not using spot instances.
Thanks!
The rke2 server configuration allows for multiple entries to be specified for tls-san, however the script rke2-init.sh is creating an invalid yaml by creating a new entry for tls-san rather than appending to user provided list.
server config passed to TF module:
# Server Configuration
write-kubeconfig-mode: "0644"
node-label:
- "name=server"
- "os=ubuntu"
kube-controller-manager-arg:
- "bind-address=0.0.0.0"
kube-scheduler-arg:
- "bind-address=0.0.0.0"
node-taint:
- "CriticalAddonsOnly=true:NoExecute"
tls-san:
- k8s.foo-demo.bar.com
server config on server node:
ubuntu@ip-10-1-1-68:~$ sudo cat /etc/rancher/rke2/config.yaml
# Additional user defined configuration
# Server Configuration
write-kubeconfig-mode: "0644"
node-label:
- "name=server"
- "os=ubuntu"
kube-controller-manager-arg:
- "bind-address=0.0.0.0"
kube-scheduler-arg:
- "bind-address=0.0.0.0"
node-taint:
- "CriticalAddonsOnly=true:NoExecute"
tls-san:
- k8s.foo-demo.bar.com
token: FmbtIMwa9TNy5pUHAAx2rs6XlK1qiphqwemAUpsC
cloud-provider-name: "aws"
tls-san:
- foo-rke2-atv-rke2-cp-8fdaf7078215333b.elb.us-east-1.amazonaws.com
This causes errors when invoking kubectl as follows:
$ kubectl get nodes
Unable to connect to the server: x509: certificate is valid for kubernetes, kubernetes.default, kubernetes.default.svc, kubernetes.default.svc.cluster.local, foo-rke2-atv-rke2-cp-8fdaf7078215333b.elb.us-east-1.amazonaws.com, localhost, ip-10-1-1-68.ec2.internal, not k8s.foo-demo.bar.com
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.