Comments (38)
Got it working.
Apart from waiting for the cloudinit to finish, I also had to set the OCI_CLI_AUTH=instance_principal.
I did fix it by having following code at the beginning of create_service_account.sh:
while [ ! -f /home/opc/admin.finish ];
do
echo "waiting for admin to be ready"; sleep 10;
done
sleep 10
export OCI_CLI_AUTH=instance_principal
....
Not the most elegant fix, it would be better to not to connect to admin host and run create_service_account.sh until admin is ready. The OCI_CLI_AUTH is in fact set by cloudinit, we are just logging onto admin host too early.
I suppose the generate_kubeconfig.sh can be fixed the same way, however I didn't use this script to generate kubeconfig on admin host. Instead, I did it like this:
resource "null_resource" "write_kubeconfig_on_admin" {
connection {
host = var.oke_admin.admin_private_ip
private_key = file(var.oke_ssh_keys.ssh_private_key_path)
timeout = "40m"
type = "ssh"
user = "opc"
bastion_host = var.oke_admin.bastion_public_ip
bastion_user = "opc"
bastion_private_key = file(var.oke_ssh_keys.ssh_private_key_path)
}
depends_on = [oci_containerengine_cluster.k8s_cluster]
provisioner "file" {
content = data.oci_containerengine_cluster_kube_config.kube_config.content
destination = "~/.kube/config"
}
count = var.oke_admin.bastion_enabled == true && var.oke_admin.admin_enabled == true ? 1 : 0
}
from terraform-oci-oke.
Thanks for logging this issue. Can you please confirm you have:
bastion_enabled = true
admin_enabled = true
admin_instance_principal = true
?
from terraform-oci-oke.
Thanks for logging this issue. Can you please confirm you have:
bastion_enabled = true
admin_enabled = true
admin_instance_principal = true?
Thanks for your response
All these are set in my tfvars.
admin_instance_principal = true
admin_enabled = true
bastion_enabled = true
from terraform-oci-oke.
it looks like this line is never run.
however if you comment out this line line
"rm -f $HOME/generate_kubeconfig.sh"
and ssh to the server, running this script manually on the server does result in the kubeconfig being generated.
so I think the problem is generate_kubeconfig.sh failing then the service account creation bombs out as a result.
from terraform-oci-oke.
The generate_kubeconfig.sh is supposed to run automatically once:
- the oci-cli has been installed on the admin
- and the oke cluster created
We have put depend_on in a few places to make the ordering of these actions deterministic. Looks like we may have missed some. Or this was possibly introduced when we shifted the instance_principal to the admin from the bastion. We'll hunt and fix it.
from terraform-oci-oke.
Could be related to #140
from terraform-oci-oke.
I think you are right here. This morning I made the following change to enable us to log the output of the scripts.
"$HOME/create_service_account.sh >>kubeconfig.log 2>&1",
"$HOME/generate_kubeconfig.sh >>kubeconfig.log 2>&1",
/home/opc/generate_kubeconfig.sh: line 5: /usr/local/bin/oci: No such file or directory
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
so essentially what I think is happening is that generate_kubeconfig is being run before admin_instance_principal is enabled which means that oci is not available when the generate_kubeconfig is run.
from terraform-oci-oke.
definitely an ordering issue here.
while [ ! -f /home/opc/admin.finish ]
do
sleep 30
done
oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config --region ${region} --token-version 2.0.0
the above code has got rid of the issue with the oci client being run before its installed now the only thing thats left is the the oci command is being run before admin_principe is enabled successful.
from terraform-oci-oke.
Yeah, this is an ordering issue. Stuck at same place-
module.oke.null_resource.write_kubeconfig_on_admin[0] (remote-exec): ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm
Looks like oci client is not installed before it ran "write_kubeconfig_on_admin"
Question is how to we order admin_instance_principle before this?
from terraform-oci-oke.
The instance_principal, if enabled, is created immediately after the admin instance is created. See here:
https://github.com/oracle-terraform-modules/terraform-oci-oke/blob/master/docs/dependencies.adoc
By the time, cloud-init has finished on the compute, the dynamic group and the policy for instance_principal would have been created already.
I'm adding @redscaresu's fix and also a dependency on install_kubectl. Given the installation of kubectl on admin is done through null_resource.install_kubectl_admin and therefore requires the compute instance to be up, this should ensure that the instance_principal would have been created by then. Together, I think these 2 should be enough. If not, then we'll look at the instance_principal in the base module, maybe add an explicit dependency there.
All the additional functionality that we add now, their dependencies are documented here.
I've submitted a PR: #146 . Can you please test and let us know?
Thanks very much for your patience and help to hunt this.
from terraform-oci-oke.
@hyder looks like that has worked now. Thanks for your help.
from terraform-oci-oke.
Thanks @redscaresu. @sauraahu can you please confirm if this works for you as well? We can then merge and cut a new release for the registry.
from terraform-oci-oke.
sorry, been testing a bit more. While this is an improvement, I dont think the issue is totally gone.
It looks like there is still an ordering issue here. On the first first apply we still get an ordering issue whereby we attempt to create the kubeconfig before the admin principle is set up.
On the second apply it is able to create the kubeconfig and then create the service account. According to some logging I created this is what happens.
On First terraform apply tries to the kubeconfig and fails because it does not have admin principle applied to it.
ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm
On second terraform apply
New config written to the Kubeconfig file /home/opc/.kube/config
serviceaccount/kubeconfigsa created
clusterrolebinding.rbac.authorization.k8s.io/kubeconfigsa-crb created
from terraform-oci-oke.
a bit more information.
On the first terraform apply I am still unable to log onto to the bastion so it looks like I am trying to do the remote execs to the bastion and admin before I have physical access to those machines.
A second terraform apply seems to resolve this ordering issue.
from terraform-oci-oke.
I've added an additional wait to ensure the instance_principal has been created before kubeconfig is generated. Can you please try again? You'll need to make a pull on the branch.
from terraform-oci-oke.
Still fails on the first terraform apply but the ordering issue is solved on the second apply.
It looks like there is a certain amount time between enabling the admin principle and actually being granted this privilege.
https://gist.github.com/redscaresu/e1e989abf48f2024cacd9b15593f285a
The above error log shows that resource write_kubeconfig_on_admin fails.
Looking at the log file for running kubeconfig.log
waiting for admin to be ready
waiting for admin to be ready
ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm
we can see we have looped twice before trying to execute
oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config --region ${region} --token-version 2.0.0
that means even though /home/opc/ip.finish
exists thats not enough to know whether we have been granted admin principle which means oci_identity_dynamic_group and oci_identity_policy is not enough to ensure that we have admin instance principle yet.
from terraform-oci-oke.
@redscaresu I've added a 30s sleep between instance_principal being detected and generating the kubeconfig. Can you please test again?
Thanks again for your patience.
from terraform-oci-oke.
so I tried that too and a simple sleep does not work unfortunately.
So I tried to implement a rudimentary try/catch
while [ ! -f /home/opc/admin.finish ] || [ ! -f /home/opc/ip.finish ];
do
echo "waiting for admin to be ready"; sleep 10;
done
for i in `seq 1 20`;
do
oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config --region ${region} --token-version 2.0.0 && break
sleep 20
done
The log is below
https://gist.github.com/redscaresu/f4c4ef86b9ad79c237a94b4458630541
it is interesting that no matter how long we wait we never get the admin_principle permission we need. Its almost as if we are waiting on a terraform operation to complete before we are given this permission, I just do not know what resource that is.
In the log you can see we hit the max 20 loops with their corresponding 20 secs of sleep before bombing out. I dont think it matters how long you will wait because it will always bomb out, we could wait for a 100 it would not matter.
Something needs to complete before we run resource "null_resource" "write_kubeconfig_on_admin". I just dont know what that is.
The below is the terraform, you can see that module.oke.null_resource.write_kubeconfig_on_admin finally bombs out after hitting the 20th loop in the bash script and subsequently causes the create_service_account sh to bomb out. module.oke.null_resource.write_kubeconfig_on_admin will never have admin_principle on the first pass.
https://gist.github.com/redscaresu/697b82ded02c640e95aae3a04465f48a
from terraform-oci-oke.
This is probably a red herring but....
module.oke.null_resource.write_kubeconfig_on_admin[0]: Still creating... [2m0s elapsed] 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "root" is waiting for "provisioner.file (close)" 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "module.oke.null_resource.create_service_account[0]" is waiting for "module.oke.null_resource.write_kubeconfig_on_admin[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "provider.null (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "provisioner.remote-exec (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "root" is waiting for "provisioner.file (close)" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "module.oke.null_resource.create_service_account[0]" is waiting for "module.oke.null_resource.write_kubeconfig_on_admin[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "provider.null (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "provisioner.remote-exec (close)" is waiting for "module.oke.null_resource.create_service_account[0]"
This seems weird, am I mistaken or does it like write_kubeconfig_on_admin is waiting for module.oke.null_resource.create_service_account? write_kubeconfig_on_admin must come first, we cant create the service account until thats done
module.oke.null_resource.write_kubeconfig_on_admin[0]: Still creating... [2m0s elapsed] 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]
I tested this by setting create_service_account = false
and it still failed with the same problem so this is likely a red herring. While write_kubeconfig_on_admin was being created I checked to see that that the dynamic group was there with the associated admin instance and policy and they were present. This is very strange.
from terraform-oci-oke.
I have modified the kubeconfig.tf so that kubeconfig on admin host is created the same way as on local machine (using data.oci_containerengine_cluster_kube_config.kube_config.content).
The kubeconfig got created, but the create_service_account.sh failed:
The connection to the server localhost:8080 was refused...
Running create_service_account.sh manually works.
from terraform-oci-oke.
I have modified the kubeconfig.tf so that kubeconfig on admin host is created the same way as on local machine (using data.oci_containerengine_cluster_kube_config.kube_config.content).
The kubeconfig got created, but the create_service_account.sh failed:
The connection to the server localhost:8080 was refused...
Running create_service_account.sh manually works.
nice! can you show me what is your dependency on the create_service_account terraform resource? Is it dependent on the resource that creates your kubeconfig now?
from terraform-oci-oke.
uuups, created kubeconfig in wrong place and didn't set the KUBECONFIG env. var. Going to destroy and apply again, fingers crossed.
from terraform-oci-oke.
I have some progress:
- kubeconfig got created
- create_service_account.sh fails with
Unable to connect to the server: getting credentials: exec: exec: "oci": executable file not found in $PATH
from terraform-oci-oke.
looks like cloudinit did not finish yet, we should wait for /home/opc/admin.finish
from terraform-oci-oke.
The OCI_CLI_AUTH is in fact set by cloudinit, we are just logging onto admin host too early.
Aha! I think this was the issue. So, I'm moving the delay to another null_resource instead.
We were using rendering the kubeconfig before. However, it was storing the kubeconfig in the state file. I thought this was not a good idea.
Oddly:
- I haven't run into any of the above issue at all
- The other scripts e.g. ocirsecret, metricserver, calico etc that all depend on the kubeconfig haven't run into this problem either.
I'll add @styledigger's findings and make another push soon. Can I trouble you to test again?
from terraform-oci-oke.
Ok, I've pushed an update to my branch. Can you please make a pull and test again?
from terraform-oci-oke.
Given I still couldn't replicate the issue, I'll need 2 confirmations from those who have been able to in order to confirm we've fixed it: @redscaresu and @styledigger at least.
from terraform-oci-oke.
Will test it now. @hyder Just to be sure, updates have been pushed to [email protected]:hyder/terraform-oci-oke.git?
from terraform-oci-oke.
Yes, in branch issue-143.
Use the following if you want to test with your existing clone:
git checkout -b hyder-issue-143 master
git pull https://github.com/hyder/terraform-oci-oke.git issue-143
Or you can do a fresh clone from my fork instead and checkout the issue-143 branch
from terraform-oci-oke.
I did a fresh clone:
git clone [email protected]:hyder/terraform-oci-oke.git
Switched to a new branch 'hyder-issue-143:
git checkout -b hyder-issue-143 master
git pull https://github.com/hyder/terraform-oci-oke.git issue-143
Now terraform plan fails:
...
module.base.module.vcn.data.oci_core_services.all_oci_services[0]: Refreshing state...
module.network.data.oci_core_services.all_oci_services[0]: Refreshing state...
Error: Null value found in list
on modules\policies\datasources.tf line 9, in data "oci_identity_regions" "home_region":
9: data "oci_identity_regions" "home_region" {
Null values are not allowed for this attribute value.
Error: Invalid function argument
on .terraform\modules\base\terraform-oci-base-1.1.3\datasources.tf line 9, in data "template_file" "ad_names":
9: count = length(data.oci_identity_availability_domains.ad_list.availability_domains)
|----------------
| data.oci_identity_availability_domains.ad_list.availability_domains is null
Invalid value for "value" parameter: argument must not be null.
Error: Null value found in list
on .terraform\modules\base\terraform-oci-base-1.1.3\datasources.tf line 18, in data "oci_identity_regions" "home_region":
18: data "oci_identity_regions" "home_region" {
Null values are not allowed for this attribute value.
Error: Attempt to index null value
on .terraform\modules\base\terraform-oci-base-1.1.3\modules\admin\locals.tf line 12, in locals:
12: admin_image_id = var.oci_admin.admin_image_id == "Oracle" ? data.oci_core_images.admin_images.images.0.id : var.oci_admin.admin_image_id
|----------------
| data.oci_core_images.admin_images.images is null
This value is null, so it does not have any indices.
Error: Attempt to index null value
on .terraform\modules\base\terraform-oci-base-1.1.3\modules\bastion\locals.tf line 12, in locals:
12: bastion_image_id = var.oci_bastion.bastion_image_id == "Autonomous" ? data.oci_core_images.autonomous_images.images.0.id : var.oci_bastion.bastion_image_id
|----------------
| data.oci_core_images.autonomous_images.images is null
This value is null, so it does not have any indices.
from terraform-oci-oke.
I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.
from terraform-oci-oke.
I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.
Thanks @sauraahu. I'll need 1 more confirmation from either @redscaresu or @styledigger as I still haven't been able to replicate their issue but I understand where it could be coming from.
from terraform-oci-oke.
I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.
Thanks @sauraahu. I'll need 1 more confirmation from either @redscaresu or @styledigger as I still haven't been able to replicate their issue but I understand where it could be coming from.
Sure. Meanwhile, i am thinking to host one example of how to create OKE cluster and host sample hello world example as that would require some additional steps like building docker image or use existing one, uploading image to OCIR, and create sample .yml with deployment and service configured. What would be right place to put that example ?
from terraform-oci-oke.
Only issue i am getting is during terraform -destroy, dont know the reason:
First time:
module.network.oci_core_subnet.pub_lb[0]: Still destroying... [id=ocid1.subnet.oc1.ap-mumbai-1.aaaaaaaaqm...2xgvnzms2gyzwlmvkgjrogflef7zesuridbhwa, 10m10s elapsed]
Error: Service error:Conflict. The Subnet ocid1.subnet.oc1.ap-mumbai-1.aaaaaaaaqmzf2mjizqomds2xgvnzms2gyzwlmvkgjrogflef7zesuridbhwa references the VNIC ocid1.vnic.oc1.ap-mumbai-1.abrg6ljr4hnwbd4m4fsfmzldixv657vkzfbyeqlrmsyd7eusr6c4px4xcngq. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: d870a1beb10aeb1ab29a4110a93ae2b4/98933FF0DF57B2AF5B029C4FB4DF4B3A/EE342E4DD141CE0F6DDD1637CBABA34E
Second time or run:
module.base.module.vcn.oci_core_vcn.vcn: Destruction complete after 1s
Error: Error in function call
on modules/auth/outputs.tf line 5, in output "ocirtoken":
5: value = var.ocir.create_auth_token == true ? element(oci_identity_auth_token.ocirtoken.*.token, 0) : "none"
|----------------
| oci_identity_auth_token.ocirtoken is empty tuple
Call to function "element" failed: cannot use element function with an empty
list.
Error: Error in function call
on modules/auth/outputs.tf line 9, in output "ocirtoken_id":
9: value = var.ocir.create_auth_token == true ? element(oci_identity_auth_token.ocirtoken.*.id, 0) : "none"
|----------------
| oci_identity_auth_token.ocirtoken is empty tuple
Call to function "element" failed: cannot use element function with an empty
list.
Error: Invalid index
on .terraform/modules/base/terraform-oci-base-1.1.3/modules/admin/outputs.tf line 9, in output "admin_instance_principal_group_name":
9: value = var.oci_admin.admin_enabled == true && var.oci_admin.enable_instance_principal == true ? oci_identity_dynamic_group.admin_instance_principal[0].name : null
|----------------
| oci_identity_dynamic_group.admin_instance_principal is empty tuple
The given key does not identify an element in this collection value.
from terraform-oci-oke.
Both apply and destroy now work for me.
Problem is that there is no provider.tf in issue-143 branch's root folder so OCI privider loads required OCIDs from ~.oci\config (which has wrong values in my case).
from terraform-oci-oke.
Yeah, both apply and destroy work for me now. I am testing other features like dashboard, OCIR secret, helm etc. I will open separate bug if i found issue there. Meanwhile you can merge this please.
from terraform-oci-oke.
Right, so let me summarize why this happened:
- we removed the provider.tf for #130
- this made the terraform provider use the oci config which for some of you may have have different permissions, particularly with the ability to create dynamic groups
- as a result of the dynamic group for the instance_principal not being created, the admin host didn't enjoy instance_principal privileges
- this resulted in the admin host unable to use the oci cli to generate the kubeconfig
- since the kubeconfig is not generated, then the service accounts could not be created either
Adding the provider.tf is documented in the quickstart doc, although we only recently updated it, so all of us collectively forgot it should be added.
I'll be merging now.
Thanks a lot everyone for your help and patience to troubleshoot this. On the plus side, we have consequently made the underlying base module more robust. So, anybody who's building on top of this repo and using the admin host to install things into their oke cluster can rely on a more definite pattern.
from terraform-oci-oke.
Fixed in #146
from terraform-oci-oke.
Related Issues (20)
- Allow k8s autoscaler to be installed via local terraform client, rather than remote operator. HOT 1
- Cloud-init fails when using Oracle Linux 9 after #890 HOT 1
- Document how to create a nood pool in a certain AD. HOT 1
- Unable to create just a node pool using the workers submodule HOT 1
- Allow for node pool specific max_pods_per_node
- OKE Cluster Add-ons and Configuration HOT 1
- Add Support for Resource Manager HOT 4
- Create Service Account functionality HOT 1
- kubeconfig should consider whether a public IP has been assigned when generating
- Feature Parity: Implement FSS from 4.x HOT 1
- Define structure of var.worker_pools (instead of type=any) - Unable to set image_id of worker node pools HOT 2
- Generated NSG Rules for accessing FSS do not match the documentation. HOT 1
- Generated NSG Security Rules for FSS do not match the documentation HOT 1
- Terraform plan/apply fail with invalid count/for_each if Subnets or KMS keys created in same configuration but outside module
- Cluster Autoscaler installation fail HOT 2
- `internet_gateway_route_rules` and `nat_gateway_route_rules` changes are ignored HOT 5
- Terraform module oci_oke broken: oci_core_images data source returning empty images for bastion HOT 5
- Default values for cluster autoscaler helm chart cannot be overridden correctly HOT 1
- remove all linked versions.tf files
- Reused VCN example request HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from terraform-oci-oke.