Git Product home page Git Product logo

Comments (38)

styledigger avatar styledigger commented on June 12, 2024 1

Got it working.
Apart from waiting for the cloudinit to finish, I also had to set the OCI_CLI_AUTH=instance_principal.

I did fix it by having following code at the beginning of create_service_account.sh:

while [ ! -f /home/opc/admin.finish ];
do
  echo "waiting for admin to be ready"; sleep 10;
done
sleep 10
export OCI_CLI_AUTH=instance_principal
....

Not the most elegant fix, it would be better to not to connect to admin host and run create_service_account.sh until admin is ready. The OCI_CLI_AUTH is in fact set by cloudinit, we are just logging onto admin host too early.

I suppose the generate_kubeconfig.sh can be fixed the same way, however I didn't use this script to generate kubeconfig on admin host. Instead, I did it like this:


resource "null_resource" "write_kubeconfig_on_admin" {
  connection {
    host        = var.oke_admin.admin_private_ip
    private_key = file(var.oke_ssh_keys.ssh_private_key_path)
    timeout     = "40m"
    type        = "ssh"
    user        = "opc"

    bastion_host        = var.oke_admin.bastion_public_ip
    bastion_user        = "opc"
    bastion_private_key = file(var.oke_ssh_keys.ssh_private_key_path)
  }

  depends_on = [oci_containerengine_cluster.k8s_cluster]

  provisioner "file" {
    content     = data.oci_containerengine_cluster_kube_config.kube_config.content
    destination = "~/.kube/config"
  }

  count = var.oke_admin.bastion_enabled == true && var.oke_admin.admin_enabled == true ? 1 : 0
}

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Thanks for logging this issue. Can you please confirm you have:

bastion_enabled = true
admin_enabled = true
admin_instance_principal = true

?

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

Thanks for logging this issue. Can you please confirm you have:

bastion_enabled = true
admin_enabled = true
admin_instance_principal = true

?
Thanks for your response

All these are set in my tfvars.

admin_instance_principal = true
admin_enabled = true
bastion_enabled = true

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

it looks like this line is never run.

https://github.com/oracle-terraform-modules/terraform-oci-oke/blob/master/modules/oke/kubeconfig.tf#L97

however if you comment out this line line

"rm -f $HOME/generate_kubeconfig.sh"

and ssh to the server, running this script manually on the server does result in the kubeconfig being generated.

so I think the problem is generate_kubeconfig.sh failing then the service account creation bombs out as a result.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

The generate_kubeconfig.sh is supposed to run automatically once:

  1. the oci-cli has been installed on the admin
  2. and the oke cluster created

We have put depend_on in a few places to make the ordering of these actions deterministic. Looks like we may have missed some. Or this was possibly introduced when we shifted the instance_principal to the admin from the bastion. We'll hunt and fix it.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Could be related to #140

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

I think you are right here. This morning I made the following change to enable us to log the output of the scripts.

"$HOME/create_service_account.sh >>kubeconfig.log 2>&1",
"$HOME/generate_kubeconfig.sh >>kubeconfig.log 2>&1",

/home/opc/generate_kubeconfig.sh: line 5: /usr/local/bin/oci: No such file or directory
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?

so essentially what I think is happening is that generate_kubeconfig is being run before admin_instance_principal is enabled which means that oci is not available when the generate_kubeconfig is run.

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

definitely an ordering issue here.

while [ ! -f /home/opc/admin.finish ]
do
  sleep 30
done
oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config  --region ${region} --token-version 2.0.0

the above code has got rid of the issue with the oci client being run before its installed now the only thing thats left is the the oci command is being run before admin_principe is enabled successful.

from terraform-oci-oke.

saurabhuja avatar saurabhuja commented on June 12, 2024

Yeah, this is an ordering issue. Stuck at same place-
module.oke.null_resource.write_kubeconfig_on_admin[0] (remote-exec): ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm

Looks like oci client is not installed before it ran "write_kubeconfig_on_admin"
Question is how to we order admin_instance_principle before this?

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

The instance_principal, if enabled, is created immediately after the admin instance is created. See here:

https://github.com/oracle-terraform-modules/terraform-oci-oke/blob/master/docs/dependencies.adoc

By the time, cloud-init has finished on the compute, the dynamic group and the policy for instance_principal would have been created already.

I'm adding @redscaresu's fix and also a dependency on install_kubectl. Given the installation of kubectl on admin is done through null_resource.install_kubectl_admin and therefore requires the compute instance to be up, this should ensure that the instance_principal would have been created by then. Together, I think these 2 should be enough. If not, then we'll look at the instance_principal in the base module, maybe add an explicit dependency there.

All the additional functionality that we add now, their dependencies are documented here.

I've submitted a PR: #146 . Can you please test and let us know?

Thanks very much for your patience and help to hunt this.

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

@hyder looks like that has worked now. Thanks for your help.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Thanks @redscaresu. @sauraahu can you please confirm if this works for you as well? We can then merge and cut a new release for the registry.

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

sorry, been testing a bit more. While this is an improvement, I dont think the issue is totally gone.

It looks like there is still an ordering issue here. On the first first apply we still get an ordering issue whereby we attempt to create the kubeconfig before the admin principle is set up.

On the second apply it is able to create the kubeconfig and then create the service account. According to some logging I created this is what happens.

On First terraform apply tries to the kubeconfig and fails because it does not have admin principle applied to it.
ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm

On second terraform apply

New config written to the Kubeconfig file /home/opc/.kube/config
serviceaccount/kubeconfigsa created
clusterrolebinding.rbac.authorization.k8s.io/kubeconfigsa-crb created

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

a bit more information.

On the first terraform apply I am still unable to log onto to the bastion so it looks like I am trying to do the remote execs to the bastion and admin before I have physical access to those machines.

A second terraform apply seems to resolve this ordering issue.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

I've added an additional wait to ensure the instance_principal has been created before kubeconfig is generated. Can you please try again? You'll need to make a pull on the branch.

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

Still fails on the first terraform apply but the ordering issue is solved on the second apply.

It looks like there is a certain amount time between enabling the admin principle and actually being granted this privilege.

https://gist.github.com/redscaresu/e1e989abf48f2024cacd9b15593f285a

The above error log shows that resource write_kubeconfig_on_admin fails.

Looking at the log file for running kubeconfig.log

waiting for admin to be ready
waiting for admin to be ready
ERROR: Could not find config file at /home/opc/.oci/config, please follow the instructions in the link to setup the config file https://docs.cloud.oracle.com/en-us/iaas/Content/API/Concepts/sdkconfig.htm

we can see we have looped twice before trying to execute

oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config --region ${region} --token-version 2.0.0

that means even though /home/opc/ip.finish exists thats not enough to know whether we have been granted admin principle which means oci_identity_dynamic_group and oci_identity_policy is not enough to ensure that we have admin instance principle yet.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

@redscaresu I've added a 30s sleep between instance_principal being detected and generating the kubeconfig. Can you please test again?

Thanks again for your patience.

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

so I tried that too and a simple sleep does not work unfortunately.

So I tried to implement a rudimentary try/catch

while [ ! -f /home/opc/admin.finish ]  || [ ! -f /home/opc/ip.finish  ];
do
  echo "waiting for admin to be ready"; sleep 10;
done

for i in `seq 1 20`;
do
  oci ce cluster create-kubeconfig --cluster-id ${cluster-id} --file $HOME/.kube/config  --region ${region} --token-version 2.0.0 && break
  sleep 20
done

The log is below

https://gist.github.com/redscaresu/f4c4ef86b9ad79c237a94b4458630541

it is interesting that no matter how long we wait we never get the admin_principle permission we need. Its almost as if we are waiting on a terraform operation to complete before we are given this permission, I just do not know what resource that is.

In the log you can see we hit the max 20 loops with their corresponding 20 secs of sleep before bombing out. I dont think it matters how long you will wait because it will always bomb out, we could wait for a 100 it would not matter.

Something needs to complete before we run resource "null_resource" "write_kubeconfig_on_admin". I just dont know what that is.

The below is the terraform, you can see that module.oke.null_resource.write_kubeconfig_on_admin finally bombs out after hitting the 20th loop in the bash script and subsequently causes the create_service_account sh to bomb out. module.oke.null_resource.write_kubeconfig_on_admin will never have admin_principle on the first pass.

https://gist.github.com/redscaresu/697b82ded02c640e95aae3a04465f48a

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

This is probably a red herring but....

module.oke.null_resource.write_kubeconfig_on_admin[0]: Still creating... [2m0s elapsed] 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "root" is waiting for "provisioner.file (close)" 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "module.oke.null_resource.create_service_account[0]" is waiting for "module.oke.null_resource.write_kubeconfig_on_admin[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "provider.null (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:15 [TRACE] dag/walk: vertex "provisioner.remote-exec (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "root" is waiting for "provisioner.file (close)" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:17 [TRACE] dag/walk: vertex "module.oke.null_resource.create_service_account[0]" is waiting for "module.oke.null_resource.write_kubeconfig_on_admin[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "provider.null (close)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "meta.count-boundary (EachMode fixup)" is waiting for "module.oke.null_resource.create_service_account[0]" 2020/04/08 11:41:20 [TRACE] dag/walk: vertex "provisioner.remote-exec (close)" is waiting for "module.oke.null_resource.create_service_account[0]"

This seems weird, am I mistaken or does it like write_kubeconfig_on_admin is waiting for module.oke.null_resource.create_service_account? write_kubeconfig_on_admin must come first, we cant create the service account until thats done

module.oke.null_resource.write_kubeconfig_on_admin[0]: Still creating... [2m0s elapsed] 2020/04/08 11:41:12 [TRACE] dag/walk: vertex "provisioner.file (close)" is waiting for "module.oke.null_resource.create_service_account[0]

I tested this by setting create_service_account = false and it still failed with the same problem so this is likely a red herring. While write_kubeconfig_on_admin was being created I checked to see that that the dynamic group was there with the associated admin instance and policy and they were present. This is very strange.

from terraform-oci-oke.

styledigger avatar styledigger commented on June 12, 2024

I have modified the kubeconfig.tf so that kubeconfig on admin host is created the same way as on local machine (using data.oci_containerengine_cluster_kube_config.kube_config.content).

The kubeconfig got created, but the create_service_account.sh failed:
The connection to the server localhost:8080 was refused...

Running create_service_account.sh manually works.

from terraform-oci-oke.

redscaresu avatar redscaresu commented on June 12, 2024

I have modified the kubeconfig.tf so that kubeconfig on admin host is created the same way as on local machine (using data.oci_containerengine_cluster_kube_config.kube_config.content).

The kubeconfig got created, but the create_service_account.sh failed:
The connection to the server localhost:8080 was refused...

Running create_service_account.sh manually works.

nice! can you show me what is your dependency on the create_service_account terraform resource? Is it dependent on the resource that creates your kubeconfig now?

from terraform-oci-oke.

styledigger avatar styledigger commented on June 12, 2024

uuups, created kubeconfig in wrong place and didn't set the KUBECONFIG env. var. Going to destroy and apply again, fingers crossed.

from terraform-oci-oke.

styledigger avatar styledigger commented on June 12, 2024

I have some progress:

  • kubeconfig got created
  • create_service_account.sh fails with Unable to connect to the server: getting credentials: exec: exec: "oci": executable file not found in $PATH

from terraform-oci-oke.

styledigger avatar styledigger commented on June 12, 2024

looks like cloudinit did not finish yet, we should wait for /home/opc/admin.finish

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

The OCI_CLI_AUTH is in fact set by cloudinit, we are just logging onto admin host too early.

Aha! I think this was the issue. So, I'm moving the delay to another null_resource instead.

We were using rendering the kubeconfig before. However, it was storing the kubeconfig in the state file. I thought this was not a good idea.

Oddly:

  1. I haven't run into any of the above issue at all
  2. The other scripts e.g. ocirsecret, metricserver, calico etc that all depend on the kubeconfig haven't run into this problem either.

I'll add @styledigger's findings and make another push soon. Can I trouble you to test again?

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Ok, I've pushed an update to my branch. Can you please make a pull and test again?

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Given I still couldn't replicate the issue, I'll need 2 confirmations from those who have been able to in order to confirm we've fixed it: @redscaresu and @styledigger at least.

from terraform-oci-oke.

styledigger avatar styledigger commented on June 12, 2024

Will test it now. @hyder Just to be sure, updates have been pushed to [email protected]:hyder/terraform-oci-oke.git?

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Yes, in branch issue-143.

Use the following if you want to test with your existing clone:

git checkout -b hyder-issue-143 master
git pull https://github.com/hyder/terraform-oci-oke.git issue-143

Or you can do a fresh clone from my fork instead and checkout the issue-143 branch

from terraform-oci-oke.

styledigger avatar styledigger commented on June 12, 2024

I did a fresh clone:
git clone [email protected]:hyder/terraform-oci-oke.git

Switched to a new branch 'hyder-issue-143:

git checkout -b hyder-issue-143 master 
git pull https://github.com/hyder/terraform-oci-oke.git issue-143

Now terraform plan fails:

...
module.base.module.vcn.data.oci_core_services.all_oci_services[0]: Refreshing state...
module.network.data.oci_core_services.all_oci_services[0]: Refreshing state...

Error: Null value found in list

  on modules\policies\datasources.tf line 9, in data "oci_identity_regions" "home_region":
   9: data "oci_identity_regions" "home_region" {

Null values are not allowed for this attribute value.


Error: Invalid function argument

  on .terraform\modules\base\terraform-oci-base-1.1.3\datasources.tf line 9, in data "template_file" "ad_names":
   9:   count    = length(data.oci_identity_availability_domains.ad_list.availability_domains)
    |----------------
    | data.oci_identity_availability_domains.ad_list.availability_domains is null

Invalid value for "value" parameter: argument must not be null.


Error: Null value found in list

  on .terraform\modules\base\terraform-oci-base-1.1.3\datasources.tf line 18, in data "oci_identity_regions" "home_region":
  18: data "oci_identity_regions" "home_region" {

Null values are not allowed for this attribute value.


Error: Attempt to index null value

  on .terraform\modules\base\terraform-oci-base-1.1.3\modules\admin\locals.tf line 12, in locals:
  12:   admin_image_id = var.oci_admin.admin_image_id == "Oracle" ? data.oci_core_images.admin_images.images.0.id : var.oci_admin.admin_image_id
    |----------------
    | data.oci_core_images.admin_images.images is null

This value is null, so it does not have any indices.


Error: Attempt to index null value

  on .terraform\modules\base\terraform-oci-base-1.1.3\modules\bastion\locals.tf line 12, in locals:
  12:   bastion_image_id = var.oci_bastion.bastion_image_id == "Autonomous" ? data.oci_core_images.autonomous_images.images.0.id : var.oci_bastion.bastion_image_id
    |----------------
    | data.oci_core_images.autonomous_images.images is null

This value is null, so it does not have any indices.

from terraform-oci-oke.

saurabhuja avatar saurabhuja commented on June 12, 2024

I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.

Thanks @sauraahu. I'll need 1 more confirmation from either @redscaresu or @styledigger as I still haven't been able to replicate their issue but I understand where it could be coming from.

from terraform-oci-oke.

saurabhuja avatar saurabhuja commented on June 12, 2024

I just took https://github.com/hyder/terraform-oci-oke.git and checkout issue-143 branch, it created cluster successfully for me.

Thanks @sauraahu. I'll need 1 more confirmation from either @redscaresu or @styledigger as I still haven't been able to replicate their issue but I understand where it could be coming from.

Sure. Meanwhile, i am thinking to host one example of how to create OKE cluster and host sample hello world example as that would require some additional steps like building docker image or use existing one, uploading image to OCIR, and create sample .yml with deployment and service configured. What would be right place to put that example ?

from terraform-oci-oke.

saurabhuja avatar saurabhuja commented on June 12, 2024

Only issue i am getting is during terraform -destroy, dont know the reason:
First time:
module.network.oci_core_subnet.pub_lb[0]: Still destroying... [id=ocid1.subnet.oc1.ap-mumbai-1.aaaaaaaaqm...2xgvnzms2gyzwlmvkgjrogflef7zesuridbhwa, 10m10s elapsed]

Error: Service error:Conflict. The Subnet ocid1.subnet.oc1.ap-mumbai-1.aaaaaaaaqmzf2mjizqomds2xgvnzms2gyzwlmvkgjrogflef7zesuridbhwa references the VNIC ocid1.vnic.oc1.ap-mumbai-1.abrg6ljr4hnwbd4m4fsfmzldixv657vkzfbyeqlrmsyd7eusr6c4px4xcngq. You must remove the reference to proceed with this operation.. http status code: 409. Opc request id: d870a1beb10aeb1ab29a4110a93ae2b4/98933FF0DF57B2AF5B029C4FB4DF4B3A/EE342E4DD141CE0F6DDD1637CBABA34E

Second time or run:
module.base.module.vcn.oci_core_vcn.vcn: Destruction complete after 1s

Error: Error in function call

on modules/auth/outputs.tf line 5, in output "ocirtoken":
5: value = var.ocir.create_auth_token == true ? element(oci_identity_auth_token.ocirtoken.*.token, 0) : "none"
|----------------
| oci_identity_auth_token.ocirtoken is empty tuple

Call to function "element" failed: cannot use element function with an empty
list.

Error: Error in function call

on modules/auth/outputs.tf line 9, in output "ocirtoken_id":
9: value = var.ocir.create_auth_token == true ? element(oci_identity_auth_token.ocirtoken.*.id, 0) : "none"
|----------------
| oci_identity_auth_token.ocirtoken is empty tuple

Call to function "element" failed: cannot use element function with an empty
list.

Error: Invalid index

on .terraform/modules/base/terraform-oci-base-1.1.3/modules/admin/outputs.tf line 9, in output "admin_instance_principal_group_name":
9: value = var.oci_admin.admin_enabled == true && var.oci_admin.enable_instance_principal == true ? oci_identity_dynamic_group.admin_instance_principal[0].name : null
|----------------
| oci_identity_dynamic_group.admin_instance_principal is empty tuple

The given key does not identify an element in this collection value.

from terraform-oci-oke.

styledigger avatar styledigger commented on June 12, 2024

Both apply and destroy now work for me.
Problem is that there is no provider.tf in issue-143 branch's root folder so OCI privider loads required OCIDs from ~.oci\config (which has wrong values in my case).

from terraform-oci-oke.

saurabhuja avatar saurabhuja commented on June 12, 2024

Yeah, both apply and destroy work for me now. I am testing other features like dashboard, OCIR secret, helm etc. I will open separate bug if i found issue there. Meanwhile you can merge this please.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Right, so let me summarize why this happened:

  1. we removed the provider.tf for #130
  2. this made the terraform provider use the oci config which for some of you may have have different permissions, particularly with the ability to create dynamic groups
  3. as a result of the dynamic group for the instance_principal not being created, the admin host didn't enjoy instance_principal privileges
  4. this resulted in the admin host unable to use the oci cli to generate the kubeconfig
  5. since the kubeconfig is not generated, then the service accounts could not be created either

Adding the provider.tf is documented in the quickstart doc, although we only recently updated it, so all of us collectively forgot it should be added.

I'll be merging now.

Thanks a lot everyone for your help and patience to troubleshoot this. On the plus side, we have consequently made the underlying base module more robust. So, anybody who's building on top of this repo and using the admin host to install things into their oke cluster can rely on a more definite pattern.

from terraform-oci-oke.

hyder avatar hyder commented on June 12, 2024

Fixed in #146

from terraform-oci-oke.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.