Git Product home page Git Product logo

Comments (8)

chamcca avatar chamcca commented on May 25, 2024 1

ok. InternetAccessible: false in your manifest does flag your subnets as "isolated", meaning there's no route to the internet. as you said, you have no route to 0.0.0.0/0. we don't get a lot of deployments set up like that, there does appear to be a bug in the deployment that will fail if InternetAccessible: false and the SSM Agent isn't installed on the Nodes. also, you don't have a compute NodeGroup defined in your manifest, so there won't be any nodes to deploy user notebooks on. I recommend adding the following to your manifest which will create a NodeGroup (customize as you see fit) and force installation of the SSM Agent on the Nodes (a recommended best practice as it it can allow SSM to patch the Nodes).

InstallSsmAgent: true
ManagedNodegroups:
-   Name: primary-compute
    InstanceType: m5.2xlarge
    LocalStorageSize: 128
    NodesNumDesired: 1
    NodesNumMax: 4
    NodesNumMin: 0
    Labels:
        instance-type: m5.2xlarge

once added, deploy the env again.

from aws-orbit-workbench.

chamcca avatar chamcca commented on May 25, 2024 1

this is not an error we've encountered before. and i was unable to reproduce it. i suppose you could try commenting out or removing line 21 from cli/aws_orbit/data/kubectl/kube_system/00-observability.yaml

but i don't know why this would be necessary. haven't you already done a deployment?

from aws-orbit-workbench.

chamcca avatar chamcca commented on May 25, 2024

getting the full yaml or json definition of the Pod will also return the Status which may have additional info. can you run:

kubectl get pods -n orbit-system -o json imagereplication-operator-658d9cdf94-iwfx5

or

kubectl get pods -n orbit-system -o yaml imagereplication-operator-658d9cdf94-iwfx5

also, it might be worth trying to delete the pod and let the Deployment/ReplicaSet recreate it:

kubectl delete pods -n orbit-system imagereplication-operator-658d9cdf94-iwfx5

from aws-orbit-workbench.

chamcca avatar chamcca commented on May 25, 2024

are you deploying orbit into isolated subnets? the image-replication operator and webhook should only be deployed if orbit is deployed in an isolated environment.

from aws-orbit-workbench.

bozethe avatar bozethe commented on May 25, 2024

Below is our manifest file. Could you please confirm if its fine? I believe our private is more like an isolated subnet since it doesn't have route to 0.0.0.0/0. (see below).
image

Name: orbit
ScratchBucketArn: arn:aws:s3:::orbit-poc0000
UserPoolId: eu-central-000000000
SharedEfsFsId: fs-00000000
SharedEfsSgId: sg-00000000
Networking:
VpcId: vpc-00000000000
PublicSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
PrivateSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Data:
InternetAccessible: false
NodesSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Frontend:
LoadBalancersSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
#SslCertArn: !SSM ${/orbit-f/orbit/resources::SslCertArn}
Images:
JupyterUser:
Repository: 0000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/jupyter-user
Version: latest
OrbitController:
Repository: 00000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/orbit-controller
Version: latest
UtilityData:
Repository: 0000000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/utility-data
Version: latest
Teams:

  • Name: sample-admin
    Policies:
    • None
      GrantSudo: true
      Fargate: true
      K8Admin: true
      JupyterhubInboundRanges:
    • 0.0.0.0/0
      EfsLifeCycle: AFTER_7_DAYS
      Plugins: !include common_plugins.yaml
      AuthenticationGroups:
    • sample-admin

I checked the pods again now and they seems to be running all of them. (see below)
image

However, When I run the stack again (orbit deploy env) it now fails on a different error. (see below)

[2022-07-21 15:46:53,380][k8s.py : 37] Endpoint Subsets: [{'addresses': [{'hostname': None, 'ip': '10.10.10.112', 'node_name': 'ip-10-19-15-31.eu-central-1.compute.internal', 'target_ref': {'api_version': None, 'field_path': None, 'kind': 'Pod', 'name': 'imagereplication-pod-webhook-58df9d9cb4-xtb9s', 'namespace': 'orbit-system', 'resource_version': '9885', 'uid': '3a4560f2-fe0b-47b3-ab3f-5459a9c5f5b8'}}], 'not_ready_addresses': None, 'ports': [{'name': 'https', 'port': 443, 'protocol': 'TCP'}]}]
[2022-07-21 15:46:53,380][kubectl.py :578] Service: imagereplication-pod-webhook Namespace: orbit-system Hostname: None IP: 10.19.13.112
[2022-07-21 15:46:53,380][sh.py : 28] + kubectl rollout restart daemonsets -n orbit-system-ssm-daemons ssm-agent-installer --context AWSCodeBuild-a372ab74-b1d3-478b-871d-546f9f18a0ec@orbit-orbit.eu-central-1.eksctl.io
[2022-07-21 15:46:53,486][sh.py : 30] Error from server (NotFound): namespaces "orbit-system-ssm-daemons" not found
[2022-07-21 15:46:53,487][sh.py : 30]
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 649, in deploy_env
"kubectl rollout restart daemonsets -n orbit-system-ssm-daemons "
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 29, in run
for line in _run_iterating(cmd=cmd, cwd=cwd):
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 23, in _run_iterating
raise FailedShellCommand(f"Exit code: {p.returncode}")
aws_orbit.exceptions.FailedShellCommand: Exit code: 1

[Container] 2022/07/21 15:46:53 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 15:46:53 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 15:46:53 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 15:46:53 Entering phase POST_BUILD
[Container] 2022/07/21 15:46:53 Running command . ~/.venv/bin/activate

[Container] 2022/07/21 15:46:53 Running command cd ${CODEBUILD_SRC_DIR}/bundle

[Container] 2022/07/21 15:46:53 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 15:46:53 Phase context status code: Message:

from aws-orbit-workbench.

bozethe avatar bozethe commented on May 25, 2024

I agree with you that if our VPC and subnets weren't like how they are now, the installation would have gone smoothly.
I had to download cert-manager, cert-manager-cainjector and cert-manager-webhook images and push them to our private ecr and update deployment files to point to our ecr instead of public ecr to avoid failures in pods creations for those pods.

from aws-orbit-workbench.

bozethe avatar bozethe commented on May 25, 2024

The installation managed to pass the previous error after applying your changes. However this is the new error we get(error log below the screenshot). I included the screenshot to show what is currently running in the cluster.

image

2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] deployment.apps/cluster-autoscaler configured
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] Error from server: error when creating ".orbit.out/orbit/kubectl/kube-system/00-observability.yaml": admission webhook "0500-amazon-eks-fargate-configmaps-admission.amazonaws.com" denied the request: Invalid value at auto_create_group On
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,079][sh.py : 30]
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] Traceback (most recent call last):
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/bin/codeseeder", line 8, in
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] sys.exit(main())
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] cli()
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call

image

from aws-orbit-workbench.

bozethe avatar bozethe commented on May 25, 2024

Thanks @chamcca

from aws-orbit-workbench.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.