Comments (8)
ok. InternetAccessible: false
in your manifest does flag your subnets as "isolated", meaning there's no route to the internet. as you said, you have no route to 0.0.0.0/0. we don't get a lot of deployments set up like that, there does appear to be a bug in the deployment that will fail if InternetAccessible: false
and the SSM Agent isn't installed on the Nodes. also, you don't have a compute NodeGroup defined in your manifest, so there won't be any nodes to deploy user notebooks on. I recommend adding the following to your manifest which will create a NodeGroup (customize as you see fit) and force installation of the SSM Agent on the Nodes (a recommended best practice as it it can allow SSM to patch the Nodes).
InstallSsmAgent: true
ManagedNodegroups:
- Name: primary-compute
InstanceType: m5.2xlarge
LocalStorageSize: 128
NodesNumDesired: 1
NodesNumMax: 4
NodesNumMin: 0
Labels:
instance-type: m5.2xlarge
once added, deploy the env again.
from aws-orbit-workbench.
this is not an error we've encountered before. and i was unable to reproduce it. i suppose you could try commenting out or removing line 21 from cli/aws_orbit/data/kubectl/kube_system/00-observability.yaml
but i don't know why this would be necessary. haven't you already done a deployment?
from aws-orbit-workbench.
getting the full yaml or json definition of the Pod will also return the Status which may have additional info. can you run:
kubectl get pods -n orbit-system -o json imagereplication-operator-658d9cdf94-iwfx5
or
kubectl get pods -n orbit-system -o yaml imagereplication-operator-658d9cdf94-iwfx5
also, it might be worth trying to delete the pod and let the Deployment/ReplicaSet recreate it:
kubectl delete pods -n orbit-system imagereplication-operator-658d9cdf94-iwfx5
from aws-orbit-workbench.
are you deploying orbit into isolated subnets? the image-replication operator and webhook should only be deployed if orbit is deployed in an isolated environment.
from aws-orbit-workbench.
Below is our manifest file. Could you please confirm if its fine? I believe our private is more like an isolated subnet since it doesn't have route to 0.0.0.0/0. (see below).
Name: orbit
ScratchBucketArn: arn:aws:s3:::orbit-poc0000
UserPoolId: eu-central-000000000
SharedEfsFsId: fs-00000000
SharedEfsSgId: sg-00000000
Networking:
VpcId: vpc-00000000000
PublicSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
PrivateSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Data:
InternetAccessible: false
NodesSubnets: ["subnet-333333333333", "subnet-44444444444444"]
Frontend:
LoadBalancersSubnets: ["subnet-11111111111111", "subnet-222222222222222"]
#SslCertArn: !SSM ${/orbit-f/orbit/resources::SslCertArn}
Images:
JupyterUser:
Repository: 0000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/jupyter-user
Version: latest
OrbitController:
Repository: 00000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/orbit-controller
Version: latest
UtilityData:
Repository: 0000000000000.dkr.ecr.eu-central-1.amazonaws.com/orbit-orbit/utility-data
Version: latest
Teams:
- Name: sample-admin
Policies:- None
GrantSudo: true
Fargate: true
K8Admin: true
JupyterhubInboundRanges: - 0.0.0.0/0
EfsLifeCycle: AFTER_7_DAYS
Plugins: !include common_plugins.yaml
AuthenticationGroups: - sample-admin
- None
I checked the pods again now and they seems to be running all of them. (see below)
However, When I run the stack again (orbit deploy env) it now fails on a different error. (see below)
[2022-07-21 15:46:53,380][k8s.py : 37] Endpoint Subsets: [{'addresses': [{'hostname': None, 'ip': '10.10.10.112', 'node_name': 'ip-10-19-15-31.eu-central-1.compute.internal', 'target_ref': {'api_version': None, 'field_path': None, 'kind': 'Pod', 'name': 'imagereplication-pod-webhook-58df9d9cb4-xtb9s', 'namespace': 'orbit-system', 'resource_version': '9885', 'uid': '3a4560f2-fe0b-47b3-ab3f-5459a9c5f5b8'}}], 'not_ready_addresses': None, 'ports': [{'name': 'https', 'port': 443, 'protocol': 'TCP'}]}]
[2022-07-21 15:46:53,380][kubectl.py :578] Service: imagereplication-pod-webhook Namespace: orbit-system Hostname: None IP: 10.19.13.112
[2022-07-21 15:46:53,380][sh.py : 28] + kubectl rollout restart daemonsets -n orbit-system-ssm-daemons ssm-agent-installer --context AWSCodeBuild-a372ab74-b1d3-478b-871d-546f9f18a0ec@orbit-orbit.eu-central-1.eksctl.io
[2022-07-21 15:46:53,486][sh.py : 30] Error from server (NotFound): namespaces "orbit-system-ssm-daemons" not found
[2022-07-21 15:46:53,487][sh.py : 30]
Traceback (most recent call last):
File "/root/.venv/bin/codeseeder", line 8, in
sys.exit(main())
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
cli()
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
return self.main(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 782, in main
rv = self.invoke(ctx)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1259, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 610, in invoke
return callback(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 153, in execute
func(*fn_args["args"], **fn_args["kwargs"])
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 398, in deploy_env
deploy_env(env_name=env_name, manifest_dir=manifest_dir)
File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/codeseeder.py", line 229, in wrapper
return func(*args, **kwargs)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/deploy.py", line 385, in deploy_env
kubectl.deploy_env(context=context)
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/remote_files/kubectl.py", line 649, in deploy_env
"kubectl rollout restart daemonsets -n orbit-system-ssm-daemons "
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 29, in run
for line in _run_iterating(cmd=cmd, cwd=cwd):
File "/root/.venv/lib/python3.7/site-packages/aws_orbit/sh.py", line 23, in _run_iterating
raise FailedShellCommand(f"Exit code: {p.returncode}")
aws_orbit.exceptions.FailedShellCommand: Exit code: 1
[Container] 2022/07/21 15:46:53 Command did not exit successfully codeseeder execute --args-file fn_args.json --debug exit status 1
[Container] 2022/07/21 15:46:53 Phase complete: BUILD State: FAILED
[Container] 2022/07/21 15:46:53 Phase context status code: COMMAND_EXECUTION_ERROR Message: Error while executing command: codeseeder execute --args-file fn_args.json --debug. Reason: exit status 1
[Container] 2022/07/21 15:46:53 Entering phase POST_BUILD
[Container] 2022/07/21 15:46:53 Running command . ~/.venv/bin/activate
[Container] 2022/07/21 15:46:53 Running command cd ${CODEBUILD_SRC_DIR}/bundle
[Container] 2022/07/21 15:46:53 Phase complete: POST_BUILD State: SUCCEEDED
[Container] 2022/07/21 15:46:53 Phase context status code: Message:
from aws-orbit-workbench.
I agree with you that if our VPC and subnets weren't like how they are now, the installation would have gone smoothly.
I had to download cert-manager, cert-manager-cainjector and cert-manager-webhook images and push them to our private ecr and update deployment files to point to our ecr instead of public ecr to avoid failures in pods creations for those pods.
from aws-orbit-workbench.
The installation managed to pass the previous error after applying your changes. However this is the new error we get(error log below the screenshot). I included the screenshot to show what is currently running in the cluster.
2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] deployment.apps/cluster-autoscaler configured
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,076][sh.py : 30] Error from server: error when creating ".orbit.out/orbit/kubectl/kube-system/00-observability.yaml": admission webhook "0500-amazon-eks-fargate-configmaps-admission.amazonaws.com" denied the request: Invalid value at auto_create_group On
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] [2022-07-21 20:06:03,079][sh.py : 30]
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] Traceback (most recent call last):
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/bin/codeseeder", line 8, in
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] sys.exit(main())
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/aws_codeseeder/main.py", line 161, in main
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] cli()
[2022-07-21 22:06:12,889][_remote.py : 30] [CODEBUILD] File "/root/.venv/lib/python3.7/site-packages/click/core.py", line 829, in call
from aws-orbit-workbench.
Thanks @chamcca
from aws-orbit-workbench.
Related Issues (20)
- Add slack notification on the failure/success of merging to main
- [BUG] - SSM TooManyUpdates
- [BUG] - Policy limit exceeded HOT 2
- [BUG] - env context is missing after destroy env HOT 1
- [FEATURE] - Update Installation Docs
- [FEATURE] - Release 1.5 HOT 1
- [BUG] - Updates to Plugins are not taking on redeploy HOT 1
- [BUG] - SM Operator Plugin does not support / pull region-specific SM Operator image HOT 2
- [BUG] - SM-Operator Regression test fails
- [BUG] - Remove hard-coded regions (mostly us-west-2) for Docker images HOT 1
- Add Spark Support with AWS Glue Interactive Sessions HOT 3
- [BUG] - Codebuild Project/Deployer Hangs HOT 7
- [BUG] -Waiting for for Ingress istio-ingress.istio-system failed HOT 4
- [BUG] - error: a container name must be specified for pod istio-pilot HOT 28
- [BUG] - Failed in cdk publishAsset during env deployment HOT 18
- [BUG] - add R-studio via custom image HOT 3
- [BUG] - CloudFormation template not found HOT 4
- [BUG] - Updating requirements.txt `ipython-sql` HOT 1
- [BUG] - Unable to deploy orbit workbench HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aws-orbit-workbench.