Describe the bug The pods in the istio-system namespace report ru

Below is what I see after adding a group into cognito. <a target="_blank" rel="noo

awslabs,aws-orbit-workbench

Comments (28)

chamcca commented on May 25, 2024

that particular error message isn't in the pod logs, it is an error being returned by the kubectl command attempting to read the logs. the istio-pilot pod has two containers that can contain logs named "discovery" and "istio-proxy". you have to specify the container name when requesting the logs:

kubectl logs isito-pilot-557c-68674-4vqcv -n istio-system discovery

kubectl logs isito-pilot-557c-68674-4vqcv -n istio-system istio-proxy

from aws-orbit-workbench.

bozethe commented on May 25, 2024

Thanks, I see its working fine. I have made lots changes so far to get orbit to work our environment. aws-ingresscontroller 1.1.5 that is deployed by orbit wasn't working for us. It failed to to discover the subnets where to launch the alb even though the tags were correct. I ended up deploying aws-ingress-controller 2.4.2 and it seems to work fine. I had to also recreate the ingress.

Below is the screenshare of our ingress.

Below is what I get when trying to access the orbit login page.

And I checked orbit landing page and below is the error (KeyError: 'HTTP_X_AMZN_OIDC_DATA').

nding-page-service.orbit-system.svc.cluster.local:80/orbit/*", "X-Istio-Attributes": "CjIKGGRlc3RpbmF0aW9uLnNlcnZpY2UubmFtZRIWEhRsYW5kaW5nLXBhZ2Utc2VydmljZQovCh1kZXN0aW5hdGlvbi5zZXJ2aWNlLm5hbWVzcGFjZRIOEgxvcmJpdC1zeXN0ZW0KTwoKc291cmNlLnVpZBJBEj9rdWJlcm5ldGVzOi8vaXN0aW8taW5ncmVzc2dhdGV3YXktNzc3YjU0ZDk2OC1zanp0ci5pc3Rpby1zeXN0ZW0KTwoXZGVzdGluYXRpb24uc2VydmljZS51aWQSNBIyaXN0aW86Ly9vcmJpdC1zeXN0ZW0vc2VydmljZXMvbGFuZGluZy1wYWdlLXNlcnZpY2UKUQoYZGVzdGluYXRpb24uc2VydmljZS5ob3N0EjUSM2xhbmRpbmctcGFnZS1zZXJ2aWNlLm9yYml0LXN5c3RlbS5zdmMuY2x1c3Rlci5sb2NhbA==", "X-B3-Traceid": "ebc065a24b50007fca4ba1ba30d3ce28", "X-B3-Spanid": "ca4ba1ba30d3ce28", "X-B3-Sampled": "1", "X-Envoy-Original-Path": "/orbit/login", "Content-Length": "0"}
[2022-08-02 06:42:41 +0000] [8] [ERROR] Error handling request /login
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 136, in handle
self.handle_request(listener, req, client, addr)
File "/usr/local/lib/python3.8/site-packages/gunicorn/workers/sync.py", line 179, in handle_request
respiter = self.wsgi(environ, resp.start_response)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2464, in call
return self.wsgi_app(environ, start_response)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2450, in wsgi_app
response = self.handle_exception(e)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1867, in handle_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 2447, in wsgi_app
response = self.full_dispatch_request()
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1952, in full_dispatch_request
rv = self.handle_user_exception(e)
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1821, in handle_user_exception
reraise(exc_type, exc_value, tb)
File "/usr/local/lib/python3.8/site-packages/flask/_compat.py", line 39, in reraise
raise value
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1950, in full_dispatch_request
rv = self.dispatch_request()
File "/usr/local/lib/python3.8/site-packages/flask/app.py", line 1936, in dispatch_request
return self.view_functionsrule.endpoint
File "/var/orbit-controller/orbit_controller/server.py", line 44, in login_request
return login(logger=app.logger, app=app)
File "/var/orbit-controller/orbit_controller/home.py", line 47, in login
email, username, groups = _get_user_info_from_jwt(logger)
File "/var/orbit-controller/orbit_controller/home.py", line 119, in _get_user_info_from_jwt
encoded_jwt = request.headers["x-amzn-oidc-data"]
File "/usr/local/lib/python3.8/site-packages/werkzeug/datastructures.py", line 1463, in getitem
return unicodify_header_value(self.environ["HTTP" + key])
KeyError: 'HTTP_X_AMZN_OIDC_DATA'

If I put the alb url without "/orbit/login", below is what we get

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I am going to attached our orbit ingress yaml here just to check if It is not missing anything.
ingress.yaml.txt

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I managed to fix the problem I had.
I just need help with below(check the screenshoot).

from aws-orbit-workbench.

chamcca commented on May 25, 2024

what are you using for your Identity Provider? if you haven't integrated with your own IdP and are using Cognito then you need to go into Cognito in the AWS Console, find the UserPool for your deployment (it will be named orbit-[ENV_NAME]-user-pool) and then make sure there is an orbit user. You will also need to create Groups in Cognito that match the Teams in your manifest. Groups should be named [ENV_NAME]-[TEAM_NAME]. Once Groups are created add the orbit User to the Groups

from aws-orbit-workbench.

chamcca commented on May 25, 2024

also, you will need to logout and log back in to trigger creation of the Team Space. you can force this by visiting the /orbit/logout URL

from aws-orbit-workbench.

bozethe commented on May 25, 2024

Below is what I see after adding a group into cognito.

clicking on the kubeflow link, it opens below screen.

Is this the only thing I am supposed to see?

from aws-orbit-workbench.

chamcca commented on May 25, 2024

that's exactly what you should see. from there you can click on "Notebook Servers" on the left to launch or connect to previously launched notebooks.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

Connecting to nodebook server fails with "upstream connect error or disconnect/reset before headers. reset reason: connection failure" and when I check logs on k8s I see it fails "Readiness probe failed: HTTP probe failed with statuscode: 503". Please see below screenshoots.

from aws-orbit-workbench.

chamcca commented on May 25, 2024

we've not encountered a 503 on the status check before.. we have seen with complex network setups cause timeouts, but not a 503 returned by the jupyter server. did the testing-0 notebook/pod ever start? there are some advanced things you can do to enable --debug mode on the jupyter server and may get additional info from the server logs.

some assumptions about your setup i'm making based on the info in the screenshots: the Team is named tieho and User is boqo. these are important. you if this is correct, then in addition to the User specific namespace called tieho-boqo there should also be a namespace called tieho.

some info, we use a CRD (CustomResourceDefinition) and custom Operator to make Orbit specific changes to the jupyter Pods when they start up. the CRD is called a PodSetting. each Team gets a PodSetting that can be configured to make changes specific to the Team. to enable --debug logging on new Pods, we can modify the PodSetting for the Team and add a --debug command line parameter. all new notebooks will be started with the new parameter and should have additional logging.

you can try to enable debug logging, i'm not sure what you'll find.

kubectl edit podsettings -n tieho orbit-pod-interactive-notebook this will start an edit session in vi for the PodSetting
locate the following line in the command section:
- /usr/local/bin/start.sh jupyter lab --ServerApp.notebook_dir=/home/jovyan --ServerApp.ip=0.0.0.0
press <esc> then i to enter insert mode
add the --debug flag to the command:
- /usr/local/bin/start.sh jupyter lab --debug --ServerApp.notebook_dir=/home/jovyan --ServerApp.ip=0.0.0.0
press <esc> then wq<enter> to write (save) and quit the edit session

then try creating a new Notebook through the UI. once it starts, you can get the logs for it with kubectl again.

Any Notebooks you create for users on that Team will now have debug logs. you can undo this by repeating the procedure and removing the --deubg flag.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

Hi @chamcca, sorry for late reply. I was on leave.

I check the namespaces I have and below is what I see.

Below is snippet from our manifest.yaml file. The Team name is sample-admin not tieho

I also checked if there are any resources in sample-admin or sample-admin-orbit namespaces but nothing.
❯ kubectl get all -n sample-admin-orbit
No resources found in sample-admin-orbit namespace.

I checked the podsettings we have in our cluster and below is what I see.

Are we suppose to have pods running in these other namespaces (sample-admin or sample-admin-orbit namespaces)?

If possible, could you please send screenshot of how your eks cluster looks like

I will try to enable debugging and revert.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I have enabled debug mode by editing "kubectl edit podsettings -n sample-admin orbit-pod-interactive-notebook" and below is how it looks now.

However the logs are still showing the same.

I deleted all the "Notebook Servers" and created a new one called debug. see below.

I also attached our manifest file. could you please check if its fine.
our-manifest.yaml.txt

from aws-orbit-workbench.

chamcca commented on May 25, 2024

it appears as though you created another User (tieho-boqo) through the kubeflow onboarding mechanism. this User is not a member of a TeamSpace and is not managed by Orbit. you should be using the sample-admin-orbit User to start your Notebooks. Notebooks under the tieho-boqo User/namespace will not work.

revisit the /orbit/login page. you should be shown the "Welcome orbit!" message to your user (orbit) and the teams that you belong to (sample-admin). click on the Kubeflow icon next to the Team on this page to be take to the Kubeflow UI for that Team/User. from there, access the Notebook Servers UI and try to start a Notebook. this will start in the sample-admin-orbit namespace which is managed by Orbit and will apply PodSettings to correctly configure the Notebook.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I logged out and now login again

Logged in successfully.

clicked on kubeflow logo and now creating notebook server

This is out cognito user pool.

group

Still getting
upstream connect error or disconnect/reset before headers. reset reason: connection failure

I believe there is something I am not understanding here.
Do you mind to have a 10 minutes zoom/teams call?

from aws-orbit-workbench.

chamcca commented on May 25, 2024

you're still creating Notebooks in the wrong namespace (tieho-boqo). after logging in, before clicking on the Notebook Servers link, try clicking the drop down box in the upper left in the primary kubeflow UI. where it says "tieho-boqo (owner)". if your teams are setup correctly you should be able to switch to the sample-admin-orbit namespace.

Notebooks created in the tieho-boqo namespace will continue to fail. an alternative is to completely delete this namespace and all resources so that you don't get pushed to it by the UI. that can be done from kubectl. kubeflow creates a CRD called a "Profile". if you list the profiles using:

kubectl get profiles

you likely have 3: anonymous, sample-admin-orbit, tieho-boqo. deleting the tieho-boqo can be done with:

kubectl delete profiles tieho-boqo

this should leave your cognito orbit user assigned to a single profile, sample-admin-orbit.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I tried clicking the drop down arrow under my name "tieho-boqo owner" but there was nothing.

My cognito user group is "orbit-sample-admin". Is that ok or is it supposed to be "sample-admin-orbit"?

I deleted tieho-boqo profile and the pods that were running in the tieho-boqo namespace are gone.

below is what i get when trying to login now.

I check the name spaces and I noticed the sample-admin-orbit and sample-admin namespaces were still existing after deleting profile. I deleted them then logout and login on orbit home page and above is what I get. I tried several times and I see it just created the namespace "sample-admin-orbit" again but still shows "Timeout while waiting for namespace creation. Something went wrong!! Consult the Orbit namespace watcher logs."

I see where the other namespace came from. accessing the orbit url without /orbit/login.

❯ kubectl get teamspace -n sample-admin -o yaml
apiVersion: v1
items:

apiVersion: orbit.aws/v1
kind: TeamSpace
metadata:
annotations:
kubectl.kubernetes.io/last-applied-configuration: |
{"apiVersion":"orbit.aws/v1","kind":"TeamSpace","metadata":{"annotations":{},"name":"sample-admin","namespace":"sample-admin"},"spec":{"env":"orbit","space":"team","team":"sample-admin"}}
creationTimestamp: "2022-08-10T17:16:54Z"
generation: 1
name: sample-admin
namespace: sample-admin
resourceVersion: "12063392"
uid: 60d7124f-99ab-4d58-848d-573001ac89ac
spec:
env: orbit
space: team
team: sample-admin
kind: List
metadata:
resourceVersion: ""

from aws-orbit-workbench.

chamcca commented on May 25, 2024

creating namespaces with the kubeflow Namespace UI is not supported, these namespaces will not function. team namespaces are created when orbit deploy teams is run. user namespaces are created when a user first visits /orbit/login immediately after logging in. unfortunately we don't have a way of disabling creation with the Namespace UI.

deleting the sample-admin namespace broke some things. this namespace belongs to the sample-admin Team and is created when the Orbit Teams are deployed. we can clean this up, though.

visit the /orbit/logout url. browsers sometimes cache this page, so you may need to refresh it to force the logout. if it routes you to /orbit/login and/or presents a login screen, do not login again yet.
we're going to cleanup the user and team namespaces and profiles. some of these commands may fail if the resources don't exist, that's ok
```
kubectl delete namespace sample-admin-orbit
kubectl delete namespace sample-admin
kubectl delete profile sample-admin-orbit
```
you should also delete any namespaces and profiles that were created as a result of using the kubeflow Namespace UI
now we need to repair the Team namespace and teamspace resources:
```
orbit deploy teams -f [path_to_your_manifest]
```
this will recreate the kubernetes resources required for your sample-admin team
now we login to the UI again and force creation of the user namespace and userspace. visit the /orbit/login url, if logout was successful you should be presented with username and password prompt. you may need to force a refresh. we need to ensure that the username/password prompt is presented again and the user is authenticated again. this is what triggers creation of the user's namespace
once logged in, you should be presented with the sample-admin Team on the /orbit/login page, click on the Kubeflow icon next to this Team
in the kubeflow UI, verify in the drop down in the upper left that you are in the sample-admin-orbit namespace
click Notebook Servers and try to create a notebook

I have just broken and then followed this procedure on my development cluster to confirm it.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I have tried the steps you provided above but the problem still exist. I also deleted the groups, created a new cognito and tested again but still the problem persist.

This is our cognito user pool name

User group. I followed the naming you suggested

My new teams config
Teams:

Name: oryx
Policies:
- None
  GrantSudo: true
  Fargate: true
  K8Admin: true
  JupyterhubInboundRanges:
- 0.0.0.0/0
  EfsLifeCycle: AFTER_7_DAYS
  Plugins: !include common_plugins.yaml
  AuthenticationGroups:
- orbit-oryx

current namespaces in eks

Profiles

landing page logs show that the groups are not being returned

Still getting the same page.

from aws-orbit-workbench.

chamcca commented on May 25, 2024

were you able to resolve the missing teams/groups?

from aws-orbit-workbench.

bozethe commented on May 25, 2024

no, still have the same problem. we are thinking of ways of hard coding the teams/group for testing purposes. I am not sure if changing code in home.py line 45(def login function) will help.

Do you have any other suggestion for us to make this thing work?

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I managed to solve the problem. I deleted the orbit-system namespace and redeployed env and teams stack.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I am now getting error

message: 'Internal error occurred: failed calling webhook "imagereplication-pod-webhook.orbit-system.svc":
Post "https://imagereplication-pod-webhook.orbit-system.svc:443/update-pod-images?timeout=30s":
service "imagereplication-pod-webhook" not found'

from aws-orbit-workbench.

chamcca commented on May 25, 2024

when you deleted the orbit-system namespace you removed the webhook declaration and pods. the image-replicator webhook updates Pod images to point to ECR and initiates a replication of public images into ECR for environments with isolated subnets. in theory, redeploying the env should have recreated them, but i've never attempted that.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

Thank you for continued help. I will remove teams, env stack redeploy then again and see if everything will be ok.

Is there a way to setup proxy for image pulling for pods?

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I managed to fix the problem of image-replicator-webhook.
Now we get below error when we create a nodebook server.

pods don't finish creating

from aws-orbit-workbench.

chamcca commented on May 25, 2024

this is usually a networking issue between EFS and the Nodes or Pods in the cluster. did you deploy EFS with the orbit deploy foundation or did you deploy your own EFS? can you confirm which Subnets the EFS shares are in and that they are the same as the Nodes/Pods. also check the Security Group attached to the EFS share and the Inbound rules on it. i know that you are operating in isolated subnets that were deployed outside the orbit tooling, there may just be something we need to get "hooked" up with the security groups.

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I checked efs SG and it allows everything. EFS is in the same subnet as the nodes. I also logged into node where the pod is running and I was able to mount efs.

It seems like the problem is efs-csi-controller, See below

from aws-orbit-workbench.

bozethe commented on May 25, 2024

I managed to fix the problem, I was hitting this bug (kubernetes-sigs/aws-efs-csi-driver#214)

from aws-orbit-workbench.

[BUG] - error: a container name must be specified for pod istio-pilot about aws-orbit-workbench HOT 28 CLOSED

Comments (28)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent