Git Product home page Git Product logo

awslabs / kubeflow-manifests Goto Github PK

View Code? Open in Web Editor NEW
160.0 8.0 116.0 60.93 MB

KubeFlow on AWS

Home Page: https://awslabs.github.io/kubeflow-manifests/

License: Apache License 2.0

YAML 90.00% Shell 0.02% Python 5.50% JSON 0.16% Makefile 0.13% Go 0.30% Dockerfile 0.16% SCSS 0.08% HTML 0.28% Jupyter Notebook 0.07% CSS 0.09% JavaScript 0.03% Smarty 0.81% HCL 2.39%
kubeflow eks kubernetes aws data-science mlops

kubeflow-manifests's Introduction

contributions welcome current development version License

Kubeflow on AWS

Overview

Kubeflow on AWS is an open source distribution of Kubeflow that provides its own Kubeflow manifests to support integrations with various AWS managed services. Use Kubeflow on AWS to streamline data science tasks and build highly reliable, secure, and scalable machine learning systems with reduced operational overheads.

Getting Started

There are a number of deployment options for installing Kubeflow with AWS service integrations. To get started with deploying Kubeflow on Amazon Elastic Kubernetes Service (Amazon EKS), see the deployment section on the Kubeflow on AWS website.

Help & Feedback

For help, please consider the following venues (in order):

Contributing

We welcome community contributions and pull requests.

See our contribution guide for more information on how to report issues, set up a development environment, and submit code.

We adhere to the Amazon Open Source Code of Conduct.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

kubeflow-manifests's People

Contributors

akartsky avatar alexandrebrown avatar amazon-auto avatar amitkalawat avatar ananth102 avatar blorby avatar briannaroskind avatar dependabot[bot] avatar elanv avatar ghaering avatar goswamig avatar jrhode2 avatar jsitu777 avatar judyheflin avatar kjvjobin avatar krhoyt avatar mbaijal avatar monkesh avatar npepin-hub avatar prli avatar rd-pong avatar rddefauw avatar rrrkharse avatar ryansteakley avatar sagi-shimoni avatar surajkota avatar techwithshadab avatar theofpa avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

kubeflow-manifests's Issues

Remove katib-mysql pod and mysql pvc from katib with RDS component

Describe the bug
See #66 (comment)
When using RDS with Katib having the katib-mysql-secrets, katib-mysql deployment and katib-mysql pvc(10Gb) is confusing for customers and also cost money

Steps To Reproduce
kustomize build apps/katib/upstream/installs/katib-external-db-with-kubeflow -o temp

Expected behavior
Stated above unused resources should be removed

Environment

  • RDS integration with Katib in Kubeflow
  • AWS service targeted (S3, RDS, etc.) RDS

Facilitate configuration of RDS & S3 credentials access for Notebooks & Pipelines

Is your feature request related to a problem? Please describe.
It might not be obvious which k8s secret name to use to create a pod default so that we can use the RDS & S3 credentials setup by the credentials manager. It might not be ovious for new k8s users that you need to create something called a PodDefault in order to have access to these credentials from within a notebook and in notebook creation UI.

Use Cases

  • Specify AWS credentials when using boto3
    • Solution : Have a pod default make environment variables available
  • Use ml-metadata (requires database credentials in notebook)
    • Solution : Have a pod default make environment variables available

Describe the solution you'd like
I propose we add this as part of the Kubeflow install.
For instance, if the user installs Kubeflow via S3 setup, his pod default could look like

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: add-aws-secret
  namespace: <YOUR_NAMESPACE_HERE>
  labels:
    add-aws-secret : "true"
spec:
  desc: add-aws-secret
  selector:
    matchLabels:
      add-aws-secret: "true"
  env:
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: mlpipeline-minio-artifact
        key: accesskey
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: mlpipeline-minio-artifact
        key: secretkey

If he installs via RDS & S3 setup then we would add the RDS values (eg: DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PWD from mysql-secret) so that the user can connect to ml metadata db in a notebook using OS environment variable values.

This would also open the door to greater notebook integration as we could modify the spawner_ui_config.yaml and add the pod default created during the setup as default value selected.

  configurations:
    # List of labels to be selected, these are the labels from PodDefaults
    # value:
    #   - add-gcp-secret
    #   - default-editor
    value:
       - add-aws-secret
    readOnly: false

image

Kubeflow logout stuck in infinite loop

Describe the bug

I installed Kubeflow 1.4 using Cognito. I can access the Kubeflow UI and signin properly. When I click on logout button, it get stuck in infinite loop. As a result, user is not logged out.

Steps To Reproduce
1- Install Kubeflow from this guide: https://github.com/awslabs/kubeflow-manifests/tree/v1.4-branch/distributions/aws/examples/cognito
2- Sign in via Kubeflow UI
3- Logout via Kubeflow UI

Expected behavior
I expect user to be logged out and redirected to sign-in page

Additional context
This bug is related to the feature: kubeflow/kubeflow#5174

Notebook server creation fails if PV is not used for workspace

Description of issue

Succeeds with scipy image and KF default pytorch images but fails on DLC based images

Logs

Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/skota-2-0. Please use `kubectl.kubernetes.io/default-container` instead
/opt/conda/lib/python3.8/site-packages/jupyter_server_mathjax/app.py:40: FutureWarning: The alias `_()` will be deprecated. Use `_i18n()` instead.
  help=_("""The MathJax.js configuration file that is to be used."""),
[I 2021-11-30 20:38:10.802 ServerApp] jupyter_server_mathjax | extension was successfully linked.
[W 2021-11-30 20:38:10.806 LabApp] 'token' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-11-30 20:38:10.806 LabApp] 'password' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-11-30 20:38:10.806 LabApp] 'allow_origin' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-11-30 20:38:10.806 LabApp] 'base_url' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2021-11-30 20:38:10.813 ServerApp] jupyterlab | extension was successfully linked.
[I 2021-11-30 20:38:10.813 ServerApp] jupyterlab_git | extension was successfully linked.
[W 2021-11-30 20:38:10.822 ServerApp] [Errno 13] Permission denied: '/home/jovyan/.local'
[I 2021-11-30 20:38:10.822 ServerApp] nbdime | extension was successfully linked.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 535, in get
    value = obj._trait_values[self.name]
KeyError: 'runtime_dir'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/jupyter-lab", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/extension/application.py", line 567, in launch_instance
    serverapp = cls.initialize_server(argv=args)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/extension/application.py", line 537, in initialize_server
    serverapp.initialize(
  File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
    return method(app, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 2320, in initialize
    self.init_configurables()
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 1739, in init_configurables
    connection_dir=self.runtime_dir,
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 575, in __get__
    return self.get(obj, cls)
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 538, in get
    default = obj.trait_defaults(self.name)
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 1578, in trait_defaults
    return self._get_trait_default_generator(names[0])(self)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 95, in _runtime_dir_default
    ensure_dir_exists(rd, mode=0o700)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py", line 11, in ensure_dir_exists
    os.makedirs(path, mode=mode)
  File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local'

Steps to reproduce:

Create a notebook with AWS images for notebook and check the option Don't use Persistent Storage for User's home under workspace volume

KFP Testing

We should make sure that KFP is working well. Check if any of the component of KFP (DB, storage, creds etc.) can we replaced by AWS services from integration POV. We should also point any standard example from official doc which has been tested.

Cloudwatch integration with Kubeflow

Describe the bug
Worker node group logging described here can be used for sending pods log to cloud watch. However with the current setup it does not work. For e.g. the permission which we assign to worker node does not have policy to describe log group from cloud watch.

Steps To Reproduce

  1. goto folder kubeflow-manifests/distributions/aws/fluentd-cloud-watch and update the base/params.env values.
  2. run `kubectl apply -k fluentd-cloud-watch/base
  3. Check the cloudwatch to see if it has log group /aws/containerinsights/${AWS_CLUSTER_NAME}/containers
  4. you can see the logs from fluentd-cloudwatch-xxxxx pods and it will report errors like below
2021-12-26 06:03:51 +0000 [warn]: #0 [out_cloudwatch_logs_containers] failed to flush the buffer. retry_time=7 next_retry_seconds=2021-12-26 06:05:02 +0000 chunk="5d4065628678b563f9dd8c2962ea5455" error_class=Aws::CloudWatchLogs::Errors::AccessDeniedException error="User: arn:aws:sts::<my account_id>:assumed-role/eksctl-myclustername-nodegroup-li-NodeInstanceRole-PP7XWJ1Q1DWU/i-01083ed00c1fdfdff6 is not authorized to perform: logs:DescribeLogGroups on resource: arn:aws:logs:us-east-1:<my_aws_account_id>:log-group::log-stream:"
  2021-12-26 06:03:51 +0000 [warn]: #0 suppressed same stacktrace

Expected behavior
No error message in fluentd pods
cw log group should be created and pods log should be pushed there

Environment

  • Kubernetes version: 1.20
  • Using EKS (yes/no), if so version? yes
  • AWS service targeted (S3, RDS, etc.): cloud watch

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Availability w.r.t secrets manager integration

Describe the bug
Secrets mount is modelled as a pod which is unreliable incase the node goes down or secrets needs to be rotated(although auto rotation is not supported)

  • secrets mount pod should be modelled as a Deployment
  • Istio sidecar injection should be disabled

[Doc] Support TensorBoard in Kubeflow Pipelines

"Support TensorBoard in Kubeflow Pipelines" section of document is outdated :
https://www.kubeflow.org/docs/distributions/aws/pipeline/#support-tensorboard-in-kubeflow-pipelines

Outdated Doc :

TensorBoard needs some extra settings on AWS like below:

  1. Create a Kubernetes secret aws-secret in the kubeflow namespace. Follow instructions here.

  2. Create a ConfigMap to store the configuration of TensorBoard on your cluster. Replace <your_region> with your S3 region.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ml-pipeline-ui-viewer-template
data:
  viewer-tensorboard-template.json: |
    {
        "spec": {
            "containers": [
                {
                    "env": [
                        {
                            "name": "AWS_ACCESS_KEY_ID",
                            "valueFrom": {
                                "secretKeyRef": {
                                    "name": "aws-secret",
                                    "key": "AWS_ACCESS_KEY_ID"
                                }
                            }
                        },
                        {
                            "name": "AWS_SECRET_ACCESS_KEY",
                            "valueFrom": {
                                "secretKeyRef": {
                                    "name": "aws-secret",
                                    "key": "AWS_SECRET_ACCESS_KEY"
                                }
                            }
                        },
                        {
                            "name": "AWS_REGION",
                            "value": "<your_region>"
                        }
                    ]
                }
            ]
        }
    }
  1. Update the ml-pipeline-ui deployment to use the ConfigMap by running kubectl edit deployment ml-pipeline-ui -n kubeflow.
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: ml-pipeline-ui
  namespace: kubeflow
  ...
spec:
  template:
    spec:
      containers:
      - env:
        - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
          value: /etc/config/viewer-tensorboard-template.json
        ....
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
      .....
      volumes:
      - configMap:
          defaultMode: 420
          name: ml-pipeline-ui-viewer-template
        name: config-volume

ALB Auth doesn't support Cognito - eu-north-1, eu-west-3, sa-east-1, us-west-1

This results in the following error message when provisioning an ALB in the alb-ingress-controller logs:

E1109 00:24:55.867169       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile listeners due to failed to create listener due to ValidationError: Action type 'authenticate-cognito' must be one of 'redirect,fixed-response,forward,authenticate-oidc'\n\tstatus code: 400, request id: a2a04577-cdd9-4a28-b735-f553883eccf6"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}

We need to find a workaround and/or determine if we can support these regions for Cognito on Kubeflow.

Error: secret "mlpipeline-minio-artifact"/"mysql-secret" not found in some deployments

Describe the bug
Error: secret "mlpipeline-minio-artifact" not found in the following deployments:

  • ml-pipeline-ui
  • kubeflow-pipelines-profile-controller
  • minio

Error: secret "mysql-secret" not found in the following deployments:

  • metadata-grpc-deployment
  • ml-pipeline
  • cache-server

It seems that these secrets are removed in file disable-default-secrets.yaml in commit 6fdcc9b but still got used in a lot of deployments.

Steps To Reproduce

  • Deploying with distributions/aws/examples/rds-s3 templates.
  • kubectl describe pod -n kubeflow $(kubectl get pod -n kubeflow -l app=ml-pipeline-ui -o jsonpath="{.items[0].metadata.name}")

Expected behavior
Above deployments run successfully.

Environment

  • Kubernetes version: 1.19
  • Using EKS (yes/no), if so version? Yes, EKS version: eks.7
  • AWS service targeted (S3, RDS, etc.) -> s3-rds

Screenshots
None

Additional context
I deploy with this custom kustomization.yaml which is based on distributions/aws/examples/rds-s3 but is modified to use only necessary services based on my own need

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
# Istio
- ../common/istio-1-9/istio-crds/base
- ../common/istio-1-9/istio-namespace/base
- ../common/istio-1-9/istio-install/base
# OIDC Authservice
- ../common/oidc-authservice/base
# Dex
- ../common/dex/overlays/istio
# Kubeflow namespace
- ../common/kubeflow-namespace/base
# Kubeflow Roles
- ../common/kubeflow-roles/base
# Kubeflow Istio Resources
- ../common/istio-1-9/kubeflow-istio-resources/base

# Central Dashboard
- ../apps/centraldashboard/upstream/overlays/istio
# Profiles + KFAM
- ../apps/profiles/upstream/overlays/kubeflow
# User namespace
- ../common/user-namespace/base

# Configured for AWS RDS and AWS S3

# AWS Secret Manager
- ../distributions/aws/aws-secrets-manager/base
# Kubeflow Pipelines
- ../apps/pipeline/upstream/env/aws

Deploy RDS Doc Commands Do Not Work

Describe the bug
The current doc suggests the following :
image
(https://github.com/awslabs/kubeflow-manifests/tree/v1.3-branch/distributions/aws/examples/rds-s3)
But when trying the commands from the doc Deploy Amazon RDS MySQL to setup the RDS instance, the commands do not work.

Steps To Reproduce

  1. Create an EKS cluster
  2. Try the commands listed here
    image
    image
    error

Expected behavior
The command should return the VpcId for the first command, the list of SubnetId for the second command and the SecurityGroupId for the third command.
The returned value(s) should be only the values needed such that we can use the returned values as-is (no parsing required from user).

Environment

  • Kubernetes version 1.21
  • Using EKS (yes/no), if so version? yes 1.21
  • AWS service targeted (S3, RDS, etc.) RDS

Use secret manager for AWS credentials required for Minio

Possible steps:
Install secret manager driver: https://docs.aws.amazon.com/secretsmanager/latest/userguide/integrating_csi_driver.html

@

apiVersion: secrets-store.csi.x-k8s.io/v1alpha1
kind: SecretProviderClass
metadata:
  name: aws-secrets
spec:
  provider: aws
  parameters:
    objects: |
        - objectName: "arn:aws:secretsmanager:us-east-2:[111122223333 SOT ISN]:secret:MySecret-00AACC"
          jmesPath: 
              - path: minioAccesskey
                objectAlias: accesskey
              - path: minioSecretKey
                objectAlias: secretKey

@rrrkharse

Remove mysql-pv-claim from pipelines with RDS component

Describe the bug
Currently RDS integration changes the RDS endpoint and credentials but does not remove the mysql pvc(20Gb) and possibly mysql pod from the pipeline component. This costs money for the user

Steps To Reproduce
kustomize build awsconfigs/apps/pipeline/ -o temp

Expected behavior
Stated above

Environment

  • Kubeflow 1.3 and above
  • AWS service targeted (S3, RDS, etc.) RDS

[Doc] Support S3 as source for in Kubeflow Pipelines Artifact Viewer

"Support S3 Artifact Store" section of document is outdated : https://www.kubeflow.org/docs/distributions/aws/pipeline/#support-s3-artifact-store

Outdated Doc :

Kubeflow Pipelines supports different artifact viewers. You can create files in S3 and reference them in output artifacts in your application as follows:

metadata = {
        'outputs' : [
          {
              'source': 's3://bucket/kubeflow/README.md',
              'type': 'markdown',
          },
          {
              'type': 'confusion_matrix',
              'format': 'csv',
              'schema': [
                  {'name': 'target', 'type': 'CATEGORY'},
                  {'name': 'predicted', 'type': 'CATEGORY'},
                  {'name': 'count', 'type': 'NUMBER'},
              ],
              'source': s3://bucket/confusion_matrics.csv,
              # Convert vocab to string because for bealean values we want "True|False" to match csv data.
              'labels': list(map(str, vocab)),
          },
          {
              'type': 'tensorboard',
              'source': s3://bucket/tb-events,
          }
        ]
    }

with file_io.FileIO('/tmp/mlpipeline-ui-metadata.json', 'w') as f:
    json.dump(metadata, f)

KFServing Testing

We should make sure that KFServing is working well. Check if any of the component of KFServing (DB, storage, creds if any etc.) can we replaced by AWS services from integration POV. We should also point any standard example from official doc which has been tested.

Metadata db is not populated after running a pipeline

Describe the bug
When using Pipelines integration, if we specify a metadata db name that is different than the Kubeflow default of metadb then the metadata like artifacts types will still be pushed to a db named metadb.

Steps To Reproduce

  1. Specify a name different than metadb when setting up the rds db, let's use the db name kubeflow for this example
  2. Install Kubeflow (tested using main branch + manifest v1.4.1)
  3. Notice how the database kubeflow gets created correctly
    image
    image
  4. Create a pipeline run that pushes some artifacts
    Eg:
@component()
def test_comp(
    value: float,
    model: Output[Model],
    model_metrics: Output[Metrics]
):
    print(f"model_metrics.path {model_metrics.path}")
    print(f"model.path {model.path}")
        
    model_metrics.log_metric("test_loss", value)

    with open(model_metrics.path, 'w') as metrics_file:
        metrics_file.write(str(model_metrics.metadata))
    with open(model.path, 'w') as model_file:
        model_file.write("Some model data")
  1. Verify that the artifacts metadata was pushed to kubeflow db

  2. Notice how the db kubeflow does not have a new row in Artifacts
    image

  3. Verify the table Artifacts in metadb database instead
    image

  4. Notice how the metadata was pushed to the metadb database instead.

Expected behavior
I expected the metadata to be uploaded to the kubeflow db.

Environment

  • Kubernetes version 1.21
  • Using EKS (yes/no), if so version? yes eks.4
  • AWS service targeted (S3, RDS, etc.) RDS

Screenshots
image

Additional context
I could be wrong but I suspect that some changes are required to allow the renaming of the metadata db name.
If we take a look at upstream/apps/pipeline/upstream/base/metadata/base/metadata-grpc-deployment.yaml we see the following :

env:
- name: DBCONFIG_USER
  valueFrom:
    secretKeyRef:
      name: mysql-secret
      key: username
- name: DBCONFIG_PASSWORD
  valueFrom:
    secretKeyRef:
      name: mysql-secret
      key: password
- name: MYSQL_DATABASE
  valueFrom:
    configMapKeyRef:
      name: pipeline-install-config
      key: mlmdDb
- name: MYSQL_HOST
  valueFrom:
    configMapKeyRef:
      name: pipeline-install-config
      key: dbHost
- name: MYSQL_PORT
  valueFrom:
    configMapKeyRef:
      name: pipeline-install-config
      key: dbPort

So it looks like the MYSQL_DATABASE is retrieved from upstream/apps/pipeline/upstream/base/installs/generic/pipeline-install-config.yaml
From this config map we can see that the value is hard coded

mlmdDb: metadb

Maybe we'd need to add mlmdDb as a param in awsconfigs/apps/pipeline/params.env ?

[e2e test] Central Dashboard

Is your feature request related to a problem? Please describe.
Test central dashboard on high level. Like login/logout, webapps are linked etc

Full scope: TBD

[v1.4-branch] Kubeflow Pipeline Does Not Work

Describe the bug
I installed Kubeflow 1.4 using v1.4-branch RDS & S3 Setup.
I tried executing the sample run [Tutorial] V2 lightweight Python components but it fails.

See Kubeflow 1.4 Progress #27

Steps To Reproduce

  1. Install Kubeflow 1.4 by following the steps described here
  2. Run the sample pipeline using default values [Tutorial] V2 lightweight Python components

Expected behavior
The pipeline succeeds.

Environment

  • Kubeflow v1.4-branch
  • Kubernetes version 1.21
  • Using EKS (yes/no), if so version? yes, platform version eks.4
  • AWS service targeted (S3, RDS, etc.) S3, RDS

Screenshots
image

Additional context

  • Pipeline Setp Logs :
    logs.txt

  • kubectl get pods --all-namespaces Result :
    pods.txt

    • Notice that katib-mysql-7894994f88-tkngr is in CreateContainerConfigError
    • kubectl describe pod is showing the following error for this pod : Error: couldn't find key MYSQL_ROOT_PASSWORD in Secret kubeflow/katib-mysql-secrets
  • Some data seem to have been written to the bucket but the error logs suggest otherwise.
    image

My parms.env looks like this :

dbHost=insert_the_host.us-east-2.rds.amazonaws.com

bucketName=test-bucket-name-here
minioServiceHost=s3.amazonaws.com
minioServiceRegion=us-east-2

My secret.env is the default values.

Also it seems that even if I specify the pipeline-root in the pipeline UI, I get an error in the pipeline step logs MissingRegion: could not find region configuration

Katib Testing

We should make sure that Katib is working well. Check if any of the component of Katib (DB, storage, creds etc.) can we replaced by AWS services from integration POV. We should also point any standard example from official doc which has been tested.

How to setup LB endpoint for vanilla Kubeflow deployment

Is your feature request related to a problem? Please describe.
@AlexandreBrown on #platform-aws slack

we have vanilla Kubeflow manifest 1.4 installed on EKS.
We used kubectl port-forward for development and now we would like to expose kubeflow so that we can perform inference requests outside the cluster (without port-forward).
We are wondering which doc to use,

Describe the solution you'd like

Public endpoint to access KF dashboard with dex login

Should users be allowed to sign themselves up using Cognito hosted UI

Is your feature request related to a problem? Please describe.

Current Scenario:
The userpools created in deployment docs requires the AWS account/userpool admin to create users in cognito with their username, email address and a temporary password. Username is the only required parameter whereas email id is mandatory for kubeflow multi-user to work.

Sign up page only asks for username and password and successfully creates a user with UNCONFIRMED account status. This user cannot login since the account is created in UNCONFIRMED status and requires the userpool admin to confirm the account. However, the current sign up page does not ask for email address and so the users created via this page are not useful. The post-sign up page also lands in the following error:
Screenshot 2022-03-07 at 2 05 48 PM

Irrespective to the above error, we need to take a step back and ask, do customers want to allow users to sign themselves up? or strictly control creation of user accounts?

Describe the solution you'd like
Solution 1: Disable the sign up button from the Cognito hosted-UI and let admins handle user creations. This is secure since the domain is hosted over internet and customers strictly want to control who has access to the platform.
In addition to this, make email-id required attribute, call out that it is a required user attribute in docs and that it should have email as the key in user claims.

Describe alternatives you've considered
Solution 2: Keep the sign up page and make email address a mandatory parameter instead of username in the sign up page. We would still keep the account confirmation control for admins for security reasons.

These settings are for demo purposes only since cognito also allows many other customizations/features like federated login etc. and customers will configure the userpool according to their needs.

Kubeflow integration with AWS managed Prometheus, Grafana and CW

Is your feature request related to a problem? Please describe.
Logging and monitoring are two important aspect of MLOps platform. AWS distro of Kubeflow should have integration with aws managed Prometheus, grafana and cloudwatch(CW)

Describe the solution you'd like
I believe we should order this in priority

  1. Logging: CW and Prometheus integration
  2. Monitoring: Grafana integration.

As a part of this issue, I would like to explore the viability of the option as well.

AWS distribution of Kubeflow v1.3

Feature list:

  • Cognito Integration
  • RDS integration for pipelines
  • RDS integration for Katib
  • S3 integration for pipelines
  • DLC in Jupyter Notebooks
  • Secrets manager integration for credentials
  • EFS
  • FSx
  • e2e testing framework
  • #92
  • Documentation for telemetry component #114
  • Documentation Updates to website (Currently in distributions/aws/examples)

Please see instructions on how to deploy: https://github.com/awslabs/kubeflow-manifests/tree/v1.3-branch/distributions/aws/examples

Cannot find target group nor listener when setting up load balancer

Describe the bug
Followed the steps to build a load balancer, getting two errors from the alb-ingress-gateway pod:
"error"="failed to reconcile listeners due to failed to reconcile rules due to ListenerNotFound: One or more listeners not found
followed by
msg"="Reconciler error" "error"="failed to reconcile targetGroups due to failed to reconcile targetGroup tags due to TargetGroupNotFound: Target groups

Steps To Reproduce
On an ec2 instance with the eksctl minimum privelages follow this guide https://github.com/awslabs/kubeflow-manifests/tree/v1.3-branch/distributions/aws/examples/vanilla#connect-to-your-kubeflow-cluster
Then to setup load balancer:
#67 (comment)
Expected behavior
First error i had was to do with the kf-admin-eu-west-1 not having an OIDC provider url but that was my fault because the $OIDC_PROVIDER_URL didnt export properly.
The two named errors listed above occured once I sorted that and the endpoint for the load balancer is inaccessible

Environment

  • Kubernetes version: 1.19
  • Using EKS: latest
  • AWS service targeted None, vanilla but setting this up on an ec2 instance

Screenshots
Screenshot 2022-03-02 at 13 24 22

Additional context

[Component Guide] KFServing

Is your feature request related to a problem? Please describe.
Existing samples/tutorial in this repository focus on how to install Kubeflow on EKS.
kserve docs show how to create an inference service and sending requests from ingress gateway, if user has a realDNS etc. It does not talk about auth or how to setup a real DNS

Describe the solution you'd like

  • E2e tutorials for users on how to use inference service in production on AWS with an ALB endpoint/custom domain and auth
  • How to use a model in S3

[TRACKING] AWS Distribution for Kubeflow v1.5

On 02/15 Kubeflow community meeting, manifests WG announced they are on track for 1.5 release and distribution testing phase from 02/16-03/08.

AWS need to verify the following features:

  • Vanilla Kubeflow
  • Cognito Integration
  • RDS integration for pipelines
  • RDS integration for Katib
  • S3 integration for pipelines
  • DLC in Jupyter Notebooks
  • EFS
  • FSx
  • e2e tests
  • Documentation updates to website

We will be using the main branch for this

Unable to deploy notebook or tensorboard servers

Describe the bug
When trying to setup a notebook server or a tensorboard server when deploying kubeflow on eks I get the following error:
[403] Could not find CSRF cookie XSRF-TOKEN in the request.

Steps To Reproduce
Follow the Vanilla guide for v1.3 aws deployment of kubeflow

Expected behavior
Expected behaviour was to deploy a notebook or tensorboard server

Environment

  • Kubernetes version 1.19
  • Using EKS: Latest
  • None, was deployed using an EC2 instance however

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
I assume this isn't working because the load balancer provides http not https and that is why it is failing. I see posts about similar where they use APP_SECURE_COOKIES=false or something similar. However it doesn't say where to put that. I assume in the env file?

AWS ALB Ingress controller runs into "Reconciler error"

I am trying to install Kubeflow 1.3 on an EKS 1.19 cluster using the instructions provided here. We have a shared VPC setup with 2 private subnets, each subnet has appropriate tags and deployed in different AZs

Key Value
kubernetes.io/cluster/kubeflow-dev shared
kubernetes.io/role/internal-elb 1

I configured the IAM Role backed service account as described in the instructions as well. I also checked that the IAM role has appropriate permissions attached to it. When the AWS Ingress controller comes up it continuously runs into the following error

E0106 03:07:35.678637       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to build LoadBalancer configuration due to retrieval of subnets failed to resolve 2 qualified subnets. Subnets must contain the kubernetes.io/cluster/\u003ccluster name\u003e tag with a value of shared or owned and the kubernetes.io/role/elb tag signifying it should be used for ALBs Additionally, there must be at least 2 subnets with unique availability zones as required by ALBs. Either tag subnets to meet this requirement or use the subnets annotation on the ingress resource to explicitly call out what subnets to use for ALB creation. The subnets that did resolve were []"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}
E0106 03:07:36.717626       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to build LoadBalancer configuration due to retrieval of subnets failed to resolve 2 qualified subnets. Subnets must contain the kubernetes.io/cluster/\u003ccluster name\u003e tag with a value of shared or owned and the kubernetes.io/role/elb tag signifying it should be used for ALBs Additionally, there must be at least 2 subnets with unique availability zones as required by ALBs. Either tag subnets to meet this requirement or use the subnets annotation on the ingress resource to explicitly call out what subnets to use for ALB creation. The subnets that did resolve were []"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}
E0106 03:07:37.757723       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to build LoadBalancer configuration due to retrieval of subnets failed to resolve 2 qualified subnets. Subnets must contain the kubernetes.io/cluster/\u003ccluster name\u003e tag with a value of shared or owned and the kubernetes.io/role/elb tag signifying it should be used for ALBs Additionally, there must be at least 2 subnets with unique availability zones as required by ALBs. Either tag subnets to meet this requirement or use the subnets annotation on the ingress resource to explicitly call out what subnets to use for ALB creation. The subnets that did resolve were []"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}

Here is the aws-alb-ingress-controller-config config map

apiVersion: v1
data:
  clusterName: kubeflow-dev
kind: ConfigMap
metadata:
  name: aws-alb-ingress-controller-config
  namespace: kubeflow

Here is the output of describe command on the alb-ingress-controller pod

kubectl describe pod alb-ingress-controller-7598c8999c-6lhzw -n kubeflow
Name:         alb-ingress-controller-7598c8999c-6lhzw
Namespace:    kubeflow
Priority:     0
Node:         ip-10-xxx-x-9.ec2.internal/10.xxx.x.9
Start Time:   Wed, 05 Jan 2022 22:07:18 -0500
Labels:       app=aws-alb-ingress-controller
              app.kubernetes.io/name=alb-ingress-controller
              kustomize.component=aws-alb-ingress-controller
              pod-template-hash=7598c8999c
Annotations:  kubernetes.io/psp: eks.privileged
              sidecar.istio.io/inject: false
Status:       Running
IP:           10.xxx.x.28
IPs:
  IP:           10.xxx.x.28
Controlled By:  ReplicaSet/alb-ingress-controller-7598c8999c
Containers:
  alb-ingress-controller:
    Container ID:  docker://8bf47ac7dc5857f0cde584ed4fc8ee9dbabfc09aad7c5462df6239d53bc6edd9
    Image:         docker.io/amazon/aws-alb-ingress-controller:v1.1.5
    Image ID:      docker-pullable://amazon/aws-alb-ingress-controller@sha256:e7d4a6a99b3e21eafae645d42b9d30ed6e790e322274f100fa068433e25b7be3
    Port:          <none>
    Host Port:     <none>
    Args:
      --ingress-class=alb
      --cluster-name=$(CLUSTER_NAME)
    State:          Running
      Started:      Wed, 05 Jan 2022 22:07:19 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      CLUSTER_NAME:                 <set to the key 'clusterName' of config map 'aws-alb-ingress-controller-config'>  Optional: false
      AWS_DEFAULT_REGION:           us-east-1
      AWS_REGION:                   us-east-1
      AWS_ROLE_ARN:                 arn:aws:iam::xxxxxxxxxxxxxx:role/kf-admin-us-east-1-kubeflow-dev
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from alb-ingress-controller-token-nmdz6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  alb-ingress-controller-token-nmdz6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  alb-ingress-controller-token-nmdz6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  21m   default-scheduler  Successfully assigned kubeflow/alb-ingress-controller-7598c8999c-6lhzw to ip-10-242-4-9.ec2.internal
  Normal  Pulling    21m   kubelet            Pulling image "docker.io/amazon/aws-alb-ingress-controller:v1.1.5"
  Normal  Pulled     21m   kubelet            Successfully pulled image "docker.io/amazon/aws-alb-ingress-controller:v1.1.5" in 134.111466ms
  Normal  Created    21m   kubelet            Created container alb-ingress-controller
  Normal  Started    21m   kubelet            Started container alb-ingress-controller

Here is the controller's deployment configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2022-01-06T02:08:35Z"
  generation: 1
  labels:
    app: aws-alb-ingress-controller
    kustomize.component: aws-alb-ingress-controller
  name: alb-ingress-controller
  namespace: kubeflow
  resourceVersion: "178004"
  selfLink: /apis/apps/v1/namespaces/kubeflow/deployments/alb-ingress-controller
  uid: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: aws-alb-ingress-controller
      app.kubernetes.io/name: alb-ingress-controller
      kustomize.component: aws-alb-ingress-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      creationTimestamp: null
      labels:
        app: aws-alb-ingress-controller
        app.kubernetes.io/name: alb-ingress-controller
        kustomize.component: aws-alb-ingress-controller
    spec:
      containers:
      - args:
        - --ingress-class=alb
        - --cluster-name=$(CLUSTER_NAME)
        env:
        - name: CLUSTER_NAME
          valueFrom:
            configMapKeyRef:
              key: clusterName
              name: aws-alb-ingress-controller-config
        image: docker.io/amazon/aws-alb-ingress-controller:v1.1.5
        imagePullPolicy: Always
        name: alb-ingress-controller
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: alb-ingress-controller
      serviceAccountName: alb-ingress-controller
      terminationGracePeriodSeconds: 30

Can someone help me figure out what is going on here?

Create CFN template for resources required by AWS Cognito

Enhance user experience

  • Creating certs requires us-east-1 resources to be created before the resources for the region
  • Cognito configuration requires creating users, app client, etc. and configuring user pool domains etc

The process has several steps and is error prone. CFN might be able to automate this process and reduce user error.

Create a master branch for AWS Distro Development

Background

Current development process being followed for AWS-Kubeflow-distro-1.3.1 was to take a release branch from the manifest repository as the base (a.k.a upstream) and make AWS distribution specific changes on the same branch. This approach presented the fastest way to get development going considering the timeline, kfctl tool getting deprecated and vendors requiring to setup their own repository to host distribution specific manifests.

Problem

However, the above approach presents several issues:

  1. Working on a new release:
    1. Every release requires a developer to create a copy of upstream KF release branch and copying/cherry-picking changes from the awslabs previous release branch (e.g. v1.3-branch) to new branch (e.g. v1.4-branch) which is cumbersome and error prone (e.g. #75)
    2. These release branches on upstream are not frozen because they are used for any patch releases. Let’s take an example for v1.4.0 and v1.4.1. Every patch release will require us to either:
      1. Create a new branch(e.g. v1.4.1-branch) from the tag/v1.4.1 from upstream and repeat the process in 1.a.
      2. Cherry pick commits related to patch release from upstream and create a PR to awslabs repo
      3. Rebase the awslabs/v1.4-branch from upstream/v1.4-branch and force push which is unacceptable
  2. Maintenance issues unrelated to the code owned by the team
    1. Security vulnerabilities for test dependencies coming from upstream
    2. Modifying the root pkg READMEs on each release branch (e.g. see README changes)
  3. AWS release version is coupled with Kubeflow version
    1. With this approach, we need to maintain 2 release versions one from Kubeflow and another from aws because we may have to do patch releases or feature enhancements e.g. aws-manifests-1.x-Kubeflow-1.3.1
  4. Other issues:
    1. PR build and test infra(when we have one) will need to be constantly be updated to point to different branches instead of one master branch. Similarly Github issue templates need to be ported
    2. Default branch on the repository will keep changing
    3. Might loose the history related to the PRs between releases if changes are copied instead of cherry-picking

Suggested Approach

Majority of the overlays do not change unless there are major changes in the upstream component. Create a base with AWS overlays that does not change between versions and work only on the delta per release.

Development - Build and Test

Create a master branch to host only AWS specific overlays, tests and docs. e.g.

ubuntu@ip-172-31-0-119:~/kubeflow-manifests$ tree -L 3
.
├── README.md
├── awsconfig
│   ├── apps
│   │   ├── jupyter
│   │   ├── katib
│   │   └── pipelines
│   ├── common
│   │   ├── alb-controller
│   │   ├── istio-envoy-filter
│   │   └── istio-ingress
│   └── infra_configs
├── docs
│   ├── cognito
│   ├── rds
│   ├── s3
│   └── storage
├── tests
│   ├── canary
│   ├── e2e
│   └── unit

(Customer experience) This will introduce one additional step for users after cloning the awslabs repo.

export KUBEFLOW_RELEASE_VERSION=v1.4.1
git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream

Once user has cloned a tagged branch from upstream manifests(kubeflow/manifests repo), they will continue with setting up aws resources or running kustomize build and apply just like today.

For this, we need to isolate the aws specific changes in form of overlays to a separate directory.

  • I have already completed a POC for jupyter, cognito, rds and s3 in the v1.4-branch as a POC. There should be no changes for EFS and FSx since the changes are self contained. The only piece left is secrets-manager integration which is also doable

Releases

New releases (major and patch from upstream): Cut a release once a Kubeflow release has been validated with updated README.

Patch for an old release(depends on support policy) where latest overlays are incompatible: Use the existing release tag branch to do a patch release or backport features for older releases.

  • Follow up: Will this base become outdated between several releases?
    • The intent of this development strategy is not to support multiple versions using the latest build. Making changes to a previous release will take the same amount of time i.e. go to release branch and make the changes.

How does this solve the problems stated earlier?

We take out the problems related to PR build/test, copying changes over from old release branches and modifying parent READMEs etc by creating this base package with only AWS manifests instead of using a Kubeflow released branch as base. The contributors only needs to work on the delta per release. Problems related to security vulnerabilities because of unrelated packages are handled by only cloning the upstream during build time.

Patch releases just become a small PR with changing a version in a config file and re-running the tests using the PR build infra.

The problem for release version is not completely solved since a it also depends on the upstream but we are able to cut a new release for this package independently which talks about the compatible version of Kubeflow (TBD - see below)

Next steps/TODOs

  • How to document/convery the supported upstream version to users?
  • What is the release versioning for this package?
  • Use the main branch to get started with this

Please comment your thoughts/feedback

Replace Minio by S3

Is your feature request related to a problem? Please describe.
We want to get rid of Minio completely from Kubeflow and use S3.

Describe the solution you'd like
Currently minio act as gateway to S3, one solution is to use S3 controller developed by ACK to act as gateway for S3.

Describe alternatives you've considered
N/A

Additional context
N/A

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.