awslabs / kubeflow-manifests Goto Github PK

KubeFlow on AWS

Home Page: https://awslabs.github.io/kubeflow-manifests/

License: Apache License 2.0

YAML 90.00% Shell 0.02% Python 5.50% JSON 0.16% Makefile 0.13% Go 0.30% Dockerfile 0.16% SCSS 0.08% HTML 0.28% Jupyter Notebook 0.07% CSS 0.09% JavaScript 0.03% Smarty 0.81% HCL 2.39%

kubeflow eks kubernetes aws data-science mlops

kubeflow-manifests's Introduction

Kubeflow on AWS

Overview

Kubeflow on AWS is an open source distribution of Kubeflow that provides its own Kubeflow manifests to support integrations with various AWS managed services. Use Kubeflow on AWS to streamline data science tasks and build highly reliable, secure, and scalable machine learning systems with reduced operational overheads.

Getting Started

There are a number of deployment options for installing Kubeflow with AWS service integrations. To get started with deploying Kubeflow on Amazon Elastic Kubernetes Service (Amazon EKS), see the deployment section on the Kubeflow on AWS website.

Help & Feedback

For help, please consider the following venues (in order):

Documentation
Search open issues
File an issue
Chat with us on the #platform-aws channel in the Kubeflow Slack community.

Contributing

We welcome community contributions and pull requests.

See our contribution guide for more information on how to report issues, set up a development environment, and submit code.

We adhere to the Amazon Open Source Code of Conduct.

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

kubeflow-manifests's People

Contributors

Stargazers

Watchers

Forkers

surajkota mbaijal goswamig ryansteakley akartsky revolutionisme speaknowpotato alexandrebrown trakru vinaydel hyperfine rrrkharse admazatimtim js-ts bahram-ebot7 fromcloud mvoitko laxmansmr-us narijeong judyheflin stevebanik aflorithmic taxibeat jrhode2 jsitu777 muskanmahajan486 samkenxstream briannaroskind vishnuisu malawskim zorrofox aaron-clarusway theofpa walkingerica monkesh nathan-salawe kjvjobin kdubovikov r-matsuzaka rizadiawan tactful-ai blorby medhacs1010 npepin-hub askulkarni2 ibsdeveloper prabuk06 rafael-ariascalles prli muhendis yongchand rddefauw at-mlops ananth102 dchristian3188 dragonfly-science aicadium-archive wisteli-raushankumar yjagdale ni artistml biljeff pankajingarud scriptdash harshasm123 krhoyt rakesh283343 datasentics ogrodniczek sreenivasaarasanagatta techwithshadab heedokang craigstuart mosabami aliceco01 brianhammons ronaldosaheki omidrk matt-dres jalawala rd-pong elanv nishantapatil3 sagi-shimoni amitkalawat rashid-boyko codinglama-tech gautampawnesh paravatha dkshan32 rodrigobersa priyankaw otterley kindoblue cqen-qdce tne-ai tawkify almservices bluecrayon52 wfclark5

kubeflow-manifests's Issues

Some RDS README improvments

RDS setup (keep the default value) for dbname: kubeflow in the example configurations
Add debug tip for TLS error: kubeflow/manifests#1931

Remove katib-mysql pod and mysql pvc from katib with RDS component

Describe the bug
See #66 (comment)
When using RDS with Katib having the katib-mysql-secrets, katib-mysql deployment and katib-mysql pvc(10Gb) is confusing for customers and also cost money

Steps To Reproduce
kustomize build apps/katib/upstream/installs/katib-external-db-with-kubeflow -o temp

Expected behavior
Stated above unused resources should be removed

Environment

RDS integration with Katib in Kubeflow
AWS service targeted (S3, RDS, etc.) RDS

Port the ability to automatically setup EFS from v1.3 branch to main branch

Is your feature request related to a problem? Please describe.
Port the ability to automatically setup EFS from v1.3 branch to main branch

Describe the solution you'd like
Port the script and doc to the main branch

Facilitate configuration of RDS & S3 credentials access for Notebooks & Pipelines

Is your feature request related to a problem? Please describe.
It might not be obvious which k8s secret name to use to create a pod default so that we can use the RDS & S3 credentials setup by the credentials manager. It might not be ovious for new k8s users that you need to create something called a PodDefault in order to have access to these credentials from within a notebook and in notebook creation UI.

Use Cases

Specify AWS credentials when using boto3
- Solution : Have a pod default make environment variables available
Use ml-metadata (requires database credentials in notebook)
- Solution : Have a pod default make environment variables available

Describe the solution you'd like
I propose we add this as part of the Kubeflow install.
For instance, if the user installs Kubeflow via S3 setup, his pod default could look like

apiVersion: kubeflow.org/v1alpha1
kind: PodDefault
metadata:
  name: add-aws-secret
  namespace: <YOUR_NAMESPACE_HERE>
  labels:
    add-aws-secret : "true"
spec:
  desc: add-aws-secret
  selector:
    matchLabels:
      add-aws-secret: "true"
  env:
  - name: AWS_ACCESS_KEY_ID
    valueFrom:
      secretKeyRef:
        name: mlpipeline-minio-artifact
        key: accesskey
  - name: AWS_SECRET_ACCESS_KEY
    valueFrom:
      secretKeyRef:
        name: mlpipeline-minio-artifact
        key: secretkey

If he installs via RDS & S3 setup then we would add the RDS values (eg: DB_HOST, DB_PORT, DB_NAME, DB_USER, DB_PWD from mysql-secret) so that the user can connect to ml metadata db in a notebook using OS environment variable values.

This would also open the door to greater notebook integration as we could modify the spawner_ui_config.yaml and add the pod default created during the setup as default value selected.

  configurations:
    # List of labels to be selected, these are the labels from PodDefaults
    # value:
    #   - add-gcp-secret
    #   - default-editor
    value:
       - add-aws-secret
    readOnly: false

Kubeflow logout stuck in infinite loop

Describe the bug

I installed Kubeflow 1.4 using Cognito. I can access the Kubeflow UI and signin properly. When I click on logout button, it get stuck in infinite loop. As a result, user is not logged out.

Steps To Reproduce
1- Install Kubeflow from this guide: https://github.com/awslabs/kubeflow-manifests/tree/v1.4-branch/distributions/aws/examples/cognito
2- Sign in via Kubeflow UI
3- Logout via Kubeflow UI

Expected behavior
I expect user to be logged out and redirected to sign-in page

Additional context
This bug is related to the feature: kubeflow/kubeflow#5174

Notebook server creation fails if PV is not used for workspace

Description of issue

Succeeds with scipy image and KF default pytorch images but fails on DLC based images

Logs

Using deprecated annotation `kubectl.kubernetes.io/default-logs-container` in pod/skota-2-0. Please use `kubectl.kubernetes.io/default-container` instead
/opt/conda/lib/python3.8/site-packages/jupyter_server_mathjax/app.py:40: FutureWarning: The alias `_()` will be deprecated. Use `_i18n()` instead.
  help=_("""The MathJax.js configuration file that is to be used."""),
[I 2021-11-30 20:38:10.802 ServerApp] jupyter_server_mathjax | extension was successfully linked.
[W 2021-11-30 20:38:10.806 LabApp] 'token' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-11-30 20:38:10.806 LabApp] 'password' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-11-30 20:38:10.806 LabApp] 'allow_origin' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[W 2021-11-30 20:38:10.806 LabApp] 'base_url' has moved from NotebookApp to ServerApp. This config will be passed to ServerApp. Be sure to update your config before our next release.
[I 2021-11-30 20:38:10.813 ServerApp] jupyterlab | extension was successfully linked.
[I 2021-11-30 20:38:10.813 ServerApp] jupyterlab_git | extension was successfully linked.
[W 2021-11-30 20:38:10.822 ServerApp] [Errno 13] Permission denied: '/home/jovyan/.local'
[I 2021-11-30 20:38:10.822 ServerApp] nbdime | extension was successfully linked.
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 535, in get
    value = obj._trait_values[self.name]
KeyError: 'runtime_dir'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/jupyter-lab", line 8, in <module>
    sys.exit(main())
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/extension/application.py", line 567, in launch_instance
    serverapp = cls.initialize_server(argv=args)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/extension/application.py", line 537, in initialize_server
    serverapp.initialize(
  File "/opt/conda/lib/python3.8/site-packages/traitlets/config/application.py", line 87, in inner
    return method(app, *args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 2320, in initialize
    self.init_configurables()
  File "/opt/conda/lib/python3.8/site-packages/jupyter_server/serverapp.py", line 1739, in init_configurables
    connection_dir=self.runtime_dir,
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 575, in __get__
    return self.get(obj, cls)
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 538, in get
    default = obj.trait_defaults(self.name)
  File "/opt/conda/lib/python3.8/site-packages/traitlets/traitlets.py", line 1578, in trait_defaults
    return self._get_trait_default_generator(names[0])(self)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_core/application.py", line 95, in _runtime_dir_default
    ensure_dir_exists(rd, mode=0o700)
  File "/opt/conda/lib/python3.8/site-packages/jupyter_core/utils/__init__.py", line 11, in ensure_dir_exists
    os.makedirs(path, mode=mode)
  File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.8/os.py", line 213, in makedirs
    makedirs(head, exist_ok=exist_ok)
  File "/opt/conda/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
PermissionError: [Errno 13] Permission denied: '/home/jovyan/.local'

Steps to reproduce:

Create a notebook with AWS images for notebook and check the option Don't use Persistent Storage for User's home under workspace volume

Add the ability to specify the RDS & S3 secrets name for the RDS & S3 automated setup

Is your feature request related to a problem? Please describe.
The script should be updated so that users running the automated script can make us of this work

Describe the solution you'd like
The script should provide an optional parameter to specify the RDS secret name and another optional parameter for the S3 secret name.

[README] At each main section can we have a CD command

Regarding : https://github.com/awslabs/kubeflow-manifests/blob/v1.3-branch/examples/aws/cognito/README.md

It's is confusing at some of the sections where we should run it from (Eg: Section 4.0). It's better to have a CD command at the beginning of each section.

KFP Testing

We should make sure that KFP is working well. Check if any of the component of KFP (DB, storage, creds etc.) can we replaced by AWS services from integration POV. We should also point any standard example from official doc which has been tested.

[Component Guide] Kubeflow Pipelines

We will need to test the KFP SDK on notebook.

One specific point where we might need to add details (documentation, instruction or change) is authentication process of pipeline using KFP sdk.

More details below.
https://www.kubeflow.org/docs/distributions/aws/pipeline/#s3-access-from-kubeflow-pipelines

Cloudwatch integration with Kubeflow

Describe the bug
Worker node group logging described here can be used for sending pods log to cloud watch. However with the current setup it does not work. For e.g. the permission which we assign to worker node does not have policy to describe log group from cloud watch.

Steps To Reproduce

goto folder kubeflow-manifests/distributions/aws/fluentd-cloud-watch and update the base/params.env values.
run `kubectl apply -k fluentd-cloud-watch/base
Check the cloudwatch to see if it has log group /aws/containerinsights/${AWS_CLUSTER_NAME}/containers
you can see the logs from fluentd-cloudwatch-xxxxx pods and it will report errors like below

2021-12-26 06:03:51 +0000 [warn]: #0 [out_cloudwatch_logs_containers] failed to flush the buffer. retry_time=7 next_retry_seconds=2021-12-26 06:05:02 +0000 chunk="5d4065628678b563f9dd8c2962ea5455" error_class=Aws::CloudWatchLogs::Errors::AccessDeniedException error="User: arn:aws:sts::<my account_id>:assumed-role/eksctl-myclustername-nodegroup-li-NodeInstanceRole-PP7XWJ1Q1DWU/i-01083ed00c1fdfdff6 is not authorized to perform: logs:DescribeLogGroups on resource: arn:aws:logs:us-east-1:<my_aws_account_id>:log-group::log-stream:"
  2021-12-26 06:03:51 +0000 [warn]: #0 suppressed same stacktrace

Expected behavior
No error message in fluentd pods
cw log group should be created and pods log should be pushed there

Environment

Kubernetes version: 1.20
Using EKS (yes/no), if so version? yes
AWS service targeted (S3, RDS, etc.): cloud watch

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
Add any other context about the problem here.

Availability w.r.t secrets manager integration

Describe the bug
Secrets mount is modelled as a pod which is unreliable incase the node goes down or secrets needs to be rotated(although auto rotation is not supported)

secrets mount pod should be modelled as a Deployment
Istio sidecar injection should be disabled

[Doc] Support TensorBoard in Kubeflow Pipelines

"Support TensorBoard in Kubeflow Pipelines" section of document is outdated :
https://www.kubeflow.org/docs/distributions/aws/pipeline/#support-tensorboard-in-kubeflow-pipelines

Outdated Doc :

TensorBoard needs some extra settings on AWS like below:

Create a Kubernetes secret aws-secret in the kubeflow namespace. Follow instructions here.
Create a ConfigMap to store the configuration of TensorBoard on your cluster. Replace <your_region> with your S3 region.

apiVersion: v1
kind: ConfigMap
metadata:
  name: ml-pipeline-ui-viewer-template
data:
  viewer-tensorboard-template.json: |
    {
        "spec": {
            "containers": [
                {
                    "env": [
                        {
                            "name": "AWS_ACCESS_KEY_ID",
                            "valueFrom": {
                                "secretKeyRef": {
                                    "name": "aws-secret",
                                    "key": "AWS_ACCESS_KEY_ID"
                                }
                            }
                        },
                        {
                            "name": "AWS_SECRET_ACCESS_KEY",
                            "valueFrom": {
                                "secretKeyRef": {
                                    "name": "aws-secret",
                                    "key": "AWS_SECRET_ACCESS_KEY"
                                }
                            }
                        },
                        {
                            "name": "AWS_REGION",
                            "value": "<your_region>"
                        }
                    ]
                }
            ]
        }
    }

Update the ml-pipeline-ui deployment to use the ConfigMap by running kubectl edit deployment ml-pipeline-ui -n kubeflow.

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
  name: ml-pipeline-ui
  namespace: kubeflow
  ...
spec:
  template:
    spec:
      containers:
      - env:
        - name: VIEWER_TENSORBOARD_POD_TEMPLATE_SPEC_PATH
          value: /etc/config/viewer-tensorboard-template.json
        ....
        volumeMounts:
        - mountPath: /etc/config
          name: config-volume
      .....
      volumes:
      - configMap:
          defaultMode: 420
          name: ml-pipeline-ui-viewer-template
        name: config-volume

Add the ability to specify the S3 credentials explicitly for the RDS & S3 automated setup

Is your feature request related to a problem? Please describe.
Users might want to use the credentials of a specific account for managing s3 that is different than the credentials of the user running the script.

Describe the solution you'd like
Add an optional parameter to the script that allows to explicitly use a specific access key and secret key.

ALB Auth doesn't support Cognito - eu-north-1, eu-west-3, sa-east-1, us-west-1

This results in the following error message when provisioning an ALB in the alb-ingress-controller logs:

E1109 00:24:55.867169       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to reconcile listeners due to failed to create listener due to ValidationError: Action type 'authenticate-cognito' must be one of 'redirect,fixed-response,forward,authenticate-oidc'\n\tstatus code: 400, request id: a2a04577-cdd9-4a28-b735-f553883eccf6"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}

We need to find a workaround and/or determine if we can support these regions for Cognito on Kubeflow.

Error: secret "mlpipeline-minio-artifact"/"mysql-secret" not found in some deployments

Describe the bug
Error: secret "mlpipeline-minio-artifact" not found in the following deployments:

ml-pipeline-ui
kubeflow-pipelines-profile-controller
minio

Error: secret "mysql-secret" not found in the following deployments:

metadata-grpc-deployment
ml-pipeline
cache-server

It seems that these secrets are removed in file disable-default-secrets.yaml in commit 6fdcc9b but still got used in a lot of deployments.

Steps To Reproduce

Deploying with distributions/aws/examples/rds-s3 templates.
kubectl describe pod -n kubeflow $(kubectl get pod -n kubeflow -l app=ml-pipeline-ui -o jsonpath="{.items[0].metadata.name}")

Expected behavior
Above deployments run successfully.

Environment

Kubernetes version: 1.19
Using EKS (yes/no), if so version? Yes, EKS version: eks.7
AWS service targeted (S3, RDS, etc.) -> s3-rds

Screenshots
None

Additional context
I deploy with this custom kustomization.yaml which is based on distributions/aws/examples/rds-s3 but is modified to use only necessary services based on my own need

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization

resources:
# Istio
- ../common/istio-1-9/istio-crds/base
- ../common/istio-1-9/istio-namespace/base
- ../common/istio-1-9/istio-install/base
# OIDC Authservice
- ../common/oidc-authservice/base
# Dex
- ../common/dex/overlays/istio
# Kubeflow namespace
- ../common/kubeflow-namespace/base
# Kubeflow Roles
- ../common/kubeflow-roles/base
# Kubeflow Istio Resources
- ../common/istio-1-9/kubeflow-istio-resources/base

# Central Dashboard
- ../apps/centraldashboard/upstream/overlays/istio
# Profiles + KFAM
- ../apps/profiles/upstream/overlays/kubeflow
# User namespace
- ../common/user-namespace/base

# Configured for AWS RDS and AWS S3

# AWS Secret Manager
- ../distributions/aws/aws-secrets-manager/base
# Kubeflow Pipelines
- ../apps/pipeline/upstream/env/aws

Deploy RDS Doc Commands Do Not Work

Describe the bug
The current doc suggests the following :

(https://github.com/awslabs/kubeflow-manifests/tree/v1.3-branch/distributions/aws/examples/rds-s3)
But when trying the commands from the doc Deploy Amazon RDS MySQL to setup the RDS instance, the commands do not work.

Steps To Reproduce

Create an EKS cluster
Try the commands listed here

Expected behavior
The command should return the VpcId for the first command, the list of SubnetId for the second command and the SecurityGroupId for the third command.
The returned value(s) should be only the values needed such that we can use the returned values as-is (no parsing required from user).

Environment

Kubernetes version 1.21
Using EKS (yes/no), if so version? yes 1.21
AWS service targeted (S3, RDS, etc.) RDS

Decouple RDS and S3 installation

Provide customers with the flexibility to choose which AWS service(s) they want to integrate with

Use secret manager for AWS credentials required for Minio

Possible steps:
Install secret manager driver: https://docs.aws.amazon.com/secretsmanager/latest/userguide/integrating_csi_driver.html

apiVersion: secrets-store.csi.x-k8s.io/v1alpha1
kind: SecretProviderClass
metadata:
  name: aws-secrets
spec:
  provider: aws
  parameters:
    objects: |
        - objectName: "arn:aws:secretsmanager:us-east-2:[111122223333 SOT ISN]:secret:MySecret-00AACC"
          jmesPath: 
              - path: minioAccesskey
                objectAlias: accesskey
              - path: minioSecretKey
                objectAlias: secretKey

@rrrkharse

Remove mysql-pv-claim from pipelines with RDS component

Describe the bug
Currently RDS integration changes the RDS endpoint and credentials but does not remove the mysql pvc(20Gb) and possibly mysql pod from the pipeline component. This costs money for the user

Steps To Reproduce
kustomize build awsconfigs/apps/pipeline/ -o temp

Expected behavior
Stated above

Environment

Kubeflow 1.3 and above
AWS service targeted (S3, RDS, etc.) RDS

[Doc] Support S3 as source for in Kubeflow Pipelines Artifact Viewer

"Support S3 Artifact Store" section of document is outdated : https://www.kubeflow.org/docs/distributions/aws/pipeline/#support-s3-artifact-store

Outdated Doc :

Kubeflow Pipelines supports different artifact viewers. You can create files in S3 and reference them in output artifacts in your application as follows:

metadata = {
        'outputs' : [
          {
              'source': 's3://bucket/kubeflow/README.md',
              'type': 'markdown',
          },
          {
              'type': 'confusion_matrix',
              'format': 'csv',
              'schema': [
                  {'name': 'target', 'type': 'CATEGORY'},
                  {'name': 'predicted', 'type': 'CATEGORY'},
                  {'name': 'count', 'type': 'NUMBER'},
              ],
              'source': s3://bucket/confusion_matrics.csv,
              # Convert vocab to string because for bealean values we want "True|False" to match csv data.
              'labels': list(map(str, vocab)),
          },
          {
              'type': 'tensorboard',
              'source': s3://bucket/tb-events,
          }
        ]
    }

with file_io.FileIO('/tmp/mlpipeline-ui-metadata.json', 'w') as f:
    json.dump(metadata, f)

Port the ability to automatically setup RDS & S3 from v1.3 branch to main branch

Is your feature request related to a problem? Please describe.
Add the ability to automatically setup RDS & S3 from main branch

Describe the solution you'd like
Setup script and doc should be ported to main branch.

KFServing Testing

We should make sure that KFServing is working well. Check if any of the component of KFServing (DB, storage, creds if any etc.) can we replaced by AWS services from integration POV. We should also point any standard example from official doc which has been tested.

Metadata db is not populated after running a pipeline

Describe the bug
When using Pipelines integration, if we specify a metadata db name that is different than the Kubeflow default of metadb then the metadata like artifacts types will still be pushed to a db named metadb.

Steps To Reproduce

Specify a name different than metadb when setting up the rds db, let's use the db name kubeflow for this example
Install Kubeflow (tested using main branch + manifest v1.4.1)
Notice how the database kubeflow gets created correctly
Create a pipeline run that pushes some artifacts
Eg:

@component()
def test_comp(
    value: float,
    model: Output[Model],
    model_metrics: Output[Metrics]
):
    print(f"model_metrics.path {model_metrics.path}")
    print(f"model.path {model.path}")
        
    model_metrics.log_metric("test_loss", value)

    with open(model_metrics.path, 'w') as metrics_file:
        metrics_file.write(str(model_metrics.metadata))
    with open(model.path, 'w') as model_file:
        model_file.write("Some model data")

Verify that the artifacts metadata was pushed to kubeflow db
Notice how the db kubeflow does not have a new row in Artifacts
Verify the table Artifacts in metadb database instead
Notice how the metadata was pushed to the metadb database instead.

Expected behavior
I expected the metadata to be uploaded to the kubeflow db.

Environment

Kubernetes version 1.21
Using EKS (yes/no), if so version? yes eks.4
AWS service targeted (S3, RDS, etc.) RDS

Screenshots

Additional context
I could be wrong but I suspect that some changes are required to allow the renaming of the metadata db name.
If we take a look at upstream/apps/pipeline/upstream/base/metadata/base/metadata-grpc-deployment.yaml we see the following :

env:
- name: DBCONFIG_USER
  valueFrom:
    secretKeyRef:
      name: mysql-secret
      key: username
- name: DBCONFIG_PASSWORD
  valueFrom:
    secretKeyRef:
      name: mysql-secret
      key: password
- name: MYSQL_DATABASE
  valueFrom:
    configMapKeyRef:
      name: pipeline-install-config
      key: mlmdDb
- name: MYSQL_HOST
  valueFrom:
    configMapKeyRef:
      name: pipeline-install-config
      key: dbHost
- name: MYSQL_PORT
  valueFrom:
    configMapKeyRef:
      name: pipeline-install-config
      key: dbPort

So it looks like the MYSQL_DATABASE is retrieved from upstream/apps/pipeline/upstream/base/installs/generic/pipeline-install-config.yaml
From this config map we can see that the value is hard coded

mlmdDb: metadb

Maybe we'd need to add mlmdDb as a param in awsconfigs/apps/pipeline/params.env ?

[e2e test] Central Dashboard

Is your feature request related to a problem? Please describe.
Test central dashboard on high level. Like login/logout, webapps are linked etc

Full scope: TBD

Update RDS instructions to use latest mysql version in RDS

Update and verify the latest RDS version(s) work with RDS and Kubeflow

[v1.4-branch] Kubeflow Pipeline Does Not Work

Describe the bug
I installed Kubeflow 1.4 using v1.4-branch RDS & S3 Setup.
I tried executing the sample run [Tutorial] V2 lightweight Python components but it fails.

See Kubeflow 1.4 Progress #27

Steps To Reproduce

Install Kubeflow 1.4 by following the steps described here
Run the sample pipeline using default values [Tutorial] V2 lightweight Python components

Expected behavior
The pipeline succeeds.

Environment

Kubeflow v1.4-branch
Kubernetes version 1.21
Using EKS (yes/no), if so version? yes, platform version eks.4
AWS service targeted (S3, RDS, etc.) S3, RDS

Screenshots

Additional context

Pipeline Setp Logs :
logs.txt
kubectl get pods --all-namespaces Result :
pods.txt
- Notice that katib-mysql-7894994f88-tkngr is in CreateContainerConfigError
- kubectl describe pod is showing the following error for this pod : Error: couldn't find key MYSQL_ROOT_PASSWORD in Secret kubeflow/katib-mysql-secrets
Some data seem to have been written to the bucket but the error logs suggest otherwise.

My parms.env looks like this :

dbHost=insert_the_host.us-east-2.rds.amazonaws.com

bucketName=test-bucket-name-here
minioServiceHost=s3.amazonaws.com
minioServiceRegion=us-east-2

My secret.env is the default values.

Also it seems that even if I specify the pipeline-root in the pipeline UI, I get an error in the pipeline step logs MissingRegion: could not find region configuration

[e2e test] aws usage tracking component

Is your feature request related to a problem? Please describe.
e2e test to avoid regression in usage tracking component

Katib Testing

We should make sure that Katib is working well. Check if any of the component of Katib (DB, storage, creds etc.) can we replaced by AWS services from integration POV. We should also point any standard example from official doc which has been tested.

How to setup LB endpoint for vanilla Kubeflow deployment

Is your feature request related to a problem? Please describe.
@AlexandreBrown on #platform-aws slack

we have vanilla Kubeflow manifest 1.4 installed on EKS.
We used kubectl port-forward for development and now we would like to expose kubeflow so that we can perform inference requests outside the cluster (without port-forward).
We are wondering which doc to use,

Describe the solution you'd like

Public endpoint to access KF dashboard with dex login

Should users be allowed to sign themselves up using Cognito hosted UI

Is your feature request related to a problem? Please describe.

Current Scenario:
The userpools created in deployment docs requires the AWS account/userpool admin to create users in cognito with their username, email address and a temporary password. Username is the only required parameter whereas email id is mandatory for kubeflow multi-user to work.

Sign up page only asks for username and password and successfully creates a user with UNCONFIRMED account status. This user cannot login since the account is created in UNCONFIRMED status and requires the userpool admin to confirm the account. However, the current sign up page does not ask for email address and so the users created via this page are not useful. The post-sign up page also lands in the following error:

Irrespective to the above error, we need to take a step back and ask, do customers want to allow users to sign themselves up? or strictly control creation of user accounts?

Describe the solution you'd like
Solution 1: Disable the sign up button from the Cognito hosted-UI and let admins handle user creations. This is secure since the domain is hosted over internet and customers strictly want to control who has access to the platform.
In addition to this, make email-id required attribute, call out that it is a required user attribute in docs and that it should have email as the key in user claims.

Describe alternatives you've considered
Solution 2: Keep the sign up page and make email address a mandatory parameter instead of username in the sign up page. We would still keep the account confirmation control for admins for security reasons.

These settings are for demo purposes only since cognito also allows many other customizations/features like federated login etc. and customers will configure the userpool according to their needs.

Kubeflow integration with AWS managed Prometheus, Grafana and CW

Is your feature request related to a problem? Please describe.
Logging and monitoring are two important aspect of MLOps platform. AWS distro of Kubeflow should have integration with aws managed Prometheus, grafana and cloudwatch(CW)

Describe the solution you'd like
I believe we should order this in priority

Logging: CW and Prometheus integration
Monitoring: Grafana integration.

As a part of this issue, I would like to explore the viability of the option as well.

AWS distribution of Kubeflow v1.3

Feature list:

Please see instructions on how to deploy: https://github.com/awslabs/kubeflow-manifests/tree/v1.3-branch/distributions/aws/examples

AWS distribution of Kubeflow v1.4

Feature list:

Single cloudformation template to deploy kubeflow

Can we come up with one click CFN template to deploy kubeflow with all supported AWS service integration ?

High availability of Kubeflow

There are many component of kubeflow which are not highly available.

For e.g. central dashboard has only one replica https://github.com/awslabs/kubeflow-manifests/blob/master/apps/centraldashboard/upstream/base/deployment.yaml#L8

We need to find out all the critical component inside kubeflow and make them highly available.

Add verification procedure for RDS and S3 installation

Provide some way to test that the installation has succeeded beyond a successful deployment.

[e2e test] AWS Jupyter notebook images

Is your feature request related to a problem? Please describe.
e2e tests for AWS optimised DLC based notebook server images

Investigate why controller is mutating IAM policy

https://github.com/kubeflow/kubeflow/blob/d43aee19ad8b556273a627ad34367e7e56b5c97e/components/profile-controller/controllers/plugin_iam.go#L128-L182

Cannot find target group nor listener when setting up load balancer

Describe the bug
Followed the steps to build a load balancer, getting two errors from the alb-ingress-gateway pod:
"error"="failed to reconcile listeners due to failed to reconcile rules due to ListenerNotFound: One or more listeners not found
followed by
msg"="Reconciler error" "error"="failed to reconcile targetGroups due to failed to reconcile targetGroup tags due to TargetGroupNotFound: Target groups

Steps To Reproduce
On an ec2 instance with the eksctl minimum privelages follow this guide https://github.com/awslabs/kubeflow-manifests/tree/v1.3-branch/distributions/aws/examples/vanilla#connect-to-your-kubeflow-cluster
Then to setup load balancer:
#67 (comment)
Expected behavior
First error i had was to do with the kf-admin-eu-west-1 not having an OIDC provider url but that was my fault because the $OIDC_PROVIDER_URL didnt export properly.
The two named errors listed above occured once I sorted that and the endpoint for the load balancer is inaccessible

Environment

Kubernetes version: 1.19
Using EKS: latest
AWS service targeted None, vanilla but setting this up on an ec2 instance

Screenshots

Additional context

[Component Guide] KFServing

Is your feature request related to a problem? Please describe.
Existing samples/tutorial in this repository focus on how to install Kubeflow on EKS.
kserve docs show how to create an inference service and sending requests from ingress gateway, if user has a realDNS etc. It does not talk about auth or how to setup a real DNS

Describe the solution you'd like

E2e tutorials for users on how to use inference service in production on AWS with an ALB endpoint/custom domain and auth
How to use a model in S3

[TRACKING] AWS Distribution for Kubeflow v1.5

On 02/15 Kubeflow community meeting, manifests WG announced they are on track for 1.5 release and distribution testing phase from 02/16-03/08.

AWS need to verify the following features:

We will be using the main branch for this

Remove IAM user usage in Kubeflow and replace that with IAM role

We want kubeflow to be completely off the usage of static credentials but only using IRSA.

figure out the IAM user usage in kubeflow component.
Replace them with IAM role using IRSA.

Unable to deploy notebook or tensorboard servers

Describe the bug
When trying to setup a notebook server or a tensorboard server when deploying kubeflow on eks I get the following error:
[403] Could not find CSRF cookie XSRF-TOKEN in the request.

Steps To Reproduce
Follow the Vanilla guide for v1.3 aws deployment of kubeflow

Expected behavior
Expected behaviour was to deploy a notebook or tensorboard server

Environment

Kubernetes version 1.19
Using EKS: Latest
None, was deployed using an EC2 instance however

Screenshots
If applicable, add screenshots to help explain your problem.

Additional context
I assume this isn't working because the load balancer provides http not https and that is why it is failing. I see posts about similar where they use APP_SECURE_COOKIES=false or something similar. However it doesn't say where to put that. I assume in the env file?

AWS ALB Ingress controller runs into "Reconciler error"

I am trying to install Kubeflow 1.3 on an EKS 1.19 cluster using the instructions provided here. We have a shared VPC setup with 2 private subnets, each subnet has appropriate tags and deployed in different AZs

Key	Value
kubernetes.io/cluster/kubeflow-dev	shared
kubernetes.io/role/internal-elb	1

I configured the IAM Role backed service account as described in the instructions as well. I also checked that the IAM role has appropriate permissions attached to it. When the AWS Ingress controller comes up it continuously runs into the following error

E0106 03:07:35.678637       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to build LoadBalancer configuration due to retrieval of subnets failed to resolve 2 qualified subnets. Subnets must contain the kubernetes.io/cluster/\u003ccluster name\u003e tag with a value of shared or owned and the kubernetes.io/role/elb tag signifying it should be used for ALBs Additionally, there must be at least 2 subnets with unique availability zones as required by ALBs. Either tag subnets to meet this requirement or use the subnets annotation on the ingress resource to explicitly call out what subnets to use for ALB creation. The subnets that did resolve were []"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}
E0106 03:07:36.717626       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to build LoadBalancer configuration due to retrieval of subnets failed to resolve 2 qualified subnets. Subnets must contain the kubernetes.io/cluster/\u003ccluster name\u003e tag with a value of shared or owned and the kubernetes.io/role/elb tag signifying it should be used for ALBs Additionally, there must be at least 2 subnets with unique availability zones as required by ALBs. Either tag subnets to meet this requirement or use the subnets annotation on the ingress resource to explicitly call out what subnets to use for ALB creation. The subnets that did resolve were []"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}
E0106 03:07:37.757723       1 controller.go:217] kubebuilder/controller "msg"="Reconciler error" "error"="failed to build LoadBalancer configuration due to retrieval of subnets failed to resolve 2 qualified subnets. Subnets must contain the kubernetes.io/cluster/\u003ccluster name\u003e tag with a value of shared or owned and the kubernetes.io/role/elb tag signifying it should be used for ALBs Additionally, there must be at least 2 subnets with unique availability zones as required by ALBs. Either tag subnets to meet this requirement or use the subnets annotation on the ingress resource to explicitly call out what subnets to use for ALB creation. The subnets that did resolve were []"  "controller"="alb-ingress-controller" "request"={"Namespace":"istio-system","Name":"istio-ingress"}

Here is the aws-alb-ingress-controller-config config map

apiVersion: v1
data:
  clusterName: kubeflow-dev
kind: ConfigMap
metadata:
  name: aws-alb-ingress-controller-config
  namespace: kubeflow

Here is the output of describe command on the alb-ingress-controller pod

kubectl describe pod alb-ingress-controller-7598c8999c-6lhzw -n kubeflow
Name:         alb-ingress-controller-7598c8999c-6lhzw
Namespace:    kubeflow
Priority:     0
Node:         ip-10-xxx-x-9.ec2.internal/10.xxx.x.9
Start Time:   Wed, 05 Jan 2022 22:07:18 -0500
Labels:       app=aws-alb-ingress-controller
              app.kubernetes.io/name=alb-ingress-controller
              kustomize.component=aws-alb-ingress-controller
              pod-template-hash=7598c8999c
Annotations:  kubernetes.io/psp: eks.privileged
              sidecar.istio.io/inject: false
Status:       Running
IP:           10.xxx.x.28
IPs:
  IP:           10.xxx.x.28
Controlled By:  ReplicaSet/alb-ingress-controller-7598c8999c
Containers:
  alb-ingress-controller:
    Container ID:  docker://8bf47ac7dc5857f0cde584ed4fc8ee9dbabfc09aad7c5462df6239d53bc6edd9
    Image:         docker.io/amazon/aws-alb-ingress-controller:v1.1.5
    Image ID:      docker-pullable://amazon/aws-alb-ingress-controller@sha256:e7d4a6a99b3e21eafae645d42b9d30ed6e790e322274f100fa068433e25b7be3
    Port:          <none>
    Host Port:     <none>
    Args:
      --ingress-class=alb
      --cluster-name=$(CLUSTER_NAME)
    State:          Running
      Started:      Wed, 05 Jan 2022 22:07:19 -0500
    Ready:          True
    Restart Count:  0
    Environment:
      CLUSTER_NAME:                 <set to the key 'clusterName' of config map 'aws-alb-ingress-controller-config'>  Optional: false
      AWS_DEFAULT_REGION:           us-east-1
      AWS_REGION:                   us-east-1
      AWS_ROLE_ARN:                 arn:aws:iam::xxxxxxxxxxxxxx:role/kf-admin-us-east-1-kubeflow-dev
      AWS_WEB_IDENTITY_TOKEN_FILE:  /var/run/secrets/eks.amazonaws.com/serviceaccount/token
    Mounts:
      /var/run/secrets/eks.amazonaws.com/serviceaccount from aws-iam-token (ro)
      /var/run/secrets/kubernetes.io/serviceaccount from alb-ingress-controller-token-nmdz6 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  aws-iam-token:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  86400
  alb-ingress-controller-token-nmdz6:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  alb-ingress-controller-token-nmdz6
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                 node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type    Reason     Age   From               Message
  ----    ------     ----  ----               -------
  Normal  Scheduled  21m   default-scheduler  Successfully assigned kubeflow/alb-ingress-controller-7598c8999c-6lhzw to ip-10-242-4-9.ec2.internal
  Normal  Pulling    21m   kubelet            Pulling image "docker.io/amazon/aws-alb-ingress-controller:v1.1.5"
  Normal  Pulled     21m   kubelet            Successfully pulled image "docker.io/amazon/aws-alb-ingress-controller:v1.1.5" in 134.111466ms
  Normal  Created    21m   kubelet            Created container alb-ingress-controller
  Normal  Started    21m   kubelet            Started container alb-ingress-controller

Here is the controller's deployment configuration

apiVersion: apps/v1
kind: Deployment
metadata:
  annotations:
    deployment.kubernetes.io/revision: "1"
  creationTimestamp: "2022-01-06T02:08:35Z"
  generation: 1
  labels:
    app: aws-alb-ingress-controller
    kustomize.component: aws-alb-ingress-controller
  name: alb-ingress-controller
  namespace: kubeflow
  resourceVersion: "178004"
  selfLink: /apis/apps/v1/namespaces/kubeflow/deployments/alb-ingress-controller
  uid: xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: aws-alb-ingress-controller
      app.kubernetes.io/name: alb-ingress-controller
      kustomize.component: aws-alb-ingress-controller
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate
  template:
    metadata:
      annotations:
        sidecar.istio.io/inject: "false"
      creationTimestamp: null
      labels:
        app: aws-alb-ingress-controller
        app.kubernetes.io/name: alb-ingress-controller
        kustomize.component: aws-alb-ingress-controller
    spec:
      containers:
      - args:
        - --ingress-class=alb
        - --cluster-name=$(CLUSTER_NAME)
        env:
        - name: CLUSTER_NAME
          valueFrom:
            configMapKeyRef:
              key: clusterName
              name: aws-alb-ingress-controller-config
        image: docker.io/amazon/aws-alb-ingress-controller:v1.1.5
        imagePullPolicy: Always
        name: alb-ingress-controller
        resources: {}
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      serviceAccount: alb-ingress-controller
      serviceAccountName: alb-ingress-controller
      terminationGracePeriodSeconds: 30

Can someone help me figure out what is going on here?

Can we number the sections so that there is no ambiguity

In kubeflow-manifests/examples/aws/cognito-rds-s3/README.md can we number the subsections as 4.1 instead of just 1
so that there is no ambiguity

Create CFN template for resources required by AWS Cognito

Enhance user experience

Creating certs requires us-east-1 resources to be created before the resources for the region
Cognito configuration requires creating users, app client, etc. and configuring user pool domains etc

The process has several steps and is error prone. CFN might be able to automate this process and reduce user error.

Kubeflow installation on EKS should not use fixed secret names, such as "s3-secret" or "rds-secret"

Hi,

Following the installation instructions at https://github.com/awslabs/kubeflow-manifests/tree/v1.3-branch/distributions/aws/examples/rds-s3, the secret names are fixed as "s3-secret" and "rds-secret", this would prevent multiple kubeflow installation under same AWS Account.

Please update the doc, if these secret names can be customized for each kubeflow installation.

Thanks!

Create a master branch for AWS Distro Development

Background

Current development process being followed for AWS-Kubeflow-distro-1.3.1 was to take a release branch from the manifest repository as the base (a.k.a upstream) and make AWS distribution specific changes on the same branch. This approach presented the fastest way to get development going considering the timeline, kfctl tool getting deprecated and vendors requiring to setup their own repository to host distribution specific manifests.

Problem

However, the above approach presents several issues:

Working on a new release:
1. Every release requires a developer to create a copy of upstream KF release branch and copying/cherry-picking changes from the awslabs previous release branch (e.g. v1.3-branch) to new branch (e.g. v1.4-branch) which is cumbersome and error prone (e.g. #75)
2. These release branches on upstream are not frozen because they are used for any patch releases. Let’s take an example for v1.4.0 and v1.4.1. Every patch release will require us to either:
  1. Create a new branch(e.g. v1.4.1-branch) from the tag/v1.4.1 from upstream and repeat the process in 1.a.
  2. Cherry pick commits related to patch release from upstream and create a PR to awslabs repo
  3. Rebase the awslabs/v1.4-branch from upstream/v1.4-branch and force push which is unacceptable
Maintenance issues unrelated to the code owned by the team
1. Security vulnerabilities for test dependencies coming from upstream
2. Modifying the root pkg READMEs on each release branch (e.g. see README changes)
AWS release version is coupled with Kubeflow version
1. With this approach, we need to maintain 2 release versions one from Kubeflow and another from aws because we may have to do patch releases or feature enhancements e.g. aws-manifests-1.x-Kubeflow-1.3.1
Other issues:
1. PR build and test infra(when we have one) will need to be constantly be updated to point to different branches instead of one master branch. Similarly Github issue templates need to be ported
2. Default branch on the repository will keep changing
3. Might loose the history related to the PRs between releases if changes are copied instead of cherry-picking

Suggested Approach

Majority of the overlays do not change unless there are major changes in the upstream component. Create a base with AWS overlays that does not change between versions and work only on the delta per release.

Development - Build and Test

Create a master branch to host only AWS specific overlays, tests and docs. e.g.

ubuntu@ip-172-31-0-119:~/kubeflow-manifests$ tree -L 3
.
├── README.md
├── awsconfig
│   ├── apps
│   │   ├── jupyter
│   │   ├── katib
│   │   └── pipelines
│   ├── common
│   │   ├── alb-controller
│   │   ├── istio-envoy-filter
│   │   └── istio-ingress
│   └── infra_configs
├── docs
│   ├── cognito
│   ├── rds
│   ├── s3
│   └── storage
├── tests
│   ├── canary
│   ├── e2e
│   └── unit

(Customer experience) This will introduce one additional step for users after cloning the awslabs repo.

export KUBEFLOW_RELEASE_VERSION=v1.4.1
git clone --branch ${KUBEFLOW_RELEASE_VERSION} https://github.com/kubeflow/manifests.git upstream

Once user has cloned a tagged branch from upstream manifests(kubeflow/manifests repo), they will continue with setting up aws resources or running kustomize build and apply just like today.

For this, we need to isolate the aws specific changes in form of overlays to a separate directory.

I have already completed a POC for jupyter, cognito, rds and s3 in the v1.4-branch as a POC. There should be no changes for EFS and FSx since the changes are self contained. The only piece left is secrets-manager integration which is also doable

Releases

New releases (major and patch from upstream): Cut a release once a Kubeflow release has been validated with updated README.

Patch for an old release(depends on support policy) where latest overlays are incompatible: Use the existing release tag branch to do a patch release or backport features for older releases.

Follow up: Will this base become outdated between several releases?
- The intent of this development strategy is not to support multiple versions using the latest build. Making changes to a previous release will take the same amount of time i.e. go to release branch and make the changes.

How does this solve the problems stated earlier?

We take out the problems related to PR build/test, copying changes over from old release branches and modifying parent READMEs etc by creating this base package with only AWS manifests instead of using a Kubeflow released branch as base. The contributors only needs to work on the delta per release. Problems related to security vulnerabilities because of unrelated packages are handled by only cloning the upstream during build time.

Patch releases just become a small PR with changing a version in a config file and re-running the tests using the PR build infra.

The problem for release version is not completely solved since a it also depends on the upstream but we are able to cut a new release for this package independently which talks about the compatible version of Kubeflow (TBD - see below)

Next steps/TODOs

How to document/convery the supported upstream version to users?
What is the release versioning for this package?
Use the main branch to get started with this

Please comment your thoughts/feedback

Replace Minio by S3

Is your feature request related to a problem? Please describe.
We want to get rid of Minio completely from Kubeflow and use S3.

Describe the solution you'd like
Currently minio act as gateway to S3, one solution is to use S3 controller developed by ACK to act as gateway for S3.

Describe alternatives you've considered
N/A

Additional context
N/A

awslabs / kubeflow-manifests Goto Github PK

kubeflow-manifests's Introduction

Kubeflow on AWS

Overview

Getting Started

Help & Feedback

Contributing

Security

License

kubeflow-manifests's People

Contributors

Stargazers

Watchers

Forkers

kubeflow-manifests's Issues

Description of issue

Steps to reproduce:

Outdated Doc :

Outdated Doc :

Background

Problem

Suggested Approach

Development - Build and Test

Releases

How does this solve the problems stated earlier?

Next steps/TODOs

Recommend Projects

Recommend Topics

Recommend Org