awslabs / aws-orbit-workbench Goto Github PK

A Data Platform built for AWS, powered by Kubernetes.

Home Page: https://awslabs.github.io/aws-orbit-workbench/

License: Apache License 2.0

Python 83.69% Shell 5.03% Dockerfile 0.76% JavaScript 0.07% HTML 0.56% CSS 0.22% Jupyter Notebook 0.63% TypeScript 7.66% Smarty 1.20% Mustache 0.16%

kubernetes eks jupyter jupyterhub data-analysis datalake mach orbit-workbench gpu eks-cluster

aws-orbit-workbench's Introduction

AWS Orbit Workbench is currently archived and is accessible via READ-ONLY means.

Orbit Workbench is an open framework for building team-based secured data environment. Orbit workbench is built on Kubernetes using Amazon Managed Kubernetes Service (EKS), and provides both a command line tool for rapid deployment as well as Python SDK, Jupyter Plugins and more to accelerate data analysis and ML by integration with AWS analytics services such as Amazon Redshift, Amazon Athena, Amazon EMR, Amazon SageMaker and more.

Orbit Workbench deploys secured team spaces that are mapped to Kubernetes namespaces and span into AWS cloud resources. Each team is a secured zone where only members of the team can access allowed data and share data and code freely within the team. Orbit automatically creates file storage for each team using Amazon EFS, security group and IAM role for each team , as well as their own JupyterHub and Jupyter Server. Orbit workbench users are also capable of launching python code or Jupyter Notebooks as Kubernetes containers or as Amazon Fargate containers. Orbit workbench provides CLI tool for users to build their own custom images and use it to deploy containers or customize their Jupyter environment.

GPU-based algorithms are easily supported by Orbit that pre-configures EKS to allow GPU loads as well as provide examples of how to build images that support GPU accelerations.

If you are looking to build your own Data & ML Platform for your company on AWS, give Orbit Workbench a chance to accelarate your business outcome using AWS Services.

Contributors are welcome!

Please see our Home for installation and usage guides.

Feature List ^**

Collaborative Team Spaces
- Isolated Team spaces with pre-provisioned access to data sources
- Team isolation enforced by EKS Namespace as well as AWS constructs such as security groups and roles
- Shared file system between team users via EFS
- Scratch shared space for your team only on S3 with defined expiration time
- Jupyter Plugin to support users with Kubernetes (Jobs, Pods, Volumes and more ) and AWS resources (Athena, Redshift, Glue Catalog, S3, and more)
Compute
- Build your own docker image using Orbit CLI on a remote AWS codebuild and into ECR Repository
- Support for GPU Node pools
- Support Dockers with GPU drivers for use of PyTorch, TensorFlow, and others
- Shared node pools for all teams with storage isolation
- Auto-Scaling EKS Node pools (coming soon)
Security
- Jupyter Hub integration with SSO Providers via Cognito
- Ability to map SSO Group to a team to control authentication
Deployment
- Deployment of all AWS and EKS resources via a simple declarative manifest
- Ability to add and remove teams dynamically
- Support for Kubernetes Administrative Teams
AWS Analytic Services Integrations
- Amazon Redshift
- Amazon SageMaker api calls and Kubernetes Operator
- Amazon EMR on EKS Kubernetes Operator
- Amazon Athena
- AWS Glue DataBrew
- AWS Lake Formation

Create an AWS Orbit Workbench trial environment

Feel free to create a full AWS Orbit Workbench environment in its own VPC.
You can always clone or fork this repo and install via CLI, but if you are just investigating the Workbench, we have provided a standard deployment.

Please follow these steps.

1. Create the AWS Orbit Workbench

Deploy	Region Name	Region
🚀	US East (N. Virginia)	us-east-1
🚀	US East (Ohio)	us-east-2
🚀	US West (N. California)	us-west-1
🚀	US West (Oregon)	us-west-2
🚀	EU (London)	eu-west-2

This reference deployment can only be deployed to Regions denoted above.

The CloudFormation template has all the necessary parameters, but you may change as needed:

Cloudformation Parameters
- Version: The version of Orbit Workbench (corresponds to the versions of aws-orbit in pypi)
- K8AdminRole: An existing role in your account that has admin access to the EKS cluster
The Cloudformation stack will create two(2) AWS CodePipelines:
- Orbit_Deploy_trial - which will start automatically and create your workbench
- Orbit_Destroy_trial - which will start automatically and will destroy your workbench
  - this pipeline has a Manual Approval stage that prevents your workbench from moving forward with the destroy process

Once your pipelines are created, the Orbit_Destroy_trial pipeline will wait for you to approve the next stage (which we don't want to do yet).

Go to the Orbit_Destroy_trial pipeline, click Stop Execution then Stop and Abandon. Abandoning the pipeline prevents the job from timing out and stopping at a later time.

The Orbit_Deploy_trial pipeline takes approximaeluy 70-90 minutes to complete.

2. Get your access URL

When the Orbit_Deploy_trial pipeline does complete, go to the EC2 page --> Load Balancing --> Load Balancers and look for the alb we have created...it have a naming pattern of xxxxxxxx-istiosystem-istio-xxxx. Get the DNS of the alb.

The AWS Orbit Workbench homepage will be located at:

https://xxxxxxxx-istiosystem-istio-xxxx-1234567890.{region}.elb.amazonaws.com/orbit/login

You can browse that url. We are using self-signed certs, so your browser may complain, but it is save to Accept and Continue to the site.

The default username and password are:

Username: orbit
Password: OrbitPwd1!

You will be promted to change the password.

Cleaning up the example resources

To remove all workbench resources , do the following:

Goto the Orbit_Destroy_trial pipeline and click 'Release Change'
- When the CLI_ApproveDestroy stage is active, click Review and then Approve so the pipeline will continue
Wait until the Orbit_Destroy_trial completes
Delete the Cloudformation Stack trial
- if the template fails to destroy due to objects in the S3 bucket, it is ok to Empty the bucket and delete the stack again

Contributing

Contributing Guidelines: ./CONTRIBUTING.md

License

This project is licensed under the Apache-2.0 License.

**: for detailed feature list by release, please see our release page in the wiki tab

aws-orbit-workbench's People

Contributors

Stargazers

Watchers

aws-orbit-workbench's Issues

Show all avaialble profiles in the JL UI profiles section

[Feature] Ability to specify timeout when building image

Building some image could end up taking a long time. We should expect users to modify timeout on local file and push to codebuild.
Solution: Add a timeout option to enable user to pass in timeout intended

Integrate AWS DataBrew JL extension.

Update get_workspace return response to include s3://scratchbucketname/teamname in scratchbucket attribute

Deploy the SageMaker operator with a plugin

See here https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-operators-for-kubernetes.html

Notice that there is a helm deployment option toward the end

ISSUE-70 Inconsistency in Deploy Mode

Repository Modes are inconsistent.
cli/datamaker_cli/docker.py:95 refers to a mode as "source",
while
cli/datamaker_cli/commands/deploy.py:49 refers to a mode as "code".

need to align.

When destroying teams, we should delete any Redshift cluster associated with it

in the destroy of the redshift plugin , we should look for cluster that are created for the team (use tags) and delete them (no snapshot) before continue and deleting the stacks. Otherwise it will fail the destroy

Orbit Netpune plugin

https://docs.aws.amazon.com/neptune/latest/userguide/graph-notebooks.html

Error: No such command 'env'.

bash-4.2$ orbit init
[ Info ] Env Manifest generated into conf folder  
[ Tip ] Recommended next step: orbit deploy foundation -f default-foundation.yaml
[ Tip ] Then, fill up the manifest file (default-env-manifest.yaml) and run: orbit env -f default-env-manifest.yaml
                                                  
Initializing |█████████████████████████████| 100% 
bash-4.2$ orbit env -f default-env-manifest.yaml
Usage: orbit [OPTIONS] COMMAND [ARGS]...
Try 'orbit --help' for help.

Error: No such command 'env'.

Same thing for ECS?

Hi,
Any plans on implementing the same solution to deploy with ECS too ? EKS / K8s might be popular, but ECS is your very own docker containers orchestrator, why is it not treated with at least as much priority as EKS ?
I read Jupyter plugin, so sure there might already be something there for K8s, but what about AWS providing an ECS plugin?
Thanks,

EFA multi-NIC support

It would be helpful to support EFA with GPUDirectRDMA and multiple NICs (see https://github.com/aws-samples/aws-efa-eks) to support distributed training jobs on p3dn.24xlarge, p4d.24xlarge or g4dn.metal.

Show EKS cluster nodegroups in JL Compute UI section

Regression inline policy obstructing orbit-{env_name}-admin role deletion while destroying foundation stack

Update profiles with an input file

Currently one can only append/delete one profile at a time, which is inconvenient if one wants to modify multiple profiles.
An option to overwrite current profiles with an input file containing a list of profile would make modification easier.

Auto update PRs with latest from branch

Perhaps we can try this:

https://github.com/marketplace/actions/auto-update

Add support for EMR on EKS

Primary Manifest too large for Parameter Store

the primary manifest stored in the PS contains the manifest for all teams as well as the original raw. when there are more than two teams, this results in a json doc larger than the 8192 characters supported by Advanced Parameter Store values.

ISSUE-125 Broken SDK renaming

Describe the Issue:

rename script renamed to orbit_sdk, instead of aws_orbit_sdk.

To Reproduce:

N/A

Additional Context:

Open controller.py
the line shows from orbit_sdk.common import ...
it supposed to be from aws_orbit_sdk.common import ...

EBS volumes should be able to change their capacity

ISSUE-104 hardcoded toolkit bucket

Describe the Issue:

Example 1 in Lake Creator has hardcoded environment name in the toolkit bucket. It's failing to execute Example 1.

To Reproduce:

Run samples/notebooks/A-LakeCreator/Example-1-Build-Lake.ipynb to the dm_s3_bucket...
Failed on SSM key.

Additional Context:

environment name should be taken from the workspace and SSM key should be parameterized.

dm_s3_bucket = json.loads(ssm.get_parameter(
    Name=f'/datamaker/{workspace.get("env_name")}/manifest'
)['Parameter']['Value'])['toolkit-s3-bucket']

CDK Upgrades

Want to bring light to a larger question on how version bumps should be orchestrated with https://github.com/aws/aws-cdk

Currently, the Orbit CLI uses 1.67, and the CDK is moving quickly, with its latest release hitting 1.95 last week.

It's not obvious whether to version bump for the latest L2 CDK constructs, even though they are very helpful for future development. The CDK team makes it clear these minor versions will contain breaking changes.

I'm very new to CDK, so I'm still learning best practices, but would it be helpful to have test coverage on validating various CDK constructs? Upon research, I see plenty of resources around testing with CDK using Typescript, but not a lot with Python. 🤔

Genisis for this was from unsupported UserPoolClient construct in cdk 1.67 #295 (comment)

Tensorboard connection timeout

Tensorboard is not able to connect. We assume there is some port we need to open in order to make tensorboard to work.

Upgrade PyYaml version (security bots)

ISSUE-57 ECS Containers cannot start from notebook due to the role

Specify a Type of Issue:

BUG

Describe the Issue:

controller.run_notebooks isn't working due to the missing linked service role for ECS.

To Reproduce:

run in a notebook :

from datamaker_sdk import controller
def run_file():
    notebooks = []
    notebook = {
      "notebookName": "Test-Container.ipynb",
      "sourcePath": "notetbooks/input",
      "targetPath": "notetbooks/output",
      "params": {
        #"bucketName": bucket_name,
      }        
    }
    notebooks.append(notebook)

    notebooksToRun = {
      "compute": {
          "container" : {
              "p_concurrent": "10"
          }
      },

      "tasks":  notebooks  
    }
    # notebooks
    containers = controller.run_notebooks(notebooksToRun)
    print (containers)
    controller.wait_for_tasks_to_complete(containers, 60,10, False)

run_file()

Additional Context:

workaround available:
run in a terminal:
aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

[Feature] Ability to monitor usage

The administrator should be able to view jobs each user launches and resources each user is using.
It would be even better if there is some sort of stats or graphs that records all the usage.

add a button to tails logs for a running container

deploy_id hex all digit value starting with 0 converts to a float from CF output variables

Known pattern - aws/serverless-application-model#200 (comment)

manifest.toolkit_s3_bucket: datamaker-dev-env-toolkit-198245574422-45596.0 [ Error ] NoSuchBucket: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist

Remove domain name from the username used in eks job labels

Removed the domain name from the username attribute while creating/searching for EKS job labels.

#394

Orbit destroy won't remove pypi upstream

Currently destroying the orbit env and foundation won't remove the pypi upstream, which could be annoying and confuses the user.
Solution: recover the ~/.config/pip/pip.conf while destroying the foundation

Re-run copy-samples job when running the regressions

https://docs.aws.amazon.com/codebuild/latest/userguide/sample-efs.html

add duration to jobs view in jupyter

Keep installed pip packages even when notebook is terminated

Issue:
Currently installed pip packages will be gone if the notebook is terminated. This could be an issue as the notebook will be terminated if it's not active, and users will have to reinstall everything again.

Tried Solution:
We tried to change the default pip installation path to be a folder in EFS by setting env variables in Dockerfile, so that packages won't be removed. However, it looks like the folder in EFS got recreated every time the notebook boots up. Below is a snippet of code we used:

# Customize pip installation location
RUN mkdir -p /home/jovyan/private/site_packages
ENV PIP_TARGET=/home/jovyan/private/site_packages
ENV PYTHONPATH=$PYTHONPATH:/home/jovyan/private/site_packages

Question:
Isn't everything in private permanent and won't be removed after each launch up? Or is there any special setting in jupyterhub that disable us doing it. Also, is there any better solution to address this pip package issue?

The VPC endpoint service com.amazonaws.us-east-1.lambda does not support the availability zone of the subnet: subnet-XXX

orbit deploy foundation -n dev --no-internet-accessibility is causing this error from the cloudformation logs:

The VPC endpoint service com.amazonaws.us-east-1.lambda does not support the availability zone of the subnet: subnet-XXX. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameter; Request ID: XXX; Proxy: null)

Add a plugin to support Amazon FSx for Lustre

https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html
https://aws.github.io/aws-emr-containers-best-practices/storage/docs/spark/fsx-lustre/

https://aws.amazon.com/blogs/opensource/using-fsx-lustre-csi-driver-amazon-eks/

Orbit SDK glue_catalog - getCatalogAsDict - conditional logic to fetch glue table location

Added conditional logic to fetch glue table/view location details in orbit SDK. Adding "" in case of view with out location attribute.

ISSUE-81 Containers API is broken

Describe the Issue:

When running notebooks in containers using controller API, it won't run

To Reproduce:

create a configuration to execute in a container.
run controller.run_notebooks(notebooksToRun) API

Additional Context:

controller API and notebook_runner seems to be broken.

Add the EKS cluster name and the env name into the top part of the JL UI team panel

Also clean JL UI refresh buttons.

Support and Test Auto-Scaling for node pools

ISSUE-103 Template for Issues isn't working

Describe the Issue:

Issue template isn't working.

To Reproduce:

Click on Create new Issue - the body is empty

Additional Context:

need to introduce confiig.yml file with default issue template(s)

ISSUE-51 Add sample notebooks - Lake Creator

Specify a Type of Issue:

FEATURE

Describe the Issue:

Need to add example notebooks that would show how APIs are being used. Use containers to run nested notebooks

To Reproduce:

N/A

Additional Context:

N/A

ISSUE-49 Create a Templates for issues and PRs

Specify a Type of Issue:

OTHER - Github Project Administration

Describe the Issue:

Templates for Issues and PRs needs to be added.

To Reproduce:

N/A

Additional Context:

N/A

Integration with Serverless Glue

[Feature] User should be able to reattach and delete their ebs

Currently, the ebs is attached to a server based on its name. If a user is able to choose which ebs he/she wants to attach, it would be more useful.
User should be able to delete their ebs when they don't need it anymore. Or some garbage collecting logic for ebs is needed. Otherwise, one can keep starting server with different names and end up with bunch of ebs.

ISSUE-55 missing tools for notebooks

Specify a Type of Issue:

FEATURE

Describe the Issue:

missing tools for notebooks:

zip and unzip

To Reproduce:

open landing page
create a server
open terminal
type in command line: zip or unzip

Additional Context:

Workaround:
apt-get install zip unzip

if no sudo access, then:

mkdir $HOME/pkgs & cd $HOME/pkgs
apt-get download zip unzip
for f in `ls *zip*.deb`; do echo dpkg -x $f $HOME/pkgs; done
export PATH=$PATH:$HOME/pkgs/usr/bin

Note, this will not make those packages available in notebooks.

Missing plugin python modules in CodeArtifact and Docker Images

Add pipeline stages to build and twine upload the plugin modules to CodeArtifact.
Add dynamic pip install of plugin modules while _create_dockerfile()

Add support to for Voila - Turning Jupyter into Web Applications

[FEATURE] Ray HPO Tune Integration

We've managed to run an axample ray cluster and HPO on kubenetes. Here are the steps:

Connect local kubectl to the eks cluster
Pip install ray locally: pip install ray
Clone ray repo: git clone https://github.com/ray-project/ray.git
Launch up a ray cluster by: ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml
Check the cluster got launched by: kubectl -n ray get pods
To create a sample job: kubectl create -f ray/doc/kubernetes/job-example.yaml
- You can modify the script got downloaded in the yaml file to run different script
To check result: kubectl -n ray logs <launched job pod>
To tear down the cluster: ray down ray/python/ray/autoscaler/kubernetes/example-full.yaml

Create run demo page on wiki

Have a page that starts from deployment of the sample/manifest with lake-creator and lake-user , to download demo data, run lake-creator notebook and then run user regression. Document step by step.

Fix logout on kubeflow

When user needs to terminate his session, we need to call this url:
/logout?response_type=code&client_id=&redirect_uri=&state=STATE&scope=openid+profile+aws.cognito.signin.user.admin

We need to create small service with html page offering the link to click on to terminate session and redirect them into a new login screen.

the small web server will use something like this form the redirect users into a new login.

<html>
<head>
    <title>This website has moved</title>
    <meta http-equiv="refresh" content="1;url=<cognito_ep>/logout?response_type=code&client_id=<clientid>&redirect_uri=<ingressalb>&state=STATE&scope=openid+profile+aws.cognito.signin.user.admin">
    <meta name="robot" content="noindex,follow">
</head>
<body>
Your session has been terminated, you will be redirected to a new login page now.
</body>
</html>

ISSUE-58 Incorrect Output Dir in Container

Specify a Type of Issue:

BUG

Describe the Issue:

When task is defined as following:

      "notebookName": "Test-Container.ipynb",
      "sourcePath": "private/notebooks/input",
      "targetPath": "private/notebooks/output",

the input path resolves to:
/home/jovyan/private/notebooks/input/Test-Container.ipynb
but output path is resolves to:
/home/jovyan/private/outputs/private/notebooks/output/Test-Container/e1@20201119-15:06.ipynb
instead of:
/home/jovyan/private/notebooks/output/Test-Container/e1@20201119-15:06.ipynb

To Reproduce:

run notebooks using controller API as defined above.

Additional Context:

Recommend to remove hard coded output part, and use it only as default if no explicit targetPatth is provided.

awslabs / aws-orbit-workbench Goto Github PK

aws-orbit-workbench's Introduction

Feature List **

Create an AWS Orbit Workbench trial environment

1. Create the AWS Orbit Workbench

2. Get your access URL

Cleaning up the example resources

Contributing

License

aws-orbit-workbench's People

Contributors

Stargazers

Watchers

Forkers

aws-orbit-workbench's Issues

Describe the Issue:

To Reproduce:

Additional Context:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Describe the Issue:

To Reproduce:

Additional Context:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Recommend Projects

Recommend Topics

Recommend Org

Feature List ^**