awslabs / aws-orbit-workbench Goto Github PK

A Data Platform built for AWS, powered by Kubernetes.

Home Page: https://awslabs.github.io/aws-orbit-workbench/

License: Apache License 2.0

Python 83.69% Shell 5.03% Dockerfile 0.76% JavaScript 0.07% HTML 0.56% CSS 0.22% Jupyter Notebook 0.63% TypeScript 7.66% Smarty 1.20% Mustache 0.16%

kubernetes eks jupyter jupyterhub data-analysis datalake mach orbit-workbench gpu eks-cluster

aws-orbit-workbench's Issues

Primary Manifest too large for Parameter Store

the primary manifest stored in the PS contains the manifest for all teams as well as the original raw. when there are more than two teams, this results in a json doc larger than the 8192 characters supported by Advanced Parameter Store values.

Remove domain name from the username used in eks job labels

Removed the domain name from the username attribute while creating/searching for EKS job labels.

#394

The VPC endpoint service com.amazonaws.us-east-1.lambda does not support the availability zone of the subnet: subnet-XXX

orbit deploy foundation -n dev --no-internet-accessibility is causing this error from the cloudformation logs:

The VPC endpoint service com.amazonaws.us-east-1.lambda does not support the availability zone of the subnet: subnet-XXX. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameter; Request ID: XXX; Proxy: null)

Auto update PRs with latest from branch

Perhaps we can try this:

https://github.com/marketplace/actions/auto-update

Keep installed pip packages even when notebook is terminated

Issue:
Currently installed pip packages will be gone if the notebook is terminated. This could be an issue as the notebook will be terminated if it's not active, and users will have to reinstall everything again.

Tried Solution:
We tried to change the default pip installation path to be a folder in EFS by setting env variables in Dockerfile, so that packages won't be removed. However, it looks like the folder in EFS got recreated every time the notebook boots up. Below is a snippet of code we used:

# Customize pip installation location
RUN mkdir -p /home/jovyan/private/site_packages
ENV PIP_TARGET=/home/jovyan/private/site_packages
ENV PYTHONPATH=$PYTHONPATH:/home/jovyan/private/site_packages

Question:
Isn't everything in private permanent and won't be removed after each launch up? Or is there any special setting in jupyterhub that disable us doing it. Also, is there any better solution to address this pip package issue?

[Feature] Ability to monitor usage

The administrator should be able to view jobs each user launches and resources each user is using.
It would be even better if there is some sort of stats or graphs that records all the usage.

Add a plugin to support Amazon FSx for Lustre

https://docs.aws.amazon.com/fsx/latest/LustreGuide/what-is.html
https://aws.github.io/aws-emr-containers-best-practices/storage/docs/spark/fsx-lustre/

https://aws.amazon.com/blogs/opensource/using-fsx-lustre-csi-driver-amazon-eks/

ISSUE-57 ECS Containers cannot start from notebook due to the role

Specify a Type of Issue:

BUG

Describe the Issue:

controller.run_notebooks isn't working due to the missing linked service role for ECS.

To Reproduce:

run in a notebook :

from datamaker_sdk import controller
def run_file():
    notebooks = []
    notebook = {
      "notebookName": "Test-Container.ipynb",
      "sourcePath": "notetbooks/input",
      "targetPath": "notetbooks/output",
      "params": {
        #"bucketName": bucket_name,
      }        
    }
    notebooks.append(notebook)

    notebooksToRun = {
      "compute": {
          "container" : {
              "p_concurrent": "10"
          }
      },

      "tasks":  notebooks  
    }
    # notebooks
    containers = controller.run_notebooks(notebooksToRun)
    print (containers)
    controller.wait_for_tasks_to_complete(containers, 60,10, False)

run_file()

Additional Context:

workaround available:
run in a terminal:
aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com

[Feature] User should be able to reattach and delete their ebs

Currently, the ebs is attached to a server based on its name. If a user is able to choose which ebs he/she wants to attach, it would be more useful.
User should be able to delete their ebs when they don't need it anymore. Or some garbage collecting logic for ebs is needed. Otherwise, one can keep starting server with different names and end up with bunch of ebs.

Add the EKS cluster name and the env name into the top part of the JL UI team panel

Also clean JL UI refresh buttons.

Integration with Serverless Glue

add duration to jobs view in jupyter

ISSUE-58 Incorrect Output Dir in Container

Specify a Type of Issue:

BUG

Describe the Issue:

When task is defined as following:

      "notebookName": "Test-Container.ipynb",
      "sourcePath": "private/notebooks/input",
      "targetPath": "private/notebooks/output",

the input path resolves to:
/home/jovyan/private/notebooks/input/Test-Container.ipynb
but output path is resolves to:
/home/jovyan/private/outputs/private/notebooks/output/Test-Container/e1@20201119-15:06.ipynb
instead of:
/home/jovyan/private/notebooks/output/Test-Container/e1@20201119-15:06.ipynb

To Reproduce:

run notebooks using controller API as defined above.

Additional Context:

Recommend to remove hard coded output part, and use it only as default if no explicit targetPatth is provided.

Orbit SDK glue_catalog - getCatalogAsDict - conditional logic to fetch glue table location

Added conditional logic to fetch glue table/view location details in orbit SDK. Adding "" in case of view with out location attribute.

Support and Test Auto-Scaling for node pools

Add support for EMR on EKS

Error: No such command 'env'.

bash-4.2$ orbit init
[ Info ] Env Manifest generated into conf folder  
[ Tip ] Recommended next step: orbit deploy foundation -f default-foundation.yaml
[ Tip ] Then, fill up the manifest file (default-env-manifest.yaml) and run: orbit env -f default-env-manifest.yaml
                                                  
Initializing |█████████████████████████████| 100% 
bash-4.2$ orbit env -f default-env-manifest.yaml
Usage: orbit [OPTIONS] COMMAND [ARGS]...
Try 'orbit --help' for help.

Error: No such command 'env'.

Orbit Netpune plugin

https://docs.aws.amazon.com/neptune/latest/userguide/graph-notebooks.html

ISSUE-70 Inconsistency in Deploy Mode

Repository Modes are inconsistent.
cli/datamaker_cli/docker.py:95 refers to a mode as "source",
while
cli/datamaker_cli/commands/deploy.py:49 refers to a mode as "code".

need to align.

Integrate AWS DataBrew JL extension.

ISSUE-81 Containers API is broken

Describe the Issue:

When running notebooks in containers using controller API, it won't run

To Reproduce:

create a configuration to execute in a container.
run controller.run_notebooks(notebooksToRun) API

Additional Context:

controller API and notebook_runner seems to be broken.

ISSUE-103 Template for Issues isn't working

Describe the Issue:

Issue template isn't working.

To Reproduce:

Click on Create new Issue - the body is empty

Additional Context:

need to introduce confiig.yml file with default issue template(s)

Deploy the SageMaker operator with a plugin

See here https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-operators-for-kubernetes.html

Notice that there is a helm deployment option toward the end

[FEATURE] Ray HPO Tune Integration

We've managed to run an axample ray cluster and HPO on kubenetes. Here are the steps:

Connect local kubectl to the eks cluster
Pip install ray locally: pip install ray
Clone ray repo: git clone https://github.com/ray-project/ray.git
Launch up a ray cluster by: ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml
Check the cluster got launched by: kubectl -n ray get pods
To create a sample job: kubectl create -f ray/doc/kubernetes/job-example.yaml
- You can modify the script got downloaded in the yaml file to run different script
To check result: kubectl -n ray logs <launched job pod>
To tear down the cluster: ray down ray/python/ray/autoscaler/kubernetes/example-full.yaml

Create run demo page on wiki

Have a page that starts from deployment of the sample/manifest with lake-creator and lake-user , to download demo data, run lake-creator notebook and then run user regression. Document step by step.

EBS volumes should be able to change their capacity

Orbit destroy won't remove pypi upstream

Currently destroying the orbit env and foundation won't remove the pypi upstream, which could be annoying and confuses the user.
Solution: recover the ~/.config/pip/pip.conf while destroying the foundation

deploy_id hex all digit value starting with 0 converts to a float from CF output variables

Known pattern - aws/serverless-application-model#200 (comment)

manifest.toolkit_s3_bucket: datamaker-dev-env-toolkit-198245574422-45596.0 [ Error ] NoSuchBucket: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist

ISSUE-51 Add sample notebooks - Lake Creator

Specify a Type of Issue:

FEATURE

Describe the Issue:

Need to add example notebooks that would show how APIs are being used. Use containers to run nested notebooks

To Reproduce:

N/A

Additional Context:

N/A

Re-run copy-samples job when running the regressions

https://docs.aws.amazon.com/codebuild/latest/userguide/sample-efs.html

Fix logout on kubeflow

When user needs to terminate his session, we need to call this url:
/logout?response_type=code&client_id=&redirect_uri=&state=STATE&scope=openid+profile+aws.cognito.signin.user.admin

We need to create small service with html page offering the link to click on to terminate session and redirect them into a new login screen.

the small web server will use something like this form the redirect users into a new login.

<html>
<head>
    <title>This website has moved</title>
    <meta http-equiv="refresh" content="1;url=<cognito_ep>/logout?response_type=code&client_id=<clientid>&redirect_uri=<ingressalb>&state=STATE&scope=openid+profile+aws.cognito.signin.user.admin">
    <meta name="robot" content="noindex,follow">
</head>
<body>
Your session has been terminated, you will be redirected to a new login page now.
</body>
</html>

CDK Upgrades

Want to bring light to a larger question on how version bumps should be orchestrated with https://github.com/aws/aws-cdk

Currently, the Orbit CLI uses 1.67, and the CDK is moving quickly, with its latest release hitting 1.95 last week.

It's not obvious whether to version bump for the latest L2 CDK constructs, even though they are very helpful for future development. The CDK team makes it clear these minor versions will contain breaking changes.

I'm very new to CDK, so I'm still learning best practices, but would it be helpful to have test coverage on validating various CDK constructs? Upon research, I see plenty of resources around testing with CDK using Typescript, but not a lot with Python. 🤔

Genisis for this was from unsupported UserPoolClient construct in cdk 1.67 #295 (comment)

ISSUE-104 hardcoded toolkit bucket

Describe the Issue:

Example 1 in Lake Creator has hardcoded environment name in the toolkit bucket. It's failing to execute Example 1.

To Reproduce:

Run samples/notebooks/A-LakeCreator/Example-1-Build-Lake.ipynb to the dm_s3_bucket...
Failed on SSM key.

Additional Context:

environment name should be taken from the workspace and SSM key should be parameterized.

dm_s3_bucket = json.loads(ssm.get_parameter(
    Name=f'/datamaker/{workspace.get("env_name")}/manifest'
)['Parameter']['Value'])['toolkit-s3-bucket']

Tensorboard connection timeout

Tensorboard is not able to connect. We assume there is some port we need to open in order to make tensorboard to work.

Show all avaialble profiles in the JL UI profiles section

ISSUE-55 missing tools for notebooks

Specify a Type of Issue:

FEATURE

Describe the Issue:

missing tools for notebooks:

zip and unzip

To Reproduce:

open landing page
create a server
open terminal
type in command line: zip or unzip

Additional Context:

Workaround:
apt-get install zip unzip

if no sudo access, then:

mkdir $HOME/pkgs & cd $HOME/pkgs
apt-get download zip unzip
for f in `ls *zip*.deb`; do echo dpkg -x $f $HOME/pkgs; done
export PATH=$PATH:$HOME/pkgs/usr/bin

Note, this will not make those packages available in notebooks.

Update get_workspace return response to include s3://scratchbucketname/teamname in scratchbucket attribute

ISSUE-125 Broken SDK renaming

Describe the Issue:

rename script renamed to orbit_sdk, instead of aws_orbit_sdk.

To Reproduce:

N/A

Additional Context:

Open controller.py
the line shows from orbit_sdk.common import ...
it supposed to be from aws_orbit_sdk.common import ...

add a button to tails logs for a running container

[Feature] Ability to specify timeout when building image

Building some image could end up taking a long time. We should expect users to modify timeout on local file and push to codebuild.
Solution: Add a timeout option to enable user to pass in timeout intended

Regression inline policy obstructing orbit-{env_name}-admin role deletion while destroying foundation stack

Missing plugin python modules in CodeArtifact and Docker Images

Add pipeline stages to build and twine upload the plugin modules to CodeArtifact.
Add dynamic pip install of plugin modules while _create_dockerfile()

Upgrade PyYaml version (security bots)

Show EKS cluster nodegroups in JL Compute UI section

ISSUE-49 Create a Templates for issues and PRs

Specify a Type of Issue:

OTHER - Github Project Administration

Describe the Issue:

Templates for Issues and PRs needs to be added.

To Reproduce:

N/A

Additional Context:

N/A

EFA multi-NIC support

It would be helpful to support EFA with GPUDirectRDMA and multiple NICs (see https://github.com/aws-samples/aws-efa-eks) to support distributed training jobs on p3dn.24xlarge, p4d.24xlarge or g4dn.metal.

Same thing for ECS?

Hi,
Any plans on implementing the same solution to deploy with ECS too ? EKS / K8s might be popular, but ECS is your very own docker containers orchestrator, why is it not treated with at least as much priority as EKS ?
I read Jupyter plugin, so sure there might already be something there for K8s, but what about AWS providing an ECS plugin?
Thanks,

awslabs / aws-orbit-workbench Goto Github PK

aws-orbit-workbench's Issues

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Describe the Issue:

To Reproduce:

Additional Context:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Describe the Issue:

To Reproduce:

Additional Context:

Specify a Type of Issue:

Describe the Issue:

To Reproduce:

Additional Context:

Recommend Projects

Recommend Topics

Recommend Org