awslabs / aws-orbit-workbench Goto Github PK
View Code? Open in Web Editor NEWA Data Platform built for AWS, powered by Kubernetes.
Home Page: https://awslabs.github.io/aws-orbit-workbench/
License: Apache License 2.0
A Data Platform built for AWS, powered by Kubernetes.
Home Page: https://awslabs.github.io/aws-orbit-workbench/
License: Apache License 2.0
the primary manifest stored in the PS contains the manifest for all teams as well as the original raw. when there are more than two teams, this results in a json doc larger than the 8192 characters supported by Advanced Parameter Store values.
Removed the domain name from the username attribute while creating/searching for EKS job labels.
orbit deploy foundation -n dev --no-internet-accessibility
is causing this error from the cloudformation logs:
The VPC endpoint service com.amazonaws.us-east-1.lambda does not support the availability zone of the subnet: subnet-XXX. (Service: AmazonEC2; Status Code: 400; Error Code: InvalidParameter; Request ID: XXX; Proxy: null)
Perhaps we can try this:
Issue:
Currently installed pip packages will be gone if the notebook is terminated. This could be an issue as the notebook will be terminated if it's not active, and users will have to reinstall everything again.
Tried Solution:
We tried to change the default pip installation path to be a folder in EFS by setting env variables in Dockerfile, so that packages won't be removed. However, it looks like the folder in EFS got recreated every time the notebook boots up. Below is a snippet of code we used:
# Customize pip installation location
RUN mkdir -p /home/jovyan/private/site_packages
ENV PIP_TARGET=/home/jovyan/private/site_packages
ENV PYTHONPATH=$PYTHONPATH:/home/jovyan/private/site_packages
Question:
Isn't everything in private permanent and won't be removed after each launch up? Or is there any special setting in jupyterhub that disable us doing it. Also, is there any better solution to address this pip package issue?
The administrator should be able to view jobs each user launches and resources each user is using.
It would be even better if there is some sort of stats or graphs that records all the usage.
BUG
controller.run_notebooks isn't working due to the missing linked service role for ECS.
run in a notebook :
from datamaker_sdk import controller
def run_file():
notebooks = []
notebook = {
"notebookName": "Test-Container.ipynb",
"sourcePath": "notetbooks/input",
"targetPath": "notetbooks/output",
"params": {
#"bucketName": bucket_name,
}
}
notebooks.append(notebook)
notebooksToRun = {
"compute": {
"container" : {
"p_concurrent": "10"
}
},
"tasks": notebooks
}
# notebooks
containers = controller.run_notebooks(notebooksToRun)
print (containers)
controller.wait_for_tasks_to_complete(containers, 60,10, False)
run_file()
workaround available:
run in a terminal:
aws iam create-service-linked-role --aws-service-name ecs.amazonaws.com
Also clean JL UI refresh buttons.
BUG
When task is defined as following:
"notebookName": "Test-Container.ipynb",
"sourcePath": "private/notebooks/input",
"targetPath": "private/notebooks/output",
the input path resolves to:
/home/jovyan/private/notebooks/input/Test-Container.ipynb
but output path is resolves to:
/home/jovyan/private/outputs/private/notebooks/output/Test-Container/e1@20201119-15:06.ipynb
instead of:
/home/jovyan/private/notebooks/output/Test-Container/e1@20201119-15:06.ipynb
run notebooks using controller API as defined above.
Recommend to remove hard coded output part, and use it only as default if no explicit targetPatth
is provided.
Added conditional logic to fetch glue table/view location details in orbit SDK. Adding "" in case of view with out location attribute.
bash-4.2$ orbit init
[ Info ] Env Manifest generated into conf folder
[ Tip ] Recommended next step: orbit deploy foundation -f default-foundation.yaml
[ Tip ] Then, fill up the manifest file (default-env-manifest.yaml) and run: orbit env -f default-env-manifest.yaml
Initializing |█████████████████████████████| 100%
bash-4.2$ orbit env -f default-env-manifest.yaml
Usage: orbit [OPTIONS] COMMAND [ARGS]...
Try 'orbit --help' for help.
Error: No such command 'env'.
Repository Modes are inconsistent.
cli/datamaker_cli/docker.py:95 refers to a mode as "source",
while
cli/datamaker_cli/commands/deploy.py:49 refers to a mode as "code".
need to align.
When running notebooks in containers using controller API, it won't run
create a configuration to execute in a container.
run controller.run_notebooks(notebooksToRun)
API
controller
API and notebook_runner
seems to be broken.
Issue template isn't working.
need to introduce confiig.yml file with default issue template(s)
See here https://docs.aws.amazon.com/sagemaker/latest/dg/amazon-sagemaker-operators-for-kubernetes.html
Notice that there is a helm deployment option toward the end
We've managed to run an axample ray cluster and HPO on kubenetes. Here are the steps:
pip install ray
git clone https://github.com/ray-project/ray.git
ray up ray/python/ray/autoscaler/kubernetes/example-full.yaml
kubectl -n ray get pods
kubectl create -f ray/doc/kubernetes/job-example.yaml
kubectl -n ray logs <launched job pod>
ray down ray/python/ray/autoscaler/kubernetes/example-full.yaml
Have a page that starts from deployment of the sample/manifest with lake-creator and lake-user , to download demo data, run lake-creator notebook and then run user regression. Document step by step.
Currently destroying the orbit env and foundation won't remove the pypi upstream, which could be annoying and confuses the user.
Solution: recover the ~/.config/pip/pip.conf while destroying the foundation
Known pattern - aws/serverless-application-model#200 (comment)
manifest.toolkit_s3_bucket: datamaker-dev-env-toolkit-198245574422-45596.0 [ Error ] NoSuchBucket: An error occurred (NoSuchBucket) when calling the PutObject operation: The specified bucket does not exist
FEATURE
Need to add example notebooks that would show how APIs are being used. Use containers to run nested notebooks
N/A
N/A
When user needs to terminate his session, we need to call this url:
/logout?response_type=code&client_id=&redirect_uri=&state=STATE&scope=openid+profile+aws.cognito.signin.user.admin
We need to create small service with html page offering the link to click on to terminate session and redirect them into a new login screen.
the small web server will use something like this form the redirect users into a new login.
<html>
<head>
<title>This website has moved</title>
<meta http-equiv="refresh" content="1;url=<cognito_ep>/logout?response_type=code&client_id=<clientid>&redirect_uri=<ingressalb>&state=STATE&scope=openid+profile+aws.cognito.signin.user.admin">
<meta name="robot" content="noindex,follow">
</head>
<body>
Your session has been terminated, you will be redirected to a new login page now.
</body>
</html>
Want to bring light to a larger question on how version bumps should be orchestrated with https://github.com/aws/aws-cdk
Currently, the Orbit CLI uses 1.67
, and the CDK is moving quickly, with its latest release hitting 1.95
last week.
It's not obvious whether to version bump for the latest L2 CDK constructs, even though they are very helpful for future development. The CDK team makes it clear these minor versions will contain breaking changes.
I'm very new to CDK, so I'm still learning best practices, but would it be helpful to have test coverage on validating various CDK constructs? Upon research, I see plenty of resources around testing with CDK using Typescript, but not a lot with Python. 🤔
Genisis for this was from unsupported UserPoolClient construct in cdk 1.67 #295 (comment)
Example 1 in Lake Creator has hardcoded environment name in the toolkit bucket. It's failing to execute Example 1.
samples/notebooks/A-LakeCreator/Example-1-Build-Lake.ipynb
to the dm_s3_bucket...
environment name should be taken from the workspace and SSM key should be parameterized.
dm_s3_bucket = json.loads(ssm.get_parameter(
Name=f'/datamaker/{workspace.get("env_name")}/manifest'
)['Parameter']['Value'])['toolkit-s3-bucket']
FEATURE
missing tools for notebooks:
zip
and unzip
zip
or unzip
Workaround:
apt-get install zip unzip
if no sudo access, then:
mkdir $HOME/pkgs & cd $HOME/pkgs
apt-get download zip unzip
for f in `ls *zip*.deb`; do echo dpkg -x $f $HOME/pkgs; done
export PATH=$PATH:$HOME/pkgs/usr/bin
Note, this will not make those packages available in notebooks.
rename script renamed to orbit_sdk, instead of aws_orbit_sdk.
N/A
Open controller.py
the line shows from orbit_sdk.common import ...
it supposed to be from aws_orbit_sdk.common import ...
Building some image could end up taking a long time. We should expect users to modify timeout on local file and push to codebuild.
Solution: Add a timeout option to enable user to pass in timeout intended
Add pipeline stages to build and twine upload the plugin modules to CodeArtifact.
Add dynamic pip install of plugin modules while _create_dockerfile()
OTHER - Github Project Administration
Templates for Issues and PRs needs to be added.
N/A
N/A
It would be helpful to support EFA with GPUDirectRDMA and multiple NICs (see https://github.com/aws-samples/aws-efa-eks) to support distributed training jobs on p3dn.24xlarge, p4d.24xlarge or g4dn.metal.
Hi,
Any plans on implementing the same solution to deploy with ECS too ? EKS / K8s might be popular, but ECS is your very own docker containers orchestrator, why is it not treated with at least as much priority as EKS ?
I read Jupyter plugin, so sure there might already be something there for K8s, but what about AWS providing an ECS plugin?
Thanks,
Currently one can only append/delete one profile at a time, which is inconvenient if one wants to modify multiple profiles.
An option to overwrite current profiles with an input file containing a list of profile would make modification easier.
in the destroy of the redshift plugin , we should look for cluster that are created for the team (use tags) and delete them (no snapshot) before continue and deleting the stacks. Otherwise it will fail the destroy
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.