kenza-ai / sagify Goto Github PK
View Code? Open in Web Editor NEWLLMs and Machine Learning done easily
Home Page: https://kenza-ai.github.io/sagify/
License: MIT License
LLMs and Machine Learning done easily
Home Page: https://kenza-ai.github.io/sagify/
License: MIT License
After the transformation is completed, given we have waited, we can add a call to describe-transform-job and check for the Status and FailureReason and return that.
I’d like to be able to define a name for the endpoint when I call ‘deploy’ for easier later processing and testing.
Hello,
in the constructor of SageMakerClient class (sagemaker.py), I believe that the session should be created before the logic to assume the role. As it currently is, we try to assume a specific role with the default profile which might not contain credentials.
If a boto session is created beforehand and the sts client is created from this session
sts_client = self.boto_session.client('sts')
The client will use the defined profile and region and correctly locate the credentials in the config.
Looks like the default value for the failure output directory 'model' passed to train should be 'failure' instead? Based on the SageMaker docs.
That is, the following:
default=os.path.join(_DEFAULT_PREFIX_PATH, 'model'),
should be:
default=os.path.join(_DEFAULT_PREFIX_PATH, 'failure'),
Hi,
Sagemaker SDK v2 is coming very soon. It has a number of breaking changes that will impact Sagify. Nothing huge, but you may want to adapt the code.
aws/sagemaker-python-sdk#1459
https://sagemaker.readthedocs.io/en/v2.0.0.rc1/v2.html
Hi,
Thanks for the project.
During my tests with Sagify on sagify cloud train
, I've noticed that when we do not set the --use-spot-instances
flag. It fails with the following message.
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Invalid MaxWaitTimeInSeconds. It is only supported when EnableManagedSpotTraining is set to true
As far as I understood, setting train_max_wait=3600
here causes the problem.
Can we set to None
instead by default or am I missing something?
Using Sagify 0.20.4.
I'd like to be able to define the endpoint name when deploying a model for later analysis / configuration changes etc.
We'd have to add the new param on the deploy method as usual.
I'm having issues when deploying a model on AWS.
gunicorn
fails with ModuleNotFoundError: No module named 'sagify.prediction' . From what I can tell this started when #106 was merged to fix an import clash.
Hi,
When I specify a role on sagify cloud train
, it fails if I do not specify the ExternalId
and got the following error.
Invalid type for parameter ExternalId, value: None, type: <class 'NoneType'>, valid types: <class 'str'>
In boto3 sts doc, it is defined asExternalId='string'
so we can not pass None.
Can we assume role here without ExternalId
if it is not specified from the cli arguments.
Using Sagify 0.20.4
The latest SageMaker release as of this writing is 1.14.2.
Sagify is currently targeting ~1.5
It's probably good to bring it up to date.
Instead of creating a repository per image.
I get the following error with the latest version but not for sagify version 0.10
An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaw-2018-07-31-16-42-36-397' at 'trainingJobName' failed to satisfy constraint: Member must satisfy regular expression pattern: ^[a-zA-Z0-9](-*[a-zA-Z0-9])*
The problem:
I try to push a docker image to ECR using sagify push following the steps in https://kenza-ai.github.io/sagify/
but I got this error message :
Error when retrieving credentials from Ec2InstanceMetadata: No credentials found in credential_source referenced in profile :aws:iam::...:role/SagemakerSagify
Traceback (most recent call last):
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/bin/sagify", line 8, in
sys.exit(cli())
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/decorators.py", line 27, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/sagify/commands/push.py", line 58, in push
image_name=image_name)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/sagify/api/push.py", line 37, in push
image_name])
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['src/sagify_base/push.sh', 'latest', 'ap-southeast-1', '', 'default', '', 'sagify-demo']' returned non-zero exit status 255.
Additional Context:
In the select IAM user step, I choose the only user available, the admin user created according to the aws guide.
In the credentials and config files the profile name is [default]. I added the role_arn according to the sagify guide and source_profile = default in the aws config file
if I run sagify push it seems sagify created a new profile with the name of the arn role I created.
I need help to resolve this error.
Is there a way to orchestrate retaining of models with Sagify?
Example: I would like to retrain my model every week and redeploy my endpoint. Do you have any recommendations about this?
Thanks in advance.
tl;dr Proposing the replacement of the init
command with a configure
command ala aws configure to allow for certain config values to change after the creation of a project.
Issue
The init
command hardcodes most (all?) of its arguments throughout the project for the cookiecutter step to pick up later and replace in various places; for example in the build script.
This is causing issues when someone wants to change any of the initial arguments such as the project / image name, python version etc
Suggestion
Addition of a configure
command that can be called whenever someone wishes to change any of the original values. The command will effectively do exactly what init
does but instead of hardcoding the values around the project via cookiecutter it will place them all in a local config file that can be updated every time the configure method is called again.
The other commands can consult that file to load their config each time they run.
To reproduce in my laptop which uses Python 3.7.9
python -m venv venv
source venv/bin/activate
pip install sagemaker
pip install sagify
sagify init
This throws
Traceback (most recent call last):
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 584, in _build_master
ws.require(__requires__)
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 901, in require
needed = self.resolve(parse_requirements(requirements))
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 792, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (urllib3 1.24.3 (/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages), Requirement.parse('urllib3<1.27,>=1.25.4; python_version != "3.4"'), {'botocore'})
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/bin/sagify", line 6, in <module>
from pkg_resources import load_entry_point
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3261, in <module>
@_call_aside
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3245, in _call_aside
f(*args, **kwargs)
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3274, in _initialize_master_working_set
working_set = WorkingSet._build_master()
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 586, in _build_master
return cls._build_from_requirements(__requires__)
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 599, in _build_from_requirements
dists = ws.resolve(reqs, Environment())
File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 792, in resolve
raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (urllib3 1.24.3 (/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages), Requirement.parse('urllib3<1.27,>=1.25.4; python_version != "3.4"'), {'botocore'})
Happy to provide more details about my environment. My current theory is that it either has to do with my Python version as I have tried in AWS with python 3.6 and it works or that it has to do with sagemaker version.
Optimize docker image for training and deployment, i.e. reduced size, in order to be uploaded faster to ECR.
If I pass any name during init when setting up a new project, I always get 'src' as the sagify module dir.
Is this by design @pm3310 ?
Issue: No output is returned to user in case of an error in the python application when local deploying sagify.
Getting started guide
then when adding code to the predict()
function add a syntax error, i.e. in consistent indentation. Then continue following the guide. Then deploy it locally sagify local deploy -d src
. ____ _ __
/ ___| __ _ __ _(_)/ _|_ _
\___ \ / _` |/ _` | | |_| | | |
___) | (_| | (_| | | _| |_| |
|____/ \__,_|\__, |_|_| \__, |
|___/ |___/
Started local deployment at localhost:8080 ...
For others, from @pm3310, in order to debug the error:
Try this:
Dockerfile
under src/sagify/
module. It should look like this #ENTRYPOINT ["sagify/executor.sh"]
sagify build -d src -r requirements.txt
docker run -it deep-learning-addition-img /bin/bash
. You’ll enter the docker image with this command.sagify/executor.sh
this should give you your error.Hi,
Facing a problem building the image-
The command '/bin/sh -c pip install -r ../sagify-requirements.txt && rm -rf /root/.cache' returned a non-zero code: 1
And
Command '[u'src/sagify/build.sh', u'src', u'src', u'src/sagify/Dockerfile', u'src/requirements.txt', u'latest']' returned non-zero exit status 1
I think the problem is with the requirements.txt. I am using the exact same file just adding 2 new packages sklearn and pandas in it.
Please need your help!
Running sagify init
and selecting the default aws profile by clicking enter instead of typing 1 throws an IndexError
Select AWS profile:
1 - default
Choose from 1 [1]:
Traceback (most recent call last):
File "/data/code/sagemaker/sagify-demo/venv/bin/sagify", line 8, in <module>
sys.exit(cli())
File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/sagify/commands/initialize.py", line 129, in init
aws_profile, aws_region = ask_for_aws_details()
File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/sagify/commands/initialize.py", line 97, in ask_for_aws_details
chosen_profile = available_profiles[chosen_profile_index]
IndexError: list index out of range
This is probably because the validation, which reduces the index does not run in the case so the default choise of 1 remains and in the case of 1 profile that throws an error. In the case of multiple profiles, the wrong profile is selected.
We're currently supporting logging metrics for Hyperparameter Tuning Jobs; we should expand support for metrics non-Hyperparameter Trainging Jobs too.
I would find it helpful to be able to specify a tag for the Docker image being used. This would allow a user to maintain a Docker build history, and to easily switch between different models.
For example, I would like to be able to run the command sagify --tag test build -d src
to build an image but tag it as 'test' instead of 'latest', the default tag.
Since all of the commands access the Docker image under the hood, I think this needs to be a global option available to all subcommands, similar to the --verbose flag.
I see two ways to approach this:
Use the click library's built-in context object. This would make it easy to pass the option to subcommands, but it would require changing the method signature for each subcommand.
Create a service class and store the current tag as a global var. This global could be imported where needed. This approach does not change any method signatures but requires an extra import.
I would prefer to go with option 1, since it would allow more global arguments to be added more easily in the future. It would also keep the structure cleaner by not requiring an extra class to be created and maintained.
@pm3310 lmk which seems more consistent with the project, and I can build it.
What
Ability (new command or probably better an update option on the existing deploy command) to update an endpoint’s configuration. As of today, if we deploy an endpoint there’s no way of amending its configuration e.g. instance type.
Running deploy
using different params and the same endpoint name does not work unfortunately, SageMaker throws a duplicate endpoint exception.
The current deep learning addition example is a little bit complicated. Especially, the part of where to place input data. It should have no manual steps.
All commands that end up running shell scripts (e.g. build, push) are using subprocess.check_output
for execution which only returns the output after the script executing subprocess exits.
It would be nice to have more immediate feedback of what's happening. Pushing to ECR for example can take a really long time to execute — sometimes hours — but a user only sees the following on the command line.
Started pushing Docker image to AWS ECS. It will take some time. Please, be patient...
It would be nice to have some indication of progress across the API since many (most?) commands have long execution time potential.
Suggestions
Use an alternative method of calling subprocess
that allows streaming of output or switch to using Python scripts and a corresponding Python docker client.
I started using MLflow the other day and it looks very promising. I was happy with MLflow API in general but I couldn't find a way to make a model (trained by mlflow commands) to be deployed on SageMaker.
Additionally, mlflow supports only local training. I think that we could enhance mlflow by enabling training and deployment on SageMaker via Sagify.
What do you think about this @berlinquin ?
In the first get_model()
call, the weight needs to be loaded and model initialized which is taking 10mins or so. The problem is that Sagemaker endpoint timeout after 60 seconds so the model never has time to initialize.
Would you have any suggestions on ways to address this?
Thanks
I was curious to know, because this library saved me a lot. I spent the whole weekend looking for ways to integrate a spacy model and had given up hope.
dir
is a required option in most commands. With the imminent implementation of #59 , we can now store it as a config variable and make it optional remove it from all commands.
The advantage I see here is that this is a value that rarely if ever changes and passing it all the time is unnecessary.
sagify push
doesn't work if the optional arguments --aws-region
and --aws-profile
are not specified.
There reason is that in file ./api/push.py
the Python functionsubprocess.check_output([...])
complains if any of the arguments is None.
Additionally, [-z "$profile"]
in push.sh
needs to be written with spaces [ -z "$profile" ]
Hello,
in the push.sh template file, we try to login to ECR with
aws ecr get-login --profile ${profile} --region ${region} --no-include-email
This does not work with AWS CLI v2. The correct command is
aws --region <region> ecr get-login-password --profile <profile> | docker login --username AWS --password-stdin <aws account number>.dkr.ecr.<region>.amazonaws.com
Please see this issue for more information.
Some data scientists/ML engineers prefer to use conda envs. Sagify should support reading conda environment.yml
files when building docker images.
When a batch transform job fails, Sagify still returns a 0 exit code.
We need to check the SageMaker status returned here and exit with 1 or similar if the status is Failed
(not sure if we can future proof this in case SageMaker changes the literal from Failed to something else but should be easy to catch).
Hi,
Can we bring support for Autopilot functionality of Sagemaker?
Best,
please add to the README.md the motivation for this package.
Why should one use this? What limitation of SageMaker is it trying to solve?
What can it do that SageMaker cannot?
Support batch prediction jobs given a ML model is trained.
There are many use cases where a model should score new data instances periodically on batch data, i.e. S3
Input
Output
Hi,
Receiving this error on running the sagify cloud train command -
botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: failed to satisfy constraint: Member must satisfy regular expression pattern: ^a-zA-Z0-9*
I am following the rest of the steps in the exact same way as mentioned.
Please refer the attachment for my training code
tran_originalf.txt
Regards,
Vasu Roy
On all the devices I've tested - SageMaker Notebooks, SageMaker Notebook Instance (AL1 and AL2), personal MacBook 11.6 - I haven't been able to download and install sagify
(neither current or older version. This is caused by some dependencies that fail to cohexist. In particular, it tries to download and install every boto3
and botocore
package in existance, until eventually finding that:
ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.9.1 requires ruamel-yaml, which is not installed.
docker-compose 1.29.2 requires docker[ssh]>=5, but you have docker 3.7.3 which is incompatible.
black 20.8b1 requires click>=7.1.2, but you have click 7.0 which is incompatible.
awscli 1.21.3 requires botocore==1.22.3, but you have botocore 1.18.18 which is incompatible.
awscli 1.21.3 requires s3transfer<0.6.0,>=0.5.0, but you have s3transfer 0.3.7 which is incompatible.
anyio 2.1.0 requires idna>=2.8, but you have idna 2.7 which is incompatible.
aiobotocore 1.3.0 requires botocore<1.20.50,>=1.20.49, but you have botocore 1.18.18 which is incompatible.
I can provide multiple tracelogs here:
SageMaker Notebook Instance AL2:
tracelog-sagemaker-notebook-instance-al2.txt
SageMaker Notebooks (SM Studio):
tracelog-sagemaker-notebooks-studio.txt
Macbook Pro:
tracelog-macbook-pro.txt
Hi.
First, this is a great piece of software! really enjoyed it!
A teammate made you demo work on his local machine (with CUDA enabled), and then pushed it seamlessly to the could where is also worked.
Alas, I do not have a GPU on my MacBook, so cannot use a cuda-docker locally.
How would you recommend to solve this?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.