kenza-ai / sagify Goto Github PK

LLMs and Machine Learning done easily

Home Page: https://kenza-ai.github.io/sagify/

License: MIT License

Makefile 0.15% Python 98.34% Shell 1.02% Dockerfile 0.49%

ai-gateway anthropic cohere generative-ai langchain langchain-python large-language-model large-language-models llm llm-inference llmops open-source-llm openai sagemaker

sagify's People

Contributors

Stargazers

Watchers

sagify's Issues

Batch transform jobs always return “success” regardless of outcome

After the transformation is completed, given we have waited, we can add a call to describe-transform-job and check for the Status and FailureReason and return that.

Support for custom / user-defined endpoint names

I’d like to be able to define a name for the endpoint when I call ‘deploy’ for easier later processing and testing.

Docs

Error with credentials location when using a role and profile for sagify cloud train

Hello,

in the constructor of SageMakerClient class (sagemaker.py), I believe that the session should be created before the logic to assume the role. As it currently is, we try to assume a specific role with the default profile which might not contain credentials.

If a boto session is created beforehand and the sts client is created from this session

sts_client = self.boto_session.client('sts')

The client will use the defined profile and region and correctly locate the credentials in the config.

Wrong default SageMake failure output folder?

Looks like the default value for the failure output directory 'model' passed to train should be 'failure' instead? Based on the SageMaker docs.
That is, the following:

default=os.path.join(_DEFAULT_PREFIX_PATH, 'model'),

should be:

default=os.path.join(_DEFAULT_PREFIX_PATH, 'failure'),

Feature request: support SageMaker SDK v2

Hi,

Sagemaker SDK v2 is coming very soon. It has a number of breaking changes that will impact Sagify. Nothing huge, but you may want to adapt the code.

aws/sagemaker-python-sdk#1459
https://sagemaker.readthedocs.io/en/v2.0.0.rc1/v2.html

CreateTrainingJob operation: Invalid MaxWaitTimeInSeconds

Hi,
Thanks for the project.

During my tests with Sagify on sagify cloud train, I've noticed that when we do not set the --use-spot-instances flag. It fails with the following message.

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: Invalid MaxWaitTimeInSeconds. It is only supported when EnableManagedSpotTraining is set to true

As far as I understood, setting train_max_wait=3600 here causes the problem.

Can we set to None instead by default or am I missing something?

Using Sagify 0.20.4.

Make Implementation of Train Function easier

Add ability to define endpoint name

I'd like to be able to define the endpoint name when deploying a model for later analysis / configuration changes etc.

We'd have to add the new param on the deploy method as usual.

Add --wait for batch_transform

Cloud deploy not working

I'm having issues when deploying a model on AWS.

gunicorn fails with ModuleNotFoundError: No module named 'sagify.prediction' . From what I can tell this started when #106 was merged to fix an import clash.

Invalid type for parameter ExternalId, value: None

Hi,

When I specify a role on sagify cloud train, it fails if I do not specify the ExternalId and got the following error.

Invalid type for parameter ExternalId, value: None, type: <class 'NoneType'>, valid types: <class 'str'>

In boto3 sts doc, it is defined asExternalId='string' so we can not pass None.

Can we assume role here without ExternalId if it is not specified from the cli arguments.

Using Sagify 0.20.4

Update to latest SageMaker

The latest SageMaker release as of this writing is 1.14.2.

Sagify is currently targeting ~1.5

It's probably good to bring it up to date.

Pass ECR repository name

Instead of creating a repository per image.

'trainingJobName' failed to satisfy constraint

I get the following error with the latest version but not for sagify version 0.10

An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: Value 'XXXXXXXXXXXX.dkr.ecr.us-east-1.amazonaw-2018-07-31-16-42-36-397' at 'trainingJobName' failed to satisfy constraint: Member must satisfy regular expression pattern: ^[a-zA-Z0-9](-*[a-zA-Z0-9])*

error when pushing docker image : no credentials found

The problem:
I try to push a docker image to ECR using sagify push following the steps in https://kenza-ai.github.io/sagify/
but I got this error message :
Error when retrieving credentials from Ec2InstanceMetadata: No credentials found in credential_source referenced in profile :aws:iam::...:role/SagemakerSagify
Traceback (most recent call last):
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/bin/sagify", line 8, in
sys.exit(cli())
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/click/decorators.py", line 27, in new_func
return f(get_current_context().obj, *args, **kwargs)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/sagify/commands/push.py", line 58, in push
image_name=image_name)
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/site-packages/sagify/api/push.py", line 37, in push
image_name])
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/subprocess.py", line 411, in check_output
**kwargs).stdout
File "/media/daniel/Data/linuxprogs/miniconda3/envs/sagify/lib/python3.7/subprocess.py", line 512, in run
output=stdout, stderr=stderr)
subprocess.CalledProcessError: Command '['src/sagify_base/push.sh', 'latest', 'ap-southeast-1', '', 'default', '', 'sagify-demo']' returned non-zero exit status 255.

Additional Context:
In the select IAM user step, I choose the only user available, the admin user created according to the aws guide.
In the credentials and config files the profile name is [default]. I added the role_arn according to the sagify guide and source_profile = default in the aws config file

if I run sagify push it seems sagify created a new profile with the name of the arn role I created.

I need help to resolve this error.

Adding the possibility to drop deployed models and endpoints

Update docs: Add --verbose

How to retrain models at different intervals

Is there a way to orchestrate retaining of models with Sagify?

Example: I would like to retrain my model every week and redeploy my endpoint. Do you have any recommendations about this?

Thanks in advance.

Add `configure` command

tl;dr Proposing the replacement of the init command with a configure command ala aws configure to allow for certain config values to change after the creation of a project.

Issue
The init command hardcodes most (all?) of its arguments throughout the project for the cookiecutter step to pick up later and replace in various places; for example in the build script.

This is causing issues when someone wants to change any of the initial arguments such as the project / image name, python version etc

Suggestion
Addition of a configure command that can be called whenever someone wishes to change any of the original values. The command will effectively do exactly what init does but instead of hardcoding the values around the project via cookiecutter it will place them all in a local config file that can be updated every time the configure method is called again.

The other commands can consult that file to load their config each time they run.

Version conflict ulrlib while running sagify init

To reproduce in my laptop which uses Python 3.7.9

python -m venv venv
source venv/bin/activate
pip install sagemaker
pip install sagify
sagify init

This throws

Traceback (most recent call last):
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 584, in _build_master
    ws.require(__requires__)
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 901, in require
    needed = self.resolve(parse_requirements(requirements))
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 792, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (urllib3 1.24.3 (/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages), Requirement.parse('urllib3<1.27,>=1.25.4; python_version != "3.4"'), {'botocore'})

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/bin/sagify", line 6, in <module>
    from pkg_resources import load_entry_point
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3261, in <module>
    @_call_aside
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3245, in _call_aside
    f(*args, **kwargs)
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 3274, in _initialize_master_working_set
    working_set = WorkingSet._build_master()
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 586, in _build_master
    return cls._build_from_requirements(__requires__)
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 599, in _build_from_requirements
    dists = ws.resolve(reqs, Environment())
  File "/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages/pkg_resources/__init__.py", line 792, in resolve
    raise VersionConflict(dist, req).with_context(dependent_req)
pkg_resources.ContextualVersionConflict: (urllib3 1.24.3 (/Users/nicksorros/code/test/sagemaker/sagify/venv/lib/python3.7/site-packages), Requirement.parse('urllib3<1.27,>=1.25.4; python_version != "3.4"'), {'botocore'})

Happy to provide more details about my environment. My current theory is that it either has to do with my Python version as I have tried in AWS with python 3.6 and it works or that it has to do with sagemaker version.

Optimize Docker image

Optimize docker image for training and deployment, i.e. reduced size, in order to be uploaded faster to ECR.

Exploit SageMaker managed spot training

New projects only allow 'src' to be the root dir

If I pass any name during init when setting up a new project, I always get 'src' as the sagify module dir.

Is this by design @pm3310 ?

Local deployment - Errors are not displayed to User

Issue: No output is returned to user in case of an error in the python application when local deploying sagify.

Steps to reproduce: Following the Getting started guide then when adding code to the predict() function add a syntax error, i.e. in consistent indentation. Then continue following the guide. Then deploy it locally sagify local deploy -d src.
What I expected to see: An indication that something failed due to the error that was added.
What I saw instead:

     ____              _  __
    / ___|  __ _  __ _(_)/ _|_   _
    \___ \ / _` |/ _` | | |_| | | |
     ___) | (_| | (_| | |  _| |_| |
    |____/ \__,_|\__, |_|_|  \__, |
                 |___/       |___/

Started local deployment at localhost:8080 ...

For others, from @pm3310, in order to debug the error:

Try this:

Comment out the last line in Dockerfile under src/sagify/ module. It should look like this #ENTRYPOINT ["sagify/executor.sh"]
Run sagify build -d src -r requirements.txt
Run docker run -it deep-learning-addition-img /bin/bash. You’ll enter the docker image with this command.
As you’re inside the docker img, run sagify/executor.sh this should give you your error.

Dockerfile Issue

Hi,
Facing a problem building the image-

The command '/bin/sh -c pip install -r ../sagify-requirements.txt && rm -rf /root/.cache' returned a non-zero code: 1

And
Command '[u'src/sagify/build.sh', u'src', u'src', u'src/sagify/Dockerfile', u'src/requirements.txt', u'latest']' returned non-zero exit status 1

I think the problem is with the requirements.txt. I am using the exact same file just adding 2 new packages sklearn and pandas in it.
Please need your help!

AWS profile choice default throws IndexError

Running sagify init and selecting the default aws profile by clicking enter instead of typing 1 throws an IndexError

Select AWS profile:
1 - default
Choose from 1 [1]:
Traceback (most recent call last):
  File "/data/code/sagemaker/sagify-demo/venv/bin/sagify", line 8, in <module>
    sys.exit(cli())
  File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/sagify/commands/initialize.py", line 129, in init
    aws_profile, aws_region = ask_for_aws_details()
  File "/data/code/sagemaker/sagify-demo/venv/lib/python3.6/site-packages/sagify/commands/initialize.py", line 97, in ask_for_aws_details
    chosen_profile = available_profiles[chosen_profile_index]
IndexError: list index out of range

This is probably because the validation, which reduces the index does not run in the case so the default choise of 1 remains and in the case of 1 profile that throws an error. In the case of multiple profiles, the wrong profile is selected.

Add Hyperparameter Optimization CLI command

Support metric definitions for "normal" training jobs

We're currently supporting logging metrics for Hyperparameter Tuning Jobs; we should expand support for metrics non-Hyperparameter Trainging Jobs too.

Make Docker tags configurable

I would find it helpful to be able to specify a tag for the Docker image being used. This would allow a user to maintain a Docker build history, and to easily switch between different models.

For example, I would like to be able to run the command sagify --tag test build -d src to build an image but tag it as 'test' instead of 'latest', the default tag.

Since all of the commands access the Docker image under the hood, I think this needs to be a global option available to all subcommands, similar to the --verbose flag.

I see two ways to approach this:

Use the click library's built-in context object. This would make it easy to pass the option to subcommands, but it would require changing the method signature for each subcommand.
Create a service class and store the current tag as a global var. This global could be imported where needed. This approach does not change any method signatures but requires an extra import.

I would prefer to go with option 1, since it would allow more global arguments to be added more easily in the future. It would also keep the structure cleaner by not requiring an extra class to be created and maintained.

@pm3310 lmk which seems more consistent with the project, and I can build it.

Add ability to update endpoints

What
Ability (new command or probably better an update option on the existing deploy command) to update an endpoint’s configuration. As of today, if we deploy an endpoint there’s no way of amending its configuration e.g. instance type.

Running deploy using different params and the same endpoint name does not work unfortunately, SageMaker throws a duplicate endpoint exception.

Simplify Onboarding Example

The current deep learning addition example is a little bit complicated. Especially, the part of where to place input data. It should have no manual steps.

Quicker and more transparent feedback on command progress

All commands that end up running shell scripts (e.g. build, push) are using subprocess.check_output for execution which only returns the output after the script executing subprocess exits.

It would be nice to have more immediate feedback of what's happening. Pushing to ECR for example can take a really long time to execute — sometimes hours — but a user only sees the following on the command line.

Started pushing Docker image to AWS ECS. It will take some time. Please, be patient...

It would be nice to have some indication of progress across the API since many (most?) commands have long execution time potential.

Suggestions

Use an alternative method of calling subprocess that allows streaming of output or switch to using Python scripts and a corresponding Python docker client.

Usage of Sagify in MLflow

I started using MLflow the other day and it looks very promising. I was happy with MLflow API in general but I couldn't find a way to make a model (trained by mlflow commands) to be deployed on SageMaker.

Additionally, mlflow supports only local training. I think that we could enhance mlflow by enabling training and deployment on SageMaker via Sagify.

What do you think about this @berlinquin ?

Simplify train and predict method provision

get_model Takes more too long to initialise so the endpoint times out

In the first get_model() call, the weight needs to be loaded and model initialized which is taking 10mins or so. The problem is that Sagemaker endpoint timeout after 60 seconds so the model never has time to initialize.

Would you have any suggestions on ways to address this?

Thanks

Library stopped being maintained?

I was curious to know, because this library saved me a lot. I spent the whole weekend looking for ways to integrate a spacy model and had given up hope.

Remove `dir` option from all commands

dir is a required option in most commands. With the imminent implementation of #59 , we can now store it as a config variable and ~~make it optional~~ remove it from all commands.

The advantage I see here is that this is a value that rarely if ever changes and passing it all the time is unnecessary.

Add Tests for Py37

Replace Travis with GitHub Actions

Hi @pm3310 👋

It may worth replacing travis with GitHub Actions. Actions are pretty mature now and they don't require additional accounts, another UI, etc.

I've given them a try to build and publish to TestPyPi in a fork. Let me know if you like it and I can raise a PR.

Add job_name param to batch_transform

Push Command Does Not Work

sagify push doesn't work if the optional arguments --aws-region and --aws-profileare not specified.

There reason is that in file ./api/push.py the Python functionsubprocess.check_output([...]) complains if any of the arguments is None.

Additionally, [-z "$profile"] in push.sh needs to be written with spaces [ -z "$profile" ]

push docker image fails with AWS CLI v2

Hello,

in the push.sh template file, we try to login to ECR with

aws ecr get-login --profile ${profile} --region ${region} --no-include-email

This does not work with AWS CLI v2. The correct command is

aws --region <region> ecr get-login-password --profile <profile> | docker login --username AWS --password-stdin <aws account number>.dkr.ecr.<region>.amazonaws.com

Please see this issue for more information.

Support Anaconda based projects

Some data scientists/ML engineers prefer to use conda envs. Sagify should support reading conda environment.yml files when building docker images.

Wrong Batch Transform exit code for failed jobs

When a batch transform job fails, Sagify still returns a 0 exit code.

We need to check the SageMaker status returned here and exit with 1 or similar if the status is Failed (not sure if we can future proof this in case SageMaker changes the literal from Failed to something else but should be easy to catch).

Amazon SageMaker Autopilot Support

Hi,

Can we bring support for Autopilot functionality of Sagemaker?

Best,

motivation is missing

please add to the README.md the motivation for this package.
Why should one use this? What limitation of SageMaker is it trying to solve?
What can it do that SageMaker cannot?

Support Batch Prediction Jobs

What

Support batch prediction jobs given a ML model is trained.

Why

There are many use cases where a model should score new data instances periodically on batch data, i.e. S3

How

Input

S3 input data
S3 Model location
Instance Type
Instance Count

Output

S3 output location

Validation error in train Command

Hi,

Receiving this error on running the sagify cloud train command -

botocore.exceptions.ClientError: An error occurred (ValidationException) when calling the CreateTrainingJob operation: 1 validation error detected: failed to satisfy constraint: Member must satisfy regular expression pattern: ^a-zA-Z0-9*

I am following the rest of the steps in the exact same way as mentioned.
Please refer the attachment for my training code
tran_originalf.txt

Regards,
Vasu Roy

Installing the latest version downloads all versions of `boto3` and `botocore` until it fails

On all the devices I've tested - SageMaker Notebooks, SageMaker Notebook Instance (AL1 and AL2), personal MacBook 11.6 - I haven't been able to download and install sagify (neither current or older version. This is caused by some dependencies that fail to cohexist. In particular, it tries to download and install every boto3 and botocore package in existance, until eventually finding that:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
anaconda-project 0.9.1 requires ruamel-yaml, which is not installed.
docker-compose 1.29.2 requires docker[ssh]>=5, but you have docker 3.7.3 which is incompatible.
black 20.8b1 requires click>=7.1.2, but you have click 7.0 which is incompatible.
awscli 1.21.3 requires botocore==1.22.3, but you have botocore 1.18.18 which is incompatible.
awscli 1.21.3 requires s3transfer<0.6.0,>=0.5.0, but you have s3transfer 0.3.7 which is incompatible.
anyio 2.1.0 requires idna>=2.8, but you have idna 2.7 which is incompatible.
aiobotocore 1.3.0 requires botocore<1.20.50,>=1.20.49, but you have botocore 1.18.18 which is incompatible.

I can provide multiple tracelogs here:

SageMaker Notebook Instance AL2:
tracelog-sagemaker-notebook-instance-al2.txt

SageMaker Notebooks (SM Studio):
tracelog-sagemaker-notebooks-studio.txt

Macbook Pro:
tracelog-macbook-pro.txt

No Local GPU

Hi.
First, this is a great piece of software! really enjoyed it!
A teammate made you demo work on his local machine (with CUDA enabled), and then pushed it seamlessly to the could where is also worked.
Alas, I do not have a GPU on my MacBook, so cannot use a cuda-docker locally.

How would you recommend to solve this?

kenza-ai / sagify Goto Github PK

sagify's People

Contributors

Stargazers

Watchers

Forkers

sagify's Issues

What

Why

How

Recommend Projects

Recommend Topics

Recommend Org