Git Product home page Git Product logo

amazon-sagemaker-mlops-with-featurestore-and-datawrangler's Introduction

Operationalize a Machine Learning model with Amazon SageMaker Featurestore and Amazon SageMaker DataWrangler Using CDK

Objective

The goal of the project is to realize a demonstration of an end-to-end machine learning workflow, including the following automated pipelines:

  • Feature jobs and store into Feature Store
  • Train and validate models
  • deploy real-time endpoint, including an API gateway and a lambda function to integrate the request payload with features from the FeatureStore
  • batch inference, to periodically score a large dataset. The resulting inference are automatically uploaded to DynamoDB to be served with an API gateway

Each pipeline is deployed by CodePipeline, based on its own repository.
The entire workflow is described by a single CFN template.
The cloudformation template serves as the basis for a custom SageMaker Project.

Environment

There are two basic environments to install/configure: CDK and Python to deploy (and develop) this project:

  • CDK
  • Python

CDK

The root project uses CDK to generate the CFN templates. For instructions on how to install CDK, check the relevant documentation.

Python

The minimum python version is set to python 3.8, package dependencies are managed via pyproject.toml file, via poetry.
To install the python virtual environment, steps are

  1. install poetry
  2. cd into the folder of the project
  3. poetry install --no-dev

To activate the venv created by poetry:

~$ poetry shell

Deployment using AWS Cloud9

To deploy the project from AWS Cloud9, it is necessary to install python 3.8.x, install npm, and install cdk. From a new environment based on AL2 (it is sufficient a t3.small instance)

Install python 3.8.*

~$ sudo amazon-linux-extras enable python3.8
~$ sudo yum install python38 -y

Install Poetry

~$ curl -sSL https://raw.githubusercontent.com/python-poetry/poetry/master/install-poetry.py | python -

Update npm and install CDK

~$ npm install -g npm@latest cdk@latest

In necessary, bootstrap CDK in the account

~$ export CDK_NEW_BOOTSTRAP=1
~$ cdk bootstrap

Deploy the CDK project

After cloning the repository, cd into the repository root folder, then

  1. Install and activate the poetry environment
~$ poetry install
~$ poetry shell
  1. Check that the project synthetize without issues
(env-name)~$ cdk synth
  1. Deploy with CDK
(env-name)~$ cdk deploy

Solution

The solution consists of an Amazon SageMaker Project that deploys three CI/CD pipelines in CodePipeline.

cicd-diagram.drawio

Each pipeline consists of

  1. Source stage, a CodeCommit repository
  2. Synth stage, that synthetizes a CDK project into a CloudFormation template
  3. Manual approval
  4. Deploy stage, that deploys the CloudFormation template

After all pipelines have completed their executions, the resulting architecture looks like the diagram below.architecture.drawio

For a successful deployment of the entire architecture, it is necessary to upload the expected Raw data in the specified S3 location. The location, as well as other reference parameters, are stored in System Manager Parameter Store.

The SageMaker Project template also includes a Demo repository that contains two Jupyter Notebook, offering a walkthrough of the demo features and an overview of the Data Scientist specific workflow.

Running Costs

This section outlines cost considerations for running the Drift Detection Pipeline. Completing the pipeline will deploy an endpoint with 1 production variants which will cost less than $4 per day. Further cost breakdowns are below.

  • CodeBuild – Charges per minute used. First 100 minutes each month come at no charge. For information on pricing beyond the first 100 minutes, see AWS CodeBuild Pricing.

  • CodeCommit – $1/month.

  • CodePipeline – CodePipeline costs $1 per active pipeline* per month. Pipelines are free for the first 30 days after creation. More can be found at AWS CodePipeline Pricing.

  • SageMaker – Prices vary based on EC2 instance usage for the Studio Apps, Model Hosting, Model Training and Model Monitoring; each charged per hour of use. For more information, see Amazon SageMaker Pricing.

    • The four ml.m5.xlarge baseline, dataset creation and inference jobs run for approx 1 minutes at $0.23 an hour.
    • The one ml.m5.large instance for staging hosting endpoint costs $0.144 per hour, or $3.456 per day.
    • The two ml.m5.4xlarge instances for DataWrangler processing jobs run for approx 1 minutes at $0.92 per hour.
    • The ml.m5.xlarge instances for model monitor schedule at $0.92 an hour, and cost less than $1 per day.
    • The ml.c5.xlarge instance for Clarify runs for approx 2 minutes at $0.235.
    • The ml.m4.xlargeinstance for training runs for approx 1 minute at $0.30 an hour.
  • S3 – Low cost, prices will vary depending on the size of the models/artifacts stored. The first 50 TB each month will cost only $0.023 per GB stored. For more information, see Amazon S3 Pricing.

  • Lambda - Low cost, $0.20 per 1 million request see AWS Lambda Pricing.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

amazon-sagemaker-mlops-with-featurestore-and-datawrangler's People

Contributors

acere avatar amazon-auto avatar dalacan avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.