Amazon SageMaker Built-in Algorithm MLOps Pipeline using AWS CDK

This repository provides a solution for MLOps Pipeline, where MLOps Pipeline includes data ETL, model re-training, model archiving, model serving and event triggering. Although this solution provides XGBoost as an example, it can be extended to other SageMaker built-in algorithms because it abstracts model training of SageMaker's built-in algorithms. AWS various services(Amazon SageMaker, AWS Step Functions, AWS Lambda) are used to provide MLOps Pipeline, and those resources are modeled and deployed through AWS CDK.

Other "Using AWS CDK" series can be found at:

Solution Architecture

Data ETL: AWS Glue Job for data ETL(extract/transform/load)
Model Build/Train: Amazon SageMaker built-in(xgboost) algorithm and SageMaker training job
Model Archive/Serve: Amazon SageMaker Model/Endpoint for realtime inference
Pipeline Orchestration: AWS Step Functions for configuring a statemachine of MLOps pipeline
Programming-based IaC: AWS CDK for modeling & provisioning for all AWS cloud resources(Typescript)

This solution refers to amazon-sagemaker-examples: automate_model_retraining_workflow for applying SageMaker built-in XGBoost algorithm. Please refer to the following links to apply other SageMaker built-in algorithms.

Note that using SageMaker built-in algorithms is very convenient because we only need to focus on a data, not model.

CDK-Project Build & Deploy

To efficiently define and provision AWS cloud resources, AWS Cloud Development Kit(CDK) which is an open source software development framework to define your cloud application resources using familiar programming languages is utilized.

Because this solusion is implemented in CDK, we can deploy these cloud resources using CDK CLI. In particular, TypeScript clearly defines restrictions on types, so we can easily and conveniently configure many parameters of various cloud resources with CDK. In addition, if programming, one of the advantages of CDK, is applied along with design patterns, it can be extended to more reusable assets.

CDK Useful commands

npm install install dependencies only for Typescript
cdk list list up stacks
cdk deploy deploy this stack to your default AWS account/region
cdk diff compare deployed stack with current state
cdk synth emits the synthesized CloudFormation template

Prerequisites

First of all, AWS Account and IAM User is required. And then the following modules must be installed.

AWS CLI: aws --version
Node.js: node --version
AWS CDK: cdk --version
jq: jq --version

Please refer to a kind guide in CDK Workshop.

Configure AWS Credential

aws configure --profile [your-profile] 
AWS Access Key ID [None]: xxxxxx
AWS Secret Access Key [None]:yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
Default region name [None]: us-east-2 
Default output format [None]: json
    
aws sts get-caller-identity --profile [your-profile]
...
...
{
    "UserId": ".............",
    "Account": "75157*******",
    "Arn": "arn:aws:iam::75157*******:user/[your IAM User ID]"
}

Check CDK project's entry-point

In this CDK project, a entry-point file is infra/app-main.ts which is described in cdk.json.

Set up deploy configuration

This project is based on aws-cdk-project-template-for-devops, which adopts configuration driven development(CDD). So let's set up configuration file(config/app-config-demo.json), which describes deployment target information(account/region) and how to configure each stack properties.

First of all, change deployment target information(account/region) in config/app-config-demo.json according to your AWS accout environment.

{
    "Project": {
        "Name": "MLOps",   <----- your project name, all stacks wil be prefixed with [Project.Name+Project.Stage]
        "Stage": "Demo",           <----- your project stage, all stacks wil be prefixed with [Project.Name+Project.Stage]
        "Account": "75157*******", <----- update according to your AWS Account
        "Region": "us-east-2",     <----- update according to your target resion
        "Profile": "cdk-v2"      <----- AWS Profile, keep empty string if no profile configured
    },
    ...
    ...
}

And then set the path of this json configuration file through setting up environment variable.

export APP_CONFIG=config/app-config-demo.json

Through this external configuration injection, multiple deployments(multiple account, multiple region, multiple stage) are possible without code modification. For example, we can maintain a variety of configuration files such as app-config-dev.json, app-config-test.json and app-config-prod.json at the same time.

Install dependecies & Bootstrap

Execute the following command, which will check versions and install dependencies intead of us. For more details, open script/setup_initial.sh file.

sh script/setup_initial.sh config/app-config-demo.json

Deploy stacks

Before deployment, please execute the following command for checking whether all configurations are ready.

cdk list
...
...
==> CDK App-Config File is config/app-config-demo.json, which is from Environment-Variable.

MLOpsDemo-ChurnXgboostPipelineStack

...
...

Check if you can see the list of stacks as shown above. If there is no problem, finally run the following command.

cdk deploy *ChurnXgboostPipelineStack --profile [optional: your profile name]

sh script/deploy_stacks.sh config/app-config-demo.json

Caution: This solution contains not-free tier AWS services. So be careful about the possible costs.

Check Deployment Results

You can find the deployment results in AWS CloudFormation as shown in the following picture.

And you can see a new StateMachine in Step Functions, which looks like this.

How to triger the StateMachine

Many resources such as Lambda/SageMakerTrainingJob/GlueETLJob have been deployed, but are not yet executed. Let's trigger that through just uploading input data.

Prepare a input data

Download sample data by running the following command, where sed command is used to remove " character in each line.

sh codes/glue/churn-xgboost/script/download_data.sh

A sample data will be downloaded in codes/glue/churn-xgboost/data/input.csv.

Trigger the StateMachine in Step Functions

Just execute the following command:

sh codes/glue/churn-xgboost/script/upload_input.sh config/app-config-demo.json data/request-01.csv
...
...
upload: codes/glue/churn-xgboost/data/input.csv to s3://mlopsdemo-churnxgboostpipelinestack-asset-[region]-[account 5 digits]/input/data/request-01.csv

This command will upload input.csv file into S3 bucket such as mlopsdemo-churnxgboostpipelinestack-asset-[region]-[account 5 digits] with input/data/request-01.csv key.

Check the execution result

Let's go to Step Functions service in web console. We can see that the new one is currently running. Click on it to check the current status, which looks like this

When all steps are completed, you can see the following results.

AWS Glue ETL Job

Amazon SageMaker Training Job

Caution Sometimes training job can be failed because container image path is wrong. In this case, you may see the following exception:

In this case, visit SageMaker Docker Registry Path link, where select your region, and then select alogorithm. Finaly, you can find the exptected docker path. For example, if your choice is us-east-2 region and XGBoost algorithm, you will see a page like this - https://docs.aws.amazon.com/sagemaker/latest/dg/ecr-us-east-2.html#xgboost-us-east-2.title.

Please update docker image path in app-config-demo.json file and deploy the stack again. And finally, if you upload the input data to s3 with a different s3 key(data/request-02.csv), the StateMachine will start again.

sh script/deploy_stacks.sh config/app-config-demo.json
...
...
sh codes/glue/churn-xgboost/script/upload_input.sh config/app-config-demo.json data/request-02.csv
...
...
upload: codes/glue/churn-xgboost/data/input.csv to s3://mlopsdemo-churnxgboostpipelinestack-asset-[region]-[account 5 digits]/input/data/request-02.csv

Amazon SageMaker Endpoint Caution In the first deployment, the SageMaker Endpoint is newly created(Create Endpoint Path), and in the second deployment, it is updated(Update Endpoint Path).

AWS Step Functions StateMachine

Check the execution output

Internally, it is implemented so that intermediate results are archived according to the following rules.

AWS Glue ETL Job Result in AWS S3 Bucket

Amazon SageMaker Training Job Result in AWS S3 Bucket

How to invoke SageMaker-Endpoint

Finally, let's invoke SageMaker Endpoint to make sure it works well.

Before invocation, open codes/glue/churn-xgboost/script/test_invoke.py file, and update profile name and endpoint name according to your configuration.

...
...

os.environ['AWS_PROFILE'] = 'cdk-v2'
_endpoint_name = 'MLOpsDemo-churn-xgboost'

...
...

Invoke the endpoint by executing the following command:

python3 codes/glue/churn-xgboost/script/test_invoke.py
...
...
0 Invocation ------------------
>>input:  106,0,274.4,120,198.6,82,160.8,62,6.0,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0
>>label:  0
>>prediction:  0.37959378957748413
1 Invocation ------------------
>>input:  28,0,187.8,94,248.6,86,208.8,124,10.6,5,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0,1,0
>>label:  0
>>prediction:  0.03738965839147568
2 Invocation ------------------
>>input:  148,0,279.3,104,201.6,87,280.8,99,7.9,2,2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,1,0
>>label:  1
>>prediction:  0.9195730090141296
3 Invocation ------------------
>>input:  132,0,191.9,107,206.9,127,272.0,88,12.6,2,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,1,0
>>label:  0
>>prediction:  0.025062650442123413
4 Invocation ------------------
>>input:  92,29,155.4,110,188.5,104,254.9,118,8.0,4,3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,1,0,0,1
>>label:  0
>>prediction:  0.028299745172262192

How to re-use or upgrade

How to re-trigger the StateMachine in Step Functions

Because a S3-event trigger is registered in Lambda Function, it is restarted when you upload a file with a different name(s3 key) under input in mlopsdemo-churnxgboostpipelinestack-asset-region-account.

A input S3 key will generate a unique title, it is used for TrainingJobName, StateMachineExecutionName, and s3 output key.

How to change deploy-configuration

ChurnXgboostPipeline in config/app-config-demo.json file provides various deployment options. So change these values, and the re-deploy this stack. And trigger again.

"ChurnXgboostPipeline": {
    "Name": "ChurnXgboostPipelineStack",

    "EndpointName": "churn-xgboost", <----- SageMaker Endpoint Name, and other resource name

    "GlueJobFilePath": "codes/glue/churn-xgboost/src/glue_etl.py", <----- Glue ETL Job Code
    "GlueJobTimeoutInMin": 30, <----- Glue ETL Job Timeout

    "TrainContainerImage": "306986355934.dkr.ecr.ap-northeast-2.amazonaws.com/xgboost:1", <------ This value is different according to SageMaker built-in algorithm & region
    "TrainParameters": {
        ...
    },
    "TrainInputContent": "text/csv", <----- This value is difference according to SageMaker built-in alorithm
    "TrainInstanceType": "c5.xlarge", <----- SageMaker training job instance type

    "ModelValidationEnable": true, <----- Enable/disable a model validation state
    "ModelErrorThreshold": 0.1, <----- Model accuracy validation metric threshold

    "EndpointInstanceType": "t2.2xlarge", <----- SageMaker endpoint instance number
    "EndpointInstanceCount": 1 <----- SageMaker ednpoint instance count
}

How to change hyper-parameter of XGBoost

ChurnXgboostPipeline in config/app-config-demo.json file includes hyper-parameters like this. So change these values, and the re-deploy this stack. And trigger again.

"ChurnXgboostPipeline": {
    "Name": "ChurnXgboostPipelineStack",

    ...
    ...
    "TrainParameters": {
        "max_depth": "5",
        "eval_metric": "error",
        "eta": "0.2",
        "gamma": "4",
        "min_child_weight": "6",
        "subsample": "0.8",
        "objective": "binary:logistic",
        "silent": "0",
        "num_round": "100"
    },
    ...
    ...
}

How to extend other SageMaker built-in algorithms

MLOpsePipelineStack provides a general MLOps Pipeline as abstraction as possible for SageMaker built-in algorithms. As a result, it can be extended by injecting only configuration information without modifying the code.

For example, consider Object2Vec algorithm.

Step1: Prepare a new configuration in config/app-config-demo.json:

{
    "Project": {
        ...
        ...
    },

    "Stack": {
        "ChurnXgboostPipeline": {
            "Name": "ChurnXgboostPipelineStack",

            ...
            ...
        },
        "RecommendObject2VecPipeline": {
            "Name": "RecommendObject2VecPipelineStack",

            "EndpointName": "recommand-object2vec", <----- change according model or usecase

            "GlueJobFilePath": "codes/glue/recommand-object2vec/src/glue_etl.py", <----- change according to data format and etl-process
            "GlueJobTimeoutInMin": 30, <----- change this value to avoid over-processing and over-charging

            "TrainContainerImage": "835164637446.dkr.ecr.ap-northeast-2.amazonaws.com/object2vec:1", <----- change image according to SageMaker built-in alorithm and region
            "TrainParameters": {
                "_kvstore": "device",
                "_num_gpus": "auto",
                "_num_kv_servers": "auto",
                "bucket_width": "0",
                "early_stopping_patience": "3",
                "early_stopping_tolerance": "0.01",
                "enc0_cnn_filter_width": "3",
                "enc0_layers": "auto",
                "enc0_max_seq_len": "1",
                "enc0_network": "pooled_embedding",
                "enc0_token_embedding_dim": "300",
                "enc0_vocab_size": "944",
                "enc1_layers": "auto",
                "enc1_max_seq_len": "1",
                "enc1_network": "pooled_embedding",
                "enc1_token_embedding_dim": "300",
                "enc1_vocab_size": "1684",
                "enc_dim": "1024",
                "epochs": "20",
                "learning_rate": "0.001",
                "mini_batch_size": "64",
                "mlp_activation": "tanh",
                "mlp_dim": "256",
                "mlp_layers": "1",
                "num_classes": "2",
                "optimizer": "adam",
                "output_layer": "mean_squared_error"
            },
            "TrainInputContent": "application/jsonlines", <----- change according to algorithm supported types
            "TrainInstanceType": "m4.xlarge", <--- change model training environments

            "ModelValidationEnable": false, <----- disable if you don't want to validate model accuracy
            "ModelErrorThreshold": 0.1,

            "EndpointInstanceType": "m4.xlarge",  <--- change model training environments
            "EndpointInstanceCount": 1 
        }
    }
}

Step2: Create a new object in infra/app-main.ts like this:

new MLOpsPipelineStack(appContext, appContext.appConfig.Stack.ChurnXgboostPipeline);
new MLOpsPipelineStack(appContext, appContext.appConfig.Stack.RecommendObject2VecPipeline);

Step3: Deploy and trigger again with a new data

cdk list
cdk deploy *RecommendObject2VecPipelineStack --profile [optional: your profile name]

Sometimes you can extend functionality by inheriting from this stack for further expansion.

How to clean up

Execute the following command, which will destroy all resources except S3 Bucket. So destroy these resources in AWS web console manually.

sh ./script/destroy_stacks.sh config/app-config-demo.json

cdk destroy *Stack --profile [optional: your profile name]

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

susd1234 / amazon-sagemaker-built-in-algorithms-mlops-pipeline-using-aws-cdk Goto Github PK

amazon-sagemaker-built-in-algorithms-mlops-pipeline-using-aws-cdk's Introduction