Git Product home page Git Product logo

feature-store-end-to-end's Introduction

End-to-end ML workflow to highlight SageMaker Feature Store

This repository demonstrates an end-to-end ML workflow using various AWS services such as SageMaker (Feature Store, Endpoints), Kinesis Data Streams, Lambda and DynamoDB.

The dataset used here is Expedia hotel recommendations dataset from Kaggle and the use-case is predicting a hotel cluster based on user inputs and destination features. We ingest the raw data from an S3 bucket into a Amazon SageMaker Feature Store and then read data from the Feature Store to train ML model for predicting a hotel cluster. The trained model is deployed as a SageMaker endpoint. A simulated inference pipeline is created using Amazon Kinesis Data Streams and Lambda. Test data is put on the stream and this triggers a Lambda function which joins the event data (customer inputs) read from the stream with destination features read from the online SageMaker Feature Store and then invokes the SageMaker Model Endpoint to get a prediction for the hotel cluster. The predicted hotel cluster along with the input data is stored in a DynamoDB table.

A blog post providing a full walkthrough of using a feature store should be coming soon.

For a full explanation of SageMaker Feature Store you can read here, which describes the capability as:

Amazon SageMaker Feature Store is a purpose-built repository where you can store and access features so it’s much easier to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent.

This implementation demonstrates how to do the following:

  • Create multiple online and offline SageMaker Feature Groups to store transformed data readily usable for training ML models.
  • Train a SageMaker XGBoost model and deploy it as an endpoint for real time inference.
  • Simulate an inference pipeline by putting events on a Kinesis Data Stream and trigger a Lambda function.
  • Read data from the online feature store from the Lambda, combine it with event data and invokte a SageMaker endpoint.
  • Store prediction results and input data in a DynamoDB table.

Prerequisites

Prior to running the steps under Instructions, you will need access to an AWS Account where you have full Admin privileges. The CloudFormation template will deploy an AWS Lambda functions, IAM Role, and a new SageMaker notebook instance with this repo already cloned. In addition, having basic knowledge of the following services will be valuable: Amazon Kinesis Data Streams, Amazon SageMaker, AWS Lambda functions, Amazon IAM Roles.

PRE-REQ 1: The CloudFormation template also deploys a Lambda function and the code for the Lambda function needs to be in an S3 bucket that needs to be created prior to running the CloudFormation template. Create an S3 bucket and place the hotel_cluster_predictions_v1.zip file in the bucket. Keep the name of this bucket handy as it will be required as an input for the "Name of the S3 bucket for holding the zip file of the Lambda code" parameter in the CloudFormation template.

PRE-REQ 2: If you have a Cloud Trail created for your account, make sure that the "Exclude AWS KMS events" checkbox is checked under the Cloud Trail -> Management Events setting. Checking this checkbox will prevent AWS KMS events from getting logged in Cloud Trail. Ingesting data into the Feature Store triggers a KMS event and depending upon the size of the data this could result in a huge cost if not disabled, therefore, for the purpose of this demo it is recommended that AWS KMS events are not logged in Cloud Trail.

Instructions

  1. Use the CloudFormation template available in the templates folder of this repository to launch a CloudFormation stack. It is required to use expedia-feature-store-demo-v2 as the stack name (using a different name would require a change in the notebook code). All parameters needed by the template have a default value, you can leave these defaults unchanged unless there is a need to.

    NOTE: This code has been tested only in the us-east-1 region although it is expected to work in other regions as well (but has not been tested in other regions). You can view the CloudFormation template directly by looking here. The stack will take a few minutes to launch. When it completes, you can view the items created by clicking on the Resources tab.

  2. Once the stack is complete, browse to Amazon SageMaker in the AWS console and click on the 'Notebook Instances' tab on the left.

  3. Click either 'Jupyter' or 'JupyterLab' to access the SageMaker Notebook instance. The CloudFormation template has cloned this git repository into the notebook instance for you. All of the example code to work through is in the notebooks directory.

  4. The dataset used for this code is available on Kaggle and can be downloaded directly from the Kaggle website. The dataset is NOT included as part of this repository. The CloudFormation template creates an S3 bucket to hold the raw data. The data downloaded form the Kaggle website needs to be uploaded to a folder called raw_data in this bucket. See the output section of the CloudFormation stack and look for DataBucketName, this is the name of the bucket created by the CloudFormation stack in which the raw data needs to be uploaded (manually). Create a folder called raw_data in this bucket and upload the files train.csv, test.csv and destinations.csv from the Kaggle dataset to this bucket in the raw_data folder.

Running the Notebooks

There are a series of notebooks which should be run in order. Follow the step-by-step guide in each notebook:

Optional steps

  • View the Kinesis Stream that is used to ingest records.
  • View the Lambda function that receives the kinesis events, reads feature data from the online feature store and triggers the model prediction.

CLEAN UP - IMPORTANT

To destroy the AWS resources created as part of this example, complete the following two steps:

  1. Run the notebooks/5_cleanup.ipynb to delete S3 objects and SageMaker endpoint (these are resources not created by the CloudFormation template).

  2. Go to CloudFormation in the AWS console, select expedia-feature-store-demo-v2 and click 'Delete'. Verify all the resources (S3 buckets, SageMaker notebook, SageMaker endpoint, Lambda, DynamoDB) are indeed deleted.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Running Cost

The running cost for this demo is $15-$20 per day.

References

  1. SageMaker Feature store end-to-end workshop

feature-store-end-to-end's People

Contributors

aarora79 avatar amazon-auto avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

feature-store-end-to-end's Issues

Enhance the prediction lambda to work as an API as well

The lambda function should also be able to function as a backend for an API GW. This can be accomplished with minimum changes to the code, the only changes need to be in the structure of the "event" object. Additional checks can be added in the code to recognize invocation from kinesis or otherwise and the rest of the code can remain the same.

We can assume that if the "Records" key is there in "event" and test if the first record contains the key "kinesis" then this instance is triggered via Kinesis, if these checks fail then we assume REST API and look for different keys (namely "data").

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.