Author: Pedro Paez github: https://github.com/pedrojpaez/dataweek.git
In this lab we will be going through the entire Data Science workflow using Sagemaker. The objective of this exercise is to build from scratch a Data Science project and to learn how Sagemaker helps accelerate the process of building and deploying in production custom machine learning models. We will see how to leverage Sagemaker's first party algorithms as well as the high level SDF for Deep Learning frameworks.
We will be building and end-to-end Natural Language Processing pipeline to classify newspaper headlines into general categories. We will first build word embeddings (vector representations of the english vocabulary) to enrich our model.
For this lab you will need to have:
- A laptop
- Network connectivity
- An AWS account
- Basic Python scripting experience
- Basic knowledge of Data Science workflow
Preferred knowledge:
- Basic knowledge of containers
- Basic knowledge of deep learning
- Go to AWS Console in your account
- On the top right corner select region N.Virginia
- Search and click on Amazon Sagemaker
- Under Notebook > Select Notebook Instance > and click on "Create Notebook Instance" button (orange button)
- Give your project a name under "Notebook instance name"
- Select ml.t2.medium Notebook instance type
-
Under "Permissions and encryption" > Under IAM role > select "Create a new role" in the scroll down menu
-
Select "Any S3 bucket" > Click on "Create new role" button
- Finally "Create Notebook Instance" and wait until status is "InService"
- Select "Open Jupyter". You should see a Jupyter notebook web interface.
- Select "New" in the top right corner > Click on "Terminal". A new tab will open with access to the Shell.
-
You now have shell access to the notebook instance and full control/flexibility over your environment. We will cd (change directory to the Sagemaker home directory). Type from the root directory :
cd Sagemaker
-
We will clone the material for this lab from the git repo : https://github.com/pedrojpaez/dataweek.git
-
Return to previous tab (Jupyter notebook web interface). The dataweek directory should now be available.
There are 4 elements in the dataweek directory:
- tf-src: This directory contains the MXNet training script for our document classifier.
- blazingtext_word2vec_text8.ipynb: Notebook to create word embeddings using the Sagemaker first party algorithm Blazingtext. We will use these embeddings as input for our headline classifier to enrich the model.
- headline-classifier-local.ipynb: Notebook to create headline classifier using keras (with MXNet backend) on the local instance.
- headline-classifier-mxnet.ipynb: Notebook to create headline classifier leveraging Sagemaker training and deploying features. We will use MXNet high-level SDK to bring our MXNet code and run and deploy our model.
In this notebook we will run through the snippets of code. We will be building a word embedding model (vector representations of the english vocabulary) to use as input for our document classification model.
For this notebook we will use the first party algorithm Blazingtext to build our word embeddings and we will leverage the one-click training/one-click deployment capabilities of Sagemaker.
The general actions we will be running:
- Configure notebook
- Download text8 corpus file
- Upload data to S3
- Run training job on Sagemaker
- Deploy model
- Download model object and unpack wordvectors
- Clean up (delete model endpoint)
Run through the notebook and read the instructions.
In this notebook we will run through the snippets of code. We will build a headline classifier model that will classify newspaper headlines into 4 classes. We will build a deep learning model using the Keras interface with MXNet backend (and use the word embeddings we previously built as input to our model). We will run the training on locally (on the notebook instance) to evaluate performance.
The general actions we will be running:
- Configure notebook
- Download NewsAggregator datasets
- Upload data to S3
- Run training job locally
- Move to the next notebook.
Run through the notebook and read the instructions.
In this notebook we will run through the snippets of code. We will build a headline classifier model that will classify newspaper headlines into 4 classes. We will build a deep learning model using the Keras interface with MXNet backend (and use the word embeddings we previously built as input to our model). We will run the training on Sagemaker and package the MXNet code to a training script and we will evaluate performance. Finally we will deploy our model as a RESTful API.
The general actions we will be running:
- Configure notebook
- Upload data to S3
- Run training job on Sagemaker
- Deploy model on Sagemaker
- Clean up (delete model endpoint)
Run through the notebook and read the instructions.
-Invoke a model endpoint deployed by Amazon SageMaker using API Gateway and AWS Lambda for additional functionality. https://aws.amazon.com/blogs/machine-learning/call-an-amazon-sagemaker-model-endpoint-using-amazon-api-gateway-and-aws-lambda/
-Analyze the results of your model responses to real time data (for this switch the Comprehend API for your Sagemaker endpoint API). https://aws.amazon.com/blogs/machine-learning/build-a-social-media-dashboard-using-machine-learning-and-bi-services/