aws-samples / sagemaker-101-workshop Goto Github PK

Hands-on demonstrations for data scientists exploring Amazon SageMaker

Makefile 0.40% Python 18.22% Jupyter Notebook 81.25% Shell 0.13%

aws sagemaker amazon-sagemaker workshops

sagemaker-101-workshop's Introduction

Getting Started with "Amazon SageMaker 101"

This repository accompanies a hands-on training event to introduce data scientists (and ML-ready developers / technical leaders) to core model training and deployment workflows with Amazon SageMaker.

Like a "101" course in the academic sense, this will likely not be the simplest introduction to SageMaker you can find; nor the fastest way to get started with advanced features like optimized SageMaker Distributed training or SageMaker Clarify for bias and explainability analyses.

Instead, these exercises are chosen to demonstrate some core build/train/deploy patterns that we've found help new users to first get productive with SageMaker - and to later understand how the more advanced features fit in.

Agenda

An interactive walkthrough of the content with screenshots is available at:

https://sagemaker-101-workshop.workshop.aws/

Sessions in suggested order:

builtin_algorithm_hpo_tabular: Explore some pre-built algorithms and tools for tabular data, including SageMaker Autopilot AutoML, the XGBoost built-in algorithm, and automatic hyperparameter tuning
- This module also includes a quick initial look at SageMaker Feature Store, SageMaker Model Registry, and the AutoGluon built-in algorithm - but you don't need to dive deep on these topics.
custom_script_demos: See how you can train and deploy your own models on SageMaker with custom Python scripts and the pre-built framework containers
- (Optional) Start with sklearn_reg for an introduction if you're new to deep learning but familiar with Scikit-Learn
- See huggingface_nlp (preferred) for a side-by-side comparison of in-notebook versus on-SageMaker model training and inference for text classification - or alternatively the custom CNN-based keras_nlp or pytorch_nlp examples.
migration_challenge: Apply what you learned to port an in-notebook workflow to a SageMaker training job + endpoint deployment on your own
- Choose the sklearn_cls, keras_mnist or pytorch_mnist challenge, depending which ML framework you're most comfortable with.

Deploying in Your Own Account

The recommended way to explore these exercises is to onboard to SageMaker Studio. Once you've done this, you can download this repository by launching a System terminal (From the "Utilities and files" section of the launcher screen inside Studio) and running git clone https://github.com/aws-samples/sagemaker-101-workshop.

If you prefer to use classic SageMaker Notebook Instances, you can find a CloudFormation template defining a simple setup at .simple.cf.yaml. This can be deployed via the AWS CloudFormation Console.

You can refer to the "How Are Amazon SageMaker Studio Notebooks Different from Notebook Instances?" docs page for more details on differences between the Studio and Notebook Instance environments.

Depending on your setup, you may be asked to choose a kernel when opening some notebooks. There should be guidance at the top of each notebook on suggested kernel types, but if you can't find any, Data Science 3.0 (Python 3) (on Studio) or conda_python3 (on Notebook Instances) are likely good options.

Setting up widgets and code completion (JupyterLab extensions)

Some of the examples depend on ipywidgets and ipycanvas for interactive inference demo widgets (but do provide code-only alternatives).

We also usually enable some additional JupyterLab extensions powered by jupyterlab-lsp and jupyterlab-s3-browser to improve user experience. You can find more information about these extensions in this AWS ML blog post

ipywidgets should be available by default on SageMaker Studio, but not on Notebook Instances when we last tested. The other extensions require installation.

To see how we automate these extra setup steps for AWS-run events, you can refer to the lifecycle configuration scripts in our CloudFormation templates. For a Notebook Instance LCC, see the AWS::SageMaker::NotebookInstanceLifecycleConfig in .simple.cf.yaml. For a SageMaker Studio LCC, see the Custom::StudioLifecycleConfig in .infrastructure/template.sam.yaml.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

sagemaker-101-workshop's People

Contributors

Stargazers

Watchers

sagemaker-101-workshop's Issues

generate_classification_report() does not show plot

In the first lab (SageMaker XGBoost HPO.ipynb), generate_classification_report() is called a couple of times. There does not appear to be a final call to plt.show() (from matplotlib.pyplot), and so the report does not display in the notebook.

Adding the lines import matplotlib.pyplot as plt and plt.show() to the notebook solves the issue, but plt.show() should probably be added to the utility function itself.

News Headline Classifier (SageMaker Version) - No module named 'tensorflow.python.eager'

I have tried this notebook with Python 3 (TensorFlow 1.15 Python 3.7 CPU Optimized) kernel then got this error.

https://github.com/apac-ml-tfc/sagemaker-workshop-101/blob/master/custom_tensorflow_keras_nlp/Headline%20Classifier%20SageMaker.ipynb

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<timed exec> in <module>

~/sagemaker-101-workshop/custom_tensorflow_keras_nlp/util/preprocessing.py in <module>
     13 import numpy as np
     14 from sklearn import preprocessing
---> 15 import tensorflow as tf
     16 from tensorflow.keras.preprocessing.text import Tokenizer
     17 from tensorflow.keras.preprocessing.sequence import pad_sequences

/usr/local/lib/python3.7/site-packages/tensorflow/__init__.py in <module>
     39 import sys as _sys
     40 
---> 41 from tensorflow.python.tools import module_util as _module_util
     42 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
     43 

/usr/local/lib/python3.7/site-packages/tensorflow/python/__init__.py in <module>
     38 # pylint: disable=wildcard-import,g-bad-import-order,g-import-not-at-top
     39 
---> 40 from tensorflow.python.eager import context
     41 from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
     42 

ModuleNotFoundError: No module named 'tensorflow.python.eager'

Make the MNIST drawing widget work in JupyterLab/Studio

JupyterLab (and therefore also SageMaker Studio) uses a controlled extensions system and does not support arbitrary trusted HTML/JavaScript, so our current MNIST "draw a digit" widget only works in plain Jupyter.

Probably there are implementations out there that could support this if the right JupyterLab extensions were installed. ipycanvas looks like one possible candidate?

Swap out sklearn dataset

The optional/non-core custom_sklearn_rf exercise currently uses the Boston housing dataset which is deprecated due to ethical issues in its construction, as documented here.

For our purposes we're really just looking for an end-to-end tabular data SKLearn training & inference example. This example should be swapped out for some other standard sample dataset.

Demonstrate script testing/debugging in the script mode walkthroughs

Feature request

We already provide example shell commands to invoke and test training scripts for all variants of the migration challenge. For example from SKLearn:

!python3 src/main.py \
    --train ./data/train \
    --test ./data/test \
    --model_dir ./data/model \
    --class_names {class_names_str} \
    --n_estimators=100 \
    --min_samples_leaf=3

Since this is the recommended debugging workflow, we should also demonstrate it in the script mode walkthroughs by adding equivalent commands in the 'SageMaker' variants of these notebooks - before the Estimator gets created.

This will help these notebooks illustrate the process/workflow of translating from in-notebook to notebook+job, better than just showing the final result.

Background

Today, we use in-notebook shell commands as the recommended script debugging workflow for the migration challenge - because our options are somewhat constrained for a workshop:

SageMaker Warm Pools requires a quota increase to enable
SageMaker Local Mode isn't natively supported in Studio at this time
SageMaker SSH Helper has a bit of a learning curve for any data scientists not already familiar with online SSH-based debugging (and some setup required - although we could probably automate that).

We talk about these other options in the post-challenge wrap-up, but don't want to confuse the issue by introducing them up-front in the code.

MNIST model CPU training broken in TF v2.7 (conda_tensorflow2_p37 kernel on NBI ALv2 JLv3)

The current conda_tensorflow2_p38 kernel on the latest SageMaker Notebook Instance platform (notebook-al2-v2, as used in the CFn template) seems to break local CPU-only training for the MNIST migration challenge.

In this environment (TF v2.7.1, TF.Keras v2.7.0), tensorflow.keras.backend.image_data_format() asks for channels_first, but training fails because MaxPoolingOp only supports channels_last on CPU - per the error message below:

InvalidArgumentError:  Default MaxPoolingOp only supports NHWC on device type CPU
	 [[node sequential/max_pooling2d/MaxPool
 (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/layers/pooling.py:357)
]] [Op:__inference_train_function_862]

Errors may have originated from an input operation.
Input Source operations connected to node sequential/max_pooling2d/MaxPool:
In[0] sequential/conv2d_1/Relu (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/backend.py:4867)

Overriding the image_data_format() check (in "Pre-Process the Data for our CNN") to prepare data in different shape does not work because the model is incompatible (will raise ValueError in conv2d_2).

Still seems to be working fine in current SMStudio kernel (TensorFlow v2.3.2, TF.Keras v2.4.0).

[Built-in algos] Need to convert one-hot variables to numerics

Perhaps some Pandas behaviour changed? But currently the following line in notebook 1 Autopilot and XGBoost.ipynb:

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

...is yielding boolean typed columns for all the one-hot encoded variables. This is consistent with the current pandas doc, and the notebook seems to train the XGBoost model fine - but the XGBoost evaluation Batch Transform job fails with:

RuntimeError: Loading csv data failed with Exception, please ensure data is in csv format:
 <class 'ValueError'>
 could not convert string to float: 'False'

I believe we need to add , dtype=int to the get_dummies() call to ensure the generated train/val/test datasets are fully numeric to be compatible with the SageMaker XGBoost algorithm. Haven't quite finished testing it through yet though.

Alternative word vector source?

The NLP example currently uses GloVe word vectors from Stanford's repository, but these are:

Sometimes slow to download on our typical instance type (~6min30) - because the combined zip of 50/100/200/300D vectors is downloaded and the unused files discarded. There don't seem to be separate downloads offered for the 100D vector size the model uses.
Only offered pre-trained in English which makes the exercise less transferable for ASEAN customers.

Could maybe instead consider:

Using FastText embeddings: they offer pre-trained "Word vectors for 157 languages", but only 300D... So we'd still need to downsample to 100D using their tool to adapt the dimension - and download would still be slower than necessary.
Using some other embedding source (?)
Pre-preparing and hosting embeddings (e.g. in S3) for optimized download times at the expense of transparency of how the vectors are created.

Can't upgrade torchtext with newer PyTorch

Per the torchtext README, our current pinned torchtext version (0.6) is a long way out of sync with our PyTorch version (PTv1.8=TTv0.9, PTv1.10=TTv0.11).

I explored pinning the PT version to current and allowing pip to solve, with a statement like this:

!pip install torch==`pip show torch | grep 'Version:' | sed 's/Version: //'` torchtext

On the SMStudio PyTorch v1.10 CPU kernel, this installs the expected version of torchtext (0.11), but import torchtext fails due to missing symbols. Perhaps due to something missing from the CPU-optimized version of PyTorch?

So for now torchtext remains pinned at a pretty old version. We only use it for basic English text tokenization (util tokenize_and_pad_docs()), so maybe can switch to some other solution if this can't be resolved.

Migrate to TFv2

It would be great to migrate the TensorFlow examples to version 2 of TensorFlow when possible

Simplify MNIST challenge via AWS Open Data

The current MNIST challenge downloads the source data in array format and then un-packs to folders-of-images, before reading back in.

Running the algorithm on folders-of-images is good for transferring the learning to real-world use cases, but the conversion process can be a bit confusing.

If we instead used s3://fast-ai-imageclas/mnist_png.tgz from FastAI on the AWS Open Data registry, then we could:

Potentially reduce download time / increase availability, as the dataset is already S3-hosted
Directly use their folders-of-PNGs format - no conversion needed

Possibly switch to Data Wrangler dataframe visualization?

Not sure yet whether it's worth it for the use-cases shown in this workshop, but we could consider adding:

import sagemaker_datawrangler

To enable the built-in notebook data preparation tools and visualisation from Data Wrangler?

Raising here as an issue to keep track