aws-samples / sagemaker-101-workshop Goto Github PK

Hands-on demonstrations for data scientists exploring Amazon SageMaker

Makefile 0.40% Python 18.22% Jupyter Notebook 81.25% Shell 0.13%

aws sagemaker amazon-sagemaker workshops

sagemaker-101-workshop's Issues

Swap out sklearn dataset

The optional/non-core custom_sklearn_rf exercise currently uses the Boston housing dataset which is deprecated due to ethical issues in its construction, as documented here.

For our purposes we're really just looking for an end-to-end tabular data SKLearn training & inference example. This example should be swapped out for some other standard sample dataset.

Migrate to TFv2

It would be great to migrate the TensorFlow examples to version 2 of TensorFlow when possible

Can't upgrade torchtext with newer PyTorch

Per the torchtext README, our current pinned torchtext version (0.6) is a long way out of sync with our PyTorch version (PTv1.8=TTv0.9, PTv1.10=TTv0.11).

I explored pinning the PT version to current and allowing pip to solve, with a statement like this:

!pip install torch==`pip show torch | grep 'Version:' | sed 's/Version: //'` torchtext

On the SMStudio PyTorch v1.10 CPU kernel, this installs the expected version of torchtext (0.11), but import torchtext fails due to missing symbols. Perhaps due to something missing from the CPU-optimized version of PyTorch?

So for now torchtext remains pinned at a pretty old version. We only use it for basic English text tokenization (util tokenize_and_pad_docs()), so maybe can switch to some other solution if this can't be resolved.

generate_classification_report() does not show plot

In the first lab (SageMaker XGBoost HPO.ipynb), generate_classification_report() is called a couple of times. There does not appear to be a final call to plt.show() (from matplotlib.pyplot), and so the report does not display in the notebook.

Adding the lines import matplotlib.pyplot as plt and plt.show() to the notebook solves the issue, but plt.show() should probably be added to the utility function itself.

Demonstrate script testing/debugging in the script mode walkthroughs

Feature request

We already provide example shell commands to invoke and test training scripts for all variants of the migration challenge. For example from SKLearn:

!python3 src/main.py \
    --train ./data/train \
    --test ./data/test \
    --model_dir ./data/model \
    --class_names {class_names_str} \
    --n_estimators=100 \
    --min_samples_leaf=3

Since this is the recommended debugging workflow, we should also demonstrate it in the script mode walkthroughs by adding equivalent commands in the 'SageMaker' variants of these notebooks - before the Estimator gets created.

This will help these notebooks illustrate the process/workflow of translating from in-notebook to notebook+job, better than just showing the final result.

Background

Today, we use in-notebook shell commands as the recommended script debugging workflow for the migration challenge - because our options are somewhat constrained for a workshop:

SageMaker Warm Pools requires a quota increase to enable
SageMaker Local Mode isn't natively supported in Studio at this time
SageMaker SSH Helper has a bit of a learning curve for any data scientists not already familiar with online SSH-based debugging (and some setup required - although we could probably automate that).

We talk about these other options in the post-challenge wrap-up, but don't want to confuse the issue by introducing them up-front in the code.

Make the MNIST drawing widget work in JupyterLab/Studio

JupyterLab (and therefore also SageMaker Studio) uses a controlled extensions system and does not support arbitrary trusted HTML/JavaScript, so our current MNIST "draw a digit" widget only works in plain Jupyter.

Probably there are implementations out there that could support this if the right JupyterLab extensions were installed. ipycanvas looks like one possible candidate?

Possibly switch to Data Wrangler dataframe visualization?

Not sure yet whether it's worth it for the use-cases shown in this workshop, but we could consider adding:

import sagemaker_datawrangler

To enable the built-in notebook data preparation tools and visualisation from Data Wrangler?

Raising here as an issue to keep track

Alternative word vector source?

The NLP example currently uses GloVe word vectors from Stanford's repository, but these are:

Sometimes slow to download on our typical instance type (~6min30) - because the combined zip of 50/100/200/300D vectors is downloaded and the unused files discarded. There don't seem to be separate downloads offered for the 100D vector size the model uses.
Only offered pre-trained in English which makes the exercise less transferable for ASEAN customers.

Could maybe instead consider:

Using FastText embeddings: they offer pre-trained "Word vectors for 157 languages", but only 300D... So we'd still need to downsample to 100D using their tool to adapt the dimension - and download would still be slower than necessary.
Using some other embedding source (?)
Pre-preparing and hosting embeddings (e.g. in S3) for optimized download times at the expense of transparency of how the vectors are created.

News Headline Classifier (SageMaker Version) - No module named 'tensorflow.python.eager'

I have tried this notebook with Python 3 (TensorFlow 1.15 Python 3.7 CPU Optimized) kernel then got this error.

https://github.com/apac-ml-tfc/sagemaker-workshop-101/blob/master/custom_tensorflow_keras_nlp/Headline%20Classifier%20SageMaker.ipynb

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<timed exec> in <module>

~/sagemaker-101-workshop/custom_tensorflow_keras_nlp/util/preprocessing.py in <module>
     13 import numpy as np
     14 from sklearn import preprocessing
---> 15 import tensorflow as tf
     16 from tensorflow.keras.preprocessing.text import Tokenizer
     17 from tensorflow.keras.preprocessing.sequence import pad_sequences

/usr/local/lib/python3.7/site-packages/tensorflow/__init__.py in <module>
     39 import sys as _sys
     40 
---> 41 from tensorflow.python.tools import module_util as _module_util
     42 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
     43 

/usr/local/lib/python3.7/site-packages/tensorflow/python/__init__.py in <module>
     38 # pylint: disable=wildcard-import,g-bad-import-order,g-import-not-at-top
     39 
---> 40 from tensorflow.python.eager import context
     41 from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
     42 

ModuleNotFoundError: No module named 'tensorflow.python.eager'

MNIST model CPU training broken in TF v2.7 (conda_tensorflow2_p37 kernel on NBI ALv2 JLv3)

The current conda_tensorflow2_p38 kernel on the latest SageMaker Notebook Instance platform (notebook-al2-v2, as used in the CFn template) seems to break local CPU-only training for the MNIST migration challenge.

In this environment (TF v2.7.1, TF.Keras v2.7.0), tensorflow.keras.backend.image_data_format() asks for channels_first, but training fails because MaxPoolingOp only supports channels_last on CPU - per the error message below:

InvalidArgumentError:  Default MaxPoolingOp only supports NHWC on device type CPU
	 [[node sequential/max_pooling2d/MaxPool
 (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/layers/pooling.py:357)
]] [Op:__inference_train_function_862]

Errors may have originated from an input operation.
Input Source operations connected to node sequential/max_pooling2d/MaxPool:
In[0] sequential/conv2d_1/Relu (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/backend.py:4867)

Overriding the image_data_format() check (in "Pre-Process the Data for our CNN") to prepare data in different shape does not work because the model is incompatible (will raise ValueError in conv2d_2).

Still seems to be working fine in current SMStudio kernel (TensorFlow v2.3.2, TF.Keras v2.4.0).

Simplify MNIST challenge via AWS Open Data

The current MNIST challenge downloads the source data in array format and then un-packs to folders-of-images, before reading back in.

Running the algorithm on folders-of-images is good for transferring the learning to real-world use cases, but the conversion process can be a bit confusing.

If we instead used s3://fast-ai-imageclas/mnist_png.tgz from FastAI on the AWS Open Data registry, then we could:

Potentially reduce download time / increase availability, as the dataset is already S3-hosted
Directly use their folders-of-PNGs format - no conversion needed

[Built-in algos] Need to convert one-hot variables to numerics

Perhaps some Pandas behaviour changed? But currently the following line in notebook 1 Autopilot and XGBoost.ipynb:

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

...is yielding boolean typed columns for all the one-hot encoded variables. This is consistent with the current pandas doc, and the notebook seems to train the XGBoost model fine - but the XGBoost evaluation Batch Transform job fails with:

RuntimeError: Loading csv data failed with Exception, please ensure data is in csv format:
 <class 'ValueError'>
 could not convert string to float: 'False'

I believe we need to add , dtype=int to the get_dummies() call to ensure the generated train/val/test datasets are fully numeric to be compatible with the SageMaker XGBoost algorithm. Haven't quite finished testing it through yet though.

aws-samples / sagemaker-101-workshop Goto Github PK

sagemaker-101-workshop's Issues

Swap out sklearn dataset

Migrate to TFv2

Can't upgrade torchtext with newer PyTorch

generate_classification_report() does not show plot

Demonstrate script testing/debugging in the script mode walkthroughs

Feature request

Background

Make the MNIST drawing widget work in JupyterLab/Studio

Possibly switch to Data Wrangler dataframe visualization?

Alternative word vector source?

News Headline Classifier (SageMaker Version) - No module named 'tensorflow.python.eager'

MNIST model CPU training broken in TF v2.7 (conda_tensorflow2_p37 kernel on NBI ALv2 JLv3)

Simplify MNIST challenge via AWS Open Data

[Built-in algos] Need to convert one-hot variables to numerics

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent