aws-samples / sagemaker-101-workshop Goto Github PK
View Code? Open in Web Editor NEWHands-on demonstrations for data scientists exploring Amazon SageMaker
Hands-on demonstrations for data scientists exploring Amazon SageMaker
The optional/non-core custom_sklearn_rf
exercise currently uses the Boston housing dataset which is deprecated due to ethical issues in its construction, as documented here.
For our purposes we're really just looking for an end-to-end tabular data SKLearn training & inference example. This example should be swapped out for some other standard sample dataset.
It would be great to migrate the TensorFlow examples to version 2 of TensorFlow when possible
Per the torchtext README, our current pinned torchtext version (0.6) is a long way out of sync with our PyTorch version (PTv1.8=TTv0.9, PTv1.10=TTv0.11).
I explored pinning the PT version to current and allowing pip to solve, with a statement like this:
!pip install torch==`pip show torch | grep 'Version:' | sed 's/Version: //'` torchtext
On the SMStudio PyTorch v1.10 CPU kernel, this installs the expected version of torchtext (0.11), but import torchtext
fails due to missing symbols. Perhaps due to something missing from the CPU-optimized version of PyTorch?
So for now torchtext remains pinned at a pretty old version. We only use it for basic English text tokenization (util tokenize_and_pad_docs()
), so maybe can switch to some other solution if this can't be resolved.
In the first lab (SageMaker XGBoost HPO.ipynb), generate_classification_report() is called a couple of times. There does not appear to be a final call to plt.show()
(from matplotlib.pyplot), and so the report does not display in the notebook.
Adding the lines import matplotlib.pyplot as plt
and plt.show()
to the notebook solves the issue, but plt.show()
should probably be added to the utility function itself.
We already provide example shell commands to invoke and test training scripts for all variants of the migration challenge. For example from SKLearn:
!python3 src/main.py \
--train ./data/train \
--test ./data/test \
--model_dir ./data/model \
--class_names {class_names_str} \
--n_estimators=100 \
--min_samples_leaf=3
Since this is the recommended debugging workflow, we should also demonstrate it in the script mode walkthroughs by adding equivalent commands in the 'SageMaker' variants of these notebooks - before the Estimator
gets created.
This will help these notebooks illustrate the process/workflow of translating from in-notebook to notebook+job, better than just showing the final result.
Today, we use in-notebook shell commands as the recommended script debugging workflow for the migration challenge - because our options are somewhat constrained for a workshop:
We talk about these other options in the post-challenge wrap-up, but don't want to confuse the issue by introducing them up-front in the code.
JupyterLab (and therefore also SageMaker Studio) uses a controlled extensions system and does not support arbitrary trusted HTML/JavaScript, so our current MNIST "draw a digit" widget only works in plain Jupyter.
Probably there are implementations out there that could support this if the right JupyterLab extensions were installed. ipycanvas looks like one possible candidate?
Not sure yet whether it's worth it for the use-cases shown in this workshop, but we could consider adding:
import sagemaker_datawrangler
To enable the built-in notebook data preparation tools and visualisation from Data Wrangler?
Raising here as an issue to keep track
The NLP example currently uses GloVe word vectors from Stanford's repository, but these are:
Could maybe instead consider:
I have tried this notebook with Python 3 (TensorFlow 1.15 Python 3.7 CPU Optimized)
kernel then got this error.
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
<timed exec> in <module>
~/sagemaker-101-workshop/custom_tensorflow_keras_nlp/util/preprocessing.py in <module>
13 import numpy as np
14 from sklearn import preprocessing
---> 15 import tensorflow as tf
16 from tensorflow.keras.preprocessing.text import Tokenizer
17 from tensorflow.keras.preprocessing.sequence import pad_sequences
/usr/local/lib/python3.7/site-packages/tensorflow/__init__.py in <module>
39 import sys as _sys
40
---> 41 from tensorflow.python.tools import module_util as _module_util
42 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
43
/usr/local/lib/python3.7/site-packages/tensorflow/python/__init__.py in <module>
38 # pylint: disable=wildcard-import,g-bad-import-order,g-import-not-at-top
39
---> 40 from tensorflow.python.eager import context
41 from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
42
ModuleNotFoundError: No module named 'tensorflow.python.eager'
The current conda_tensorflow2_p38
kernel on the latest SageMaker Notebook Instance platform (notebook-al2-v2
, as used in the CFn template) seems to break local CPU-only training for the MNIST migration challenge.
In this environment (TF v2.7.1, TF.Keras v2.7.0), tensorflow.keras.backend.image_data_format()
asks for channels_first
, but training fails because MaxPoolingOp only supports channels_last on CPU - per the error message below:
InvalidArgumentError: Default MaxPoolingOp only supports NHWC on device type CPU
[[node sequential/max_pooling2d/MaxPool
(defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/layers/pooling.py:357)
]] [Op:__inference_train_function_862]
Errors may have originated from an input operation.
Input Source operations connected to node sequential/max_pooling2d/MaxPool:
In[0] sequential/conv2d_1/Relu (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/backend.py:4867)
Overriding the image_data_format()
check (in "Pre-Process the Data for our CNN") to prepare data in different shape does not work because the model is incompatible (will raise ValueError in conv2d_2).
Still seems to be working fine in current SMStudio kernel (TensorFlow v2.3.2, TF.Keras v2.4.0).
The current MNIST challenge downloads the source data in array format and then un-packs to folders-of-images, before reading back in.
Running the algorithm on folders-of-images is good for transferring the learning to real-world use cases, but the conversion process can be a bit confusing.
If we instead used s3://fast-ai-imageclas/mnist_png.tgz
from FastAI on the AWS Open Data registry, then we could:
Perhaps some Pandas behaviour changed? But currently the following line in notebook 1 Autopilot and XGBoost.ipynb:
df_model_data = pd.get_dummies(df_model_data) # Convert categorical variables to sets of indicators
...is yielding boolean typed columns for all the one-hot encoded variables. This is consistent with the current pandas doc, and the notebook seems to train the XGBoost model fine - but the XGBoost evaluation Batch Transform job fails with:
RuntimeError: Loading csv data failed with Exception, please ensure data is in csv format:
<class 'ValueError'>
could not convert string to float: 'False'
I believe we need to add , dtype=int
to the get_dummies()
call to ensure the generated train/val/test datasets are fully numeric to be compatible with the SageMaker XGBoost algorithm. Haven't quite finished testing it through yet though.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.