Git Product home page Git Product logo

sagemaker-101-workshop's Issues

Swap out sklearn dataset

The optional/non-core custom_sklearn_rf exercise currently uses the Boston housing dataset which is deprecated due to ethical issues in its construction, as documented here.

For our purposes we're really just looking for an end-to-end tabular data SKLearn training & inference example. This example should be swapped out for some other standard sample dataset.

Migrate to TFv2

It would be great to migrate the TensorFlow examples to version 2 of TensorFlow when possible

Can't upgrade torchtext with newer PyTorch

Per the torchtext README, our current pinned torchtext version (0.6) is a long way out of sync with our PyTorch version (PTv1.8=TTv0.9, PTv1.10=TTv0.11).

I explored pinning the PT version to current and allowing pip to solve, with a statement like this:

!pip install torch==`pip show torch | grep 'Version:' | sed 's/Version: //'` torchtext

On the SMStudio PyTorch v1.10 CPU kernel, this installs the expected version of torchtext (0.11), but import torchtext fails due to missing symbols. Perhaps due to something missing from the CPU-optimized version of PyTorch?

So for now torchtext remains pinned at a pretty old version. We only use it for basic English text tokenization (util tokenize_and_pad_docs()), so maybe can switch to some other solution if this can't be resolved.

generate_classification_report() does not show plot

In the first lab (SageMaker XGBoost HPO.ipynb), generate_classification_report() is called a couple of times. There does not appear to be a final call to plt.show() (from matplotlib.pyplot), and so the report does not display in the notebook.

Adding the lines import matplotlib.pyplot as plt and plt.show() to the notebook solves the issue, but plt.show() should probably be added to the utility function itself.

Demonstrate script testing/debugging in the script mode walkthroughs

Feature request

We already provide example shell commands to invoke and test training scripts for all variants of the migration challenge. For example from SKLearn:

!python3 src/main.py \
    --train ./data/train \
    --test ./data/test \
    --model_dir ./data/model \
    --class_names {class_names_str} \
    --n_estimators=100 \
    --min_samples_leaf=3

Since this is the recommended debugging workflow, we should also demonstrate it in the script mode walkthroughs by adding equivalent commands in the 'SageMaker' variants of these notebooks - before the Estimator gets created.

This will help these notebooks illustrate the process/workflow of translating from in-notebook to notebook+job, better than just showing the final result.

Background

Today, we use in-notebook shell commands as the recommended script debugging workflow for the migration challenge - because our options are somewhat constrained for a workshop:

We talk about these other options in the post-challenge wrap-up, but don't want to confuse the issue by introducing them up-front in the code.

Make the MNIST drawing widget work in JupyterLab/Studio

JupyterLab (and therefore also SageMaker Studio) uses a controlled extensions system and does not support arbitrary trusted HTML/JavaScript, so our current MNIST "draw a digit" widget only works in plain Jupyter.

Probably there are implementations out there that could support this if the right JupyterLab extensions were installed. ipycanvas looks like one possible candidate?

Alternative word vector source?

The NLP example currently uses GloVe word vectors from Stanford's repository, but these are:

  • Sometimes slow to download on our typical instance type (~6min30) - because the combined zip of 50/100/200/300D vectors is downloaded and the unused files discarded. There don't seem to be separate downloads offered for the 100D vector size the model uses.
  • Only offered pre-trained in English which makes the exercise less transferable for ASEAN customers.

Could maybe instead consider:

  • Using FastText embeddings: they offer pre-trained "Word vectors for 157 languages", but only 300D... So we'd still need to downsample to 100D using their tool to adapt the dimension - and download would still be slower than necessary.
  • Using some other embedding source (?)
  • Pre-preparing and hosting embeddings (e.g. in S3) for optimized download times at the expense of transparency of how the vectors are created.

News Headline Classifier (SageMaker Version) - No module named 'tensorflow.python.eager'

I have tried this notebook with Python 3 (TensorFlow 1.15 Python 3.7 CPU Optimized) kernel then got this error.

https://github.com/apac-ml-tfc/sagemaker-workshop-101/blob/master/custom_tensorflow_keras_nlp/Headline%20Classifier%20SageMaker.ipynb

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<timed exec> in <module>

~/sagemaker-101-workshop/custom_tensorflow_keras_nlp/util/preprocessing.py in <module>
     13 import numpy as np
     14 from sklearn import preprocessing
---> 15 import tensorflow as tf
     16 from tensorflow.keras.preprocessing.text import Tokenizer
     17 from tensorflow.keras.preprocessing.sequence import pad_sequences

/usr/local/lib/python3.7/site-packages/tensorflow/__init__.py in <module>
     39 import sys as _sys
     40 
---> 41 from tensorflow.python.tools import module_util as _module_util
     42 from tensorflow.python.util.lazy_loader import LazyLoader as _LazyLoader
     43 

/usr/local/lib/python3.7/site-packages/tensorflow/python/__init__.py in <module>
     38 # pylint: disable=wildcard-import,g-bad-import-order,g-import-not-at-top
     39 
---> 40 from tensorflow.python.eager import context
     41 from tensorflow.python import pywrap_tensorflow as _pywrap_tensorflow
     42 

ModuleNotFoundError: No module named 'tensorflow.python.eager'

MNIST model CPU training broken in TF v2.7 (conda_tensorflow2_p37 kernel on NBI ALv2 JLv3)

The current conda_tensorflow2_p38 kernel on the latest SageMaker Notebook Instance platform (notebook-al2-v2, as used in the CFn template) seems to break local CPU-only training for the MNIST migration challenge.

In this environment (TF v2.7.1, TF.Keras v2.7.0), tensorflow.keras.backend.image_data_format() asks for channels_first, but training fails because MaxPoolingOp only supports channels_last on CPU - per the error message below:

InvalidArgumentError:  Default MaxPoolingOp only supports NHWC on device type CPU
	 [[node sequential/max_pooling2d/MaxPool
 (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/layers/pooling.py:357)
]] [Op:__inference_train_function_862]

Errors may have originated from an input operation.
Input Source operations connected to node sequential/max_pooling2d/MaxPool:
In[0] sequential/conv2d_1/Relu (defined at /home/ec2-user/anaconda3/envs/tensorflow2_p38/lib/python3.8/site-packages/keras/backend.py:4867)

Overriding the image_data_format() check (in "Pre-Process the Data for our CNN") to prepare data in different shape does not work because the model is incompatible (will raise ValueError in conv2d_2).

Still seems to be working fine in current SMStudio kernel (TensorFlow v2.3.2, TF.Keras v2.4.0).

Simplify MNIST challenge via AWS Open Data

The current MNIST challenge downloads the source data in array format and then un-packs to folders-of-images, before reading back in.

Running the algorithm on folders-of-images is good for transferring the learning to real-world use cases, but the conversion process can be a bit confusing.

If we instead used s3://fast-ai-imageclas/mnist_png.tgz from FastAI on the AWS Open Data registry, then we could:

  1. Potentially reduce download time / increase availability, as the dataset is already S3-hosted
  2. Directly use their folders-of-PNGs format - no conversion needed

[Built-in algos] Need to convert one-hot variables to numerics

Perhaps some Pandas behaviour changed? But currently the following line in notebook 1 Autopilot and XGBoost.ipynb:

df_model_data = pd.get_dummies(df_model_data)  # Convert categorical variables to sets of indicators

...is yielding boolean typed columns for all the one-hot encoded variables. This is consistent with the current pandas doc, and the notebook seems to train the XGBoost model fine - but the XGBoost evaluation Batch Transform job fails with:

RuntimeError: Loading csv data failed with Exception, please ensure data is in csv format:
 <class 'ValueError'>
 could not convert string to float: 'False'

I believe we need to add , dtype=int to the get_dummies() call to ensure the generated train/val/test datasets are fully numeric to be compatible with the SageMaker XGBoost algorithm. Haven't quite finished testing it through yet though.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.