Interactive and visual workflow of active learning

Supervised machine learning, while powerful, needs labeled data to be effective. Active learning reduces the number of labeled examples needed to train a model, saving time and money while obtaining comparable performance to models trained with much more data.

This application is meant to serve as a complement to the prototype for the report we released on Learning with Limited Labeled Data. To build an intuition for why active learning works, please see our blogpost - A guide to learning with limited labeled data

What is Active Learning?

Active learning is an iterative process that relies on human input to build up a smartly labeled dataset. The process typically looks like this:

Begin with a small set of labeled data (that is all we have)
Train a model
Use the trained model to run inference on a much larger pool of unlabeled data available to us
Use a selection strategy to pick out points that are difficult for the machine to predict correctly
Request labels from human for those difficult points
Add these examples back to the labeled dataset and retrain on the expanded dataset
Repeat the training and labeling process until you have achieved your desired model performance

We use the MNIST dataset to illustrate the active learning workflow. To start, we train a convolutional neural network using only a few labeled datapoints. You can kick off training by selecting appropriate hyperparameters and clicking on the "TRAIN" button. Once the model has completed training, you can visualize the embeddings of the convolutional neural network by using UMAP to project the high dimensional representation down to 2D.

Structure

.
├── activelearning        # active learning scripts
├── apps                  # Dash application
├── assets                # Dash related stylesheets
├── cml                   # contains scripts that facilitate the project launch on CML
├── data                  # contains MNIST data.
├── docs                  # images/ snapshots for README
├── experiments           # contains a script that demonstrates the use of AL functions

The assets, data, and docs directories are unimportant and can be ignored.

`activelearning`

activelearning
├── data.py               # loads train and validation data
├── dataset.py            # MNIST dataset handler
├── model.py              # Neural network architecture
├── sample.py             # defines random, entropy and entropy dropout selection strategies
└── train.py              # helper functions for model training, generating embeddings and predictions, computing metrics, checkpointing, saving results in text file and so on

`apps`

apps
├── app.py
├── demo.py 
├── demo_description.md
├── demo_intro.md

These scripts leverage the stylesheets from the assets folder to provide a UI for:

training a model based on the user selected hyperparameters,
visualize the trained embeddings using UMAP
visualize model performance metrics on trains and validation sets
and request labels for 10 selected datapoints from the user to retrain the model
the final model is saved in the "/models" directory for future use

`experiments`

experiments
├── main.py               # code to experiment and test data and model functions w/o UI

Note: You still need to go through the installation process below to be able to run this code

`data`

data
├── MNIST

The project uses the MNIST dataset to illustrate the AL workflow. It uses only the 10k testing dataset as the entire dataset.

First set aside 2,000 datapoints for validation/testing
Out of the remaining 8,000 datapoints, allow the user to select 100, 500 or 1,000 as initial labeled examples while the rest are unlabeled.
The user can then provide labels for the 10 shortlisted examples (based on the selection strategy) from the remaining training examples and continue to train a model with the additional data points.
In the long run the model performance differs based on the selection strategy employed.

Deploying on CML

There are three ways to launch this project on CML:

From Prototype Catalog - Navigate to the Prototype Catalog on a CML workspace, select the "Active Learning" tile, click "Launch as Project", click "Configure Project"
As ML Prototype - In a CML workspace, click "New Project", add a Project Name, select "ML Prototype" as the Initial Setup option, copy in the repo URL, click "Create Project", click "Configure Project"
Manual Setup - In a CML workspace, click "New Project", add a Project Name, select "Git" as the Initial Setup option, copy in the repo URL, click "Create Project". Then, follow the installation instructions below.

Installation

The code and applications were developed using Python 3.6.9, and are likely also to function with more recent versions of Python.

To install dependencies, first create and activate a new virtual environment through your preferred means, then pip install from the requirements file. We recommend:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

In CML or CDSW, no virtual env is necessary. Instead, inside a Python 3 session (with at least 4 vCPU / 6 GiB Memory), simply run

!pip3 install -r requirements.txt     # notice `pip3`, not `pip`

Starting the application as a normal python session

First, specify the port in apps/app.py by uncommenting the lines for normal python session

# for normal python session uncomment below
server = app.server

# Running server
if __name__ == '__main__':
  # for running on CDSW / CML uncomment below
  # app.run_server(port=os.getenv("CDSW_APP_PORT"))
  # OR 
  # for normal python session uncomment below
  app.run_server(debug=True)

Run
```
python apps/app.py
```

Starting the application within CML or CDSW

First, specify the port in apps/app.py by commenting the lines for normal python session

# for normal python session uncomment below
# server = app.server

# Running server
if __name__ == '__main__':
  # for running on CDSW / CML uncomment below
  app.run_server(port=os.getenv("CDSW_APP_PORT"))
  # OR 
  # for normal python session uncomment below
  # app.run_server(debug=True)

Second, set the subdomain in CDSW's Applications tab.
Third, enter apps/app.py in the Script field in CDSW's Applications tab.
Fourth, start the application within CDSW.
Finally, access demo at subdomain.ffl-4.cdsw.eng.cloudera.com

AMP Review 2

💥 This AMP is looking awesome! I really like the workflow you've built in the UI - it really helps the user understand the process. The repo is in a really good spot for the most part. I just have a few comments on documentation in the README and in the UI itself that I've outlined below. If any of these are major hurdles that cannot be implemented easily, feel free to push back and we can discuss together.

General AMP Feedback

README.md

Minor re-wording
Might be good to specify which MNIST dataset (e.g. handwritten digits) since there are others. Also would be good to link to the original dataset here.
I think this should say apps/app.py and might also be good to include a link to that directory. This comment also applies to the "Starting the application within CML or CDSW" section

Active Learner UI Feedback

I think it would be helpful to add a bit more context at the introduction of the UI to help orient the user to what they are looking at. After the opening sentence (shown below with red arrow), it might be good to include something like:

In this application, we use the MNIST handwritten digit dataset to illustrate the active learning workflow. The workflow begins by training a convolutional neural network to recognize what digit is present in each image using only a small sample of labeled examples. We then compute the loss and accuracy of this model on a hold out validation set and are presented with ten unseen, ambiguous (as defined by the model) examples which we can manually assign labels for. Those ten examples are then added into the original training set which helps the model learn to distinguish between ambiguous classes more effectively. This process can be repeated until an acceptable level of performance is achieved. For step by step instructions, click the "Learn More" button below.

The next few items correspond to the image below

Would it be possible to make these bullets into an ordered list (1,2,3,4) instead of bullet points?
In the third bullet, what are the embeddings that are being referenced here? Can we clarify that they are the weights from the last layer (or whatever they happen to be)?
The last bullet says "Repeat the process until you have achieved the desired performance"... it seems there isn't currently an easy way to quickly obtain a performance metric without scanning the loss/accuracy plot for the maxima/minima. I think it would be super helpful to add a set of KPI's to the page that spells out:
- here's the best loss/accuracy from the previous iteration
- here's the best loss/accuracy from the current iteration
- [maybe] here's the % change in performance by labeling 10 more datapoints

Scatter Plot & Loss/Accuracy Plots

Can we add a title (and maybe a description) to the top of the scatterplot to help orient users to the UI? Basically just helping people understand that each dot represents an image and is spatially organized by the UMAP projection of some set of weights/embeddings
Also could we add a title to both the loss and accuracy plots specifying this is the training profile for the most recent training cycle

fastforwardlabs / active-learning-cml Goto Github PK

active-learning-cml's Introduction

Interactive and visual workflow of active learning

What is Active Learning?

Structure

activelearning

apps

experiments

data

Deploying on CML

Installation

Starting the application as a normal python session

Starting the application within CML or CDSW

active-learning-cml's People

Contributors

Stargazers

Watchers

Forkers

active-learning-cml's Issues