Git Product home page Git Product logo

active-learning-cml's Introduction

Interactive and visual workflow of active learning

Supervised machine learning, while powerful, needs labeled data to be effective. Active learning reduces the number of labeled examples needed to train a model, saving time and money while obtaining comparable performance to models trained with much more data.

This application is meant to serve as a complement to the prototype for the report we released on Learning with Limited Labeled Data. To build an intuition for why active learning works, please see our blogpost - A guide to learning with limited labeled data

AL Screenshot

What is Active Learning?

Active learning is an iterative process that relies on human input to build up a smartly labeled dataset. The process typically looks like this:

  • Begin with a small set of labeled data (that is all we have)
  • Train a model
  • Use the trained model to run inference on a much larger pool of unlabeled data available to us
  • Use a selection strategy to pick out points that are difficult for the machine to predict correctly
  • Request labels from human for those difficult points
  • Add these examples back to the labeled dataset and retrain on the expanded dataset
  • Repeat the training and labeling process until you have achieved your desired model performance

We use the MNIST dataset to illustrate the active learning workflow. To start, we train a convolutional neural network using only a few labeled datapoints. You can kick off training by selecting appropriate hyperparameters and clicking on the "TRAIN" button. Once the model has completed training, you can visualize the embeddings of the convolutional neural network by using UMAP to project the high dimensional representation down to 2D.

Structure

.
├── activelearning        # active learning scripts
├── apps                  # Dash application
├── assets                # Dash related stylesheets
├── cml                   # contains scripts that facilitate the project launch on CML
├── data                  # contains MNIST data.
├── docs                  # images/ snapshots for README
├── experiments           # contains a script that demonstrates the use of AL functions

The assets, data, and docs directories are unimportant and can be ignored.

activelearning

activelearning
├── data.py               # loads train and validation data
├── dataset.py            # MNIST dataset handler
├── model.py              # Neural network architecture
├── sample.py             # defines random, entropy and entropy dropout selection strategies
└── train.py              # helper functions for model training, generating embeddings and predictions, computing metrics, checkpointing, saving results in text file and so on

apps

apps
├── app.py
├── demo.py 
├── demo_description.md
├── demo_intro.md

These scripts leverage the stylesheets from the assets folder to provide a UI for:

  • training a model based on the user selected hyperparameters,
  • visualize the trained embeddings using UMAP
  • visualize model performance metrics on trains and validation sets
  • and request labels for 10 selected datapoints from the user to retrain the model
  • the final model is saved in the "/models" directory for future use

experiments

experiments
├── main.py               # code to experiment and test data and model functions w/o UI

Note: You still need to go through the installation process below to be able to run this code

data

data
├── MNIST

The project uses the MNIST dataset to illustrate the AL workflow. It uses only the 10k testing dataset as the entire dataset.

  • First set aside 2,000 datapoints for validation/testing
  • Out of the remaining 8,000 datapoints, allow the user to select 100, 500 or 1,000 as initial labeled examples while the rest are unlabeled.
  • The user can then provide labels for the 10 shortlisted examples (based on the selection strategy) from the remaining training examples and continue to train a model with the additional data points.
  • In the long run the model performance differs based on the selection strategy employed.

Deploying on CML

There are three ways to launch this project on CML:

  1. From Prototype Catalog - Navigate to the Prototype Catalog on a CML workspace, select the "Active Learning" tile, click "Launch as Project", click "Configure Project"
  2. As ML Prototype - In a CML workspace, click "New Project", add a Project Name, select "ML Prototype" as the Initial Setup option, copy in the repo URL, click "Create Project", click "Configure Project"
  3. Manual Setup - In a CML workspace, click "New Project", add a Project Name, select "Git" as the Initial Setup option, copy in the repo URL, click "Create Project". Then, follow the installation instructions below.

Installation

The code and applications were developed using Python 3.6.9, and are likely also to function with more recent versions of Python.

To install dependencies, first create and activate a new virtual environment through your preferred means, then pip install from the requirements file. We recommend:

python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

In CML or CDSW, no virtual env is necessary. Instead, inside a Python 3 session (with at least 4 vCPU / 6 GiB Memory), simply run

!pip3 install -r requirements.txt     # notice `pip3`, not `pip`

Starting the application as a normal python session

  • First, specify the port in apps/app.py by uncommenting the lines for normal python session
    # for normal python session uncomment below
    server = app.server
    
    # Running server
    if __name__ == '__main__':
      # for running on CDSW / CML uncomment below
      # app.run_server(port=os.getenv("CDSW_APP_PORT"))
      # OR 
      # for normal python session uncomment below
      app.run_server(debug=True)
    
  • Run
    python apps/app.py
    

Starting the application within CML or CDSW

  • First, specify the port in apps/app.py by commenting the lines for normal python session
    # for normal python session uncomment below
    # server = app.server
    
    # Running server
    if __name__ == '__main__':
      # for running on CDSW / CML uncomment below
      app.run_server(port=os.getenv("CDSW_APP_PORT"))
      # OR 
      # for normal python session uncomment below
      # app.run_server(debug=True)
    
  • Second, set the subdomain in CDSW's Applications tab.
  • Third, enter apps/app.py in the Script field in CDSW's Applications tab.
  • Fourth, start the application within CDSW.
  • Finally, access demo at subdomain.ffl-4.cdsw.eng.cloudera.com

active-learning-cml's People

Contributors

nishamuktewar avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

active-learning-cml's Issues

AMP Review 2

AMP Review 2

💥 This AMP is looking awesome! I really like the workflow you've built in the UI - it really helps the user understand the process. The repo is in a really good spot for the most part. I just have a few comments on documentation in the README and in the UI itself that I've outlined below. If any of these are major hurdles that cannot be implemented easily, feel free to push back and we can discuss together.

General AMP Feedback

README.md

  • Minor re-wording
    image

  • Might be good to specify which MNIST dataset (e.g. handwritten digits) since there are others. Also would be good to link to the original dataset here.
    image

  • I think this should say apps/app.py and might also be good to include a link to that directory. This comment also applies to the "Starting the application within CML or CDSW" section
    image

Active Learner UI Feedback

  • I think it would be helpful to add a bit more context at the introduction of the UI to help orient the user to what they are looking at. After the opening sentence (shown below with red arrow), it might be good to include something like:

In this application, we use the MNIST handwritten digit dataset to illustrate the active learning workflow. The workflow begins by training a convolutional neural network to recognize what digit is present in each image using only a small sample of labeled examples. We then compute the loss and accuracy of this model on a hold out validation set and are presented with ten unseen, ambiguous (as defined by the model) examples which we can manually assign labels for. Those ten examples are then added into the original training set which helps the model learn to distinguish between ambiguous classes more effectively. This process can be repeated until an acceptable level of performance is achieved. For step by step instructions, click the "Learn More" button below.

image

The next few items correspond to the image below

image

  • Would it be possible to make these bullets into an ordered list (1,2,3,4) instead of bullet points?
  • In the third bullet, what are the embeddings that are being referenced here? Can we clarify that they are the weights from the last layer (or whatever they happen to be)?
  • The last bullet says "Repeat the process until you have achieved the desired performance"... it seems there isn't currently an easy way to quickly obtain a performance metric without scanning the loss/accuracy plot for the maxima/minima. I think it would be super helpful to add a set of KPI's to the page that spells out:
    • here's the best loss/accuracy from the previous iteration
    • here's the best loss/accuracy from the current iteration
    • [maybe] here's the % change in performance by labeling 10 more datapoints

Scatter Plot & Loss/Accuracy Plots

  • Can we add a title (and maybe a description) to the top of the scatterplot to help orient users to the UI? Basically just helping people understand that each dot represents an image and is spatially organized by the UMAP projection of some set of weights/embeddings
  • Also could we add a title to both the loss and accuracy plots specifying this is the training profile for the most recent training cycle

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.