The cords from cyfer0618

COResets and Data Subset selection

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?
Installation
- Installing via pip
- Installing from source
Documentation
Tutorials
Results
Publications

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of pytorch. Deep Learning systems are extremely compute intensive today with large turn around times, energy inefficiencies, higher costs and resourse requirements [1,2]. CORDS is an effort to make deep learning more energy, cost, resource and time efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency

Reducing End to End Training Time

Reducing Energy Requirement

Faster Hyper-parameter tuning

Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the right representative data subsets from massive datasets, and it does so iteratively. CORDS uses some recent advances in data subset selection and particularly, ideas of coresets and submodularity select such subsets. CORDS implements a number of state of the art data subset selection algorithms and coreset algorithms. Some of the algorithms currently implemented with CORDS include:

GLISTER [3]
GradMatch [4]
CRAIG [4,5]
SubmodularSelection [6,7,8] (Facility Location, Feature Based Functions, Coverage, Diversity)
RandomSelection

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

Reproducability of SOTA in Data Selection and Coresets: Enable easy reproducability of SOTA described above. We are trying to also add more algorithms so if you have an algorithm you would like us to include, please let us know,.
Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets including CIFAR-10, CIFAR-100, MNIST, SVHN and ImageNet.
Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
Modular design: The data selection algorithms are separate from the training loop, thereby enabling modular design and also varied scenarios of utility.
Broad number of usecases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating a number of additional use cases like object detection, speech recognition, semi-supervised learning, Auto-ML, etc.

Installation

To install latest version of CORDS package using PyPI:

pip install -i https://test.pypi.org/simple/ cords

To install using source:

git clone https://github.com/decile-team/cords.git
cd cords
pip install -r requirements/requirements.txt

Documentation

Learn more about CORDS at our documentation.

Tutorials

Here are some tutorials to get you started with CORDS.

Results

The below link contains the jupyter notebook link for cifar10 timing analysis experiments

Open In Colab CIFAR10 Notebook

Results are obtained by running each dataset with different strategies for 300 epochs. The following experimental plots shows the relative test error vs speed up for different strategies. Currently we see between 3x to 7x improvements in energy and runtime with around 1 - 2% drop in accuracy. We expect to push the accuracy-speedup (or energy savings) frontier even more over time!

CIFAR10

CIFAR100

MNIST

SVHN

ImageNet

Publications

[1] Schwartz, Roy, et al., Green AI, arXiv preprint arXiv:1907.10597 (2019).

[2] Strubell, Emma, Ananya Ganesh, and Andrew McCallum, Energy and Policy Considerations for Deep Learning in NLP, In ACL 2019.

[3] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer, GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning, 35th AAAI Conference on Artificial Intelligence, AAAI 2021

[4] Krishnateja Killamsetty, Durga Sivasubramanian, Abir De, Ganesh Ramakrishnan, Baharan Mirzasoleiman, Rishabh Iyer, Grad-Match: A Gradient Matching based Data Selection Framework for Efficient Learning, To Appear in International Conference on Machine Learning (ICML) 2021

[5] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for Data-efficient Training of Machine Learning Models. In International Conference on Machine Learning (ICML), July 2020

[6] Kai Wei, Rishabh Iyer, Jeff Bilmes, Submodularity in Data Subset Selection and Active Learning, International Conference on Machine Learning (ICML) 2015

[7] Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision, 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

[8] Wei, Kai, et al. Submodular subset selection for large-scale speech training data, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

cyfer0618 / cords Goto Github PK

cords's Introduction

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?

Installation

Documentation

Tutorials

Results

CIFAR10

CIFAR100

MNIST

SVHN

ImageNet

Publications

cords's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent