Git Product home page Git Product logo

cords's Introduction


            

COResets and Data Subset selection

GitHub Decile Documentation GitHub Stars GitHub Forks

Reduce end to end training time from days to hours (or hours to minutes), and energy requirements/costs by an order of magnitude using coresets and data selection.

In this README

What is CORDS?

CORDS is COReset and Data Selection library for making machine learning time, energy, cost, and compute efficient. CORDS is built on top of pytorch. Deep Learning systems are extremely compute intensive today with large turn around times, energy inefficiencies, higher costs and resourse requirements [1,2]. CORDS is an effort to make deep learning more energy, cost, resource and time efficient while not sacrificing accuracy. The following are the goals CORDS tries to achieve:

Data Efficiency

Reducing End to End Training Time

Reducing Energy Requirement

Faster Hyper-parameter tuning

Reducing Resource (GPU) Requirement and Costs

The primary purpose of CORDS is to select the right representative data subsets from massive datasets, and it does so iteratively. CORDS uses some recent advances in data subset selection and particularly, ideas of coresets and submodularity select such subsets. CORDS implements a number of state of the art data subset selection algorithms and coreset algorithms. Some of the algorithms currently implemented with CORDS include:

We are continuously incorporating newer and better algorithms into CORDS. Some of the features of CORDS includes:

  • Reproducability of SOTA in Data Selection and Coresets: Enable easy reproducability of SOTA described above. We are trying to also add more algorithms so if you have an algorithm you would like us to include, please let us know,.
  • Benchmarking: We have benchmarked CORDS (and the algorithms present right now) on several datasets including CIFAR-10, CIFAR-100, MNIST, SVHN and ImageNet.
  • Ease of Use: One of the main goals of CORDS is that it is easy to use and add to CORDS. Feel free to contribute to CORDS!
  • Modular design: The data selection algorithms are separate from the training loop, thereby enabling modular design and also varied scenarios of utility.
  • Broad number of usecases: CORDS is currently implemented for simple image classification tasks and hyperparameter tuning, but we are working on integrating a number of additional use cases like object detection, speech recognition, semi-supervised learning, Auto-ML, etc.

Installation

  1. To install latest version of CORDS package using PyPI:

    pip install -i https://test.pypi.org/simple/ cords
  2. To install using source:

    git clone https://github.com/decile-team/cords.git
    cd cords
    pip install -r requirements/requirements.txt

Documentation

Learn more about CORDS at our documentation.

Tutorials

Here are some tutorials to get you started with CORDS.

Results

The below link contains the jupyter notebook link for cifar10 timing analysis experiments

Open In Colab CIFAR10 Notebook

Results are obtained by running each dataset with different strategies for 300 epochs. The following experimental plots shows the relative test error vs speed up for different strategies. Currently we see between 3x to 7x improvements in energy and runtime with around 1 - 2% drop in accuracy. We expect to push the accuracy-speedup (or energy savings) frontier even more over time!

CIFAR10



CIFAR100



MNIST



SVHN



ImageNet



Publications

[1] Schwartz, Roy, et al., Green AI, arXiv preprint arXiv:1907.10597 (2019).

[2] Strubell, Emma, Ananya Ganesh, and Andrew McCallum, Energy and Policy Considerations for Deep Learning in NLP, In ACL 2019.

[3] Krishnateja Killamsetty, Durga Sivasubramanian, Ganesh Ramakrishnan, and Rishabh Iyer, GLISTER: Generalization based Data Subset Selection for Efficient and Robust Learning, 35th AAAI Conference on Artificial Intelligence, AAAI 2021

[4] Krishnateja Killamsetty, Durga Sivasubramanian, Abir De, Ganesh Ramakrishnan, Baharan Mirzasoleiman, Rishabh Iyer, Grad-Match: A Gradient Matching based Data Selection Framework for Efficient Learning, To Appear in International Conference on Machine Learning (ICML) 2021

[5] Baharan Mirzasoleiman, Jeff Bilmes, and Jure Leskovec. Coresets for Data-efficient Training of Machine Learning Models. In International Conference on Machine Learning (ICML), July 2020

[6] Kai Wei, Rishabh Iyer, Jeff Bilmes, Submodularity in Data Subset Selection and Active Learning, International Conference on Machine Learning (ICML) 2015

[7] Vishal Kaushal, Rishabh Iyer, Suraj Kothiwade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan, Learning From Less Data: A Unified Data Subset Selection and Active Learning Framework for Computer Vision, 7th IEEE Winter Conference on Applications of Computer Vision (WACV), 2019 Hawaii, USA

[8] Wei, Kai, et al. Submodular subset selection for large-scale speech training data, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2014.

cords's People

Contributors

krishnatejakk avatar dheerajnbhat avatar rishabhk108 avatar dssresearch avatar pizhn avatar savan77 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.