Git Product home page Git Product logo

automl_benchmark's Introduction

The Big Auto-ML Showdown

Automated Pipeline Interactive Dashboard
Pipeline Dashboard

Machine learning methods are often seen as black boxes which are difficult to understand. Automated machine learning frameworks tend to take these boxes, put them into a dark room, lock the door, and run away (watch out for sarcasm!). So instead of improving interpretability directly, let's conduct a benchmark in a meaningful way in order to learn more about them.

There have been some approaches in this direction already. See, e.g., the AutoML Benchmark, the work of STATWORX, and the benchmark accompanying the Auto-Sklearn publication. However, they are all lacking with regards to the underlying pipeline system using to setup the benchmark, as well as the exploration capabilities of the results.

Here, we build a sustainable (i.e. reproducible, adaptable, and transparent) workflow to automatically benchmark a multitude of models against a diverse set of data. The models are existing auto-ML frameworks which each have their own advantages and disadvantages related to ease-of-use, execution speed, and predictive performance. All of these features will become apparent in this benchmark. The datasets try to be as representative as possible and cover a wide range of applications. They thus serve as a reasonable playground for the aforementioned models. Finally, the results are displayed in an interactive dashboard which allows an in-depth exploration of the generated performance evaluation.

Check out the screencast (for LauzHack2020).

Resources

Models

  • XGBoost: optimized distributed gradient boosting (this will serve as baseline to calibrate the results)
  • auto-sklearn: "extend Auto-sklearn with a new, simpler meta-learning technique"
  • PyCaret: "end-to-end machine learning and model management tool"
  • TPOT: "optimizes machine learning pipelines using genetic programming"

Datasets

  • test_dataset: just a dummy dataset to make sure everything works
  • iris: classic and very easy multi-class classification dataset
  • titanic: another classic survival dataset
  • MAGIC Gamma Telescope: classification of high energy gamma particles

Usage

Execute the following command to run all models on all datasets:

$ snakemake -j 1 -pr --use-conda

Afterwards, execute python results/dashboard.py to enter an interactive dashboard for exploring the results.

Adding new datasets

To add a new dataset, you simply need to add a single CSV file to ./resources/datasets/{dataset}.csv. Each response variable/column needs to be prefixed with target__ (and there has to be exactly one). All other columns are treated as covariates.

Adding new models

A new model can be added by implementing its training procedure and putting the script into ./resources/models/{model}.py. Each script consists of a single main function which takes X_train and y_train as input.

automl_benchmark's People

Contributors

kpj avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Forkers

pedrofale

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.