To ensure refactorings don't break performance To help look for optimiz

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Has been mentioned recently but the <a href="http://www.csie.ntu.edu.tw/~cjlin/libsvmt

Add performance benchmarking for algorithms about golearn HOT 6 OPEN

sjwhitworth commented on June 17, 2024

Add performance benchmarking for algorithms

from golearn.

Comments (6)

Sentimentron commented on June 17, 2024

I like this idea a lot, but we have to be mindful of the practicalities of checking lots of data into the tree. We could host the data in a separate repo and use a download script.

from golearn.

sjwhitworth commented on June 17, 2024

Yes, this is how we should do it. Only store code in the repo, but use a Go script to download it all.

from golearn.

amitkgupta commented on June 17, 2024

I've started writing a benchmarking suite, here's a quick update on philosophy, features, current status, caveats, and open questions.

Philosophy: the general idea is to have a suite of tests that stress the algorithms in the golearn library in a number of ways, establishing benchmarks for accuracy and speed. I want the tests to be highly decoupled from the implementation (so, e.g. for classifiers, it should only know how to create them, and then call Fit and Predict on them, and not much else) and also decoupled from the regular workflow on golearn (I don't want people to have to run slow tests or download large datasets to work on golearn). For those reasons, and also since it's a new project that's likely to see a lot of churn for now, it's a separate repo from golearn, but can be copy-pasted in later if keeping things in sync becomes painful.

Given that it's "out of the way", it still needs to provide value as a regression check against changes that hurt performance, and as a standard to decide whether new algorithm optimizations actually improve things. I imagine that the Travis build should go get and then run the tests in the benchmark suite, so that it serves its purpose as a regression suite.

Features: Structurally I plan for it to have a suite for classifiers, a suite for optimization algorithms, etc. Each suite will benchmark behaviour of some number of algorithms (whichever ones are implemented in golearn) against a common set of datasets for the suite. So each suite will consist of three main things: (a) datasets, (b) shared behaviours that make assertions about how an algorithm in the suite performs against a given dataset, and (c) concrete applications of the shared behaviours for the specific algorithms in golearn. One thing that will be nice is that anyone can use (a), and anyone writing an ML library who wraps their algorithms with something that implements the interfaces defined in golearn can even use (b). The idea here fits with the "decoupled" philosophy, namely that this project tries to solve "how do you benchmark an ML library, then apply it to golearn" rather than just "how do you benchmark golearn."

Current Status: I've only started on the Classifier suite, and the shared behaviours for that are done. I only have one basic dataset so far, and have only applied them to one algorithm. Adding more datasets and plugging in different classifiers will be easy now. Not sure yet what the next suite will be.

Caveat: It works against (the develop branch of) my fork of golearn. The only salient difference between that branch and master of this repo is Fit and Predict now include errors in their return signatures. I noticed that Fit would just hang sometimes if the input wasn't what it expected, so after fixing that it was clear Fit, and also Predict, should really be able to return errors.

Open Questions:

What's a good source for datasets?
What variety of datasets should we use? For the Classifier suite, my current thinking is to have (a) a basic dataset, (b) very large dataset, (c) a dataset where the number of features is large, (d) a dataset with mixed-type features, (e) dataset where the "boundaries" between classes is somewhat fuzzy.
What other suites can the algorithms be broken down into? Regressions, Classifiers, Optimizers, Clusterers, ...?

from golearn.

joshrendek commented on June 17, 2024

@amitkgupta http://www.quandl.com/ is free and has a ton of nice data available

from golearn.

Sentimentron commented on June 17, 2024

Has been mentioned recently but the libsvm project have some good datasets, though we can't read them yet.

from golearn.

amitkgupta commented on June 17, 2024

Nice, the libsvm datasets look great, table even shows number of features,
classes, total size of dataset, etc. Exactly the kind of breakdown I was
hoping for.

On Fri, Aug 22, 2014 at 6:45 AM, Richard Townsend [email protected]
wrote:

Has been mentioned recently but the libsvm project have some good datasets
http://www.csie.ntu.edu.tw/%7Ecjlin/libsvmtools/datasets/, though we
can't read them yet.

—
Reply to this email directly or view it on GitHub
#72 (comment).

from golearn.

Add performance benchmarking for algorithms about golearn HOT 6 OPEN

Comments (6)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent