Git Product home page Git Product logo

Comments (6)

Sentimentron avatar Sentimentron commented on June 17, 2024

I like this idea a lot, but we have to be mindful of the practicalities of checking lots of data into the tree. We could host the data in a separate repo and use a download script.

from golearn.

sjwhitworth avatar sjwhitworth commented on June 17, 2024

Yes, this is how we should do it. Only store code in the repo, but use a Go script to download it all.

from golearn.

amitkgupta avatar amitkgupta commented on June 17, 2024

I've started writing a benchmarking suite, here's a quick update on philosophy, features, current status, caveats, and open questions.

Philosophy: the general idea is to have a suite of tests that stress the algorithms in the golearn library in a number of ways, establishing benchmarks for accuracy and speed. I want the tests to be highly decoupled from the implementation (so, e.g. for classifiers, it should only know how to create them, and then call Fit and Predict on them, and not much else) and also decoupled from the regular workflow on golearn (I don't want people to have to run slow tests or download large datasets to work on golearn). For those reasons, and also since it's a new project that's likely to see a lot of churn for now, it's a separate repo from golearn, but can be copy-pasted in later if keeping things in sync becomes painful.

Given that it's "out of the way", it still needs to provide value as a regression check against changes that hurt performance, and as a standard to decide whether new algorithm optimizations actually improve things. I imagine that the Travis build should go get and then run the tests in the benchmark suite, so that it serves its purpose as a regression suite.

Features: Structurally I plan for it to have a suite for classifiers, a suite for optimization algorithms, etc. Each suite will benchmark behaviour of some number of algorithms (whichever ones are implemented in golearn) against a common set of datasets for the suite. So each suite will consist of three main things: (a) datasets, (b) shared behaviours that make assertions about how an algorithm in the suite performs against a given dataset, and (c) concrete applications of the shared behaviours for the specific algorithms in golearn. One thing that will be nice is that anyone can use (a), and anyone writing an ML library who wraps their algorithms with something that implements the interfaces defined in golearn can even use (b). The idea here fits with the "decoupled" philosophy, namely that this project tries to solve "how do you benchmark an ML library, then apply it to golearn" rather than just "how do you benchmark golearn."

Current Status: I've only started on the Classifier suite, and the shared behaviours for that are done. I only have one basic dataset so far, and have only applied them to one algorithm. Adding more datasets and plugging in different classifiers will be easy now. Not sure yet what the next suite will be.

Caveat: It works against (the develop branch of) my fork of golearn. The only salient difference between that branch and master of this repo is Fit and Predict now include errors in their return signatures. I noticed that Fit would just hang sometimes if the input wasn't what it expected, so after fixing that it was clear Fit, and also Predict, should really be able to return errors.

Open Questions:

  • What's a good source for datasets?
  • What variety of datasets should we use? For the Classifier suite, my current thinking is to have (a) a basic dataset, (b) very large dataset, (c) a dataset where the number of features is large, (d) a dataset with mixed-type features, (e) dataset where the "boundaries" between classes is somewhat fuzzy.
  • What other suites can the algorithms be broken down into? Regressions, Classifiers, Optimizers, Clusterers, ...?

from golearn.

joshrendek avatar joshrendek commented on June 17, 2024

@amitkgupta http://www.quandl.com/ is free and has a ton of nice data available

from golearn.

Sentimentron avatar Sentimentron commented on June 17, 2024

Has been mentioned recently but the libsvm project have some good datasets, though we can't read them yet.

from golearn.

amitkgupta avatar amitkgupta commented on June 17, 2024

Nice, the libsvm datasets look great, table even shows number of features,
classes, total size of dataset, etc. Exactly the kind of breakdown I was
hoping for.

On Fri, Aug 22, 2014 at 6:45 AM, Richard Townsend [email protected]
wrote:

Has been mentioned recently but the libsvm project have some good datasets
http://www.csie.ntu.edu.tw/%7Ecjlin/libsvmtools/datasets/, though we
can't read them yet.


Reply to this email directly or view it on GitHub
#72 (comment).

from golearn.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.