sjwhitworth / golearn Goto Github PK

Machine Learning for Go

License: MIT License

Go 84.88% C 4.29% C++ 10.41% Python 0.29% Shell 0.08% Dockerfile 0.05%

golearn's Issues

Examples for creating data grids and instances with native types

I am currently trying to take natively typed data in a stream processing system and do predictions on the data. All of the current examples only show how to create instances with csv data. The one example that shows how to create instances only works with converting string data to float64. I already have a map[string]float64 for all of my data I want to put it into an instance and do predictions based on previously learned data.

Any help would be appreciated, if there is no current way to do this I would love to do the work necessary to support this.

`optimisation` doesn't do optimization

Call me crazy, but BatchGradientDescent doesn't find the min, nor the argmin, as the GradientDescent part of the function name and the optimisation package name would suggest. It also doesn't do parameter estimation as the tests suggest.

I actually don't know what "parameter estimation" even means in this context, but I'm guessing that, assuming y = a1*x1 + a2*x2 + ... + an*xn (dot product), it attempts to figure out that the linear coefficient parameters a1, ..., an are, given a bunch of observed values y and observed tuples (x1, ..., xn).

Can someone enlighten me on what this code does and/or is supposed to do?

beginner: creating predict/test data

Could you give me an example or an easy method to create the instances for test data? I was able to follow the main example in the landing page to split train-test data of something similar to that below:

10,hello,hello
9,helow,hello
8,mahalo,mahalo
12,helo,hello
7,hallo,hello
5,halo,hello
11,hellow,hello
8,mhalo,mahalo
12,mehalo,mahalo

But how do I now create a test data instance of something like below that I can pass to a Predict function?
10,melo

Is there a way to add it in a way similar to ParseCSVToInstances? Maybe a ParseStringToPredictInstance? Or ParseCSVToPredictInstance?

Random Forest: Vastly different results to scikit learn

When training a random forest, I see vastly different, and poorer results when using golearn, than when using Python's scikit-learn. Unfortunately the dataset is confidential, so I can't share it here online. However, I'm using the same train/test split, and have ensured that data is represented the same way (they're all floats in both).

When using scikit learn

Auc: 0.943867958106
Confusion Matrix
[[35878  1876]
 [ 5402 16388]]
             precision    recall  f1-score   support

          0       0.87      0.95      0.91     37754
          1       0.90      0.75      0.82     21790

avg / total       0.88      0.88      0.88     59544

When using golearn, with the same number of estimators..

Reference Class True Positives  False Positives True Negatives  Precision   Recall  F1 Score
--------------- --------------  --------------- --------------  ---------   ------  --------
1.00        4199        1366        15858       0.7545      0.4263  0.5448
0.00        15858       5651        4199        0.7373      0.9207  0.8188
Overall accuracy: 0.7408

As you can see, there's a big drop in precision and recall, on both outcomes. Any ideas as to what could be the problem @Sentimentron ?

Gradient Descent

Issues with C Dependencies on Ubuntu

The installation README has instructions related to C dependencies for SUSE Linux which are missing for Ubuntu. The particular omissions are:

How to modify the linker flags in gonum/blas/cblas.go. (After following the instructions to install OpenBLAS, you need to change the line #cgo linux LDFLAGS: -lcblas to #cgo linux LDFLAGS: -openblas, and may need to include a -L/path/to/OpenBLAS if it's installed to some non-standard location.
You need to install liblinear, which you can do with sudo apt-get install liblinear-dev.

I would create a PR myself with these additional instructions, BUT then go get ./... in the root of this project still fails, yielding the following error:

linear_models/liblinear.go:55: cannot use &c_y[0] (type *C.double) as type *C.int in assignment

Casting c_y[0] to a C.int "fixes" it, and go test ./... passes. The liblinear that I'm apt-getting is somehow different than the library built from source as per the instructions for SUSE Linux?

Direction of this project

I was wondering if it would be better to focus on a particular class in machine learning that is computationally intensive - dimensionality reduction, neural networks, fourier transform, etc rather than re-writing the algorithms. My take is if we just use one of the multi-translator compiler - python -> go or c++ -> go, we might be able to save some time. The team can just do some tests.

Integrate r9y9/nnet

nnet implements some more exotic neural network architectures that those we currently support, so I think integration is logical.

Refactor and clean up

It's a little bit of a mess at the moment in terms of duplication of code . I'm going to clean up and send a pull request. I'm working on the interfaces branch if you'd like to take a look.

V.01 release

Hi everyone. I'd like to formalise what features we want for a V.01 release. What I mean by this is, is the first version of GoLearn that is nearly ready for production use externally. We'll learn much more when it's in the hands of users. Docs need to be improved substantially, and we need a few more implementations of algorithms.

What does everyone think?

cc: @ifesdjeen @npbool @macmania @lazywei @marcoseravalli

MMap implementation error

When installing everything from scratch on my Raspberry Pi, I get this error when trying to run the tests.

# github.com/riobard/go-mmap
../riobard/go-mmap/mmap_linux.go:8: undefined: syscall.MAP_32BIT
../riobard/go-mmap/mmap_linux.go:8: const initializer (<T>)(syscall.MAP_32BIT) is not a constant
../riobard/go-mmap/mmap_linux.go:17: undefined: syscall.MAP_STACK
../riobard/go-mmap/mmap_linux.go:17: const initializer (<T>)(syscall.MAP_STACK) is not a constant
../riobard/go-mmap/mmap_linux.go:18: undefined: syscall.MAP_HUGETLB
../riobard/go-mmap/mmap_linux.go:18: const initializer (<T>)(syscall.MAP_HUGETLB) is not a constant

Any ideas @Sentimentron?

Class labels are all strings

Hi. In the base.GetClass() function, the return type is set to string. This implies that all class labels will be strings. Is this always the case? I think it would be better to have this as an interface{} where the user can determine the type at learning time.

Is there any specific reason why this is set to string?

DBSCAN implementation

I've implemented this before, but only for a binary case. It requires some kind of spatial indexing, which we don't currently have support for either.

Please give users control over console output

Currently golearn uses fmt.Println liberally to let you know what's going on. While this is nice for small training sets, it's over the top and completely unhelpful for larger sets.

For example, I'm training for a data set with 100,000 rows and categorical features with a very high cardinality. This results is an absurdly huge amount of console spam. Additionally, writing this much out to the screen can actually slow down program execution significantly.

One way to solve this would be to use a log.Logger to do all outputs. By default, set up the logger using stdout, but allow users to configure the logger to write to a file or write to dev/null to essentially turn logging off.

I may be able to submit a proper pull request, but wanted to post first to see if other people want this feature. What do you think?

Implement linear regression

Reading LibSVM datasets

Discussed in #72, LibSVM have some really awesome and large datasets that might be good for benchmarking and also might be crucial for people wanting to try out golearn. Probably need a new FixedDataGrid which stores things in the LibSVM memory format, as well as the parsing functions.

Time series

OK, I've just been hitting my head against merging a high-resolution DataFrame with a low-resolution time Series in pandas and it sucked harder than a dying star. I think we can do better.

Lots of ML applications involve predicting one or more dependent variables based on past measurements, usually at a given timestamp. Sometimes, these are totally linear and regular but often they aren't and I don't want to have to write bespoke interpolation functions, merging functions etc to get the job done. Sometimes I want to process stock data, where I've got measurements at defined dates and time, and sometimes I don't care what time really means.

I propose to add a new interface - TimeGrid - to base which will look like this:

interface TimeGrid {
     FixedDataGrid
     SetInterpolationMethod(InterpolationMethod)
     AtTime(int64, AttributeSpec) []byte
     SetTime(int64, AttributeSpec, []byte)
}

as well as two implementations of TimeGrid: AbsoluteTimeSeries and RelativeTimeSeries. They will both function as Instances do, but will always add a mandatory time Attribute (called Time) (TimeAttribute for AbsoluteTimeSeries, and a new IntAttribute for RelativeTimeSeries). Rows will be accessed in the order implied by those Attributes. Accesses between recorded time positions will trigger interpolation of any FloatAttributes using an interpolation function (currently expected to be Nearest and Linear).

Additionally, I want to write new filters to resample time series, detrend them, apply convolutions to them, and combine them with other time series data.

Implement random forests

Again, a very useful method.

Trying to install and run example gives error

Apparently github.com/gonum/blas/cblas was recently changed to github.com/gonum/blas/cblas128 This, by the way breaks the go get -u -t ./... command

package github.com/sjwhitworth/golearn
imports github.com/sjwhitworth/golearn/base
imports github.com/gonum/blas
imports github.com/gonum/blas/blas64
imports github.com/gonum/blas/native
imports github.com/gonum/internal/asm
imports github.com/gonum/matrix/mat64
imports github.com/smartystreets/goconvey/convey
imports github.com/jtolds/gls
imports github.com/smartystreets/goconvey/convey/assertions
imports github.com/smartystreets/goconvey/convey/assertions/oglematchers
imports github.com/smartystreets/goconvey/convey/gotest
imports github.com/smartystreets/goconvey/convey/reporting
imports github.com/sjwhitworth/golearn/ensemble
imports github.com/gonum/blas/cblas
imports github.com/gonum/blas/cblas
imports github.com/gonum/blas/cblas: cannot find package "github.com/gonum/blas/cblas" in any of:
/usr/local/Cellar/go/1.4/libexec/src/github.com/gonum/blas/cblas (from $GOROOT)
/Users/mathDR/gocode/src/github.com/gonum/blas/cblas (from $GOPATH)

use gofmt

Please run gofmt on your source files. It will, for example, eliminate crazy inconsistent indentation.

I recommend a simple git pre-commit hook:

#!/bin/sh

# Redirect output to stderr.
exec 1>&2

files="*.go */*.go"
nofmted=$(gofmt -l $files)
if [ $(echo "$nofmted" | wc -w) != 0 ]; then
  echo "Some files are not gofmt'd:"
  for f in $nofmted; do
    echo $f
  done
  exit 1
fi

This will prevent you from committing Go code that isn't gofmt'd.

Depending on your editor, there may be tools that can gofmt your code automatically when you save.

Cross validation

We still don't have it and we do need it. #90 offers a candidate API, where you call the function with a classifier and it gives you back k confusion matrices that you can feed to an evaluation function. Things we need to consider are:

How should the API let you express whether to stratify or not?
How should the API express whether the row order should be randomized or not?
How should evaluation be handled?
Should we implement additional statistical functions like Welch's t-test for evaluation?

Finish writing starter documentation

I've started writing some documentation for the direction of the project. Please contribute your ideas when you get a minute. You'll have to send me your email so I can give you edit rights.

https://docs.google.com/a/hailocab.com/document/d/1x21Y-g1rga0LTwC_LnKHi0y7RjFzd2Il7YB47rp7kTA/edit

linear_models / Linear regression

Would this be interesting / fit the roadmap? What do you guys think?

Google hangout before version release?

Hello! Does anyone want to do a google hangout sometime next week?
I'm new to this open source thing and I don't know how other groups do it but it would be awesome to have a google hangout to just have a check mark on how the components are going and to see which parts are good to go, which ones needs to be fully tested and stuff.

p.s. I still have yet to implement my part! So sorry about that, I got caught up on exams + a golang web app project. I should have more energy and time to spend on neural-nets and svms :)

Unable to allocate memory running `TestRandomForest1` in a loop

If I wrap the code inside TestRandomForest1 inside a 10-iteration for loop, I get the following panic:

panic: cannot allocate memory

I'm running this on an m3.2xlarge EC2 instance (Linux 3.13.0-32-generic x86_64). The first 8 or so iterations are likely to succeed, the problem almost always occurs when iterating 10 or more times.

Other things to note: If I increase the forest size to something like 50, then 2-3 iterations suffice to cause the panic. Also, if I remove the call to rf.Predict(testData) (and all subsequent code depending on the result of that Predict), then the panics do not occur.

Here's the full output, starting from the panic (not including the output of the first few iterations that succeed):

panic: cannot allocate memory

goroutine 184 [running]:
runtime.panic(0x5851e0, 0xc)
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/runtime/panic.c:279 +0xf5
github.com/sjwhitworth/golearn/base.NewDenseInstances(0xc2083f6b00)
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/base/dense.go:29 +0x67
github.com/sjwhitworth/golearn/base.GeneratePredictionVector(0x7fd6cf13a9f8, 0xc2083f6b00, 0x0, 0x0)
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/base/util_instances.go:16 +0xa3
github.com/sjwhitworth/golearn/trees.(*DecisionTreeNode).Predict(0xc2082189b0, 0x7fd6cf13a9f8, 0xc2083f6b00, 0x0, 0x0)
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/trees/id3.go:199 +0xad
github.com/sjwhitworth/golearn/trees.(*ID3DecisionTree).Predict(0xc20807bc40, 0x7fd6cf13a9f8, 0xc2083f6b00, 0x0, 0x0)
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/trees/id3.go:278 +0x51
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:143 +0x140
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 16 [chan receive]:
testing.RunTests(0x601658, 0x69c010, 0x1, 0x1, 0x575401)
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/testing/testing.go:505 +0x923
testing.Main(0x601658, 0x69c010, 0x1, 0x1, 0x6a8700, 0x0, 0x0, 0x6a8700, 0x0, 0x0)
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/testing/testing.go:435 +0x84
main.main()
    github.com/sjwhitworth/golearn/ensemble/_test/_testmain.go:47 +0x9c

goroutine 19 [finalizer wait]:
runtime.park(0x413090, 0x6a4ef8, 0x6a3a09)
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/runtime/proc.c:1369 +0x89
runtime.parkunlock(0x6a4ef8, 0x6a3a09)
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/runtime/proc.c:1385 +0x3b
runfinq()
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/runtime/mgc0.c:2644 +0xcf
runtime.goexit()
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/runtime/proc.c:1445

goroutine 20 [runnable]:
github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict(0xc208818380, 0x7fd6cf13a9f8, 0xc208818340, 0x0, 0x0)
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:154 +0x2cf
github.com/sjwhitworth/golearn/ensemble.(*RandomForest).Predict(0xc20807bc20, 0x7fd6cf13a9f8, 0xc208818340, 0x0, 0x0)
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/ensemble/randomforest.go:45 +0x51
github.com/sjwhitworth/golearn/ensemble.TestRandomForest1(0xc208050090)
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/ensemble/randomforest_test.go:29 +0x1a3
testing.tRunner(0xc208050090, 0x69c010)
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/testing/testing.go:422 +0x8b
created by testing.RunTests
    /home/ubuntu/.gvm/gos/go1.3/src/pkg/testing/testing.go:504 +0x8db

goroutine 191 [runnable]:
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:138
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 190 [runnable]:
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:138
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 189 [runnable]:
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:138
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 188 [runnable]:
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:138
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 187 [runnable]:
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:138
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 186 [runnable]:
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:138
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 185 [runnable]:
github.com/sjwhitworth/golearn/meta.func·004()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:138
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:149 +0x25b

goroutine 183 [chan receive]:
github.com/sjwhitworth/golearn/meta.func·003()
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:113 +0x8c
created by github.com/sjwhitworth/golearn/meta.(*BaggedModel).Predict
    /home/ubuntu/.gvm/pkgsets/go1.3/global/src/github.com/sjwhitworth/golearn/meta/bagging.go:131 +0x175
exit status 2

This of course makes it hard to write benchmarking tests which expect to be able to execute the code multiple times in order to time it.

Problems with EDF and DenseInstances when using large datasets

Trying to benchmark the k-NN Classifier with a large dataset with a large number of features. I'm using this in particular:

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/multiclass.html#aloi

Had to do a fair bit of yak shaving to get edf to work:

The math here might be off? When the numbers are small there aren't likely to be problems, but when rowsPerPage is about 3.96 and you have 108,000 rows, the lack-of-rounding error compounds and you end up not asking for enough pages. I think rowsPerPage needs to be math.Floored.
extend in alloc.go doesn't appear to consider edfAnonMode. I get a panic on line 21 because e.f is nil. I extracted a fileSize variable which I set to os.Getpagesize() in case we're in edfAnonMode. Few issues here: (a) f isn't a great name for a struct field, (b) my solution was hacky, not sure where else various modes aren't being considered, but also don't want to riddle the package with switches on mode, (c) not sure why os.Getpagesize() is the correct value, I copied it from map.go, but this value should probably be extracted as a constant somewhere.
The purpose of the startBlock == 0 guard in AllocPages isn't clear, but I found I needed to move e.extend(pagesRequested) up before the if block.
I tried to the String function in fixed.go blowing up when I tried to use it for some debugging output. It assumes f.alloc has non-zero length, and that f.alloc[0] has length at least 61. Got index out of range panics. Not sure whether these assumptions are supposed to hold and weren't for some other reason that's broken, so I just worked around it by removing anything related to alloc from the Sprintf interpolation arguments, and so I got rid of the if altogether. The source of the 61 number wasn't clear.

I didn't want to submit a PR with these fixes because (a) the tests don't give me a lot of confidence that the changes haven't broken something else, (b) lot of my changes felt hacky, and feels like the right solution would involve heavier refactoring.

Couple big-picture questions about EDF:

What's its origin story? What problem is it trying to solve? My vague guess is to give some sort of common interface for the internal data structures so that one can have an in-memory backing or a file-system backing, presumably to facilitate something like an HDFS-backed distributed computing scenario? Why else would file-system backing be desirable?
Why is it in the golearn repo? Seems like it's its own project.
It's incredibly sloooow. My dataset is 108,000 rows, 128 float attributes, and a single label class which can take on one of about 1000 labels. I randomly split the dataset roughly 60/40 into a training and validation set, and saved off those CSVs ahead of time (so I'm not using the testTrainSplit function). For each of those files, the ParseCSVToInstances takes about 30s, never mind actually doing the kNN classification (literally on the order of a few years, made some tweaks but still on the order of an hour). For comparison I have code that parses both CSVs and does the kNN-classification in about 30s total. Hard to pinpoint where all the slowness is coming from, but seems scary.

Perceptrons and boosting

Add a basic Perceptron implementation, probably under optimisation or a new subpackage linear
Add BoostedModel to the ensemble package

Many tests don't make assertions

A large number of tests don't make any assertions, and instead just print stuff out, e.g. a confusion matrix summary. Seems to defeat the purpose of having automated tests. There is some value to these tests, if they pass, it tells you the algorithm runs end to end without blowing up. But I've noticed when running in verbose mode that some of the algorithms produce 0% accuracy, and when tweaking parameters to improve the accuracy, the tests crashed with a memory panic.

Seems like having actual assertions (in addition to being valuable on their own) would've helped catch that kind of thing earlier. I'm willing to volunteer to go do this, work through the tests and make sure they're all making valuable assertions, someone just needs to assign it to me.

While I'm at it, any ideas for what sorts of things would be valuable to assert on? At minimum, for things that were printing out confusion matrices, asserting on the overall summary is a start.

Randomness Algorithms

Just wanted to bring up one minor bug and the ask about unseeded random number generation in Go.

There was a minor implementation bug in base/instances.go in regards to the Shuffle method on Instances that causes it to create a non-uniform distribution of shuffle permutations. I'll have it fixed with the correct Fisher-Yates algorithm and submit a pull request soon.

Another thing that came to my attention is that our libraries do not seed the math/rand generators. I'm not sure if this lack of seeding was a desired feature. As Go's rand package uses a fixed seed, behavior between runs is currently deterministic. Perhaps we could make it a convention to seed in our packages' init() function?

linear.h not found

Trying to run the tests for linear_regression. I get the error

# include <linear.h>```

But I've installed everything with `make.go` and set appropriate paths. Any ideas @njern?

No way to currently optimise InstancesViews

InstancesView are used by InstancesTrainingTestSplit and other methods to slice and dice FixedDataGrids without using any more memory than necessary, however this does mean that there's less scope for optimising things, like with the recent KNN optimisation, which rely on having a contiguous data layout.

To fix: I propose adding a InstancesDenseCopy method to base which always returns a DenseInstances, which replicates the original FixedDataGrid's Attributes and all of the data it contains.

Port GoKMeans to GoLearn

Moving gokmeans to golearn, making use of golearn's interfaces.

Improving Performance

So you might have come across this piece on Hacker News the other week about implementing 1-NN in Rust. I decided it might be an interesting idea to see if we could achieve the same performance.

I installed a similar nightly build of Rust 0.11.0 on my Linode and compiled the sample code with full optimisations. The real user time (as measured by time and averaged over five runs) was 6.5 seconds. I then checked out the current golearn master, and wrote an equivalent KNN program. This took 45.9 seconds (averaged over two runs). Most of the time is spent in matrix functions, and a small amount is spent sorting integer maps and garbage collecting.

To cut down on allocations and matrix operations, I

Introduced an mmap-able format.
Refactored all Attributes to use []byte slices as their system representation.
Removed the time-consuming matrix operations and did it inline.

Using reflect.SliceHeader, and some other tricks to cast between float64 and uint64 and []byte it's possible to access that (potentially disk-backed) memory with very little garbage collection overhead. Surprisingly, while this transformed the profiling results, performance wasn't that different - saving only around 7-10 seconds on average..

I then radically overhauled the Instances type, allowing subsets of Attributes to be stored in column order, and then re-wrote the Euclidean distance calculation in C to take advantage of this locality. The new running time (averaged over five runs) is 9.6 seconds. I think that with a few compiler flags, and some more Attribute types, I think we could go even faster than that.

In light of this result, I'd like to recommend:

Making all of the SelectAttributes-style functions into utility functions
Introduce Attribute grouping

The code I'm going to submit is not finished yet: it doesn't work in low-memory situations, it doesn't implement all of the safety requirements to use it, and these features definitely aren't mergeable in the v0.1 time-frame, but I'd be interested in your thoughts on the design.

Integrate providing a consistent api with several machine learning algorithms

I'm not a experienced golang programmer but there are couple libraries made for golang that could be used.

https://github.com/datastream/libsvm
https://github.com/ryanbressler/CloudForest
https://gowalker.org/github.com/jbrukh/bayesian

Implement neural networks

Not sure if we want to stick with plain vanilla neural nets, or if we want to do deep learning. Do you have any experience with this @macmania ?

K Means clustering

Installation Instructions Outdated

The installation instructions mention modifying some lines of code in $GOPATH/src/github.com/gonum/blas/cblas/blas.go. However, the lines the instructions reference do not exist in the latest version of gonum/blas. As far as I can tell, they were removed on April 28.

I found that adding the temporary environment variable CGO_LDFLAGS when running go install ./.. works. So e.g., on OSX the command that worked for me was CGO_LDFLAGS="-framework Accelerate" go install ./.... This way you don't have to modify the source code for gonum/blas.

Can I modify the instructions to do it this way, or is there some reason for asking people to modify the source code at $GOPATH/src/github.com/gonum/blas/cblas/blas.go?
It doesn't seem like the other flags for OSX are necessary. All the tests in gonum/blas pass if I only use the flag for the accelerate framework. Maybe the other flags are required for different versions of OSX? I'm on 10.10.1? Can I remove the other flags or can someone tell me if/why they are needed?

Implement linear regression

We should start off with this. It's a stable workhorse of ML applications, and would be really useful.

Add performance benchmarking for algorithms

To ensure refactorings don't break performance
To help look for optimizations and prove objectively that new optimizations actually improve performance
Etc.

Implement random forests

Some interfaces / dependence discussion

As mentioned in other issues, there are some decisions we need to make.

Use biogo.matrix or mat64
Both are under development.
mat64 lack docs, but author replies to the issues very fast, optimized memory usage.
biogo.matrix docs are quite good, but I have no experience in using this.
Should our pairwise interface return scalar or a vector? Detailed discussion is here: #20 (comment)
Interface for data and IO. See https://github.com/sjwhitworth/golearn/blob/io/data/data.go, https://github.com/sjwhitworth/golearn/blob/io/data/string_frame.go
This is really essential, because we need to settle down the format/methods/attributes so we can build trainer/cross_validator/predictor compatible with the data interfaces.
Also, I think this should move to base package, due to it is related to many other packages in golearn.
How to organize third party libraries? For example, there is a linear_models/liblinear_src in #23. We need to agree a convention for how to include 3rd libraries.
Other interfaces?

Please leave comments about above issues. We should settle down these issues first.
@sjwhitworth @ifesdjeen @npbool @marcoseravalli @macmania

Apriori implementation

Add a really basic CategoricalSetAttribute type for holding up to 64 things
Implement the Apriori algorithm
Test it using a reference dataset

Design interfaces for classifiers/regressors

We should design the interface for a base classifier/regressor to implement, so we can simply pass interfaces into func's.

Support for individual class weights for MultiLinearSVC

Overlooked in #88, requires some plumbing.

Cross-validation

Started working on k-fold and leave-one-out.

Build out data parsing package

Having a useful set of parsing packages makes life a whole lot easier for people. We should aim to natively support at least JSON and CSV from the start.

cgo/liblinear integration broken?

I am getting the following error message when trying to do anything:

# github.com/sjwhitworth/golearn/linear_models
../../../sjwhitworth/golearn/linear_models/liblinear.go:55: cannot use &c_y[0] (type *C.int) as type *C.double in assignment

It has to do with commit 94a562b. Changing the two lines 51 and 53 back to using the double type fixes the issue.

Is it possible this is a cross-platform or versioning issue? I'm on mac os x and installed liblinear version 1.94 with brew install liblinear.

Happy to provide clarification or more information. Is anyone else running into this?

Simple Averaged Perceptron

I know there is a ticket for building out a deep learning Neural Network but I was curious if an averaged perceptron implementation might be welcome. I've been hacking in Go for a bit and I'm trying to find good projects to both learn from others and have an opportunity to build new things. Thanks!

Reading ARFF files

Another format which is quite important for switchers. Have to parse it's custom header format, but otherwise it's very similar to CSV. It does have support for relational Attributes, but they don't seem to be very widely used. Was an issue originally brought up by @lazywei in #24.

TrainTestSplit() vs InstancesTrainTestSplit()

Right now the TrainTestSplit() function that is implemented in the cross_validation package is never used anywhere, because the base package already implements InstancesTrainTestSplit().

Personally, I think CV is such an integral part of any ML algorithm/training that we could go ahead and delete the CV package and just keep any CV-related logic in base.

Thoughts?

sjwhitworth / golearn Goto Github PK

golearn's Issues

Recommend Projects

Recommend Topics

Recommend Org