Trying to benchmark the k-NN Classifier with a large dataset with a large number of fe

I think the math is off there, because it hasn't been tested with very large dat

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Problems with EDF and DenseInstances when using large datasets,about sjwhitworth/golearn

Comments (16)

Sentimentron commented on June 24, 2024

I think the math is off there, because it hasn't been tested with very large datasets.
GetContiguousOffset returns 0 if there isn't enough space in the free bitmap to store stuff, which is one of the conditions needed for extension.
The String method is not intended for practical use (it was for debugging) and it should be removed.

My motivation for edf is to allow easy manipulation of large amounts of data which don't necessarily fit in memory, it's designed specifically for backing DenseInstances. KNN does need to be optimised, but DenseInstances does do some things (such as storing all FloatAttributes inside a contiguous AttributeGroup) which should make optimisation possible. Could you run go tool pprof and post the SVG somewhere so we identify good optimisation targets?

from golearn.

amitkgupta commented on June 24, 2024

@Sentimentron

Re getContinuousOffset, could you elaborate on that? I was able to move the call to extend up outside the guard, so it's happening even if getContiguousOffset returns 0, and works just fine. And things weren't working with the extend left inside the if. Could you explain what "contiguous offset" is supposed to mean in this context, and why a value of 0 indicates that pages can't/shouldn't be extended onto the EdfFile?

Re optimizations, I was able to speed it up by several orders of magnitude by simply using plain-old for loops instead of MapOverRows, but it was still too slow to finish in a reasonable amount of time. I also used goroutines to parallelize computation, which was another few orders of magnitude, but eventually something with mat64 started panicking which I haven't debugged.

Not sure what you mean about running pprof to generate an SVG. You mean for my implementation of CSV parsing and kNN classification, or for the golearn implementation? In my code, I tried adding

_ "net/http/pprof"
"log"
"net/http"

to my imports and put the following as the first thing in my main function:

go func() {
    log.Println(http.ListenAndServe("localhost:8080", nil))
}()

Then I can go run the script but what do I need to do to generate the SVG you mention?

from golearn.

Sentimentron commented on June 24, 2024

So the idea of getContiguousOffset is you get to request some amount of storage which doesn't break across mmap'd boundaries, and which isn't occupied by previously allocated blocks. Because block 0 is reserved for system metadata, it's not available for allocation and so 0 indicates that the file needs to be extended.

For testing, what I do is

go build
Richards-MacBook-Air:knn rtownsend$ go test -cpuprofile=knn.cpu.prof
....PASS
ok      github.com/sjwhitworth/golearn/knn  0.042s
Richards-MacBook-Air:knn rtownsend$ go tool pprof ./knn.test knn.cpu.prof
Welcome to pprof!  For help, type 'help'.
(pprof) go web
Unknown command: try 'help'.
(pprof) web
Total: 1 samples
Loading web page file:////var/folders/mr/yzjksl8j3hz4qmc6k1wdclwr0000gn/T/kqvIuywmDN.0.svg
(pprof)

I note on my Mac (10.8), the pprof output is broken, but when I run it on a Linux system, it produces something like this.

from golearn.

amitkgupta commented on June 24, 2024

Okay, so if getContiguousOffset returns 0, the file needs to be extended. But if getContiguousOffset returns a non-zero value, couldn't it still be the case that the file needs to be extended? This is the exact case I'm running into. Is the hidden assumption that if getContiguousOffset returns non-zero, then somewhere else something else has already extended the file by enough pages that it should never need to be extended again?

from golearn.

Sentimentron commented on June 24, 2024

Hmmm, the for { ... break } structure is quite unusual: don't know what I was getting at there. Also getFreeMapSize doesn't appear to take into account the new allocation strategy, it should use the number of pages allocated by truncateMem.

from golearn.

Sentimentron commented on June 24, 2024

Ah, it's skipping ContentsTable entries which are full: it's searching for the latest possible insertion point.

from golearn.

Sentimentron commented on June 24, 2024

@amitkgupta Try working off #80 and see if that helps mitigate any of those issues.

from golearn.

amitkgupta commented on June 24, 2024

Not getting any images, I get No nodes to print, and knn.cpu.prof is nearly empty:

ubuntu knn$ go build
ubuntu knn$ go test -cpuprofile=knn.cpu.prof
........
8 assertions thus far

PASS
ok      github.com/sjwhitworth/golearn/knn  0.006s
ubuntu knn$ go tool pprof ./knn.test knn.cpu.prof
Welcome to pprof!  For help, type 'help'.
(pprof) web
Total: 0 samples
No nodes to print
(pprof) ^C
ubuntu knn$ ls -al knn.cpu.prof
-rw-rw-r-- 1 ubuntu ubuntu 64 Aug 31 08:44 knn.cpu.prof
ubuntu knn$ cat knn.cpu.prof
'ubuntu knn$

from golearn.

Sentimentron commented on June 24, 2024

OK, since we seem to be deadlocked on this: could you post some of the code? (I'll double check the profile and see if there's something obvious going on).

from golearn.

amitkgupta commented on June 24, 2024

I don't have the machine with the exact code anymore, but it's easy to
reproduce with some simple modifications of existing code.

Here's my script, just need to have the right content in the CSV files
(lines highlighted), and maybe change float32 to float64 everywhere:
https://github.com/amitkgupta/nearest_neighbour/blob/master/golang-k-nn-speedup.go#L73-L94

The code in golearn's test for k-NN classification is essentially the
golearn code I'm running:
https://github.com/sjwhitworth/golearn/blob/master/knn/knn_test.go. For
fair comparison, I'd modify the number of neighbours from 2 to 1, and again
make sure the right data is in the CSV files. Obviously for profiling
you'd want to put this in a main package and remove the testing/assertion
related code. Again, for fair comparison with the previous script, you'd
want to calculate what percentage of the predictions are correct, but my
guess is it'll never get that far.

The CSV files I used are attached.

many_features_test.csv
https://docs.google.com/file/d/0B2oA2Fh-AtVAekVyUnFQemR6c00/edit?usp=drive_web

many_features_training.csv
https://docs.google.com/file/d/0B2oA2Fh-AtVATS00TDljWmJfblU/edit?usp=drive_web

On Fri, Sep 12, 2014 at 12:22 PM, Richard Townsend <[email protected]

wrote:

OK, since we seem to be deadlocked on this: could you post some of the
code? (I'll double check the profile and see if there's something obvious
going on).

—
Reply to this email directly or view it on GitHub
#79 (comment).

from golearn.

sjwhitworth commented on June 24, 2024

I'm seeing similar problems when loading in datasets. Have a CSV dataset that will load in, up to around 2000 rows, out of 250,000 rows. After that, golearn spits back a rather cryptic

Stephens-Air:fraud stephenwhitworth$ go run fraud.go 
panic: runtime error: index out of range

goroutine 1 [running]:
runtime.panic(0x13a620, 0x2f61d7)
    /opt/boxen/homebrew/Cellar/go/1.2.2/libexec/src/pkg/runtime/panic.c:266 +0xb6
github.com/sjwhitworth/golearn/base/edf.(*EdfFile).ResolveRange(0x210441a80, 0x0, 0x4000, 0x2e, 0xafff, ...)
    /Users/stephenwhitworth/src/github.com/sjwhitworth/golearn/base/edf/map.go:419 +0x3fb
github.com/sjwhitworth/golearn/base.(*DenseInstances).Extend(0x21041bf00, 0x3d912, 0x0, 0x0)
    /Users/stephenwhitworth/src/github.com/sjwhitworth/golearn/base/dense.go:319 +0x3f3
github.com/sjwhitworth/golearn/base.ParseCSVToInstances(0x16bab0, 0x8, 0x2f9701, 0x21041bf00, 0x0, ...)
    /Users/stephenwhitworth/src/github.com/sjwhitworth/golearn/base/csv.go:177 +0x266
main.main()
    /Users/stephenwhitworth/fraud/fraud.go:19 +0x7f

Do we need all of the EDF mapping behind the scenes @Sentimentron ? Can we not just keep all in memory, or at least give the option to..?

from golearn.

Sentimentron commented on June 24, 2024

One problem I'm having trying to do some KNN stuff is that, at the moment, Attributes are stored together in slabs (AttributeGroups) of a homogenous type and there isn't support (yet) for rows which span more than a single page in memory. I'll make it a priority. @sjwhitworth Can you gve some details about the number of columns that your dataset has, just so I can check if it's the same issue.

from golearn.

mish15 commented on June 24, 2024

My two cents on this is to remove all the mmap for now. For the few benefits, it will cause way, way more problems. We've been doing lots of mmap work for the past 12 months and i don't think it's the best solution (for the majority of this package). It will slow most things down, cause OS specific glitches, complicate issue resolution, slow development, and even when working it will trip up users that don't know about overcommit, etc.

Everything we've done with mmap takes 5x more code and 5x more testing. I see it as a solution for inefficient memory usage (e.g. large proportion of ram rarely gets read), but that's about it.

from golearn.

amitkgupta commented on June 24, 2024

Is there an immediate use case that you, your colleagues, or your clients have for handling large datasets that don't fit in memory? If not, this feels like a premature optimization, and I wouldn't expect it to be the default behaviour.

Does anyone know how industry standard tools like scikit-learn handle, or don't handle, datasets that are that big? Realistically, I think someone with that much data would spin up a high-memory ephemeral machine, do some PCA or something to trim the dataset, save it off and tear down the machine, and then longer-running processes doing predictions do so against the trimmed dataset and don't require as much memory. Or, in the case of a stateful thing like linear regression, just store off the coefficients and not worry about memory at all.

Being able to handle very large datasets is a nice goal, but the current implementation incurs costs in terms of performance, leaked abstractions (DenseInstances should not publicly expose Extend), and most importantly it doesn't work. Even with modest datasets that easily fit in memory on most people's personal computers, I can't use golearn, I just get crashes or execution times that look like they'll take on the order of months based on extrapolation.

Perhaps edf can be extracted as a separate repo with a clear focus, then fixed, improved, and rigorously tested and benchmarked. golearn can just naively chuck stuff in memory for now, and focus on getting things working, keeping in mind the longer term goal of working with large datasets (and so avoiding assuming that all the data is in memory whenever possible). Then when edf is ready, with a nice interface, it could be something that cleanly plugs into golearn, and is perhaps optional based on a user-provided parameter or some logic in the code that decides a file is big enough to require an edf backed DataGrid.

I can see that more decoupling between edf and golearn could maybe make certain optimizations difficult, but it feels like it'd be better to have something working now, and solve that problem later if it even is one.

from golearn.

Sentimentron commented on June 24, 2024

I reviewed the code yesterday and actually it has some more problems (like
not allocating bitmaps properly). I would agree that the best way to
proceed is to urgently remove edf. The way to do that is to:

Remove all references to the storage field within DenseInstances
Change everything from the "Allocate those pages" comment onward
within Extend to use make, and add storage to the AttributeGroups
individually

I've now got a test case for this (using MNIST, 784ish Attributes), so I'll
do some surgery and see if I can get a patch out tonight. I'm also working
on a patch to vectorise our KNN implementation in certain circumstances,
time series work is still ongoing and integration of nnet is currently in
planning.

On 19 September 2014 07:34, Amit Gupta [email protected] wrote:

Is there an immediate use case that you, your colleagues, or your clients
have for handling large datasets that don't fit in memory? If not, this
feels like a premature optimization, and I wouldn't expect it to be the
default behaviour.

Does anyone know how industry standard tools like scikit-learn handle, or
don't handle, datasets that are that big? Realistically, I think someone
with that much data would spin up a high-memory ephemeral machine, do some
PCA or something to trim the dataset, save it off and tear down the
machine, and then longer-running processes doing predictions do so against
the trimmed dataset and don't require as much memory. Or, in the case of a
stateful thing like linear regression, just store off the coefficients and
not worry about memory at all.

Being able to handle very large datasets is a nice goal, but the current
implementation incurs costs in terms of performance, leaked abstractions (
DenseInstances should not publicly expose Extend), and most importantly
it doesn't work. Even with modest datasets that easily fit in memory on
most people's personal computers, I can't use golearn, I just get crashes
or execution times that look like they'll take on the order of months based
on extrapolation.

Perhaps edf can be extracted as a separate repo with a clear focus, then
fixed, improved, and rigorously tested and benchmarked. golearn can just
naively chuck stuff in memory for now, and focus on getting things working,
keeping in mind the longer term goal of working with large datasets (and so
avoiding assuming that all the data is in memory whenever possible). Then
when edf is ready, with a nice interface, it could be something that
cleanly plugs into golearn, and is perhaps optional based on a
user-provided parameter or some logic in the code that decides a file is
big enough to require an edf backed DataGrid.

I can see that more decoupling between edf and golearn could maybe make
certain optimizations difficult, but it feels like it'd be better to have
something working now, and solve that problem later if it even is one.

—
Reply to this email directly or view it on GitHub
#79 (comment).

from golearn.

Sentimentron commented on June 24, 2024

OK #82 expunges it, but Travis is complaining against tip. Some kind of liblinear segfault against tip, (so it may be unrelated to the change).

from golearn.

Problems with EDF and DenseInstances when using large datasets about golearn HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent