Comments (16)
- I think the math is off there, because it hasn't been tested with very large datasets.
GetContiguousOffset
returns 0 if there isn't enough space in the free bitmap to store stuff, which is one of the conditions needed for extension.- The
String
method is not intended for practical use (it was for debugging) and it should be removed.
My motivation for edf
is to allow easy manipulation of large amounts of data which don't necessarily fit in memory, it's designed specifically for backing DenseInstances
. KNN does need to be optimised, but DenseInstances
does do some things (such as storing all FloatAttributes
inside a contiguous AttributeGroup
) which should make optimisation possible. Could you run go tool pprof
and post the SVG somewhere so we identify good optimisation targets?
from golearn.
Re getContinuousOffset
, could you elaborate on that? I was able to move the call to extend
up outside the guard, so it's happening even if getContiguousOffset
returns 0, and works just fine. And things weren't working with the extend
left inside the if
. Could you explain what "contiguous offset" is supposed to mean in this context, and why a value of 0 indicates that pages can't/shouldn't be extended onto the EdfFile?
Re optimizations, I was able to speed it up by several orders of magnitude by simply using plain-old for loops instead of MapOverRows
, but it was still too slow to finish in a reasonable amount of time. I also used goroutines to parallelize computation, which was another few orders of magnitude, but eventually something with mat64
started panicking which I haven't debugged.
Not sure what you mean about running pprof
to generate an SVG. You mean for my implementation of CSV parsing and kNN classification, or for the golearn
implementation? In my code, I tried adding
_ "net/http/pprof"
"log"
"net/http"
to my imports and put the following as the first thing in my main function:
go func() {
log.Println(http.ListenAndServe("localhost:8080", nil))
}()
Then I can go run
the script but what do I need to do to generate the SVG you mention?
from golearn.
So the idea of getContiguousOffset
is you get to request some amount of storage which doesn't break across mmap'd boundaries, and which isn't occupied by previously allocated blocks. Because block 0 is reserved for system metadata, it's not available for allocation and so 0 indicates that the file needs to be extended.
For testing, what I do is
go build
Richards-MacBook-Air:knn rtownsend$ go test -cpuprofile=knn.cpu.prof
....PASS
ok github.com/sjwhitworth/golearn/knn 0.042s
Richards-MacBook-Air:knn rtownsend$ go tool pprof ./knn.test knn.cpu.prof
Welcome to pprof! For help, type 'help'.
(pprof) go web
Unknown command: try 'help'.
(pprof) web
Total: 1 samples
Loading web page file:////var/folders/mr/yzjksl8j3hz4qmc6k1wdclwr0000gn/T/kqvIuywmDN.0.svg
(pprof)
I note on my Mac (10.8), the pprof
output is broken, but when I run it on a Linux system, it produces something like this.
from golearn.
Okay, so if getContiguousOffset
returns 0, the file needs to be extended. But if getContiguousOffset
returns a non-zero value, couldn't it still be the case that the file needs to be extended? This is the exact case I'm running into. Is the hidden assumption that if getContiguousOffset
returns non-zero, then somewhere else something else has already extended the file by enough pages that it should never need to be extended again?
from golearn.
Hmmm, the for { ... break }
structure is quite unusual: don't know what I was getting at there. Also getFreeMapSize
doesn't appear to take into account the new allocation strategy, it should use the number of pages allocated by truncateMem
.
from golearn.
Ah, it's skipping ContentsTable entries which are full: it's searching for the latest possible insertion point.
from golearn.
@amitkgupta Try working off #80 and see if that helps mitigate any of those issues.
from golearn.
Not getting any images, I get No nodes to print
, and knn.cpu.prof
is nearly empty:
ubuntu knn$ go build
ubuntu knn$ go test -cpuprofile=knn.cpu.prof
........
8 assertions thus far
PASS
ok github.com/sjwhitworth/golearn/knn 0.006s
ubuntu knn$ go tool pprof ./knn.test knn.cpu.prof
Welcome to pprof! For help, type 'help'.
(pprof) web
Total: 0 samples
No nodes to print
(pprof) ^C
ubuntu knn$ ls -al knn.cpu.prof
-rw-rw-r-- 1 ubuntu ubuntu 64 Aug 31 08:44 knn.cpu.prof
ubuntu knn$ cat knn.cpu.prof
'ubuntu knn$
from golearn.
OK, since we seem to be deadlocked on this: could you post some of the code? (I'll double check the profile and see if there's something obvious going on).
from golearn.
I don't have the machine with the exact code anymore, but it's easy to
reproduce with some simple modifications of existing code.
Here's my script, just need to have the right content in the CSV files
(lines highlighted), and maybe change float32 to float64 everywhere:
https://github.com/amitkgupta/nearest_neighbour/blob/master/golang-k-nn-speedup.go#L73-L94
The code in golearn's test for k-NN classification is essentially the
golearn code I'm running:
https://github.com/sjwhitworth/golearn/blob/master/knn/knn_test.go. For
fair comparison, I'd modify the number of neighbours from 2 to 1, and again
make sure the right data is in the CSV files. Obviously for profiling
you'd want to put this in a main package and remove the testing/assertion
related code. Again, for fair comparison with the previous script, you'd
want to calculate what percentage of the predictions are correct, but my
guess is it'll never get that far.
The CSV files I used are attached.
many_features_test.csv
https://docs.google.com/file/d/0B2oA2Fh-AtVAekVyUnFQemR6c00/edit?usp=drive_web
many_features_training.csv
https://docs.google.com/file/d/0B2oA2Fh-AtVATS00TDljWmJfblU/edit?usp=drive_web
On Fri, Sep 12, 2014 at 12:22 PM, Richard Townsend <[email protected]
wrote:
OK, since we seem to be deadlocked on this: could you post some of the
code? (I'll double check the profile and see if there's something obvious
going on).—
Reply to this email directly or view it on GitHub
#79 (comment).
from golearn.
I'm seeing similar problems when loading in datasets. Have a CSV dataset that will load in, up to around 2000 rows, out of 250,000 rows. After that, golearn spits back a rather cryptic
Stephens-Air:fraud stephenwhitworth$ go run fraud.go
panic: runtime error: index out of range
goroutine 1 [running]:
runtime.panic(0x13a620, 0x2f61d7)
/opt/boxen/homebrew/Cellar/go/1.2.2/libexec/src/pkg/runtime/panic.c:266 +0xb6
github.com/sjwhitworth/golearn/base/edf.(*EdfFile).ResolveRange(0x210441a80, 0x0, 0x4000, 0x2e, 0xafff, ...)
/Users/stephenwhitworth/src/github.com/sjwhitworth/golearn/base/edf/map.go:419 +0x3fb
github.com/sjwhitworth/golearn/base.(*DenseInstances).Extend(0x21041bf00, 0x3d912, 0x0, 0x0)
/Users/stephenwhitworth/src/github.com/sjwhitworth/golearn/base/dense.go:319 +0x3f3
github.com/sjwhitworth/golearn/base.ParseCSVToInstances(0x16bab0, 0x8, 0x2f9701, 0x21041bf00, 0x0, ...)
/Users/stephenwhitworth/src/github.com/sjwhitworth/golearn/base/csv.go:177 +0x266
main.main()
/Users/stephenwhitworth/fraud/fraud.go:19 +0x7f
Do we need all of the EDF mapping behind the scenes @Sentimentron ? Can we not just keep all in memory, or at least give the option to..?
from golearn.
One problem I'm having trying to do some KNN stuff is that, at the moment, Attributes are stored together in slabs (AttributeGroups) of a homogenous type and there isn't support (yet) for rows which span more than a single page in memory. I'll make it a priority. @sjwhitworth Can you gve some details about the number of columns that your dataset has, just so I can check if it's the same issue.
from golearn.
My two cents on this is to remove all the mmap for now. For the few benefits, it will cause way, way more problems. We've been doing lots of mmap work for the past 12 months and i don't think it's the best solution (for the majority of this package). It will slow most things down, cause OS specific glitches, complicate issue resolution, slow development, and even when working it will trip up users that don't know about overcommit, etc.
Everything we've done with mmap takes 5x more code and 5x more testing. I see it as a solution for inefficient memory usage (e.g. large proportion of ram rarely gets read), but that's about it.
from golearn.
Is there an immediate use case that you, your colleagues, or your clients have for handling large datasets that don't fit in memory? If not, this feels like a premature optimization, and I wouldn't expect it to be the default behaviour.
Does anyone know how industry standard tools like scikit-learn handle, or don't handle, datasets that are that big? Realistically, I think someone with that much data would spin up a high-memory ephemeral machine, do some PCA or something to trim the dataset, save it off and tear down the machine, and then longer-running processes doing predictions do so against the trimmed dataset and don't require as much memory. Or, in the case of a stateful thing like linear regression, just store off the coefficients and not worry about memory at all.
Being able to handle very large datasets is a nice goal, but the current implementation incurs costs in terms of performance, leaked abstractions (DenseInstances
should not publicly expose Extend
), and most importantly it doesn't work. Even with modest datasets that easily fit in memory on most people's personal computers, I can't use golearn, I just get crashes or execution times that look like they'll take on the order of months based on extrapolation.
Perhaps edf
can be extracted as a separate repo with a clear focus, then fixed, improved, and rigorously tested and benchmarked. golearn
can just naively chuck stuff in memory for now, and focus on getting things working, keeping in mind the longer term goal of working with large datasets (and so avoiding assuming that all the data is in memory whenever possible). Then when edf
is ready, with a nice interface, it could be something that cleanly plugs into golearn
, and is perhaps optional based on a user-provided parameter or some logic in the code that decides a file is big enough to require an edf
backed DataGrid.
I can see that more decoupling between edf
and golearn
could maybe make certain optimizations difficult, but it feels like it'd be better to have something working now, and solve that problem later if it even is one.
from golearn.
I reviewed the code yesterday and actually it has some more problems (like
not allocating bitmaps properly). I would agree that the best way to
proceed is to urgently remove edf. The way to do that is to:
- Remove all references to the storage field within DenseInstances
- Change everything from the "Allocate those pages" comment onward
within Extend to use make, and add storage to the AttributeGroups
individually
I've now got a test case for this (using MNIST, 784ish Attributes), so I'll
do some surgery and see if I can get a patch out tonight. I'm also working
on a patch to vectorise our KNN implementation in certain circumstances,
time series work is still ongoing and integration of nnet is currently in
planning.
On 19 September 2014 07:34, Amit Gupta [email protected] wrote:
Is there an immediate use case that you, your colleagues, or your clients
have for handling large datasets that don't fit in memory? If not, this
feels like a premature optimization, and I wouldn't expect it to be the
default behaviour.Does anyone know how industry standard tools like scikit-learn handle, or
don't handle, datasets that are that big? Realistically, I think someone
with that much data would spin up a high-memory ephemeral machine, do some
PCA or something to trim the dataset, save it off and tear down the
machine, and then longer-running processes doing predictions do so against
the trimmed dataset and don't require as much memory. Or, in the case of a
stateful thing like linear regression, just store off the coefficients and
not worry about memory at all.Being able to handle very large datasets is a nice goal, but the current
implementation incurs costs in terms of performance, leaked abstractions (
DenseInstances should not publicly expose Extend), and most importantly
it doesn't work. Even with modest datasets that easily fit in memory on
most people's personal computers, I can't use golearn, I just get crashes
or execution times that look like they'll take on the order of months based
on extrapolation.Perhaps edf can be extracted as a separate repo with a clear focus, then
fixed, improved, and rigorously tested and benchmarked. golearn can just
naively chuck stuff in memory for now, and focus on getting things working,
keeping in mind the longer term goal of working with large datasets (and so
avoiding assuming that all the data is in memory whenever possible). Then
when edf is ready, with a nice interface, it could be something that
cleanly plugs into golearn, and is perhaps optional based on a
user-provided parameter or some logic in the code that decides a file is
big enough to require an edf backed DataGrid.I can see that more decoupling between edf and golearn could maybe make
certain optimizations difficult, but it feels like it'd be better to have
something working now, and solve that problem later if it even is one.—
Reply to this email directly or view it on GitHub
#79 (comment).
from golearn.
OK #82 expunges it, but Travis is complaining against tip. Some kind of liblinear segfault against tip, (so it may be unrelated to the change).
from golearn.
Related Issues (20)
- The same order of categorical values for Equals function is necessary or not HOT 1
- Example about how to query the model after being trained using KNN ? HOT 4
- Building the KNN Example as a static linked executable fails
- runtime error:cgo argument has Go pointer to Go pointer HOT 2
- neural: Rename `neural.NeuralFunction` -> `neural.Function` HOT 1
- Bad import, was an upstream dependency deleted? HOT 2
- Question: How to convert class name into single quote?
- Any interest in XGBoost? HOT 7
- Implement tanh activation func HOT 2
- Error in download golearn with 'go get' HOT 3
- KNNClassifier has no field or method optimisedEuclideanPredict HOT 2
- golearn can read the .pickle converted csv file ?
- Support for Apple Metal HOT 2
- KNN Classifier saved and loaded models don't give the same results. HOT 1
- KNN.optimisedEuclideanPredict undefined (type *KNNClassifier has no field or method optimisedEuclideanPredict) HOT 6
- Question: Could we save the model and use it next time for reproduction of result? HOT 6
- there is a crash in dbscan algoritm HOT 8
- Update readme cover image
- PCA returns all 0's HOT 5
- This library is defunct: please correct me if I'm wrong…
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from golearn.