Comments (14)
This is a known problem which has been affecting some Travis builds. Just to confirm:
- If you're retaining a reference to each new
DenseInstances
created inside the loop, then that might cause the problem. - If you're waiting for garbage collection to deallocate the DenseInstances, this might also cause the problem as it currently needs three cycles (one to deallocate
DenseInstances
, another to call the finaliser onEdfMap
and I think another to actually unmap the memory). - We might have to introduce a
Deallocate
method onDenseInstances
or a finalizer to actually ensure the memory gets unmapped. - Additionally, because
EdfMap
manages lots of pages outside go's garbage collector (theEdfMap
structure is actually pretty small), this may mean that garbage collection doesn't run often enough to release all of the memory each time.
Additionally, try cherry picking commit 8e20799 from #69. Also try reducing EDF_SIZE
. Hope those things help.
from golearn.
Hi @Sentimentron, thanks for the response. I believe your first bullet point does not apply, but your second does. Here's the code:
package ensemble
import (
"fmt"
base "github.com/sjwhitworth/golearn/base"
eval "github.com/sjwhitworth/golearn/evaluation"
filters "github.com/sjwhitworth/golearn/filters"
"testing"
)
func TestRandomForest1(testEnv *testing.T) {
for i := 0; i < 10; i++ {
inst, err := base.ParseCSVToInstances("../examples/datasets/iris_headers.csv", true)
if err != nil {
panic(err)
}
filt := filters.NewChiMergeFilter(inst, 0.90)
for _, a := range base.NonClassFloatAttributes(inst) {
filt.AddAttribute(a)
}
filt.Train()
instf := base.NewLazilyFilteredInstances(inst, filt)
trainData, testData := base.InstancesTrainTestSplit(instf, 0.60)
rf := NewRandomForest(10, 3)
rf.Fit(trainData)
predictions := rf.Predict(testData)
fmt.Println(predictions)
confusionMat := eval.GetConfusionMatrix(testData, predictions)
fmt.Println(confusionMat)
fmt.Println(eval.GetSummary(confusionMat))
}
}
I don't know the implementation details well enough to comment on the third and fourth bullet, but I do think it's the library's responsibility to ensure memory is freed quickly and reliably.
Here's a target use-case: I want to be able to run the RandomForest classifier on the iris dataset so that it achieves a reasaonable level of accuracy, and I want to see that it is able to achieve this reasonable level of accuracy in a reasonable amount of time. In order to test that it's performing reasonably accurately reasonably fast, I want to write a benchmark test that asserts that the overall accuracy is greater than some threshold, and the average execution time is less than some threshold when running the code over several iterations.
On the same dataset, the knn-classifier consistently achieves about 95% accuracy. For random forest, in order to get above even 70% accuracy, I had to bump the forest size to something like 50 (it's currently at 10 in the test). However, at that forest size, I can only iterate about 3 times before I get the memory panic. I'd say a minimal target would be that the Random Forest prediction can be run on the iris dataset 10 times, with an accuracy consistently over 70%, and an average runtime of at most 0.5s, without any memory panics.
I'll experiment with cherry-picking that commit and reducing EDF_SIZE
tomorrow to see if the above target can be reached without changing or adding anything to the edf implementation.
from golearn.
Sounds like a challenge. I'll experiment with adding a finalizer to
DenseInstances and tweaking the allocation tonight.
On 20 August 2014 09:34, Amit Gupta [email protected] wrote:
Hi @Sentimentron https://github.com/Sentimentron, thanks for the
response. I believe your first bullet point does not apply, but your second
does. Here's the code:package ensemble
import (
"fmt"
base "github.com/sjwhitworth/golearn/base"
eval "github.com/sjwhitworth/golearn/evaluation"
filters "github.com/sjwhitworth/golearn/filters"
"testing"
)func TestRandomForest1(testEnv *testing.T) {
for i := 0; i < 10; i++ {
inst, err := base.ParseCSVToInstances("../examples/datasets/iris_headers.csv", true)
if err != nil {
panic(err)
}filt := filters.NewChiMergeFilter(inst, 0.90) for _, a := range base.NonClassFloatAttributes(inst) { filt.AddAttribute(a) } filt.Train() instf := base.NewLazilyFilteredInstances(inst, filt) trainData, testData := base.InstancesTrainTestSplit(instf, 0.60) rf := NewRandomForest(10, 3) rf.Fit(trainData) predictions := rf.Predict(testData) fmt.Println(predictions) confusionMat := eval.GetConfusionMatrix(testData, predictions) fmt.Println(confusionMat) fmt.Println(eval.GetSummary(confusionMat)) }
}
I don't know the implementation details well enough to comment on the
third and fourth bullet, but I do think it's the library's responsibility
to ensure memory is freed quickly and reliably.Here's a target use-case: I want to be able to run the RandomForest
classifier on the iris dataset so that it achieves a reasaonable level of
accuracy, and I want to see that it is able to achieve this reasonable
level of accuracy in a reasonable amount of time. In order to test that
it's performing reasonably accurately reasonably fast, I want to write a
benchmark test that asserts that the overall accuracy is greater than some
threshold, and the average execution time is less than some threshold when
running the code over several iterations.On the same dataset, the knn-classifier consistently achieves about 95%
accuracy. For random forest, in order to get above even 70% accuracy, I had
to bump the forest size to something like 50 (it's currently at 10 in the
test). However, at that forest size, I can only iterate about 3 times
before I get the memory panic. I'd say a minimal target would be that the
Random Forest prediction can be run on the iris dataset 10 times, with an
accuracy consistently over 70%, and an average runtime of at most 0.5s,
without any memory panics.I'll experiment with cherry-picking that commit and reducing EDF_SIZE
tomorrow to see if the above target can be reached without changing or
adding anything to the edf implementation.β
Reply to this email directly or view it on GitHub
#73 (comment).
from golearn.
OK: so I've spent lots of time looking at stack traces and I know what the problem is. By default, each call to NewDenseInstances
(e.g. every time a GeneratePredictionVector
is called in a tree) maps in 128 MB by default (EDF_SIZE
), but there's only a few hundred bytes of tracking structures allocated to keep track of that memory. Because there's no memory pressure on Go's working set, it doesn't run garbage collection often enough to call the finalizers which unmap that memory, so the virtual memory allocated just balloons until everything falls over. The solution, I suspect, is to change the implementation of EdfAnonMap
so that it allocates byte slices using make
from Go's working set.
from golearn.
OK, so #75 changes the backing of EdfAnonMap
to use make
, my quick survey of top indicate that the working set and VMem remain stable, so see if this fixes the problem, and if not, let me know.
from golearn.
@Sentimentron i bet that took a while to find!! Out of interest, is the main reason for using mmap to analyze data greater than the available memory?
@Amit-PivotalLabs if you want to solve this in the short term, given the above comments, setting the overcommit=1 on your AWS instance should do the trick. By default it is disabled.
from golearn.
@mish15 Basically yes, but I still need to do some more work to support that.
I overcommit on my Linode, but with a gig of RAM it only gets to about 400 GB or so of overcommit before it falls over. That's not quite enough, so I think this is the way forward for the time being.
from golearn.
Reducing EDF_SIZE
to 8MB "worked" in that I was able to run the code in this comment with the size of the random forest increased from 10 to 50. However it's still just a bandaid, because increasing the forest size further would eventually cause the memory panics again.
@Sentimentron I'll give your PR a shot and see if it's a more stable solution.
from golearn.
@Sentimentron makes perfect sense. Might be worth noting in the install docs about the overcommit setting. Or stick to mem if overcommit is not set.
400GB virtual from 1GB is pretty amazing really, i'm surprised you got that far! :) Out of interest, why the need for more than 400GB?
from golearn.
@Sentimentron Thanks! It appears #75 is a stable solution, I can increase the number of iterations and the size of the forest quite a bit and I don't see any memory panics.
@Sentimentron, @mish15 To be clear about my purpose, I'm not personally trying to use this library at the moment to do any machine learning. I'm actually trying to write a benchmarking framework/test suite for the golearn library itself, so I'm not looking for workarounds.
from golearn.
Great. @sjwhitworth ready to merge #75 if you have no objections.
from golearn.
Looks good to me.
from golearn.
@Amit-PivotalLabs no problem and nice work!
from golearn.
#75 has been merged, closing issue.
from golearn.
Related Issues (20)
- The same order of categorical values for Equals function is necessary or not HOT 1
- Example about how to query the model after being trained using KNN ? HOT 4
- Building the KNN Example as a static linked executable fails
- runtime error:cgo argument has Go pointer to Go pointer HOT 2
- neural: Rename `neural.NeuralFunction` -> `neural.Function` HOT 1
- Bad import, was an upstream dependency deleted? HOT 2
- Question: How to convert class name into single quote?
- Any interest in XGBoost? HOT 7
- Implement tanh activation func HOT 2
- Error in download golearn with 'go get' HOT 3
- KNNClassifier has no field or method optimisedEuclideanPredict HOT 2
- golearn can read the .pickle converted csv file ?
- Support for Apple Metal HOT 2
- KNN Classifier saved and loaded models don't give the same results. HOT 1
- KNN.optimisedEuclideanPredict undefined (type *KNNClassifier has no field or method optimisedEuclideanPredict) HOT 6
- Question: Could we save the model and use it next time for reproduction of result? HOT 6
- there is a crash in dbscan algoritm HOT 8
- Update readme cover image
- PCA returns all 0's HOT 5
- This library is defunct: please correct me if I'm wrongβ¦
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from golearn.