Comments (5)
I have also posted this question in the MOA development Google Group here.
from moa.
I have resolved this issue in my fork of MOA as well as a couple of other issues that I discovered along the way.
In StreamKM.java:
-getClusteringResult() now obtains the current coreset and performs 5 k-means++ runs in order to return a non-empty clustering; and
-lloydPlusPlus() now calculates cluster radii.
In Point.java:
-toCluster() now takes the radius of the cluster to be passed as argument, instead of making all radii equal to one.
Created CoresetCostTriple.java:
-This is a simple data structure which allows StreamKM's lloydPlusPlus() to return the coreset centres, radii of the clusters and cost of the clustering. Previously the associated centres were never returned or stored globally.
I have also included a new FlagOption parameter which permits only the final clustering to be evaluated, as in the original code.
from moa.
Pull request #100 was merged into MOA's master branch; issue is resolved.
from moa.
@abifet
@richard-moulton
I have came across the same bug, and got the following exception, while training streamKM model.
java.lang.ArrayIndexOutOfBoundsException: 6
at moa.clusterers.streamkm.BucketManager.insertPoint(BucketManager.java:93)
at moa.clusterers.streamkm.StreamKM.trainOnInstanceImpl(StreamKM.java:77)
at com.evam.moakmeansplusplus.service.StreamKMService.train(StreamKMService.java:45)
at com.evam.moakmeansplusplus.communication.TrainQueueProcessor.runOperation(TrainQueueProcessor.java:50)
at com.evam.moakmeansplusplus.utils.EvamIntelligenceThread.run(EvamIntelligenceThread.java:31)
I have traced the source code, and found a bug in the following method:
/**
inserts a single point into the bucketmanager
**/
void insertPoint(Point p){
//check if there is enough space in the first bucket
int cursize = this.buckets[0].cursize;
if(cursize >= this.maxBucketsize) {
//printf("Bucket 0 full \n");
//start spillover process
int curbucket = 0;
int nextbucket = 1;
//check if the next bucket is empty
if(this.buckets[nextbucket].cursize == 0){
//copy the bucket
int i;
for(i=0; i<this.maxBucketsize; i++){
this.buckets[nextbucket].points[i] = this.buckets[curbucket].points[i].clone();
//copyPointWithoutInit: we should not copy coordinates?
}
//bucket is now full
this.buckets[nextbucket].cursize = this.maxBucketsize;
//first bucket is now empty
this.buckets[curbucket].cursize = 0;
cursize = 0;
} else {
//printf("Bucket %d full \n",nextbucket);
//copy bucket to spillover and continue
int i;
for(i=0;i<this.maxBucketsize;i++){
this.buckets[nextbucket].spillover[i] = this.buckets[curbucket].points[i].clone();
//copyPointWithoutInit: we should not copy coordinates?
}
this.buckets[0].cursize=0;
cursize = 0;
curbucket++;
nextbucket++;
/*
as long as the next bucket is full output the coreset to the spillover of the next bucket
*/
while(this.buckets[nextbucket].cursize == this.maxBucketsize){
//printf("Bucket %d full \n",nextbucket);
this.treeCoreset.unionTreeCoreset(this.maxBucketsize,this.maxBucketsize,
this.maxBucketsize,p.dimension,
this.buckets[curbucket].points,this.buckets[curbucket].spillover,
this.buckets[nextbucket].spillover, this.clustererRandom);
//bucket now empty
this.buckets[curbucket].cursize = 0;
curbucket++;
nextbucket++;
}
this.treeCoreset.unionTreeCoreset(this.maxBucketsize,this.maxBucketsize,
this.maxBucketsize,p.dimension,
this.buckets[curbucket].points,this.buckets[curbucket].spillover,
this.buckets[nextbucket].points, this.clustererRandom);
this.buckets[curbucket].cursize = 0;
this.buckets[nextbucket].cursize = this.maxBucketsize;
}
}
//insert point into the first bucket
this.buckets[0].points[cursize] = p.clone();
//copyPointWithoutInit: we should not copy coordinates?
this.buckets[0].cursize++;
}
The exception is thrown at line :
while(this.buckets[nextbucket].cursize == this.maxBucketsize){
It means that BucketManager's buckets are full, and BucketManager is trying to move points from last bucket to non-existing next last bucket as Richard mentioned above.
I have debugged the source code line by line and got the following result:
To be more clear let's assume that my initial parameter for streamkm model is following:
n(lengthOption) = 10000
maxsize(sizeCoresetOption) = 1000
By using following formula BucketManager determines its number of bucket limit.
this.numberOfBuckets = (int) Math.ceil(Math.log((double)n/(double)maxsize) / Math.log(2) )+2;
With my n and maxsize values, bucket manager assigns 6 to numberOfBuckets variable, which means BucketManager has 6 buckets and each bucket can store 2000 data-points. 1000(maxsize) in points array and 1000(maxsize) in spillover array. In order to have better understanding, take a look Bucket class:
protected class Bucket {
int cursize;
Point[] points;
Point[] spillover;
public Bucket(int d, int maxsize){ // d is number of feature, in other word dimension
this.cursize = 0;
this.points = new Point[maxsize];
this.spillover = new Point[maxsize];
for(int i=0; i<maxsize; i++){
this.points[i] = new Point(d);
this.spillover[i] = new Point(d);
}
}
};
You can follow the code flow on the picture, I also put some notes on drawings.
After the initialization part, I have a BucketManager which has 6 buckets in it.
Each bucket has 2 arrays (points[1000] and spillover[1000]) and 1000+1000 capacity to store data-points.
Each iteration, the code insert new coming data-point to bucket[0].point array.
If the bucket[0].point array is full, then it moves bucket[0].point array to bucket[1].point array.
If bucket[1].point array is also full, then move it to bucket[1].spillover array.
When both bucket[1].points and bucket[1].spillover arrays are full, the code merges[1] these two arrays and moves it into bucket[2].points.
Same iteration goes on.
At the last bucket(buckets[5]), when both bucket[5].points and bucket[5].spillover are full, the program merge these to arrays and tries to move into bucket[6].points.
However, I have only 6 buckets and bucket[5] is the last one and it throws ArrayOutOfBoundException.
After getting exception, the code only has bucket[5].points array, and loses the bucket[5].spillover data-points.
It runs fine, but loses half of the data-points at last bucket, and last bucket's points array is never updated, holds the first 1000 data-points.
That means streamkm++ is not updating itself, partially. [This part is complicated, take a look at getCoresetFromManager method to have better understanding.]
WhatsApp Image 2019-05-31 at 12.15.46.jpeg
My solution is:
When the last bucket is full, I am going to copy the data-points into first bucket array which is bucket[0].points array.
It will move the last bucket data-points to first bucket and saves them.
So, exception and data losing problem is gone.
Now, I am implementing the code. I will let you know the process.
Any suggestion or correction would be wonderful.
[1] some merging algorithm. It gets 1000 data-points from points array and 1000 data-points from spillover array, and merges them. As an output, it gives another 1000 data-points array which has some kind of summary of 2000 data-points
from moa.
This issue should be opened again.
from moa.
Related Issues (20)
- How do you use Den-Stream in your project and replace the data source with Kafka? HOT 1
- classIndex() takes long time with AdaptiveRandomForest HOT 12
- Time measurement for AdaptiveRandomForest in parallel mode HOT 2
- build fails on AArch64, Fedora 3 HOT 1
- Issues about Multi label Hoeffding Tree
- Cannot find specific algorithms which are on Github but not on application HOT 2
- Date attribute error
- IllegalAccessError when trying to browse for an ARFF file in ArffFileStream
- MOA running error
- moa.classifiers.meta.imbalanced Tests are failing HOT 2
- data with noise class
- Data with noise class
- MOAObject measureByteSize() could return negative value if the size is > 2^31-1 HOT 1
- measureByteSize() gets called twice in EvaluatePrequential for every stats collection cycle
- AdaHoeffdingOptionTreeTest is failing on latest master (6eacf9b) HOT 1
- MOA devel docker image fails to build with Java 8 HOT 1
- How can I obtain the final clustering results (the labels) using Clustream algorihm?
- When I run massive online analysis, I get an error "Failed to load implementation from: com.github.fommil.netlib.NativeSystemARPACK".
- Method removeBadSplitNodes in FIMTDDNumericAttributeClassObserver cannot remove nodes
- Some MOA tasks fail when using filters because they (tasks) use InstanceStream instead of ExampleStream.
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from moa.