Good afternoon, I am trying to understand the source code for Stream

Pull request <a class="issue-link js-issue-link" data-error-text="Failed to load title

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

StreamKM "Width" Parameter about moa HOT 5 CLOSED

waikato commented on June 18, 2024

StreamKM "Width" Parameter

from moa.

Comments (5)

richard-moulton commented on June 18, 2024

I have also posted this question in the MOA development Google Group here.

from moa.

richard-moulton commented on June 18, 2024

I have resolved this issue in my fork of MOA as well as a couple of other issues that I discovered along the way.

In StreamKM.java:
-getClusteringResult() now obtains the current coreset and performs 5 k-means++ runs in order to return a non-empty clustering; and
-lloydPlusPlus() now calculates cluster radii.

In Point.java:
-toCluster() now takes the radius of the cluster to be passed as argument, instead of making all radii equal to one.

Created CoresetCostTriple.java:
-This is a simple data structure which allows StreamKM's lloydPlusPlus() to return the coreset centres, radii of the clusters and cost of the clustering. Previously the associated centres were never returned or stored globally.

I have also included a new FlagOption parameter which permits only the final clustering to be evaluated, as in the original code.

from moa.

richard-moulton commented on June 18, 2024

Pull request #100 was merged into MOA's master branch; issue is resolved.

from moa.

mustafaevam commented on June 18, 2024

@abifet
@richard-moulton
I have came across the same bug, and got the following exception, while training streamKM model.

java.lang.ArrayIndexOutOfBoundsException: 6
       at moa.clusterers.streamkm.BucketManager.insertPoint(BucketManager.java:93)
       at moa.clusterers.streamkm.StreamKM.trainOnInstanceImpl(StreamKM.java:77)
       at com.evam.moakmeansplusplus.service.StreamKMService.train(StreamKMService.java:45)
       at com.evam.moakmeansplusplus.communication.TrainQueueProcessor.runOperation(TrainQueueProcessor.java:50)
       at com.evam.moakmeansplusplus.utils.EvamIntelligenceThread.run(EvamIntelligenceThread.java:31)

I have traced the source code, and found a bug in the following method:

/**
inserts a single point into the bucketmanager
**/
void insertPoint(Point p){
   
   //check if there is enough space in the first bucket
   int cursize = this.buckets[0].cursize; 
   if(cursize >= this.maxBucketsize) {
      //printf("Bucket 0 full \n");
      //start spillover process
      int curbucket  = 0;
      int nextbucket = 1;

      //check if the next bucket is empty
      if(this.buckets[nextbucket].cursize == 0){
         //copy the bucket  
         int i;
         for(i=0; i<this.maxBucketsize; i++){
            this.buckets[nextbucket].points[i] = this.buckets[curbucket].points[i].clone();
            //copyPointWithoutInit: we should not copy coordinates? 
         }
         //bucket is now full
         this.buckets[nextbucket].cursize = this.maxBucketsize;
         //first bucket is now empty
         this.buckets[curbucket].cursize = 0;
         cursize = 0;
      } else {
         //printf("Bucket %d full \n",nextbucket);
         //copy bucket to spillover and continue
         int i;
         for(i=0;i<this.maxBucketsize;i++){
            this.buckets[nextbucket].spillover[i] = this.buckets[curbucket].points[i].clone();
            //copyPointWithoutInit: we should not copy coordinates? 
         }
         this.buckets[0].cursize=0;
         cursize = 0;
         curbucket++;
         nextbucket++;
         /*
         as long as the next bucket is full output the coreset to the spillover of the next bucket
         */
         while(this.buckets[nextbucket].cursize == this.maxBucketsize){
            //printf("Bucket %d full \n",nextbucket);
            this.treeCoreset.unionTreeCoreset(this.maxBucketsize,this.maxBucketsize,
               this.maxBucketsize,p.dimension, 
               this.buckets[curbucket].points,this.buckets[curbucket].spillover,
               this.buckets[nextbucket].spillover, this.clustererRandom);
            //bucket now empty
            this.buckets[curbucket].cursize = 0;
            curbucket++;
            nextbucket++;
         }
         this.treeCoreset.unionTreeCoreset(this.maxBucketsize,this.maxBucketsize,
               this.maxBucketsize,p.dimension, 
               this.buckets[curbucket].points,this.buckets[curbucket].spillover,
               this.buckets[nextbucket].points, this.clustererRandom);
         this.buckets[curbucket].cursize = 0;
         this.buckets[nextbucket].cursize = this.maxBucketsize;
      }
   }
   //insert point into the first bucket
   this.buckets[0].points[cursize] = p.clone();
   //copyPointWithoutInit: we should not copy coordinates? 
   this.buckets[0].cursize++;
}

The exception is thrown at line :
while(this.buckets[nextbucket].cursize == this.maxBucketsize){

It means that BucketManager's buckets are full, and BucketManager is trying to move points from last bucket to non-existing next last bucket as Richard mentioned above.

I have debugged the source code line by line and got the following result:
To be more clear let's assume that my initial parameter for streamkm model is following:
n(lengthOption) = 10000
maxsize(sizeCoresetOption) = 1000

By using following formula BucketManager determines its number of bucket limit.
this.numberOfBuckets = (int) Math.ceil(Math.log((double)n/(double)maxsize) / Math.log(2) )+2;
With my n and maxsize values, bucket manager assigns 6 to numberOfBuckets variable, which means BucketManager has 6 buckets and each bucket can store 2000 data-points. 1000(maxsize) in points array and 1000(maxsize) in spillover array. In order to have better understanding, take a look Bucket class:

protected class Bucket {
   int cursize;
   Point[] points;
   Point[] spillover;
   
   public Bucket(int d, int maxsize){  // d is number of feature, in other word dimension
      this.cursize = 0;
      this.points = new Point[maxsize];
      this.spillover = new Point[maxsize];
      for(int i=0; i<maxsize; i++){
         this.points[i] = new Point(d);
         this.spillover[i] = new Point(d);
      }
   }
   
};

You can follow the code flow on the picture, I also put some notes on drawings.
After the initialization part, I have a BucketManager which has 6 buckets in it.
Each bucket has 2 arrays (points[1000] and spillover[1000]) and 1000+1000 capacity to store data-points.

Each iteration, the code insert new coming data-point to bucket[0].point array.
If the bucket[0].point array is full, then it moves bucket[0].point array to bucket[1].point array.
If bucket[1].point array is also full, then move it to bucket[1].spillover array.
When both bucket[1].points and bucket[1].spillover arrays are full, the code merges[1] these two arrays and moves it into bucket[2].points.
Same iteration goes on.

At the last bucket(buckets[5]), when both bucket[5].points and bucket[5].spillover are full, the program merge these to arrays and tries to move into bucket[6].points.
However, I have only 6 buckets and bucket[5] is the last one and it throws ArrayOutOfBoundException.
After getting exception, the code only has bucket[5].points array, and loses the bucket[5].spillover data-points.
It runs fine, but loses half of the data-points at last bucket, and last bucket's points array is never updated, holds the first 1000 data-points.
That means streamkm++ is not updating itself, partially. [This part is complicated, take a look at getCoresetFromManager method to have better understanding.]
WhatsApp Image 2019-05-31 at 12.15.46.jpeg

My solution is:
When the last bucket is full, I am going to copy the data-points into first bucket array which is bucket[0].points array.
It will move the last bucket data-points to first bucket and saves them.
So, exception and data losing problem is gone.

Now, I am implementing the code. I will let you know the process.
Any suggestion or correction would be wonderful.

[1] some merging algorithm. It gets 1000 data-points from points array and 1000 data-points from spillover array, and merges them. As an output, it gives another 1000 data-points array which has some kind of summary of 2000 data-points

from moa.

mustafaevam commented on June 18, 2024

This issue should be opened again.

from moa.

StreamKM "Width" Parameter about moa HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent