Git Product home page Git Product logo

Comments (5)

richard-moulton avatar richard-moulton commented on June 18, 2024

I have also posted this question in the MOA development Google Group here.

from moa.

richard-moulton avatar richard-moulton commented on June 18, 2024

I have resolved this issue in my fork of MOA as well as a couple of other issues that I discovered along the way.

In StreamKM.java:
-getClusteringResult() now obtains the current coreset and performs 5 k-means++ runs in order to return a non-empty clustering; and
-lloydPlusPlus() now calculates cluster radii.

In Point.java:
-toCluster() now takes the radius of the cluster to be passed as argument, instead of making all radii equal to one.

Created CoresetCostTriple.java:
-This is a simple data structure which allows StreamKM's lloydPlusPlus() to return the coreset centres, radii of the clusters and cost of the clustering. Previously the associated centres were never returned or stored globally.

I have also included a new FlagOption parameter which permits only the final clustering to be evaluated, as in the original code.

from moa.

richard-moulton avatar richard-moulton commented on June 18, 2024

Pull request #100 was merged into MOA's master branch; issue is resolved.

from moa.

mustafaevam avatar mustafaevam commented on June 18, 2024

@abifet
@richard-moulton
I have came across the same bug, and got the following exception, while training streamKM model.

java.lang.ArrayIndexOutOfBoundsException: 6
       at moa.clusterers.streamkm.BucketManager.insertPoint(BucketManager.java:93)
       at moa.clusterers.streamkm.StreamKM.trainOnInstanceImpl(StreamKM.java:77)
       at com.evam.moakmeansplusplus.service.StreamKMService.train(StreamKMService.java:45)
       at com.evam.moakmeansplusplus.communication.TrainQueueProcessor.runOperation(TrainQueueProcessor.java:50)
       at com.evam.moakmeansplusplus.utils.EvamIntelligenceThread.run(EvamIntelligenceThread.java:31)

I have traced the source code, and found a bug in the following method:

/**
inserts a single point into the bucketmanager
**/
void insertPoint(Point p){
   
   //check if there is enough space in the first bucket
   int cursize = this.buckets[0].cursize; 
   if(cursize >= this.maxBucketsize) {
      //printf("Bucket 0 full \n");
      //start spillover process
      int curbucket  = 0;
      int nextbucket = 1;

      //check if the next bucket is empty
      if(this.buckets[nextbucket].cursize == 0){
         //copy the bucket  
         int i;
         for(i=0; i<this.maxBucketsize; i++){
            this.buckets[nextbucket].points[i] = this.buckets[curbucket].points[i].clone();
            //copyPointWithoutInit: we should not copy coordinates? 
         }
         //bucket is now full
         this.buckets[nextbucket].cursize = this.maxBucketsize;
         //first bucket is now empty
         this.buckets[curbucket].cursize = 0;
         cursize = 0;
      } else {
         //printf("Bucket %d full \n",nextbucket);
         //copy bucket to spillover and continue
         int i;
         for(i=0;i<this.maxBucketsize;i++){
            this.buckets[nextbucket].spillover[i] = this.buckets[curbucket].points[i].clone();
            //copyPointWithoutInit: we should not copy coordinates? 
         }
         this.buckets[0].cursize=0;
         cursize = 0;
         curbucket++;
         nextbucket++;
         /*
         as long as the next bucket is full output the coreset to the spillover of the next bucket
         */
         while(this.buckets[nextbucket].cursize == this.maxBucketsize){
            //printf("Bucket %d full \n",nextbucket);
            this.treeCoreset.unionTreeCoreset(this.maxBucketsize,this.maxBucketsize,
               this.maxBucketsize,p.dimension, 
               this.buckets[curbucket].points,this.buckets[curbucket].spillover,
               this.buckets[nextbucket].spillover, this.clustererRandom);
            //bucket now empty
            this.buckets[curbucket].cursize = 0;
            curbucket++;
            nextbucket++;
         }
         this.treeCoreset.unionTreeCoreset(this.maxBucketsize,this.maxBucketsize,
               this.maxBucketsize,p.dimension, 
               this.buckets[curbucket].points,this.buckets[curbucket].spillover,
               this.buckets[nextbucket].points, this.clustererRandom);
         this.buckets[curbucket].cursize = 0;
         this.buckets[nextbucket].cursize = this.maxBucketsize;
      }
   }
   //insert point into the first bucket
   this.buckets[0].points[cursize] = p.clone();
   //copyPointWithoutInit: we should not copy coordinates? 
   this.buckets[0].cursize++;
}

The exception is thrown at line :
while(this.buckets[nextbucket].cursize == this.maxBucketsize){

It means that BucketManager's buckets are full, and BucketManager is trying to move points from last bucket to non-existing next last bucket as Richard mentioned above.

I have debugged the source code line by line and got the following result:
To be more clear let's assume that my initial parameter for streamkm model is following:
n(lengthOption) = 10000
maxsize(sizeCoresetOption) = 1000

By using following formula BucketManager determines its number of bucket limit.
this.numberOfBuckets = (int) Math.ceil(Math.log((double)n/(double)maxsize) / Math.log(2) )+2;
With my n and maxsize values, bucket manager assigns 6 to numberOfBuckets variable, which means BucketManager has 6 buckets and each bucket can store 2000 data-points. 1000(maxsize) in points array and 1000(maxsize) in spillover array. In order to have better understanding, take a look Bucket class:

protected class Bucket {
   int cursize;
   Point[] points;
   Point[] spillover;
   
   public Bucket(int d, int maxsize){  // d is number of feature, in other word dimension
      this.cursize = 0;
      this.points = new Point[maxsize];
      this.spillover = new Point[maxsize];
      for(int i=0; i<maxsize; i++){
         this.points[i] = new Point(d);
         this.spillover[i] = new Point(d);
      }
   }
   
};

You can follow the code flow on the picture, I also put some notes on drawings.
After the initialization part, I have a BucketManager which has 6 buckets in it.
Each bucket has 2 arrays (points[1000] and spillover[1000]) and 1000+1000 capacity to store data-points.

Each iteration, the code insert new coming data-point to bucket[0].point array.
If the bucket[0].point array is full, then it moves bucket[0].point array to bucket[1].point array.
If bucket[1].point array is also full, then move it to bucket[1].spillover array.
When both bucket[1].points and bucket[1].spillover arrays are full, the code merges[1] these two arrays and moves it into bucket[2].points.
Same iteration goes on.

At the last bucket(buckets[5]), when both bucket[5].points and bucket[5].spillover are full, the program merge these to arrays and tries to move into bucket[6].points.
However, I have only 6 buckets and bucket[5] is the last one and it throws ArrayOutOfBoundException.
After getting exception, the code only has bucket[5].points array, and loses the bucket[5].spillover data-points.
It runs fine, but loses half of the data-points at last bucket, and last bucket's points array is never updated, holds the first 1000 data-points.
That means streamkm++ is not updating itself, partially. [This part is complicated, take a look at getCoresetFromManager method to have better understanding.]
WhatsApp Image 2019-05-31 at 12.15.46.jpeg

My solution is:
When the last bucket is full, I am going to copy the data-points into first bucket array which is bucket[0].points array.
It will move the last bucket data-points to first bucket and saves them.
So, exception and data losing problem is gone.

Now, I am implementing the code. I will let you know the process.
Any suggestion or correction would be wonderful.

[1] some merging algorithm. It gets 1000 data-points from points array and 1000 data-points from spillover array, and merges them. As an output, it gives another 1000 data-points array which has some kind of summary of 2000 data-points

from moa.

mustafaevam avatar mustafaevam commented on June 18, 2024

This issue should be opened again.

from moa.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.