edwardraff / jsat Goto Github PK

View Code? Open in Web Editor NEW

786.0 63.0 207.0 4 MB

Java Statistical Analysis Tool, a Java library for Machine Learning

License: GNU General Public License v3.0

Java 100.00%

machine-learning java machine-learning-library machine-learning-algorithms svm tsne jsat

jsat's People

Stargazers

Watchers

Forkers

davidmoten tklerx chhh lupanh narayana1208 superf2t codeaudit pwangjing gitter-badger siqueries nkhuyu briancecker sureddy norbertoritzmann yorkerlin starju samighamdi mathewkaplos binbenliu adverley fangzheng354 embracelife lazarenkoal thorbur ailoan ralic egbutter bahbarrettmatthew directorscut82 echohenry2006 zhangwj0101 fkleon tekrar aashusingh zhengchaoken sushengyang languagerecipes ntugce pranishd1 shadercoder desperado1992 jackyshuoyang huntertan fredfeng luowenjuan atulrajdhar chenying99 martinseeler 4x melantronic minjiang caomiaoke dixitpatel aagohary zorrock blacelle skycore9 5awla alainlompo kirand1303 solertis ravithejaburugu juangon rain9016 fabianp kalebmmm starrrr1 cccnam5158 geelisready faceang aesmin josemacedo xc35 nesrined sayhellorudy decheng-zhang donaldlee2008 gomani zhangjinrui2718 loretoparisi show1po dongshen ambier emmanual-liu githmn likaiqing lreaderl zshwuhan chdd keeshaaw jz3707 19990909 ironaldo domochen intsiral jimsow clustersdata zhanglbjames codenuances pantos06

jsat's Issues

Eclipse/Netbeans Refactoring (including upgrading to java 8)

I ran Eclipse's and Netbeans' upgrade assistant / refactor suggestions on the code base, and it is quite extensive. In many places, too pushy and/or breaks the code.

Some are great suggestions! Lots of safer braces, using Strings in switch statements, removing unnecessary else blocks after a return, Java 7 diamonds in the constructors, ditching unnecessary casts, etc etc. IMHO totally worth it.

Some are wrong suggestions, like how to handle templates with DataSet. In general, all of the raw type suggestions take a moment to look at. You may have a better solve.

Some of the Java 8 suggestions are iffy, like converting the for loops to streams, and I don't know if that would be a bad idea, or a really interesting idea because then they could be run in parallel. You may be able to slim down the existing parallel code using parallel streams and lambdas.

Do not limit the elements in VPTree to vectors

One of the reasons to work in metric spaces is to abstract away from what the elements you're measuring distances are.
They could be images, text, audio samples, excel spreadsheets.. whatever as long as they come with a distance that defines a metric space. Why are you limiting this to numeric vectors only?
All that you would need is an interface,

MetricDistance {
public double distance(SomeType a, SomeType b);
}

and let the user provide an implementation of that.

fr: make long running loops cancelable

When I call evaluateCrossValidation, I may want to give up after a certain amount of time. Instead of building in some "isRunning" boolean, would it be easier to check in the loops for https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html if (Thread.interrupted()) { in key areas?

instanceof Parameterized but autoAddParameters==0

What should be the interpretation of the case when classifier instanceof Parameterized, but autoAddParameters(trainDS)==0?

SAMME, ArcX4, ModestAdaBoost, DecisionStump, AdaBoostM1PL, StochasticMultinomialLogisticRegression, NaiveBayes, DDAG

Should I take it as "for this specific dataset, there were no good parameters for RandomSearch to chew on -- but keep trying, other DataSets may work better!" or as "author intended to make this algorithm Parameterized but likely hasn't gotten to implementing it yet"?

Tree-based regressors give null pointer exception on .regress

The following example gives me a null-pointer exception for version 0.0.3 of the library. Seems to be that the root value in the tree learners is null? Did I do something wrong?

package demo;

import java.util.List;
import java.util.LinkedList;

import jsat.classifiers.DataPoint;
import jsat.linear.DenseVector;
import jsat.linear.Vec;
import jsat.regression.RegressionDataSet;
import jsat.classifiers.trees.RandomForest;
import jsat.classifiers.trees.DecisionTree;
import jsat.regression.MultipleLinearRegression;
import jsat.regression.Regressor;

public class TreeRegressionDemo
{
    public static void main(String[] args)
    {
        double[][]          data    = new double[][]{{1.0, 1.5}, {2.0, 2.5}, {3.0, 3.5}};
        double[]            y       = new double[]{2.0, 4.0, 3.0};
        List<DataPoint>     points  = new LinkedList<DataPoint>();

        for (int irow = 0; irow < data.length; irow++)
        {
            Vec             vec     = new DenseVector(data[irow].length+1);

            for (int icol = 0; icol < data[irow].length; icol++)
            {
                vec.set(icol, data[irow][icol]);
            }
            vec.set(data[irow].length, y[data[irow].length-1]);

            DataPoint       point   = new DataPoint(vec);
            points.add(point);
        }

    RegressionDataSet   dataSet = new RegressionDataSet(points, data[0].length);

    RandomForest        forest  = new RandomForest(10);
    DecisionTree        tree    = new DecisionTree();
    MultipleLinearRegression
                        mlr     = new MultipleLinearRegression();
    forest.autoFeatureSample();
    Regressor           regressor
                                = forest;

    regressor.train(dataSet);

    DataPoint           testPoint
                                = dataSet.getDataPoint(0);
    System.out.println(testPoint);
    double              result  = regressor.regress(testPoint);
    System.out.println(result);
}
}

FR: Regression w/ bartMachine

This person over here had a comparison of various algos and said bartMachine was the winning Regression algorithm by a large margin.

Request that it be added to JSAT.

License

Hello Edward,

Your library is incredible, I'm developing too a framework for ML, Image processing and others areas in Computer Science.

My project has born from the AForge.NET and Accord.NET and has the same license LGPL. I would like to know, if you can change the license GPL for to LGPL or at least some portions.

Cheers !

license

Hi,

I've just stumbled on your project while looking for an alternative to Weka. Frankly, this looks amazing and exactly what we need... until I realised that the reason I was looking for an alternative to Weka in the first place (besides its other shortcomings); the fact that it's GPL - means we can't use your library either, to put models into production.

So I was wondering if you've given any thought to switching from GPL to LGPL, or even something more permissive, like MIT, BSD or Apache licences?

Kernal Density Estimation

Hi,
I have a 1D array of some certain values.I want to plot the kernel density estimation of my array. I want to play with the bandwidth and all. I have been using matlab and there I have a function called ksdensity. Now I want to create a kernel density estimation using java. Could you please help me, How can I find the kernel density estimation of a 1D array.?

Defaults for Classifiers/Regressors

Hello all,
Would it be possible to have a wiki entry that lists the default parameters that should be used with each of the classifiers/regressors? It is not clear from the source code what those parameter values should be. Thanks!

Tests fail

The following tests fail with windows and jdk 1.8 (I don't think this is windows or java related ;) ):

Failed tests:
ElkanKMeansTest.testCluster_3args_2:106 expected:<3> but was:<4>
RandomBallCoverTest.testSearch_Vec_double:114 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed expected:<14> but was:<13>
RandomBallCoverTest.testSearch_Vec_int:168 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed 10
VPTreeMVTest.testSearch_Vec_int:171 jsat.linear.vectorcollection.VPTreeMV$VPTreeMVFactory failed 10
VPTreeTest.testSearch_Vec_double:113 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed expected:<12> but was:<11>
VPTreeTest.testSearch_Vec_int:167 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed 1
FastMathTest.testPow:152 null

EDIT:
It seems that some of the tests sometimes fail and sometimes pass. I did not have a closer look but maybe there are some random issues in the tests.
This time I got the following result:

Failed tests:
NewGLMNETTest.testSetC:109 null
ElkanKMeansTest.testCluster_3args_2:106 expected:<3> but was:<4>
NaiveKMeansTest.testCluster_3args_1:89 expected:<10> but was:<9>
RandomBallCoverTest.testSearch_Vec_double:114 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed expected:<14> but was:<7>
RandomBallCoverTest.testSearch_Vec_int:168 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed 20
VPTreeMVTest.testSearch_Vec_double:114 jsat.linear.vectorcollection.VPTreeMV$VPTreeMVFactory failed expected:<13> but was:<0>
VPTreeMVTest.testSearch_Vec_int:171 jsat.linear.vectorcollection.VPTreeMV$VPTreeMVFactory failed 2
VPTreeTest.testSearch_Vec_double:113 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed expected:<13> but was:<0>
VPTreeTest.testSearch_Vec_int:167 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed 1
Tests in error:
PegasosKTest.testTrainC_ClassificationDataSet_ExecutorService:68 » FailedToFit
DivisiveGlobalClustererTest.testCluster_DataSet_int_int_ExecutorService:129 » ArrayIndexOutOfBounds
MiniBatchKMeansTest.testCluster_3args_1:94 » IndexOutOfBounds Index: 0, Size: ...

String attribute support

Hi,

It doesn't seem that JSAT supports string attributes.
Is this a feature that will be supported soon?

Thanks!

Serialization exception for CategoricalResults

Serialization of a trained RandomForest classifier results in Exception
Class jsat.classifiers.CategoricalResults does not implement Serializable or externalizable

ArrayIndexOutOfBoundException after dimensional reduction

I tried to use PCA or MutualInfoFS to do dimensional reduction. After doing that I run the regression classifier, it crashes on the max dimension + 1.

MutualInfoFS transform = new MutualInfoFS(dataSet, 200);
dataSet.applyTransform(transform);
classifier = new StochasticMultinomialLogisticRegression(learningRate, iterations);
classifier.trainC(dataSet);

And the error is

Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 201
at jsat.linear.DenseVector.get(DenseVector.java:116)
at jsat.linear.SparseVector.dot(SparseVector.java:511)
at jsat.classifiers.linear.StochasticMultinomialLogisticRegression.classify(StochasticMultinomialLogisticRegression.java:539)

Is that a bug? Or did I miss something?

KMeans data load

How To load a csv file, 1 row as vector

Save and Load Classifier

Hi,

Is there any way save to a classifer once trained so it can be reloaded later?
I tried serializing/deserializing a DANN classifier using ObjectOutputStream/ObjectInputStream but VPNode was not serializeable on read.
Adding Serializeable to VPNode gave "no valid constructor" on read.
Adding a no-argument constructor to VPNode gave the same error.

NaiveBayes classifier can be serialized/deserialized.

Thanks,

How to get that Snapshot goodness

the pom.xml has 0.0.7-SNAPSHOT
and the readme has I will also host a snapshot directory, to access it - change "maven-repo" to "maven-snapshot-repo" for the "<url>" tag.

Which is both great - but can anyone tell me how to make use of that? Which "maven-repo" do I change where?

NPE when getting parameters from (a few) Parameterized algos getValueString

jsat.classifiers.linear.BBR
Exception in thread "main" java.lang.NullPointerException at jsat.parameters.ObjectParameter.getValueString(ObjectParameter.java:38)

The context I was calling it in:

  public String toString() {
    if (classifier instanceof Parameterized) {
      state.putAll(((Parameterized) classifier).getParameters().stream()
          .collect(Collectors.toMap(Parameter::getName, Parameter::getValueString)));
    }
    return GSON.toJson(state);
  }

Unit tests fail (randomly?)

I ran the unit tests a few times in a row, and each time, a different set of 1 to 3 of them failed.

testTrainC_ClassificationDataSet(jsat.classifiers.trees.RandomForestTest) Time elapsed: 0.242 sec <<< FAILURE!
testSearch_Vec_int(jsat.linear.vectorcollection.RandomBallCoverTest) Time elapsed: 0.013 sec <<< FAILURE!

Normal Distribution returns values >1 and NaN

Hi,

I just found a strange behavior:
The following Unit test methods fail:

import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;

import org.junit.After;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;

import jsat.distributions.Normal;
public class TestNormal {
    static Normal n = new Normal(811.4250871080139d, 1540.8594859716793d);

    @BeforeClass
    public static void setUpBeforeClass() throws Exception {
    }

    @AfterClass
    public static void tearDownAfterClass() throws Exception {
    }

    @Before
    public void setUp() throws Exception {
        n = new Normal(811.4250871080139d, 1540.8594859716793d);
    }

    @After
    public void tearDown() throws Exception {
    }

    @Test
    public void test1() {
        assertTrue(n.cdf(44430.0d) <= 1);
    }

    @Test
    public void test2() {
        assertFalse(Double.isNaN(n.cdf(67043.0)));
    }
}

Can you reproduce this? Is it expected? Did I miss something?

Maven Repo

Very nice, that you set up a maven repository. Maybe, for those who are not familiar with maven, you could upload the jar to the releases tab.
Could you also add src and doc jars to your maven repo? (with maven-source-plugin and maven-javadoc-plugin).
What about adding JSAT to the central maven repository?! Then anyone could find JSAT without changing the pom file

Example of how to build a DataSet from scratch

Getting data in is always painful - JSAT much less than most.
I've been going through the test classes and Loaders, and was looking for the most straightforward "load this data into a classifier" example.

I've got the data available as a Table of Strings and would need to figure out which cols were "doubles in disguise" vs. classes - but it feels like I'm reinventing wheels past that point. All the code of "take this list of strings, convert them to int indexed lookup tables, use that to create a data point, add those data points to a list, finally put the list into a Simple(?)DataSet" - can I do it in a generic way?

question on image classifcation

nice lib - seems to have all the common algos.

question - which sections are more relevant to image classification ? given a large dataset, and let's say each image is 28X28 pixels - I'd like to compare against a large dataset to determine the image type. let's take traffic signs for sake of argument.

Plotting Library for big data

Hi Edward,
I am looking for a plotting library that can plot a scatter plot from big dataset(2 Million) rows. Does your visualization library support big data plotting?
Thanks,

IndexOutOfBoundsException for ClassificationDataSet.stratSet

public List stratSet(int folds, Random rnd)
{
ArrayList cvList = new ArrayList();

    IntList rndOrder = new IntList();
    
    int curFold = 0;
    for(int c = 0; c < getClassSize(); c++)
    {
        List<DataPoint> subPoints = getSamples(c);
        rndOrder.clear();
        ListUtils.addRange(rndOrder, 0, subPoints.size(), 1);
        Collections.shuffle(rndOrder, rnd);
        
        for(int i : rndOrder)
        {
            cvList.get(curFold).datapoints.add(subPoints.get(i));
            cvList.get(curFold).category.add(c);
            curFold = (curFold + 1) % folds;
        }
    }
    
    return cvList;
}

I added these code：
while(cvList.size()<folds){ ClassificationDataSet clone = new ClassificationDataSet(numNumerVals, categories, predicting.clone()); cvList.add(clone); }

Consider moving to a more liberal license

Hi, I would like to take parts of JSAT, cross-compile them using GWT (your pure-Java style will greatly help) and use them in a (maybe one day commercial) project. Why not just use MIT, BSD or Apache 2.0 license? Thanks for considering!

General Refactoring TODO

Notes for myself on some refactoring I want to do at some point. Probably going to do these all at once when I move to Java 8

use "size" as method name instead of "getSampleSize"
standardize Vec / Matrix method order to (Other obj, constant, Obj to mutate).
Make transforms have a train method that just takes a DataSet, and remove the factories used. It will be up to the implementation to throw an error if the data set isn't of the desired type.
Move ARFF into io package
Remove weight from data point object, make methods take in a weight vector if they are going to use weights. (Might be a Java 8 only one)
Remove factory concept from Transforms and just use a fit method.
Improve Text Loaders. Rename to TextCorpus for the current loaders. Make the base abstract classes non abstract so they can be used for unlabeled data loading.

Example from raw text to LDA?

I saw your old article on topic modeling http://jsatml.blogspot.com/2014/06/stochastic-lda.html
and was wondering how easy it was to take a text file of text-document-per-row and try out your LDA model? I've been trying mallet and gensim, but both are very specific for how you have to prepare your data. I was hoping for some snippet from your test.

I saw the update(List<Vec>)
I have a List<String> of documents (multiple sentences, plain text, with punctuation)
which I want to turn into a List<jsat.linear.Vec>
what is the best way to get from A to B?

It isn't quite DataSet from Loading-text-data-and-Spam-Classification

maybe HashedTextDataLoader.java

I checked out OnlineLDAsviTest.java but it didn't bridge the two.

Do you have a snippet that would test the topic modeling starting from a list of Strings?

There is a doubt： this modification will improve performance globally? sure it is slow in SVD.

class DataSet:

/**
* Creates a matrix from the data set, where each row represent a data
* point, and each column is one of the numeric example from the data set.
*

* This matrix can be altered and will not effect any of the values in the data set.
*
* @return a matrix of the data points.
*/

  public Matrix getDataMatrix()
   {
   	//modify by me Out of memory for text feathers
   	int len = this.getNumNumericalVars();
   	Vec first = getDataPoint(0).getNumericalValues();
   	if(!first.isSparse()){
           DenseMatrix matrix = new DenseMatrix(this.getSampleSize(), len);
           
           for(int i = 0; i < getSampleSize(); i++)
           {
               Vec row = getDataPoint(i).getNumericalValues();
               for(int j = 0; j < row.length(); j++)
                   matrix.set(i, j, row.get(j));
           }
           
           return matrix;
   	}
   	else{    		
   		SparseVector[] rows = new SparseVector[this.getSampleSize()];
   		for(int i = 0; i < getSampleSize(); i++)
	        {
	           Vec row = getDataPoint(i).getNumericalValues();
	           rows[i] = (SparseVector)row.clone();
	        }
   		SparseMatrix matrix = new SparseMatrix(rows);
   		
   		return matrix;
   	}
   }

[Question] Disable certain comparisons?

Is there a way to only use certain comparisons or change weighting so that inconclusive classifications aren't being accounted for?

If I'm confusing you, here's what I mean (Screenshot from weka): http://prnt.sc/ea3ogz
Plots I squared in red are okay for classification and I want to use them while I have some (In black) that don't classify well (or at all for that matter) and therefore I don't want to use those. How would I go about doing that?

My data setup is 5 numerical datas and a class. I'm using a LVQ classifier.

SystemInfo.L2CacheSize throws errors on Google App Engine

Even though it identifies as a Linux OS, you can't call binaries. I'm guessing the same might be true (sometimes) for containers. IMHO better to lower the log severity so as not to clutter logs.

https://github.com/EdwardRaff/JSAT/blob/master/JSAT/src/jsat/utils/SystemInfo.java

Build problem in Eclipse environment

Thank you for sharing the great job you have done with the community.

While build works with Maven, I had a trouble compiling the code from Eclipse (Luna, Java 1.8). For methods loadSimple, loadClassification and loadRegression of class JSATData, compilation failed in the casts of load method response, e.g:
return (SimpleDataSet) load(inRaw, true));
Compile error is: Cannot cast from DataSet<DataSet<DataSet>> to SimpleDataSet

Guess it's specific to my dev environment, so I'm not sending a pull request but just sharing a workaround with "double casting":

return (SimpleDataSet) ((DataSet<?>)load(inRaw, true));

Save and load RandomForest

Hi,
I would like to save and load the result of the RandomForest training.
When I used Java serialization and I got an Exception since getParam in the Parameter class creates Parameter Object that contains a Method field member (the getMethod variable is final and saved as part of the created Parameter object). The same happend when I used FST.
I also tried to use Kryo but got an Exception since RandomDecisionTree does not have no-arg Constructor.
I have millions of training data so I can't train it every time I want to use it, What should I do?

Thanks.

PCA class

Not completely sure about this, but shouldn't the "return" statement at line 120 of jsat.datatransform.PCA be a "break" statement? Otherwise P may never get initialized.

Also, the "scores" list declared at line 85 is never used.

missing class - jsat.graphing.CategoryPlot;

Checking out your project. Looks promising!
Going through your wiki, I noticed you mention the class: CategoryPlot.
However, there doesn't seem to be such a class in your repo.

DataSet / DataPoint interface

In order to facilitate the use of the library, is possible to have some interface like other library?
You can look at "Apache Commons Math" where point are interface of Clusterable.

Here some reference:
Clustering algorithms and distance measures
Clusterable
DBSCANClusterer

Kernal Density Estimation

Hi,
I have a 1D array of some certain values.I want to plot the kernel density estimation of my array. I want to play with the bandwidth and all. I have been using matlab and there I have a function called ksdensity. Now I want to create a kernel density estimation using python.The existing algorithm can not give me results as matlab. Could you please help me, How can I find the kernel density estimation of a 1D array.? Thank you.

Missing values?

Hi,

Is there a way to parameter JSAT so that ExtraTrees handle missing values?

Thanks

Bug in the ARFFLoader

While loading an ARFF file, there's a bug with the naming of numeric variables:
Line 199: There is no need to use the helper variable k.
Simple Example: The list "variableNames" contains "A", "B", "C". "A" is a categorical variable.
Inside your loop, you check if the variable is numeric, so in the first iteration no name will be set, i will be 1 and k is 0. In the second iteration you use k to get name. Normally this should be "B", but because k is 0, you will get "A", which is wrong.
So just remove k and use i and everything will be fine.

No-Arg Constructors?

I did a bad thing :) and used reflection to try to instantiate, train, and test every classifier you got, using brain-dead no-arg or simple arg constructors. With autoAddParameters, cause why not.

https://gist.github.com/salamanders/8e7054f62b53eb772895

It exploded all over the place, of course. Which is why I'm so interested in as many classifiers as possible having a best-practice default.

Bagging - is there a weak classifier that in general can be assumed to be an "ok" starting point?
Caching & Parallel training - is there an interface possible for caching-enabled trainers?
Incompatible data - is there a way to upgrade a one-to-many if the classifier is expecting binary but the data assumes multiple?

On the plus side: when it works, it creates some really fun results that I don't think are nearly as easy to produce with competing libraries!

3 test erros.

I had 3 test errors.

OS: Win 10

E:\data\finance\docs\ML\lib\JSAT-master\JSAT>java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)

E:\data\finance\docs\ML\lib\JSAT-master\JSAT>javac -version
javac 1.7.0_79

Results :

Failed tests:
LinearBatchTest.testTrainWarmCMultieFast:175 Warm start wasn't faster? 30 vs 28
FastMathTest.testPow:152 null
NadarayaWatsonTest.testTrainC_RegressionDataSet_ExecutorService:103 null

Tests run: 1049, Failures: 3, Errors: 0, Skipped: 0

[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:28 min
[INFO] Finished at: 2016-01-04T16:59:37-05:00
[INFO] Final Memory: 9M/244M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project JSAT: There are test failures.
[ERROR]
[ERROR] Please refer to E:\data\finance\docs\ML\lib\JSAT-master\JSAT\target\surefire-reports for the individual test results.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project JSAT: There are test failures.

Please refer to E:\data\finance\docs\ML\lib\JSAT-master\JSAT\target\surefire-reports for the individual test results.
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures.

Please refer to E:\data\finance\docs\ML\lib\JSAT-master\JSAT\target\surefire-reports for the individual test results.
at org.apache.maven.plugin.surefire.SurefireHelper.reportExecution(SurefireHelper.java:82)
at org.apache.maven.plugin.surefire.SurefirePlugin.handleSummary(SurefirePlugin.java:254)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:854)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:722)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 20 more

not override Iterator<IndexValue> getNonZeroIterator(int start) in VecPaired

Cannot load arff file

Hey, i tried to load my new arff file, but i'm getting the following stacktrace:

java.lang.NullPointerException
at jsat.ARFFLoader.loadArffFile(ARFFLoader.java:182)

my file:
test4.arff

https://hastebin.com/ewopaxateh.pl

My maven dependencie:

[dependency]
[groupId>com.edwardraff</groupId]
[artifactId>JSAT</artifactId]
[version>0.0.6[/version]
[/dependency]

My Code:
File in = new File("test4.arff");
SimpleDataSet simpleDataSet = ARFFLoader.loadArffFile(new FileReader(in));

FR: getParamsFromMethods - Could this be annotations?

This seems perfect for annotations:

@TunableParameter(
  name="RBFKernel_Sigma",
  minValue=0.001,
  maxValue=2_000,
  startValue=1832,
  tunePriority=TunableParameterPriority.HIGH
)

Production readiness

Hi, crazy library, I can't believe how much stuff you've implemented yourself. Must have been terrific for understanding all these algorithms.

I'm new to ML and am struggling to find pure java machine learning libraries. JSAT seems to fit the bill but testing seems thin on the ground in places. What is your impression of the reliability of the classifiers for instance? I was going to start with a DecisionTree but find no tests thus sample code and am curious as to how healthy you think this classifier is.

In general do you suggest that this library is ready for use in a production environment? If you have doubts in what areas and would you mind documenting this in the README?

Thanks!

Guava Table<Integer, String, String> to JSAT DataSet

I got it working... but it was brutal, about 300 lines of code. I feel like I did it the hard way, but I wasn't sure if there was an easier way after reading the CSV parser code.

Parsing the Strings into Longs, Doubles, Strings
Finding out the "worst" type for each column and normalizing across the column
Making lookup tables for each column that needs it (small number of ints, or Strings)
Generate a dataset based on the output column name

Is there an easier way to do this?
Can it be part of the library?

class TableDataLoader

TableDataLoader(Table<Long, String, String>)
getDataSet(String)
tableToDataSet_Classification(ColumnInfo, List, SortedSet, int, int)
tableToDataSet_Regression(ColumnInfo, List, SortedSet, int, int)

class ColumnInfo

ColumnInfo(String, Map<Long, String>)
collectionToSortedUniqueStringList(Collection)
parseColumn(Map<Long, String>)
parseToLowestObject(String, Class<?>)
constructJSATCategoricalData()
constructLabelLookups()
getCategoricalData()
getName()
getType()
isLookup()
getRowValue(Number)
getKeyFromLookupId(int)
getAllRowKeys()

Docs

Are there any docs at all on JSAT?

TF-IDF NPE

I started with a purely numeric test project to start, but when I tried to adapt it to a spark workflow we were trying to accelerate using tf-idf, it exploded. I jacked up n a little higher looking at #33. Thought it looked kindof like #33, but doesn't seem like a complete match.

public class App {
    public static void main(String[] args) {
        Random rng = new Random();
        rng.setSeed(0);
        int n = 1000;
        HashedTextVectorCreator htvc = new HashedTextVectorCreator(1000, new NaiveTokenizer(), new TfIdf());
        RegressionDataSet regressionDataSet = new RegressionDataSet(Stream
                .generate(UUID::randomUUID)
                .limit(n)
                .map(String::valueOf)
                .map(htvc::newText)
                .map(v -> new DataPointPair<>(new DataPoint(v), rng.nextDouble()))
                .collect(Collectors.toList()));
        RandomForest randomForest = new RandomForest();
        randomForest.train(regressionDataSet);
        double regress = randomForest.regress(new DataPoint(htvc.newText("asdf")));
        System.out.println(regress);
    }
}

/usr/lib/jvm/java-8-openjdk/bin/java -Didea.launcher.port=7533 -Didea.launcher.bin.path=/opt/idea-IU-145.597.3/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-8-openjdk/jre/lib/charsets.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/jaccess.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/nashorn.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/jce.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/jsse.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/management-agent.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/resources.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/rt.jar:/home/automaticgiant/git/valet2k/jsat-test/target/classes:/home/automaticgiant/.m2/repository/com/edwardraff/JSAT/0.0.4/JSAT-0.0.4.jar:/opt/idea-IU-145.597.3/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain asdf.App
Exception in thread "main" java.lang.NullPointerException
    at jsat.text.wordweighting.TfIdf.indexFunc(TfIdf.java:95)
    at jsat.linear.SparseVector.applyIndexFunction(SparseVector.java:882)
    at jsat.text.wordweighting.TfIdf.applyTo(TfIdf.java:105)
    at jsat.text.HashedTextVectorCreator.newText(HashedTextVectorCreator.java:52)
    at jsat.text.HashedTextVectorCreator.newText(HashedTextVectorCreator.java:41)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
    at java.util.stream.SliceOps$1$1.accept(SliceOps.java:204)
    at java.util.stream.StreamSpliterators$InfiniteSupplyingSpliterator$OfRef.tryAdvance(StreamSpliterators.java:1356)
    at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
    at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
    at asdf.App.main(App.java:32)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)

Process finished with exit code 1

Very pleased with author update quickly，Other bugs I find：

jsat.clustering.FLAME
line 340
int len = Math.min(weights[i].length, knns.size()); //modify by me,
// if knns.size is less than weights[i].length there is a IndexOutOfBoundsException throw.

jsat.clustering.kmeans.NaiveKMeans
line 170
final CountDownLatch latch = new CountDownLatch(blockSize>0? SystemInfo.LogicalCores : extra);
//dataSet size is less than LogicalCores, CountDownLatch will never down to 0 .

Question: Mean-Shift Starting point

Hello,

Very impressed with this library!

Here is my problem: I have a bunch photos that have: Latitude, Longitude, and time data points. I need to cluster them together as to form groups of photos that were taken in the same area, around the same time. I've done some research on clustering, and I think using the Mean-Shift algorithm will suite this nicely. Mainly because I do not know in advance the number of clusters there will be.

I wanted to use your library to help me with that, and I'm having trouble getting started. I have looked though your wiki, but still a little lost on how to begin. Any advice or code snippets would be awesome!

Thanks!

edwardraff / jsat Goto Github PK

jsat's People

Stargazers

Watchers

Forkers

jsat's Issues

Recommend Projects

Recommend Topics

Recommend Org