edwardraff / jsat Goto Github PK
View Code? Open in Web Editor NEWJava Statistical Analysis Tool, a Java library for Machine Learning
License: GNU General Public License v3.0
Java Statistical Analysis Tool, a Java library for Machine Learning
License: GNU General Public License v3.0
I ran Eclipse's and Netbeans' upgrade assistant / refactor suggestions on the code base, and it is quite extensive. In many places, too pushy and/or breaks the code.
Some are great suggestions! Lots of safer braces, using Strings in switch statements, removing unnecessary else blocks after a return, Java 7 diamonds in the constructors, ditching unnecessary casts, etc etc. IMHO totally worth it.
Some are wrong suggestions, like how to handle templates with DataSet. In general, all of the raw type suggestions take a moment to look at. You may have a better solve.
Some of the Java 8 suggestions are iffy, like converting the for loops to streams, and I don't know if that would be a bad idea, or a really interesting idea because then they could be run in parallel. You may be able to slim down the existing parallel code using parallel streams and lambdas.
One of the reasons to work in metric spaces is to abstract away from what the elements you're measuring distances are.
They could be images, text, audio samples, excel spreadsheets.. whatever as long as they come with a distance that defines a metric space. Why are you limiting this to numeric vectors only?
All that you would need is an interface,
MetricDistance {
public double distance(SomeType a, SomeType b);
}
and let the user provide an implementation of that.
When I call evaluateCrossValidation, I may want to give up after a certain amount of time. Instead of building in some "isRunning" boolean, would it be easier to check in the loops for https://docs.oracle.com/javase/tutorial/essential/concurrency/interrupt.html if (Thread.interrupted()) {
in key areas?
What should be the interpretation of the case when classifier instanceof Parameterized
, but autoAddParameters(trainDS)==0
?
SAMME, ArcX4, ModestAdaBoost, DecisionStump, AdaBoostM1PL, StochasticMultinomialLogisticRegression, NaiveBayes, DDAG
Should I take it as "for this specific dataset, there were no good parameters for RandomSearch to chew on -- but keep trying, other DataSets may work better!" or as "author intended to make this algorithm Parameterized but likely hasn't gotten to implementing it yet"?
The following example gives me a null-pointer exception for version 0.0.3 of the library. Seems to be that the root
value in the tree learners is null? Did I do something wrong?
package demo;
import java.util.List;
import java.util.LinkedList;
import jsat.classifiers.DataPoint;
import jsat.linear.DenseVector;
import jsat.linear.Vec;
import jsat.regression.RegressionDataSet;
import jsat.classifiers.trees.RandomForest;
import jsat.classifiers.trees.DecisionTree;
import jsat.regression.MultipleLinearRegression;
import jsat.regression.Regressor;
public class TreeRegressionDemo
{
public static void main(String[] args)
{
double[][] data = new double[][]{{1.0, 1.5}, {2.0, 2.5}, {3.0, 3.5}};
double[] y = new double[]{2.0, 4.0, 3.0};
List<DataPoint> points = new LinkedList<DataPoint>();
for (int irow = 0; irow < data.length; irow++)
{
Vec vec = new DenseVector(data[irow].length+1);
for (int icol = 0; icol < data[irow].length; icol++)
{
vec.set(icol, data[irow][icol]);
}
vec.set(data[irow].length, y[data[irow].length-1]);
DataPoint point = new DataPoint(vec);
points.add(point);
}
RegressionDataSet dataSet = new RegressionDataSet(points, data[0].length);
RandomForest forest = new RandomForest(10);
DecisionTree tree = new DecisionTree();
MultipleLinearRegression
mlr = new MultipleLinearRegression();
forest.autoFeatureSample();
Regressor regressor
= forest;
regressor.train(dataSet);
DataPoint testPoint
= dataSet.getDataPoint(0);
System.out.println(testPoint);
double result = regressor.regress(testPoint);
System.out.println(result);
}
}
This person over here had a comparison of various algos and said bartMachine was the winning Regression algorithm by a large margin.
Request that it be added to JSAT.
Hello Edward,
Your library is incredible, I'm developing too a framework for ML, Image processing and others areas in Computer Science.
My project has born from the AForge.NET and Accord.NET and has the same license LGPL. I would like to know, if you can change the license GPL for to LGPL or at least some portions.
Cheers !
Hi,
I've just stumbled on your project while looking for an alternative to Weka. Frankly, this looks amazing and exactly what we need... until I realised that the reason I was looking for an alternative to Weka in the first place (besides its other shortcomings); the fact that it's GPL - means we can't use your library either, to put models into production.
So I was wondering if you've given any thought to switching from GPL to LGPL, or even something more permissive, like MIT, BSD or Apache licences?
Hi,
I have a 1D array of some certain values.I want to plot the kernel density estimation of my array. I want to play with the bandwidth and all. I have been using matlab and there I have a function called ksdensity. Now I want to create a kernel density estimation using java. Could you please help me, How can I find the kernel density estimation of a 1D array.?
Hello all,
Would it be possible to have a wiki entry that lists the default parameters that should be used with each of the classifiers/regressors? It is not clear from the source code what those parameter values should be. Thanks!
The following tests fail with windows and jdk 1.8 (I don't think this is windows or java related ;) ):
Failed tests:
ElkanKMeansTest.testCluster_3args_2:106 expected:<3>
but was:<4>
RandomBallCoverTest.testSearch_Vec_double:114 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed expected:<14>
but was:<13>
RandomBallCoverTest.testSearch_Vec_int:168 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed 10
VPTreeMVTest.testSearch_Vec_int:171 jsat.linear.vectorcollection.VPTreeMV$VPTreeMVFactory failed 10
VPTreeTest.testSearch_Vec_double:113 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed expected:<12>
but was:<11>
VPTreeTest.testSearch_Vec_int:167 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed 1
FastMathTest.testPow:152 null
EDIT:
It seems that some of the tests sometimes fail and sometimes pass. I did not have a closer look but maybe there are some random issues in the tests.
This time I got the following result:
Failed tests:
NewGLMNETTest.testSetC:109 null
ElkanKMeansTest.testCluster_3args_2:106 expected:<3>
but was:<4>
NaiveKMeansTest.testCluster_3args_1:89 expected:<10>
but was:<9>
RandomBallCoverTest.testSearch_Vec_double:114 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed expected:<14>
but was:<7>
RandomBallCoverTest.testSearch_Vec_int:168 jsat.linear.vectorcollection.RandomBallCover$RandomBallCoverFactory failed 20
VPTreeMVTest.testSearch_Vec_double:114 jsat.linear.vectorcollection.VPTreeMV$VPTreeMVFactory failed expected:<13>
but was:<0>
VPTreeMVTest.testSearch_Vec_int:171 jsat.linear.vectorcollection.VPTreeMV$VPTreeMVFactory failed 2
VPTreeTest.testSearch_Vec_double:113 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed expected:<13>
but was:<0>
VPTreeTest.testSearch_Vec_int:167 jsat.linear.vectorcollection.VPTree$VPTreeFactory failed 1
Tests in error:
PegasosKTest.testTrainC_ClassificationDataSet_ExecutorService:68 » FailedToFit
DivisiveGlobalClustererTest.testCluster_DataSet_int_int_ExecutorService:129 » ArrayIndexOutOfBounds
MiniBatchKMeansTest.testCluster_3args_1:94 » IndexOutOfBounds Index: 0, Size: ...
Hi,
It doesn't seem that JSAT supports string attributes.
Is this a feature that will be supported soon?
Thanks!
Serialization of a trained RandomForest classifier results in Exception
Class jsat.classifiers.CategoricalResults does not implement Serializable or externalizable
I tried to use PCA or MutualInfoFS to do dimensional reduction. After doing that I run the regression classifier, it crashes on the max dimension + 1.
MutualInfoFS transform = new MutualInfoFS(dataSet, 200);
dataSet.applyTransform(transform);
classifier = new StochasticMultinomialLogisticRegression(learningRate, iterations);
classifier.trainC(dataSet);
And the error is
Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: 201
at jsat.linear.DenseVector.get(DenseVector.java:116)
at jsat.linear.SparseVector.dot(SparseVector.java:511)
at jsat.classifiers.linear.StochasticMultinomialLogisticRegression.classify(StochasticMultinomialLogisticRegression.java:539)
Is that a bug? Or did I miss something?
How To load a csv file, 1 row as vector
Hi,
Is there any way save to a classifer once trained so it can be reloaded later?
I tried serializing/deserializing a DANN classifier using ObjectOutputStream/ObjectInputStream but VPNode was not serializeable on read.
Adding Serializeable to VPNode gave "no valid constructor" on read.
Adding a no-argument constructor to VPNode gave the same error.
NaiveBayes classifier can be serialized/deserialized.
Thanks,
the pom.xml has 0.0.7-SNAPSHOT
and the readme has I will also host a snapshot directory, to access it - change "maven-repo" to "maven-snapshot-repo" for the "<url>" tag.
Which is both great - but can anyone tell me how to make use of that? Which "maven-repo" do I change where?
jsat.classifiers.linear.BBR
Exception in thread "main" java.lang.NullPointerException at jsat.parameters.ObjectParameter.getValueString(ObjectParameter.java:38)
The context I was calling it in:
public String toString() {
if (classifier instanceof Parameterized) {
state.putAll(((Parameterized) classifier).getParameters().stream()
.collect(Collectors.toMap(Parameter::getName, Parameter::getValueString)));
}
return GSON.toJson(state);
}
I ran the unit tests a few times in a row, and each time, a different set of 1 to 3 of them failed.
testTrainC_ClassificationDataSet(jsat.classifiers.trees.RandomForestTest) Time elapsed: 0.242 sec <<< FAILURE!
testSearch_Vec_int(jsat.linear.vectorcollection.RandomBallCoverTest) Time elapsed: 0.013 sec <<< FAILURE!
Hi,
I just found a strange behavior:
The following Unit test methods fail:
import static org.junit.Assert.assertFalse;
import static org.junit.Assert.assertTrue;
import org.junit.After;
import org.junit.AfterClass;
import org.junit.Before;
import org.junit.BeforeClass;
import org.junit.Test;
import jsat.distributions.Normal;
public class TestNormal {
static Normal n = new Normal(811.4250871080139d, 1540.8594859716793d);
@BeforeClass
public static void setUpBeforeClass() throws Exception {
}
@AfterClass
public static void tearDownAfterClass() throws Exception {
}
@Before
public void setUp() throws Exception {
n = new Normal(811.4250871080139d, 1540.8594859716793d);
}
@After
public void tearDown() throws Exception {
}
@Test
public void test1() {
assertTrue(n.cdf(44430.0d) <= 1);
}
@Test
public void test2() {
assertFalse(Double.isNaN(n.cdf(67043.0)));
}
}
Can you reproduce this? Is it expected? Did I miss something?
Very nice, that you set up a maven repository. Maybe, for those who are not familiar with maven, you could upload the jar to the releases tab.
Could you also add src and doc jars to your maven repo? (with maven-source-plugin and maven-javadoc-plugin).
What about adding JSAT to the central maven repository?! Then anyone could find JSAT without changing the pom file
Getting data in is always painful - JSAT much less than most.
I've been going through the test classes and Loaders, and was looking for the most straightforward "load this data into a classifier" example.
I've got the data available as a Table of Strings and would need to figure out which cols were "doubles in disguise" vs. classes - but it feels like I'm reinventing wheels past that point. All the code of "take this list of strings, convert them to int indexed lookup tables, use that to create a data point, add those data points to a list, finally put the list into a Simple(?)DataSet" - can I do it in a generic way?
nice lib - seems to have all the common algos.
question - which sections are more relevant to image classification ? given a large dataset, and let's say each image is 28X28 pixels - I'd like to compare against a large dataset to determine the image type. let's take traffic signs for sake of argument.
Hi Edward,
I am looking for a plotting library that can plot a scatter plot from big dataset(2 Million) rows. Does your visualization library support big data plotting?
Thanks,
public List stratSet(int folds, Random rnd)
{
ArrayList cvList = new ArrayList();
IntList rndOrder = new IntList();
int curFold = 0;
for(int c = 0; c < getClassSize(); c++)
{
List<DataPoint> subPoints = getSamples(c);
rndOrder.clear();
ListUtils.addRange(rndOrder, 0, subPoints.size(), 1);
Collections.shuffle(rndOrder, rnd);
for(int i : rndOrder)
{
cvList.get(curFold).datapoints.add(subPoints.get(i));
cvList.get(curFold).category.add(c);
curFold = (curFold + 1) % folds;
}
}
return cvList;
}
I added these code:
while(cvList.size()<folds){ ClassificationDataSet clone = new ClassificationDataSet(numNumerVals, categories, predicting.clone()); cvList.add(clone); }
Hi, I would like to take parts of JSAT, cross-compile them using GWT (your pure-Java style will greatly help) and use them in a (maybe one day commercial) project. Why not just use MIT, BSD or Apache 2.0 license? Thanks for considering!
Notes for myself on some refactoring I want to do at some point. Probably going to do these all at once when I move to Java 8
I saw your old article on topic modeling http://jsatml.blogspot.com/2014/06/stochastic-lda.html
and was wondering how easy it was to take a text file of text-document-per-row and try out your LDA model? I've been trying mallet and gensim, but both are very specific for how you have to prepare your data. I was hoping for some snippet from your test.
I saw the update(List<Vec>)
I have a List<String>
of documents (multiple sentences, plain text, with punctuation)
which I want to turn into a List<jsat.linear.Vec>
what is the best way to get from A to B?
It isn't quite DataSet from Loading-text-data-and-Spam-Classification
maybe HashedTextDataLoader.java
I checked out OnlineLDAsviTest.java but it didn't bridge the two.
Do you have a snippet that would test the topic modeling starting from a list of Strings?
class DataSet:
/**
* Creates a matrix from the data set, where each row represent a data
* point, and each column is one of the numeric example from the data set.
*
* This matrix can be altered and will not effect any of the values in the data set.
*
* @return a matrix of the data points.
*/
public Matrix getDataMatrix()
{
//modify by me Out of memory for text feathers
int len = this.getNumNumericalVars();
Vec first = getDataPoint(0).getNumericalValues();
if(!first.isSparse()){
DenseMatrix matrix = new DenseMatrix(this.getSampleSize(), len);
for(int i = 0; i < getSampleSize(); i++)
{
Vec row = getDataPoint(i).getNumericalValues();
for(int j = 0; j < row.length(); j++)
matrix.set(i, j, row.get(j));
}
return matrix;
}
else{
SparseVector[] rows = new SparseVector[this.getSampleSize()];
for(int i = 0; i < getSampleSize(); i++)
{
Vec row = getDataPoint(i).getNumericalValues();
rows[i] = (SparseVector)row.clone();
}
SparseMatrix matrix = new SparseMatrix(rows);
return matrix;
}
}
Is there a way to only use certain comparisons or change weighting so that inconclusive classifications aren't being accounted for?
If I'm confusing you, here's what I mean (Screenshot from weka): http://prnt.sc/ea3ogz
Plots I squared in red are okay for classification and I want to use them while I have some (In black) that don't classify well (or at all for that matter) and therefore I don't want to use those. How would I go about doing that?
My data setup is 5 numerical datas and a class. I'm using a LVQ classifier.
Even though it identifies as a Linux OS, you can't call binaries. I'm guessing the same might be true (sometimes) for containers. IMHO better to lower the log severity so as not to clutter logs.
https://github.com/EdwardRaff/JSAT/blob/master/JSAT/src/jsat/utils/SystemInfo.java
Thank you for sharing the great job you have done with the community.
While build works with Maven, I had a trouble compiling the code from Eclipse (Luna, Java 1.8). For methods loadSimple, loadClassification and loadRegression of class JSATData, compilation failed in the casts of load method response, e.g:
return (SimpleDataSet) load(inRaw, true));
Compile error is: Cannot cast from DataSet<DataSet<DataSet>> to SimpleDataSet
Guess it's specific to my dev environment, so I'm not sending a pull request but just sharing a workaround with "double casting":
return (SimpleDataSet) ((DataSet<?>)load(inRaw, true));
Hi,
I would like to save and load the result of the RandomForest training.
When I used Java serialization and I got an Exception since getParam in the Parameter class creates Parameter Object that contains a Method field member (the getMethod variable is final and saved as part of the created Parameter object). The same happend when I used FST.
I also tried to use Kryo but got an Exception since RandomDecisionTree does not have no-arg Constructor.
I have millions of training data so I can't train it every time I want to use it, What should I do?
Thanks.
Not completely sure about this, but shouldn't the "return" statement at line 120 of jsat.datatransform.PCA be a "break" statement? Otherwise P may never get initialized.
Also, the "scores" list declared at line 85 is never used.
Checking out your project. Looks promising!
Going through your wiki, I noticed you mention the class: CategoryPlot
.
However, there doesn't seem to be such a class in your repo.
In order to facilitate the use of the library, is possible to have some interface like other library?
You can look at "Apache Commons Math" where point are interface of Clusterable.
Here some reference:
Clustering algorithms and distance measures
Clusterable
DBSCANClusterer
Hi,
I have a 1D array of some certain values.I want to plot the kernel density estimation of my array. I want to play with the bandwidth and all. I have been using matlab and there I have a function called ksdensity. Now I want to create a kernel density estimation using python.The existing algorithm can not give me results as matlab. Could you please help me, How can I find the kernel density estimation of a 1D array.? Thank you.
Hi,
Is there a way to parameter JSAT so that ExtraTrees handle missing values?
Thanks
While loading an ARFF file, there's a bug with the naming of numeric variables:
Line 199: There is no need to use the helper variable k.
Simple Example: The list "variableNames" contains "A", "B", "C". "A" is a categorical variable.
Inside your loop, you check if the variable is numeric, so in the first iteration no name will be set, i will be 1 and k is 0. In the second iteration you use k to get name. Normally this should be "B", but because k is 0, you will get "A", which is wrong.
So just remove k and use i and everything will be fine.
I did a bad thing :) and used reflection to try to instantiate, train, and test every classifier you got, using brain-dead no-arg or simple arg constructors. With autoAddParameters, cause why not.
https://gist.github.com/salamanders/8e7054f62b53eb772895
It exploded all over the place, of course. Which is why I'm so interested in as many classifiers as possible having a best-practice default.
On the plus side: when it works, it creates some really fun results that I don't think are nearly as easy to produce with competing libraries!
I had 3 test errors.
OS: Win 10
E:\data\finance\docs\ML\lib\JSAT-master\JSAT>java -version
java version "1.7.0_79"
Java(TM) SE Runtime Environment (build 1.7.0_79-b15)
Java HotSpot(TM) 64-Bit Server VM (build 24.79-b02, mixed mode)
E:\data\finance\docs\ML\lib\JSAT-master\JSAT>javac -version
javac 1.7.0_79
Results :
Failed tests:
LinearBatchTest.testTrainWarmCMultieFast:175 Warm start wasn't faster? 30 vs 28
FastMathTest.testPow:152 null
NadarayaWatsonTest.testTrainC_RegressionDataSet_ExecutorService:103 null
Tests run: 1049, Failures: 3, Errors: 0, Skipped: 0
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 03:28 min
[INFO] Finished at: 2016-01-04T16:59:37-05:00
[INFO] Final Memory: 9M/244M
[INFO] ------------------------------------------------------------------------
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project JSAT: There are test failures.
[ERROR]
[ERROR] Please refer to E:\data\finance\docs\ML\lib\JSAT-master\JSAT\target\surefire-reports for the individual test results.
[ERROR] -> [Help 1]
org.apache.maven.lifecycle.LifecycleExecutionException: Failed to execute goal org.apache.maven.plugins:maven-surefire-plugin:2.18.1:test (default-test) on project JSAT: There are test failures.
Please refer to E:\data\finance\docs\ML\lib\JSAT-master\JSAT\target\surefire-reports for the individual test results.
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:212)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:153)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:145)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:116)
at org.apache.maven.lifecycle.internal.LifecycleModuleBuilder.buildProject(LifecycleModuleBuilder.java:80)
at org.apache.maven.lifecycle.internal.builder.singlethreaded.SingleThreadedBuilder.build(SingleThreadedBuilder.java:51)
at org.apache.maven.lifecycle.internal.LifecycleStarter.execute(LifecycleStarter.java:128)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:307)
at org.apache.maven.DefaultMaven.doExecute(DefaultMaven.java:193)
at org.apache.maven.DefaultMaven.execute(DefaultMaven.java:106)
at org.apache.maven.cli.MavenCli.execute(MavenCli.java:862)
at org.apache.maven.cli.MavenCli.doMain(MavenCli.java:286)
at org.apache.maven.cli.MavenCli.main(MavenCli.java:197)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.codehaus.plexus.classworlds.launcher.Launcher.launchEnhanced(Launcher.java:289)
at org.codehaus.plexus.classworlds.launcher.Launcher.launch(Launcher.java:229)
at org.codehaus.plexus.classworlds.launcher.Launcher.mainWithExitCode(Launcher.java:415)
at org.codehaus.plexus.classworlds.launcher.Launcher.main(Launcher.java:356)
Caused by: org.apache.maven.plugin.MojoFailureException: There are test failures.
Please refer to E:\data\finance\docs\ML\lib\JSAT-master\JSAT\target\surefire-reports for the individual test results.
at org.apache.maven.plugin.surefire.SurefireHelper.reportExecution(SurefireHelper.java:82)
at org.apache.maven.plugin.surefire.SurefirePlugin.handleSummary(SurefirePlugin.java:254)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.executeAfterPreconditionsChecked(AbstractSurefireMojo.java:854)
at org.apache.maven.plugin.surefire.AbstractSurefireMojo.execute(AbstractSurefireMojo.java:722)
at org.apache.maven.plugin.DefaultBuildPluginManager.executeMojo(DefaultBuildPluginManager.java:134)
at org.apache.maven.lifecycle.internal.MojoExecutor.execute(MojoExecutor.java:208)
... 20 more
Hey, i tried to load my new arff file, but i'm getting the following stacktrace:
java.lang.NullPointerException
at jsat.ARFFLoader.loadArffFile(ARFFLoader.java:182)
my file:
test4.arff
https://hastebin.com/ewopaxateh.pl
My maven dependencie:
[dependency]
[groupId>com.edwardraff</groupId]
[artifactId>JSAT</artifactId]
[version>0.0.6[/version]
[/dependency]
My Code:
File in = new File("test4.arff");
SimpleDataSet simpleDataSet = ARFFLoader.loadArffFile(new FileReader(in));
This seems perfect for annotations:
@TunableParameter(
name="RBFKernel_Sigma",
minValue=0.001,
maxValue=2_000,
startValue=1832,
tunePriority=TunableParameterPriority.HIGH
)
Hi, crazy library, I can't believe how much stuff you've implemented yourself. Must have been terrific for understanding all these algorithms.
I'm new to ML and am struggling to find pure java machine learning libraries. JSAT seems to fit the bill but testing seems thin on the ground in places. What is your impression of the reliability of the classifiers for instance? I was going to start with a DecisionTree but find no tests thus sample code and am curious as to how healthy you think this classifier is.
In general do you suggest that this library is ready for use in a production environment? If you have doubts in what areas and would you mind documenting this in the README?
Thanks!
I got it working... but it was brutal, about 300 lines of code. I feel like I did it the hard way, but I wasn't sure if there was an easier way after reading the CSV parser code.
Is there an easier way to do this?
Can it be part of the library?
class TableDataLoader
class ColumnInfo
Are there any docs at all on JSAT?
I started with a purely numeric test project to start, but when I tried to adapt it to a spark workflow we were trying to accelerate using tf-idf, it exploded. I jacked up n a little higher looking at #33. Thought it looked kindof like #33, but doesn't seem like a complete match.
public class App {
public static void main(String[] args) {
Random rng = new Random();
rng.setSeed(0);
int n = 1000;
HashedTextVectorCreator htvc = new HashedTextVectorCreator(1000, new NaiveTokenizer(), new TfIdf());
RegressionDataSet regressionDataSet = new RegressionDataSet(Stream
.generate(UUID::randomUUID)
.limit(n)
.map(String::valueOf)
.map(htvc::newText)
.map(v -> new DataPointPair<>(new DataPoint(v), rng.nextDouble()))
.collect(Collectors.toList()));
RandomForest randomForest = new RandomForest();
randomForest.train(regressionDataSet);
double regress = randomForest.regress(new DataPoint(htvc.newText("asdf")));
System.out.println(regress);
}
}
/usr/lib/jvm/java-8-openjdk/bin/java -Didea.launcher.port=7533 -Didea.launcher.bin.path=/opt/idea-IU-145.597.3/bin -Dfile.encoding=UTF-8 -classpath /usr/lib/jvm/java-8-openjdk/jre/lib/charsets.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/cldrdata.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/dnsns.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/jaccess.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/localedata.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/nashorn.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunec.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunjce_provider.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/sunpkcs11.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/ext/zipfs.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/jce.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/jsse.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/management-agent.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/resources.jar:/usr/lib/jvm/java-8-openjdk/jre/lib/rt.jar:/home/automaticgiant/git/valet2k/jsat-test/target/classes:/home/automaticgiant/.m2/repository/com/edwardraff/JSAT/0.0.4/JSAT-0.0.4.jar:/opt/idea-IU-145.597.3/lib/idea_rt.jar com.intellij.rt.execution.application.AppMain asdf.App
Exception in thread "main" java.lang.NullPointerException
at jsat.text.wordweighting.TfIdf.indexFunc(TfIdf.java:95)
at jsat.linear.SparseVector.applyIndexFunction(SparseVector.java:882)
at jsat.text.wordweighting.TfIdf.applyTo(TfIdf.java:105)
at jsat.text.HashedTextVectorCreator.newText(HashedTextVectorCreator.java:52)
at jsat.text.HashedTextVectorCreator.newText(HashedTextVectorCreator.java:41)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193)
at java.util.stream.SliceOps$1$1.accept(SliceOps.java:204)
at java.util.stream.StreamSpliterators$InfiniteSupplyingSpliterator$OfRef.tryAdvance(StreamSpliterators.java:1356)
at java.util.stream.ReferencePipeline.forEachWithCancel(ReferencePipeline.java:126)
at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
at java.util.stream.ReduceOps$ReduceOp.evaluateSequential(ReduceOps.java:708)
at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
at java.util.stream.ReferencePipeline.collect(ReferencePipeline.java:499)
at asdf.App.main(App.java:32)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:144)
Process finished with exit code 1
jsat.clustering.FLAME
line 340
int len = Math.min(weights[i].length, knns.size()); //modify by me,
// if knns.size is less than weights[i].length there is a IndexOutOfBoundsException throw.
jsat.clustering.kmeans.NaiveKMeans
line 170
final CountDownLatch latch = new CountDownLatch(blockSize>0? SystemInfo.LogicalCores : extra);
//dataSet size is less than LogicalCores, CountDownLatch will never down to 0 .
Hello,
Very impressed with this library!
Here is my problem: I have a bunch photos that have: Latitude, Longitude, and time data points. I need to cluster them together as to form groups of photos that were taken in the same area, around the same time. I've done some research on clustering, and I think using the Mean-Shift algorithm will suite this nicely. Mainly because I do not know in advance the number of clusters there will be.
I wanted to use your library to help me with that, and I'm having trouble getting started. I have looked though your wiki, but still a little lost on how to begin. Any advice or code snippets would be awesome!
Thanks!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.