jmotif / sax Goto Github PK
View Code? Open in Web Editor NEWJava implementation of SAX, HOT-SAX, and EMMA
Home Page: http://jmotif.github.io/sax-vsm_site/
License: GNU General Public License v2.0
Java implementation of SAX, HOT-SAX, and EMMA
Home Page: http://jmotif.github.io/sax-vsm_site/
License: GNU General Public License v2.0
Impressive API. Thanks :)
Please let me suggest small performance improvements that I have verified to be effective with JMH.
net.seninp.jmotif.sax.TSProcessor
public double mean(double[] series) {
double res = 0D;
int count = 0;// series.length has it
for (double tp : series) {
res += tp;
count += 1;
}
if (count > 0) {
return res / ((Integer) count).doubleValue();// (double)count is faster as it does not need boxing
}
return Double.NaN;
}
So we get:
public double mean(double[] series) {
double res = 0D;
for (double tp : series) {
res += tp;
}
if (series.length > 0) {
return res / (double)series.length;
}
return Double.NaN;
}
and
public double mean(int[] series) {
double res = 0D;
for (int tp : series) {
res += (double) tp;
}
if (series.length > 0) {
return res / (double)series.length;
}
return Double.NaN;
}
and
public double var(double[] series) {
double res = 0D;
double mean = mean(series);
for (double tp : series) {
res += (tp - mean) * (tp - mean);
}
if (series.length > 0) {
return res / (double)series.length;
}
return Double.NaN;
}
and
public double stDev(double[] series) {
double num0 = 0D;
double sum = 0D;
for (double tp : series) {
num0 = num0 + tp * tp;
sum = sum + tp;
}
double len = (double)series.length;
return Math.sqrt((len * num0 - sum * sum) / (len * (len - 1)));
}
any python version?
Hi, I have a gps points dataset. It includes timestamp, position, velocity and other attributes.
I want to symbolize the velocity using the sax.
If the time interval between two consecutive point is not fixed, does sax support this case?
Ideally, create and throw SAXRuntimeException.
In the following case, a checked exception is thrown on a condition similar to where typically java.lang.IndexOutOfBoundsException is thrown.
However IndexOutOfBoundsException extends RuntimeException.
There is little chance in the following case to meaningfully catch the checked exception and recover from it. Once it is thrown, it is game over because of a programming error.
SAXProcessor
public char[] ts2string(double[] ts, int paaSize, double[] cuts, double nThreshold)
throws SAXException {
throws it because it calls
TSProcessor
public double[] paa(double[] ts, int paaSize) throws SAXException {
if (len < paaSize) {
throw new SAXException("PAA size can't be greater than the timeseries size.");
}
What one ends up doing as a developer using this library is to ring fence the exception in a way like this:
final char[] ts2string;
try {
ts2string = sp.ts2string(windowSubseries, paaSize, na.getCuts(alphabetSize), normalizationThreshold);
} catch (SAXException ex) {
throw new RuntimeException(ex);
}
which disrupts the work flow when writing software.
My request is motivated by the need to calculate the slope of a pattern.
So as to not let this request be too specific, it might be more generally useful to convert a pattern to a PAA (double[]) which can then be used to do calculations on it.
PAA conversion like SAXProcessor#ts2string() but in reverse.
A while ago I read a bit about SAX, and played around with different implementation options. The results have not been published. This was just some recreational thing. I now reviewed this code, and thought about bringing it into a publishable shape, but a quick websearch about java sax time series
led me to your implementation - and I'll probably not put any efforts into my implementation, because it most likely won't be able to compete with your library anyhow.
However, I was curious about how you implemented the breakpoint- and distance matrix computations. And ... admittedly, when I saw the giant arrays and switches in https://github.com/jMotif/SAX/blob/master/src/main/java/net/seninp/jmotif/sax/alphabet/NormalAlphabet.java , I twitched a little ;-)
The following is an implementation of your Alphabet
class that does the computation of the cuts and the distance matrix generically, on the fly, for arbitrary alphabet sizes.
package net.seninp.jmotif.sax.alphabet;
import net.seninp.jmotif.sax.SAXException;
public class GeneralAlphabet extends Alphabet
{
@Override
public Integer getMaxSize()
{
return Integer.MAX_VALUE;
}
@Override
public double[] getCuts(Integer size) throws SAXException
{
return computeEquiprobableRegions(size);
}
@Override
public double[][] getDistanceMatrix(Integer size) throws SAXException
{
return computeDistanceMatrix(getCuts(size));
}
/**
* The square root of two
*/
private static final double SQRT_TWO = Math.sqrt(2);
/**
* Create an array that describes the breakpoints for dividing a
* Gaussian distribution into equiprobable regions.
*
* @param n The number of bisection points
* @return The array
*/
private static double[] computeEquiprobableRegions(int n)
{
double b[] = new double[n - 1];
for (int s = 1; s <= n - 1; s++)
{
double x = ((double) s / n);
b[s - 1] = probit(x);
}
return b;
}
/**
* The probit function.<br>
* <br>
* (See http://en.wikipedia.org/wiki/Probit)
*
* @param x The argument
* @return The result
*/
private static double probit(double x)
{
return SQRT_TWO * erfInv(2 * x - 1);
}
/**
* The inverse error function.<br>
* <br>
* (See http://en.wikipedia.org/wiki/Error_function)
*
* @param x The argument
* @return The result
*/
private static double erfInv(double x)
{
// From https://people.maths.ox.ac.uk/gilesm/files/gems_erfinv.pdf
double w = -Math.log((1.0 - x) * (1.0 + x));
double p = 0;
if (w < 5.0)
{
w = w - 2.5;
p = 2.81022636e-08;
p = 3.43273939e-07 + p * w;
p = -3.5233877e-06 + p * w;
p = -4.39150654e-06 + p * w;
p = 0.00021858087 + p * w;
p = -0.00125372503 + p * w;
p = -0.00417768164 + p * w;
p = 0.246640727 + p * w;
p = 1.50140941 + p * w;
}
else
{
w = Math.sqrt(w) - 3.0;
p = -0.000200214257;
p = 0.000100950558 + p * w;
p = 0.00134934322 + p * w;
p = -0.00367342844 + p * w;
p = 0.00573950773 + p * w;
p = -0.0076224613 + p * w;
p = 0.00943887047 + p * w;
p = 1.00167406 + p * w;
p = 2.83297682 + p * w;
}
return p * x;
}
/**
* Computes the MINDIST lookup matrix, as described in the SAX paper
*
* @param b The breakpoints, as computed with
* {@link #computeEquiprobableRegions(int)}
* @return The distance matrix
*/
private static double[][] computeDistanceMatrix(double b[])
{
int n = b.length + 1;
double result[][] = new double[n][n];
for (int r = 0; r < n; r++)
{
for (int c = r; c < n; c++)
{
double value = 0;
if (Math.abs(r - c) > 1)
{
value = b[Math.max(r, c) - 1] - b[Math.min(r, c)];
}
result[r][c] = value;
result[c][r] = value;
}
}
return result;
}
}
I'm not so sure abut the usage pattern of instances of this class. If the methods are called repeatedly and are performance critical, one could consider "caching" the results, so that the overhead for the computation will vanish, and the performance will eventually be the same as for the current, switch
-based implementation - roughly like that:
package net.seninp.jmotif.sax.alphabet;
import java.util.LinkedHashMap;
import java.util.Map;
import net.seninp.jmotif.sax.SAXException;
public class CachingAlphabet extends Alphabet
{
private final Alphabet delegate;
private final Map<Integer, double[]> cuts;
private final Map<Integer, double[][]> distanceMatrices;
CachingAlphabet(Alphabet delegate)
{
this.delegate = delegate;
this.cuts = new LinkedHashMap<Integer, double[]>();
this.distanceMatrices = new LinkedHashMap<Integer, double[][]>();
}
@Override
public Integer getMaxSize()
{
return delegate.getMaxSize();
}
@Override
public double[] getCuts(Integer size) throws SAXException
{
double result[] = cuts.get(size);
if (result == null)
{
result = delegate.getCuts(size);
cuts.put(size, result);
}
return result;
}
@Override
public double[][] getDistanceMatrix(Integer size) throws SAXException
{
double result[][] = distanceMatrices.get(size);
if (result == null)
{
result = delegate.getDistanceMatrix(size);
distanceMatrices.put(size, result);
}
return result;
}
}
To be used as
Alphabet alphabet = new CachingAlphabet(new GenericAlphabet());
Regardless of the possible caching, here's a quick test showing that the results delivered by this implementation are epsilon-equal to the results from your current implementation.
package net.seninp.jmotif.sax.alphabet;
import java.util.Locale;
import net.seninp.jmotif.sax.SAXException;
public class AlphabetTest
{
public static void main(String[] args) throws SAXException
{
NormalAlphabet a0 = new NormalAlphabet();
GeneralAlphabet a1 = new GeneralAlphabet();
int max = Math.min(a0.getMaxSize(), a1.getMaxSize());
for (int i = 2; i < max; i++)
{
double[] c0 = a0.getCuts(i);
double[] c1 = a1.getCuts(i);
System.out.println("For " + i);
//System.out.println("c1: "+Arrays.toString(c0));
//System.out.println("c0: "+Arrays.toString(c1));
double d0[][] = a0.getDistanceMatrix(i);
double d1[][] = a1.getDistanceMatrix(i);
//System.out.println("d0:\n" + createString(d0));
//System.out.println("d1:\n" + createString(d1));
System.out.println("maxAbsoluteDifference for cuts : " +
maxAbsoluteDifference(c0, c1));
System.out.println("maxAbsoluteDifference for distance matrix: " +
maxAbsoluteDifference(d0, d1));
}
}
private static double maxAbsoluteDifference(double a0[], double a1[])
{
double maxAbsoluteDifference = Double.NEGATIVE_INFINITY;
for (int i = 0; i < a0.length; i++)
{
double d = Math.abs(a0[i] - a1[i]);
maxAbsoluteDifference = Math.max(maxAbsoluteDifference, d);
}
return maxAbsoluteDifference;
}
private static double maxAbsoluteDifference(double a0[][], double a1[][])
{
double maxAbsoluteDifference = Double.NEGATIVE_INFINITY;
for (int i = 0; i < a0.length; i++)
{
maxAbsoluteDifference = Math.max(maxAbsoluteDifference,
maxAbsoluteDifference(a0[i], a1[i]));
}
return maxAbsoluteDifference;
}
private static String createString(double a[][])
{
StringBuilder sb = new StringBuilder();
for (int r=0; r<a.length; r++)
{
if (r > 0)
{
sb.append("\n");
}
for (int c=0; c<a[r].length; c++)
{
if (c > 0)
{
sb.append(" ");
}
sb.append(String.format(Locale.ENGLISH, "%5.5f", a[r][c]));
}
}
return sb.toString();
}
}
I considered opening this as a pull request, but think that it might be better to post this here, because there are several degrees of freedom for a possible integration. (You could even use this class to generate the lookup tables for your implementation, so that at least the 26 (instead of 20) letters can be covered). But of course, whether you use it or not is up to you.
If I use and publish this code in one of my libraries, it will be under the MIT/X11 license. But, from my perspective, the above code can be considered as "public domain" (one function refers to a paper, the link is in the code), so everybody may do with this whatever he wants.
Hi Pavel, are there any plans to turn this lib available in any maven repository?
Thanks.
Hi Senin,
Recently, I came across approach (cosine similarity) that tries to use k-shingles to help classify text documents based on similarity in subsequent string patterns?
My quick search found that you actually have implemented public classes in Java (in SAXProcessor.java that indeed represents timeseries into Shingles using:
However, my question to you is have you attempted classifying time series based on k-shingles ? If yes, does it fair well than SAX-VSM? Any other findings from such..pls advice.
Thanks,
KMB
Have you got any equivalent implementation (java API) for the function that read test time series (CSV) dataset row-by-row and then converts them into SAX wordbag with each bag corresponding to individual time series?
for (i in c(1:length(data_test[,1]))) {
print(paste(i))
series = data_test[i,]
Appreciate your quick help !
Thanks,
KMB
In SAXProcessor, the following loop prevents from processing the last windowSize elements of a series.
E.g. if the series is the same size of the window, no record is generated.
for (int i = 0; i < ts.length - windowSize; i++)
I guess this should be
for (int i = 0; i <= ts.length - windowSize; i++)
But they do.
I was starting from your code to create a SAX processor that operates reactively.
While coding PAA to operates on enumerables/observable collections I found something that I guess it's an error in TSProcessor PAA implementation
.
Take for example an array [1,2,3,4,5,6,7] and PAA size = 4.
The following R code (taken from this repo) generates what seems to me the correct output
x <- array(1:7, dim=c(7))
l <- length(x)
PAA_number = 4
PAA <- array(0, PAA_number)
for (i in 1:PAA_number) {
PAA[i] <- mean(x[round((i - 1) * l/PAA_number + 1):round(i * l/PAA_number)])
print(paste("segment",i,"[", # a debug output
round((i - 1) * l/PAA_number + 1),
":",round(i * l/PAA_number),"] ->",
PAA[i]))
}
Output is:
[1] "segment 1 [ 1 : 2 ] -> 1.5"
[1] "segment 2 [ 3 : 4 ] -> 3.5"
[1] "segment 3 [ 4 : 5 ] -> 4.5"
[1] "segment 4 [ 6 : 7 ] -> 6.5"
To the contrary, the TSProcessor paa function generates:
[0] = 1.4285714285714286
[1] = 3.1428571428571428
[2] = 4.8571428571428568
[3] = 6.5714285714285712
Can you check?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.