tslearn-team / tslearn Goto Github PK

The machine learning toolkit for time series analysis in Python

Home Page: https://tslearn.readthedocs.io

License: BSD 2-Clause "Simplified" License

Python 100.00%

time-series timeseries time-series-analysis time-series-clustering machine-learning machine-learning-algorithms machinelearning dtw python time-series-classification

tslearn's People

Contributors

Stargazers

Watchers

Forkers

hungs tomfisher benjamesbabala konstantinstadler zhouyonglong ml-ai-nlp-ir whitsnak maxbenchrist xuanhan863 eycab tungk hedgefair hiredd paurichardson fizatm hailprob vishalbelsare asardaes pgroenbaek ywgjh1994 carlinix fullflu cr458 tonypius jscheithe geopars gjinhui tony32769 guanj-h rth coselk unnamedplay-r colinleverger gustavocarita lefnire rqz233 stevenlol fmailhot jangirgourav yeahrmek manuelmonteiro24 dddragons dongminwu zhouchaomxx gio8tisu gandroz apoorv007 samins freephys lidan456 stfnrpplngr batermj marccoru zhuyiche lejafar htomlinson14 jaykimbravekjh jermellbeane vanbenschoten hakanaku1234 yuanjie-ai rob-med serenidpity charlatted deltaresprojects romanbrickie maelg lifanghe ltyong fzeeshan menelikv thusithathilina stevenelsworth msimonin mjip willqq tcoln yichangwang rahatkatal johannfaouzi exp-time-series-tools sichqq kinokoberuji zkxshg akaneus themisz 18305169793 hitkehaochen page1 empythy mannyjop chrisxthe pinheirochagas afsafzal lukegs7 sandy4321 felixdivo wzpy hundunyu jpzhangvincent

tslearn's Issues

Consider refactoring DBA to Majorize-Minimize Mean Algorithm

Referencing issue #44, DBA can be expressed in matrix and vectorized terms that might be more suited for numpy's vectorized operations. I think it would be worthwhile to explore this to see if we can get a considerable performance boost by expressing the function in this form.

Instead of removing the internals of the original algorithm, it might be worth considering adding a parameter to the function which takes values such as 'mm' or 'original' to differentiate what kind of internal operation it uses: either the form in the original paper, or the MM expression of the algorithm.

Relevant section in the Schultz & Jain (2017) paper is 4.2. The Majorize-Minimize Mean Algorithm and it's respective algorithm: 2.

Error when using ShapeletModel

I am trying to fit ShapeletModel and passing an X array with data (dim=(72,78,6)) and Y array with labels (dim=(72,)) to the fit method.

I am getting an shape error:

Traceback (most recent call last):
File "/home/me/repos/core/ts/experiments/supervised_ts_clustering.py", line 50, in
instance.fit(x_train, y_train)
File "/home/me/repos/core/ts/sota/supervised/shapelet.py", line 35, in fit
self.instance.fit(TimeSeriesScalerMinMax().fit_transform(np.nan_to_num(data)), labels)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 288, in fit
verbose=self.verbose_level)
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1598, in fit
validation_steps=validation_steps)
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1183, in _fit_loop
outs = f(ins_batch)
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2273, in call
**self.session_kwargs)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
[[Node: shapelets_0_2/Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](false_conv_0_2/convolution/Squeeze, shapelets_0_2/Reshape_1/shape)]]

Caused by op u'shapelets_0_2/Reshape_1', defined at:
File "/home/me/repos/core/ts/experiments/supervised_ts_clustering.py", line 50, in
instance.fit(x_train, y_train)
File "/home/me/repos/core/ts/sota/supervised/shapelet.py", line 35, in fit
self.instance.fit(TimeSeriesScalerMinMax().fit_transform(np.nan_to_num(data)), labels)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 274, in fit
self._set_model_layers(ts_sz=sz, d=d, n_classes=n_classes)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 375, in _set_model_layers
for di in range(d)]
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 602, in call
output = self.call(inputs, **kwargs)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 85, in call
xy = K.dot(x, K.transpose(self.kernel))
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 991, in dot
xt = tf.reshape(x, [-1, x_shape[-1]])
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2619, in reshape
name=name)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

InvalidArgumentError (see above for traceback): Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
[[Node: shapelets_0_2/Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](false_conv_0_2/convolution/Squeeze, shapelets_0_2/Reshape_1/shape)]]

Process finished with exit code 1

Does anyone met the same error?

UCR_UEA_datasets.baseline_accuracy fail for list_datasets=None

(originally posted as part of #22 )

With Python 3 and tslearn 0.1.10.8 after loading all datasets, and running d.load_dataset('ItalyPowerDemand') I'm getting an error for,

>>> d.baseline_accuracy()
ValueError                                Traceback (most recent call last)
-> 1 d.baseline_accuracy()

~/anaconda2/envs/ts-env/lib/python3.6/site-packages/tslearn/datasets.py in baseline_accuracy(self, list_datasets, list_methods)
    149                 for m in perfs_dict.keys():
    150                     if m != "" and (list_methods is None or m in list_methods):
--> 151                         d_out[perfs_dict[""]][m] = float(perfs_dict[m])
    152         return d_out
    153 

ValueError: could not convert string to float:

while d.baseline_accuracy('ItalyPowerDemand') works as expected. I guess some input validation is missing ..

extract_from_zip_url failing on Windows

https://github.com/rtavenar/tslearn/blob/702bbb24b5a85ee0857717c60d39bb60cafca291/tslearn/datasets.py#L38

leads to a FileNotFoundError: [Errno 2] No such file or directory: '/tmp/Adiac.zip' error message.

I'd recommend searching for the temporary directory using the tempfile module and the gettempdir method, or building a temporary directory using the the TemporaryDirectory method. The TemporaryDirectory method seems like a safer and future-proof bet.

Bug in TimeSeriesKMeans using metric 'softdtw'

Hello @rtavenar ,

I was trying out the TimeSeriesKMeans model with variable length time series.
Unfortunately the model seems to fail using "softdtw" as a metric.

Code I tried out.

import numpy
import random

from tslearn.clustering import TimeSeriesKMeans

seed = 0
numpy.random.seed(seed)

ts = list()

# random time series of variable length
ts.append(random.sample(range(1, 1000), 100))
ts.append(random.sample(range(1, 1000), 99))
ts.append(random.sample(range(1, 1000), 98))

# Soft-DTW-k-means
print("Soft-DTW k-means")
sdtw_km = TimeSeriesKMeans(
    n_clusters=2, metric="softdtw", verbose=True, random_state=seed)
y_pred = sdtw_km.fit_predict(ts)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Any help would be appreciated, thanks.

Distance function choice for DTW

I've been looking at different DTW implementations (ex. via DTW R package, fastdtw, UCR Suite) and came upon this package. I have some questions/suggestions:

The documentations (http://tslearn.readthedocs.io/en/latest/gen_modules/tslearn.metrics.html) doesn't seem to specify what actual distance function is used in DTW (ex. Manhattan, Euclidean, etc)
I tried looking through the code, found usage of sqeuclidean from scipy.spatial.distance - but am unable to get the same results as other packages produce with this distance. Are additional transformations applied? It would be good to have a sample result on a small piece of data.
Since scipy distances appear to be used, why not make that a parameter that user can select for dtw?

Soft-DTW Barycenter init parameter isn't documented

See the docs.

load_dataset causing exception on load of select datasets

This line here:
https://github.com/rtavenar/tslearn/blob/702bbb24b5a85ee0857717c60d39bb60cafca291/tslearn/datasets.py#L218-L219

fails when it can't extract the zip. I think this should be wrapped in a try/except, and maybe returned with a (None, None, None, None) tuple if it fails to match the documentation. The StarlightCurves, UWaveGestureLibraryX, and UWaveGestureLibraryAll libraries caused these exceptions for me.

AttributeError: 'ShapeletModel' object has no attribute 'get_params'

If i run this from command line:

from tslearn.shapelets import ShapeletModel
sm = ShapeletModel({1:1})
sm.get_params()

i get

Traceback (most recent call last):
  File "run.py", line 3, in <module>
AttributeError: 'ShapeletModel' object has no attribute 'get_params'

However, if i run the file from Spyder in IPython, it works as expected.

Same thing happens with sm.set_params(batchsize=128), but calling sm.fit(X, y) works from command line as well as from IPython console.

Strange, right? Do you have any idea what might cause this?

EDIT: For the record, not all estimators from tslearn are afffected:

from tslearn.shapelets import ShapeletModel
from tslearn.clustering import TimeSeriesKMeans

tskm = TimeSeriesKMeans()
sm = ShapeletModel({1:1})

print(tskm.get_params())
print(sm.get_params())

{'dtw_inertia': False, 'max_iter': 50, 'max_iter_barycenter': 100, 'metric': 'euclidean', 'metric_params': None, 'n_clusters': 3, 'n_init': 1, 'random_state': None, 'tol': 1e-06, 'verbose': True}
Traceback (most recent call last):
  File "run.py", line 8, in <module>
    print(sm.get_params())
AttributeError: 'ShapeletModel' object has no attribute 'get_params'

Save shapelet model

I would like to save my trained shapelet model. Any help will be appreciated.
Thanks.

AttributeError: 'TimeSeriesKMeans' object has no attribute 'gamma_sdtw'`

First of all, thank you so much for sharing this wonderful package!
You're making my master's thesis much more enjoyable :)

I'm getting a weird error and haven't been able to figure it out. Here's a minimal working example (I can call TimeSeriesKMeans(metric='softdtw').fit(X) without Pipeline() without problems).

I'm on p36, Windows 7 x64. Thanks in advance!

import numpy as np
from sklearn.pipeline import Pipeline
from tslearn.clustering import TimeSeriesKMeans
from tslearn.utils import to_time_series_dataset

X = to_time_series_dataset(np.array([1,2]))
pipeline = Pipeline([
        ('kmeans', TimeSeriesKMeans())
        ])
parameters = {
        'kmeans__metric' : 'softdtw'
        }
pipeline.set_params(**parameters)
pipeline.fit(X)

Traceback (most recent call last):

  File "<ipython-input-8-c4ba89694fd5>", line 10, in <module>
    pipeline.fit(X)

  File "C:\...\sklearn\pipeline.py", line 250, in fit
    self._final_estimator.fit(Xt, y, **fit_params)

  File "C:\...\tslearn\clustering.py", line 531, in fit
    self._fit_one_init(X_, x_squared_norms, rs)

  File "C:\...\tslearn\clustering.py", line 460, in _fit_one_init
    self._assign(X)

  File "C:\...\tslearn\clustering.py", line 480, in _assign
    dists = cdist_soft_dtw(X, self.cluster_centers_, gamma=self.gamma_sdtw)

AttributeError: 'TimeSeriesKMeans' object has no attribute 'gamma_sdtw'

dtw_barycenter_averaging example broken

I can't get dtw_barycenter_averaging to work as advertized. Even running the example from the docs breaks for me.

 $ ipython --no-banner

In [1]: import tslearn

In [2]: tslearn.__version__
Out[2]: '0.1.18.3'

In [3]: from tslearn.barycenters import dtw_barycenter_averaging

In [4]: from tslearn.utils import to_time_series_dataset
   ...: X = to_time_series_dataset([[1, 2, 3, 4], [1, 2, 3], [2, 5, 6, 7, 8, 9]])
   ...:

In [5]: bar = dtw_barycenter_averaging(X, barycenter_size=3)
---------------------------------------------------------------------------
ZeroDivisionError                         Traceback (most recent call last)
<ipython-input-5-f093b7f127f3> in <module>()
----> 1 bar = dtw_barycenter_averaging(X, barycenter_size=3)

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in dtw_barycenter_averaging(X, barycenter_size, init_barycenter, max_iter, tol, weights, verbose)
    327         if verbose:
    328             print("[DBA] epoch %d, cost: %.3f" % (it + 1, cost))
--> 329         barycenter = _petitjean_update_barycenter(X_, assign, barycenter_size, weights)
    330         if abs(cost_prev - cost) < tol:
    331             break

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in _petitjean_update_barycenter(X, assign, barycenter_size, weights)
    248     barycenter = numpy.zeros((barycenter_size, X.shape[-1]))
    249     for t in range(barycenter_size):
--> 250         barycenter[t] = numpy.average(X[assign[0][t], assign[1][t]], axis=0, weights=weights[assign[0][t]])
    251     return barycenter
    252

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/numpy/lib/function_base.pyc in average(a, axis, weights, returned)
   1138         if np.any(scl == 0.0):
   1139             raise ZeroDivisionError(
-> 1140                 "Weights sum to zero, can't be normalized")
   1141
   1142         avg = np.multiply(a, wgt, dtype=result_dtype).sum(axis)/scl

ZeroDivisionError: Weights sum to zero, can't be normalized

The error gets thrown in numpy.average inside the _petitjean_update_barycenter on the second iteration through the loop. The relevant variables are as follows:

ipdb> l 246, 252
    246
    247 def _petitjean_update_barycenter(X, assign, barycenter_size, weights):
    248     barycenter = numpy.zeros((barycenter_size, X.shape[-1]))
    249     for t in range(barycenter_size):
--> 250         barycenter[t] = numpy.average(X[assign[0][t], assign[1][t]], axis=0, weights=weights[assign[0][t]])
    251     return barycenter
    252

ipdb> X
array([[[  1.],
        [  2.],
        [  3.],
        [  4.],
        [ nan],
        [ nan]],

       [[  1.],
        [  2.],
        [  3.],
        [ nan],
        [ nan],
        [ nan]],

       [[  2.],
        [  5.],
        [  6.],
        [  7.],
        [  8.],
        [  9.]]])
ipdb> assign
([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2], [], []], [[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 5], [], []])
ipdb> weights
array([ 1.,  1.,  1.])
ipdb> t
1

pip out of date

hey, i'm having trouble installing this with pip, which i believe this commit might fix.

the latest version i see in the releases is v0.0.27, but the latest i see on pip is v0.0.25.

could you update the pip package? thanks! ⭐️

index out of bounds error?

Could someone help me figure out this error, and/or what's happening here? How do I interpret this and does it mean something is wrong with my data? I'm trying to use the TimeSeriesKMeansClustering function. I have data I have pre-processed so that all examples are the same length.

My variable data is a numpy array of shape: (3207, 5) and k is just the number of clusters, in this case k=6

This is my function:

def cluster_timeseries(data, k):
    seed = 0
    np.random.seed(seed)
    X_train = to_time_series_dataset(data)
    X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train)
    sz = X_train.shape[1]
    sdtw_km = TimeSeriesKMeans(n_clusters=k, metric="softdtw", metric_params={"gamma_sdtw": .01}, verbose=True, random_state=seed)
    y_pred = sdtw_km.fit_predict(X_train)
    return sdtw_km, np.array(X_train), sz

I get this error:

  File "01_replicate.py", line 258, in cluster_timeseries
    y_pred = sdtw_km.fit_predict(X_train)
  File "/Library/Python/2.7/site-packages/tslearn/clustering.py", line 558, in fit_predict
    return self.fit(X, y).labels_
  File "/Library/Python/2.7/site-packages/tslearn/clustering.py", line 517, in fit
    X_ = to_time_series_dataset(X)
  File "/Library/Python/2.7/site-packages/tslearn/utils.py", line 119, in to_time_series_dataset
    max_sz = max([ts_size(to_time_series(ts)) for ts in dataset])
  File "/Library/Python/2.7/site-packages/tslearn/utils.py", line 315, in ts_size
    while not numpy.any(numpy.isfinite(ts_[sz - 1])):
IndexError: index -6 is out of bounds for axis 0 with size 5

Run time for unit tests

Currently running tests with,

pytest --doctest-modules --ignore tslearn/docs/ --ignore tslearn/shapelets.py  tslearn

takes approximately 2.5 min on my laptop which is a while.

A quick profiling with the --duration=0 option, shows that the few test bellow take most of the run time,

86.10s call     tslearn/svm.py::tslearn.svm.TimeSeriesSVC
36.81s call     tslearn/svm.py::tslearn.svm.TimeSeriesSVR
8.25s call     tslearn/clustering.py::tslearn.clustering.TimeSeriesKMeans
7.15s call     tslearn/clustering.py::tslearn.clustering.KShape
3.51s call     tslearn/clustering.py::tslearn.clustering.GlobalAlignmentKernelKMeans
2.41s call     tslearn/clustering.py::tslearn.clustering.silhouette_score
0.05s call     tslearn/datasets.py::tslearn.datasets.UCR_UEA_datasets.baseline_accuracy
0.04s call     tslearn/datasets.py::tslearn.datasets.UCR_UEA_datasets.load_dataset
0.03s call     tslearn/generators.py::tslearn.generators.random_walk_blobs

Profiling e.g. TimeSeriesSVC with pytest-profiling indicates that soft_dtw is the bottleneck (cf. below). Maybe reducing the dataset size further could help?

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   97.182   97.182 runner.py:106(pytest_runtest_call)
        1    0.000    0.000   97.182   97.182 doctest.py:106(runtest)
        1    0.000    0.000   97.182   97.182 doctest.py:1838(run)
        1    0.000    0.000   97.182   97.182 doctest.py:1418(run)
        1    0.000    0.000   97.178   97.178 doctest.py:1272(__run)
       12    0.000    0.000   97.177    8.098 {built-in method builtins.exec}
        5    0.000    0.000   97.161   19.432 base.py:358(_compute_kernel)
        5    0.000    0.000   97.161   19.432 svm.py:33(<lambda>)
        5    0.000    0.000   97.160   19.432 metrics.py:262(cdist_gak)
        5    0.001    0.000   97.146   19.429 metrics.py:305(<listcomp>)
     8000    0.331    0.000   97.059    0.012 metrics.py:221(gak)
     8000    9.881    0.001   96.540    0.012 metrics.py:561(soft_dtw)
     8000    0.124    0.000   67.245    0.008 metrics.py:693(compute)
     8000   67.122    0.008   67.122    0.008 {tslearn.soft_dtw_fast._soft_dtw}
        2    0.000    0.000   34.452   17.226 svm.py:189(predict_proba)
        2    0.000    0.000   34.449   17.224 base.py:593(_predict_proba)
        2    0.000    0.000   34.449   17.224 base.py:635(_dense_predict_proba)

Implement (Stochastic) Subgradient Mean

Summary:
Stochastically explore the DTW Averaging Space. It's a useful algorithm for exploring online settings and larger sample sizes. The paper is a bit of work to get through as they rigerously detail a connection between the DTW sample mean to nonsmooth optimization methods for their proposed algorithm. They
also detail a non-stochastic subgradient method and present a vectorized version of DBA that might provide additional performance benefits (I'll create a seperate issue for refactoring DBA to that).

I think this package would benefit greatly for both the Subgradient (SG) and Stochastic Subgradient (SSG) methods.

Original Paper:
Nonsmooth Analysis and Subgradient Methods for Averaging in Dynamic Time Warping Spaces

arxiv pdf: https://arxiv.org/pdf/1701.06393.pdf
Final version of paper (paywall): https://www.sciencedirect.com/science/article/pii/S0031320317303163

Relevant sections are 4.1. The Subgradient Mean Algorithm and 4.3. The Stochastic Subgradient Mean Algorithm and their respective algorithms 1 and 3.

Locally-Regularized DTW presentation

Locally-Regularized DTW sounds really interesting. I'm familiar with regularization in regression problems, but I am not sure how it fits into this context. I am looking forward to your presentation on this topic!

Ben

[Bug] dtw_barycenter_averaging function default parameters incorrectly set

https://github.com/rtavenar/tslearn/blob/0aeeb125606f63eb64d6112b61cf7252a8fd62fa/tslearn/barycenters.py#L263-L264

max_iter needs to be set to an integer, and by the docs and previous implementation, it's 30.

tslearn needs uniformly sampled and same sized time series

After a first glimpse on the package, I was wondering if tslearn needs uniformly sampled and same sized time series?

Is there a way to parallelise the algorithms?

Hello guys,

first of all thanks for this great project!
Right now i am heavily using the k-Shape-Algorithm for clustering (imho) huge amounts of data.
n = 5.000.000, k = 6;8;10;12, m = 56
O(max{n·k ·m· log(m), n·m2, k ·m3}) (complexity of k-Shape from the original paper)

And this takes about 150hours on an Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz. Sadly the algorithm only uses one core. How can we fix this? How can we make this faster? I am quite new to Python but maybe you have an idea?

Thanks in advance =)

TimeSeriesKMeans and variable length time series

I received the following by e-mail, and I prefer to deal with it through a GitHub issue.

Dear Dr Tavenard
Today I started with using tslearn.
My small home project is about trying to cluster time series data. I have data from my own cycling. These are various rides I made. From each ride file I extracted 1 time series (namelijk wattage). My rides are from various lenght so my time series are from various length.
I have already put my rides in the right data format for tslearn using:
>>> from tslearn.utils import to_time_series_dataset
>>> formatted_dataset = to_time_series_dataset(appended_data_Garmin)
As expected I get nan for filling up all the datapoints given the longest time series I have.
However. When I try to run the simple:
>>> from tslearn.clustering import TimeSeriesKMeans
>>> km = TimeSeriesKMeans(n_clusters=3, metric="dtw", max_iter=5, verbose=False, random_state=0)
>>> km.fit(formatted_dataset)
I get:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Do you have any idea what might be causing this issue?

Implement Matrix Profile Data Structure

A matrix profile is a data structure built from a similarity join of two time series. The resulting data structure produces two objects:

An array containing the indices of the nearest neighbor (a subsequence that has the minimum Euclidean distances to it)
Their respective Euclidean distances.

Once the data structure is built, it can produce responses to:

Time Series Set Similarity and Difference
Motif discovery
Discord discovery
Shapelet discovery
Semantic segmentation
Density estimation

Resources:

http://www.cs.ucr.edu/~eamonn/MatrixProfile.html
Original Paper
Lecture Part 1 and Part 2.
An implementation we might want to review: https://github.com/rob-med/owlpy

bug in release version 0.1.14

from tslearn.clustering import KShape,TimeSeriesKMeans
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Library/Python/2.7/site-packages/tslearn-0.1.14-py2.7-macosx-10.11-intel.egg/tslearn/clustering.py", line 13, in <module>
    from tslearn.metrics import cdist_gak, cdist_dtw, cdist_soft_dtw, dtw
  File "/Library/Python/2.7/site-packages/tslearn-0.1.14-py2.7-macosx-10.11-intel.egg/tslearn/metrics.py", line 8, in <module>
    from tslearn.soft_dtw_fast import _soft_dtw, _soft_dtw_grad, _jacobian_product_sq_euc
  File "__init__.pxd", line 164, in init tslearn.soft_dtw_fast
ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96
>>>

re-compiled, re-installed and now:

from tslearn.clustering import KShape,TimeSeriesKMeans
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "tslearn/clustering.py", line 13, in <module>
    from tslearn.metrics import cdist_gak, cdist_dtw, cdist_soft_dtw, dtw
  File "tslearn/metrics.py", line 8, in <module>
    from tslearn.soft_dtw_fast import _soft_dtw, _soft_dtw_grad, _jacobian_product_sq_euc
ImportError: No module named soft_dtw_fast

You've got a soft_dtw_fast.c and a soft_dtw_fast.pyx -> is there supposed to also be a soft_dtw_fast.py?

Download specific UCR_UEA dataset

It would be nice if it was possible to download a specific (e.g. small) dataset when running,

d = UCR_UEA_datasets()
X_train, y_train, X_test, y_test = d.load_dataset('<some-id>')

as opposed to downloading everything and caching it. It particular, that would make it possible to use this in some gallery example as well.

Edit: moved part of this comment to a separate issue #23 as requested.

Evaluation of time series clustering

It would be nice to allow the user to evaluate the quality of a clustering by providing the equivalent of the silhouette score (or related metric) for time series clustering.

pipeline not work with SVC?

Following code is from KNeighborsTimeSeriesClassifier example, when change the classifier to SVC , error occurs when using the pipeline:

version:0.1.18.3

from tslearn.utils import to_time_series,to_time_series_dataset
from tslearn.clustering import TimeSeriesKMeans,GlobalAlignmentKernelKMeans,silhouette_score
from tslearn.svm import TimeSeriesSVC,TimeSeriesSVR
from tslearn.neighbors import KNeighborsClassifier
import numpy as np


from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from tslearn.generators import random_walk_blobs
from tslearn.preprocessing import TimeSeriesScalerMinMax
from tslearn.neighbors import KNeighborsTimeSeriesClassifier,KNeighborsTimeSeries
from tslearn.piecewise import SymbolicAggregateApproximation



n_ts_per_blob,sz,d,n_blobs=20,100,1,2
x,y=random_walk_blobs(n_ts_per_blob=n_ts_per_blob,sz=sz,d=d,n_blobs=n_blobs)
print('x,y, shape:',x.shape,y.shape)
print(len(x[0]),x[0][:10],y[0])
scaler=TimeSeriesScalerMinMax(min=0,max=1)
x_scaled=scaler.fit_transform(x)
print(x_scaled[0][:10],y[0])



print(n_ts_per_blob*n_blobs)
indices_shuffle=np.random.permutation(n_ts_per_blob*n_blobs)
print(indices_shuffle)
x_shuffle=x_scaled[indices_shuffle]
y_shuffle=y[indices_shuffle]
nofTrain=n_ts_per_blob*n_blobs//2
print(nofTrain)
x_train=x_shuffle[:nofTrain]
y_train=y_shuffle[:nofTrain]
x_test=x_shuffle[nofTrain:]
y_test=y_shuffle[nofTrain:]
print(y_shuffle)
print(y_train)
print(y_test)

sax_trans=SymbolicAggregateApproximation(n_segments=15,alphabet_size_avg=5)
svc=TimeSeriesSVC(sz=100,d=1)     #KNeighborsTimeSeriesClassifier(n_neighbors=3,metric='dtw')
pip=      svc  # Pipeline(steps=[('sax',sax_trans),('knn',svc)])  #works with svc but not the pipeline

pip.fit(x_train,y_train)
predictsPIP=pip.predict(x_test)
print(predictsPIP)
print(y_test)
print(accuracy_score(y_test,predictsPIP))

Example code fails

I've tried a couple of the examples in the docs, but none of them appear to work. It looks like the data shapes are different than are expected in the examples. For instance, in the DTW example:

sz = X_train.shape[1]

results in IndexError: tuple index out of range. Similarly in the kernel k-means example, dataset_scaled[0, :, 0] yields an index error suggesting that there are only 2 dimensions.

Running from today's master branch on Python 3.6.

softdtw_barycenter is broken

Trying to get barycenter averaging working on some audio data, but can't even get the examples in the docs to work.

 $ ipython --no-banner

In [1]: from tslearn.utils import to_time_series_dataset
   ...: X = to_time_series_dataset([[1, 2, 3, 4], [1, 2, 3], [2, 5, 6, 7, 8, 9]])
   ...:

In [2]: from tslearn.barycenters import softdtw_barycenter
   ...: from tslearn.utils import ts_zeros
   ...: initial_barycenter = ts_zeros(sz=5)
   ...: bar = softdtw_barycenter(X, init=initial_barycenter)
   ...:
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-2-7221f8ce968c> in <module>()
      2 from tslearn.utils import ts_zeros
      3 initial_barycenter = ts_zeros(sz=5)
----> 4 bar = softdtw_barycenter(X, init=initial_barycenter)

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in softdtw_barycenter(X, gamma, weights, method, tol, max_iter, init)
    484         # The function works with vectors so we need to vectorize barycenter.
    485         res = minimize(f, barycenter.ravel(), method=method, jac=True, tol=tol,
--> 486                        options=dict(maxiter=max_iter, disp=False))
    487         return res.x.reshape(barycenter.shape)
    488     else:

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/_minimize.pyc in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
    448     elif meth == 'l-bfgs-b':
    449         return _minimize_lbfgsb(fun, x0, args, jac, bounds,
--> 450                                 callback=callback, **options)
    451     elif meth == 'tnc':
    452         return _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/lbfgsb.pyc in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
    326             # until the completion of the current minimization iteration.
    327             # Overwrite f and g:
--> 328             f, g = func_and_grad(x)
    329         elif task_str.startswith(b'NEW_X'):
    330             # new iteration

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/lbfgsb.pyc in func_and_grad(x)
    276     else:
    277         def func_and_grad(x):
--> 278             f = fun(x, *args)
    279             g = jac(x, *args)
    280             return f, g

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in function_wrapper(*wrapper_args)
    290     def function_wrapper(*wrapper_args):
    291         ncalls[0] += 1
--> 292         return function(*(wrapper_args + args))
    293
    294     return ncalls, function_wrapper

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in __call__(self, x, *args)
     61     def __call__(self, x, *args):
     62         self.x = numpy.asarray(x).copy()
---> 63         fg = self.fun(x, *args)
     64         self.jac = fg[1]
     65         return fg[0]

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in <lambda>(Z)
    481
    482     if max_iter > 0:
--> 483         f = lambda Z: _softdtw_func(Z, X_, weights, barycenter, gamma)
    484         # The function works with vectors so we need to vectorize barycenter.
    485         res = minimize(f, barycenter.ravel(), method=method, jac=True, tol=tol,

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in _softdtw_func(Z, X, weights, barycenter, gamma)
    427     for i in range(len(X)):
    428         D = SquaredEuclidean(Z, X[i])
--> 429         sdtw = SoftDTW(D, gamma=gamma)
    430         value = sdtw.compute()
    431         E = sdtw.grad()

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/metrics.pyc in __init__(self, D, gamma)
    662         """
    663         if hasattr(D, "compute"):
--> 664             self.D = D.compute()
    665         else:
    666             self.D = D

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/metrics.pyc in compute(self)
    750             Distance matrix.
    751         """
--> 752         return euclidean_distances(self.X, self.Y, squared=True)
    753
    754     def jacobian_product(self, E):

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/metrics/pairwise.pyc in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
    221     paired_distances : distances betweens pairs of elements of X and Y.
    222     """
--> 223     X, Y = check_pairwise_arrays(X, Y)
    224
    225     if X_norm_squared is not None:

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/metrics/pairwise.pyc in check_pairwise_arrays(X, Y, precomputed, dtype)
    110                         warn_on_dtype=warn_on_dtype, estimator=estimator)
    111         Y = check_array(Y, accept_sparse='csr', dtype=dtype,
--> 112                         warn_on_dtype=warn_on_dtype, estimator=estimator)
    113
    114     if precomputed:

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    420                              % (array.ndim, estimator_name))
    421         if force_all_finite:
--> 422             _assert_all_finite(array)
    423
    424     shape_repr = _shape_repr(array.shape)

/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
     41             and not np.isfinite(X).all()):
     42         raise ValueError("Input contains NaN, infinity"
---> 43                          " or a value too large for %r." % X.dtype)
     44
     45

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

I assume the problem here is that to_timeseries_dataset imputes NaNs for unequal length ts data, but I don't know for sure.

I'm really excited about tslearn, it will be super helpful for a project I'm working on, if these hiccups can be sorted out.

Implement local feature extractors and Bag-of-Word models

It would be nice to have a method such as BOSS (cf here) implemented in tslearn.

Local feature extraction step should be implemented as a TransformerMixin.

TimeSeriesScalerMinMax need a transform method

Hi,

Like MinMaxScaler in scikitlearn , a transform method is needed for TimeSeriesScalerMinMax.

Notes on Barycenter API

A few notes:

After putting the Barycenter API to use, it feels weird calling .fit() and returning an average sequence. .fit() feels like it would initiate the initial average sequence through some method and perform initial computations. The transform() method feels right more right here, it would take those computations and perform the transformation of the dataset into an average sequence.
In the documentation of the DBA algorithm, it's not clear what EM means in the max_iter, init_barycenter, and tol parameters.
In the DBA algorithm, you raise an error if the sequence doesn't converge and begins to backpedal:
https://github.com/rtavenar/tslearn/blob/7d1f00ca0b647d9c5e4ce6bd34c7b97038abec80/tslearn/barycenters.py#L150-L151 This feels like bad practice. If I'm waiting hours for a dataset to converge, I would die a little bit on the inside if I found an error was raised and all that computational effort went to nothing. This should instead print out a convergence error warning and return the average sequence.

KShape first guess guidance / constrain

For the KShape algorithm (or k-means) it could be interesting to guide the algo to converge with the user knowledge. Imagine the user knows that there are three clusters and has at least one sample of each class, it could be interesting to use these samples as a first barycenter during the learning phase.

Global Alignment Kernel and soft-DTW give inconsistent results

As stated in Cuturi and Blondel, 2017, there is a clear relationship between Global Alignment Kernel and soft-DTW.

Global Alignment Kernel (GAK), up to now, used a custom implementation in tslearn, while soft-DTW relies on the implementation released by Mathieu Blondel.

GAK tests were built so as to get similar results to those output by Adrien Gaidon's wrapper [link].
Recently, I tried to get rid of GAK's implementation by just computing GAK from soft-DTW using inline equation from section 2.1 in (Cuturi & Blondel 2017). And I got different results, and the difference is not marginal.
This is what makes tslearn tests fail at the moment.

If anyone is willing to help on this, feel free!

[Question] Pytest framework?

Hi @rtavenar , just wondering if there is any interest in moving towards a pytest based testing framework for tslearn, perhaps something that mirrors sklearn?
If so, this is something I have some experience in and would be more than happy to begin helping out with!

Normalization of Time Series

Hi @rtavenar,

Does the DTW distance calculation routine normalizes the time series before calculating distances or the user has to pass pre-normalized sequences.
The DTW distance calculation routine seems to accept sequences of different lengths which makes sense as it warps the sequences. However, I am wondering if this routine also does subsequence search? By subsequence search, I am expecting the algorithm to find the most optimum starting point and find the warp path and distance from it.

Thanks

Empty return

I'm trying to use Kmean Kshape and Sax in class methods i've write but whenever thoses three methods are called it return nothing. I am compeled to call this methods directly in the main files.

Faster kNN search with constrained DTW

It would be nice to make kNN searches faster when Sakoe-Chiba constrained DTW is concerned using LB_Keogh based pre-filtering.

This should be implemented in the kneighbors method of class KNeighborsTimeSeriesMixin from module neighbors.

Silhouette score breaks for the variant length time series

Hello @rtavenar,

Thank you for fixing #38 , unfortunately something similar also happens for Silhouette score.

Example to keep it simple, even though it doesn't make much sense

from tslearn.clustering import silhouette_score
from tslearn.metrics import soft_dtw

ts = [[1, 2, 3], [4, 5, 6, 7], [8, 7, 4, 3, 2, 1]]
labels = [1, 0, 1]
print(silhouette_score(ts, labels, metric=soft_dtw))

Error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

Find synergies with tsfresh / tspreprocess

Hi @rtavenar,

I am impressed with the number of algorithms you already implemented in tslearn. Actually, I had the idea for the same package, even the same name :D :D. Some of the algorithms I implemented in the past.

I have been working with time series models for the last years and there is definitely a lack of python tools or ecosystems for time series analysis. So, I designed both packages https://github.com/blue-yonder/tsfresh and https://github.com/MaxBenChrist/tspreprocess.

If you are interested, I would like to have a chat with you to hear your ideas and vision for tslearn. We have a really good development team in tsfresh. Maybe we find a way to synchronize our efforts and support you.

Best, Max

Implement metric learning for time series

It would make sense to have metric learning algos dedicated to time series in tslearn.

A good start could be Garreau et al, 2014, but maybe other methods could make more sense.

`TimeSeriesScalerMeanVariance` got warning of dividing zero when the time series is 0

Hi,

Thanks for your brilliant works of building this library!

I am trying to use tslearn.preprocessing.TimeSeriesScalerMeanVariance for standardization, but it failed when one of my data samples is pure 0. Here is the example:

t = np.array([[0,0,0],[1,2,3]])
TimeSeriesScalerMeanVariance().fit_transform(t)

the output for the above code is:

/lib/python3.6/site-packages/tslearn/preprocessing.py:153: RuntimeWarning: invalid value encountered in true_divide
  X_[i, :, d] = (X_[i, :, d] - cur_mean) * self.std_ / cur_std + self.mu_
array([[[        nan],
        [        nan],
        [        nan]],

       [[-1.22474487],
        [ 0.        ],
        [ 1.22474487]]])

The issue is the standard deviation of a [0,0,0] is 0, it there any good methods for working around?

Extract K top-shapelets with lengths in interval

Hello,

First of all, really great work on this repository, very helpful!

I would like to extract the "best" K from a set of timeseries. The length of these shapelets should be within the interval '[min_len, max_len]'. I was wondering how I best approached this, since the n_shapelets_per_size hyper-parameter has an impact on the output.

I currently see two alternatives:

Create a dictionary with keys all values between min_len and max_len and each value equal to K. In the end, iterate over all extracted shapelets and select the best K based on a metric such as information gain.
A for-loop iterating over each value in the interval, creating a simple dictionary {l: K} and apply the shapelet learning. Append all the extracted shapelets in each iteration to a list and in the end, select the best K from that list.

Is there any alternative better than the other? Or is there a third option that I currently cannot think of?

Thanks in advance,

Gilles

Implement Learning Shapelets from Grabocka et al.

It would be nice to have shapelet-based classifiers implemented in tslearn. Maybe a good start would be Learning Shapelets from Grabocka et al. (cf here).

kNN using SAX+MINDIST

When using this class what are the available "metrics" parameters that can be used? only "dtw"? any recommendation if i would want to use euclidean or for example the SAX distance, on using this classifier on a dataset with a SAX representation?

Retrieving cluster centers

When we cluster series (for instance with KShape), we obtain centers for normalized data. It would be convenient to have a method to retrieve centers as original time series.

ShapeletsModel's catagorical_accuracy doesn't change with binary classification

Use the docs/examples/plot_shapelet.py, and modify the y_train to be suitable for binary classification tasks. I think this can reproduce the bug.

Inertia increases during DBA kmeans optimization

As can be seen in the gallery of examples, inertia is increasing for DBA kmeans during fitting.

It should be checked whether the problem comes from the evaluation of inertia or from the optimization process itself.

Missing Soft-DTW barycenter function from documentation

See here.

Support learning time series data in variant length

Hi,

One of important features that make time series learning different from generic data is its variant length, which may also make tslearn unique.

It would be really appreciated that there is a specific paragraph elaborating which length-variant algorithms can be used in tslearn. For shapelets example, the related demo is about how to deal with equal length data although the algorithm itself supports TS in variant length.

Thanks

Read The Docs failing when used with tensorflow>=1.5

I have not been able to build the docs under RTD with tensorflow version higher than 1.4.

I get the following error message:

python /home/docs/checkouts/readthedocs.org/user_builds/tslearn/envs/latest/bin/sphinx-build -T -E -b readthedocs -d _build/doctrees-readthedocs -D language=en . _build/html
Running Sphinx v1.7.4
loading translations [en]... done
making output directory...
[autosummary] generating autosummary for: auto_examples/index.rst, auto_examples/plot_barycenter_interpolate.rst, auto_examples/plot_barycenters.rst, auto_examples/plot_dtw.rst, auto_examples/plot_kernel_kmeans.rst, auto_examples/plot_kmeans.rst, auto_examples/plot_kshape.rst, auto_examples/plot_lb_keogh.rst, auto_examples/plot_neighbors.rst, auto_examples/plot_sax.rst, ..., gen_modules/tslearn.utils.rst, gen_modules/utils/tslearn.utils.load_timeseries_txt.rst, gen_modules/utils/tslearn.utils.save_timeseries_txt.rst, gen_modules/utils/tslearn.utils.to_time_series.rst, gen_modules/utils/tslearn.utils.to_time_series_dataset.rst, gen_modules/utils/tslearn.utils.ts_size.rst, gettingstarted.rst, index.rst, reference.rst, variablelength.rst
Illegal instruction (core dumped)

It seems related to the CPUs at RTD not having AVX functionalities, but I guess there should be a way to make things work.

The thing is:

we could downgrade TF to 1.4 for RTD, but it seems if we do not put a constraint on keras version, then there will be incompatibilities (eg. keras sigmoid calling tf.nn.sigmoid with a keyword argument axis that did not exist in TF 1.4): hence we should check which keras version could be OK
ideally, we would like to have a way to run latest TF from PyPI on RTD

Definitely, even if we choose option 1. for short-term fix, we should investigate option 2. because we might need new keras functionalities in the future.