tslearn-team / tslearn Goto Github PK
View Code? Open in Web Editor NEWThe machine learning toolkit for time series analysis in Python
Home Page: https://tslearn.readthedocs.io
License: BSD 2-Clause "Simplified" License
The machine learning toolkit for time series analysis in Python
Home Page: https://tslearn.readthedocs.io
License: BSD 2-Clause "Simplified" License
Referencing issue #44, DBA can be expressed in matrix and vectorized terms that might be more suited for numpy's vectorized operations. I think it would be worthwhile to explore this to see if we can get a considerable performance boost by expressing the function in this form.
Instead of removing the internals of the original algorithm, it might be worth considering adding a parameter to the function which takes values such as 'mm'
or 'original'
to differentiate what kind of internal operation it uses: either the form in the original paper, or the MM expression of the algorithm.
Relevant section in the Schultz & Jain (2017) paper is 4.2. The Majorize-Minimize Mean Algorithm and it's respective algorithm: 2.
I am trying to fit ShapeletModel and passing an X array with data (dim=(72,78,6)) and Y array with labels (dim=(72,)) to the fit method.
I am getting an shape error:
Traceback (most recent call last):
File "/home/me/repos/core/ts/experiments/supervised_ts_clustering.py", line 50, in
instance.fit(x_train, y_train)
File "/home/me/repos/core/ts/sota/supervised/shapelet.py", line 35, in fit
self.instance.fit(TimeSeriesScalerMinMax().fit_transform(np.nan_to_num(data)), labels)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 288, in fit
verbose=self.verbose_level)
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1598, in fit
validation_steps=validation_steps)
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/engine/training.py", line 1183, in _fit_loop
outs = f(ins_batch)
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 2273, in call
**self.session_kwargs)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 895, in run
run_metadata_ptr)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1124, in _run
feed_dict_tensor, options, run_metadata)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1321, in _do_run
options, run_metadata)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/client/session.py", line 1340, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
[[Node: shapelets_0_2/Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](false_conv_0_2/convolution/Squeeze, shapelets_0_2/Reshape_1/shape)]]
Caused by op u'shapelets_0_2/Reshape_1', defined at:
File "/home/me/repos/core/ts/experiments/supervised_ts_clustering.py", line 50, in
instance.fit(x_train, y_train)
File "/home/me/repos/core/ts/sota/supervised/shapelet.py", line 35, in fit
self.instance.fit(TimeSeriesScalerMinMax().fit_transform(np.nan_to_num(data)), labels)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 274, in fit
self._set_model_layers(ts_sz=sz, d=d, n_classes=n_classes)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 375, in _set_model_layers
for di in range(d)]
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/engine/topology.py", line 602, in call
output = self.call(inputs, **kwargs)
File "/home/me/anaconda2/lib/python2.7/site-packages/tslearn/shapelets.py", line 85, in call
xy = K.dot(x, K.transpose(self.kernel))
File "/home/me/anaconda2/lib/python2.7/site-packages/keras/backend/tensorflow_backend.py", line 991, in dot
xt = tf.reshape(x, [-1, x_shape[-1]])
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_array_ops.py", line 2619, in reshape
name=name)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2630, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/home/me/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1204, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access
InvalidArgumentError (see above for traceback): Reshape cannot infer the missing input size for an empty tensor unless all specified input sizes are non-zero
[[Node: shapelets_0_2/Reshape_1 = Reshape[T=DT_FLOAT, Tshape=DT_INT32, _device="/job:localhost/replica:0/task:0/cpu:0"](false_conv_0_2/convolution/Squeeze, shapelets_0_2/Reshape_1/shape)]]
Process finished with exit code 1
Does anyone met the same error?
(originally posted as part of #22 )
With Python 3 and tslearn 0.1.10.8 after loading all datasets, and running d.load_dataset('ItalyPowerDemand')
I'm getting an error for,
>>> d.baseline_accuracy()
ValueError Traceback (most recent call last)
-> 1 d.baseline_accuracy()
~/anaconda2/envs/ts-env/lib/python3.6/site-packages/tslearn/datasets.py in baseline_accuracy(self, list_datasets, list_methods)
149 for m in perfs_dict.keys():
150 if m != "" and (list_methods is None or m in list_methods):
--> 151 d_out[perfs_dict[""]][m] = float(perfs_dict[m])
152 return d_out
153
ValueError: could not convert string to float:
while d.baseline_accuracy('ItalyPowerDemand')
works as expected. I guess some input validation is missing ..
leads to a FileNotFoundError: [Errno 2] No such file or directory: '/tmp/Adiac.zip'
error message.
I'd recommend searching for the temporary directory using the tempfile module and the gettempdir method, or building a temporary directory using the the TemporaryDirectory method. The TemporaryDirectory method seems like a safer and future-proof bet.
Hello @rtavenar ,
I was trying out the TimeSeriesKMeans model with variable length time series.
Unfortunately the model seems to fail using "softdtw" as a metric.
Code I tried out.
import numpy
import random
from tslearn.clustering import TimeSeriesKMeans
seed = 0
numpy.random.seed(seed)
ts = list()
# random time series of variable length
ts.append(random.sample(range(1, 1000), 100))
ts.append(random.sample(range(1, 1000), 99))
ts.append(random.sample(range(1, 1000), 98))
# Soft-DTW-k-means
print("Soft-DTW k-means")
sdtw_km = TimeSeriesKMeans(
n_clusters=2, metric="softdtw", verbose=True, random_state=seed)
y_pred = sdtw_km.fit_predict(ts)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Any help would be appreciated, thanks.
I've been looking at different DTW implementations (ex. via DTW R package, fastdtw, UCR Suite) and came upon this package. I have some questions/suggestions:
See the docs.
This line here:
https://github.com/rtavenar/tslearn/blob/702bbb24b5a85ee0857717c60d39bb60cafca291/tslearn/datasets.py#L218-L219
fails when it can't extract the zip. I think this should be wrapped in a try/except, and maybe returned with a (None, None, None, None)
tuple if it fails to match the documentation. The StarlightCurves, UWaveGestureLibraryX, and UWaveGestureLibraryAll libraries caused these exceptions for me.
If i run this from command line:
from tslearn.shapelets import ShapeletModel
sm = ShapeletModel({1:1})
sm.get_params()
i get
Traceback (most recent call last):
File "run.py", line 3, in <module>
AttributeError: 'ShapeletModel' object has no attribute 'get_params'
However, if i run the file from Spyder in IPython, it works as expected.
Same thing happens with sm.set_params(batchsize=128)
, but calling sm.fit(X, y)
works from command line as well as from IPython console.
Strange, right? Do you have any idea what might cause this?
EDIT: For the record, not all estimators from tslearn are afffected:
from tslearn.shapelets import ShapeletModel
from tslearn.clustering import TimeSeriesKMeans
tskm = TimeSeriesKMeans()
sm = ShapeletModel({1:1})
print(tskm.get_params())
print(sm.get_params())
{'dtw_inertia': False, 'max_iter': 50, 'max_iter_barycenter': 100, 'metric': 'euclidean', 'metric_params': None, 'n_clusters': 3, 'n_init': 1, 'random_state': None, 'tol': 1e-06, 'verbose': True}
Traceback (most recent call last):
File "run.py", line 8, in <module>
print(sm.get_params())
AttributeError: 'ShapeletModel' object has no attribute 'get_params'
I would like to save my trained shapelet model. Any help will be appreciated.
Thanks.
First of all, thank you so much for sharing this wonderful package!
You're making my master's thesis much more enjoyable :)
I'm getting a weird error and haven't been able to figure it out. Here's a minimal working example (I can call TimeSeriesKMeans(metric='softdtw').fit(X)
without Pipeline()
without problems).
I'm on p36, Windows 7 x64. Thanks in advance!
import numpy as np
from sklearn.pipeline import Pipeline
from tslearn.clustering import TimeSeriesKMeans
from tslearn.utils import to_time_series_dataset
X = to_time_series_dataset(np.array([1,2]))
pipeline = Pipeline([
('kmeans', TimeSeriesKMeans())
])
parameters = {
'kmeans__metric' : 'softdtw'
}
pipeline.set_params(**parameters)
pipeline.fit(X)
Traceback (most recent call last):
File "<ipython-input-8-c4ba89694fd5>", line 10, in <module>
pipeline.fit(X)
File "C:\...\sklearn\pipeline.py", line 250, in fit
self._final_estimator.fit(Xt, y, **fit_params)
File "C:\...\tslearn\clustering.py", line 531, in fit
self._fit_one_init(X_, x_squared_norms, rs)
File "C:\...\tslearn\clustering.py", line 460, in _fit_one_init
self._assign(X)
File "C:\...\tslearn\clustering.py", line 480, in _assign
dists = cdist_soft_dtw(X, self.cluster_centers_, gamma=self.gamma_sdtw)
AttributeError: 'TimeSeriesKMeans' object has no attribute 'gamma_sdtw'
I can't get dtw_barycenter_averaging
to work as advertized. Even running the example from the docs breaks for me.
$ ipython --no-banner
In [1]: import tslearn
In [2]: tslearn.__version__
Out[2]: '0.1.18.3'
In [3]: from tslearn.barycenters import dtw_barycenter_averaging
In [4]: from tslearn.utils import to_time_series_dataset
...: X = to_time_series_dataset([[1, 2, 3, 4], [1, 2, 3], [2, 5, 6, 7, 8, 9]])
...:
In [5]: bar = dtw_barycenter_averaging(X, barycenter_size=3)
---------------------------------------------------------------------------
ZeroDivisionError Traceback (most recent call last)
<ipython-input-5-f093b7f127f3> in <module>()
----> 1 bar = dtw_barycenter_averaging(X, barycenter_size=3)
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in dtw_barycenter_averaging(X, barycenter_size, init_barycenter, max_iter, tol, weights, verbose)
327 if verbose:
328 print("[DBA] epoch %d, cost: %.3f" % (it + 1, cost))
--> 329 barycenter = _petitjean_update_barycenter(X_, assign, barycenter_size, weights)
330 if abs(cost_prev - cost) < tol:
331 break
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in _petitjean_update_barycenter(X, assign, barycenter_size, weights)
248 barycenter = numpy.zeros((barycenter_size, X.shape[-1]))
249 for t in range(barycenter_size):
--> 250 barycenter[t] = numpy.average(X[assign[0][t], assign[1][t]], axis=0, weights=weights[assign[0][t]])
251 return barycenter
252
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/numpy/lib/function_base.pyc in average(a, axis, weights, returned)
1138 if np.any(scl == 0.0):
1139 raise ZeroDivisionError(
-> 1140 "Weights sum to zero, can't be normalized")
1141
1142 avg = np.multiply(a, wgt, dtype=result_dtype).sum(axis)/scl
ZeroDivisionError: Weights sum to zero, can't be normalized
The error gets thrown in numpy.average
inside the _petitjean_update_barycenter
on the second iteration through the loop. The relevant variables are as follows:
ipdb> l 246, 252
246
247 def _petitjean_update_barycenter(X, assign, barycenter_size, weights):
248 barycenter = numpy.zeros((barycenter_size, X.shape[-1]))
249 for t in range(barycenter_size):
--> 250 barycenter[t] = numpy.average(X[assign[0][t], assign[1][t]], axis=0, weights=weights[assign[0][t]])
251 return barycenter
252
ipdb> X
array([[[ 1.],
[ 2.],
[ 3.],
[ 4.],
[ nan],
[ nan]],
[[ 1.],
[ 2.],
[ 3.],
[ nan],
[ nan],
[ nan]],
[[ 2.],
[ 5.],
[ 6.],
[ 7.],
[ 8.],
[ 9.]]])
ipdb> assign
([[0, 0, 0, 0, 1, 1, 1, 2, 2, 2, 2, 2, 2], [], []], [[0, 1, 2, 3, 0, 1, 2, 0, 1, 2, 3, 4, 5], [], []])
ipdb> weights
array([ 1., 1., 1.])
ipdb> t
1
hey, i'm having trouble installing this with pip
, which i believe this commit might fix.
the latest version i see in the releases is v0.0.27, but the latest i see on pip is v0.0.25.
could you update the pip package? thanks! โญ๏ธ
Could someone help me figure out this error, and/or what's happening here? How do I interpret this and does it mean something is wrong with my data? I'm trying to use the TimeSeriesKMeansClustering function. I have data I have pre-processed so that all examples are the same length.
My variable data
is a numpy array of shape: (3207, 5)
and k is just the number of clusters, in this case k=6
This is my function:
def cluster_timeseries(data, k):
seed = 0
np.random.seed(seed)
X_train = to_time_series_dataset(data)
X_train = TimeSeriesScalerMeanVariance().fit_transform(X_train)
sz = X_train.shape[1]
sdtw_km = TimeSeriesKMeans(n_clusters=k, metric="softdtw", metric_params={"gamma_sdtw": .01}, verbose=True, random_state=seed)
y_pred = sdtw_km.fit_predict(X_train)
return sdtw_km, np.array(X_train), sz
I get this error:
File "01_replicate.py", line 258, in cluster_timeseries
y_pred = sdtw_km.fit_predict(X_train)
File "/Library/Python/2.7/site-packages/tslearn/clustering.py", line 558, in fit_predict
return self.fit(X, y).labels_
File "/Library/Python/2.7/site-packages/tslearn/clustering.py", line 517, in fit
X_ = to_time_series_dataset(X)
File "/Library/Python/2.7/site-packages/tslearn/utils.py", line 119, in to_time_series_dataset
max_sz = max([ts_size(to_time_series(ts)) for ts in dataset])
File "/Library/Python/2.7/site-packages/tslearn/utils.py", line 315, in ts_size
while not numpy.any(numpy.isfinite(ts_[sz - 1])):
IndexError: index -6 is out of bounds for axis 0 with size 5
Currently running tests with,
pytest --doctest-modules --ignore tslearn/docs/ --ignore tslearn/shapelets.py tslearn
takes approximately 2.5 min on my laptop which is a while.
A quick profiling with the --duration=0
option, shows that the few test bellow take most of the run time,
86.10s call tslearn/svm.py::tslearn.svm.TimeSeriesSVC
36.81s call tslearn/svm.py::tslearn.svm.TimeSeriesSVR
8.25s call tslearn/clustering.py::tslearn.clustering.TimeSeriesKMeans
7.15s call tslearn/clustering.py::tslearn.clustering.KShape
3.51s call tslearn/clustering.py::tslearn.clustering.GlobalAlignmentKernelKMeans
2.41s call tslearn/clustering.py::tslearn.clustering.silhouette_score
0.05s call tslearn/datasets.py::tslearn.datasets.UCR_UEA_datasets.baseline_accuracy
0.04s call tslearn/datasets.py::tslearn.datasets.UCR_UEA_datasets.load_dataset
0.03s call tslearn/generators.py::tslearn.generators.random_walk_blobs
Profiling e.g. TimeSeriesSVC
with pytest-profiling
indicates that soft_dtw
is the bottleneck (cf. below). Maybe reducing the dataset size further could help?
ncalls tottime percall cumtime percall filename:lineno(function)
1 0.000 0.000 97.182 97.182 runner.py:106(pytest_runtest_call)
1 0.000 0.000 97.182 97.182 doctest.py:106(runtest)
1 0.000 0.000 97.182 97.182 doctest.py:1838(run)
1 0.000 0.000 97.182 97.182 doctest.py:1418(run)
1 0.000 0.000 97.178 97.178 doctest.py:1272(__run)
12 0.000 0.000 97.177 8.098 {built-in method builtins.exec}
5 0.000 0.000 97.161 19.432 base.py:358(_compute_kernel)
5 0.000 0.000 97.161 19.432 svm.py:33(<lambda>)
5 0.000 0.000 97.160 19.432 metrics.py:262(cdist_gak)
5 0.001 0.000 97.146 19.429 metrics.py:305(<listcomp>)
8000 0.331 0.000 97.059 0.012 metrics.py:221(gak)
8000 9.881 0.001 96.540 0.012 metrics.py:561(soft_dtw)
8000 0.124 0.000 67.245 0.008 metrics.py:693(compute)
8000 67.122 0.008 67.122 0.008 {tslearn.soft_dtw_fast._soft_dtw}
2 0.000 0.000 34.452 17.226 svm.py:189(predict_proba)
2 0.000 0.000 34.449 17.224 base.py:593(_predict_proba)
2 0.000 0.000 34.449 17.224 base.py:635(_dense_predict_proba)
Summary:
Stochastically explore the DTW Averaging Space. It's a useful algorithm for exploring online settings and larger sample sizes. The paper is a bit of work to get through as they rigerously detail a connection between the DTW sample mean to nonsmooth optimization methods for their proposed algorithm. They
also detail a non-stochastic subgradient method and present a vectorized version of DBA that might provide additional performance benefits (I'll create a seperate issue for refactoring DBA to that).
I think this package would benefit greatly for both the Subgradient (SG) and Stochastic Subgradient (SSG) methods.
Original Paper:
Nonsmooth Analysis and Subgradient Methods for Averaging in Dynamic Time Warping Spaces
Relevant sections are 4.1. The Subgradient Mean Algorithm and 4.3. The Stochastic Subgradient Mean Algorithm and their respective algorithms 1 and 3.
Locally-Regularized DTW sounds really interesting. I'm familiar with regularization in regression problems, but I am not sure how it fits into this context. I am looking forward to your presentation on this topic!
Ben
max_iter needs to be set to an integer, and by the docs and previous implementation, it's 30.
After a first glimpse on the package, I was wondering if tslearn needs uniformly sampled and same sized time series?
Hello guys,
first of all thanks for this great project!
Right now i am heavily using the k-Shape-Algorithm for clustering (imho) huge amounts of data.
n = 5.000.000, k = 6;8;10;12, m = 56
O(max{nยทk ยทmยท log(m), nยทm2, k ยทm3}) (complexity of k-Shape from the original paper)
And this takes about 150hours on an Intel(R) Xeon(R) CPU E5-2690 v4 @ 2.60GHz. Sadly the algorithm only uses one core. How can we fix this? How can we make this faster? I am quite new to Python but maybe you have an idea?
Thanks in advance =)
I received the following by e-mail, and I prefer to deal with it through a GitHub issue.
Dear Dr Tavenard
Today I started with using tslearn.
My small home project is about trying to cluster time series data. I have data from my own cycling. These are various rides I made. From each ride file I extracted 1 time series (namelijk wattage). My rides are from various lenght so my time series are from various length.
I have already put my rides in the right data format for tslearn using:>>> from tslearn.utils import to_time_series_dataset >>> formatted_dataset = to_time_series_dataset(appended_data_Garmin)As expected I get nan for filling up all the datapoints given the longest time series I have.
However. When I try to run the simple:>>> from tslearn.clustering import TimeSeriesKMeans >>> km = TimeSeriesKMeans(n_clusters=3, metric="dtw", max_iter=5, verbose=False, random_state=0) >>> km.fit(formatted_dataset)I get:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Do you have any idea what might be causing this issue?
A matrix profile is a data structure built from a similarity join of two time series. The resulting data structure produces two objects:
Once the data structure is built, it can produce responses to:
Resources:
from tslearn.clustering import KShape,TimeSeriesKMeans
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/Library/Python/2.7/site-packages/tslearn-0.1.14-py2.7-macosx-10.11-intel.egg/tslearn/clustering.py", line 13, in <module>
from tslearn.metrics import cdist_gak, cdist_dtw, cdist_soft_dtw, dtw
File "/Library/Python/2.7/site-packages/tslearn-0.1.14-py2.7-macosx-10.11-intel.egg/tslearn/metrics.py", line 8, in <module>
from tslearn.soft_dtw_fast import _soft_dtw, _soft_dtw_grad, _jacobian_product_sq_euc
File "__init__.pxd", line 164, in init tslearn.soft_dtw_fast
ValueError: numpy.dtype has the wrong size, try recompiling. Expected 88, got 96
>>>
re-compiled, re-installed and now:
from tslearn.clustering import KShape,TimeSeriesKMeans
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "tslearn/clustering.py", line 13, in <module>
from tslearn.metrics import cdist_gak, cdist_dtw, cdist_soft_dtw, dtw
File "tslearn/metrics.py", line 8, in <module>
from tslearn.soft_dtw_fast import _soft_dtw, _soft_dtw_grad, _jacobian_product_sq_euc
ImportError: No module named soft_dtw_fast
You've got a soft_dtw_fast.c
and a soft_dtw_fast.pyx
-> is there supposed to also be a soft_dtw_fast.py
?
It would be nice if it was possible to download a specific (e.g. small) dataset when running,
d = UCR_UEA_datasets()
X_train, y_train, X_test, y_test = d.load_dataset('<some-id>')
as opposed to downloading everything and caching it. It particular, that would make it possible to use this in some gallery example as well.
Edit: moved part of this comment to a separate issue #23 as requested.
It would be nice to allow the user to evaluate the quality of a clustering by providing the equivalent of the silhouette score (or related metric) for time series clustering.
Following code is from KNeighborsTimeSeriesClassifier example, when change the classifier to SVC , error occurs when using the pipeline:
version:0.1.18.3
from tslearn.utils import to_time_series,to_time_series_dataset
from tslearn.clustering import TimeSeriesKMeans,GlobalAlignmentKernelKMeans,silhouette_score
from tslearn.svm import TimeSeriesSVC,TimeSeriesSVR
from tslearn.neighbors import KNeighborsClassifier
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from tslearn.generators import random_walk_blobs
from tslearn.preprocessing import TimeSeriesScalerMinMax
from tslearn.neighbors import KNeighborsTimeSeriesClassifier,KNeighborsTimeSeries
from tslearn.piecewise import SymbolicAggregateApproximation
n_ts_per_blob,sz,d,n_blobs=20,100,1,2
x,y=random_walk_blobs(n_ts_per_blob=n_ts_per_blob,sz=sz,d=d,n_blobs=n_blobs)
print('x,y, shape:',x.shape,y.shape)
print(len(x[0]),x[0][:10],y[0])
scaler=TimeSeriesScalerMinMax(min=0,max=1)
x_scaled=scaler.fit_transform(x)
print(x_scaled[0][:10],y[0])
print(n_ts_per_blob*n_blobs)
indices_shuffle=np.random.permutation(n_ts_per_blob*n_blobs)
print(indices_shuffle)
x_shuffle=x_scaled[indices_shuffle]
y_shuffle=y[indices_shuffle]
nofTrain=n_ts_per_blob*n_blobs//2
print(nofTrain)
x_train=x_shuffle[:nofTrain]
y_train=y_shuffle[:nofTrain]
x_test=x_shuffle[nofTrain:]
y_test=y_shuffle[nofTrain:]
print(y_shuffle)
print(y_train)
print(y_test)
sax_trans=SymbolicAggregateApproximation(n_segments=15,alphabet_size_avg=5)
svc=TimeSeriesSVC(sz=100,d=1) #KNeighborsTimeSeriesClassifier(n_neighbors=3,metric='dtw')
pip= svc # Pipeline(steps=[('sax',sax_trans),('knn',svc)]) #works with svc but not the pipeline
pip.fit(x_train,y_train)
predictsPIP=pip.predict(x_test)
print(predictsPIP)
print(y_test)
print(accuracy_score(y_test,predictsPIP))
I've tried a couple of the examples in the docs, but none of them appear to work. It looks like the data shapes are different than are expected in the examples. For instance, in the DTW example:
sz = X_train.shape[1]
results in IndexError: tuple index out of range
. Similarly in the kernel k-means example, dataset_scaled[0, :, 0]
yields an index error suggesting that there are only 2 dimensions.
Running from today's master branch on Python 3.6.
Trying to get barycenter averaging working on some audio data, but can't even get the examples in the docs to work.
$ ipython --no-banner
In [1]: from tslearn.utils import to_time_series_dataset
...: X = to_time_series_dataset([[1, 2, 3, 4], [1, 2, 3], [2, 5, 6, 7, 8, 9]])
...:
In [2]: from tslearn.barycenters import softdtw_barycenter
...: from tslearn.utils import ts_zeros
...: initial_barycenter = ts_zeros(sz=5)
...: bar = softdtw_barycenter(X, init=initial_barycenter)
...:
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-2-7221f8ce968c> in <module>()
2 from tslearn.utils import ts_zeros
3 initial_barycenter = ts_zeros(sz=5)
----> 4 bar = softdtw_barycenter(X, init=initial_barycenter)
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in softdtw_barycenter(X, gamma, weights, method, tol, max_iter, init)
484 # The function works with vectors so we need to vectorize barycenter.
485 res = minimize(f, barycenter.ravel(), method=method, jac=True, tol=tol,
--> 486 options=dict(maxiter=max_iter, disp=False))
487 return res.x.reshape(barycenter.shape)
488 else:
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/_minimize.pyc in minimize(fun, x0, args, method, jac, hess, hessp, bounds, constraints, tol, callback, options)
448 elif meth == 'l-bfgs-b':
449 return _minimize_lbfgsb(fun, x0, args, jac, bounds,
--> 450 callback=callback, **options)
451 elif meth == 'tnc':
452 return _minimize_tnc(fun, x0, args, jac, bounds, callback=callback,
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/lbfgsb.pyc in _minimize_lbfgsb(fun, x0, args, jac, bounds, disp, maxcor, ftol, gtol, eps, maxfun, maxiter, iprint, callback, maxls, **unknown_options)
326 # until the completion of the current minimization iteration.
327 # Overwrite f and g:
--> 328 f, g = func_and_grad(x)
329 elif task_str.startswith(b'NEW_X'):
330 # new iteration
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/lbfgsb.pyc in func_and_grad(x)
276 else:
277 def func_and_grad(x):
--> 278 f = fun(x, *args)
279 g = jac(x, *args)
280 return f, g
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in function_wrapper(*wrapper_args)
290 def function_wrapper(*wrapper_args):
291 ncalls[0] += 1
--> 292 return function(*(wrapper_args + args))
293
294 return ncalls, function_wrapper
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/scipy/optimize/optimize.pyc in __call__(self, x, *args)
61 def __call__(self, x, *args):
62 self.x = numpy.asarray(x).copy()
---> 63 fg = self.fun(x, *args)
64 self.jac = fg[1]
65 return fg[0]
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in <lambda>(Z)
481
482 if max_iter > 0:
--> 483 f = lambda Z: _softdtw_func(Z, X_, weights, barycenter, gamma)
484 # The function works with vectors so we need to vectorize barycenter.
485 res = minimize(f, barycenter.ravel(), method=method, jac=True, tol=tol,
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/barycenters.pyc in _softdtw_func(Z, X, weights, barycenter, gamma)
427 for i in range(len(X)):
428 D = SquaredEuclidean(Z, X[i])
--> 429 sdtw = SoftDTW(D, gamma=gamma)
430 value = sdtw.compute()
431 E = sdtw.grad()
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/metrics.pyc in __init__(self, D, gamma)
662 """
663 if hasattr(D, "compute"):
--> 664 self.D = D.compute()
665 else:
666 self.D = D
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/tslearn/metrics.pyc in compute(self)
750 Distance matrix.
751 """
--> 752 return euclidean_distances(self.X, self.Y, squared=True)
753
754 def jacobian_product(self, E):
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/metrics/pairwise.pyc in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared)
221 paired_distances : distances betweens pairs of elements of X and Y.
222 """
--> 223 X, Y = check_pairwise_arrays(X, Y)
224
225 if X_norm_squared is not None:
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/metrics/pairwise.pyc in check_pairwise_arrays(X, Y, precomputed, dtype)
110 warn_on_dtype=warn_on_dtype, estimator=estimator)
111 Y = check_array(Y, accept_sparse='csr', dtype=dtype,
--> 112 warn_on_dtype=warn_on_dtype, estimator=estimator)
113
114 if precomputed:
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/utils/validation.pyc in check_array(array, accept_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
420 % (array.ndim, estimator_name))
421 if force_all_finite:
--> 422 _assert_all_finite(array)
423
424 shape_repr = _shape_repr(array.shape)
/Users/fredmailhot/anaconda/envs/turkish_nlp/lib/python2.7/site-packages/sklearn/utils/validation.pyc in _assert_all_finite(X)
41 and not np.isfinite(X).all()):
42 raise ValueError("Input contains NaN, infinity"
---> 43 " or a value too large for %r." % X.dtype)
44
45
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
I assume the problem here is that to_timeseries_dataset
imputes NaN
s for unequal length ts data, but I don't know for sure.
I'm really excited about tslearn
, it will be super helpful for a project I'm working on, if these hiccups can be sorted out.
It would be nice to have a method such as BOSS (cf here) implemented in tslearn
.
Local feature extraction step should be implemented as a TransformerMixin
.
Hi,
Like MinMaxScaler in scikitlearn , a transform method is needed for TimeSeriesScalerMinMax.
A few notes:
.fit()
and returning an average sequence. .fit()
feels like it would initiate the initial average sequence through some method and perform initial computations. The transform()
method feels right more right here, it would take those computations and perform the transformation of the dataset into an average sequence.max_iter
, init_barycenter
, and tol
parameters.For the KShape algorithm (or k-means) it could be interesting to guide the algo to converge with the user knowledge. Imagine the user knows that there are three clusters and has at least one sample of each class, it could be interesting to use these samples as a first barycenter during the learning phase.
As stated in Cuturi and Blondel, 2017, there is a clear relationship between Global Alignment Kernel and soft-DTW.
Global Alignment Kernel (GAK), up to now, used a custom implementation in tslearn
, while soft-DTW relies on the implementation released by Mathieu Blondel.
GAK tests were built so as to get similar results to those output by Adrien Gaidon's wrapper [link].
Recently, I tried to get rid of GAK's implementation by just computing GAK from soft-DTW using inline equation from section 2.1 in (Cuturi & Blondel 2017). And I got different results, and the difference is not marginal.
This is what makes tslearn
tests fail at the moment.
If anyone is willing to help on this, feel free!
Hi @rtavenar , just wondering if there is any interest in moving towards a pytest
based testing framework for tslearn, perhaps something that mirrors sklearn?
If so, this is something I have some experience in and would be more than happy to begin helping out with!
Hi @rtavenar,
Does the DTW distance calculation routine normalizes the time series before calculating distances or the user has to pass pre-normalized sequences.
The DTW distance calculation routine seems to accept sequences of different lengths which makes sense as it warps the sequences. However, I am wondering if this routine also does subsequence search? By subsequence search, I am expecting the algorithm to find the most optimum starting point and find the warp path and distance from it.
Thanks
I'm trying to use Kmean Kshape and Sax in class methods i've write but whenever thoses three methods are called it return nothing. I am compeled to call this methods directly in the main files.
It would be nice to make kNN searches faster when Sakoe-Chiba constrained DTW is concerned using LB_Keogh based pre-filtering.
This should be implemented in the kneighbors
method of class KNeighborsTimeSeriesMixin
from module neighbors
.
Hello @rtavenar,
Thank you for fixing #38 , unfortunately something similar also happens for Silhouette score.
Example to keep it simple, even though it doesn't make much sense
from tslearn.clustering import silhouette_score
from tslearn.metrics import soft_dtw
ts = [[1, 2, 3], [4, 5, 6, 7], [8, 7, 4, 3, 2, 1]]
labels = [1, 0, 1]
print(silhouette_score(ts, labels, metric=soft_dtw))
Error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Hi @rtavenar,
I am impressed with the number of algorithms you already implemented in tslearn. Actually, I had the idea for the same package, even the same name :D :D. Some of the algorithms I implemented in the past.
I have been working with time series models for the last years and there is definitely a lack of python tools or ecosystems for time series analysis. So, I designed both packages https://github.com/blue-yonder/tsfresh and https://github.com/MaxBenChrist/tspreprocess.
If you are interested, I would like to have a chat with you to hear your ideas and vision for tslearn. We have a really good development team in tsfresh. Maybe we find a way to synchronize our efforts and support you.
Best, Max
It would make sense to have metric learning algos dedicated to time series in tslearn
.
A good start could be Garreau et al, 2014, but maybe other methods could make more sense.
Hi,
Thanks for your brilliant works of building this library!
I am trying to use tslearn.preprocessing.TimeSeriesScalerMeanVariance
for standardization, but it failed when one of my data samples is pure 0
. Here is the example:
t = np.array([[0,0,0],[1,2,3]])
TimeSeriesScalerMeanVariance().fit_transform(t)
the output for the above code is:
/lib/python3.6/site-packages/tslearn/preprocessing.py:153: RuntimeWarning: invalid value encountered in true_divide
X_[i, :, d] = (X_[i, :, d] - cur_mean) * self.std_ / cur_std + self.mu_
array([[[ nan],
[ nan],
[ nan]],
[[-1.22474487],
[ 0. ],
[ 1.22474487]]])
The issue is the standard deviation of a [0,0,0]
is 0, it there any good methods for working around?
Hello,
First of all, really great work on this repository, very helpful!
I would like to extract the "best" K
from a set of timeseries. The length of these shapelets should be within the interval '[min_len, max_len]'. I was wondering how I best approached this, since the n_shapelets_per_size
hyper-parameter has an impact on the output.
I currently see two alternatives:
min_len
and max_len
and each value equal to K
. In the end, iterate over all extracted shapelets and select the best K
based on a metric such as information gain.K
from that list.Is there any alternative better than the other? Or is there a third option that I currently cannot think of?
Thanks in advance,
Gilles
It would be nice to have shapelet-based classifiers implemented in tslearn
. Maybe a good start would be Learning Shapelets from Grabocka et al. (cf here).
When using this class what are the available "metrics" parameters that can be used? only "dtw"? any recommendation if i would want to use euclidean or for example the SAX distance, on using this classifier on a dataset with a SAX representation?
When we cluster series (for instance with KShape
), we obtain centers for normalized data. It would be convenient to have a method to retrieve centers as original time series.
Use the docs/examples/plot_shapelet.py, and modify the y_train
to be suitable for binary classification tasks. I think this can reproduce the bug.
As can be seen in the gallery of examples, inertia is increasing for DBA kmeans during fitting.
It should be checked whether the problem comes from the evaluation of inertia or from the optimization process itself.
See here.
Hi,
One of important features that make time series learning different from generic data is its variant length, which may also make tslearn unique.
It would be really appreciated that there is a specific paragraph elaborating which length-variant algorithms can be used in tslearn. For shapelets example, the related demo is about how to deal with equal length data although the algorithm itself supports TS in variant length.
Thanks
I have not been able to build the docs under RTD with tensorflow version higher than 1.4.
I get the following error message:
python /home/docs/checkouts/readthedocs.org/user_builds/tslearn/envs/latest/bin/sphinx-build -T -E -b readthedocs -d _build/doctrees-readthedocs -D language=en . _build/html
Running Sphinx v1.7.4
loading translations [en]... done
making output directory...
[autosummary] generating autosummary for: auto_examples/index.rst, auto_examples/plot_barycenter_interpolate.rst, auto_examples/plot_barycenters.rst, auto_examples/plot_dtw.rst, auto_examples/plot_kernel_kmeans.rst, auto_examples/plot_kmeans.rst, auto_examples/plot_kshape.rst, auto_examples/plot_lb_keogh.rst, auto_examples/plot_neighbors.rst, auto_examples/plot_sax.rst, ..., gen_modules/tslearn.utils.rst, gen_modules/utils/tslearn.utils.load_timeseries_txt.rst, gen_modules/utils/tslearn.utils.save_timeseries_txt.rst, gen_modules/utils/tslearn.utils.to_time_series.rst, gen_modules/utils/tslearn.utils.to_time_series_dataset.rst, gen_modules/utils/tslearn.utils.ts_size.rst, gettingstarted.rst, index.rst, reference.rst, variablelength.rst
Illegal instruction (core dumped)
It seems related to the CPUs at RTD not having AVX functionalities, but I guess there should be a way to make things work.
The thing is:
sigmoid
calling tf.nn.sigmoid
with a keyword argument axis
that did not exist in TF 1.4): hence we should check which keras version could be OKDefinitely, even if we choose option 1. for short-term fix, we should investigate option 2. because we might need new keras
functionalities in the future.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.