Git Product home page Git Product logo

stumpy's Introduction

PyPI Version

PyPI Downloads

Conda-forge Version

Conda-forge Downloads

License

ReadTheDocs Status

Code Coverage

Binder

JOSS

DOI

FOSSA

Twitter

STUMPY Logo

STUMPY

STUMPY is a powerful and scalable library that efficiently computes something called the matrix profile, which can be used for a variety of time series data mining tasks such as:

  • pattern/motif (approximately repeated subsequences within a longer time series) discovery
  • anomaly/novelty (discord) discovery
  • shapelet discovery
  • semantic segmentation
  • density estimation
  • time series chains (temporally ordered set of subsequence patterns)
  • and more ...

Whether you are an academic, data scientist, software developer, or time series enthusiast, STUMPY is straightforward to install and allows you to compute the matrix profile in the most efficient way. Our goal is to allow you to get to your time series insights faster. See documentation for more information.

How to use STUMPY

Typical usage (1-dimensional time series data) with `STUMP`:

import stumpy
import numpy as np

your_time_series = np.random.rand(10000)
window_size = 50  # Approximately, how many data points might be found in a pattern 

matrix_profile = stumpy.stump(your_time_series, m=window_size)

Distributed usage for 1-dimensional time series data with Dask Distributed via `STUMPED`:

import stumpy
import numpy as np
from dask.distributed import Client
dask_client = Client()

your_time_series = np.random.rand(10000)
window_size = 50  # Approximately, how many data points might be found in a pattern 

matrix_profile = stumpy.stumped(dask_client, your_time_series, m=window_size)

Multi-dimensional time series data with `MSTUMP`:

import stumpy
import numpy as np

your_time_series = np.random.rand(3, 1000)
window_size = 50  # Approximately, how many data points might be found in a pattern

matrix_profile, matrix_profile_indices = stumpy.mstump(your_time_series, m=window_size)

Distributed multi-dimensional time series data analysis with Dask Distributed `MSTUMPED`:

import stumpy
import numpy as np
from dask.distributed import Client
dask_client = Client()

your_time_series = np.random.rand(3, 1000)
window_size = 50  # Approximately, how many data points might be found in a pattern

matrix_profile, matrix_profile_indices = stumpy.mstumped(dask_client, your_time_series, m=window_size)

Time Series Chains:

import stumpy
import numpy as np

your_time_series = np.random.rand(10000)
window_size = 50  # Approximately, how many data points might be found in a pattern 

matrix_profile = stumpy.stump(your_time_series, m=window_size)

left_matrix_profile_index = matrix_profile[:, 2]
right_matrix_profile_index = matrix_profile[:, 3]
idx = 10  # Subsequence index for which to retrieve the anchored time series chain for

anchored_chain = stumpy.atsc(left_matrix_profile_index, right_matrix_profile_index, idx)

all_chain_set, longest_unanchored_chain = stumpy.allc(left_matrix_profile_index, right_matrix_profile_index)

Dependencies

Where to get it

Conda install (preferred):

conda install -c conda-forge stumpy

PyPI install, presuming you have numpy, scipy, and numba installed:

pip install stumpy

To install stumpy from source, see the instructions in the documentation.

Documentation

In order to fully understand and appreciate the underlying algorithms and applications, it is imperative that you read the original publications. For a more detailed example of how to use STUMPY please consult the latest documentation or explore the following tutorials:

  1. The Matrix Profile - Tutorial 0
  2. STUMPY Basics - Tutorial 1
  3. Time Series Chains - Tutorial 2

Performance

We tested the performance using the Numba JIT compiled version of the code on randomly generated data with various lengths (i.e., np.random.rand(n)).

STUMPY Performance Plot

The raw results are displayed below as Hours:Minutes:Seconds.

i n = 2i GPU-STOMP STUMP.16 STUMPED.128 STUMPED.256
6 64 00:00:10.00 00:00:00.00 00:00:05.77 00:00:06.08
7 128 00:00:10.00 00:00:00.00 00:00:05.93 00:00:07.29
8 256 00:00:10.00 00:00:00.01 00:00:05.95 00:00:07.59
9 512 00:00:10.00 00:00:00.02 00:00:05.97 00:00:07.47
10 1024 00:00:10.00 00:00:00.04 00:00:05.69 00:00:07.64
11 2048 NaN 00:00:00.09 00:00:05.60 00:00:07.83
12 4096 NaN 00:00:00.19 00:00:06.26 00:00:07.90
13 8192 NaN 00:00:00.41 00:00:06.29 00:00:07.73
14 16384 NaN 00:00:00.99 00:00:06.24 00:00:08.18
15 32768 NaN 00:00:02.39 00:00:06.48 00:00:08.29
16 65536 NaN 00:00:06.42 00:00:07.33 00:00:09.01
17 131072 00:00:10.00 00:00:19.52 00:00:09.75 00:00:10.53
18 262144 00:00:18.00 00:01:08.44 00:00:33.38 00:00:24.07
19 524288 00:00:46.00 00:03:56.82 00:01:35.27 00:03:43.66
20 1048576 00:02:30.00 00:19:54.75 00:04:37.15 00:03:01.16
21 2097152 00:09:15.00 03:05:07.64 00:13:36.51 00:08:47.47
22 4194304 NaN 10:37:51.21 00:55:44.43 00:32:06.70
23 8388608 NaN 38:42:51.42 03:33:30.53 02:00:49.37
24 16777216 NaN NaN 13:03:43.86 07:13:47.12
NaN 17729800 09:16:12.00 NaN NaN 07:18:42.54
25 33554432 NaN NaN NaN 26:27:41.29
26 67108864 NaN NaN NaN 106:40:17.17
NaN 100000000 291:07:12.00 NaN NaN 234:51:35.39
27 134217728 NaN NaN NaN NaN

GPU-STOMP: Results are reproduced from the original Matrix Profile II paper - NVIDIA Tesla K80 (contains 2 GPUs)

STUMP.16: 16 CPUs in Total - 16x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors parallelized with Numba on a single server without Dask.

STUMPED.128: 128 CPUs in Total - 8x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors x 16 servers, parallelized with Numba, and distributed with Dask Distributed.

STUMPED.256: 256 CPUs in Total - 8x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors x 32 servers, parallelized with Numba, and distributed with Dask Distributed.

Running Tests

Tests are written in the tests directory and processed using PyTest. and requires coverage.py for code coverage analysis. Tests can be executed with:

./test.sh

Python Version

STUMPY supports Python 3.6+ and, due to the use of unicode variable names/identifiers, is not compatible with Python 2.x. Given the small dependencies, STUMPY may work on older versions of Python but this is beyond the scope of our support and we strongly recommend that you upgrade to the most recent version of Python.

Getting Help

First, please check the issues on github to see if your question has already been answered there. If no solution is available there feel free to open a new issue and the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project, make your changes, and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

References

Yeh, Chin-Chia Michael, et al. (2016) Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords, and Shapelets. ICDM:1317-1322. Link

Zhu, Yan, et al. (2016) Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for TIme Series Motifs and Joins. ICDM:739-748. Link

Yeh, Chin-Chia Michael, et al. (2017) Matrix Profile VI: Meaningful Multidimensional Motif Discovery. ICDM:565-574. Link

Zhu, Yan, et al. (2017) Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining. ICDM:695-704. Link

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

S. M. Law, STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining In: Journal of Open Source Software, The Open Journal, Volume 4, Number 39. 2019

@article{law2017stumpy,
  title={{STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining}},
  author={Law, Sean M.},
  journal={{The Journal of Open Source Software}},
  volume={4},
  number={39},
  pages={1504},
  year={2019}
}

License & Trademark

STUMPY
Copyright 2019 TD Ameritrade. Released under the terms of the 3-Clause BSD license.
STUMPY is a trademark of TD Ameritrade IP Company, Inc. All rights reserved.

stumpy's People

Contributors

seanlaw avatar bdice avatar bfattori-tda avatar cuichenli avatar 0xflotus avatar bharatr21 avatar ronaldhorner avatar xhochy avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.