tdameritrade / stumpy Goto Github PK

STUMPY is a powerful and scalable Python library for modern time series analysis

Home Page: https://stumpy.readthedocs.io/en/latest/

License: Other

Python 98.29% Shell 1.57% TeX 0.14%

data-science time-series-analysis dask numba python anomaly-detection pattern-matching pydata matrix-profile motif-discovery

stumpy's Introduction

STUMPY

STUMPY is a powerful and scalable Python library that efficiently computes something called the matrix profile, which is just an academic way of saying "for every (green) subsequence within your time series, automatically identify its corresponding nearest-neighbor (grey)":

What's important is that once you've computed your matrix profile (middle panel above) it can then be used for a variety of time series data mining tasks such as:

pattern/motif (approximately repeated subsequences within a longer time series) discovery
anomaly/novelty (discord) discovery
shapelet discovery
semantic segmentation
streaming (on-line) data
fast approximate matrix profiles
time series chains (temporally ordered set of subsequence patterns)
snippets for summarizing long time series
pan matrix profiles for selecting the best subsequence window size(s)
and more ...

Whether you are an academic, data scientist, software developer, or time series enthusiast, STUMPY is straightforward to install and our goal is to allow you to get to your time series insights faster. See documentation for more information.

How to use STUMPY

Please see our API documentation for a complete list of available functions and see our informative tutorials for more comprehensive example use cases. Below, you will find code snippets that quickly demonstrate how to use STUMPY.

Typical usage (1-dimensional time series data) with STUMP:

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile = stumpy.stump(your_time_series, m=window_size)

Distributed usage for 1-dimensional time series data with Dask Distributed via STUMPED:

import stumpy
import numpy as np
from dask.distributed import Client

if __name__ == "__main__":
    with Client() as dask_client:
        your_time_series = np.random.rand(10000)
        window_size = 50  # Approximately, how many data points might be found in a pattern

        matrix_profile = stumpy.stumped(dask_client, your_time_series, m=window_size)

GPU usage for 1-dimensional time series data with GPU-STUMP:

import stumpy
import numpy as np
from numba import cuda

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern
    all_gpu_devices = [device.id for device in cuda.list_devices()]  # Get a list of all available GPU devices

    matrix_profile = stumpy.gpu_stump(your_time_series, m=window_size, device_id=all_gpu_devices)

Multi-dimensional time series data with MSTUMP:

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(3, 1000)  # Each row represents data from a different dimension while each column represents data from the same dimension
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile, matrix_profile_indices = stumpy.mstump(your_time_series, m=window_size)

Distributed multi-dimensional time series data analysis with Dask Distributed MSTUMPED:

import stumpy
import numpy as np
from dask.distributed import Client

if __name__ == "__main__":
    with Client() as dask_client:
        your_time_series = np.random.rand(3, 1000)   # Each row represents data from a different dimension while each column represents data from the same dimension
        window_size = 50  # Approximately, how many data points might be found in a pattern

        matrix_profile, matrix_profile_indices = stumpy.mstumped(dask_client, your_time_series, m=window_size)

Time Series Chains with Anchored Time Series Chains (ATSC):

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile = stumpy.stump(your_time_series, m=window_size)

    left_matrix_profile_index = matrix_profile[:, 2]
    right_matrix_profile_index = matrix_profile[:, 3]
    idx = 10  # Subsequence index for which to retrieve the anchored time series chain for

    anchored_chain = stumpy.atsc(left_matrix_profile_index, right_matrix_profile_index, idx)

    all_chain_set, longest_unanchored_chain = stumpy.allc(left_matrix_profile_index, right_matrix_profile_index)

Semantic Segmentation with Fast Low-cost Unipotent Semantic Segmentation (FLUSS):

import stumpy
import numpy as np

if __name__ == "__main__":
    your_time_series = np.random.rand(10000)
    window_size = 50  # Approximately, how many data points might be found in a pattern

    matrix_profile = stumpy.stump(your_time_series, m=window_size)

    subseq_len = 50
    correct_arc_curve, regime_locations = stumpy.fluss(matrix_profile[:, 1],
                                                    L=subseq_len,
                                                    n_regimes=2,
                                                    excl_factor=1
                                                    )

Dependencies

Supported Python and NumPy versions are determined according to the NEP 29 deprecation policy.

Where to get it

Conda install (preferred):

conda install -c conda-forge stumpy

PyPI install, presuming you have numpy, scipy, and numba installed:

python -m pip install stumpy

To install stumpy from source, see the instructions in the documentation.

Documentation

In order to fully understand and appreciate the underlying algorithms and applications, it is imperative that you read the original publications. For a more detailed example of how to use STUMPY please consult the latest documentation or explore our hands-on tutorials.

Performance

We tested the performance of computing the exact matrix profile using the Numba JIT compiled version of the code on randomly generated time series data with various lengths (i.e., np.random.rand(n)) along with different CPU and GPU hardware resources.

The raw results are displayed in the table below as Hours:Minutes:Seconds.Milliseconds and with a constant window size of m = 50. Note that these reported runtimes include the time that it takes to move the data from the host to all of the GPU device(s). You may need to scroll to the right side of the table in order to see all of the runtimes.

i	n = 2ⁱ	GPU-STOMP	STUMP.2	STUMP.16	STUMPED.128	STUMPED.256	GPU-STUMP.1	GPU-STUMP.2	GPU-STUMP.DGX1	GPU-STUMP.DGX2
6	64	00:00:10.00	00:00:00.00	00:00:00.00	00:00:05.77	00:00:06.08	00:00:00.03	00:00:01.63	NaN	NaN
7	128	00:00:10.00	00:00:00.00	00:00:00.00	00:00:05.93	00:00:07.29	00:00:00.04	00:00:01.66	NaN	NaN
8	256	00:00:10.00	00:00:00.00	00:00:00.01	00:00:05.95	00:00:07.59	00:00:00.08	00:00:01.69	00:00:06.68	00:00:25.68
9	512	00:00:10.00	00:00:00.00	00:00:00.02	00:00:05.97	00:00:07.47	00:00:00.13	00:00:01.66	00:00:06.59	00:00:27.66
10	1024	00:00:10.00	00:00:00.02	00:00:00.04	00:00:05.69	00:00:07.64	00:00:00.24	00:00:01.72	00:00:06.70	00:00:30.49
11	2048	NaN	00:00:00.05	00:00:00.09	00:00:05.60	00:00:07.83	00:00:00.53	00:00:01.88	00:00:06.87	00:00:31.09
12	4096	NaN	00:00:00.22	00:00:00.19	00:00:06.26	00:00:07.90	00:00:01.04	00:00:02.19	00:00:06.91	00:00:33.93
13	8192	NaN	00:00:00.50	00:00:00.41	00:00:06.29	00:00:07.73	00:00:01.97	00:00:02.49	00:00:06.61	00:00:33.81
14	16384	NaN	00:00:01.79	00:00:00.99	00:00:06.24	00:00:08.18	00:00:03.69	00:00:03.29	00:00:07.36	00:00:35.23
15	32768	NaN	00:00:06.17	00:00:02.39	00:00:06.48	00:00:08.29	00:00:07.45	00:00:04.93	00:00:07.02	00:00:36.09
16	65536	NaN	00:00:22.94	00:00:06.42	00:00:07.33	00:00:09.01	00:00:14.89	00:00:08.12	00:00:08.10	00:00:36.54
17	131072	00:00:10.00	00:01:29.27	00:00:19.52	00:00:09.75	00:00:10.53	00:00:29.97	00:00:15.42	00:00:09.45	00:00:37.33
18	262144	00:00:18.00	00:05:56.50	00:01:08.44	00:00:33.38	00:00:24.07	00:00:59.62	00:00:27.41	00:00:13.18	00:00:39.30
19	524288	00:00:46.00	00:25:34.58	00:03:56.82	00:01:35.27	00:03:43.66	00:01:56.67	00:00:54.05	00:00:19.65	00:00:41.45
20	1048576	00:02:30.00	01:51:13.43	00:19:54.75	00:04:37.15	00:03:01.16	00:05:06.48	00:02:24.73	00:00:32.95	00:00:46.14
21	2097152	00:09:15.00	09:25:47.64	03:05:07.64	00:13:36.51	00:08:47.47	00:20:27.94	00:09:41.43	00:01:06.51	00:01:02.67
22	4194304	NaN	36:12:23.74	10:37:51.21	00:55:44.43	00:32:06.70	01:21:12.33	00:38:30.86	00:04:03.26	00:02:23.47
23	8388608	NaN	143:16:09.94	38:42:51.42	03:33:30.53	02:00:49.37	05:11:44.45	02:33:14.60	00:15:46.26	00:08:03.76
24	16777216	NaN	NaN	NaN	14:39:11.99	07:13:47.12	20:43:03.80	09:48:43.42	01:00:24.06	00:29:07.84
NaN	17729800	09:16:12.00	NaN	NaN	15:31:31.75	07:18:42.54	23:09:22.43	10:54:08.64	01:07:35.39	00:32:51.55
25	33554432	NaN	NaN	NaN	56:03:46.81	26:27:41.29	83:29:21.06	39:17:43.82	03:59:32.79	01:54:56.52
26	67108864	NaN	NaN	NaN	211:17:37.60	106:40:17.17	328:58:04.68	157:18:30.50	15:42:15.94	07:18:52.91
NaN	100000000	291:07:12.00	NaN	NaN	NaN	234:51:35.39	NaN	NaN	35:03:44.61	16:22:40.81
27	134217728	NaN	NaN	NaN	NaN	NaN	NaN	NaN	64:41:55.09	29:13:48.12

Hardware Resources

GPU-STOMP: These results are reproduced from the original Matrix Profile II paper - NVIDIA Tesla K80 (contains 2 GPUs) and serves as the performance benchmark to compare against.

STUMP.2: stumpy.stump executed with 2 CPUs in Total - 2x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors parallelized with Numba on a single server without Dask.

STUMP.16: stumpy.stump executed with 16 CPUs in Total - 16x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors parallelized with Numba on a single server without Dask.

STUMPED.128: stumpy.stumped executed with 128 CPUs in Total - 8x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors x 16 servers, parallelized with Numba, and distributed with Dask Distributed.

STUMPED.256: stumpy.stumped executed with 256 CPUs in Total - 8x Intel(R) Xeon(R) CPU E5-2650 v4 @ 2.20GHz processors x 32 servers, parallelized with Numba, and distributed with Dask Distributed.

GPU-STUMP.1: stumpy.gpu_stump executed with 1x NVIDIA GeForce GTX 1080 Ti GPU, 512 threads per block, 200W power limit, compiled to CUDA with Numba, and parallelized with Python multiprocessing

GPU-STUMP.2: stumpy.gpu_stump executed with 2x NVIDIA GeForce GTX 1080 Ti GPU, 512 threads per block, 200W power limit, compiled to CUDA with Numba, and parallelized with Python multiprocessing

GPU-STUMP.DGX1: stumpy.gpu_stump executed with 8x NVIDIA Tesla V100, 512 threads per block, compiled to CUDA with Numba, and parallelized with Python multiprocessing

GPU-STUMP.DGX2: stumpy.gpu_stump executed with 16x NVIDIA Tesla V100, 512 threads per block, compiled to CUDA with Numba, and parallelized with Python multiprocessing

Running Tests

Tests are written in the tests directory and processed using PyTest and requires coverage.py for code coverage analysis. Tests can be executed with:

./test.sh

Python Version

STUMPY supports Python 3.8+ and, due to the use of unicode variable names/identifiers, is not compatible with Python 2.x. Given the small dependencies, STUMPY may work on older versions of Python but this is beyond the scope of our support and we strongly recommend that you upgrade to the most recent version of Python.

Getting Help

First, please check the discussions and issues on Github to see if your question has already been answered there. If no solution is available there feel free to open a new discussion or issue and the authors will attempt to respond in a reasonably timely fashion.

Contributing

We welcome contributions in any form! Assistance with documentation, particularly expanding tutorials, is always welcome. To contribute please fork the project, make your changes, and submit a pull request. We will do our best to work through any issues with you and get your code merged into the main branch.

Citing

If you have used this codebase in a scientific publication and wish to cite it, please use the Journal of Open Source Software article.

S.M. Law, (2019). STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining. Journal of Open Source Software, 4(39), 1504.

@article{law2019stumpy,
  author  = {Law, Sean M.},
  title   = {{STUMPY: A Powerful and Scalable Python Library for Time Series Data Mining}},
  journal = {{The Journal of Open Source Software}},
  volume  = {4},
  number  = {39},
  pages   = {1504},
  year    = {2019}
}

References

Yeh, Chin-Chia Michael, et al. (2016) Matrix Profile I: All Pairs Similarity Joins for Time Series: A Unifying View that Includes Motifs, Discords, and Shapelets. ICDM:1317-1322. Link

Zhu, Yan, et al. (2016) Matrix Profile II: Exploiting a Novel Algorithm and GPUs to Break the One Hundred Million Barrier for Time Series Motifs and Joins. ICDM:739-748. Link

Yeh, Chin-Chia Michael, et al. (2017) Matrix Profile VI: Meaningful Multidimensional Motif Discovery. ICDM:565-574. Link

Zhu, Yan, et al. (2017) Matrix Profile VII: Time Series Chains: A New Primitive for Time Series Data Mining. ICDM:695-704. Link

Gharghabi, Shaghayegh, et al. (2017) Matrix Profile VIII: Domain Agnostic Online Semantic Segmentation at Superhuman Performance Levels. ICDM:117-126. Link

Zhu, Yan, et al. (2017) Exploiting a Novel Algorithm and GPUs to Break the Ten Quadrillion Pairwise Comparisons Barrier for Time Series Motifs and Joins. KAIS:203-236. Link

Zhu, Yan, et al. (2018) Matrix Profile XI: SCRIMP++: Time Series Motif Discovery at Interactive Speeds. ICDM:837-846. Link

Yeh, Chin-Chia Michael, et al. (2018) Time Series Joins, Motifs, Discords and Shapelets: a Unifying View that Exploits the Matrix Profile. Data Min Knowl Disc:83-123. Link

Gharghabi, Shaghayegh, et al. (2018) "Matrix Profile XII: MPdist: A Novel Time Series Distance Measure to Allow Data Mining in More Challenging Scenarios." ICDM:965-970. Link

Zimmerman, Zachary, et al. (2019) Matrix Profile XIV: Scaling Time Series Motif Discovery with GPUs to Break a Quintillion Pairwise Comparisons a Day and Beyond. SoCC '19:74-86. Link

Akbarinia, Reza, and Betrand Cloez. (2019) Efficient Matrix Profile Computation Using Different Distance Functions. arXiv:1901.05708. Link

Kamgar, Kaveh, et al. (2019) Matrix Profile XV: Exploiting Time Series Consensus Motifs to Find Structure in Time Series Sets. ICDM:1156-1161. Link

License & Trademark

STUMPY

stumpy's People

Contributors

Stargazers

Watchers

Forkers

bharatr21 sqlcode jfloresc xudongting shaunstanislauslau hhy5277 0xflotus jingmouren xhochy sprinterzzj barnab5012 bdice dankminhkhoa tomfisher nikolayvoronchikhin gridl chetanmehra canslove akafle1003 42machinelearning chandansinha miteshchakma renwoox vanbenschoten erin311 monkidea ronaldhorner ssusantachary csianglim sroecker sandy4321 yishuihanhan jcapitz alexanu adriantorrie sumarsss webappengineer ralfjanser fossabot kamalchahalbigd ashoknp-git asjahagirdar nudtchengqing 100rabh1401 sahanduiuc turiya bfattori-tda conradbm empythy amitmahi awesomedatatool kevinvitale manikumar34 rcdb-io stanleyjacob fanjing8 jakubpizon gallogiulia deepakgthomas sergeymopozov inkenbrandt yangxiong0903 rajacsp jsyzeng mbrukman openxai bhosalems xyuan heycwkim ranbix666 nidhoggurz mihailescum shalevy1 pankajkarman rockinhumingbird ramonlln cyuab claremontmba2020 chen0031 leocordoba turtlelabs aznul mypain seehuily ananyadhar perone bakhtiaris jimmy-inl ericrincon-ua junfanz1 iamskab sonamrawat itsvivekghosh jgill018 o7s8r6 cxz valeman drjeym lokeshgithub pissall20

stumpy's Issues

Add MPdist

According to the ICDM publication, the calculation of MPdist is pretty straightforward. For two time series, A and B, of identical length and a window size, m, equal to, say, 50:

Compute the matrix profile AB for stumpy.stump(A, m=50, T_B=B, ignore_trivial=False)
Compute the matrix profile BA for stumpy.stump(B, m=50, T_B=A, ignore_trivial=False)
Concatenate both matrix profiles into PABBA
Choose some k that is 5 percent of 2 * n

As mentioned on on page 3 of the above paper, "section, this data structure PABBA
contains all the information we need to compute the MPdist."

The pseudocode can be found here.

The supporting site can be found here

Improve Testing Script to Fail on Pytest Exit Code 2

Currently, our test.sh script runs a series of tests. However, Pytest returns exit code 2 if there is a failure and it does not cause the test.sh script to fail and exit. Instead, we need to:

Force pytest to exit on the first failure
Check the pytest exit code and exit test.sh if it the pytest exit code is non zero

Add NABDConf Presentation

Consider including the NABDConf video to the left contents pane in RTD and to the README

Wrong Python Version Number

The minimum Python version should be Python 3.6+ and not Python 3.5+. It has never worked for Python 3.5 anyways due to the presence of f-strings (see #48). Keeping the Python version at 3.6+ also makes it consistent with black code formatter that is also only available for Python 3.6+.

Currently, CI is only tested on Python 3.6+

Incorrect GPU Output

In the unit tests, the gpu_stump test input data includes:

test_data = [
    (
        np.array([9, 8100, -60, 7], dtype=np.float64),
        np.array([584, -11, 23, 79, 1001, 0, -19], dtype=np.float64),
    ),
    (
        np.random.uniform(-1000, 1000, [8]).astype(np.float64),
        np.random.uniform(-1000, 1000, [64]).astype(np.float64),
    ),
]

Currently, the tests that use this data as input passes and the output has been confirmed to match the output from stumpy.stump. However, this is only tested with a window size, m=3. When m=13 and the T_B=np.random.uniform(-1000, 1000, [64]).astype(np.float64), the self-join test fails:

@pytest.mark.parametrize("T_A, T_B", test_data)
def test_stump_self_join(T_A, T_B):
    m = 13
    if len(T_B) > m:
        zone = int(np.ceil(m / 4))
        left = np.array(
            [
                naive_mass(Q, T_B, m, i, zone, True)
                for i, Q in enumerate(core.rolling_window(T_B, m))
            ],
            dtype=object,
        )
        right = gpu_stump(T_B, m, ignore_trivial=True, threads_per_block=THREADS_PER_BLOCK)
        replace_inf(left)
        replace_inf(right)
        npt.assert_almost_equal(left, right)

        right = gpu_stump(
            pd.Series(T_B), m, ignore_trivial=True, threads_per_block=THREADS_PER_BLOCK
        )
        replace_inf(right)
        npt.assert_almost_equal(left, right)

E           AssertionError: 
E           Arrays are not almost equal to 7 decimals
E           
E           Mismatch: 9.62%
E           Max absolute difference: 1.2708986280208605
E           Max relative difference: 0.35300318929522423
E            x: array([2.2705092863251686, 2.319594287314451, 2.3776967426085363,
E                  2.3915692674488906, 2.536081316513293, 2.969978484098601,
E                  2.2705092863251686, 2.319594287314451, 2.3776967426085363,...
E            y: array([2.270509286325168, 2.31959428731445, 2.3776967426085367,
E                  2.3915692674488898, 2.5360813165132923, 2.9699784840986023,
E                  3.509305221847841, 3.470126329813803, 3.6485953706293968,...

tests/test_gpu_stump.py:109: AssertionError
========================================================= 1 failed, 1 passed in 1.50 seconds =========

Add FOSSA dependency scanning

FOSSA provides a way to determine the open source dependencies used in the application and surface those, especially deep dependencies.

https://app.fossa.com/account/github_login

Use this link to set up STUMPY.

Replace python install with python -m pip install

According to Brett Cannon, it is much safer to be using python -m pip install <package> than straight pip install. It would be good to update our setup.sh script accordingly.

Add GPU-MSTUMP: Multi-dimensional STUMP on GPUs

Is there a multi-dimensional time series data analysis using GPU instead Dask Distributed MSTUMPED?

Using DataFrame Inputs with MSTUMP/MSTUMPED

@marcusau asked:

I have tried multi-dimensional array to mstump with stock prices and its technical indicators

#### step 2 : Feature creation
df1=df.copy()

df1['Close_pct']=np.log(df1['Close'] / df1['Close'].shift(1)).dropna()
df1['STDDEV']= ta.STDDEV(df1['Close'], timeperiod=5, nbdev=1)
### Volume Indicator Functions
df1['OBV']=ta.OBV(df1['Close'], df1['Volume'])
df1['Chaikin AD'] = ta.AD(df1['High'], df1['Low'], df1['Close'], df1['Volume'])

### momentum Indicator

macd, macdsignal, macdhist = ta.MACD(df1['Close'], fastperiod=12, slowperiod=26, signalperiod=9)
df1['macdhist']=macdhist

df1['RSI']= ta.RSI(df1['Close'], timeperiod=14)

# Volatility Indicator Functions
df1['NATR'] = ta.NATR(df1['High'], df1['Low'], df1['Close'], timeperiod=14)
df1['TRANGE'] = ta.TRANGE(df1['High'], df1['Low'], df1['Close'])
print(df1.tail())


print(list(set(df.columns) ^ set(df1.columns)))

feature_cols=list(set(df.columns) ^ set(df1.columns))
df1=df1.loc[:,feature_cols].dropna()
print(df1.head())


#Store these values in the NumPy array for using in our models later:
f_x=()
for f in feature_cols:
  f_x += (df1[f].values.reshape(-1,1),)
X = np.concatenate(f_x,axis=1)
X=X.T
print(X.shape)
>>>> (8, 2429)

window_size = 10  # Approximately, how many data points might be found in a pattern

matrix_profile, matrix_profile_indices = stumpy.mstump(X, m=window_size)

left_matrix_profile_index = matrix_profile[:, 2]
right_matrix_profile_index = matrix_profile[:, 3]

as you said, mstumpy only support 1-D data, however, what is the explanation of the results in mstumpy?

Add additional documentation to readthedocs.io

Currently, this repo is associated with https://stumpy.readthedocs.io/en/latest/ and all of the docstrings are in restructured text format. It would be a nice good first issue for anybody who'd like to contribute better documentation for the API.

Convert docstrings to documentation
Remove install-from-source from the README and add to readthedocs
Embed the tutorial notebooks into the Sphinx documentation on ReadTheDocs. (Example: freud)
Add badge to README.rst

Host Tutorial Notebooks on Binder

https://gke.mybinder.org/

Add FLUSS and FLOSS

Implement the FLUSS and FLOSS algorithms for offline and online semantic segmentation

Tutorial Notebooks Appear Empty or Blank

examples notebooks are empty?

https://github.com/TDAmeritrade/stumpy/tree/master/notebooks

Add Discourse Channel

Setup a discourse channel for help/questions/comments/suggestions

Write GPU vs CPU Blog

Discuss lessons learned
Provide percentage speedup (relative to naive and STOMP)
Talk about future improvements

Review FOSSA Obligations

Deadline August 23, 2019

Add Changelog

Don’t let your friends dump git logs into changelogs

A good reference of what the changelog should contain and look like is here

Improve Tutorial #2 on Time Series Chains

Tutorial 2 lacks the narrative of Tutorial 1 and seems incomplete.

Typo in Times Series Chains Example in README

The time series chains example shows:

left_matrix_profile_index = matrix_profile[2]
right_matrix_profile_index = matrix_profile[3]

and, instead, should say:

left_matrix_profile_index = matrix_profile[:, 2]
right_matrix_profile_index = matrix_profile[:, 3]

Project Roadmap

Below are some opportunities for future development:

~~SCRIMP++~~
~~GPU-STUMP~~
top-K
GPU-MSTUMP
MPdist
snippets
~~FLOSS Semantic Segmentation YouTube~~
~~STOMPI Incremental STOMP (not interactive)~~

Add Earlier Reference to Documentation in README

It would helpful to point out the RTD link earlier, say, in the first paragraph of the README since it is currently buried near the middle of the README.

Add Python 3.8 to Azure Pipelines

This is still not fully available yet but should be ready soon on Azure Pipelines

Transpose DataFrame Input for MSTUMP/MSTUMPED

In general, STUMPY assumes that each row of your input array represents data from a different dimension while each column in your input array represents data from the same dimension. In the case of a NumPy array:

import numpy as np
import stumpy

x = np.random.rand(10)
y = np.random.rand(3, 10)

1d_mp = stumpy.stump(x, 5)
3d_mp = stumpy.stump(y, 5)

This works fine. Similarly, STUMPY has Pandas support and so a Pandas Series also works:

import pandas as pd

1d_mp = stumpy.stump(pd.Series(x), 5)

Note that the transpose of x also gives the same answer:

1d_mp = stumpy.stump(pd.Series(x.T), 5)

In other words, stump isn't affected as long as your 1-dimensional input data is a row-wise 1-dimensional numpy array. STUMPY automatically converts your 1-dimensional input into a NumPy array by calling np.asarray on the stump time series input.

However, when we have a Pandas DataFrame (rather than a Series), the data is typically column-wise where each column is a dimensional and each row is data from the same dimension. Calling np.asarray on this DataFrame ends up producing an undesirable input for mstump or mstumped since it is column-wise and not row-wise. We need to correct this by detecting that we have a DataFrame input and then automatically transpose the DataFrame before calling np.asarray.

Additionally, we should add some safeguards to check that we only have a 1-d array for stump/stumped and, equivalently, that we have n-d array for mstump/mstumped.

Replace Performance Table in README.rst with Graph

Currently, the performance comparisons are shown in the README.rst as a table. A graph might be a better way to express the data in the table of runtimes, which can be difficult to read with all the NaNs. A graph can be interpreted visually. Maybe use log scale, given that 100M data points were tested!

# Add Motif Discovery to Tutorial #1

Tutorial #1 only discusses anomalies. It would be good to include a section on motif discovery using the steamgen data set (see column #4 is the steam flow):

colnames = ['drum pressure',
                    'excess oxygen',
                    'water level',
                    'steam flow']

steam_df = pd.read_csv('https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat', header=None, sep="\s+")
steam_df.columns = colnames

Add Inline Plotting to Tutorial 1

%matplotlib inline

Needs to be added to Tutorial 1 to allow plots to display pre-rendered on Github

Speed Up GPU-STUMP

Currently, the code works with CuPy but isn't performant. This issue is to explore the improvement of the performance.

python 3.5 f-string

Python 3.5 does not have f strings:

https://github.com/TDAmeritrade/stumpy/blob/master/stumpy/stomp.py#L131

Should stumpy be compatible with 3.5?

Typos in the tutorials

Clean up a few of the small typos in the tutorials

Add GPU-STUMP

According to these CuPy examples, it may be pretty straightforward to port the stumpy._stump function over to CuPy code. Specifically, the k-means example shows how to mix a Python for-loop with a CuPy GPU function call.

Try CuPy
Google Colab example
Add to Tutorial 1
Add unit test

Publish in JOSS

Submission process

Add NJIT to STAMP MASS and Other Functions

For speed, there's no reason why we can't make everything JIT compiled

window_size 1 and 2 doesn't seem to work

Even if ignore_trivial is explicitly set to True, there seems to be a problem when m is 1 or 2.

test = [3,8,9,2,5,1,17,4,11,18]
profile = stumpy.stump(np.float_(test), m=1, ignore_trivial = True)
print(profile)
print(2 ** 0.5)

[[1.4141633185218911 1 -1 -1]
 [1.4142135623730951 2 -1 -1]
 [1.4142135623730951 0 0 -1]
 [1.4142135623730951 0 0 -1]
 [1.4142135623730951 0 0 -1]
 [1.4142135623730951 1 1 -1]
 [1.4142135623730951 0 0 -1]
 [1.4142135623730951 1 1 -1]
 [1.4142135623730951 0 0 -1]
 [1.4142135623730951 0 0 -1]]
1.4142135623730951

test = [3,8,9,2,5,1,17,4,11,18]
profile = stumpy.stump(np.float_(test), m=2, ignore_trivial = True)
print(profile)

[[0.0 1 -1 -1]
 [0.0 3 -1 3]
 [0.0 4 0 4]
 [0.0 0 0 5]
 [0.0 2 2 6]
 [0.0 0 0 7]
 [0.0 2 2 -1]
 [0.0 0 0 -1]
 [0.0 0 0 -1]]

Add Tutorial(s) that Reproduces “Matrix Profile Top Ten” Paper

It would be nice to add a tutorial(s) that reproduces the Matrix Profile Top Ten paper. The accompanying data at their Google sites page can be found here.

It might be best to make the individual top ten sections as separate items (i.e., sub-list) that rolls under one tutorial line item in RTD and that can expand to ten sub items

_{Sent with GitHawk}

Add Parallel-GPU Support

According to this Numba unit test we can use multiple GPUs on the same server by using a context manager. However, it isn't clear if this is sequential execution on each GPU or concurrent.

Specifying package version

I was looking into adding support for ReadTheDocs when I realized that there isn't a great way to find the stumpy version programmatically. I would like to know the package version via stumpy.__version__, as defined in PEP396. For example:

>>> import stumpy
>>> stumpy.__version__
'1.0.0'

https://www.python.org/dev/peps/pep-0396/

This could be accomplished by simply defining a version identifier with the line __version__ = '1.0' in the __init__.py file.

Also, the setup.py file currently defines the version as 1.0. Does this package adhere to some defined version schema like semantic versioning? If so, it would be good to add a "patch version" (major.minor.patch, like 1.0.0) for future releases of the package.

Add FLUSS and FLOSS Examples to README

Add Continuous Integration (CI)

Currently, there is no continuous integration for pull requests and everything is performed manually. Sadly, I have no experience here and would appreciate some help and/guidance.

Use Azure Pipelines
flake8
mention code style requirement in CONTRIBUTING.md
run unit tests (Dask is an additional dependency)
run coverage tests (Dask is an additional dependency)
Add CI badge(s) to README.rst (builds passing, test coverage, etc)

Add STUMPY to Python 3 Statement

python3statement/python3statement.github.io#257

Add First-class Support for Pandas Series/DataFrames

In the tutorial, it's a little awkward that one has to extract the values from the pandas dataframe. First-class support for pandas Series/DataFrames (casting to a NumPy array, or potentially even returning a DataFrame with the same keys, if a DataFrame is passed in) would be a really nice feature.

Add API Links to README

In the "How to use STUMPY" section, it would be nice to add links to the API RTD

Typo in JOSS Article

Unfortunately, there is a typo where it says "squence" instead of "sequence"

Failures with single-precision float32 data

Hi @seanlaw! I was experimenting with stumpy today and had some unexpected failures when using single-precision float32 data (the NumPy default is double-precision float64).

Script to reproduce

import stumpy
import numpy as np

your_time_series = np.random.rand(10000).astype(np.float32)
window_size = 50  # Approximately, how many data points might be found in a pattern

matrix_profile = stumpy.stump(your_time_series, m=window_size)

Error output

Traceback (most recent call last):
  File "stumpy_float32_bug.py", line 7, in <module>
    matrix_profile = stumpy.stump(your_time_series, m=window_size)
  File "/redacted/lib/python3.6/site-packages/stumpy/stump.py", line 354, in stump
    core.check_dtype(T_A)
  File "/redacted/lib/python3.6/site-packages/stumpy/core.py", line 74, in check_dtype
    raise TypeError(msg)
TypeError: <class 'float'> type expected but found <class 'numpy.float32'>

Check for NaN

Having NaN values in the input array can lead to NaN output. We should check for this and error out with an appropriate message to have the user fill in the missing values.

Add STUMPI

See a reference implementation of STOMPI in Table 5 in this paper.

This is the incremental version of STUMP (not interactive!).

STAMPI can be found here.

Add Ability to Select GPU Device for GPU-STUMP

Currently, gpu_stump uses device 0 for all computations. However, if multiple GPUs are present, it would be useful to be able to specify which GPU to use.

Update MSTUMP/MSTUMPED Input Dimensionality Check

Compared to STUMP, one may intuitively expect the behavior of MSTUMP when passing a multi-dimensional array into MSTUMP -- Currently, three 1D matrix profiles instead of a single matrix profile for the 3D data is returned.

Realistically, STUMP should only accept a 1D array. I have no idea what happens when you pass a multi-dimensional array into STUMP.

It might make sense to check the shape of the input array and then simply:

Warn the user that they might want to use MSTUMP/MSTUMPED instead
Warn the user that only the first dimension is used and the rest is ignored
~~- [ ] Only take the first dimension of the input array and use that to compute a 1D matrix profile using STUMP/STUMPED~~

Add required TDA language to CODE_OF_CONDUCT.md

The code of conduct needs to be modified to include the following text:

In addition to this Contributor Code of Conduct, TD Ameritrade Associates remain subject to all company policy including our internal Code of Conduct.

I will submit a pull request.

Add short description for Matrix Profile in README.rst

Currently, the README does not provide a description of what a "Matrix Profile" is and, instead, only points to the user to a paper. It would be better to have a short description of the matrix profile in the README without referring to an external paper. This could be done in a section labeled "The Matrix Profile" with a description, which was what a user would expect from the anchor link in the intro.

Add "The Matrix Profile" description section to README.rst
Update link in the package description (first sentence) to reference/point to this section

Create a STUMPY Class Object

This object should store the matrix profile, matrix profile indices, and can access all of the relevant STUMPY functions that can act on a matrix profile, NumPy array, or Pandas dataframe.

Not sure how it should look or if it is overkill/not necessary so I want to open this up for discussion.

_{Sent with GitHawk}

tdameritrade / stumpy Goto Github PK

stumpy's Introduction

STUMPY

How to use STUMPY

Dependencies

Where to get it

Documentation

Performance

Hardware Resources

Running Tests

Python Version

Getting Help

Contributing

Citing

References

License & Trademark

stumpy's People

Contributors

Stargazers

Watchers

Forkers

stumpy's Issues

Script to reproduce

Error output

Recommend Projects

Recommend Topics

Recommend Org