tdameritrade / stumpy Goto Github PK
View Code? Open in Web Editor NEWSTUMPY is a powerful and scalable Python library for modern time series analysis
Home Page: https://stumpy.readthedocs.io/en/latest/
License: Other
STUMPY is a powerful and scalable Python library for modern time series analysis
Home Page: https://stumpy.readthedocs.io/en/latest/
License: Other
Tutorial 2 lacks the narrative of Tutorial 1 and seems incomplete.
Currently, the code works with CuPy but isn't performant. This issue is to explore the improvement of the performance.
The minimum Python version should be Python 3.6+ and not Python 3.5+. It has never worked for Python 3.5 anyways due to the presence of f-strings (see #48). Keeping the Python version at 3.6+ also makes it consistent with black code formatter that is also only available for Python 3.6+.
Currently, CI is only tested on Python 3.6+
FOSSA provides a way to determine the open source dependencies used in the application and surface those, especially deep dependencies.
https://app.fossa.com/account/github_login
Use this link to set up STUMPY.
According to this Numba unit test we can use multiple GPUs on the same server by using a context manager. However, it isn't clear if this is sequential execution on each GPU or concurrent.
Don’t let your friends dump git logs into changelogs
A good reference of what the changelog should contain and look like is here
According to Brett Cannon, it is much safer to be using python -m pip install <package>
than straight pip install
. It would be good to update our setup.sh
script accordingly.
In the "How to use STUMPY" section, it would be nice to add links to the API RTD
It would be nice to add a tutorial(s) that reproduces the Matrix Profile Top Ten paper. The accompanying data at their Google sites page can be found here.
It might be best to make the individual top ten sections as separate items (i.e., sub-list) that rolls under one tutorial line item in RTD and that can expand to ten sub items
Sent with GitHawk
It would helpful to point out the RTD link earlier, say, in the first paragraph of the README since it is currently buried near the middle of the README.
Currently, the performance comparisons are shown in the README.rst as a table. A graph might be a better way to express the data in the table of runtimes, which can be difficult to read with all the NaNs. A graph can be interpreted visually. Maybe use log scale, given that 100M data points were tested!
Python 3.5 does not have f strings:
https://github.com/TDAmeritrade/stumpy/blob/master/stumpy/stomp.py#L131
Should stumpy be compatible with 3.5?
Currently, gpu_stump
uses device 0 for all computations. However, if multiple GPUs are present, it would be useful to be able to specify which GPU to use.
@marcusau asked:
I have tried multi-dimensional array to mstump with stock prices and its technical indicators
#### step 2 : Feature creation
df1=df.copy()
df1['Close_pct']=np.log(df1['Close'] / df1['Close'].shift(1)).dropna()
df1['STDDEV']= ta.STDDEV(df1['Close'], timeperiod=5, nbdev=1)
### Volume Indicator Functions
df1['OBV']=ta.OBV(df1['Close'], df1['Volume'])
df1['Chaikin AD'] = ta.AD(df1['High'], df1['Low'], df1['Close'], df1['Volume'])
### momentum Indicator
macd, macdsignal, macdhist = ta.MACD(df1['Close'], fastperiod=12, slowperiod=26, signalperiod=9)
df1['macdhist']=macdhist
df1['RSI']= ta.RSI(df1['Close'], timeperiod=14)
# Volatility Indicator Functions
df1['NATR'] = ta.NATR(df1['High'], df1['Low'], df1['Close'], timeperiod=14)
df1['TRANGE'] = ta.TRANGE(df1['High'], df1['Low'], df1['Close'])
print(df1.tail())
print(list(set(df.columns) ^ set(df1.columns)))
feature_cols=list(set(df.columns) ^ set(df1.columns))
df1=df1.loc[:,feature_cols].dropna()
print(df1.head())
#Store these values in the NumPy array for using in our models later:
f_x=()
for f in feature_cols:
f_x += (df1[f].values.reshape(-1,1),)
X = np.concatenate(f_x,axis=1)
X=X.T
print(X.shape)
>>>> (8, 2429)
window_size = 10 # Approximately, how many data points might be found in a pattern
matrix_profile, matrix_profile_indices = stumpy.mstump(X, m=window_size)
left_matrix_profile_index = matrix_profile[:, 2]
right_matrix_profile_index = matrix_profile[:, 3]
as you said, mstumpy only support 1-D data, however, what is the explanation of the results in mstumpy?
Currently, the README does not provide a description of what a "Matrix Profile" is and, instead, only points to the user to a paper. It would be better to have a short description of the matrix profile in the README without referring to an external paper. This could be done in a section labeled "The Matrix Profile" with a description, which was what a user would expect from the anchor link in the intro.
For speed, there's no reason why we can't make everything JIT compiled
Currently, there is no continuous integration for pull requests and everything is performed manually. Sadly, I have no experience here and would appreciate some help and/guidance.
According to these CuPy examples, it may be pretty straightforward to port the stumpy._stump
function over to CuPy code. Specifically, the k-means example shows how to mix a Python for-loop with a CuPy GPU function call.
Consider including the NABDConf video to the left contents pane in RTD and to the README
See a reference implementation of STOMPI in Table 5 in this paper.
This is the incremental version of STUMP (not interactive!).
STAMPI can be found here.
Currently, our test.sh
script runs a series of tests. However, Pytest returns exit code 2
if there is a failure and it does not cause the test.sh
script to fail and exit. Instead, we need to:
test.sh
if it the pytest exit code is non zeroTutorial #1 only discusses anomalies. It would be good to include a section on motif discovery using the steamgen data set (see column #4 is the steam flow):
colnames = ['drum pressure',
'excess oxygen',
'water level',
'steam flow']
steam_df = pd.read_csv('https://www.cs.ucr.edu/~eamonn/iSAX/steamgen.dat', header=None, sep="\s+")
steam_df.columns = colnames
examples notebooks are empty?
https://github.com/TDAmeritrade/stumpy/tree/master/notebooks
The code of conduct needs to be modified to include the following text:
In addition to this Contributor Code of Conduct, TD Ameritrade Associates remain subject to all company policy including our internal Code of Conduct.
I will submit a pull request.
This is still not fully available yet but should be ready soon on Azure Pipelines
Unfortunately, there is a typo where it says "squence" instead of "sequence"
Compared to STUMP, one may intuitively expect the behavior of MSTUMP when passing a multi-dimensional array into MSTUMP -- Currently, three 1D matrix profiles instead of a single matrix profile for the 3D data is returned.
Realistically, STUMP should only accept a 1D array. I have no idea what happens when you pass a multi-dimensional array into STUMP.
It might make sense to check the shape of the input array and then simply:
Implement the FLUSS and FLOSS algorithms for offline and online semantic segmentation
The time series chains example shows:
left_matrix_profile_index = matrix_profile[2]
right_matrix_profile_index = matrix_profile[3]
and, instead, should say:
left_matrix_profile_index = matrix_profile[:, 2]
right_matrix_profile_index = matrix_profile[:, 3]
Setup a discourse channel for help/questions/comments/suggestions
In general, STUMPY assumes that each row of your input array represents data from a different dimension while each column in your input array represents data from the same dimension. In the case of a NumPy array:
import numpy as np
import stumpy
x = np.random.rand(10)
y = np.random.rand(3, 10)
1d_mp = stumpy.stump(x, 5)
3d_mp = stumpy.stump(y, 5)
This works fine. Similarly, STUMPY has Pandas support and so a Pandas Series
also works:
import pandas as pd
1d_mp = stumpy.stump(pd.Series(x), 5)
Note that the transpose of x
also gives the same answer:
1d_mp = stumpy.stump(pd.Series(x.T), 5)
In other words, stump
isn't affected as long as your 1-dimensional input data is a row-wise 1-dimensional numpy array. STUMPY automatically converts your 1-dimensional input into a NumPy array by calling np.asarray
on the stump
time series input.
However, when we have a Pandas DataFrame
(rather than a Series
), the data is typically column-wise where each column is a dimensional and each row is data from the same dimension. Calling np.asarray
on this DataFrame
ends up producing an undesirable input for mstump
or mstumped
since it is column-wise and not row-wise. We need to correct this by detecting that we have a DataFrame
input and then automatically transpose the DataFrame
before calling np.asarray
.
Additionally, we should add some safeguards to check that we only have a 1-d array for stump
/stumped
and, equivalently, that we have n-d array for mstump
/mstumped
.
In the tutorial, it's a little awkward that one has to extract the values from the pandas dataframe. First-class support for pandas Series/DataFrames (casting to a NumPy array, or potentially even returning a DataFrame with the same keys, if a DataFrame is passed in) would be a really nice feature.
This object should store the matrix profile, matrix profile indices, and can access all of the relevant STUMPY functions that can act on a matrix profile, NumPy array, or Pandas dataframe.
Not sure how it should look or if it is overkill/not necessary so I want to open this up for discussion.
Sent with GitHawk
Deadline August 23, 2019
I was looking into adding support for ReadTheDocs when I realized that there isn't a great way to find the stumpy version programmatically. I would like to know the package version via stumpy.__version__
, as defined in PEP396. For example:
>>> import stumpy
>>> stumpy.__version__
'1.0.0'
https://www.python.org/dev/peps/pep-0396/
This could be accomplished by simply defining a version identifier with the line __version__ = '1.0'
in the __init__.py
file.
Also, the setup.py
file currently defines the version as 1.0. Does this package adhere to some defined version schema like semantic versioning? If so, it would be good to add a "patch version" (major.minor.patch
, like 1.0.0) for future releases of the package.
Is there a multi-dimensional time series data analysis using GPU instead Dask Distributed MSTUMPED?
Having NaN values in the input array can lead to NaN output. We should check for this and error out with an appropriate message to have the user fill in the missing values.
In the unit tests, the gpu_stump
test input data includes:
test_data = [
(
np.array([9, 8100, -60, 7], dtype=np.float64),
np.array([584, -11, 23, 79, 1001, 0, -19], dtype=np.float64),
),
(
np.random.uniform(-1000, 1000, [8]).astype(np.float64),
np.random.uniform(-1000, 1000, [64]).astype(np.float64),
),
]
Currently, the tests that use this data as input passes and the output has been confirmed to match the output from stumpy.stump
. However, this is only tested with a window size, m=3
. When m=13
and the T_B=np.random.uniform(-1000, 1000, [64]).astype(np.float64)
, the self-join test fails:
@pytest.mark.parametrize("T_A, T_B", test_data)
def test_stump_self_join(T_A, T_B):
m = 13
if len(T_B) > m:
zone = int(np.ceil(m / 4))
left = np.array(
[
naive_mass(Q, T_B, m, i, zone, True)
for i, Q in enumerate(core.rolling_window(T_B, m))
],
dtype=object,
)
right = gpu_stump(T_B, m, ignore_trivial=True, threads_per_block=THREADS_PER_BLOCK)
replace_inf(left)
replace_inf(right)
npt.assert_almost_equal(left, right)
right = gpu_stump(
pd.Series(T_B), m, ignore_trivial=True, threads_per_block=THREADS_PER_BLOCK
)
replace_inf(right)
npt.assert_almost_equal(left, right)
E AssertionError:
E Arrays are not almost equal to 7 decimals
E
E Mismatch: 9.62%
E Max absolute difference: 1.2708986280208605
E Max relative difference: 0.35300318929522423
E x: array([2.2705092863251686, 2.319594287314451, 2.3776967426085363,
E 2.3915692674488906, 2.536081316513293, 2.969978484098601,
E 2.2705092863251686, 2.319594287314451, 2.3776967426085363,...
E y: array([2.270509286325168, 2.31959428731445, 2.3776967426085367,
E 2.3915692674488898, 2.5360813165132923, 2.9699784840986023,
E 3.509305221847841, 3.470126329813803, 3.6485953706293968,...
tests/test_gpu_stump.py:109: AssertionError
========================================================= 1 failed, 1 passed in 1.50 seconds =========
%matplotlib inline
Needs to be added to Tutorial 1 to allow plots to display pre-rendered on Github
Hi @seanlaw! I was experimenting with stumpy
today and had some unexpected failures when using single-precision float32
data (the NumPy default is double-precision float64
).
import stumpy
import numpy as np
your_time_series = np.random.rand(10000).astype(np.float32)
window_size = 50 # Approximately, how many data points might be found in a pattern
matrix_profile = stumpy.stump(your_time_series, m=window_size)
Traceback (most recent call last):
File "stumpy_float32_bug.py", line 7, in <module>
matrix_profile = stumpy.stump(your_time_series, m=window_size)
File "/redacted/lib/python3.6/site-packages/stumpy/stump.py", line 354, in stump
core.check_dtype(T_A)
File "/redacted/lib/python3.6/site-packages/stumpy/core.py", line 74, in check_dtype
raise TypeError(msg)
TypeError: <class 'float'> type expected but found <class 'numpy.float32'>
Clean up a few of the small typos in the tutorials
Even if ignore_trivial
is explicitly set to True
, there seems to be a problem when m
is 1 or 2.
test = [3,8,9,2,5,1,17,4,11,18]
profile = stumpy.stump(np.float_(test), m=1, ignore_trivial = True)
print(profile)
print(2 ** 0.5)
[[1.4141633185218911 1 -1 -1]
[1.4142135623730951 2 -1 -1]
[1.4142135623730951 0 0 -1]
[1.4142135623730951 0 0 -1]
[1.4142135623730951 0 0 -1]
[1.4142135623730951 1 1 -1]
[1.4142135623730951 0 0 -1]
[1.4142135623730951 1 1 -1]
[1.4142135623730951 0 0 -1]
[1.4142135623730951 0 0 -1]]
1.4142135623730951
test = [3,8,9,2,5,1,17,4,11,18]
profile = stumpy.stump(np.float_(test), m=2, ignore_trivial = True)
print(profile)
[[0.0 1 -1 -1]
[0.0 3 -1 3]
[0.0 4 0 4]
[0.0 0 0 5]
[0.0 2 2 6]
[0.0 0 0 7]
[0.0 2 2 -1]
[0.0 0 0 -1]
[0.0 0 0 -1]]
According to the ICDM publication, the calculation of MPdist is pretty straightforward. For two time series, A
and B
, of identical length and a window size, m
, equal to, say, 50:
stumpy.stump(A, m=50, T_B=B, ignore_trivial=False)
stumpy.stump(B, m=50, T_B=A, ignore_trivial=False)
k
that is 5 percent of 2 * n
As mentioned on on page 3 of the above paper, "section, this data structure PABBA
contains all the information we need to compute the MPdist."
The pseudocode can be found here.
The supporting site can be found here
Currently, this repo is associated with https://stumpy.readthedocs.io/en/latest/ and all of the docstrings are in restructured text format. It would be a nice good first issue for anybody who'd like to contribute better documentation for the API.
Convert docstrings to documentation
Remove install-from-source from the README and add to readthedocs
Embed the tutorial notebooks into the Sphinx documentation on ReadTheDocs. (Example: freud)
Add badge to README.rst
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.