Git Product home page Git Product logo

functime-org / functime Goto Github PK

View Code? Open in Web Editor NEW
974.0 12.0 52.0 285.12 MB

Time-series machine learning at scale. Built with Polars for embarrassingly parallel feature extraction and forecasts on panel data.

Home Page: https://docs.functime.ai

License: Apache License 2.0

Makefile 0.14% Python 98.23% Rust 1.63%
forecasting machine-learning python time-series polars feature-engineering panel-data

functime's Introduction

Time-series machine learning at scale


functime Python PyPi Code style: black GitHub Run Quickstart Discord


functime is a powerful Python library for production-ready global forecasting and time-series feature extraction on large panel datasets.

functime also comes with time-series preprocessing (box-cox, differencing etc), cross-validation splitters (expanding and sliding window), and forecast metrics (MASE, SMAPE etc). All optimized as lazy Polars transforms.

Join us on Discord!

Highlights

  • Fast: Forecast and extract features (e.g. tsfresh, Catch22) across 100,000 time series in seconds on your laptop
  • Efficient: Embarrassingly parallel feature engineering for time-series using Polars
  • Battle-tested: Machine learning algorithms that deliver real business impact and win competitions
  • Exogenous features: supported by every forecaster
  • Backtesting with expanding window and sliding window splitters
  • Automated lags and hyperparameter tuning using FLAML

Additional Highlights

functime comes with a specialized LLM agent to analyze, describe, and compare your forecasts. Check out the walkthrough here.

Getting Started

Install functime via the pip package manager.

pip install functime

functime comes with extra options. For example, to install functime with large-language model (LLM) and lightgbm features:

pip install "functime[llm,lgb]"
  • cat: To use catboost forecaster
  • xgb: To use xgboost forecaster
  • lgb: To use lightgbm forecaster
  • llm: To use the LLM-powered forecast analyst

Forecasting

import polars as pl
from functime.cross_validation import train_test_split
from functime.seasonality import add_fourier_terms
from functime.forecasting import linear_model
from functime.preprocessing import scale
from functime.metrics import mase

# Load commodities price data
y = pl.read_parquet("https://github.com/functime-org/functime/raw/main/data/commodities.parquet")
entity_col, time_col = y.columns[:2]

# Time series split
y_train, y_test = y.pipe(train_test_split(test_size=3))

# Fit-predict
forecaster = linear_model(freq="1mo", lags=24)
forecaster.fit(y=y_train)
y_pred = forecaster.predict(fh=3)

# functime ❤️ functional design
# fit-predict in a single line
y_pred = linear_model(freq="1mo", lags=24)(y=y_train, fh=3)

# Score forecasts in parallel
scores = mase(y_true=y_test, y_pred=y_pred, y_train=y_train)

# Forecast with target transforms and feature transforms
forecaster = linear_model(
    freq="1mo",
    lags=24,
    target_transform=scale(),
    feature_transform=add_fourier_terms(sp=12, K=6)
)

# Forecast with exogenous regressors!
# Just pass them into X
X = (
    y.select([entity_col, time_col])
    .pipe(add_fourier_terms(sp=12, K=6)).collect()
)
X_train, X_future = y.pipe(train_test_split(test_size=3))
forecaster = linear_model(freq="1mo", lags=24)
forecaster.fit(y=y_train, X=X_train)
y_pred = forecaster.predict(fh=3, X=X_future)

View the full walkthrough on forecasting here.

Feature Extraction

functime comes with over 100+ time-series feature extractors. Every feature is easily accessible via functime's custom ts (time-series) namespace, which works with any Polars Series or expression. To register the custom ts Polars namespace, you must first import functime in your module.

To register the custom ts Polars namespace, you must first import functime!

import polars as pl
import numpy as np
from functime.feature_extractors import FeatureExtractor, binned_entropy

# Load commodities price data
y = pl.read_parquet("https://github.com/functime-org/functime/raw/main/data/commodities.parquet")

# Get column names ("commodity_type", "time", "price")
entity_col, time_col, value_col = y.columns

# Extract a single feature from a single time-series
binned_entropy = binned_entropy(
    pl.Series(np.random.normal(0, 1, size=10)),
    bin_count=10
)

# 🔥 Also works on LazyFrames with query optimization
features = (
    pl.LazyFrame({
        "index": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        "value": np.random.normal(0, 1, size=10)
    })
    .select(
        binned_entropy=pl.col("value").ts.binned_entropy(bin_count=10),
        lempel_ziv_complexity=pl.col("value").ts.lempel_ziv_complexity(threshold=3),
        longest_streak_above_mean=pl.col("value").ts.longest_streak_above_mean(),
    )
    .collect()
)

# 🚄 Extract features blazingly fast on many
# stacked time-series using `group_by`
features = (
    y.group_by(entity_col)
    .agg(
        binned_entropy=pl.col(value_col).ts.binned_entropy(bin_count=10),
        lempel_ziv_complexity=pl.col(value_col).ts.lempel_ziv_complexity(threshold=3),
        longest_streak_above_mean=pl.col(value_col).ts.longest_streak_above_mean(),
    )
)

# 🚄 Extract features blazingly fast on windows
# of many time-series using `group_by_dynamic`
features = (
    # Compute rolling features at yearly intervals
    y.group_by_dynamic(
        time_col,
        every="12mo",
        by=entity_col,
    )
    .agg(
        binned_entropy=pl.col(value_col).ts.binned_entropy(bin_count=10),
        lempel_ziv_complexity=pl.col(value_col).ts.lempel_ziv_complexity(threshold=3),
        longest_streak_above_mean=pl.col(value_col).ts.longest_streak_above_mean(),
    )
)

Related Projects

If you are interested in general data-science related plugins for Polars, you must check out polars-ds. polars-ds is a project created by one of functime's core maintainers and is the easiest way to extend your Polars pipelines with commonly used data-science operations made blazing fast with Rust!

License

functime is distributed under Apache-2.0.

functime's People

Contributors

abstractqqq avatar andreasoprani avatar baggiponte avatar claysmyth avatar daryllimyt avatar domenicocinque avatar fbruzzesi avatar khrapovs avatar mathieucayssol avatar metaboulie avatar miroslaavi avatar ngriffiths13 avatar tomburdge avatar topher-lo avatar vienneraphael avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

functime's Issues

Possible Name Changes to Some Features

I am not sure how Tsfresh decided on some of the names.

E.g. variation_coefficient, but in fact it is coefficient of variation, in industry we call this CV. I think coefficient_of_variation would be better.

Here is a list of name changes I am proposing, with reasons:

  1. variation_coefficient ---> coefficient_of_variation (stated above)
  2. absolute_sum_of_changes ---> sum_abs_changes. (absolute_sum_of_changes make people think it is abs(sum(changes)), instead of the actual value, which is sum(abs(changes))

Personally, I prefer a more concise naming convention. For well known items, we can use abbreviations.
3. large_standard_deviation ---> large_std

The list is not exhaustive. Please comment if you think some other features deserve a name change, or if you think we should stick to Tsfresh's naming conventions.

[BUG] plotting functions don't work when the number of columns is >3

import polars as pl
from functime import plotting

y = pl.scan_parquet("https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet")


(
    y
    .filter(pl.col("commodity_type").eq("Aluminum"))
    .with_columns(pl.lit(1).alias("fourth_col"))
    .collect()
    .pipe(plotting.plot_panel)
)

This fails. The proposed solution is to replace this line:

https://github.com/neocortexdb/functime/blob/90e0f8d66a5ea242d546adeeafd61c60dab1eae7/functime/plotting.py#L53

With this:

entity_col, time_col, target_col = y.columns[:2]

As we do in other places of the codebase.

Support NumPy 1d Array Whenever Series is Supported

I think for general purpose computing, we still need to support NumPy 1d array inputs. Asking the user to do Series -> NumPy or NumPy -> Series conversion can be cumbersome.

Most code for Series will work the same for NumPy. More code will need to have clear branches like

isinstance(x, pl.Series, np.ndarray)

because x.len() won't work on NumPy. But that is just a quick fix, and generally speaking code duplication is not a big deal.

What do you guys think?

Analyzing multiple time series with dimension reduction

The idea consists in analyzing multiple time series with dimension reduction based on our set of feature calculators. The steps are:

  1. Select N time series
  2. Calculate multiple features on the N time series
    The dataset is such as the columns being the features and each row the corresponding time series
  3. Apply the dimension reduction algorithm
  4. Plot the results using plotly

Binned Entropy Implementation is wrong

Currently the binned_entropy is wrong because pl.Series's hist function's behavior differs wildly from NumPy's np.histogram. The generated bins are wildly different. I am not sure if this is a bug on Polar's side, or this is just how they intend people to use the hist function on Series. Personally, I don't think it is a Polars error, but rather a "quirk". Another detail which took me a while to notice is that NumPy's histogram seems to be left_closed, while default behavior of Polars's right_closed.

image

I compared the current result vs. tsfresh's result, which replies on NumPy's histogram function, we can see the result is wrong.

image

I made a rewrite, which works on pl.Exprs, which I think is more preferable for the future of the package.

def binned_entropy_rewrite(x:pl.Expr, bin_count: int = 10) -> pl.Expr:
    step_size = 1/bin_count
    breaks = np.linspace(step_size, stop = 1 - step_size, num = bin_count - 1)
    scaled_x = (x - x.min()) / (x.max() - x.min()) # This steps slows down the calculation
    # Left closed because we want to micmic NumPy's histogram's behavior
    return scaled_x.cut(breaks, left_closed=True).value_counts().struct.field("counts").entropy().suffix("_binned_entropy")

If we want to work lazily, we will not be able to set the break points before scanning through the data. However, if can map the values of x into [0,1] by using min-max scaling which preserves the distribution (bins) within the break point ranges.

The only downside to this approach is that it seems to be much slower than NumPy (tsfresh) implementation as of now. On a Series/array of size 100_000, this took 3.7ms while tsfresh took 0.6ms. I currently don't see how to improve this further at this moment.

Bug: Energy Ratios

Energy_ratio has the same problem as some others. x.len() is an expression which cannot be used in range(), which expects a concrete int.

The rewrite provides a lazy way to segment and do aggregation on the segments. However, the drawback is that this approach is currently 10x slower than tsfresh. (Individually comparing, this is 10x slower. But if we are doing things in batch, a lot of CSE might come in and save us a bit. But still, we need to improve the speed for this one.)

If we know df is in memory, we might have better control over that. But I don't think looping in Python is a good solution (like in the og code, if we assume it worked).

Need more suggestions on potential speed ups.

def energy_ratio_rewrite(
    x: pl.Expr,
    num_segments: int
) -> pl.Expr:
    '''
    Divide x into num_segments by row order. For each segment, compute the energy on the segment. The
    energy ratio of all segments will be returned as a list in order.

    Parameters
    ----------
    x : pl.Expr
        Input expression
    num_segments: int
        The number of segments
    '''
    segments = pl.int_range(pl.lit(0), pl.count()).floordiv(pl.count().floordiv(num_segments)).alias("segment")
    sum_over_segment = pl.struct(
        segments
        , x.pow(2).sum().over(segments).alias("segment_energy")
    ).unique().sort() # This is slow
    total_energy = sum_over_segment.struct.field("segment_energy").sum()

    return (
        (sum_over_segment.struct.field("segment_energy") / total_energy).implode().alias("segment_energy_ratio")
    )

Pandas is a required dependency for plotting with plotly

I did a fresh install of functime in a clean env and tried to run the following:

import polars as pl
from functime import plotting


def main():
    y = pl.scan_parquet(
        "https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet"
    )

    plotting.plot_panel(y)


if __name__ == "__main__":
    main()

The following error is raised:

ImportError: Plotly express requires pandas to be installed.

Proposed solution

We should add pandas to the list of dependencies (duh). However, I would go a step forward and suggest to further break up the dependencies in optional groups. For example, one could install functime[plotting].

I think this could be a pretty big issue in terms of deployability for container size, etc. For example, in a production/inference setting, I believe one might not be interested in plotting capabilities.

A fresh functime install (including dependencies) is as much as 992MB in my venv. Pandas alone is 135MB (though numpy is a common dependency) which would make the whole "plotting" functions be as much as 293MB (just for plotly and pandas, though we would have to take out numpy that we might still be using under the hood for other things).

By comparison, a clean install of statsforecast (with dependencies) is just 637MB, statsforecasts + mlforecast is 680MB, while scikit-learn is 218MB (with dependencies) and polars alone is 93MB.

Let me know what you think :)

examples to forecast with multiple feature columns

as described in title, now there seems to have only example like

[entity_col, time_col, price_col]

How can I forecast with more columns as:

[entity_col, time_col, pct_col, low_col, high_col, open_col, close_col]

I had tried data with these column schema but it did not work

[FEAT] Frequency based cross-validation and gap parameter

In the discord server, a while ago we mentioned the possibility of generating CV splits based on time intervals (e.g. the first day of the month or week). This might be useful for financial settings especially (e.g. retraining the model on mondays).

Plus, it might be useful to provide a gap parameter to allow for gaps between the train and test set (supported e.g. by skforecast):

image

plot source

TypeError: date_range() got an unexpected keyword argument 'eager'

Ran the example forecasting code on colab, to eager:)?

TypeError Traceback (most recent call last)
in <cell line: 16>()
14 model = lightgbm(freq="1mo", lags=24, max_horizons=3, strategy="ensemble")
15 model.fit(y=y_train)
---> 16 y_pred = model.predict(fh=3)
17
18 # functime ❤️ functional design

2 frames
/usr/local/lib/python3.10/dist-packages/polars/utils/decorators.py in wrapper(*args, **kwargs)
35 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
36 _rename_kwargs(function.name, kwargs, aliases)
---> 37 return function(*args, **kwargs)
38
39 return wrapper

TypeError: date_range() got an unexpected keyword argument 'eager'

Let the user decide how to close bounds in range_count

The implementation of range_count could benefit from letting the user decide how to close intervals:

def range_count(x: TIME_SERIES_T, lower: float, upper: float) -> INT_EXPR:
    """
    Computes values of input expression that is between lower (inclusive) and upper (exclusive).

    Parameters
    ----------
    x : pl.Expr | pl.Series
        Input time-series.
    lower : float
        The lower bound, inclusive
    upper : float
        The upper bound, exclusive

    Returns
    -------
    int | Expr
    """
    if upper < lower:
        raise ValueError("Upper must be greater than lower.")
    return x.is_between(lower_bound=lower, upper_bound=upper, closed="left").sum()

The new version would be:

def range_count(x: TIME_SERIES_T, lower: float, upper: float, closed: Literal[‘both’, ‘left’, ‘right’, ‘none’]) -> INT_EXPR:
    """
    Computes values of input expression that is between lower (inclusive) and upper (exclusive).

    Parameters
    ----------
    x : pl.Expr | pl.Series
        Input time-series.
    lower : float
        The lower bound, inclusive
    upper : float
        The upper bound, exclusive
    closed: Literal[‘both’, ‘left’, ‘right’, ‘none’]
        How to close the interval.

    Returns
    -------
    int | Expr
    """
    if upper < lower:
        raise ValueError("Upper must be greater than lower.")
    return x.is_between(lower_bound=lower, upper_bound=upper, closed=closed).sum()

If you think that's worth it I can submit a PR for this. Let me know

Add `plot_interval_forecasts` function

Rationale

We currently have plot_forecasts function in functime.plotting. However, we also support probablistic forecasts (see forecaster.conformalize. We need to do a better job promoting this functionality. The best way is through a chart!

Prior Art

Let's just use this code example from https://plotly.com/python/continuous-error-bars/

Challanges

Plotting multiple forecasts at once. To do this, just reuse the multi-plot code from the other plotting functions.

`plotting.plot_panel` fails with LazyFrames

This code results in the following error: AttributeError: 'LazyFrame' object has no attribute 'get_column'

import polars as pl

y = pl.scan_parquet("https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet")

plotting.plot_panel(y)

This is due to get_column being available for pl.DataFrames only. I think plot_panel could benefit from lazy operations since only number of entities * last_n points are plotted. The following line would just require adding a .collect() in the end if y is a LazyFrame.

https://github.com/neocortexdb/functime/blob/e02509c47882f556640b7c88299988cfbdde8836/functime/plotting.py#L54

Bug: change_quantile is lacking an aggregation

https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#tsfresh.feature_extraction.feature_calculators.quantile

The change_quantile function should not return a series. Instead, after all the filtering, we should aggregate over the remaining values in the diff and return the aggregated value. I am proposing to add a type_alias module to add an AggStrategy type. See the issue here: #46

To add the aggregation, we simply need to do:

# I am proposing a new type_alias module which should contain AggStrategy
from typing_extensions import TypeAlias, Literal
AggStrategy: TypeAlias = Literal["mean", "sum", "median", "std", "max"]
def change_quantile_rewrite(
    x: pl.Expr, 
    q_low: float, 
    q_high: float, 
    is_abs: bool=True, 
    agg: AggStrategy = "mean",
    # Polars' default is nearest, but NumPy's default is linear
    # interpolation: InterpolationMethod = "linear"
) -> pl.Expr:
    
    if q_high <= q_low:
        return pl.lit(0.)

    # Use linear to conform to NumPy
    y = x.is_between(x.quantile(q_low, interpolation="linear"), x.quantile(q_high, interpolation="linear"))
    expr = x.filter(
        pl.all_horizontal(
            y
            , y.shift_and_fill(False, period=-1)
        )
    ).diff() 
    
    if is_abs:
        expr = expr.abs()

    if agg == "mean":
        return expr.mean()
    elif agg == "median":
        return expr.median()
    elif agg == "sum":
        return expr.sum()
    elif agg == "std":
        return expr.std()
    elif agg == "max":
        return expr.max()
    else:
        raise TypeError(f"The input: `{agg}` is not a valid aggregation strategy.")

Unable to poetry add functime

Hey guys, I'm able to install functime as a pip package, but unable to add it with the poetry package management.

The problem lies with kaleido

(tsboost-py3.10) ferpapi tsboost $ poetry add functime
Using version ^0.2.4 for functime

Updating dependencies
Resolving dependencies... (4.2s)

Writing lock file

Package operations: 22 installs, 0 updates, 0 removals

  • Installing entrypoints (0.4)
  • Installing locket (1.0.0)
  • Installing mdurl (0.1.2)
  • Installing toolz (0.12.0)
  • Installing zipp (3.16.2)
  • Installing asciitree (0.3.3)
  • Installing fasteners (0.18)
  • Installing importlib-metadata (6.8.0)
  • Installing markdown-it-py (3.0.0)
  • Installing numcodecs (0.11.0)
  • Installing partd (1.4.0)
  • Installing pygments (2.15.1)
  • Installing pynndescent (0.5.8)
  • Installing dask (2023.7.0)
  • Installing flaml (1.2.4)
  • Installing kaleido (0.2.1.post1): Failed

  RuntimeError

  Unable to find installation candidates for kaleido (0.2.1.post1)

  at ~/.local/share/pypoetry/venv/lib/python3.10/site-packages/poetry/installation/chooser.py:109 in choose_for
      105│
      106│             links.append(link)
      107│
      108│         if not links:
    → 109│             raise RuntimeError(f"Unable to find installation candidates for {package}")
      110│
      111│         # Get the best link
      112│         chosen = max(links, key=lambda link: self._sort_key(package, link))
      113│

  • Installing polars (0.18.7)
  • Installing pylance (0.5.8)
  • Installing rich (13.4.2)
  • Installing umap-learn (0.5.3)
  • Installing zarr (2.15.0

any clues?

Auto LightGBM: unable to add a column of length 7644 to a dataframe of height 9334

Following a similar pattern to what is in the documentation for auto-lightgbm but for the m4 parquet weekly file and receiving an error several steps in. Maybe the error has to do with how much data is held out and the number of lags coupled with some short time series but not quite sure. Any thoughts?

import polars as pl
from flaml import tune
from functime.forecasting import auto_lightgbm


weekly_train_pl = pl.read_parquet("https://github.com/descendant-ai/functime/raw/main/data/m4_1w_train.parquet")
hourly_train_pl = pl.read_parquet('https://github.com/descendant-ai/functime/raw/main/data/m4_1d_train.parquet')

time_budget = 2

# Fit model
forecaster = auto_lightgbm(
    freq=-1,
    min_lags=4,
    max_lags=52,
    test_size=26,
    time_budget=time_budget,
)
forecaster.fit(y=weekly_train_pl)
[flaml.tune.tune: 07-23 00:17:53] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:17:57] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:04] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:12] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:20] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:26] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:35] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:45] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:52] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:01] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:12] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:20] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:32] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:43] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:57] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:10] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:22] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:36] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:50] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:04] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:18] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:36] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:53] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:22:09] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:22:25] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:22:43] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:00] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:17] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:39] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:57] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:24:18] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:24:37] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:25:02] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:25:26] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:25:47] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:26:11] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:26:34] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:26:56] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:27:22] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:27:45] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:28:09] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:28:33] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:29:00] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:29:26] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:29:52] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:30:18] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:30:46] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
---------------------------------------------------------------------------
ShapeError                                Traceback (most recent call last)
[<ipython-input-12-453b2f0928c1>](https://localhost:8080/#) in <cell line: 11>()
      9     time_budget=time_budget,
     10 )
---> 11 forecaster.fit(y=weekly_train_pl)
     12 
     13 # Get best lags and model hyperparameters

8 frames
[/usr/local/lib/python3.10/dist-packages/functime/base/forecaster.py](https://localhost:8080/#) in fit(self, y, X)
     78                 X = self._enforce_string_cache(X.lazy().collect())
     79             X = X.lazy()
---> 80         artifacts = self._fit(y=y, X=X)
     81         cutoffs = y.groupby(y.columns[0]).agg(pl.col(y.columns[1]).max().alias("low"))
     82         artifacts["__cutoffs"] = cutoffs.collect(streaming=True)

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/automl.py](https://localhost:8080/#) in _fit(self, y, X)
    108         from functime.forecasting._ar import fit_cv
    109 
--> 110         return fit_cv(
    111             y=y,
    112             X=X,

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_ar.py](https://localhost:8080/#) in fit_cv(y, forecaster_cls, freq, min_lags, max_lags, max_horizons, strategy, test_size, step_size, n_splits, time_budget, search_space, points_to_evaluate, low_cost_partial_config, num_samples, cv, X, **kwargs)
    149     scores_path = []
    150     for lags in lags_path:
--> 151         score = evaluate(
    152             **{
    153                 "lags": lags,

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_evaluate.py](https://localhost:8080/#) in evaluate(lags, n_splits, time_budget, points_to_evaluate, num_samples, low_cost_partial_config, test_size, max_horizons, strategy, freq, forecaster_cls, y_splits, X_splits, search_space)
    134         score = result["mae"]
    135     else:
--> 136         tuner = flaml.tune.run(
    137             partial(
    138                 evaluate_windows,

[/usr/local/lib/python3.10/dist-packages/flaml/tune/tune.py](https://localhost:8080/#) in run(evaluation_function, config, low_cost_partial_config, cat_hp_cost, metric, mode, time_budget_s, points_to_evaluate, evaluated_rewards, resource_attr, min_resource, max_resource, reduction_factor, scheduler, search_alg, verbose, local_dir, num_samples, resources_per_trial, config_constraints, metric_constraints, max_failure, use_ray, use_spark, use_incumbent_result_in_evaluation, log_file_name, lexico_objectives, force_cancel, n_concurrent_trials, **ray_args)
    774                 result = None
    775                 with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel):
--> 776                     result = evaluation_function(trial_to_run.config)
    777                 if result is not None:
    778                     if isinstance(result, dict):

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_evaluate.py](https://localhost:8080/#) in evaluate_windows(config, lags, n_splits, test_size, max_horizons, strategy, freq, forecaster_cls, y_splits, X_splits)
     78         y_train, y_test = y_splits[i]
     79         X_train, X_test = X_splits[i] if X_splits is not None else None, None
---> 80         result = evaluate_window(
     81             y_train=y_train,
     82             y_test=y_test,

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_evaluate.py](https://localhost:8080/#) in evaluate_window(config, lags, test_size, max_horizons, strategy, freq, forecaster_cls, y_train, y_test, X_train, X_test)
     44         y_test = y_test.sort([entity_col, time_col])
     45         y_pred = y_pred.sort([entity_col, time_col])
---> 46         y_test = y_test.with_columns(**{time_col: y_pred.get_column(time_col)})
     47         score = mae(y_true=y_test, y_pred=y_pred).get_column("mae").mean()
     48         res = {"score": score}

[/usr/local/lib/python3.10/dist-packages/polars/dataframe/frame.py](https://localhost:8080/#) in with_columns(self, *exprs, **named_exprs)
   7331         """
   7332         return (
-> 7333             self.lazy()
   7334             .with_columns(*exprs, **named_exprs)
   7335             .collect(no_optimization=True)

[/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1529             streaming,
   1530         )
-> 1531         return wrap_df(ldf.collect())
   1532 
   1533     def sink_parquet(

ShapeError: unable to add a column of length 7644 to a dataframe of height 9334

Analyzing multiple time series with dimension reduction

The idea consists in analyzing multiple time series with dimension reduction based on our set of feature calculators. The steps are:

  1. Select N time series
  2. Calculate multiple features on the N time series
    The dataset is such as the columns being the features and each row the corresponding time series
  3. Apply the dimension reduction algorithm
  4. Plot the results using plotly

Explain why forecasters drop very short time series

Problem

Time series with counts less than the number of lags are silently dropped at predict time. For example, during M5 benchmarking, time series with lengths less than lags=24 are dropped. This is intended behavior, but currently undocumented.

Rationale

functime is made for high-performance ML forecasting in production. Data engineers are responsible for upstream and downstream data quality (including the property of "no missing values"), not ML engineers. I made the explicit design not to include any data quality pre-checks within fit-predict in functime.

Solution

Document why functime has weaker data quality pre-conditions.

Additional comment

My goal is to eventually create a checks module with functions to support more defensive forecasting pipelines. But the choice to have checks will be an explicit pipeline design decision by the user, not the functime forecasting API.

[FEAT] adding a `num_series` parameter to `plotting.plot_panel`

Currently, plot_panel draws a plot for every ID in the panel. The docs recommend:

Note: if you have over 10 entities / time-series, we recommend using the rank_ functions in functime.evaluation then df.head() before plotting.

  1. I would perhaps provide an example of this
  2. Ideally, though, I would like for plot_panel to have a num_series param to select the number of series to plot. It could either accept an integer (e.g. plot up to k series) a list of integers (plot the series 1, 3, 4...) or a list of strings with the names of the IDs.

[FEAT] [evaluation] Add rank_by to evaluation

We mention and use the coefficient of variation more than once, such as here. It would be interesting to have a evaluation.rank_cv function to see what entities in a panel display the greatest variation.

The way I see it, we should have a public method (perhaps even in feature_extraction?) to compute the CV across all entities. This would be used by rank_cv and possibly in plot_entities (see #83) to display additional information about all entities in the panel.

functime has no attribute `__version__`

In a fresh env,

python -c "import functime; print(functime.__version__)"

Raises this: AttributeError: module 'functime' has no attribute '__version__'

Should functime define a __version__ in __init__.py?

Rationale for the licensing

First of all, thanks for the great work to provide such a blazing-fast library for Time Series forecasting. I think functime comes in a very great timing in the Python libraries space for such kind of problem space.

I would like to know which rationale used by you that ended up setting the library as AGPLv3 (Affero General Public License v3.0)?

The reason of my question is that most of libraries on that Time Series space has a distinct licensing:

Apache-2.0 license

MIT license

BSD-3-Clause license

[DOCS] Missing import in forecasting tutorial

There's a wrong import in the feature transform section of the forecasting tutorial:

from functime.forecasting import linear_model
from functime.feature_extraction import add_fourier_terms
-from functime.preprocessing import lag
+from functime.preprocessing import roll

Optimization: benford_correlation

There are three ways to optimize the existing benford_correlation:

  1. relies on strings to extract the first digit of a number. But we don't need strings actually. We can get around strings mathematically.
  2. Since corr is scale invariant, we do not need to compute counts.sum()
  3. The "Benford distribution" is static. We should keep it as a global variable computed at module initialization instead of reapeatedly computing it at runtime.

A new implementation will be:


benford_dist_series = (1 + 1 / pl.int_range(1, 10, eager=True)).log10()
def benford_correlation2(x: pl.Expr) -> pl.Expr:
    
    counts = (
        (x.abs()/(pl.lit(10).pow(x.abs().log10().floor())))
        .drop_nans()
        .cast(pl.UInt8)
        .append(pl.int_range(1, 10, eager=False))
        .sort()
        .value_counts()
        .struct.field("counts") - pl.lit(1)
    )
    # no need to divide because correlation is invariant under scaling
    return pl.corr(counts, pl.lit(benford_dist_series))

To test, we can try this on the following df:

test_df = pl.DataFrame({
    "a": [float(i) + 0.1 for i in range(-100_000, 100_000)]
})

Only problem:

There seems to be a precision issue that only affects the value 1000. By my testing, for sequences without the 1000 in it, it is accurate. I can't guarantee 1000 is the only problematic value, however.
image

3x Less Time (tested on my local system)

Capture

[UNIT TEST TRACKER] tsfresh

Tracker for the unit test. The unit tests should cover pl.Series, pl.Expr for eager and lazy (if implemented).

  • absolute_energy
  • absolute_maximum
  • absolute_sum_of_changes
  • approximate_entropy
  • autocorrelation
  • autoregressive_coefficients
  • benford_correlation
  • binned_entropy
  • c3
  • change_quantiles
  • cid_ce
  • count_above
  • count_above_mean
  • count_below
  • count_below_mean
  • energy_ratios
  • first_location_of_maximum
  • first_location_of_minimum
  • has_duplicate
  • has_duplicate_max
  • has_duplicate_min
  • index_mass_quantile
  • large_standard_deviation
  • last_location_of_maximum
  • last_location_of_minimum
  • longest_strike_above_mean
  • longest_strike_below_mean
  • mean_abs_change
  • mean_change
  • mean_n_absolute_max
  • mean_second_derivative_central
  • number_crossings
  • number_cwt_peaks
  • number_peaks
  • percent_reoccuring_values
  • percent_reocurring_points
  • permutation_entropy
  • range_count
  • ratio_beyond_r_sigma
  • ratio_n_unique_to_length
  • root_mean_square
  • sample_entropy
  • spkt_welch_density
  • sum_reocurring_points
  • sum_reocurring_values
  • symmetry_looking
  • time_reversal_asymmetry_statistic
  • variation_coefficient
  • var_gt_std
  • cwt_coefficients
  • fourier_entropy
  • friedrich_coefficients
  • lempel_ziv_complexity
  • linear_trend
  • partial_autocorrelation

Additional Features:

  • - range_over_mean (This and range_change are tested but tests are non-exhaustive.)
  • - range_change
  • - longest_winning_streak (special case of longest_streak_above)
  • - longest_losing_streak (special case of longest_streak_below)
  • - streak_length_stats
  • - longest_streak_above
  • - longest_streak_below
  • - max_abs_change

freq_to_sp documentation is wrong + some wrong mapping

Some offsets are suggested in the documentation, but those will simply return errors: (1ns, 1us for example). Also, the alias_to_sp are not correctly mapped as in the shared link, 30m is wrong, also m is minute, but then in alias_to_sp it is monthly..

Include a functime.type_alias Module

Typing is important. We all agree on that. Here we want to micmic Polars and have a functime specific type_alias module for the following purposes:

  1. Custom typing for more complex and nested types. Right now I don't see any crazy types, but we might need this in the future.
  2. Use Literal types for strategies in function calls instead of str. The advantage of using Literal["a", "b", "c", ..] is that a good linter can tell the user which are the available strategies, and can suggest the strategy based on what the user has typed in. On the other hand, a linter cannot infer anything from strings. (This is similar to Rust Enum, although more basic and only helps the linter, not the Python "compiler" or "interpreter".)
  3. Have a centralized place to initialize constants, some parameters, custom types, instead of having all these stuff scattered around.

Compatability with sci-kit learn 1.3.0

Thank you for making what looks like an amazing module functime team!
The package currently requires sci-kit learn 1.2.2, is it possible to get it compatible with 1.3.0?
I want to try out functime's performance in a repo for which I need sci-kit learn 1.3.0.

(Unnecessary extra info: This is for 1.3.0's hdbscan. I know I could just use the hdbscan module, but I expect the hdbscan module may become semi-redundant with hdbscan in sci-kit learn).

Feature Bundles

Let's brainstorm some useful feature bundles (features that are likely to be generated together).

Some common ideas are:

  1. Diagnostic features. Features that detect anomaly in time seres.
  2. Pattern detection features.
  3. Features specific in certain domains.
  4. Features that are commonly used together in a variety of domain.
  5. etc...

@MathieuCayssol @topher-lo @vienneraphael

[PROJECT TRACKER] `tsfresh` time-series features

This issue lists out the time-series features in tsfresh and API design challanges and considerations.

IMPORTANT NOTE: To ensure fair and useful benchmarks between our polars / Rust FFI code vs the original numpy code, our rewrite will go through three stages:

  • Rewrite using polars expressions only. Take note of features that require external dependencies. Modify original tsfresh API to improve user-experience.
  • Setup Rust FFI
  • Full benchmarking

Proposed API Design

Each individual time-series feature should be implemented as a function, which takes a numeric type pl.Expr and returns another pl.Expr of any dtype (depends on the feature). Any parameter should be explicitly specified as function inputs. For example:

import polars as pl

def some_feature(x: pl.Expr, lag: int) -> pl.Expr:
  ...
  return some_result

We will then register the functions into a custom polars namespace (see guide on "Extending the API" here). So usage of tsfresh in functime will look like this:

X_features = X.select([
  pl.col("value").ts.binned_entropy(n_bins=42),
  pl.col("value").ts.acf(lags=12),
  pl.col("value").ts.abs_energy()
])

This API design makes it easy to extend to panel data and rolling windows! For example:

# Panel dataset
X_features = X.groupby("series_id").agg([
    pl.col("value").tsa.binned_entropy(n_bins=42),
    pl.col("value").tsa.acf(lags=12),
    pl.col("value").tsa.abs_energy()
])

# With window + stride (time-series)
X_features = X.groupby_dynamic(every="10i").agg([
    pl.col("value").tsa.binned_entropy(n_bins=42),
    pl.col("value").tsa.acf(lags=12),
    pl.col("value").tsa.abs_energy()
])

# With window + stride (panel)
X_features = X.groupby_dynamic(every="10i", by="series_id").agg([
    pl.col("value").tsa.binned_entropy(n_bins=42),
    pl.col("value").tsa.acf(lags=12),
    pl.col("value").tsa.abs_energy()
])

Docstrings and Style

We use numpydoc and black formatting.

We also use pre-commit. After you clone the repo, remember to run the following:

pip install pre-commit
pre-commit install

Features Checklist

If you want to work on a feature, just create a comment in this issue thread. I'll then add your @ to the checklist below.

IMPORTANT NOTE: We are going to ignore any FFT (fast fourier transform) features for now!

Testing

White-box testing strategy. We use pytest. Each developer is responsible for their own unit test per feature. Should be relatively straightforward: can reuse most code from tsfresh unit tests

Development Workflow

Everybody create ONE pull request with all your implemented features and tests. The title of the PR should be feat: tsfresh features @your-github-handle

Challenges Checklist

Here is a list of challenges I discovered during a first attempt at the rewrite. Please add to this list in the comments below!

  1. Some features in tsfresh do not list of parameters in the function signature (e.g. cwt_coefficients). We should explicitly list it out for our implementation. You can find the default parameters from the original implementation here.
  2. Some features in tsfresh return a pandas Series (e.g. ar_coefficients). For our implementation, we should return a pl.List instead.
  3. We should ignore agg_ features. Let the user aggregate using polars expressions e.g. x.list.to_explode().std() instead.

Import functime.forecasting (zarr) leads to error

with Macbook M1 and zarr-2.16.1 I get the following error

ImportError: dlopen(/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so, 0x0002): tried: '/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so' (no such file), '/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))'

Explore Parallel = True in value_counts

In the docstring, it is mentioned that in group_by context, value_counts = True will likely not improve perf and might even make it worse. I have two questions, which I do not have time to test out and nor do I have the domain knowledge.

  1. In time series, how often will these features be used in a group_by context? It seems to be very often, but I am not sure.
  2. Can someone run a benchmark and see the relatively perf differences for value_counts in both select vs. group_by context with Parallel = True/False? I think this is interesting knowledge and would help us later.
  3. One thing that will be super interesting to know is when will Parallel=True outperform Parallel=False. How big the data has to be? Let's assume we are doing value_count on integer columns for now. The result might differ for string columns.

@topher-lo @MathieuCayssol @

Feature Extraction Tsfresh Rewrite Quality Assurance

First, thank you everybody for contributing to the rewrite.

We are planning to make this project more public, which means we need to make sure that the quality is good. For this round of review, we want to focus on the following 3 items (ranked in terms of importance):

  • [99%] Correctness. Correctness with respect to implementation, to feature definition, and the final numerical result should make sense. There are a few I haven't reviewed or I don't know if anybody has reviewed: cwt_coefficients, autoregressive_coefficients, augmented_dickey_fuller, and the newly added fft_coefficients.. Please comment if you think there are others we need to review.

  • [99%] Tests. Building out more tests for all the features. Thank you @MathieuCayssol for building out more tests.

  • Delayed FFT features. Now that fft_coefficients is implemented, we can start using it?

  • Ongoing, Performance. For some methods, there might be short cuts in eager mode, or vice versa. For methods (eager) that use NumPy under the hood, can we do better? Is there redundant computation? More efficient methods?

<style type="text/css"></style>
Feature Name Implemented Lazy (Expr) Implemented Eager (Series) Need More Review
absolute_energy Y Y  
absolute_maximum Y Y  
absolute_sum_of_changes Y Y  
approximate_entropy N Y
augmented_dickey_fuller N Y Y
autocorrelation N Y
autoregressive_coefficients N Y Y
benford_correlation Y Y  
binned_entropy Y Y  
c3 Y Y  
change_quantiles Y Y  
cid_ce Y Y
count_above Y Y  
count_above_mean Y Y  
count_below Y Y  
count_below_mean Y Y  
cwt_coefficients N Y Y
energy_ratios Y Y  
first_location_of_maximum Y Y  
first_location_of_minimum Y Y  
fourier_entropy N Y Y
friedrich_coefficients N Y
has_duplicate Y Y  
has_duplicate_max Y Y  
has_duplicate_min Y Y  
index_mass_quantile Y Y
large_standard_deviation Y Y  
last_location_of_maximum Y Y  
last_location_of_minimum Y Y  
lempel_ziv_complexity N Y
linear_trend Y Y
longest_strike_above_mean Y Y
longest_strike_below_mean Y Y
mean_abs_change Y Y  
mean_change Y Y  
mean_n_absolute_max Y Y  
mean_second_derivative_central Y Y
number_crossings Y Y
number_cwt_peaks N Y
number_peaks Y Y
percent_reocurring_points Y Y
percent_reoccuring_values Y Y
permutation_entropy Y Y
range_count Y Y  
ratio_beyond_r_sigma Y Y  
ratio_n_unique_to_length Y Y  
root_mean_square Y Y  
sample_entropy N Y
spkt_welch_density N Y
sum_reocurring_points Y Y  
sum_reocurring_values Y Y  
symmetry_looking Y Y  
time_reversal_asymmetry_statistic Y Y  
variation_coefficient Y Y  
harmonic_mean Y Y  
fft_coefficients N Y Y

Optimization: count_above_mean

def count_above_mean(x: pl.Expr) -> pl.Expr:
    """Count the number of values that are above the mean.

    Parameters
    ----------
    x : pl.Expr | pl.Series
        Input time-series.

    Returns
    -------
    int
    """
    return x.filter(x > x.mean()).count()

def count_above_mean_rewrite(x:pl.Expr) -> pl.Expr:

    return (x > x.mean()).sum()

The current implementation is good, but is somewhat wordy and not precise.

Performance-wise, the two implementation are really close when input column has <200k rows by my testing. Rewrite is about 10-20% faster depending on length on these relatively small data sizes. However, on really long series, say 1mm rows, the rewrite is significantly faster. I think this is because "filter" is essentially a redundant operation, which hints at roughly half of the runtime on larger dataset).

Here are some simple tests:
image

image

module 'functime' has no attribute 'embeddings'

I am trying to run Embeddings example as mentioned in doc and receiving following error

AttributeError: module 'functime' has no attribute 'embeddings' .

Here is the snippet, which raises the error.

import functime
import polars as pl
import numpy as np
from scipy.stats import iqr

# Load memory usage data
y = pl.read_parquet(
    "https://github.com/descendant-ai/functime/raw/main/data/behacom.parquet",
    columns=["user", "timestamp", "system_average_mem"]
)

# Create embeddings
embeddings = functime.embeddings.embed(y)

LightGBM: Unexpected behavior when introducing exogenous variables

I'm following the quickstart.py demo from the documentation page, I've noticed that exogenous variables X are calculated (month) but are not introduced in the model later on.

When I try to introduce the exogenous variables, the scores worsen significantly and the predicted values seem to be the same for all the entities, is this the expected behavior?

y = pl.read_parquet(
    "https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet"
)
entity_col, time_col = y.columns[:2]
X = (
    y.select([entity_col, time_col])
    .pipe(add_calendar_effects(["month"]))
)

test_size = 3
freq = "1mo"
y_train, y_test = train_test_split(test_size)(y)
X_train, X_test = train_test_split(test_size)(X)

forecaster = lightgbm(freq=freq, lags=12)
forecaster.fit(y=y_train, X=X_train)
y_pred = forecaster.predict(fh=test_size, X=X_test)

scores = mase(y_true=y_test, y_pred=y_pred, y_train=y_train)

print("💯 Scores (with X):\n", scores.sort(entity_col))
print("✅ Predictions (with X):\n", y_pred.sort(entity_col))

I'm getting the same results on a LightGBM model I'm porting from darts (that performed pretty well with similar lags and exogenous variables), so I'm not sure this is the correct way of introducing the exogenous variables.

Bug: c3 Implementation is wrong

The implementation is wrong because k is not an integer (because n is an expression not an int), but an expression, which cannot be used as range(k).

  n = x.len()
  k = n - 2 * n_lags

A rewrite is simply:

def c3_rewrite(x: pl.Expr, lag: int) -> pl.Expr:
    twice_lag = 2 * lag
    return (x * x.shift(lag) * x.shift(twice_lag)).sum() / (x.count() - pl.lit(twice_lag))

Sdist

Is it possible to publish sdist archive on PyPI or make a release here on Github (either way if fine)? This looks like a great package that we can make available on Conda Forge but they do need an sdist

[FEAT] [PLOTTING] add `plotting.plot_entities` to display info about entities

I was wondering if we want a plot_entities histogram/barchart to display summary information about the entities, e.g. the number of observations for each entity.

This could be used to drop entities with too few obs, or (with future features) draw the number of missing values/zeroes in each series.

Here is a draft implementation:

import polars as pl
import plotly.express as px

url = "https://github.com/neocortexdb/functime/raw/main/data/commodities.parquet"
y = pl.scan_parquet(url).with_columns(pl.col("time").cast(pl.Date))

entity_col, time_col, target_col = y.columns


def plot_entities(y, **kwargs):
    # add logic to handle dataframe or lazyframes
    counts = (
        y
        .group_by(entity_col)
        .agg(pl.count())
        .collect()
    )

    height = len(counts) * 15 # sensible-ish default
    
    return (
        px.bar(
            data_frame = counts,
            x="count",
            y=entity_col,
            orientation="h",
            )
        .update_layout(height=height, **kwargs) # add logic to avoid `height` being passed twice
    )

plot_entities(y)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.