functime-org / functime Goto Github PK

Time-series machine learning at scale. Built with Polars for embarrassingly parallel feature extraction and forecasts on panel data.

Home Page: https://docs.functime.ai

License: Apache License 2.0

Makefile 0.14% Python 98.23% Rust 1.63%

forecasting machine-learning python time-series polars feature-engineering panel-data

functime's Introduction

Time-series machine learning at scale

functime is a powerful Python library for production-ready global forecasting and time-series feature extraction on large panel datasets.

functime also comes with time-series preprocessing (box-cox, differencing etc), cross-validation splitters (expanding and sliding window), and forecast metrics (MASE, SMAPE etc). All optimized as lazy Polars transforms.

Join us on Discord!

Highlights

Fast: Forecast and extract features (e.g. tsfresh, Catch22) across 100,000 time series in seconds on your laptop
Efficient: Embarrassingly parallel feature engineering for time-series using Polars
Battle-tested: Machine learning algorithms that deliver real business impact and win competitions
Exogenous features: supported by every forecaster
Backtesting with expanding window and sliding window splitters
Automated lags and hyperparameter tuning using FLAML

Additional Highlights

functime comes with a specialized LLM agent to analyze, describe, and compare your forecasts. Check out the walkthrough here.

Getting Started

Install functime via the pip package manager.

pip install functime

functime comes with extra options. For example, to install functime with large-language model (LLM) and lightgbm features:

pip install "functime[llm,lgb]"

cat: To use catboost forecaster
xgb: To use xgboost forecaster
lgb: To use lightgbm forecaster
llm: To use the LLM-powered forecast analyst

Forecasting

import polars as pl
from functime.cross_validation import train_test_split
from functime.seasonality import add_fourier_terms
from functime.forecasting import linear_model
from functime.preprocessing import scale
from functime.metrics import mase

# Load commodities price data
y = pl.read_parquet("https://github.com/functime-org/functime/raw/main/data/commodities.parquet")
entity_col, time_col = y.columns[:2]

# Time series split
y_train, y_test = y.pipe(train_test_split(test_size=3))

# Fit-predict
forecaster = linear_model(freq="1mo", lags=24)
forecaster.fit(y=y_train)
y_pred = forecaster.predict(fh=3)

# functime ❤️ functional design
# fit-predict in a single line
y_pred = linear_model(freq="1mo", lags=24)(y=y_train, fh=3)

# Score forecasts in parallel
scores = mase(y_true=y_test, y_pred=y_pred, y_train=y_train)

# Forecast with target transforms and feature transforms
forecaster = linear_model(
    freq="1mo",
    lags=24,
    target_transform=scale(),
    feature_transform=add_fourier_terms(sp=12, K=6)
)

# Forecast with exogenous regressors!
# Just pass them into X
X = (
    y.select([entity_col, time_col])
    .pipe(add_fourier_terms(sp=12, K=6)).collect()
)
X_train, X_future = y.pipe(train_test_split(test_size=3))
forecaster = linear_model(freq="1mo", lags=24)
forecaster.fit(y=y_train, X=X_train)
y_pred = forecaster.predict(fh=3, X=X_future)

View the full walkthrough on forecasting here.

Feature Extraction

functime comes with over 100+ time-series feature extractors. Every feature is easily accessible via functime's custom ts (time-series) namespace, which works with any Polars Series or expression. To register the custom ts Polars namespace, you must first import functime in your module.

To register the custom ts Polars namespace, you must first import functime!

import polars as pl
import numpy as np
from functime.feature_extractors import FeatureExtractor, binned_entropy

# Load commodities price data
y = pl.read_parquet("https://github.com/functime-org/functime/raw/main/data/commodities.parquet")

# Get column names ("commodity_type", "time", "price")
entity_col, time_col, value_col = y.columns

# Extract a single feature from a single time-series
binned_entropy = binned_entropy(
    pl.Series(np.random.normal(0, 1, size=10)),
    bin_count=10
)

# 🔥 Also works on LazyFrames with query optimization
features = (
    pl.LazyFrame({
        "index": [0, 1, 2, 3, 4, 5, 6, 7, 8, 9],
        "value": np.random.normal(0, 1, size=10)
    })
    .select(
        binned_entropy=pl.col("value").ts.binned_entropy(bin_count=10),
        lempel_ziv_complexity=pl.col("value").ts.lempel_ziv_complexity(threshold=3),
        longest_streak_above_mean=pl.col("value").ts.longest_streak_above_mean(),
    )
    .collect()
)

# 🚄 Extract features blazingly fast on many
# stacked time-series using `group_by`
features = (
    y.group_by(entity_col)
    .agg(
        binned_entropy=pl.col(value_col).ts.binned_entropy(bin_count=10),
        lempel_ziv_complexity=pl.col(value_col).ts.lempel_ziv_complexity(threshold=3),
        longest_streak_above_mean=pl.col(value_col).ts.longest_streak_above_mean(),
    )
)

# 🚄 Extract features blazingly fast on windows
# of many time-series using `group_by_dynamic`
features = (
    # Compute rolling features at yearly intervals
    y.group_by_dynamic(
        time_col,
        every="12mo",
        by=entity_col,
    )
    .agg(
        binned_entropy=pl.col(value_col).ts.binned_entropy(bin_count=10),
        lempel_ziv_complexity=pl.col(value_col).ts.lempel_ziv_complexity(threshold=3),
        longest_streak_above_mean=pl.col(value_col).ts.longest_streak_above_mean(),
    )
)

Related Projects

If you are interested in general data-science related plugins for Polars, you must check out polars-ds. polars-ds is a project created by one of functime's core maintainers and is the easiest way to extend your Polars pipelines with commonly used data-science operations made blazing fast with Rust!

License

functime is distributed under Apache-2.0.

functime's People

Contributors

Stargazers

Watchers

functime's Issues

[UNIT TEST TRACKER] tsfresh

Tracker for the unit test. The unit tests should cover pl.Series, pl.Expr for eager and lazy (if implemented).

Additional Features:

- range_over_mean (This and range_change are tested but tests are non-exhaustive.)
- range_change
- longest_winning_streak (special case of longest_streak_above)
- longest_losing_streak (special case of longest_streak_below)
- streak_length_stats
- longest_streak_above
- longest_streak_below
- max_abs_change

`autocorrelation` doesn't return the same result for pl.LazyFrame and pl.DataFrame

The correct result is for the Eager mode.

Optimization: benford_correlation

There are three ways to optimize the existing benford_correlation:

relies on strings to extract the first digit of a number. But we don't need strings actually. We can get around strings mathematically.
Since corr is scale invariant, we do not need to compute counts.sum()
The "Benford distribution" is static. We should keep it as a global variable computed at module initialization instead of reapeatedly computing it at runtime.

A new implementation will be:


benford_dist_series = (1 + 1 / pl.int_range(1, 10, eager=True)).log10()
def benford_correlation2(x: pl.Expr) -> pl.Expr:
    
    counts = (
        (x.abs()/(pl.lit(10).pow(x.abs().log10().floor())))
        .drop_nans()
        .cast(pl.UInt8)
        .append(pl.int_range(1, 10, eager=False))
        .sort()
        .value_counts()
        .struct.field("counts") - pl.lit(1)
    )
    # no need to divide because correlation is invariant under scaling
    return pl.corr(counts, pl.lit(benford_dist_series))

To test, we can try this on the following df:

test_df = pl.DataFrame({
    "a": [float(i) + 0.1 for i in range(-100_000, 100_000)]
})

Only problem:

There seems to be a precision issue that only affects the value 1000. By my testing, for sequences without the 1000 in it, it is accurate. I can't guarantee 1000 is the only problematic value, however.

3x Less Time (tested on my local system)

[FEAT] Frequency based cross-validation and gap parameter

In the discord server, a while ago we mentioned the possibility of generating CV splits based on time intervals (e.g. the first day of the month or week). This might be useful for financial settings especially (e.g. retraining the model on mondays).

Plus, it might be useful to provide a gap parameter to allow for gaps between the train and test set (supported e.g. by skforecast):

plot source

Include a functime.type_alias Module

Typing is important. We all agree on that. Here we want to micmic Polars and have a functime specific type_alias module for the following purposes:

Custom typing for more complex and nested types. Right now I don't see any crazy types, but we might need this in the future.
Use Literal types for strategies in function calls instead of str. The advantage of using Literal["a", "b", "c", ..] is that a good linter can tell the user which are the available strategies, and can suggest the strategy based on what the user has typed in. On the other hand, a linter cannot infer anything from strings. (This is similar to Rust Enum, although more basic and only helps the linter, not the Python "compiler" or "interpreter".)
Have a centralized place to initialize constants, some parameters, custom types, instead of having all these stuff scattered around.

Replace manual `.select` with `pl.selectors`

Reason: Polars selectors are (as of polars version 0.18.1) the idiomatic way to select multiple columns.

https://pola-rs.github.io/polars/py-polars/html/reference/selectors.html

[BUILD] Bump minimum required Python version to 3.9?

Dask, NumPy and SciPy only supports Python >=3.9. Does this mean we have to update the minimum Python version? We currently use >=3.8.

LightGBM: Unexpected behavior when introducing exogenous variables

I'm following the quickstart.py demo from the documentation page, I've noticed that exogenous variables X are calculated (month) but are not introduced in the model later on.

When I try to introduce the exogenous variables, the scores worsen significantly and the predicted values seem to be the same for all the entities, is this the expected behavior?

y = pl.read_parquet(
    "https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet"
)
entity_col, time_col = y.columns[:2]
X = (
    y.select([entity_col, time_col])
    .pipe(add_calendar_effects(["month"]))
)

test_size = 3
freq = "1mo"
y_train, y_test = train_test_split(test_size)(y)
X_train, X_test = train_test_split(test_size)(X)

forecaster = lightgbm(freq=freq, lags=12)
forecaster.fit(y=y_train, X=X_train)
y_pred = forecaster.predict(fh=test_size, X=X_test)

scores = mase(y_true=y_test, y_pred=y_pred, y_train=y_train)

print("💯 Scores (with X):\n", scores.sort(entity_col))
print("✅ Predictions (with X):\n", y_pred.sort(entity_col))

I'm getting the same results on a LightGBM model I'm porting from darts (that performed pretty well with similar lags and exogenous variables), so I'm not sure this is the correct way of introducing the exogenous variables.

Analyzing multiple time series with dimension reduction

The idea consists in analyzing multiple time series with dimension reduction based on our set of feature calculators. The steps are:

Select N time series
Calculate multiple features on the N time series
The dataset is such as the columns being the features and each row the corresponding time series
Apply the dimension reduction algorithm
Plot the results using plotly

`plotting.plot_panel` fails with LazyFrames

This code results in the following error: AttributeError: 'LazyFrame' object has no attribute 'get_column'

import polars as pl

y = pl.scan_parquet("https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet")

plotting.plot_panel(y)

This is due to get_column being available for pl.DataFrames only. I think plot_panel could benefit from lazy operations since only number of entities * last_n points are plotted. The following line would just require adding a .collect() in the end if y is a LazyFrame.

https://github.com/neocortexdb/functime/blob/e02509c47882f556640b7c88299988cfbdde8836/functime/plotting.py#L54

energy_ratios is incorrect when `x.len() % n_chunks == 0`

We get this:

Instead of:

module 'functime' has no attribute 'embeddings'

I am trying to run Embeddings example as mentioned in doc and receiving following error

AttributeError: module 'functime' has no attribute 'embeddings' .

Here is the snippet, which raises the error.

import functime
import polars as pl
import numpy as np
from scipy.stats import iqr

# Load memory usage data
y = pl.read_parquet(
    "https://github.com/descendant-ai/functime/raw/main/data/behacom.parquet",
    columns=["user", "timestamp", "system_average_mem"]
)

# Create embeddings
embeddings = functime.embeddings.embed(y)

functime has no attribute `version`

In a fresh env,

python -c "import functime; print(functime.__version__)"

Raises this: AttributeError: module 'functime' has no attribute '__version__'

Should functime define a __version__ in __init__.py?

Sdist

Is it possible to publish sdist archive on PyPI or make a release here on Github (either way if fine)? This looks like a great package that we can make available on Conda Forge but they do need an sdist

Add `plot_interval_forecasts` function

Rationale

We currently have plot_forecasts function in functime.plotting. However, we also support probablistic forecasts (see forecaster.conformalize. We need to do a better job promoting this functionality. The best way is through a chart!

Prior Art

Let's just use this code example from https://plotly.com/python/continuous-error-bars/

Challanges

Plotting multiple forecasts at once. To do this, just reuse the multi-plot code from the other plotting functions.

cross_validation functions require dataframes sorted by (entity, time) columns

If you pass a dataframe where you haven't sorted it first, you won't get the results you're expecting. Could be useful to document this kind of behavior.

Bug: Energy Ratios

Energy_ratio has the same problem as some others. x.len() is an expression which cannot be used in range(), which expects a concrete int.

The rewrite provides a lazy way to segment and do aggregation on the segments. However, the drawback is that this approach is currently 10x slower than tsfresh. (Individually comparing, this is 10x slower. But if we are doing things in batch, a lot of CSE might come in and save us a bit. But still, we need to improve the speed for this one.)

If we know df is in memory, we might have better control over that. But I don't think looping in Python is a good solution (like in the og code, if we assume it worked).

Need more suggestions on potential speed ups.

def energy_ratio_rewrite(
    x: pl.Expr,
    num_segments: int
) -> pl.Expr:
    '''
    Divide x into num_segments by row order. For each segment, compute the energy on the segment. The
    energy ratio of all segments will be returned as a list in order.

    Parameters
    ----------
    x : pl.Expr
        Input expression
    num_segments: int
        The number of segments
    '''
    segments = pl.int_range(pl.lit(0), pl.count()).floordiv(pl.count().floordiv(num_segments)).alias("segment")
    sum_over_segment = pl.struct(
        segments
        , x.pow(2).sum().over(segments).alias("segment_energy")
    ).unique().sort() # This is slow
    total_energy = sum_over_segment.struct.field("segment_energy").sum()

    return (
        (sum_over_segment.struct.field("segment_energy") / total_energy).implode().alias("segment_energy_ratio")
    )

Bug: change_quantile is lacking an aggregation

https://tsfresh.readthedocs.io/en/latest/api/tsfresh.feature_extraction.html#tsfresh.feature_extraction.feature_calculators.quantile

The change_quantile function should not return a series. Instead, after all the filtering, we should aggregate over the remaining values in the diff and return the aggregated value. I am proposing to add a type_alias module to add an AggStrategy type. See the issue here: #46

To add the aggregation, we simply need to do:

# I am proposing a new type_alias module which should contain AggStrategy
from typing_extensions import TypeAlias, Literal
AggStrategy: TypeAlias = Literal["mean", "sum", "median", "std", "max"]
def change_quantile_rewrite(
    x: pl.Expr, 
    q_low: float, 
    q_high: float, 
    is_abs: bool=True, 
    agg: AggStrategy = "mean",
    # Polars' default is nearest, but NumPy's default is linear
    # interpolation: InterpolationMethod = "linear"
) -> pl.Expr:
    
    if q_high <= q_low:
        return pl.lit(0.)

    # Use linear to conform to NumPy
    y = x.is_between(x.quantile(q_low, interpolation="linear"), x.quantile(q_high, interpolation="linear"))
    expr = x.filter(
        pl.all_horizontal(
            y
            , y.shift_and_fill(False, period=-1)
        )
    ).diff() 
    
    if is_abs:
        expr = expr.abs()

    if agg == "mean":
        return expr.mean()
    elif agg == "median":
        return expr.median()
    elif agg == "sum":
        return expr.sum()
    elif agg == "std":
        return expr.std()
    elif agg == "max":
        return expr.max()
    else:
        raise TypeError(f"The input: `{agg}` is not a valid aggregation strategy.")

Fractional Differentiation

Would there be any interest in adding fraction differencing as a function to the library? This is talked about in M. L. Prado, "Advances in Financial Machine Learning". I have an implementation already in polars I could clean up and make a PR for if there is any interest.

You can also see this repo for a numpy example

[DOCS] Duplicate args in forecasting tutorial?

The Target and Feature Transform section is barely commented but I guess it should look like this:

from functime.forecasting import linear_model

forecaster = linear_model(
    freq="1mo",
    lags=12,
    target_transform=scale(),
-   target_transform=add_fourier_terms(sp=12, K=3)
+   feature_transform=add_fourier_terms(sp=12, K=3)
)

Compatability with sci-kit learn 1.3.0

Thank you for making what looks like an amazing module functime team!
The package currently requires sci-kit learn 1.2.2, is it possible to get it compatible with 1.3.0?
I want to try out functime's performance in a repo for which I need sci-kit learn 1.3.0.

(Unnecessary extra info: This is for 1.3.0's hdbscan. I know I could just use the hdbscan module, but I expect the hdbscan module may become semi-redundant with hdbscan in sci-kit learn).

Feature Extraction Tsfresh Rewrite Quality Assurance

First, thank you everybody for contributing to the rewrite.

We are planning to make this project more public, which means we need to make sure that the quality is good. For this round of review, we want to focus on the following 3 items (ranked in terms of importance):

[99%] Correctness. Correctness with respect to implementation, to feature definition, and the final numerical result should make sense. There are a few I haven't reviewed or I don't know if anybody has reviewed: cwt_coefficients, autoregressive_coefficients, augmented_dickey_fuller, and the newly added fft_coefficients.. Please comment if you think there are others we need to review.
[99%] Tests. Building out more tests for all the features. Thank you @MathieuCayssol for building out more tests.
Delayed FFT features. Now that fft_coefficients is implemented, we can start using it?
Ongoing, Performance. For some methods, there might be short cuts in eager mode, or vice versa. For methods (eager) that use NumPy under the hood, can we do better? Is there redundant computation? More efficient methods?

Feature Name	Implemented Lazy (Expr)	Implemented Eager (Series)	Need More Review
absolute_energy	Y	Y
absolute_maximum	Y	Y
absolute_sum_of_changes	Y	Y
approximate_entropy	N	Y
augmented_dickey_fuller	N	Y	Y
autocorrelation	N	Y
autoregressive_coefficients	N	Y	Y
benford_correlation	Y	Y
binned_entropy	Y	Y
c3	Y	Y
change_quantiles	Y	Y
cid_ce	Y	Y
count_above	Y	Y
count_above_mean	Y	Y
count_below	Y	Y
count_below_mean	Y	Y
cwt_coefficients	N	Y	Y
energy_ratios	Y	Y
first_location_of_maximum	Y	Y
first_location_of_minimum	Y	Y
fourier_entropy	N	Y	Y
friedrich_coefficients	N	Y
has_duplicate	Y	Y
has_duplicate_max	Y	Y
has_duplicate_min	Y	Y
index_mass_quantile	Y	Y
large_standard_deviation	Y	Y
last_location_of_maximum	Y	Y
last_location_of_minimum	Y	Y
lempel_ziv_complexity	N	Y
linear_trend	Y	Y
longest_strike_above_mean	Y	Y
longest_strike_below_mean	Y	Y
mean_abs_change	Y	Y
mean_change	Y	Y
mean_n_absolute_max	Y	Y
mean_second_derivative_central	Y	Y
number_crossings	Y	Y
number_cwt_peaks	N	Y
number_peaks	Y	Y
percent_reocurring_points	Y	Y
percent_reoccuring_values	Y	Y
permutation_entropy	Y	Y
range_count	Y	Y
ratio_beyond_r_sigma	Y	Y
ratio_n_unique_to_length	Y	Y
root_mean_square	Y	Y
sample_entropy	N	Y
spkt_welch_density	N	Y
sum_reocurring_points	Y	Y
sum_reocurring_values	Y	Y
symmetry_looking	Y	Y
time_reversal_asymmetry_statistic	Y	Y
variation_coefficient	Y	Y
harmonic_mean	Y	Y
fft_coefficients	N	Y	Y

[PROJECT TRACKER] `tsfresh` time-series features

This issue lists out the time-series features in tsfresh and API design challanges and considerations.

❗ IMPORTANT NOTE: To ensure fair and useful benchmarks between our polars / Rust FFI code vs the original numpy code, our rewrite will go through three stages:

Rewrite using polars expressions only. Take note of features that require external dependencies. Modify original tsfresh API to improve user-experience.
Setup Rust FFI
Full benchmarking

Proposed API Design

Each individual time-series feature should be implemented as a function, which takes a numeric type pl.Expr and returns another pl.Expr of any dtype (depends on the feature). Any parameter should be explicitly specified as function inputs. For example:

import polars as pl

def some_feature(x: pl.Expr, lag: int) -> pl.Expr:
  ...
  return some_result

We will then register the functions into a custom polars namespace (see guide on "Extending the API" here). So usage of tsfresh in functime will look like this:

X_features = X.select([
  pl.col("value").ts.binned_entropy(n_bins=42),
  pl.col("value").ts.acf(lags=12),
  pl.col("value").ts.abs_energy()
])

This API design makes it easy to extend to panel data and rolling windows! For example:

# Panel dataset
X_features = X.groupby("series_id").agg([
    pl.col("value").tsa.binned_entropy(n_bins=42),
    pl.col("value").tsa.acf(lags=12),
    pl.col("value").tsa.abs_energy()
])

# With window + stride (time-series)
X_features = X.groupby_dynamic(every="10i").agg([
    pl.col("value").tsa.binned_entropy(n_bins=42),
    pl.col("value").tsa.acf(lags=12),
    pl.col("value").tsa.abs_energy()
])

# With window + stride (panel)
X_features = X.groupby_dynamic(every="10i", by="series_id").agg([
    pl.col("value").tsa.binned_entropy(n_bins=42),
    pl.col("value").tsa.acf(lags=12),
    pl.col("value").tsa.abs_energy()
])

Docstrings and Style

We use numpydoc and black formatting.

We also use pre-commit. After you clone the repo, remember to run the following:

pip install pre-commit
pre-commit install

Features Checklist

If you want to work on a feature, just create a comment in this issue thread. I'll then add your @ to the checklist below.

❗ IMPORTANT NOTE: We are going to ignore any FFT (fast fourier transform) features for now!

abs_energy @topher-lo
abs_maximum @topher-lo
absolute_sum_of_energy @topher-lo
~~agg_linear_trend~~ Aggregated not supported.
approximate_entropy @topher-lo
ar_coeffs @metaboulie
augmented_dickrey_fuller
autocorrelation @achasol
benfold_correlation @MathieuCayssol
binned_entropy @claysmyth
c3 @mbignotti
change_quantiles @vienneraphael
cid_ce @mbignotti
count_above @achasol
count_above_mean @achasol
count_below @achasol
count_below_mean @achasol
cwt_coefficients
energy_ratio_by_chunks @mbignotti
~~fft_aggregated~~ Aggregated not supported.
fft_coefficient (ignore)
first_location_of_maximum @vienneraphael
first_location_of_minimum @vienneraphael
fourier_entropy @claysmyth
friedrich_coefficients @claysmyth
has_duplicate @achasol
has_duplicate_max @achasol
has_duplicate_min @achasol
index_mass_quantile @mbignotti
~~kurtosis~~
large_std @mbignotti
last_location_of_maximum @vienneraphael
last_location_of_minimum @vienneraphael
lempel_ziv_complexity @claysmyth
~~length~~
linear_trend
~~linear_trend_timewise~~ (not supported)
longest_strike_above_mean @MathieuCayssol
longest_strike_below_mean @MathieuCayssol
max_langevin_fixed_point @claysmyth
~~max~~
~~mean~~
mean_abs_change @vienneraphael
mean_change @vienneraphael
mean_n_abs_max @MathieuCayssol
mean_second_derivative_central
median (ignore...already in polars)
min (ignore...already in polars)
number_crossing_m @vienneraphael
number_cwt_peaks @MathieuCayssol
number_peaks @MathieuCayssol
partial_autocorr @claysmyth
percentage_of_reoccuring_datapoints_to_all @MathieuCayssol
percentage_of_reoccuring_values_to_all @MathieuCayssol
permutation_entropy @abstractqqq
~~quantile~~
~~query_similarity_count~~ (not supported)
range_count @abstractqqq
ratio_beyond_r_sigma @abstractqqq
ratio_value_number_to_length @abstractqqq
root_mean_square @abstractqqq
sample_entropy @abstractqqq
~~skewness~~
spkt_welch_density @abstractqqq
~~std~~
sum_of_reoccuring_data_points (see percentage_of_reoccuring_data_point_to_all) @MathieuCayssol
sum_of_reoccuring_values (see percentage_of_reoccuring_values_to_all) @MathieuCayssol
~~sum~~
symmetry_looking @metaboulie
time_reversal_asymmetry_statistic @metaboulie
~~value_count~~
~~variance~~
var_greater_than_std @vienneraphael
var_coeff @mbignotti

Testing

White-box testing strategy. We use pytest. Each developer is responsible for their own unit test per feature. Should be relatively straightforward: can reuse most code from tsfresh unit tests

Development Workflow

Everybody create ONE pull request with all your implemented features and tests. The title of the PR should be feat: tsfresh features @your-github-handle

Challenges Checklist

Here is a list of challenges I discovered during a first attempt at the rewrite. Please add to this list in the comments below!

Some features in tsfresh do not list of parameters in the function signature (e.g. cwt_coefficients). We should explicitly list it out for our implementation. You can find the default parameters from the original implementation here.
Some features in tsfresh return a pandas Series (e.g. ar_coefficients). For our implementation, we should return a pl.List instead.
We should ignore agg_ features. Let the user aggregate using polars expressions e.g. x.list.to_explode().std() instead.

Optimization: count_above_mean

def count_above_mean(x: pl.Expr) -> pl.Expr:
    """Count the number of values that are above the mean.

    Parameters
    ----------
    x : pl.Expr | pl.Series
        Input time-series.

    Returns
    -------
    int
    """
    return x.filter(x > x.mean()).count()

def count_above_mean_rewrite(x:pl.Expr) -> pl.Expr:

    return (x > x.mean()).sum()

The current implementation is good, but is somewhat wordy and not precise.

Performance-wise, the two implementation are really close when input column has <200k rows by my testing. Rewrite is about 10-20% faster depending on length on these relatively small data sizes. However, on really long series, say 1mm rows, the rewrite is significantly faster. I think this is because "filter" is essentially a redundant operation, which hints at roughly half of the runtime on larger dataset).

Here are some simple tests:

[FEAT] [PLOTTING] add `plotting.plot_entities` to display info about entities

I was wondering if we want a plot_entities histogram/barchart to display summary information about the entities, e.g. the number of observations for each entity.

This could be used to drop entities with too few obs, or (with future features) draw the number of missing values/zeroes in each series.

Here is a draft implementation:

import polars as pl
import plotly.express as px

url = "https://github.com/neocortexdb/functime/raw/main/data/commodities.parquet"
y = pl.scan_parquet(url).with_columns(pl.col("time").cast(pl.Date))

entity_col, time_col, target_col = y.columns


def plot_entities(y, **kwargs):
    # add logic to handle dataframe or lazyframes
    counts = (
        y
        .group_by(entity_col)
        .agg(pl.count())
        .collect()
    )

    height = len(counts) * 15 # sensible-ish default
    
    return (
        px.bar(
            data_frame = counts,
            x="count",
            y=entity_col,
            orientation="h",
            )
        .update_layout(height=height, **kwargs) # add logic to avoid `height` being passed twice
    )

plot_entities(y)

TypeError: date_range() got an unexpected keyword argument 'eager'

Ran the example forecasting code on colab, to eager:)?

TypeError Traceback (most recent call last)
in <cell line: 16>()
14 model = lightgbm(freq="1mo", lags=24, max_horizons=3, strategy="ensemble")
15 model.fit(y=y_train)
---> 16 y_pred = model.predict(fh=3)
17
18 # functime ❤️ functional design

2 frames
/usr/local/lib/python3.10/dist-packages/polars/utils/decorators.py in wrapper(*args, **kwargs)
35 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
36 _rename_kwargs(function.name, kwargs, aliases)
---> 37 return function(*args, **kwargs)
38
39 return wrapper

TypeError: date_range() got an unexpected keyword argument 'eager'

Auto LightGBM: unable to add a column of length 7644 to a dataframe of height 9334

Following a similar pattern to what is in the documentation for auto-lightgbm but for the m4 parquet weekly file and receiving an error several steps in. Maybe the error has to do with how much data is held out and the number of lags coupled with some short time series but not quite sure. Any thoughts?

import polars as pl
from flaml import tune
from functime.forecasting import auto_lightgbm


weekly_train_pl = pl.read_parquet("https://github.com/descendant-ai/functime/raw/main/data/m4_1w_train.parquet")
hourly_train_pl = pl.read_parquet('https://github.com/descendant-ai/functime/raw/main/data/m4_1d_train.parquet')

time_budget = 2

# Fit model
forecaster = auto_lightgbm(
    freq=-1,
    min_lags=4,
    max_lags=52,
    test_size=26,
    time_budget=time_budget,
)
forecaster.fit(y=weekly_train_pl)

[flaml.tune.tune: 07-23 00:17:53] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:17:57] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:04] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:12] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:20] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:26] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:35] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:45] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:18:52] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:01] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:12] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:20] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:32] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:43] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:19:57] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:10] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:22] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:36] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:20:50] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:04] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:18] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:36] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:21:53] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:22:09] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:22:25] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:22:43] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:00] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:17] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:39] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:23:57] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:24:18] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:24:37] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:25:02] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:25:26] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:25:47] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:26:11] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:26:34] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:26:56] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:27:22] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:27:45] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:28:09] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:28:33] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:29:00] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:29:26] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:29:52] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:30:18] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
[flaml.tune.tune: 07-23 00:30:46] {773} INFO - trial 1 config: {'n_estimators': 40, 'num_leaves': 2, 'reg_alpha': 0.002624570650559949, 'reg_lambda': 0.6080101522778687, 'colsample_bytree': 0.6176855370868456, 'subsample': 0.6211956701608718, 'subsample_freq': 4, 'min_child_samples': 10}
---------------------------------------------------------------------------
ShapeError                                Traceback (most recent call last)
[<ipython-input-12-453b2f0928c1>](https://localhost:8080/#) in <cell line: 11>()
      9     time_budget=time_budget,
     10 )
---> 11 forecaster.fit(y=weekly_train_pl)
     12 
     13 # Get best lags and model hyperparameters

8 frames
[/usr/local/lib/python3.10/dist-packages/functime/base/forecaster.py](https://localhost:8080/#) in fit(self, y, X)
     78                 X = self._enforce_string_cache(X.lazy().collect())
     79             X = X.lazy()
---> 80         artifacts = self._fit(y=y, X=X)
     81         cutoffs = y.groupby(y.columns[0]).agg(pl.col(y.columns[1]).max().alias("low"))
     82         artifacts["__cutoffs"] = cutoffs.collect(streaming=True)

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/automl.py](https://localhost:8080/#) in _fit(self, y, X)
    108         from functime.forecasting._ar import fit_cv
    109 
--> 110         return fit_cv(
    111             y=y,
    112             X=X,

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_ar.py](https://localhost:8080/#) in fit_cv(y, forecaster_cls, freq, min_lags, max_lags, max_horizons, strategy, test_size, step_size, n_splits, time_budget, search_space, points_to_evaluate, low_cost_partial_config, num_samples, cv, X, **kwargs)
    149     scores_path = []
    150     for lags in lags_path:
--> 151         score = evaluate(
    152             **{
    153                 "lags": lags,

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_evaluate.py](https://localhost:8080/#) in evaluate(lags, n_splits, time_budget, points_to_evaluate, num_samples, low_cost_partial_config, test_size, max_horizons, strategy, freq, forecaster_cls, y_splits, X_splits, search_space)
    134         score = result["mae"]
    135     else:
--> 136         tuner = flaml.tune.run(
    137             partial(
    138                 evaluate_windows,

[/usr/local/lib/python3.10/dist-packages/flaml/tune/tune.py](https://localhost:8080/#) in run(evaluation_function, config, low_cost_partial_config, cat_hp_cost, metric, mode, time_budget_s, points_to_evaluate, evaluated_rewards, resource_attr, min_resource, max_resource, reduction_factor, scheduler, search_alg, verbose, local_dir, num_samples, resources_per_trial, config_constraints, metric_constraints, max_failure, use_ray, use_spark, use_incumbent_result_in_evaluation, log_file_name, lexico_objectives, force_cancel, n_concurrent_trials, **ray_args)
    774                 result = None
    775                 with PySparkOvertimeMonitor(time_start, time_budget_s, force_cancel):
--> 776                     result = evaluation_function(trial_to_run.config)
    777                 if result is not None:
    778                     if isinstance(result, dict):

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_evaluate.py](https://localhost:8080/#) in evaluate_windows(config, lags, n_splits, test_size, max_horizons, strategy, freq, forecaster_cls, y_splits, X_splits)
     78         y_train, y_test = y_splits[i]
     79         X_train, X_test = X_splits[i] if X_splits is not None else None, None
---> 80         result = evaluate_window(
     81             y_train=y_train,
     82             y_test=y_test,

[/usr/local/lib/python3.10/dist-packages/functime/forecasting/_evaluate.py](https://localhost:8080/#) in evaluate_window(config, lags, test_size, max_horizons, strategy, freq, forecaster_cls, y_train, y_test, X_train, X_test)
     44         y_test = y_test.sort([entity_col, time_col])
     45         y_pred = y_pred.sort([entity_col, time_col])
---> 46         y_test = y_test.with_columns(**{time_col: y_pred.get_column(time_col)})
     47         score = mae(y_true=y_test, y_pred=y_pred).get_column("mae").mean()
     48         res = {"score": score}

[/usr/local/lib/python3.10/dist-packages/polars/dataframe/frame.py](https://localhost:8080/#) in with_columns(self, *exprs, **named_exprs)
   7331         """
   7332         return (
-> 7333             self.lazy()
   7334             .with_columns(*exprs, **named_exprs)
   7335             .collect(no_optimization=True)

[/usr/local/lib/python3.10/dist-packages/polars/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
   1529             streaming,
   1530         )
-> 1531         return wrap_df(ldf.collect())
   1532 
   1533     def sink_parquet(

ShapeError: unable to add a column of length 7644 to a dataframe of height 9334

Rationale for the licensing

First of all, thanks for the great work to provide such a blazing-fast library for Time Series forecasting. I think functime comes in a very great timing in the Python libraries space for such kind of problem space.

I would like to know which rationale used by you that ended up setting the library as AGPLv3 (Affero General Public License v3.0)?

The reason of my question is that most of libraries on that Time Series space has a distinct licensing:

Apache-2.0 license

MIT license

BSD-3-Clause license

sktime/sktime

Pandas is a required dependency for plotting with plotly

I did a fresh install of functime in a clean env and tried to run the following:

import polars as pl
from functime import plotting


def main():
    y = pl.scan_parquet(
        "https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet"
    )

    plotting.plot_panel(y)


if __name__ == "__main__":
    main()

The following error is raised:

ImportError: Plotly express requires pandas to be installed.

Proposed solution

We should add pandas to the list of dependencies (duh). However, I would go a step forward and suggest to further break up the dependencies in optional groups. For example, one could install functime[plotting].

I think this could be a pretty big issue in terms of deployability for container size, etc. For example, in a production/inference setting, I believe one might not be interested in plotting capabilities.

A fresh functime install (including dependencies) is as much as 992MB in my venv. Pandas alone is 135MB (though numpy is a common dependency) which would make the whole "plotting" functions be as much as 293MB (just for plotly and pandas, though we would have to take out numpy that we might still be using under the hood for other things).

By comparison, a clean install of statsforecast (with dependencies) is just 637MB, statsforecasts + mlforecast is 680MB, while scikit-learn is 218MB (with dependencies) and polars alone is 93MB.

Let me know what you think :)

Analyzing multiple time series with dimension reduction

The idea consists in analyzing multiple time series with dimension reduction based on our set of feature calculators. The steps are:

Select N time series
Calculate multiple features on the N time series
The dataset is such as the columns being the features and each row the corresponding time series
Apply the dimension reduction algorithm
Plot the results using plotly

Unable to poetry add functime

Hey guys, I'm able to install functime as a pip package, but unable to add it with the poetry package management.

The problem lies with kaleido

(tsboost-py3.10) ferpapi tsboost $ poetry add functime
Using version ^0.2.4 for functime

Updating dependencies
Resolving dependencies... (4.2s)

Writing lock file

Package operations: 22 installs, 0 updates, 0 removals

  • Installing entrypoints (0.4)
  • Installing locket (1.0.0)
  • Installing mdurl (0.1.2)
  • Installing toolz (0.12.0)
  • Installing zipp (3.16.2)
  • Installing asciitree (0.3.3)
  • Installing fasteners (0.18)
  • Installing importlib-metadata (6.8.0)
  • Installing markdown-it-py (3.0.0)
  • Installing numcodecs (0.11.0)
  • Installing partd (1.4.0)
  • Installing pygments (2.15.1)
  • Installing pynndescent (0.5.8)
  • Installing dask (2023.7.0)
  • Installing flaml (1.2.4)
  • Installing kaleido (0.2.1.post1): Failed

  RuntimeError

  Unable to find installation candidates for kaleido (0.2.1.post1)

  at ~/.local/share/pypoetry/venv/lib/python3.10/site-packages/poetry/installation/chooser.py:109 in choose_for
      105│
      106│             links.append(link)
      107│
      108│         if not links:
    → 109│             raise RuntimeError(f"Unable to find installation candidates for {package}")
      110│
      111│         # Get the best link
      112│         chosen = max(links, key=lambda link: self._sort_key(package, link))
      113│

  • Installing polars (0.18.7)
  • Installing pylance (0.5.8)
  • Installing rich (13.4.2)
  • Installing umap-learn (0.5.3)
  • Installing zarr (2.15.0

any clues?

[BUG] plotting functions don't work when the number of columns is >3

import polars as pl
from functime import plotting

y = pl.scan_parquet("https://github.com/descendant-ai/functime/raw/main/data/commodities.parquet")


(
    y
    .filter(pl.col("commodity_type").eq("Aluminum"))
    .with_columns(pl.lit(1).alias("fourth_col"))
    .collect()
    .pipe(plotting.plot_panel)
)

This fails. The proposed solution is to replace this line:

https://github.com/neocortexdb/functime/blob/90e0f8d66a5ea242d546adeeafd61c60dab1eae7/functime/plotting.py#L53

With this:

entity_col, time_col, target_col = y.columns[:2]

As we do in other places of the codebase.

Support NumPy 1d Array Whenever Series is Supported

I think for general purpose computing, we still need to support NumPy 1d array inputs. Asking the user to do Series -> NumPy or NumPy -> Series conversion can be cumbersome.

Most code for Series will work the same for NumPy. More code will need to have clear branches like

isinstance(x, pl.Series, np.ndarray)

because x.len() won't work on NumPy. But that is just a quick fix, and generally speaking code duplication is not a big deal.

What do you guys think?

add entity column parameter to all cross_validation functions

Currently train_test_split and other functions automatically infer what the entity column is by just picking the first column. Seems to be a crude way of doing that. Being able to add which columns to be used could be more practical.

Explain why forecasters drop very short time series

Problem

Time series with counts less than the number of lags are silently dropped at predict time. For example, during M5 benchmarking, time series with lengths less than lags=24 are dropped. This is intended behavior, but currently undocumented.

Rationale

functime is made for high-performance ML forecasting in production. Data engineers are responsible for upstream and downstream data quality (including the property of "no missing values"), not ML engineers. I made the explicit design not to include any data quality pre-checks within fit-predict in functime.

Solution

Document why functime has weaker data quality pre-conditions.

Additional comment

My goal is to eventually create a checks module with functions to support more defensive forecasting pipelines. But the choice to have checks will be an explicit pipeline design decision by the user, not the functime forecasting API.

[FEAT] [evaluation] Add rank_by to evaluation

We mention and use the coefficient of variation more than once, such as here. It would be interesting to have a evaluation.rank_cv function to see what entities in a panel display the greatest variation.

The way I see it, we should have a public method (perhaps even in feature_extraction?) to compute the CV across all entities. This would be used by rank_cv and possibly in plot_entities (see #83) to display additional information about all entities in the panel.

freq_to_sp documentation is wrong + some wrong mapping

Some offsets are suggested in the documentation, but those will simply return errors: (1ns, 1us for example). Also, the alias_to_sp are not correctly mapped as in the shared link, 30m is wrong, also m is minute, but then in alias_to_sp it is monthly..

Add ta-lib to feature extraction

For the financial forecasting use-case, instead of implementing technical indicators from scratch, we can leverage talib's native integration with polars.

https://github.com/ta-lib/ta-lib-python

Possible Name Changes to Some Features

I am not sure how Tsfresh decided on some of the names.

E.g. variation_coefficient, but in fact it is coefficient of variation, in industry we call this CV. I think coefficient_of_variation would be better.

Here is a list of name changes I am proposing, with reasons:

variation_coefficient ---> coefficient_of_variation (stated above)
absolute_sum_of_changes ---> sum_abs_changes. (absolute_sum_of_changes make people think it is abs(sum(changes)), instead of the actual value, which is sum(abs(changes))

Personally, I prefer a more concise naming convention. For well known items, we can use abbreviations.
3. large_standard_deviation ---> large_std

The list is not exhaustive. Please comment if you think some other features deserve a name change, or if you think we should stick to Tsfresh's naming conventions.

[DOCS] Missing import in forecasting tutorial

There's a wrong import in the feature transform section of the forecasting tutorial:

from functime.forecasting import linear_model
from functime.feature_extraction import add_fourier_terms
-from functime.preprocessing import lag
+from functime.preprocessing import roll

Import functime.forecasting (zarr) leads to error

with Macbook M1 and zarr-2.16.1 I get the following error

ImportError: dlopen(/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so, 0x0002): tried: '/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64')), '/System/Volumes/Preboot/Cryptexes/OS/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so' (no such file), '/opt/homebrew/lib/python3.10/site-packages/numcodecs/_shuffle.cpython-310-darwin.so' (mach-o file, but is an incompatible architecture (have 'x86_64', need 'arm64'))'

Let the user decide how to close bounds in range_count

The implementation of range_count could benefit from letting the user decide how to close intervals:

def range_count(x: TIME_SERIES_T, lower: float, upper: float) -> INT_EXPR:
    """
    Computes values of input expression that is between lower (inclusive) and upper (exclusive).

    Parameters
    ----------
    x : pl.Expr | pl.Series
        Input time-series.
    lower : float
        The lower bound, inclusive
    upper : float
        The upper bound, exclusive

    Returns
    -------
    int | Expr
    """
    if upper < lower:
        raise ValueError("Upper must be greater than lower.")
    return x.is_between(lower_bound=lower, upper_bound=upper, closed="left").sum()

The new version would be:

def range_count(x: TIME_SERIES_T, lower: float, upper: float, closed: Literal[‘both’, ‘left’, ‘right’, ‘none’]) -> INT_EXPR:
    """
    Computes values of input expression that is between lower (inclusive) and upper (exclusive).

    Parameters
    ----------
    x : pl.Expr | pl.Series
        Input time-series.
    lower : float
        The lower bound, inclusive
    upper : float
        The upper bound, exclusive
    closed: Literal[‘both’, ‘left’, ‘right’, ‘none’]
        How to close the interval.

    Returns
    -------
    int | Expr
    """
    if upper < lower:
        raise ValueError("Upper must be greater than lower.")
    return x.is_between(lower_bound=lower, upper_bound=upper, closed=closed).sum()

If you think that's worth it I can submit a PR for this. Let me know

[BUILD] dependencies needed for functime

In #72 I was looking at functime package size, which is as much as 1GB. I was wondering what dependencies where actually required:

https://github.com/neocortexdb/functime/blob/5480874bb8ff968726955ace6b71d0be36b64fe4/pyproject.toml#L29-L46

kaleido is used for plots
pynndescent seems unused
rich requests tabulate fastapi seem unused too

Should some of those be removed?

examples to forecast with multiple feature columns

as described in title, now there seems to have only example like

[entity_col, time_col, price_col]

How can I forecast with more columns as:

[entity_col, time_col, pct_col, low_col, high_col, open_col, close_col]

I had tried data with these column schema but it did not work

Explore performance uplift from pre-sorting

https://www.rhosignal.com/posts/polars-sorted-data-2/

Binned Entropy Implementation is wrong

Currently the binned_entropy is wrong because pl.Series's hist function's behavior differs wildly from NumPy's np.histogram. The generated bins are wildly different. I am not sure if this is a bug on Polar's side, or this is just how they intend people to use the hist function on Series. Personally, I don't think it is a Polars error, but rather a "quirk". Another detail which took me a while to notice is that NumPy's histogram seems to be left_closed, while default behavior of Polars's right_closed.

I compared the current result vs. tsfresh's result, which replies on NumPy's histogram function, we can see the result is wrong.

I made a rewrite, which works on pl.Exprs, which I think is more preferable for the future of the package.

def binned_entropy_rewrite(x:pl.Expr, bin_count: int = 10) -> pl.Expr:
    step_size = 1/bin_count
    breaks = np.linspace(step_size, stop = 1 - step_size, num = bin_count - 1)
    scaled_x = (x - x.min()) / (x.max() - x.min()) # This steps slows down the calculation
    # Left closed because we want to micmic NumPy's histogram's behavior
    return scaled_x.cut(breaks, left_closed=True).value_counts().struct.field("counts").entropy().suffix("_binned_entropy")

If we want to work lazily, we will not be able to set the break points before scanning through the data. However, if can map the values of x into [0,1] by using min-max scaling which preserves the distribution (bins) within the break point ranges.

The only downside to this approach is that it seems to be much slower than NumPy (tsfresh) implementation as of now. On a Series/array of size 100_000, this took 3.7ms while tsfresh took 0.6ms. I currently don't see how to improve this further at this moment.

Feature Bundles

Let's brainstorm some useful feature bundles (features that are likely to be generated together).

Some common ideas are:

Diagnostic features. Features that detect anomaly in time seres.
Pattern detection features.
Features specific in certain domains.
Features that are commonly used together in a variety of domain.
etc...

@MathieuCayssol @topher-lo @vienneraphael

Explore Parallel = True in value_counts

In the docstring, it is mentioned that in group_by context, value_counts = True will likely not improve perf and might even make it worse. I have two questions, which I do not have time to test out and nor do I have the domain knowledge.

In time series, how often will these features be used in a group_by context? It seems to be very often, but I am not sure.
Can someone run a benchmark and see the relatively perf differences for value_counts in both select vs. group_by context with Parallel = True/False? I think this is interesting knowledge and would help us later.
One thing that will be super interesting to know is when will Parallel=True outperform Parallel=False. How big the data has to be? Let's assume we are doing value_count on integer columns for now. The result might differ for string columns.

@topher-lo @MathieuCayssol @

[FEAT] adding a `num_series` parameter to `plotting.plot_panel`

Currently, plot_panel draws a plot for every ID in the panel. The docs recommend:

Note: if you have over 10 entities / time-series, we recommend using the rank_ functions in functime.evaluation then df.head() before plotting.

I would perhaps provide an example of this
Ideally, though, I would like for plot_panel to have a num_series param to select the number of series to plot. It could either accept an integer (e.g. plot up to k series) a list of integers (plot the series 1, 3, 4...) or a list of strings with the names of the IDs.

Bug: c3 Implementation is wrong

The implementation is wrong because k is not an integer (because n is an expression not an int), but an expression, which cannot be used as range(k).

  n = x.len()
  k = n - 2 * n_lags

A rewrite is simply:

def c3_rewrite(x: pl.Expr, lag: int) -> pl.Expr:
    twice_lag = 2 * lag
    return (x * x.shift(lag) * x.shift(twice_lag)).sum() / (x.count() - pl.lit(twice_lag))

functime-org / functime Goto Github PK

functime's Introduction

Time-series machine learning at scale

Highlights

Additional Highlights

Getting Started

Forecasting

Feature Extraction

Related Projects

License

functime's People

Contributors

Stargazers

Watchers

Forkers

functime's Issues

Only problem:

3x Less Time (tested on my local system)

Rationale

Prior Art

Challanges

Proposed API Design

Docstrings and Style

Features Checklist

Testing

Development Workflow

Challenges Checklist

Apache-2.0 license

MIT license

BSD-3-Clause license

Problem

Rationale

Solution

Additional comment

Recommend Projects

Recommend Topics

Recommend Org