Git Product home page Git Product logo

scalecast's Introduction

Scalecast: Forecast everything at scale

pip install scalecast

pseudocode
estimators
error/accuracy metrics
tuning models
Xvars
normalizer
call_me
weighted average modeling
plotting
export
history
feature analysis
forecasting the same series at different levels
warnings
all functions

  • A flexible, minimal-code forecasting object meant to be used with loops to forecast many series or to focus on one series for maximum accuracy
  • Flexible enough to support forecasts at different integration levels (albeit with some caveats)
  • See examples/housing.py for an example of forecasting one series
  • examples/avocados.ipynb for an example of forecasting many series
  • examples/housing_different_levels.py for an example of forecasting one series at different levels
  • All forecasting with auto-regressive terms uses an iterative process to fill in future values with forecasts so this can slow down the evaluation of many models but makes everything dynamic and reduces the chance of leakage

pseudocode

f = Forecaster(y=y_vals,current_dates=y_dates) # initialize object
f.set_test_length(test_periods) # for accuracy metrics
f.generate_future_dates(forecast_length)
f.add_regressors(seasonal,ar,AR,combo,covid19,holidays,time_trend,polynomials,other)
f.set_validation_length(validation_periods) # to tune models, a period of time before the test set
for m in estimators:
  f.set_estimator(m)
  f.ingest_grid(dict)
  f.tune()
  f.auto_forecast() # uses best parameters from tuning process

f.set_estimator('combo') # combination modeling
f.manual_forecast(how='simple',models=[m1,m2,...],call_me='simple_avg')
f.manual_forecast(how='weighted',models='all',determine_best_by='ValidationSetMetric',call_me='weighted_avg') # be careful when specifying determine_best_by to not overfit/leak

f.plot(forecast,test_set,level_forecast,fitted_vals)
f.export(to_excel=True) # summary stats, forecasts, test set, etc.

estimators

arima
combo
elasticnet
gbt
hwes
knn
mlr
mlp
prophet
rf
svr
xgboost

_estimators_ = {'arima', 'mlr', 'mlp', 'gbt', 'xgboost', 'rf', 'prophet', 'hwes', 'elasticnet','svr','knn','combo'}

arima

  • Stats Models Documentation
  • uses no Xvars by default but does accept the Xvars argument
  • does not accept the normalizer argument
  • can be used effectively on level or differenced data
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.set_estimator('arima')
f.manual_forecast(order=(1,1,1),seasonal_order=(2,1,0,12),trend='ct')

combo

  • src
  • three combination models are available:
    • simple average of specified models
    • weighted average of specified models
      • weights are based on determine_best_by parameter -- see metrics
    • splice of specified models at specified splice point(s)
      • specify splice points in splice_points
        • should be one less in length than models
        • models[0] --> :splice_points[0]
        • models[-1] --> splice_points[-1]:
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.set_estimator('hwes')
f.manual_forecast(trend='add',seasonal='add',call_me='hwes_add')
f.manual_forecast(trend='mul',seasonal='mul',call_me='hwes_mul')

f.set_estimator('combo')
f.manual_forecast(how='simple',models=['hwes_add','hwes_mul'])
f.manual_forecast(how='weighted',determine_best_by='InSampleRMSE',models=['hwes_add','hwes_mul']) # this leaks data -- see auto_forecast for better weighted average modeling
f.manual_forecast(how='splice',models=['hwes_add','hwes_mul'],splice_points=['2022-01-01'])
  • the above weighted average model will probably overfit since determine_best_by is a metric that partly uses the test-set to be determined
  • the models argument can also be a str beginning with "top_" and that number of models will be averaged, determined by determine_best_by, see export
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.set_estimator('hwes')
f.manual_forecast(trend='add',seasonal='add',call_me='hwes_add')
f.manual_forecast(trend='mul',seasonal='mul',call_me='hwes_mul')
f.manual_forecast(trend=None,seasonal='add',call_me='hwes_add_no_trend')

f.set_estimator('combo')
f.manual_forecast(how='simple',models='top_2',determine_best_by='InSampleRMSE') # this leaks data
f.manual_forecast(how='weighted',determine_best_by='InSampleRMSE',models='top_2') # this leaks data

elasticnet

  • Sklearn Documentation
  • combines a lasso and ridge regressor
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('elasticnet')
f.manual_forecast(alpha=.5,l1_ratio=.5,normalizer='scale')

gbt

  • Sklearn Documentation
  • Gradient Boosted Trees
  • robust to overfitting
  • takes longer to run than xgboost
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('gbt')
f.manual_forecast(max_depth=2,normalizer=None)

hwes

  • Stats Models Documentation
  • Holt-Winters Exponential Smoothing
  • does not accept the normalizer or Xvars argument
  • usually better on level data, whether or not the series is stationary
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.set_estimator('hwes')
f.manual_forecast(trend='add',seasonal='mul')

knn

  • Sklearn Documentation
  • K-nearest Neighbors
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('knn')
f.manual_forecast(n_neigbors=5,weights='uniform')

mlp

  • Sklearn Documentation
  • Multi-Level Perceptron (neural network)
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('mlp')
f.manual_forecast(Xvars=['monthsin','monthcos','year','t'],solver=['lbfgs'])

mlr

  • Sklearn Documentation
  • Multiple Linear Regression
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('mlr')
f.manual_forecast(normalizer=None)

prophet

  • Prophet Documentation
  • uses no Xvars by default but does accept the Xvars argument
  • does not accept the normalizer argument
  • whether it performs better on differenced or level data depends on the series but it should be okay with either
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.set_estimator('prophet')
f.manual_forecast(n_changepoints=3)

svr

  • Sklearn Documentation
  • Support Vector Regressor
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('svr')
f.manual_forecast(kernel='linear',gamma='scale',C=2,epsilon=0.01)

rf

  • Sklearn Documentation
  • Random Forest
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
  • prone to overfitting
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('rf')
f.manual_forecast(n_estimators=1000,max_depth=6)

xgboost

  • Xgboost Documentation
  • extreme gradient boosted tree model
  • uses all Xvars and a MinMax normalizer by default
  • better on differenced data for non-stationary series
  • runs faster than gbt
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('xgboost')
f.manual_forecast(max_depth=2)

error accuracy metrics

  • relevant when combination modeling, plotting, and exporting
  • both level and integrated metrics are available
    • if the forecasts were performed on level data, these will be the same
    • if the series were differenced, these can offer interesting contrasts and views of accuracy
_metrics_ = {'r2','rmse','mape','mae'}
_determine_best_by_ = {'TestSetRMSE','TestSetMAPE','TestSetMAE','TestSetR2','InSampleRMSE','InSampleMAPE','InSampleMAE',
                        'InSampleR2','ValidationMetricValue','LevelTestSetRMSE','LevelTestSetMAPE','LevelTestSetMAE',
                        'LevelTestSetR2','LevelInSampleRMSE','LevelInSampleMAPE','LevelInSampleMAE','LevelInSampleR2',None}

in-sample metrics

  • 'InSampleRMSE','InSampleMAPE','InSampleMAE','InSampleR2'
  • These can be used to detect overfitting
  • Should not be used for determining best models/weights when combination modeling as these also include the test set within them
  • Still available for combination modeling in case you want to use them, but it should be understood that the accuracy metrics will be unreliable
  • stored in the history attribute

out-of-sample metrics

  • 'TestSetRMSE','TestSetMAPE','TestSetMAE','TestSetR2','LevelTestSetRMSE','LevelTestSetMAPE','LevelTestSetMAE','LevelTestSetR2'
  • These are good for ordering models from best to worst according to how well they predicted out-of-sample values
  • Should not be used for for determining best models/weights when combination modeling as it will lead to data leakage and overfitting
  • Compare to in-sample metrics for a good sense of how well-fit the model is
  • stored in the history attribute

validation metrics

  • 'ValidationMetricValue' is stored in the history attribute
    • based on 'ValidationMetric', also stored in history
      • one of {'r2','rmse','mape','mae'}
  • This will only be populated if you first tune the model with a grid and use the tune() method
  • This metric can be used for combination modeling without data leakage/overfitting as they are derived from out-of-sample data but not included in the test-set
    • If you change the validation metric during the tuning process, this will no longer be reliable for combination modeling

tuning models

  • all models except combination models can be tuned with a straightforward process
  • set_validation_length() will set n periods aside before the test set chronologically to validate different model hyperparameters
  • grids can be fed to the object that are dictionaries with a keyword as the key and array-like object as the value
  • for each model, there are different hyperparameters that can be tuned this way -- see the specific model's source documentation
  • all combinations will be tried like any other grid, however, combinations that cannot be estimated together will be passed over to not break loops (this issue comes up frequently with hwes estimators)
  • most estimators will also accept an Xvars and normalizer argument and these can be added to the grid
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

elasticnet_grid = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[0,0.25,0.5,0.75,1],
  'normalizer':['scale','minmax',None]
}

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_validation_length(6)
f.set_estimator('elasticnet')
f.ingest_grid(elasticnet_grid)
f.tune()
f.auto_forecast()

Grids.py

  • instead of placing grids inside your code, you can create (or copy/paste) a file called Grids.py to store your grids and they will be read in automatically
  • you can pass the name of the grid as str to ingest_grid() but you can also skip straight to tune() and it will automatically search for a grid in Grids.py with the same name as the estimator
# Grids.py
elasticnet = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[0.5],
  'normalizer':['scale','minmax',None]  
}

lasso = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[1],
  'normalizer':['scale','minmax',None]
}

ridge = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[0],
  'normalizer':['scale','minmax',None]
}

# main.py
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_validation_length(6)
f.set_estimator('elasticnet')

# GRIDS
f.tune() # automatically ingests the elasticnet grid since it is the same as the estimator
f.auto_forecast()

f.ingest_grid('lasso') # ingests the lasso grid in Grids.py
f.tune()
f.auto_forecast()

f.ingest_grid('ridge') # ingests the ridge grid in Grids.py
f.tune()
f.auto_forecast()

limit_grid_size()

  • Forecaster.limit_grid_size(n)
  • use to limit big grids to a smaller size of randomly kept rows
    • n: int or float
      • if int, that many of random rows will be kept
      • if float, must be 0 < n > 1 and that percentage of random rows will be kept
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

mlp = {
  'activation':['relu','tanh'],
  'hidden_layer_sizes':[(25,),(25,50),(25,50,25),(100,100,100),(100,100,100,100)],
  'solver':['lbfgs','adam'],
  'normalizer':['scale','minmax',None],
  'random_state':[20]
  }

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_validation_length(6)
f.set_estimator('mlp')
f.ingest_grid(mlp)
f.limit_grid_size(10) # random 10 rows
f.ingest_grid(mlp)
f.limit_grid_size(.1) # random 10 percent
f.tune()
f.auto_forecast()

combo modeling

  • the only safe way to autamatically select and weight top models for the "combo" estimator is to use in conjunction with auto forecasating
  • use determine_best_by = 'ValidationMetricValue' (its default)
# Grids.py
elasticnet = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[0.5],
  'normalizer':['scale','minmax',None]  
}

lasso = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[1],
  'normalizer':['scale','minmax',None]
}

ridge = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[0],
  'normalizer':['scale','minmax',None]
}

# main.py
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_validation_length(6)
f.set_estimator('elasticnet')

# GRIDS
f.tune() # automatically ingests the elasticnet grid since it is the same as the estimator
f.auto_forecast()

f.ingest_grid('lasso') # ingests the lasso grid in Grids.py
f.tune()
f.auto_forecast()

f.ingest_grid('ridge') # ingests the ridge grid in Grids.py
f.tune()
f.auto_forecast()

# COMBO
f.set_estimator(cobmo)
f.manual_forecast(how='simple',models='top_2',call_me='simple_avg')
f.manual_forecast(how='weighted',models='top_2',call_me='weighted_avg')

Validation metric

  • you can change validation metrics to any of 'rmse','mape','mae','r2'
# Grids.py
elasticnet = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[0.5],
  'normalizer':['scale','minmax',None]  
}

lasso = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[1],
  'normalizer':['scale','minmax',None]
}

ridge = {
  'alpha':[i/10 for i in range(1,101)],
  'l1_ratio':[0],
  'normalizer':['scale','minmax',None]
}

# main.py
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_validation_length(6)
f.set_validation_metric('r2')
f.set_estimator('elasticnet')

# GRIDS
f.tune() # automatically ingests the elasticnet grid since it is the same as the estimator
f.auto_forecast()

f.ingest_grid('lasso') # ingests the lasso grid in Grids.py
f.tune()
f.auto_forecast()

f.ingest_grid('ridge') # ingests the ridge grid in Grids.py
f.tune()
f.auto_forecast()

# COMBO
f.set_estimator(cobmo)
f.manual_forecast(how='simple',models='top_2',call_me='simple_avg')
f.manual_forecast(how='weighted',models='top_2',call_me='weighted_avg')

Xvars

  • all estimators except hwes and combo accept an Xvars argument
  • accepted arguments are an array-like of named regressors, a str of a single regressor name, 'all', or None
    • for estimators that require Xvars (sklearn models), None and 'all' will both use all Xvars
  • all Xvars must be numeric type

seasonal regressors
ar terms
time trend
combination regressors
poly terms
covid19
ingesting a dataframe of x variables
holidays/other

seasonal regressors

  • Forecaster.add_seasonal_regressors(*args,raw=True,sincos=False,dummy=False,drop_first=False)
    • args: includes all pandas.Series.dt attributes ('month','day','dayofyear','week',etc.) that return pandas.Series.astype(int)
      • I'm not sure there exists anywhere a complete list of possible attributes, but a good place to start is here
      • only use attributes that return a series of int type
    • raw: bool, default True
      • by default, the output of calling this method results in Xvars added to current_xreg and future_xreg that are int (ordinal) type
      • setting raw to False will bypass that
      • at least one of raw, sincos, dummy must be true
    • sincos: bool, default False
      • this creates two wave transformations out of the pandas output and stores them in future_xreg, current_xreg
        • formula: sin(pi*raw_output/(max(raw_output)/2)), cos(pi*raw_output/(max(raw_output)/2))
      • it uses the max from the pandas output to automatically set the cycle length, so if you think this might cause problems in the analysis, using dummy=True is a safer bet to achieve a similar result, but it adds more total variables
    • dummy: bool, default False
      • changes the raw int output into dummy 0/1 variables and stores them in future_xreg, current_xreg
    • drop_first: bool, default False
      • whether to drop one class for dummy variables
      • ignored when dummy=False
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.generate_future_dates(24) # forecast length
>>> f.add_seasonal_regressors('month',dummy=True,sincos=True)
>>> f.add_seasonal_regressors('year')
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1959-01-01 00:00:00, ends at 2021-05-01 00:00:00, loaded to forecast out 24 periods, has 16 regressors."
>>> print(f.get_regressor_names())
['month', 'monthsin', 'monthcos', 'month_1', 'month_10', 'month_11', 'month_12', 'month_2', 'month_3', 'month_4', 'month_5', 'month_6', 'month_7', 'month_8', 'month_9', 'year']

ar terms

  • Forecaster.add_ar_terms(n)

    • n: int
      • the number of ar terms to add, will add 1 to n ar terms
  • Forecast.add_AR_terms(N)

    • N: tuple([int,int])
      • tuple shape: (P,m)
        • P is the number of terms to add
        • m is the seasonal length (12 for monthly data, for example)
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.generate_future_dates(24) # forecast length
>>> f.add_ar_terms(4)
>>> f.add_AR_terms((2,12)) # seasonal AR terms
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1959-01-01 00:00:00, ends at 2021-05-01 00:00:00, loaded to forecast out 24 periods, has 6 regressors."
>>> print(f.get_regressor_names())
['AR1', 'AR2', 'AR3', 'AR4', 'AR12', 'AR24']
  • the beautiful part of adding auto-regressive terms in this framework is that all metrics and forecasts use an iterative process that plugs in forecasted values to future terms, making all test set and validation predictions and forecasts dynamic
  • however, doing it this way means lots of loops in the evaluation process, which means some models run very slowly
  • add ar/AR terms before differencing (don't worry, they will be differenced as well)
  • don't begin any other regressor names you add with "AR" as it will confuse the forecasts

time trend

  • Forecaster.add_time_trend(called='t')
    • called: str, default 't'
      • what to call the resulting time trend
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.generate_future_dates(24) # forecast length
>>> f.add_time_trend()
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1959-01-01 00:00:00, ends at 2021-05-01 00:00:00, loaded to forecast out 24 periods, has 1 regressors."
>>> print(f.get_regressor_names())
['t']

combination regressors

  • Forecaster.add_combo_regressors(*args,sep='_')
    • args: names of Xvars that aleady exist in the object
      • all vars in args will be multiplied together
    • sep: str, default "_"
      • the separator between each term in arg to create the final variable name
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.generate_future_dates(24) # forecast length
>>> f.add_time_trend()
>>> f.add_covid19_regressor()
>>> f.add_combo_regressors('t','COVID19')
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1959-01-01 00:00:00, ends at 2021-05-01 00:00:00, loaded to forecast out 24 periods, has 3 regressors."
>>> print(f.get_regressor_names())
['t','COVID19','t_COVID19']

poly terms

  • Forecaster.add_poly_terms(*args,pwr=2,sep='^')
    • args: names of Xvars that aleady exist in the object
    • pwr: int, default 2
      • the max power to add to each term in args (2 to this number will be added)
    • sep: str, default "^"
        • the separator between each term in arg and pwr to create the final vairable name
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.generate_future_dates(24) # forecast length
>>> f.add_time_trend()
>>> f.add_poly_terms('t',pwr=3)
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1959-01-01 00:00:00, ends at 2021-05-01 00:00:00, loaded to forecast out 24 periods, has 3 regressors."
>>> print(f.get_regressor_names())
['t','t^2','t^3']

covid19

  • Forecaster.add_covid19_regressor(called='COVID19',start=datetime.datetime(2020,3,15),end=datetime.datetime(2021,5,13))
  • adds dummy variable that is 1 during the time period that covid19 effects are present for the series, 0 otherwise
    • called: str, default 'COVID19'
      • what to call the resulting time trend
    • start: str or datetime object, default datetime.datetime(2020,3,15)
      • the start date (default is day Walt Disney World closed in the U.S.)
      • use format yyyy-mm-dd when passing strings
    • end: str or datetime object, default datetime.datetime(2021,5,13)
      • the end date (default is day the U.S. CDC dropped mask mandate/recommendation for vaccinated people)
      • use format yyyy-mm-dd when passing strings
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.generate_future_dates(24) # forecast length
>>> f.add_covid19_regressor()
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1959-01-01 00:00:00, ends at 2021-05-01 00:00:00, loaded to forecast out 24 periods, has 1 regressors."
>>> print(f.get_regressor_names())
['COVID19']

ingesting a dataframe of x variables

  • Forecaster.ingest_Xvars_df(df,date_col='Date',drop_first=False,use_future_dates=False)
  • reads all variables from a dataframe and stores them in current_xreg and future_xreg, can use future dates here instead of generate_future_dates(), will convert all non-numeric variables to dummies
    • df: pandas dataframe
      • must span the same time period as current_dates
      • if use_future_dates = False, must span at least the same time period as future_dates
    • date_col: str
      • the name of the date column in df
    • drop_first: bool, default False
      • whether to drop a class in any columns that have to be dummied
    • use_future_dates: bool, default False
      • whether to set the forecast periods in the object with the future dates in df
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1970-01-01',end='2021-03-01')
>>> ur = pdr.get_data_fred('UNRATE',start='1970-01-01',end='2021-05-01').reset_index()
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.ingest_Xvars_df(ur,date_col='DATE',use_future_dates=True)
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1970-01-01 00:00:00, ends at 2021-03-01 00:00:00, loaded to forecast out 2 periods, has 1 regressors."
>>> print(f.get_regressor_names())
['UNRATE']

other

  • Forecaster.add_other_regressor(called,start,end)
  • adds dummy variable that is 1 during the specified time period, 0 otherwise
    • called: str
      • what to call the resulting time trend
    • start: str or datetime object
    • end: str or datetime object
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.generate_future_dates(24) # forecast length
>>> f.add_other_regressor(called='Sept2001',start='2001-09-01',end='2001-09-01')
>>> print(f)
"Forecaster object with no models evaluated. Data starts at 1959-01-01 00:00:00, ends at 2021-05-01 00:00:00, loaded to forecast out 24 periods, has 1 regressors."
>>> print(f.get_regressor_names())
['Sept2001']

normalizer

  • ('minmax','scale',None)
  • all Sklearn and XGBOOST models have a normalizer argument that can be optimized in tune()
  • 'minmax': ManMaxScaler from Sklearn
  • 'scale': Normalizer from Sklearn

call_me

  • in manual_forecast() and auto_forecast() you can use the call_me parameter to name the key in the history dict
  • by default, this will be the same as whatever the estimator is called

history

structure:

dict(call_me = 
  dict(
    'Estimator' = str: name of estimator in `_estimators_`, always set
    'Xvars' = list: name of utilized Xvars, None when no Xvars used, always set
    'HyperParams' = dict: name/value of hyperparams, empty when all defaults, always set 
    'Scaler' = str: name of normalizer used ('minmax','scale',None), always set
    'Forecast' = list: the forecast at whatever level it was run, always set
    'FittedVals' = list: the fitted values at whatever level the forecast was run, always set
    'Tuned' = bool: whether the model was auto-tuned, always set
    'Integration' = int: the integration of the model run, 0 when no series diff taken, never greater than 2, always set
    'TestSetLength' = int: the number of periods set aside to test the model, always set
    'TestSetRMSE' = float: the RMSE of the model on the test set at whatever level the forecast was run, always set
    'TestSetMAPE' = float: the MAPE of the model on the test set at whatever level the forecast was run, `None` when there is a 0 in the actuals, always set
    'TestSetMAE' = float: the MAE of the model on the test set at whatever level the forecast was run, always set
    'TestSetR2' = float: the MAE of the model on the test set at whatever level the forecast was run, never greater than 1, can be less than 0, always set
    'TestSetPredictions' = list: the predictions on the test set, always set
    'TestSetActuals' = list: the test-set actuals, always set 
    'InSampleRMSE' = float: the RMSE of the model on the entire y series using the fitted values to compare, always set
    'InSampleMAPE' = float: the MAPE of the model on the entire y series using the fitted values to compare, `None` when there is a 0 in the actuals, always set
    'InSampleMAE' = float: the MAE of the model on the entire y series using the fitted values to compare, always set
    'InSampleR2' = float: the R2 of the model on the entire y series using the fitted values to compare, always set
    'ValidationSetLength' = int: the number of periods before the test set periods to validate the model, only set when the model has been tuned
    'ValidationMetric' = str: the name of the metric used to validate the model, only set when the model has been tuned
    'ValidationMetricValue' = float: the value of the metric used to validate the model, only set when the model has been tuned
    'univariate' = bool: True if the model uses univariate features only (hwes, prophet, arima are the only estimators where this could be True), otherwise not set
    'first_obs' = list: the first y values from the undifferenced data, only set when `diff()` has been called
    'first_dates' = list: the first date values from the undifferenced data, only set when `diff()` has been called
    'grid_evaluated' = pandas dataframe: the evaluated grid, only set when themodel has been tuned
    'models' = list: the models used in the combination, only set when the model is a 'combo' estimator
    'LevelForecast' = list: the forecast in level (undifferenced terms), when data has not been differenced this is the same as the 'Forecast' key, always set
    'LevelY' = list: the y value in level (undifferenced terms), when data has not been differenced this is the same as the y attribute, always set
    'LevelFittedVals' = list: the fitted values in level (undifferenced terms), when data has not been differenced this is the same as the 'FittedVals' key, always set
    'LevelTestSetPreds' = list: the test-set predictions in level (undifferenced terms), when data has not been differenced this is the same as the 'TestSetPredictions' key, always set
    'LevelTestSetRMSE' = float: the RMSE of the level test-set predictions vs. the level actuals, when data has not been differenced this is the same as the 'TestSetRMSE' key, always set
    'LevelTestSetMAPE' = float: the MAPE of the level test-set predictions vs. the level actuals, when data has not been differenced this is the same as the 'TestSetRMSE' key, None when there is a 0 in the level test-set actuals, always set
    'LevelTestSetMAE' = float: the MAE of the level test-set predictions vs. the level actuals, when data has not been differenced this is the same as the 'TestSetRMSE' key, always set
    'LevelTestSetR2' = float: the R2 of the level test-set predictions vs. the level actuals, when data has not been differenced this is the same as the 'TestSetRMSE' key, always set
    'LevelInSampleRMSE' = float: the RMSE of the level fitted values vs. the level actuals, when data has not been differenced this is the same as the 'InSampleRMSE' key, always set
    'LevelInSampleMAPE' = float: the MAPE of the level fitted values vs. the level actuals, when data has not been differenced this is the same as the 'InSampleRMSE' key, None if there is a 0 in the level actuals, always set
    'LevelInSampleMAE' = float: the MAE of the level fitted values vs. the level actuals, when data has not been differenced this is the same as the 'InSampleRMSE' key, always set
    'LevelInSampleR2' = float: the R2 of the level fitted values vs. the level actuals, when data has not been differenced this is the same as the 'InSampleRMSE' key, always set
    'feature_importance' = pandas dataframe: eli5 feature importance information (based on change in accuracy when a certain feature is filled with random data), only set when save_feature_importance() is called
    'summary_stats' = pandas dataframe: statsmodels summary stats information, only set when save_summary_stats() is called
  )
)

weighted average modeling

  • weighted averaging can easily mean data leakage and overfitting
  • determining how to weigh each passed regressor should use 'ValidationMetricValue' only if wishing to set automatically, you can also pass manually set weights
  • if your manual weights do not add to one, they will be rebalanced to do so
  • if they do add to one, no rebalancing anywhere will be performed
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
>>> f.set_test_length(12)
>>> f.generate_future_dates(24) # forecast length
>>> f.set_estimator('arima')
>>> f.manual_forecast(order=(1,1,1),seasonal_order=(2,1,0,12),trend='ct')
>>> f.set_estimator('hwes')
>>> f.manual_forecast(trend='mul',seasonal='add')
>>> f.set_estiamtor('combo')
>>> f.manual_forecast(how='weighted',models=['arima','hwes'],weights=(.75,.25))
>>> print(f.export('model_summaries'))
  ModelNickname Estimator Xvars  ... LevelTestSetMAE LevelTestSetR2  best_model
0         combo     combo  None  ...       29.327446      -7.696709        True
1          hwes      hwes  None  ...       38.758404     -10.243523       False
2         arima     arima  None  ...       48.316691     -21.313984       False

weight rebalancing

  • by default, weighted average will use a minmax or maxmin estimator depending on which metric is passed to it to determine the weights
  • this means the lowest-performing model is always weighted at 0
  • to make sure all models are given some weight, automatic rebalancing is performed so that .1 is added to all standardized values, then all values are rebalanced to add to 1
  • this can be adjusted in the rebalance_weights parameter
    • the higher you set this value, the closer the weighted average becomes to a simple average, 0 means the worst model gets no weight
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

f.set_test_length(12) # specify a test length for your models--it's a good idea to keep this the same for all forecasts
f.generate_future_dates(25) # this will create future dates that are on the same interval as the current dates and it will also set the forecast length
f.add_ar_terms(4) # add AR terms before differencing
f.add_AR_terms((2,12)) # seasonal AR terms
f.adf_test() # will print out whether it thinks the series is stationary and return a bool representing stationarity based on the augmented dickey fuller test
f.diff() # differences the y term and all ar terms to make a series stationary (also supports 2-level integration)
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True) # uses pandas attributes: raw=True creates integers (default), sincos=True creates wave functions (not default), dummy=True creates dummy vars (not default)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # dates are flexible, default is from when disney world closed to when US CDC lifted mask recommendations
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies regressors together
f.add_poly_terms('t',pwr=3) # by default, creates an order 2 regressor, n-order polynomial terms are allowed
f.set_validation_length(6) # length, different than test_length, to tune the hyperparameters 

# automatically tune and forecast with a series of models
models = ('mlr','knn','svr','xgboost','gbt','elasticnet','mlp','prophet')
for m in models:
  f.set_estimator(m)
  #f.ingest_grid('mlr') # manually pull any grid name that is specified in Grids.py
  f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
  f.auto_forecast()

f.set_estimator('combo')
f.manual_forecast(how='weighted',rebalance_weights=0)

export

  • Forecaster.export(dfs=['all_fcsts','model_summaries','best_fcst','test_set_predictions','lvl_fcsts'], models='all', best_model='auto', determine_best_by='TestSetRMSE', to_excel=False, out_path='./', excel_name='results.xlsx')
  • exports 1-all of 5 pandas dataframes, can write to excel with each dataframe on a separate sheet, will return either a dictionary with dataframes as values or a single dataframe if only one df is specified
    • dfs: list-like or str, default ['all_fcsts','model_summaries','best_fcst','test_set_predictions','lvl_fcsts']
      • a list or name of the specific dataframe(s) you want returned and/or written to excel
      • must be one of default
    • models: list-like or str, default 'all'
      • the models to write information for
      • can start with "top_" and the metric specified in determine_best_by will be used to order the models appropriately
    • best_model: str, default 'auto'
      • the name of the best model, if "auto", will determine this by the metric in determine_best_by
      • if not "auto", must match a model nickname of an already-evaluated model
    • determine_best_by: one of _determine_best_by_, default 'TestSetRMSE'
    • to_excel: bool, default False
      • whether to save to excel
    • out_path: str, default './'
      • the path to save the excel file to (ignored when to_excel=False)
    • excel_name: str, default 'results.xlsx'
      • the name to call the excel file (ignored when to_excel=False)
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

f.set_test_length(12) # specify a test length for your models--it's a good idea to keep this the same for all forecasts
f.generate_future_dates(25) # this will create future dates that are on the same interval as the current dates and it will also set the forecast length
f.add_ar_terms(4) # add AR terms before differencing
f.add_AR_terms((2,12)) # seasonal AR terms
f.adf_test() # will print out whether it thinks the series is stationary and return a bool representing stationarity based on the augmented dickey fuller test
f.diff() # differences the y term and all ar terms to make a series stationary (also supports 2-level integration)
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True) # uses pandas attributes: raw=True creates integers (default), sincos=True creates wave functions (not default), dummy=True creates dummy vars (not default)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # dates are flexible, default is from when disney world closed to when US CDC lifted mask recommendations
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies regressors together
f.add_poly_terms('t',pwr=3) # by default, creates an order 2 regressor, n-order polynomial terms are allowed
f.set_validation_length(6) # length, different than test_length, to tune the hyperparameters 

# automatically tune and forecast with a series of models
models = ('mlr','knn','svr','xgboost','gbt','elasticnet','mlp','prophet')
for m in models:
  f.set_estimator(m)
  #f.ingest_grid('mlr') # manually pull any grid name that is specified in Grids.py
  f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
  f.auto_forecast()

f.export(to_excel=True,excel_name='all_results.xlsx') # will write all five dataframes as separate sheets to excel in the local directory as "all_results.xlsx"

plotting

plot
plot_test_set
plot_fitted
plot_acf
plot_pacf
plot_periodogram
seasonal_decompose

plot

  • Forecaster.plot(models='all',order_by=None,level=False,print_attr=[])
    • models: list-like or str, default 'all'
      • the models to plot
      • can start with "top_" and the metric specified in order_by will be used to order the models appropriately
    • order_by: one of _determine_best_by_, default None
    • level: bool, default False
      • if True, will always plot level forecasts
      • if False, will plot the forecasts at whatever level they were called on
      • if False and there are a mix of models passed with different integrations, will default to True
    • print_attr: list-like, default []
      • attributes from history dict to print to console
      • if the attribute doesn't exist for a passed model, will not raise error, will just skip that element
>>> import pandas as pd
>>> import pandas_datareader as pdr
>>> import matplotlib.pyplot as plt
>>> import seaborn as sns

>>> from scalecast.Forecaster import Forecaster

>>> df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
>>> f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

>>> f.set_test_length(12) # specify a test length for your models--it's a good idea to keep this the same for all forecasts
>>> f.generate_future_dates(25) # this will create future dates that are on the same interval as the current dates and it will also set the forecast length
>>> f.add_ar_terms(4) # add AR terms before differencing
>>> f.add_AR_terms((2,12)) # seasonal AR terms
>>> f.adf_test() # will print out whether it thinks the series is stationary and return a bool representing stationarity based on the augmented dickey fuller test
>>> f.diff() # differences the y term and all ar terms to make a series stationary (also supports 2-level integration)
>>> f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True) # uses pandas attributes: raw=True creates integers (default), sincos=True creates wave functions (not default), dummy=True creates dummy vars (not default)
>>> f.add_seasonal_regressors('year')
>>> f.add_covid19_regressor() # dates are flexible, default is from when disney world closed to when US CDC lifted mask recommendations
>>> f.add_time_trend()
>>> f.add_combo_regressors('t','COVID19') # multiplies regressors together
>>> f.add_poly_terms('t',pwr=3) # by default, creates an order 2 regressor, n-order polynomial terms are allowed
>>> f.set_validation_length(6) # length, different than test_length, to tune the hyperparameters 

>>> # automatically tune and forecast with a series of models
>>> models = ('mlr','knn','svr','xgboost','gbt','elasticnet','mlp','prophet')
>>> for m in models:
>>>   f.set_estimator(m)
>>>   #f.ingest_grid('mlr') # manually pull any grid name that is specified in Grids.py
>>>   f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
>>>   f.auto_forecast()

>>> f.plot(models='top_5',order_by='LevelTestSetMAPE',print_attr=['TestSetRMSE','HyperParams','Xvars']) # plots the forecast differences or levels based on the level the forecast was performed on
knn TestSetRMSE: 15.270860125581308
knn HyperParams: {'n_neighbors': 19, 'weights': 'uniform'}
knn Xvars: ['AR1', 'AR2', 'AR3', 'AR4', 'AR12', 'AR24', 'monthsin', 'monthcos', 'dayofyearsin', 'dayofyearcos', 'weeksin', 'weekcos', 'year', 'COVID19', 't', 't_COVID19', 't^2', 't^3']
prophet TestSetRMSE: 15.136371374950649
prophet HyperParams: {'n_changepoints': 2}
prophet Xvars: None
svr TestSetRMSE: 16.67679416471487
svr HyperParams: {'kernel': 'poly', 'degree': 2, 'gamma': 'scale', 'C': 3.0, 'epsilon': 0.1}
svr Xvars: ['AR1', 'AR2', 'AR3', 'AR4', 'AR12', 'AR24', 'monthsin', 'monthcos', 'dayofyearsin', 'dayofyearcos', 'weeksin', 'weekcos', 'year', 'COVID19', 't', 't_COVID19', 't^2', 't^3']
mlp TestSetRMSE: 16.27657072564657
mlp HyperParams: {'activation': 'tanh', 'hidden_layer_sizes': (25, 25), 'solver': 'lbfgs', 'random_state': 20}
mlp Xvars: ['AR1', 'AR2', 'AR3', 'AR4', 'AR12', 'AR24', 'monthsin', 'monthcos', 'dayofyearsin', 'dayofyearcos', 'weeksin', 'weekcos', 'year', 'COVID19', 't', 't_COVID19', 't^2', 't^3']
elasticnet TestSetRMSE: 16.269472253462983
elasticnet HyperParams: {'alpha': 0.1, 'l1_ratio': 0.0}
elasticnet Xvars: ['AR1', 'AR2', 'AR3', 'AR4', 'AR12', 'AR24', 'monthsin', 'monthcos', 'dayofyearsin', 'dayofyearcos', 'weeksin', 'weekcos', 'year', 'COVID19', 't', 't_COVID19', 't^2', 't^3']

plot_test_set

  • Forecaster.plot_test_set(models='all',order_by=None,include_train=True,level=False)
    • models: list-like or str, default 'all'
      • the models to plot
      • can start with "top_" and the metric specified in order_by will be used to order the models appropriately
    • order_by: one of _determine_best_by_, default None
    • include_train: bool or int, default True
      • use to zoom into training results
      • if True, plots the test results with the entire history in y
      • if False, matches y history to test results and only plots this
      • if int, plots that length of y to match to test results
    • level: bool, default False
      • if True, will always plot level forecasts
      • if False, will plot the forecasts at whatever level they were called on
      • if False and there are a mix of models passed with different integrations, will default to True
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

f.set_test_length(12) # specify a test length for your models--it's a good idea to keep this the same for all forecasts
f.generate_future_dates(25) # this will create future dates that are on the same interval as the current dates and it will also set the forecast length
f.add_ar_terms(4) # add AR terms before differencing
f.add_AR_terms((2,12)) # seasonal AR terms
f.adf_test() # will print out whether it thinks the series is stationary and return a bool representing stationarity based on the augmented dickey fuller test
f.diff() # differences the y term and all ar terms to make a series stationary (also supports 2-level integration)
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True) # uses pandas attributes: raw=True creates integers (default), sincos=True creates wave functions (not default), dummy=True creates dummy vars (not default)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # dates are flexible, default is from when disney world closed to when US CDC lifted mask recommendations
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies regressors together
f.add_poly_terms('t',pwr=3) # by default, creates an order 2 regressor, n-order polynomial terms are allowed
f.set_validation_length(6) # length, different than test_length, to tune the hyperparameters 

# automatically tune and forecast with a series of models
models = ('mlr','knn','svr','xgboost','gbt','elasticnet','mlp','prophet')
for m in models:
  f.set_estimator(m)
  #f.ingest_grid('mlr') # manually pull any grid name that is specified in Grids.py
  f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
  f.auto_forecast()

# combine models and run manually specified models of other varieties
f.set_estimator('combo')
f.manual_forecast(how='simple',models='top_3',determine_best_by='ValidationMetricValue',call_me='avg') # simple average of top_3 models based on performance in validation
f.manual_forecast(how='weighted',models=models,determine_best_by='ValidationMetricValue',call_me='weighted') # weighted average of all models based on metric specified in determine_best_by (default is the validation metric)

f.plot_test_set(models='top_5',order_by='TestSetR2',include_train=60) # see test-set performance visually of top 5 best models by r2 (last 60 obs only)

plot_fitted

  • Forecaster.plot_fitted(models='all',order_by=None)
    • models: list-like or str, default 'all'
      • the models to plot
      • can start with "top_" and the metric specified in order_by will be used to order the models appropriately
    • order_by: one of _determine_best_by_, default None
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

f.set_test_length(12) # specify a test length for your models--it's a good idea to keep this the same for all forecasts
f.generate_future_dates(25) # this will create future dates that are on the same interval as the current dates and it will also set the forecast length
f.add_ar_terms(4) # add AR terms before differencing
f.add_AR_terms((2,12)) # seasonal AR terms
f.adf_test() # will print out whether it thinks the series is stationary and return a bool representing stationarity based on the augmented dickey fuller test
f.diff() # differences the y term and all ar terms to make a series stationary (also supports 2-level integration)
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True) # uses pandas attributes: raw=True creates integers (default), sincos=True creates wave functions (not default), dummy=True creates dummy vars (not default)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # dates are flexible, default is from when disney world closed to when US CDC lifted mask recommendations
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies regressors together
f.add_poly_terms('t',pwr=3) # by default, creates an order 2 regressor, n-order polynomial terms are allowed
f.set_validation_length(6) # length, different than test_length, to tune the hyperparameters 

# automatically tune and forecast with a series of models
models = ('mlr','knn','svr','xgboost','gbt','elasticnet','mlp','prophet')
for m in models:
  f.set_estimator(m)
  #f.ingest_grid('mlr') # manually pull any grid name that is specified in Grids.py
  f.tune() # by default, will pull the grid with the same name as the estimator (mlr will pull the mlr grid, etc.)
  f.auto_forecast()

# combine models and run manually specified models of other varieties
f.set_estimator('combo')
f.manual_forecast(how='simple',models='top_3',determine_best_by='ValidationMetricValue',call_me='avg') # simple average of top_3 models based on performance in validation
f.manual_forecast(how='weighted',models=models,determine_best_by='ValidationMetricValue',call_me='weighted') # weighted average of all models based on metric specified in determine_best_by (default is the validation metric)

f.plot_fitted(order_by='TestSetR2') # plot fitted values of all models ordered by r2

plot_acf

  • Forecaster.plot_acf(diffy=False,**kwargs)
    • plot_acf() from statsmodels
    • diffy: bool, default False
      • whether to call the function on the first differenced y series
    • **kwargs passed to the sm function
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

# time series exploration
f.plot_acf()
plt.show()

plot_pacf

  • Forecaster.plot_pacf(diffy=False,**kwargs)
    • plot_pacf() from statsmodels
    • diffy: bool, default False
      • whether to call the function on the first differenced y series
    • **kwargs passed to the sm function
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

# time series exploration
f.plot_pacf(diffy=True)
plt.show()

plot_periodogram

  • Forecaster.plot_periodogram(diffy=False)
    • periodogram() from scipy
    • diffy: bool, default False
      • whether to call the function on the first differenced y series
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

a, b = f.plot_periodogram(diffy=True)
plt.semilogy(a, b)
plt.show()

seasonal_decompose

  • Forecaster.seasonal_decompose(diffy=False,**kwargs)
    • seasonal_decompose() from statsmodels
    • diffy: bool, default False
      • whether to call the function on the first differenced y series
    • **kwargs passed to the sm function
import pandas as pd
import pandas_datareader as pdr
import matplotlib.pyplot as plt
import seaborn as sns

from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index) # to initialize, specify y and current_dates (must be arrays of the same length)

f.seasonal_decompose().plot()
plt.show()

history

see call_me

feature analysis

  • the following models have eli5 feature importance attributes that can be saved to the history as dataframes
    • 'mlr', 'mlp', 'gbt', 'xgboost', 'rf', 'elasticnet', 'svr', 'knn'
  • the following models have summary stats:
    • 'arima', 'hwes'
  • you can save these to history (run right after a forecast is created):
    • Forecaster.save_feature_importance()
    • Forecaster.save_summary_stats()
      • does not raise error if the last model run does not support feature importance/summary stats so that it won't break loops
    • Forecaster.export_feature_importance(model)
    • Forecaster.export_summary_stats(model)
      • model should match what is passed to call_me

forecasting the same series at different levels

  • you can use undiff() to revert back to the series' original integration
    • will delete all regressors so you will have to re-add the ones you want
  • when differencing, diff(1) is default but diff(2) is also supported
  • do not combine forecasts ('combo' estimator) run at different levels -- will return nonsense
  • plot() and plot_test_set() default to level unless you only call models run at one integration or another
  • plot_fitted() cannot mix models with different integrations
import pandas as pd
import pandas_datareader as pdr
from scalecast.Forecaster import Forecaster

df = pdr.get_data_fred('HOUSTNSA',start='1900-01-01',end='2021-05-01')
f = Forecaster(y=df['HOUSTNSA'],current_dates=df.index)
f.set_test_length(12)
f.generate_future_dates(24) # forecast length
f.add_ar_terms(4)
f.add_AR_terms((2,12)) # seasonal AR terms
f.add_seasonal_regressors('month','dayofyear','week',raw=False,sincos=True)
f.add_seasonal_regressors('year')
f.add_covid19_regressor() # default is from when disney world closed to when U.S. cdc no longer recommended masks but can be changed
f.add_time_trend()
f.add_combo_regressors('t','COVID19') # multiplies time trend and COVID19 regressor
f.add_poly_terms('t') # t^2
f.diff() # non-stationary data forecasts better differenced with this model
f.set_estimator('elasticnet')
f.manual_forecast(alpha=.5,l1_ratio=.5,normalizer='scale')

f.undiff() # drops all added regressors

f.set_estimator('arima')
f.manual_forecast(order=(1,1,1),seasonal_order=(2,1,0,12),trend='ct')

f.plot()

warnings

  • in the process of specifying so many models, you may write code that returns warnings
  • rather than printing to console, these warnings are written to a log file that gets overwritten with every run

all functions

  • there's a lot more you can do with this object, see below for a list of all functions
  • brief descriptions given for functions not explained anywhere else
Forecaster.add_AR_terms(N)
Forecaster.add_ar_terms(n)
Forecaster.add_combo_regressors(*args, sep='_')
Forecaster.add_covid19_regressor(called='COVID19', start=datetime.datetime(2020, 3, 15, 0, 0), end=datetime.datetime(2021, 5, 13, 0, 0))
Forecaster.add_other_regressor(called, start, end)
Forecaster.add_poly_terms(*args, pwr=2, sep='^')
Forecaster.add_seasonal_regressors(*args, raw=True, sincos=False, dummy=False, drop_first=False)
Forecaster.add_time_trend(called='t')
Forecaster.adf_test(critical_pval=0.05, quiet=True, full_res=False, **kwargs)
  # from statsmodels: Augmented Dickey Fuller stationarity test, returns False if result is series is not stationary, True otherwise, full_res=True means the full output from sm will be returned, **kwargs passed to the statsmodel function
Forecaster.auto_forecast(call_me=None)
Forecaster.diff(i=1)
  # supports i = 1 | i = 2
Forecaster.drop_regressors(*args)
  # deletes regressors out of future xreg and current xreg
Forecaster.export(dfs=['all_fcsts', 'model_summaries', 'best_fcst', 'test_set_predictions', 'lvl_fcsts'], models='all', best_model='auto', determine_best_by='TestSetRMSE', to_excel=False, out_path='./', excel_name='results.xlsx')
Forecaster.export_feature_importance(model)
Forecaster.export_summary_stats(model)
Forecaster.export_validation_grid(model)
  # creates a pandas dataframe out of a validation grid for a given model with error/accuracy results saved to a column
Forecaster.fillna_y(how='ffill')
  # 'how' matches 'method' keyword from the pandas.DataFrame.fillna() method
Forecaster.generate_future_dates(n)
Forecaster.get_freq()
  # returns the freq attribute which is set from infer_freq() which matches the pandas function
Forecaster.get_regressor_names()
  # returns a list of the regressors saved to the object's names
Forecaster.infer_freq()
  # pandas.infer_freq()
Forecaster.ingest_Xvars_df(df, date_col='Date', drop_first=False, use_future_dates=False)
Forecaster.ingest_grid(grid)
Forecaster.keep_smaller_history(n)
  # reduces the amount of y observations, n can be a str in yyyy-mm-dd, and int representing the last number of obs to keep, or a datetime/pandas date object
Forecaster.limit_grid_size(n)
Forecaster.manual_forecast(call_me=None, **kwargs)
Forecaster.order_fcsts(models, determine_best_by='TestSetRMSE')
  # returns a list of model names (matching call_me) in order from best to worst according to the determine_best_by arg
Forecaster.plot(models='all', order_by=None, level=False, print_attr=[])
Forecaster.plot_acf(diffy=False, **kwargs)
Forecaster.plot_fitted(models='all', order_by=None)
Forecaster.plot_pacf(diffy=False, **kwargs)
Forecaster.plot_periodogram(diffy=False)
Forecaster.plot_test_set(models='all', order_by=None, include_train=True, level=False)
Forecaster.pop(*args)
  # *args are names of models (matching call_me) that will be deleted from history
Forecaster.save_feature_importance(quiet=True)
Forecaster.save_summary_stats(quiet=True)
Forecaster.seasonal_decompose(diffy=False, **kwargs)
Forecaster.set_estimator(which)
Forecaster.set_last_future_date(date)
  # another way to fill future dates by stopping at the last date you want forecasted
Forecaster.set_test_length(n=1)
Forecaster.set_validation_length(n=1)
Forecaster.set_validation_metric(which='rmse')
  # one of 'rmse','mape','mae','r2'
Forecaster.tune()
Forecaster.typ_set()
  # sets data types of y, current_dates, etc. appropriately
Forecaster.undiff(suppress_error=False)
Forecaster.validate_regressor_names()
  # validates that all variable names in current_xreg and future_xreg match

scalecast's People

Contributors

mikekeith52 avatar

Stargazers

Timothy Keith avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.