fabrice-rossi / mixvlmc Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 3.0 18.6 MB

Variable Length Markov Chains with Covariates

Home Page: https://fabrice-rossi.github.io/mixvlmc/

License: GNU General Public License v3.0

R 86.43% C++ 13.37% C 0.20%

machine-learning markov-chain markov-model r statistics time-series

mixvlmc's Introduction

Variable Length Markov Chains with Covariates

mixvlmc implements variable length Markov chains (VLMC) and variable length Markov chains with covariates (COVLMC), as described in:

mixvlmc includes functionalities similar to the ones available in VLMC and PST. The main advantages of mixvlmc are the support of time varying covariates with COVLMC and the introduction of post-pruning of the models that enables fast model selection via information criteria.

Installation

The package can be installed from CRAN with:

install.packages("mixvlmc")

The development version is available from GitHub:

# install.packages("devtools")
devtools::install_github("fabrice-rossi/mixvlmc")

Usage

Variable length Markov chains

Variable length Markov chains (VLMC) are sparse high order Markov chains. They can be used to model time series (sequences) with discrete values (states) with a mix of small order dependencies for certain states and higher order dependencies for other states. For instance, with a binary time series, the probability of observing 1 at time $t$ could be constant whatever the older past states if the last one (at time $t-1$) was 1, but could depend on states at time $t-3$ and $t-2$ if the state was 0 at time $t-1$. A collection of past states that determines completely the transition probabilities is a context of the VLMC. Read vignette("context-trees") for details about contexts and context tree, and see vignette("variable-length-markov-chains") for a more detailed introduction to VLMC.

VLMC with covariates (COVLMC) are extension of VLMC in which transition probabilities (probabilities of the next state given the past) can be influenced by the past values of some covariates (in addition to the past values of the time series itself). Each context is associated to a logistic model that maps the (past values of the) covariates to transition probabilities.

Fitting a VLMC

The package is loaded in a standard way.

library(mixvlmc)
library(ggplot2) ## we load ggplot2 for the autoplot examples

The main function of VLMC is vlmc() which can be called on a time series represented by a numerical vector or a factor, for instance.

set.seed(0)
x <- sample(c(0L, 1L), 200, replace = TRUE)
model <- vlmc(x)
model
#> VLMC context tree on 0, 1 
#>  cutoff: 1.921 (quantile: 0.05)
#>  Number of contexts: 11 
#>  Maximum context length: 6

The default parameters of vlmc() will tend to produce overly complex VLMC in order to avoid missing potential structure in the time series. In the example above, we expect the optimal VLMC to be a constant distribution as the sample is independent and uniformly distributed (it has no temporal structure). The default parameters give here an overly complex model, as illustrated by its text based representation

draw(model)
#> ▪ (0.505, 0.495)
#> └─ 1 (0.4848, 0.5152)
#>    ├─ 0 (0.5319, 0.4681)
#>    │  └─ 1 (0.5, 0.5)
#>    │     └─ 0 (0.4444, 0.5556)
#>    │        └─ 0 (0.4286, 0.5714)
#>    │           ├─ 0 (1, 0)
#>    │           └─ 1 (0, 1)
#>    └─ 1 (0.4314, 0.5686)
#>       └─ 0 (0.2727, 0.7273)
#>          └─ 0 (0.3846, 0.6154)
#>             └─ 0 (0.5, 0.5)
#>                ├─ 0 (0, 1)
#>                └─ 1 (1, 0)

The representation uses simple ASCII art to display the contexts of the VLMC organized into a tree (see vignette("context-trees") for a more detailed introduction):

the root ▪ corresponds to an empty context;
one can read contexts by following branches (represented by ─) down to their ends (the leaves): for instance $(0, 0, 0, 1, 0, 1)$ is one of the contexts of the tree.

Here the context $(0, 0, 0, 1, 0, 1)$ is associated to the transition probabilities $(1, 0)$. This means that when one observes this context in the time series, it is always followed by a 0. Notice that contexts are extends to the left when we go down in the tree as deep nodes corresponds to older values. Some papers prefer to write contexts from the most recent value to the oldest one. With this convention, the “reverse” context $(1, 0, 1, 0, 0, 0)$ corresponds to the sub time series $(0, 0, 0, 1, 0, 1)$. Unless otherwise specified, we write contexts in the temporal order.

BIC based model selection

The VLMC above is obviously overfitting to the time series, as illustrated by the 0/1 transition probabilities. A classical way to select a good model is to minimize the BIC. In mixvlmc this can be done easily using using ‘tune_vlmc()’ which fits first a complex VLMC and then prunes it (using a combination of cutoff() and prune()), as follows (see vignette("variable-length-markov-chains") for details):

best_model_tune <- tune_vlmc(x)
best_model <- as_vlmc(best_model_tune)
draw(best_model)
#> ▪ (0.505, 0.495)

As expected, we end up with a constant model.

In time series with actual temporal patterns, the optimal model will be more complex. As a very basic illustrative example, let us consider the sunspot.year time series and turn it into a binary one, with high activity associated to a number of sun spots larger than the median number.

sun_activity <- as.factor(ifelse(sunspot.year >= median(sunspot.year), "high", "low"))

We adjust automatically an optimal VLMC as follows:

sun_model_tune <- tune_vlmc(sun_activity)
sun_model_tune
#> VLMC context tree on high, low 
#>  cutoff: 2.306 (quantile: 0.03175)
#>  Number of contexts: 9 
#>  Maximum context length: 5 
#>  Selected by BIC (236.262) with likelihood function "truncated" (-98.83247)

The results of the pruning process can be represented graphically:

print(autoplot(sun_model_tune) + geom_point())

The plot shows that simpler models are too simple as the BIC increases when pruning becomes strong enough. The best model remains rather complex (as expected based on the periodicity of the Solar cycle):

best_sun_model <- as_vlmc(sun_model_tune)
draw(best_sun_model)
#> ▪ (0.5052, 0.4948)
#> ├─ high (0.8207, 0.1793)
#> │  ├─ high (0.7899, 0.2101)
#> │  │  ├─ high (0.7447, 0.2553)
#> │  │  │  ├─ high (0.6571, 0.3429)
#> │  │  │  │  └─ low (0.9167, 0.08333)
#> │  │  │  └─ low (1, 0)
#> │  │  └─ low (0.96, 0.04)
#> │  └─ low (0.9615, 0.03846)
#> └─ low (0.1888, 0.8112)
#>    ├─ high (0, 1)
#>    └─ low (0.2328, 0.7672)
#>       ├─ high (0, 1)
#>       └─ low (0.3034, 0.6966)
#>          └─ high (0.07692, 0.9231)

Fitting a VLMC with covariates

To illustrate the use of covariates, we use the power consumption data set included in the package (see vignette("covlmc") for details). We consider a week of electricity usage as follows:

pc_week_5 <- powerconsumption[powerconsumption$week == 5, ]
elec <- pc_week_5$active_power
ggplot(pc_week_5, aes(x = date_time, y = active_power)) +
  geom_line() +
  xlab("Date") +
  ylab("Activer power (kW)")

The time series displays some typical patterns of electricity usage:

low active power at night (typically below 0.4 kW);
standard use between 0.4 and 2 kW;
peak use above 2 kW.

We build a discrete time series from those (somewhat arbitrary) thresholds:

elec_dts <- cut(elec, breaks = c(0, 0.4, 2, 8), labels = c("low", "typical", "high"))

The best VLMC model is quite simple. It is almost a standard order one Markov chain, up to the order 2 context used when the active power is typical.

elec_vlmc_tune <- tune_vlmc(elec_dts)
best_elec_vlmc <- as_vlmc(elec_vlmc_tune)
draw(best_elec_vlmc)
#> ▪ (0.1667, 0.5496, 0.2837)
#> ├─ low (0.7665, 0.2335, 0)
#> ├─ typical (0.0704, 0.8466, 0.08303)
#> │  └─ low (0.3846, 0.5385, 0.07692)
#> └─ high (0.003497, 0.1573, 0.8392)

As pointed about above, low active power tend to correspond to night phase. We can include this information by introducing a day covariate as follows:

elec_cov <- data.frame(day = (pc_week_5$hour >= 7 & pc_week_5$hour <= 17))

A COVLMC is estimated using the covlmc function:

elec_covlmc <- covlmc(elec_dts, elec_cov, min_size = 2, alpha = 0.5)
draw(elec_covlmc, model = "full")
#> ▪
#> ├─ low ([ (I)    • day_1TRUE
#> │         -1.558 • 1.006     ])
#> ├─ typical
#> │  ├─ low ([ (I)    • day_1TRUE ⁞ day_2TRUE
#> │  │         0.3567 • -27.81    ⁞ 27.81    
#> │  │         -1.253 • -14.39    ⁞ 13.69     ])
#> │  ├─ typical ([ (I)    • day_1TRUE
#> │  │             2.666  • 0.566    
#> │  │             0.2683 • 0.2426    ])
#> │  └─ high ([ (I)    • day_1TRUE
#> │             2.015  • 16.18    
#> │             0.6931 • 16.61     ])
#> └─ high ([ (I)   • day_1TRUE
#>            17.41 • -14.23   
#>            19.38 • -14.88    ])

The model appears a bit complex. To get a more adapted model, we use a BIC based model selection as follows:

elec_covlmc_tune <- tune_covlmc(elec_dts, elec_cov)
print(autoplot(elec_covlmc_tune))

best_elec_covlmc <- as_covlmc(elec_covlmc_tune)
draw(best_elec_covlmc, model = "full")
#> ▪
#> ├─ low ([ (I)    • day_1TRUE
#> │         -1.558 • 1.006     ])
#> ├─ typical
#> │  ├─ low ([ (I)   
#> │  │         0.3365
#> │  │         -1.609 ])
#> │  ├─ typical ([ (I)   
#> │  │             2.937 
#> │  │             0.3747 ])
#> │  └─ high ([ (I)  
#> │             2.773
#> │             1.705 ])
#> └─ high ([ (I)  
#>            3.807
#>            5.481 ])

As in the VLMC case, the optimal model remains rather simple:

the high context do not use the covariate and is equivalent to the vlmc context;
the low context is more interesting: it does not switch to a high context (hence the single row of parameters) but uses the covariate. As expected, the probability of switching from low to typical is larger during the day;
the typical context is described in a more complex way that in the case of the vlmc as the transition probabilities depend on the previous state.

Sampling

VLMC models can also be used to sample new time series as in the VMLC bootstrap proposed by Bühlmann and Wyner. For instance, we can estimate the longest time period spent in the high active power regime. In this “predictive” setting, the AIC may be more adapted to select the best model. Notice that some quantities can be computed directly from the model in the VLMC case, using classical results on Markov Chains. See vignette("sampling") for details on sampling.

We first select two models based on the AIC.

best_elec_vlmc_aic <- as_vlmc(tune_vlmc(elec_dts, criterion = "AIC"))
best_elec_covlmc_aic <- as_covlmc(tune_covlmc(elec_dts, elec_cov, criterion = "AIC"))

The we sample 100 new time series for each model, using the simulate() function as follows:

set.seed(0)
vlmc_simul <- vector(mode = "list", 100)
for (k in seq_along(vlmc_simul)) {
  vlmc_simul[[k]] <- simulate(best_elec_vlmc_aic, nsim = length(elec_dts), init = elec_dts[1:2])
}

set.seed(0)
covlmc_simul <- vector(mode = "list", 100)
for (k in seq_along(covlmc_simul)) {
  covlmc_simul[[k]] <- simulate(best_elec_covlmc_aic, nsim = length(elec_dts), covariate = elec_cov, init = elec_dts[1:2])
}

Then statistics can be computed on those time series. For instance, we look for the longest time period spent in the high active power regime.

longuest_high <- function(x) {
  high_length <- rle(x == "high")
  10 * max(high_length$lengths[high_length$values])
}
lh_vlmc <- sapply(vlmc_simul, longuest_high)
lh_covlmc <- sapply(covlmc_simul, longuest_high)

The average longest time spent in high consecutively is

for the VLMC: 243.6 minutes with a standard error of 6.7337834;
for the VLMC with covariate: 286 minutes with a standard error of 8.9386235;
410 minutes for the observed time series.

The following figure shows the distributions of the times obtained by both models as well as the observed value. The VLMC model with covariate is able to generate longer sequences in the high active power state than the bare VLMC model as the consequence of the sensitivity to the day/night schedule.

lh <- data.frame(
  time = c(lh_vlmc, lh_covlmc),
  model = c(rep("VLMC", length(lh_vlmc)), rep("COVLMC", length(lh_covlmc)))
)
ggplot(lh, aes(x = time, color = model)) +
  geom_density() +
  geom_rug(alpha = 0.5) +
  geom_vline(xintercept = longuest_high(elec_dts), color = 3)

The VLMC with covariate can be used to investigate the effects of changes in those covariates. For instance, if the day time is longer, we expect high power usage to be less frequent. For instance, we simulate one week with a day time from 6:00 to 20:00 as follows.

elec_cov_long_days <- data.frame(day = (pc_week_5$hour >= 6 & pc_week_5$hour <= 20))
set.seed(0)
covlmc_simul_ld <- vector(mode = "list", 100)
for (k in seq_along(covlmc_simul_ld)) {
  covlmc_simul_ld[[k]] <- simulate(best_elec_covlmc_aic, nsim = length(elec_dts), covariate = elec_cov_long_days, init = elec_dts[1:2])
}

As expected the distribution of the longest time spend consecutively in high power usage is shifted to lower values when the day length is increased.

lh_covlmc_ld <- sapply(covlmc_simul_ld, longuest_high)
day_time_effect <- data.frame(
  time = c(lh_covlmc, lh_covlmc_ld),
  `day length` = c(rep("Short days", length(lh_covlmc)), rep("Long days", length(lh_covlmc_ld))),
  check.names = FALSE
)
ggplot(day_time_effect, aes(x = time, color = `day length`)) +
  geom_density() +
  geom_rug(alpha = 0.5)

mixvlmc's People

Contributors

Stargazers

Watchers

Forkers

hugotmdd guenoj gi0na

mixvlmc's Issues

Support logical values as state space

Currently vlmc() and covlmc() support only character, factor or numeric values for the state space. It should be easy to support directly logical values (in the internal function to_dts()).

Enable post-pruning for `covlmc` models

The cutoff and prune functions for vlmc enable an efficient search for an optimal model according to e.g. the BIC. A similar functionality for covlmc models would be very useful, especially considering that the estimation time for those models is significantly longer than for vlmc models.

The ground work for cutoff has already been implemented in commit a9600eb.

Post pruning needs the data to be saved in the covlmc but the added memory consumption should be compensated by the reduced total processing time when we look for an optimal model. Models fitted to the data but discarded could also be saved to speed up the process even more.

Add more input validation tests

Many functions do not validate enough their parameters before calling low level functions, leading to non informative error messages. In particular:

likelihood does not check the compatibility of the newdata (see simulate.vlmc() for the validation code).
draw.covlmc() does not validate model

Store discarded models in `covlmc` to speed up `prune.covlmc`

In some circumstances, it might be interesting to store discarded models to avoid recomputing them during post pruning. However, most of them might be invalidated once a leaf of the context tree is pruned. As currently models use a lot of memory, this should be done after implementing feature #12.

Make cutoff values more usable

Values computed by the cutoff.* functions correspond directly to the thresholds used in the pruning phase of VLMC or coVLMC. As the pruning tests are written as follows

if (p_value > cutoff) {
  prune
} else {
  do not prune
}

using directly the cutoff values may not induce pruning.

This can be fixed by either:

replacing the test by p_value >= cutoff
slightly decreasing the thresholds reported by cutoff

The first solution departs from what is used in other VLMC implementations, leading to potential differences between the results of very similar codes. The second solution is easy to implement in Rcpp using the nextafter standard function.

Use suffix trees to speed up context tree construction

Fast VLMC construction can be done using a linear time suffix tree construction algorithm such Ukkonen's algorithm. It make sense to implement it in C++ and to keep the full representation in C++
as well. Steps:

implement Ukkonen's algorithm
add counts for support pruning
integrate the C++ representation for ctx tree
implement similarity pruning

This should be first limited to VLMC. The cost of COVLMC is dominated by the logistic model estimation and thus supporting a simple transformation of the C++ representation into the current R representation (with nested lists) should be sufficient as a first integration step.

Implement switchable predictive models in VLMC with covariates

The current implementation of the VLMC with covariates uses logistic regression to estimate the transition probabilities associated to a context from the covariates. When the state space has only two states, this is based on glm with the binomial link (and estimated via spaMM_glm.fit from the spaMM package), while larger space states use the VGAM package (more specifically vglm with the multinomial family).

It would be interesting to support other estimators such as multinom from the nnet package or simple tables for discrete covariates. The default case could be specified by some global options. Passing the estimators as a parameter of the fitting function might be a bit too complex.

Support initial="extended" in loglikelihood.covlmc

The implementation of 'loglikelihood.covlmc` added by commit 06e4bb2 does not support the extended context matching mode available for vlmc. The main issue is the lack of logistic model in internal nodes of the context tree. Those models were generally not evaluated during the construction of the covlmc (some may have been tested).

To enable extended context matching, we need to keep potentially useful models (see also issue #15) and to compute the other ones on the fly. If we do not want to store the data into the covlmc object, we need compute all models preventively. Because of the rather large memory occupation associated we may either wait for issue #12 to be solved or at least trim the models.

Specialized representation for logistic models

A covlmc object needs to store one logistic model per context, which induces a rather large memory consumption. In practice, we only need to

access to the coefficients of the model
compute the transition probabilities associated to the values of the covariates

Storing a glm object for this is a waste of space. In addition, we already need to unify to some extent the internal representation of the coefficients between the different estimation engines that can be used (see #2). We should go one step further and replace the full objects returned by other packages by bare bone representations.

Create a Context class used to represent contexts

This could be used by contexts to return more efficiently everything associated to each context (frequency of next symbols, positions, etc.). This could also be used by a new find_context function that return a single context (if it exists).

Include examples for all the functions

Todo:

Should be easier with a data set (#9).

Refactor SuffixTree.cpp to declare the module elsewhere

SuffixTree should have a proper header file and the module export should be done in another file that includes headers of all the classes to export.

Use ellipsis to test varargs proper usage

The package has numerous S3 methods which use varargs (...). Using https://github.com/r-lib/ellipsis to test their proper use would many basic errors such argument name misspelling.

Add a burn_in_time parameter to simulate.*

As it is recommended to drop the initial samples in produced by a (CO)VLMC when used for bootstrap estimation, it would be convenient to have this feature implemented directly by simulate.vlmc(). For simulate.covlmc() the situation is much more complex as covariates are needed.

Likelihood calculation for VLMC with covariates

Likelihood calculation for VLMC with covariates are missing. We need:

an implementation of the standard S3 method logLik.covlmc;
an implementation of loglikelihood that can compute the log likelihood of a new discrete time series with covariates.

Add predict functions for vlmc and covlmc

The goal is to implement a predict.* function which can work with a new series or the one used to estimate the model. For covlmc, the covariates must be specified as in simulate.covlmc.

vlmc
covlmc

Add extraction functions for tune_*vlmc results

Currently the only simple use of tune_*vlmc results consists in converting them to *vlmc objects. It would be interesting to add other uses such as:

plot functions
retune functions: this could be used to get directly the best AIC model from the BIC one and vice versa without running the full search again
possibly a direct extraction of the best AIC model or BIC model when all models where saved

Ongoing work:

plot functions
autoplot functions for more complex representations
retune functions

Report the random seed as an attribute in simulate.*

The current implementations of simulate.* do not implement the full contract of stats::simulate as they don't add a seed attribute with information about the random seed.

Specialize `contexts` for `vlmc` and `covlmc`

The contexts function is useful to get the list of all contexts in a context tree, but it does not give access to the context specific data that are available for vlmc and covlmc objects. It would be interesting to specialize the function for those classes to report:

the conditional distribution of states for vlmc
the conditional logistic model for covlmc

Compute and report quality metrics for models in covlmc

VLMC with covariates may include poor predictive logistic models which perform barely better than constant models (a.k.a. conditional probabilities). In order to enable proper interpretation of a covlmc, the package should compute quality metrics for the models, on the learning time series and possible on new times serie. Metrics could include:

accuracy and balanced accuracy
AUC for binary models
F-measure and variants

The metrics could be reported by:

the context function via new columns in the data.frame format
the draw function
the summary function in aggregated form
possibly via a new function for new time series.

Add options to control tune_* functions

The fully automated aspect of the tune_* is convenient, but some control is needed in certain situations. In particular, when the computational burden is expected to by high, the conservative initial cut off value can lead to an important waste of resources. This can be fixed by allowing the user to specify the initial cut off.

initial cut off

Add automatic model selection functions

Model selection for vlmc is rather easy to implement but is slightly more complex for covlmc. It would be useful to include model selection directly in the package, using AIC and BIC for instance.

implement the core function for vlmc
implement tests for vlmc
add cross links to documentation of other methods (cutoff, prune, etc.)
implement the core function for covlmc
implement tests for covlmc
add cross links to documentation of other methods (cutoff, prune, etc.)
add examples to the relevant vignettes

Improve test coverage

Improve draw.* usability

The draw.* functions can be improved using different approaches:

With better defaults. We do not need p values in general, for instance.
By the use of UTF8 characters. We can take inspiration from the tree rendering in https://github.com/r-lib/cli. This could be at least an option or one of the default configuration obtained easily from draw_control().
By removing logistic models when possible. In some cases, logistic models are completely pruned in numerous contexts. Rather than reporting them as models, we could display the conditional distribution (as it is constant).
By providing a latex output. The ascii/UTF8 representation of context trees is not satisfactory for a latex export. It would be interesting to leverage e.g. the forest package.

Add graphical representations of context trees

Several solutions could be used:

direct solutions that produce a graphical representation in R
- using plot/ggplot
- possibly via igraph
indirect solutions
- using a igraph or similar representation but letting the user free of choosing their preferred visualisation
- using the dot format or any other format for which external tools are available

Allow VLMC models to be estimated on a collection of time series

basic context trees for collections of time series multi_ctx_tree
basic VLMC for collections of time series multi_vlmc
likelihood variants for multi VLMC (on new series)
model selection for multi VLMC multi_tune_vlmc
handle positions
handle metrics
likelihood for the collection used for estimation

Add a way to seed the model in `simulate`

When a new discrete time series is sampled with simulate, the first states are obtained via a context free distribution (for obvious reasons). In some situations, it would be useful to allow the user to specify an initial context.

Preserve the original type of the time series states

The code supports time series with states that can be numerical, factors or characters, but some of those types are not preserved. For instance

dts <- sample(c(0L, 1L), 100, replace=TRUE)
dts_tree <- ctx_tree(dts, max_depth = 4)
dts_contexts <- contexts(dts_tree)
typeof(dts_contexts[[1]])

gives "character" instead of "integer".

Improve draw for covlmc to show logistic models in a clear way

The current implementation of draw as of commit 2e97c38 shows logistic models as a flat list of coefficients. This is acceptable when the state space has only two values and a single covariate. With more values or more covariate, it is quite difficult to read. In addition, the ordering of the coefficients depends on the underlying implementation of the logistic model (see issue #2).

a minimal improvement is to have a unified representation regardless of the underlying model.
another improvement is to show explicitly the different "submodels" when there are more than two values in the state space
another improvement is to show explicitly the original covariates

Document options at the package level

The effects of options are already documented in the impacted functions but a package level documentation would be also convenient.

Handle the empty context consistently

A fully simplified VLMC consists simply in a stationary model with an empty context. While prune and tune_vlmc produce this type of model, the way they are handled is not consistent with what happens with more complex models. For instance

set.seed(0)
dts <- sample(0:2, 1000, replace = TRUE)
model <- as_vlmc(tune_vlmc(dts))

produces a stationary model which is reported to have 0 context and thus 0 parameter, leading to a wrong value of the AIC/BIC, among other problems. In some situations, this
will also lead to the loss of matched positions.

Notice that this problem does not manifest with the C++ backend.

Release mixvlmc 0.2.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Add trimming to VGAM based models

Currently trim.covlmc trims only glm and multinom models. We should at least remove the residuals, the fitted values, the predictors and the terms.

Add vignettes

We need to include at least a basic vignette. It will be easier to write using a simple data set (#9).

Planned vignette:

introduction to contexts and context trees (see commit 594b4db)
introduction to VLMC (see commit a28be40)
introduction to VLMC with covariates
post simulation

Support tibbles in covlmc

The covlmc code expects a data.frame and does not work correctly when given a tibble.

Organise functions into groups in the website

Implements predict.vlmc for the C++ back end

Clarify the semantics of `model$H0`

The covlmc code uses internally a model list with a H0 component. During the initial model fitting, it is true if the H0 hypothesis that the true model is the simplest one in term of historical covariates cannot be rejected (by the likelihood ratio test). This is used to decide if the enclosing context can be collapsed or merged. However, the use of H0 by node_prune_model is not clearly specified. It is seems that no bug is currently triggered by this inconsistency, but a clearer definition of the role of this component should improve the robustness of the estimation procedure.

Expose and document the "bayesian" regularisation used in `constant_model`

Bug #3 was fixed using a new constant_model object which mimics a logistic regression when the target class is constant. To avoid numerical instabilities, we use a "bayesian" which amounts to adding fake observations of all target values that are not observed in the data set. The level of regularization (and its use) should be under user control (and documented).

Data set(s)

Documenting the package and implementing some of the tests is tedious as we do not have a simple data set to illustrate everything. We should add at least one.

Add summary functions for ctx_tree, vlmc and covlmc

The summary functions could report statistics in addition to the depth and size of the context tree, for instance:

the distribution of the context lengths;
balance (or lack therefore) in the context tree;
covariate use;
etc.

Handle factors that degenerate to a single level

Some contexts may be associated to constant covariates. When this happens for factor values covariates, this is detected by glm which fails with a "contrasts can be applied only to factors with 2 or more levels" error. In practice, the logistic model is rank deficient, which will be detected by node_fit_glm and the model will be discarded. A simple solution for this problem consists in catching the error and returning a fake model to trigger directly model discarding.

Report merged contexts for covlmc

Contexts that share the same model, i.e. merged contexts, are reported as standard contexts by contexts.covlmc(). This is not an error stricto sensu, but this can be misleading in some situations. We should add an option that includes in a specific column whether the context is standard or merged.

Switch to extended likelihood as default likelihood function

This is the most consistent choice with the rest of the functionalities provided by the package.

Release mixvlmc 0.1.0

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Implement the full vlmc interface with the C++ backend

Extend Bayesian estimation of transition probabilities to the general case

We use Bayesian estimation of the transition probabilities in some specific cases and only in covlmc (see Issue #6). This could be extended to the general case, as in the PST R package.

Show target values in draw.covlmc (optionally)

When the state space has three or more states, some contexts may be followed only by a strict subset of the states. In this case, VGAM::vglm() reduces automatically the adjusted model to estimate the transition probabilities. This should be shown in draw.covlmc() when model="full" (and possibly when model="coef").

Multinomial regression breaks when the context corresponds to a single value of the target variable.

The bug is triggered by

x <- rep(c(0, 1), 1000)
y <- data.frame(y = rep(0, length(x)))
options(mixvlmc.predictive = "multinom")
x_covlmc <- covlmc(x, y)

which produces the following error message:

Error in nnet::multinom(target ~ ., data = mm, trace = FALSE): need two or more classes to fit a multinom model

This corresponds to a situation where a context (here '0' or '1') selects a subset of the time series with a unique value of the state space.

In this situation, one should output a degenerate model with constant predictions. A possible solution to avoid degeneracy is to use a "Bayesian" solution with pseudo observations of the non observed states.

Create a global option for the C++ back end

The use of the C++ back end is currently specified at the function level when using ctx_tree(), vlmc() and tune_vlmc(). It would be easier to use with in addition a global option to specify the default back end. This would in addition reduce redundancy in the tests as some of them are identical between the R and C++ version.

fabrice-rossi / mixvlmc Goto Github PK

mixvlmc's Introduction

Variable Length Markov Chains with Covariates

Installation

Usage

Variable length Markov chains

Fitting a VLMC

BIC based model selection

Fitting a VLMC with covariates

Sampling

mixvlmc's People

Contributors

Stargazers

Watchers

Forkers

mixvlmc's Issues

Recommend Projects

Recommend Topics

Recommend Org