fabrice-rossi / mixvlmc Goto Github PK

View Code? Open in Web Editor NEW

2.0 1.0 3.0 18.6 MB

Variable Length Markov Chains with Covariates

Home Page: https://fabrice-rossi.github.io/mixvlmc/

License: GNU General Public License v3.0

R 86.43% C++ 13.37% C 0.20%

machine-learning markov-chain markov-model r statistics time-series

mixvlmc's Issues

Preserve the original type of the time series states

The code supports time series with states that can be numerical, factors or characters, but some of those types are not preserved. For instance

dts <- sample(c(0L, 1L), 100, replace=TRUE)
dts_tree <- ctx_tree(dts, max_depth = 4)
dts_contexts <- contexts(dts_tree)
typeof(dts_contexts[[1]])

gives "character" instead of "integer".

Expose and document the "bayesian" regularisation used in `constant_model`

Bug #3 was fixed using a new constant_model object which mimics a logistic regression when the target class is constant. To avoid numerical instabilities, we use a "bayesian" which amounts to adding fake observations of all target values that are not observed in the data set. The level of regularization (and its use) should be under user control (and documented).

Report merged contexts for covlmc

Contexts that share the same model, i.e. merged contexts, are reported as standard contexts by contexts.covlmc(). This is not an error stricto sensu, but this can be misleading in some situations. We should add an option that includes in a specific column whether the context is standard or merged.

Report the random seed as an attribute in simulate.*

The current implementations of simulate.* do not implement the full contract of stats::simulate as they don't add a seed attribute with information about the random seed.

Improve draw for covlmc to show logistic models in a clear way

The current implementation of draw as of commit 2e97c38 shows logistic models as a flat list of coefficients. This is acceptable when the state space has only two values and a single covariate. With more values or more covariate, it is quite difficult to read. In addition, the ordering of the coefficients depends on the underlying implementation of the logistic model (see issue #2).

a minimal improvement is to have a unified representation regardless of the underlying model.
another improvement is to show explicitly the different "submodels" when there are more than two values in the state space
another improvement is to show explicitly the original covariates

Create a Context class used to represent contexts

This could be used by contexts to return more efficiently everything associated to each context (frequency of next symbols, positions, etc.). This could also be used by a new find_context function that return a single context (if it exists).

Implement switchable predictive models in VLMC with covariates

The current implementation of the VLMC with covariates uses logistic regression to estimate the transition probabilities associated to a context from the covariates. When the state space has only two states, this is based on glm with the binomial link (and estimated via spaMM_glm.fit from the spaMM package), while larger space states use the VGAM package (more specifically vglm with the multinomial family).

It would be interesting to support other estimators such as multinom from the nnet package or simple tables for discrete covariates. The default case could be specified by some global options. Passing the estimators as a parameter of the fitting function might be a bit too complex.

Use suffix trees to speed up context tree construction

Fast VLMC construction can be done using a linear time suffix tree construction algorithm such Ukkonen's algorithm. It make sense to implement it in C++ and to keep the full representation in C++
as well. Steps:

implement Ukkonen's algorithm
add counts for support pruning
integrate the C++ representation for ctx tree
implement similarity pruning

This should be first limited to VLMC. The cost of COVLMC is dominated by the logistic model estimation and thus supporting a simple transformation of the C++ representation into the current R representation (with nested lists) should be sufficient as a first integration step.

Compute and report quality metrics for models in covlmc

VLMC with covariates may include poor predictive logistic models which perform barely better than constant models (a.k.a. conditional probabilities). In order to enable proper interpretation of a covlmc, the package should compute quality metrics for the models, on the learning time series and possible on new times serie. Metrics could include:

accuracy and balanced accuracy
AUC for binary models
F-measure and variants

The metrics could be reported by:

the context function via new columns in the data.frame format
the draw function
the summary function in aggregated form
possibly via a new function for new time series.

Allow VLMC models to be estimated on a collection of time series

basic context trees for collections of time series multi_ctx_tree
basic VLMC for collections of time series multi_vlmc
likelihood variants for multi VLMC (on new series)
model selection for multi VLMC multi_tune_vlmc
handle positions
handle metrics
likelihood for the collection used for estimation

Improve draw.* usability

The draw.* functions can be improved using different approaches:

With better defaults. We do not need p values in general, for instance.
By the use of UTF8 characters. We can take inspiration from the tree rendering in https://github.com/r-lib/cli. This could be at least an option or one of the default configuration obtained easily from draw_control().
By removing logistic models when possible. In some cases, logistic models are completely pruned in numerous contexts. Rather than reporting them as models, we could display the conditional distribution (as it is constant).
By providing a latex output. The ascii/UTF8 representation of context trees is not satisfactory for a latex export. It would be interesting to leverage e.g. the forest package.

Add graphical representations of context trees

Several solutions could be used:

direct solutions that produce a graphical representation in R
- using plot/ggplot
- possibly via igraph
indirect solutions
- using a igraph or similar representation but letting the user free of choosing their preferred visualisation
- using the dot format or any other format for which external tools are available

Handle the empty context consistently

A fully simplified VLMC consists simply in a stationary model with an empty context. While prune and tune_vlmc produce this type of model, the way they are handled is not consistent with what happens with more complex models. For instance

set.seed(0)
dts <- sample(0:2, 1000, replace = TRUE)
model <- as_vlmc(tune_vlmc(dts))

produces a stationary model which is reported to have 0 context and thus 0 parameter, leading to a wrong value of the AIC/BIC, among other problems. In some situations, this
will also lead to the loss of matched positions.

Notice that this problem does not manifest with the C++ backend.

Show target values in draw.covlmc (optionally)

When the state space has three or more states, some contexts may be followed only by a strict subset of the states. In this case, VGAM::vglm() reduces automatically the adjusted model to estimate the transition probabilities. This should be shown in draw.covlmc() when model="full" (and possibly when model="coef").

Make cutoff values more usable

Values computed by the cutoff.* functions correspond directly to the thresholds used in the pruning phase of VLMC or coVLMC. As the pruning tests are written as follows

if (p_value > cutoff) {
  prune
} else {
  do not prune
}

using directly the cutoff values may not induce pruning.

This can be fixed by either:

replacing the test by p_value >= cutoff
slightly decreasing the thresholds reported by cutoff

The first solution departs from what is used in other VLMC implementations, leading to potential differences between the results of very similar codes. The second solution is easy to implement in Rcpp using the nextafter standard function.

Multinomial regression breaks when the context corresponds to a single value of the target variable.

The bug is triggered by

x <- rep(c(0, 1), 1000)
y <- data.frame(y = rep(0, length(x)))
options(mixvlmc.predictive = "multinom")
x_covlmc <- covlmc(x, y)

which produces the following error message:

Error in nnet::multinom(target ~ ., data = mm, trace = FALSE): need two or more classes to fit a multinom model

This corresponds to a situation where a context (here '0' or '1') selects a subset of the time series with a unique value of the state space.

In this situation, one should output a degenerate model with constant predictions. A possible solution to avoid degeneracy is to use a "Bayesian" solution with pseudo observations of the non observed states.

Support logical values as state space

Currently vlmc() and covlmc() support only character, factor or numeric values for the state space. It should be easy to support directly logical values (in the internal function to_dts()).

Data set(s)

Documenting the package and implementing some of the tests is tedious as we do not have a simple data set to illustrate everything. We should add at least one.

Specialized representation for logistic models

A covlmc object needs to store one logistic model per context, which induces a rather large memory consumption. In practice, we only need to

access to the coefficients of the model
compute the transition probabilities associated to the values of the covariates

Storing a glm object for this is a waste of space. In addition, we already need to unify to some extent the internal representation of the coefficients between the different estimation engines that can be used (see #2). We should go one step further and replace the full objects returned by other packages by bare bone representations.

Add a burn_in_time parameter to simulate.*

As it is recommended to drop the initial samples in produced by a (CO)VLMC when used for bootstrap estimation, it would be convenient to have this feature implemented directly by simulate.vlmc(). For simulate.covlmc() the situation is much more complex as covariates are needed.

Implement the full vlmc interface with the C++ backend

Include examples for all the functions

Todo:

Should be easier with a data set (#9).

Clarify the semantics of `model$H0`

The covlmc code uses internally a model list with a H0 component. During the initial model fitting, it is true if the H0 hypothesis that the true model is the simplest one in term of historical covariates cannot be rejected (by the likelihood ratio test). This is used to decide if the enclosing context can be collapsed or merged. However, the use of H0 by node_prune_model is not clearly specified. It is seems that no bug is currently triggered by this inconsistency, but a clearer definition of the role of this component should improve the robustness of the estimation procedure.

Handle factors that degenerate to a single level

Some contexts may be associated to constant covariates. When this happens for factor values covariates, this is detected by glm which fails with a "contrasts can be applied only to factors with 2 or more levels" error. In practice, the logistic model is rank deficient, which will be detected by node_fit_glm and the model will be discarded. A simple solution for this problem consists in catching the error and returning a fake model to trigger directly model discarding.

Enable post-pruning for `covlmc` models

The cutoff and prune functions for vlmc enable an efficient search for an optimal model according to e.g. the BIC. A similar functionality for covlmc models would be very useful, especially considering that the estimation time for those models is significantly longer than for vlmc models.

The ground work for cutoff has already been implemented in commit a9600eb.

Post pruning needs the data to be saved in the covlmc but the added memory consumption should be compensated by the reduced total processing time when we look for an optimal model. Models fitted to the data but discarded could also be saved to speed up the process even more.

Add vignettes

We need to include at least a basic vignette. It will be easier to write using a simple data set (#9).

Planned vignette:

introduction to contexts and context trees (see commit 594b4db)
introduction to VLMC (see commit a28be40)
introduction to VLMC with covariates
post simulation

Add automatic model selection functions

Model selection for vlmc is rather easy to implement but is slightly more complex for covlmc. It would be useful to include model selection directly in the package, using AIC and BIC for instance.

implement the core function for vlmc
implement tests for vlmc
add cross links to documentation of other methods (cutoff, prune, etc.)
implement the core function for covlmc
implement tests for covlmc
add cross links to documentation of other methods (cutoff, prune, etc.)
add examples to the relevant vignettes

Extend Bayesian estimation of transition probabilities to the general case

We use Bayesian estimation of the transition probabilities in some specific cases and only in covlmc (see Issue #6). This could be extended to the general case, as in the PST R package.

Support initial="extended" in loglikelihood.covlmc

The implementation of 'loglikelihood.covlmc` added by commit 06e4bb2 does not support the extended context matching mode available for vlmc. The main issue is the lack of logistic model in internal nodes of the context tree. Those models were generally not evaluated during the construction of the covlmc (some may have been tested).

To enable extended context matching, we need to keep potentially useful models (see also issue #15) and to compute the other ones on the fly. If we do not want to store the data into the covlmc object, we need compute all models preventively. Because of the rather large memory occupation associated we may either wait for issue #12 to be solved or at least trim the models.

Add options to control tune_* functions

The fully automated aspect of the tune_* is convenient, but some control is needed in certain situations. In particular, when the computational burden is expected to by high, the conservative initial cut off value can lead to an important waste of resources. This can be fixed by allowing the user to specify the initial cut off.

initial cut off

Use ellipsis to test varargs proper usage

The package has numerous S3 methods which use varargs (...). Using https://github.com/r-lib/ellipsis to test their proper use would many basic errors such argument name misspelling.

Add more input validation tests

Many functions do not validate enough their parameters before calling low level functions, leading to non informative error messages. In particular:

likelihood does not check the compatibility of the newdata (see simulate.vlmc() for the validation code).
draw.covlmc() does not validate model

Switch to extended likelihood as default likelihood function

This is the most consistent choice with the rest of the functionalities provided by the package.

Add trimming to VGAM based models

Currently trim.covlmc trims only glm and multinom models. We should at least remove the residuals, the fitted values, the predictors and the terms.

Refactor SuffixTree.cpp to declare the module elsewhere

SuffixTree should have a proper header file and the module export should be done in another file that includes headers of all the classes to export.

Specialize `contexts` for `vlmc` and `covlmc`

The contexts function is useful to get the list of all contexts in a context tree, but it does not give access to the context specific data that are available for vlmc and covlmc objects. It would be interesting to specialize the function for those classes to report:

the conditional distribution of states for vlmc
the conditional logistic model for covlmc

Organise functions into groups in the website

Add extraction functions for tune_*vlmc results

Currently the only simple use of tune_*vlmc results consists in converting them to *vlmc objects. It would be interesting to add other uses such as:

plot functions
retune functions: this could be used to get directly the best AIC model from the BIC one and vice versa without running the full search again
possibly a direct extraction of the best AIC model or BIC model when all models where saved

Ongoing work:

plot functions
autoplot functions for more complex representations
retune functions

Add predict functions for vlmc and covlmc

The goal is to implement a predict.* function which can work with a new series or the one used to estimate the model. For covlmc, the covariates must be specified as in simulate.covlmc.

vlmc
covlmc

Release mixvlmc 0.1.0

First release:

usethis::use_cran_comments()
Update (aspirational) install instructions in README
Proofread Title: and Description:
Check that all exported functions have @return and @examples
Check that Authors@R: includes a copyright holder (role 'cph')
Check licensing of included files
Review https://github.com/DavisVaughan/extrachecks

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Add a way to seed the model in `simulate`

When a new discrete time series is sampled with simulate, the first states are obtained via a context free distribution (for obvious reasons). In some situations, it would be useful to allow the user to specify an initial context.

Create a global option for the C++ back end

The use of the C++ back end is currently specified at the function level when using ctx_tree(), vlmc() and tune_vlmc(). It would be easier to use with in addition a global option to specify the default back end. This would in addition reduce redundancy in the tests as some of them are identical between the R and C++ version.

Store discarded models in `covlmc` to speed up `prune.covlmc`

In some circumstances, it might be interesting to store discarded models to avoid recomputing them during post pruning. However, most of them might be invalidated once a leaf of the context tree is pruned. As currently models use a lot of memory, this should be done after implementing feature #12.

Implements predict.vlmc for the C++ back end

Improve test coverage

Release mixvlmc 0.2.0

Prepare for release:

Submit to CRAN:

usethis::use_version('minor')
devtools::submit_cran()
Approve email

Wait for CRAN...

Likelihood calculation for VLMC with covariates

Likelihood calculation for VLMC with covariates are missing. We need:

an implementation of the standard S3 method logLik.covlmc;
an implementation of loglikelihood that can compute the log likelihood of a new discrete time series with covariates.

the distribution of the context lengths;
balance (or lack therefore) in the context tree;
covariate use;
etc.

fabrice-rossi / mixvlmc Goto Github PK

mixvlmc's Issues

Recommend Projects

Recommend Topics

Recommend Org