fabrice-rossi / mixvlmc Goto Github PK
View Code? Open in Web Editor NEWVariable Length Markov Chains with Covariates
Home Page: https://fabrice-rossi.github.io/mixvlmc/
License: GNU General Public License v3.0
Variable Length Markov Chains with Covariates
Home Page: https://fabrice-rossi.github.io/mixvlmc/
License: GNU General Public License v3.0
The code supports time series with states that can be numerical, factors or characters, but some of those types are not preserved. For instance
dts <- sample(c(0L, 1L), 100, replace=TRUE)
dts_tree <- ctx_tree(dts, max_depth = 4)
dts_contexts <- contexts(dts_tree)
typeof(dts_contexts[[1]])
gives "character"
instead of "integer"
.
Bug #3 was fixed using a new constant_model
object which mimics a logistic regression when the target class is constant. To avoid numerical instabilities, we use a "bayesian" which amounts to adding fake observations of all target values that are not observed in the data set. The level of regularization (and its use) should be under user control (and documented).
Contexts that share the same model, i.e. merged contexts, are reported as standard contexts by contexts.covlmc()
. This is not an error stricto sensu, but this can be misleading in some situations. We should add an option that includes in a specific column whether the context is standard or merged.
The current implementations of simulate.*
do not implement the full contract of stats::simulate
as they don't add a seed
attribute with information about the random seed.
The current implementation of draw as of commit 2e97c38 shows logistic models as a flat list of coefficients. This is acceptable when the state space has only two values and a single covariate. With more values or more covariate, it is quite difficult to read. In addition, the ordering of the coefficients depends on the underlying implementation of the logistic model (see issue #2).
This could be used by contexts
to return more efficiently everything associated to each context (frequency of next symbols, positions, etc.). This could also be used by a new find_context
function that return a single context (if it exists).
The current implementation of the VLMC with covariates uses logistic regression to estimate the transition probabilities associated to a context from the covariates. When the state space has only two states, this is based on glm
with the binomial
link (and estimated via spaMM_glm.fit
from the spaMM package), while larger space states use the VGAM package (more specifically vglm
with the multinomial
family).
It would be interesting to support other estimators such as multinom
from the nnet package or simple tables for discrete covariates. The default case could be specified by some global options. Passing the estimators as a parameter of the fitting function might be a bit too complex.
Fast VLMC construction can be done using a linear time suffix tree construction algorithm such Ukkonen's algorithm. It make sense to implement it in C++ and to keep the full representation in C++
as well. Steps:
This should be first limited to VLMC. The cost of COVLMC is dominated by the logistic model estimation and thus supporting a simple transformation of the C++ representation into the current R representation (with nested lists) should be sufficient as a first integration step.
VLMC with covariates may include poor predictive logistic models which perform barely better than constant models (a.k.a. conditional probabilities). In order to enable proper interpretation of a covlmc, the package should compute quality metrics for the models, on the learning time series and possible on new times serie. Metrics could include:
The metrics could be reported by:
context
function via new columns in the data.frame
formatdraw
functionsummary
function in aggregated formmulti_ctx_tree
multi_vlmc
multi_tune_vlmc
The draw.*
functions can be improved using different approaches:
draw_control()
.Several solutions could be used:
plot
/ggplot
igraph
igraph
or similar representation but letting the user free of choosing their preferred visualisationA fully simplified VLMC consists simply in a stationary model with an empty context. While prune
and tune_vlmc
produce this type of model, the way they are handled is not consistent with what happens with more complex models. For instance
set.seed(0)
dts <- sample(0:2, 1000, replace = TRUE)
model <- as_vlmc(tune_vlmc(dts))
produces a stationary model which is reported to have 0 context and thus 0 parameter, leading to a wrong value of the AIC/BIC, among other problems. In some situations, this
will also lead to the loss of matched positions.
Notice that this problem does not manifest with the C++ backend.
When the state space has three or more states, some contexts may be followed only by a strict subset of the states. In this case, VGAM::vglm()
reduces automatically the adjusted model to estimate the transition probabilities. This should be shown in draw.covlmc()
when model="full"
(and possibly when model="coef"
).
Values computed by the cutoff.*
functions correspond directly to the thresholds used in the pruning phase of VLMC or coVLMC. As the pruning tests are written as follows
if (p_value > cutoff) {
prune
} else {
do not prune
}
using directly the cutoff values may not induce pruning.
This can be fixed by either:
p_value >= cutoff
cutoff
The first solution departs from what is used in other VLMC implementations, leading to potential differences between the results of very similar codes. The second solution is easy to implement in Rcpp using the nextafter
standard function.
The bug is triggered by
x <- rep(c(0, 1), 1000)
y <- data.frame(y = rep(0, length(x)))
options(mixvlmc.predictive = "multinom")
x_covlmc <- covlmc(x, y)
which produces the following error message:
Error in
nnet::multinom(target ~ ., data = mm, trace = FALSE)
: need two or more classes to fit a multinom model
This corresponds to a situation where a context (here '0' or '1') selects a subset of the time series with a unique value of the state space.
In this situation, one should output a degenerate model with constant predictions. A possible solution to avoid degeneracy is to use a "Bayesian" solution with pseudo observations of the non observed states.
Currently vlmc()
and covlmc()
support only character, factor or numeric values for the state space. It should be easy to support directly logical values (in the internal function to_dts()
).
Documenting the package and implementing some of the tests is tedious as we do not have a simple data set to illustrate everything. We should add at least one.
A covlmc
object needs to store one logistic model per context, which induces a rather large memory consumption. In practice, we only need to
Storing a glm
object for this is a waste of space. In addition, we already need to unify to some extent the internal representation of the coefficients between the different estimation engines that can be used (see #2). We should go one step further and replace the full objects returned by other packages by bare bone representations.
As it is recommended to drop the initial samples in produced by a (CO)VLMC when used for bootstrap estimation, it would be convenient to have this feature implemented directly by simulate.vlmc()
. For simulate.covlmc()
the situation is much more complex as covariates are needed.
as_vlmc
contexts
cutoff
draw
logLik
loglikelihood
metrics
print
prune
simulate
summary
trim
tune_vlmc
Todo:
Should be easier with a data set (#9).
The covlmc code uses internally a model list with a H0
component. During the initial model fitting, it is true if the H0 hypothesis that the true model is the simplest one in term of historical covariates cannot be rejected (by the likelihood ratio test). This is used to decide if the enclosing context can be collapsed or merged. However, the use of H0 by node_prune_model
is not clearly specified. It is seems that no bug is currently triggered by this inconsistency, but a clearer definition of the role of this component should improve the robustness of the estimation procedure.
Some contexts may be associated to constant covariates. When this happens for factor values covariates, this is detected by glm
which fails with a "contrasts can be applied only to factors with 2 or more levels" error. In practice, the logistic model is rank deficient, which will be detected by node_fit_glm
and the model will be discarded. A simple solution for this problem consists in catching the error and returning a fake model to trigger directly model discarding.
The cutoff
and prune
functions for vlmc
enable an efficient search for an optimal model according to e.g. the BIC. A similar functionality for covlmc
models would be very useful, especially considering that the estimation time for those models is significantly longer than for vlmc
models.
The ground work for cutoff
has already been implemented in commit a9600eb.
Post pruning needs the data to be saved in the covlmc
but the added memory consumption should be compensated by the reduced total processing time when we look for an optimal model. Models fitted to the data but discarded could also be saved to speed up the process even more.
Model selection for vlmc is rather easy to implement but is slightly more complex for covlmc. It would be useful to include model selection directly in the package, using AIC and BIC for instance.
cutoff
, prune
, etc.)cutoff
, prune
, etc.)We use Bayesian estimation of the transition probabilities in some specific cases and only in covlmc
(see Issue #6). This could be extended to the general case, as in the PST R package.
The implementation of 'loglikelihood.covlmc` added by commit 06e4bb2 does not support the extended context matching mode available for vlmc. The main issue is the lack of logistic model in internal nodes of the context tree. Those models were generally not evaluated during the construction of the covlmc (some may have been tested).
To enable extended context matching, we need to keep potentially useful models (see also issue #15) and to compute the other ones on the fly. If we do not want to store the data into the covlmc object, we need compute all models preventively. Because of the rather large memory occupation associated we may either wait for issue #12 to be solved or at least trim the models.
The fully automated aspect of the tune_* is convenient, but some control is needed in certain situations. In particular, when the computational burden is expected to by high, the conservative initial cut off value can lead to an important waste of resources. This can be fixed by allowing the user to specify the initial cut off.
The package has numerous S3 methods which use varargs (...). Using https://github.com/r-lib/ellipsis to test their proper use would many basic errors such argument name misspelling.
Many functions do not validate enough their parameters before calling low level functions, leading to non informative error messages. In particular:
likelihood
does not check the compatibility of the newdata (see simulate.vlmc()
for the validation code).draw.covlmc()
does not validate model
This is the most consistent choice with the rest of the functionalities provided by the package.
Currently trim.covlmc
trims only glm
and multinom
models. We should at least remove the residuals, the fitted values, the predictors and the terms.
SuffixTree should have a proper header file and the module export should be done in another file that includes headers of all the classes to export.
The contexts
function is useful to get the list of all contexts in a context tree, but it does not give access to the context specific data that are available for vlmc
and covlmc
objects. It would be interesting to specialize the function for those classes to report:
vlmc
covlmc
Currently the only simple use of tune_*vlmc
results consists in converting them to *vlmc
objects. It would be interesting to add other uses such as:
Ongoing work:
The goal is to implement a predict.* function which can work with a new series or the one used to estimate the model. For covlmc, the covariates must be specified as in simulate.covlmc
.
First release:
usethis::use_cran_comments()
Title:
and Description:
@return
and @examples
Authors@R:
includes a copyright holder (role 'cph')Prepare for release:
git pull
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
When a new discrete time series is sampled with simulate
, the first states are obtained via a context free distribution (for obvious reasons). In some situations, it would be useful to allow the user to specify an initial context.
The use of the C++ back end is currently specified at the function level when using ctx_tree()
, vlmc()
and tune_vlmc()
. It would be easier to use with in addition a global option to specify the default back end. This would in addition reduce redundancy in the tests as some of them are identical between the R and C++ version.
In some circumstances, it might be interesting to store discarded models to avoid recomputing them during post pruning. However, most of them might be invalidated once a leaf of the context tree is pruned. As currently models use a lot of memory, this should be done after implementing feature #12.
Prepare for release:
git pull
urlchecker::url_check()
devtools::build_readme()
devtools::check(remote = TRUE, manual = TRUE)
devtools::check_win_devel()
cran-comments.md
git push
Submit to CRAN:
usethis::use_version('minor')
devtools::submit_cran()
Wait for CRAN...
usethis::use_github_release()
usethis::use_dev_version(push = TRUE)
Likelihood calculation for VLMC with covariates are missing. We need:
logLik.covlmc
;loglikelihood
that can compute the log likelihood of a new discrete time series with covariates.The effects of options are already documented in the impacted functions but a package level documentation would be also convenient.
The covlmc code expects a data.frame
and does not work correctly when given a tibble
.
The summary functions could report statistics in addition to the depth and size of the context tree, for instance:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.