rmcminds / stan_models Goto Github PK

View Code? Open in Web Editor NEW

1.0 2.0 2.0 1.16 MB

collection of models written in Stan, with data wrangling code to run the models on real data

Stan 14.28% R 76.42% Jupyter Notebook 9.19% Shell 0.11%

stan_models's People

Contributors

Stargazers

Watchers

Forkers

maherrl courtneyklepac

stan_models's Issues

allow separate priors for each tree's branch length modifiers

currently a flat prior partitioning total meta variance. but, for instance, since i have zero confidence in my symbiodinium tree's branch lengths but lots of confidence for the host, it'd be nice to allow the symbiodinium to be modified extremely broadly

implement sparse matrix multiplication on log scale

there is a built-in for sparse matrix multiplication, which might be necessary for scalability as the model matrix and number of microbes increases. But at the same time, these matrix products seem to be having issues with numerical stability because they are exponentiated values being multiplied with one another. So I've written my own log-scale matrix multiplication functions , which use log_sum_exp(). These functions are much slower than the built-in standard matrix multiplication and the built-in sparse matrix multiplication, so a custom implementation that combines the two might be necessary.

adding treatment effects to source mixture model

current model calculates contributions from each mixture component, sums them up, applies a logit transformation, and adds treatment effects. Thus the effects essentially act as multipliers of final odds after mixing, which might not seem very realistic? If a given taxon is 99% of the tissue community and only 1% of the mucus community, and these communities are mixed at a 1:1 ratio, the final odds of that taxon vs all others would also be 1:1. A nutrient effect that is expected to multiply the odds of the taxon in each community by 2 would change the original proportions to 99.5% and 2%, for a final odds of 1.03:1. But a nutrient effect of the same proportion that is applied after calculating a final mixture would give a result of 1.95:1! How would I interpret such an effect? It would be as if the effect is acting on the whole meta-community, which in theory doesn't exist in these proportions until the sample is taken. Maybe it could be interpreted that because the members of the tissue community double, their absolute numbers have increased and they will then overwhelm the mucus community - but if this is the case, the final mixture proportion should reflect that, and the tissue community would be more than 50% of the final sample.

So, I should incorporate the treatment effects before calculating mixtures. What about overdispersion/error? Sample-to-sample variation could conceivably occur both before and after the sample is collected. Of course, each mixture component should vary independent of one another (e.g. taxon A might be higher than usual in the tissue of sample 1, but lower than usual in the mucus of sample 1). But it would be an identifiability nightmare to give separate error to each taxon in each component of each sample, and ultimately these are 'nuisance' parameters. So, is it better to have a single error parameter for each taxon in each sample, which is added before calculating the mixtures, or should it be added after?? Before might be more consistent and allow for variance partitioning with the other effects.

Distinguish between general increase in evolution vs codiv

The Hadfield model’s coevolution term is more consistent with something like environmental filtering, where host traits evolve and change the chances that a microbe or clade of microbes will associate with it. A process of strict codiversification, on the other hand, would expect that branch lengths are irrelevant - only the topology really matter, and every speciation should be associated with the same amount of variance (essentially, the chance that the microbe co-speciates). Because of this, a tree with the expected variance under codiv would have more total variance in clades with more branches, with all branch lengths equal. To incorporate such an expectation into the model, I would need to /add/ variance at each node rather than /scale/ the variance with multipliers. Should be able to sub-partition the variance due to host into two parts using a flat dirichlet, and then dividing the codiv part again among all the nodes (maybe a sum-to-zero normal effect for each node, which is log transformed and scaled so the mean is equal to the codiv variance, and then added to the node’s existing variance.

estimation of branch lengths leads to divergences

addition of branch-lengh estimation during model fitting seems to have led to divergences in the model

add ultrametric-sustaining node modifications

currently, the parameter that modifies a branch length deals with correlated rates - that is, it's essentially assumed that the ultrametric tree is already accurate in terms of the times of the nodes, and that variance can increase or decrease in an entire clade due to a speedup or slowdown of the rate of evolution. so, to increase an internal branch length, all descendent branch lengths must also increase. However, as is illustrated well by the multifurcating SymPortal tree I'm using for Symbiodinium, a more realistic modification may sometimes instead be almost the exact opposite: a node's /age/ is 'wrong', so to 'fix' it, the branch leading to it should be modified, but all descendant branches should actually be modified in the opposite direction, so that the total variance at the tips stays the same. could be implemented by tracking node heights and using a simplex, so that the parameter is a proportion of the distance between the parent node and the closest descendant node?

change OU prior so that it is flat on the proportion of variance explained

ouPercent <- function(x, alpha) {exp(-2 * alpha * (1 - x)) * (1 - exp(-2 * alpha * x)) / (1 - exp(-2 * alpha))}

x is proportion of time from root

sample ouPercent on the simplex with a flat prior, then calculate alpha with the inverse of the above function

implement symmetric priors on effect-coded variables

With 'simple' effect coding, a given category gets N-1 'effects' that as raw values that explain the difference of N-1 levels from the grand mean, and the Nth level is the negative sum of all those effects. The marginal distribution for the Nth level is therefore different than that of all the other levels, and I noticed in the Symbiodinium analysis that this appears to have drastic practical effects (effects of 'colony name' were non-significant and hovering from ~-1 to 1 for every single level except the composed reference level, which consistently had absurd values ~-50 which didn't make any sense given prior expectations OR the data).

I want to modify the model so that the marginal distributions are perfectly symmetrical for all levels. This post on the stan forums seems to offer a solution: https://discourse.mc-stan.org/t/test-soft-vs-hard-sum-to-zero-constrain-choosing-the-right-prior-for-soft-constrain/3884/27.

incorporate within-sample information

currently, the model conflates effects on alpha diversity with effects on generalism/specialism. This is because, for a given factor or host node, a general increase in the probability of finding things increases all symbiont types in all samples, which is probably easiest to interpret as a change in ADiv, but also means less-likely symbiont types are more likely to be found in some samples, influencing generalism. what might be ideal is recognizing that an increase in alpha diversity would increase the probability for all types (the current model), whereas an increase in generalism would actually keep the overall probability of finding something (the sum of the various type-specific probabilities?) the same while essentially only increasing the probability of finding the rarer types (essentially evening out or smoothing the probabilities). to do this, the various probabilities within each sample would have to be 'aware' of each other somehow.

one thought is to explicitly model the number of types found in a sample using, say, a poisson distribution, and then pass the current model output (slightly modified so it doesn't have a grand mean parameter) to a softmax function and a Wallenius' noncentral multivariate hypergeometric distribution (https://en.wikipedia.org/wiki/Wallenius%27_noncentral_hypergeometric_distribution) with all k=1 to determine which particular types are represented. since the existing parameters would then be normalized on a per-sample basis, they would then only represent generalism-specialism, not alpha diversity.

incorporate non-biological samples

currently running cophylogeny model on dataset that has non-biological samples filtered out. in theory, it shouldn't be too hard to include the sediment and water samples - should just add two more sum-to-zero effects (isBiological & waterVsSediment) and make sure the ancestry matrices are calculated properly so that the non-biologicals just get zeros across the board. I think the trick is making sure the current code doesn't get bugs added to it.

simplify log-scale sparse matrix multiplications

I've realized that, as the model exists, all instances of log-scale sparse matrix multiplication are even more trivial than they seem. The sparse 'ancestry' matrices I'm using are all completely binary, so there is really no multiplication needed at all once we have the indices of the 1's and 0's. Should just use the indices to pull the values from the second matrix and sum them.

create data simulation methods

All models should be tested on simulated data, but I haven't explored how to do that.

improve model for phylo variance modifers if interpreting as codiv

currently, each host-microbe node pair's phylogenetic variance modifier is applied to the nodes' immediate leading edges and to all descendant edges. (model is currently somewhat like the Hadfield model, except works independently on all subclades rather than testing the tree all at once). However, codiversification is more likely tp lead to an asymmetrical difference in variance among descendant edges, because the host clade that does not inherit child A will very uniformly not be associated with child A - there's no more variance with respect to child A, but actually less, even though variance should increase with the descendants of child B. Perhaps something like nested flat dirichlet priors could be applied to the variance modifier at all descendant nodes to distribute it among descendants differently.

more consistent phylogenetic variance modifiers

currently, i use lognormal terms to modify branch lengths and generate new 'global' branch lengths, then have additional terms to modify branch lengths in a cophylogenetic setting specifically. However, the other factors (from the modelMat) could also theoretically use their own branch lengths, and if they don't, it is impossible to interpret the global changes as changes specific to 'alpha diversity' or 'prevalence'. I would like to change the current global modifier into a single grand mean modifier, and then use sum-to-zero effects to allow each factor and the host specificity to have their own changes.

log crossprodexp

implement a way to normalize the variance multipliers in numerically stable way, and save the normalized result rather than just passing it on. the normalized result should be identified but the raw one isn't, and that could be what's leading to terrible nEff for those results.

formalize links between location effects and changes in variance

when non-biological samples were added to the model, it made me consider how to model other correlated effects of 'transitions' from a free-living to host-associated lifestyle, and vice versa. For instance, it could be that a host-associated lifestyle increases or decreases rate of evolution. So, a single parameter could be useful that assesses such a linkage: an increase in location effect for 'isBiological' has a corresponding increase in variance effect, perhaps related via a single global scaling parameter?

allow a model matrix on sequence variants in addition to the samplewise model matrix

although much less metadata is often available or used for microbial species/ASVs than is available for samples, some could be very useful and it should be possible to incorporate it. For instance, whether a bacterium is aerobic vs. anaerobic, or the degree of heat tolerance of each symbiodinium type. Could also be directly used so that ITS2 sequence variants take the place of the SymPortal ITS2 profiles, making the incorporation of the Symbiodinium phylogeny easier, but then still incorporating the SymPortal profiles into the model as levels of a factor.

identify microbeScales etc.

microbeScales is currently normalized to one by dividing microbeVarRaw by its mean root-to-tip distance. Ideally, we would constrain microbeVarRaw to have a constant mean root-to-tip distance without the need for normalization. This should drop a single parameter and work on a simplex instead. Potential for considerable performance improvement?

incorporate rate of molecular evolution into variance estimates

currently, i've written the model to explicitly use a chronogram or ultrametric tree, which makes interpretation of many terms easier. However, if a particular clade has had more rapid molecular evolution, as would be shown using a non-ultrametric tree, it is reasonable to ask whether the increase in molecular evolution is correlated with an increase in the rate of evolution of the traits assessed in our model. If so, a non-ultrametric tree would provide more accurate estimates of variance than a forcibly ultrametricized one.

Two ideas come to mind to implement this: (1) simplify the model and remove the time bins, at which point any tree could be used as input; (2) calculate the proportion of each branch that belongs to each time bin using an ultrametricized tree, but then fill the edgeToBin matrices with the raw branch lengths from the original tree; or (3) incorporate the original branch lengths into a model matrix that acts on variance parameters, and estimate another metaparameter for the correlation between those original branch lengths and the optimum variance.

gather code blocks into functions and annotate code

relatively self-explanatory, unlike the current code. things have gotten complex and are difficult even for me to follow. In particular, the block of code that partitions variance among subfactors is a mess.

implement QR

i had abandoned the QR parameterization a while ago because the way it's used in the manual doesn't allow each column of the model matrix to have its own variance, and I couldn't see how to implement it. Have now done so for one portion of the model that could use it - the rawNodeEffects on one dimension. Not entirely sure it will help given the Stan people's limiting its use the way they have, but I keep getting the pesky single divergence out of 1250 transitions, and it's worth a shot to get rid of the others and maybe improve the scaling of the model.

incorporate OU process evolution

a previous attempt to incorporate an ornstein-uhlenbeck parameter was misguided, as it used a tree transformation, but especially given the model's ultimately non-ultrametric tree, the OU process can not be represented with a tree structure. in fact, it may not be possible to use the noncentered matrix multiplication parameterization at all, because the absolute prior state has to be known in order to calculate the state of a descendant node.

dx.doi.org/10.1111/2041-210X.12201

the mean state at time t is given by the following formula from wikipedia, and this mean state corresponds to the deterministic adjustment that should be added on top of the brownian motion
{\mathrm {E}}[r_{t}]=r_{0}e^{{-at}}+b(1-e^{{-at}})
https://en.wikipedia.org/wiki/Vasicek_model

so, need to iterate through nodes with a loop starting at the root to apply the adjustment to each one and calculate the actual state. I assume you could do this for one of either host or microbe first, and then repeat for the other, but would need to check to make sure they don't need to be somehow done simultaneously. the scaled brownian components prior to this adjustment would still be the interesting bits that tell how the evolution along a branch deviated from the marginal expectation.

implement stan 2.19's glm functions

stan 2.19 introduced functions that automatically calculate log likelihood for glms, combining the likelihood step with the previously separate matrix multiplication step. the calculations are even passed to the GPU for possible drastic speedups (~30x??). Using these functions will require me to go back to the original Kronecker product implementation rather than my current 2-step matrix multiplication on two dimensions. Shouldn't be too hard; only need to make sure matrix->vector conversions match the order of the kronecker product (which I can easily precalculate in R).

merge marginalization of host tree into cophylogeny project

I originally explored the 'logistic_community_glm' model as a separate model from the 'cophylogeny' model, and implemented multiple versions of it that allowed me to choose whether to marginalize multiple versions of host trees. I simultaneously developed the cophylogeny model to allow branch lengths to be modified during the model fitting (among other things). Then, I realized that a shortcoming of my 'community glm' was that it relied very strongly on the accuracy of the input branch lengths, which is a terrible assumption given the entire lack of branch lengths for the Symbiodinium tree. Thus it seems the 'cophylogeny' model is likely to be useful more generally than for the codiversification question specifically. So I've shifted my attention to that one, but haven't yet implemented the ability to marginalize multiple versions of the host tree yet.

clean up continuous indexing code in BGPFA

In order to reduce the size of inverted matrices in BGFA, I implemented some complicated indexing that selects sub-model matrices that are independent of the rest, and processes their likelihood separately. The code to do this is a nightmare both in transformed data{} and model{} but made a huge speed difference. Think about ways to simplify!

implement clade-by-clade variance modifiers for codivergence estimation

A model with clade-by-clade effects (as implemented in my hierarchical model or with multivariate normal and S^-1 as in Hadfield) will test whether there is a general association between clades, but doesn't necessarily test co_divergence_. For instance, if all bacteria are included in a model, the MRCA of all Endozoicomonas might be significantly associated with the ancestor of all Scleractinians, but the distribution of the various Endo types among Scleractinian species could still be random or uniform. The variance associated with the Hadfield cophylogeny term is a better measure than the effects themselves because it implies that there are multiple clade-by-clade associations that lead to more specificity, not just one such association. However, since it is applied globally to the entire dataset, it doesn't allow us to identify the clades in which such specificity is greater (GCMP Oz paper, for instance, suggests that cophylogeny is actually restricted to one sub-clade of Endozoicomonaceae not the whole family, even though the whole family was tested all at once and there was 'significant' cophylogenetic variance). A more general model might allow not just 'effects' to vary clade-by-clade, but the variance itself. The variance term should be identifiable relative to the 'effect' term because it would be applied to all descendant nodes, allowing some to have 'effects' that are strongly counter to the higher-level ones.

I imagine giving the root variance a positive-bounded prior such as exponential(1), and then giving each node a multiplier defined by a lognormal(0,sigma) prior. The actual variance used for the effect of a given node would be the product of its own multiplier, all the ancestral multipliers, the root variance, and the leading branch length.

Current issue is that my models have been partitioning variance with a grand mean and a dirichlet, but that wouldn't really be compatible (or at least it wouldn't be as interpretable) when combined with the above scheme. Would need to allow each node to have unconstrained variance, rather than lying on the simplex. Or, I suppose I could use a softmax transform of the sum of normal accumulators, and instead of having an exponential at the root just have nothing there at all.

splinephy using min() for positive definiteness

Currently, the 'splinephy' model creates a smooth function relating evolutionary distance to covariance, but the raw function does not guarantee that the resulting matrix will be positive definite. Thus, the minimum eigenvalue is added to the diagonal to force it. This could be a problem in stan because it should break the gradient, probably in many places. The model as written is divergent on almost all transitions and this may be the primary reason.

https://github.com/rmcminds/stan_models/blob/eda67cee95b32e3e38fdb7f90dc6e2a226a9d1b6/cophylogeny/splinephy_logistic.stan

distinguish between ancestral states and shared effects

as implemented, the model estimates the shared effects for each node of all of its descendants. This is not the same as estimating the ancestral states of each node because the OU transformation of the branch lengths reduces variance (or the rate of evolution) near the root, whereas the ancestral states would have have had the same rate of evolution at all points of the tree. In addition to the lack of interpretability as ancestral states, this also might make the apparent advantages of soft-centering moot?

implement something similar to Hadfield 2012 for cophylogeny

current cophylogeny script 'splinephy' tried to get fancy but it would be good to first test the method and data on a model similar to the one already used. It is hard to say whether the original model with multiple hostXsymbiont interaction terms will run faster or slower, because the time saved by not having to recalculate the covariance matrix and Cholesky factors may be made up by having all the extra parameters.

Alternate pre-existing models

My balance tree methods were initially based on gneiss in qiime2. Other methods also exist that get at some of the same problems:

Mixed Effect Dirichlet-Tree Multinomial for Longitudinal Microbiome Data and Weight Prediction apparently 'DTM' has already been established and used in microbiome analyses, where the OTU phylogeny is incorporated much like in the gneiss balance tree approach, but with dirichlet distributions on the original counts. This paper incorporates covariates into the model as well.

allow OU parameter to vary for different interactions

to really ask whether one clade of microbes is 'more cophylogenetic' than the background, that clade's branch lengths shouldn't just be scaled linearly, but should give more influence to the older nodes, such as by decreasing the strength of the OU attraction. so if the microbe tree has an overall OU attraction strength of 10, perhaps the strength is only 4 within the endozoicomonas group, meaning the ancestor of endozoicomonas entered a new mode so that its descendants are no longer pulled toward the overall mean as strongly.

combination of branch length modifiers and time bins is difficult to interpret

The 'rate of evolution' in each time bin might not actually represent the mean rate of evolution because the lognormal modifiers mess it up? I could keep the time bin modifiers, but might consider ditching the dirichlet because it isn't consistent with the other modifications and was mostly used for easy interpretation.

in BGFA_pipeline.r, collect similar data structures into lists

things like bacteriaFilt and euksFilt should be gathered into a single list like "alldata" which could take advantage of apply() to clean code considerably

implement Bridge Sampling for across-tree comparisons

currently, i incorporate uncertainty about the phylogeny by fitting a separate chain for multiple draws from the phylogenetic posterior, and then just summing all the draws across chains for comparable parameters. however, some phylogenetic draws will be less likely than other, and a simple sum across chains ignores this fact, lending no extra insight into the phylogeny and resulting in improper weights for the comparable parameters.

One solution appears to be to use the bridgesampling R package (https://arxiv.org/pdf/1710.08162.pdf), which can calculate the posterior probabilities of each entire model. Then, draws from each chain can be resampled according to that chain's posterior probability, so a chain that uses a very unlikely tree will not contribute much to the parameter estimates.

To do this, I will need to:

Change all "~~" style sampling statements to "target +=" style statements in the model. The "~~" style statements drop values that are constant within that chain, but since they vary among chains they will need to be included in the log posterior.
attach 'bridgesampling' and apply bridge_sampler() to each chain independently, then post_prob() to all of them at once. This returns a result in itself: update posterior probabilities for each tree, lending insight back into the phylogeny.
monitor() each chain without resampling and then take weighted mean of the estimates for each parameter, with weights corresponding to the posterior probabilities of each tree: https://discourse.mc-stan.org/t/question-about-lp-as-weight/2386. think about how this affects hypothesis testing... (could we instead resample from the chains according to their probabilities??)

source mixtures does not converge; has divergent transitions

I still have not gotten the source-tracking mixtures model to work in any sense of the word. The constraints on the average source proportions do not appear to be strong enough to identify the model. Might try using stronger scale priors (e.g. normals rather than cauchys), but am concerned that this would pull all the various sink types too strongly toward even mixtures, when in reality we should expect (or hope for) mixtures that are very strongly biased toward their single respective source types.

Test data might ultimately be one of the biggest problems, because the coral-algal dataset is messy. Need to create better simulated data to test the model!!