turinglang / turingglm.jl Goto Github PK

View Code? Open in Web Editor NEW

70.0 9.0 7.0 94.45 MB

Bayesian Generalized Linear models using `@formula` syntax.

Home Page: https://turinglang.org/TuringGLM.jl/dev

License: MIT License

Julia 100.00%

julia-language bayesian-inference bayesian-statistics mcmc hmc turing

turingglm.jl's Introduction

TuringGLM

TuringGLM makes easy to specify Bayesian Generalized Linear Models using the formula syntax and returns an instantiated Turing model.

Heavily inspired by brms (uses RStan or CmdStanR) and bambi (uses PyMC3).

The @formula macro is extended from StatsModels.jl along with MixedModels.jl for the random-effects (a.k.a. group-level predictors).

turingglm.jl's People

Contributors

Stargazers

Watchers

Forkers

kleinschmidt stjordanis aatwum pitmonticone nilshg burtonjosh itsdfish

turingglm.jl's Issues

New `return` values for Posterior Predictive Checks

We are not sure how you usually perform posterior predictive checks. But e.g. using rand(model) will just sample a NamedTuple (or maybe @torfjelde changed it to a Dict already since NamedTuple does not work as generally IIRC) of the variables, so it's not necessary to explicitly return a named tuple of variables in general.

Originally posted by @devmotion in #64 (comment)

Pretty printing of Turing models

I am trying to use TuringGLM to better learn Turing, and so I wrote the simplest LM and would like to re-write that in Turing.

using RDatasets
using TuringGLM

data = RDatasets.dataset("datasets", "mtcars")

fm = @formula(MPG ~ WT)
model = turing_model(fm, data)
model

DynamicPPL.Model{TuringGLM.var"#normal_model#19"{Int64,Int64,CustomPrior},(:y, :X, :predictors, :μ_X, :σ_X, :prior, :residual),(:predictors, :μ_X, :σ_X, :prior, :residual),(),Tuple{Vector{Float64},Matrix{Float64},Int64,Int64,Int64,CustomPrior,Float64},Tuple{Int64,Int64,Int64,CustomPrior,Float64},DynamicPPL.DefaultContext}(TuringGLM.var"#normal_model#19"{Int64,Int64,CustomPrior}(0, 1, CustomPrior(TDist{Float64}(ν=3.0), LocationScale{Float64,Continuous,TDist{Float64}}(
                μ:19.2σ:5.411498097545447ρ:TDist{Float64}(ν=3.0)
            ), nothing), Core.Box(TuringGLM.var"#normal_model#18#20"(Core.Box(TuringGLM.var"#normal_model#19"{Int64,Int64,CustomPrior}())))), (y=[21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2 … 15.2, 13.3, 19.2, 27.3, 26.0, 30.4, 15.8, 19.7, 15.0, 21.4], X=[2.62; 2.875…3.57; 2.78;;], predictors=1, μ_X=0, σ_X=1, prior=CustomPrior(TDist{Float64}(ν=3.0), LocationScale{Float64,Continuous,TDist{Float64}}( #= circular reference @-4 =#
                μ:19.2σ:5.411498097545447ρ:TDist{Float64}(ν=3.0)
            ), nothing), residual=6.026948052089105), (predictors=1, μ_X=0, σ_X=1, prior=CustomPrior(TDist{Float64}(ν=3.0), LocationScale{Float64,Continuous,TDist{Float64}}(
                μ:19.2σ:5.411498097545447ρ:TDist{Float64}(ν=3.0)
            ), nothing), residual=6.026948052089105), DynamicPPL.DefaultContext())

Unfortunately, I can't get my head around this output to format in a standard Turing syntax. Is there a way to "pretty-print" or format the Turing model call? Thanks!

Feature list for 0.1.0 release

turing_model function:

turing_model(formula, data;
      family=Normal,
      priors=DefaultPriors(),
      standardize=false)

Close interface with DataFrames.jl and CategoricalArrays.jl using the Tables.jl API:
- Reading the length(unique(x)) of a group-level intercept/slope
- Auto dummy variable ($K - 1$) with categorical vectors by reading the levels and using the first level as baseline.
Likelihoods:
- Linear Regression: Gaussian and TDist (IdentityLink)
- Logistic Regression: Bernoulli
- Count Data: Poisson and NegativeBinomial (LogLink)
Priors:
- Default Priors and Custom Priors
Group-level effects (Hierarchical Models):
- Non-Centered Parameterization and recompute the parameter back in the return statement
- Varying-Intercept: (1 | group)
Case Studies showcasing:
- @formula syntax
- random effects: random-intercept
- likelihoods
- CategoricalArrays.jl and DataFrames.jl (Tables.jl) interfaces
- Use the {rstanarm} datasets (check LICENSE) and the sleep dataset from {brms}

Simplify Tutorials builds with `PlutoStaticHTML.jl`

This much simpler version is possible since yesterday (rikhuijzer/PlutoStaticHTML.jl#34).

I'll simplify all notebooks in a next PR.

Originally posted by @rikhuijzer in #27 (comment)

Multivariate GLMs

Multivariate GLMs are very common in econometrics and also very useful in general. I think this is low-hanging fruit in terms of relative effort vs payoff: Multivariate GLMs should be simple to implement, since they just require generalizing a handful of formulas to accept matrices, but they're extremely useful because they solve the problems typically misidentified as "Multiple comparisons."

Relevant

Unclear what posterior parameter relates to what formula predictor

Currently, the posterior parameters have the form β[1], ..., β[n] whereas formulas used something along the lines of y ~ a + b + c. This is now tricky to link again because I don't know without looking into the code what parameter belongs to what predictor.

Switch to `SimpleVarInfo` for better computational efficiency

SimpleVarInfo in DynamicPPL provides significant speedup over the current implementation. We should consider switching to use it once TuringLang/DynamicPPL.jl#360 is merged.

use intercept traits w/ `apply_schema` to handle implicit intercept

StatsModels provides the notion of "intercept traits" https://github.com/JuliaStats/StatsModels.jl/blob/master/src/traits.jl to control the behavior of the automatic intercept adding that normally happens in GLMs. This package is a great example of a model type with an ~~"implicit intercept"~~ "dropped intercept", since everything is centered (AFAICT) so the intercept is always zero and shouldn't be included.

I think what you'd want to do is to define a model type (if there isn't one already) and add methods for the appropriate types.

Clarification on `Logistic` (and possibly rename?)

For likelihoods, TuringGLM.jl supports:

Gaussian() (the default if not specified): linear regression
Student(): robust linear regression
**Logistic(): logistic regression**
Pois(): Poisson count data regression
NegBin(): negative binomial robust count data regression

I'm marking this because it seems to be implying two different things, and I'm not sure which it's referring to. "Logistic regression" almost always means regression with a Binomial likelihood using the logit-link (logistic inverse link). However, the Logistic distribution also exists, and can be used to perform robust linear regression. (It has slightly thicker tails than a normal distribution, but unlike the T distribution's they scale off exponentially, making it a good efficiency/robustness compromise). If these names are supposed to refer to likelihoods, Logistic would then be inappropriate and this name could result in misunderstandings.

Perhaps we should make a clear distinction between a likelihood and the link function associated with it? There's no reason you can't use a logistic link with a normal likelihood, for example.

Create tests for contrasts.jl

We are currently without tests for contrasts.jl

Make TuringGLM,jl work with ReverseDiff

This issue is based on extensive discussion that took place here: Turing Samples Slowly #1851. One of the key observations was that:

Turing based Bayesian models can be specified with AD (automatic differentiation) being either forwarddiff or reversediff. For hierarchical models, it turns out the posterior sampling is several orders faster if we use ReverseDiff (7-9 minutes) than the default ForwardDiff (7-10 hours) for the specific data and models I was testing extensively across Turing, TuringGLM, Brms, and PyMC
TuringGLM models testing showed that it only works with ForwardDiff (7-10 hours to sample) and gives errors with ReverseDiff. It would be wonderful to configure TuringGLM to work with ReverseDiff

TuringGLM.jl Roadmap and Internals/API Discussion

Hello, all!

I was speaking with @yebai about how Turing could support the whole Bayesian Workflow following Bob Carpenter presentation at ProbProg 2021 in October, 2021. And I gave him an idea about how an easy interface of Turing could attract more users and be a great transition from users coming from a stats background and/or R background. I am deeply interested in this because I mainly lecture graduate students on applied social sciences and if I show either Stan code or Turing model code they would run away. Thus, I use the friendly RStan interface {brms} that uses the formula syntax (as you can see in this still on-going course YouTube playlist).

I also use Turing a lot in my research and most of it are hierarchical/generalized linear models that could be easily specified by a formula. Thus, I also find myself having to copy and paste a LOT of boilerplate code to define a model.

I suggested this package and the following objectives, initial release, API design and roadmap. We discussed and we would really love feedback.

I am also commited to a long-term maintainance/development of TuringGLM, since this would solve all my lecturer and researcher needs. I am tired of having students complaining that they cannot install {brms} because of RStan Windows issues. And also I would move all my lecturing and research to Julia. Furthermore, I think this package has a great opportunity to attract users to Julia/Turing/Bayesian Statistics, which I am all up for.

Objective of the Package

Uses @formula syntax to specify several different (generalized) (hierarchical) linear models in Turing.jl. It is defined by StatsModels.jl and can be easly extended to GLMs (see GLM.jl) and to Hierarchical models (see MixedModels.jl).

Heavily inspired by {brms} (uses RStan or CmdStanR), bambi (uses PyMC3) and StatsModels.jl.

My ideia is just to create a syntactic sugar of model definition with a @formula macro that translates it to a instantiated Turing @model along with the data. The user just need to call sample, advi or another future Turing function (e.g. pathfinder) onto the returned model from @formula and he is good to go. So this package would be easy to maintain and just focus on whats important: the @formula -> @model translation.

Discussion Points

@formula syntax:
turing_model will return an instantiate Turing model with the data, so the user has just to call sample or advi etc in the model. Example:
```
m = turing_model(@formula y ~ x; data=my_df)
chn = sample(m, NUTS(), 2_000)
```
How the user specify custom priors?

Feature list for `0.1.0` release (public repo)

Close interface with DataFrames.jl and CategoricalArrays.jl
- Reading the length(unique(x)) of a group-level intercept/slope
- Auto dummy variable ($K - 1$) with categorical vectors by reading the levels and using the first level as baseline.
Likelihoods:
- Linear Regression: Gaussian and TDist (IdentityLink)
- Logistic Regression: Bernoulli, Binomial (how would we accept a matrix of [success n - success]) and Geometric (LogitLink)
- Count Data: Poisson and NegativeBinomial (LogLink)
Priors (Should be Distributions.jl dependent):
- Default Priors and Custom Priors
- Intercept
- Coefficients
- Auxiliary parameter (residuals)
Group-level effect (Hierarchical Models):
- Non-Centered Parameterization and recompute the parameter back in the return statement
- Varying-Intercept: (1 | group)
- Varying-Slope: (x_1 | group)
- Varying-Intercept-Slope: (1 + x_1 | group) or (1 | group) + (x1 | group)
- Correlated Group-level effects: LKJ() (OBS: fast implementation depends on TuringLang/Turing.jl/issues/1629) ???
Automatic QR decomposition with a keyword argument inside @formula

Roadmap

Likelihoods:
- Survival stuff: LogNormal, Gamma, Exponential and Weibull
- Zero-Inflated stuff (Mostly Mixtures):
  - Zero-Inflated Count Data: ZeroInflatedPoisson and ZeroInflatedNegativeBinomial.
  - Zero-Inflated Binary Data: ZeroInflatedBinomial and ZeroInflatedBeta
- Ordinal Regression
Priors:
- Horseshoe Prior and R2D2 prior for Sparse Bayesian Regression
Multivariate Models (more than 1 likelihood and/or dependent variable):
- Stuff like f1 = @formula y1 ~ x1 and f2 = @formula y2 ~ x2.
Autoregressive Models Correlation Structures:
- AR(p): Autoregressive
- MA(q): Moving Average
- ARMA(p, q): Autoregressive Moving Average
Warmup with Pathfinder.jl?
Integration with Bijectors.jl
Normalizing Flows?

Feature Request [enhancement]: Support for Multiple Response Variables using Syntax in TuringGLM.jl

I would like to inquire if any ongoing work exists for a feature in TuringGLM.jl that supports a syntax for specifying multiple response variables within a single model. Based on my understanding, the current formula syntax in TuringGLM.jl only allows modeling of a single response variable. However, having the capability to model multiple response variables would significantly enhance the usability and convenience of TuringGLM.jl.

Requested Feature

Compact Syntax for Multiple Response Variables: Implement a compact syntax in TuringGLM.jl that enables users to specify multiple response variables within a single model specification. This would allow users to model the relationships between multiple dependent variables and independent variables.

Proposed Formula Syntax

Compact Syntax for Multiple Response Variables could look something similar to that of brms:

formula = @formula(y ~ condition + (condition | id),
                   w ~ condition + (condition | id),
                   v ~ condition + (condition | id))

In this proposed syntax, each response variable (y, w, v) is specified on a separate line, followed by the fixed effects (condition) and the random effects ((condition | id)).

If someone could provide guidance on where to start with these modifications, I would be happy to contribute to the implementation.

Suggestion: Bayesian median of means

As laid out here, the Bayesian median of means is a robust but always-consistent estimator of the mean (unlike e.g. the median). It applies the Bayesian bootstrap to the data to create a posterior distribution, then uses the median of the posterior. BRMS suggests using robust estimators when setting weakly-informative priors; maybe this is a good option for centering data? I might add this later on if that’s ok with you guys.

Performance dip when using ForwardDiff compared to Turing

I've noticed that there's a performance dip when using ForwardDiff with a model defined in TuringGLM, compared to defining the model directly in Turing. I've set up a MWE to show this.

First I set up 4 models, two in TuringGLM (with and without custom priors), and two in Turing, with the default and custom priors given to the TuringGLM models.

using Turing, TuringGLM, TuringBenchmarking, BenchmarkTools
using ReverseDiff: ReverseDiff
using CSV, DataFrames, LinearAlgebra

hibbs_df = CSV.read(
    download("https://raw.githubusercontent.com/avehtari/ROS-Examples/master/ElectionsEconomy/data/hibbs.dat"),
    DataFrame
);

# TuringGLM model
f = @formula(vote ~ growth)
m_glm = turing_model(f, hibbs_df)

# TuringGLM model with custom priors
priors = CustomPrior(Normal(0, 10), Normal(52, 14), nothing)
m_glm_custom = turing_model(f, hibbs_df; priors=priors)

# extract data for Turing models
y = TuringGLM.data_response(f, hibbs_df)
X = TuringGLM.data_fixed_effects(f, hibbs_df)

# model with default priors
@model function regression_default(X, y; residual=std(y))
    α ~ 50.755 + TDist(3.0)*6.071256084780443
    β ~ filldist(TDist(3.0), size(X,2))
    σ ~ Exponential(residual)

    y ~ MvNormal(α .+ X*β, σ^2*I)
end

m_turing = regression_default(X, y; residual=std(y))

# model with custom priors
@model function regression_custom(X, y; residual=std(y))
    α ~ Normal(52, 14)
    β ~ filldist(Normal(0, 10), size(X,2))
    σ ~ Exponential(residual)

    y ~ MvNormal(α .+ X*β, σ^2*I)
end

m_turing_custom = regression_custom(X, y; residual=std(y))

Then using TuringBenchmarking.jl, I benchmark each of the four models with both Forward and Reverse diff backends:

The results of the benchmark are shown in the table below. You can see that for Reversediff the benchmarks are the same, but with ForwardDiff TuringGLM is ~20-30% slower than Turing (I've included the full results below).

Model	ForwardDiff, linked (time, μs)	ReverseDiff, linked (time, μs)	ForwardDiff, not linked (time, μs)	ReverseDiff, not linked (time, μs)
TuringGLM (default prior)	3.967	2.772	3.976	1.990
Turing (default prior)	3.046	2.676	3.059	1.931
TuringGLM (custom prior)	4.013	2.102	3.905	1.868
Turing (custom prior	2.776	1.986	2.827	1.829

Click here for in detail output

TuringGLM model 1 (default priors)

suite_glm = TuringBenchmarking.make_turing_suite(
    m_glm,
    adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}()]
)
run(suite_glm)

Output:

2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(2.882 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(2.772 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(3.967 μs)
  "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(2.836 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.990 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(3.976 μs)

Turing model 1 (default priors)

suite_turing = TuringBenchmarking.make_turing_suite(
    m_turing,
    adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}()]
)
run(suite_turing)

Output:

2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(1.256 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(2.676 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(3.046 μs)
  "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(1.207 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.931 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(3.059 μs)

TuringGLM model 2 (custom priors)

suite_glm_custom = TuringBenchmarking.make_turing_suite(
    m_glm_custom,
    adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}()]
)
run(suite_glm_custom)

Output:

2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(2.724 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(2.102 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(4.013 μs)
  "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(2.737 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.868 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(3.905 μs)

Turing model 2 (custom priors)

suite_turing_custom = TuringBenchmarking.make_turing_suite(
    m_turing_custom,
    adbackends = [TuringBenchmarking.ForwardDiffAD{40}(), TuringBenchmarking.ReverseDiffAD{true}()]
)
run(suite_turing_custom)

Output:

2-element BenchmarkTools.BenchmarkGroup:
  tags: []
  "linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(1.176 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.986 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(2.776 μs)
  "not_linked" => 3-element BenchmarkTools.BenchmarkGroup:
          tags: []
          "evaluation" => Trial(1.160 μs)
          "Turing.Essential.ReverseDiffAD{true}()" => Trial(1.829 μs)
          "Turing.Essential.ForwardDiffAD{40, true}()" => Trial(2.827 μs)

Modify $tau$ for GLM hierarchical models standard prior

Modify tau to be std(y) instead of mad(y).

          When trying to improve a Turing hierarchical intercept logistic model by reviewing turing_model.jl, I noticed that ``` function _model(μ_X, σ_X, prior, intercept_ranef, idx, ::Type{Bernoulli})```

includes a normalization on the dependent variable, which here is 0/1. It gives me an error because mad(y) in my case is 0, which messes with the hyperparameter $\tau$ for SD. I thought I'd bring it to the developers' attention.

mad_y=mad(y; normalize=true) (ln 266)

Originally posted by @jfhawkin in #21 (comment)

Slightly Broader Name?

Might be worth it to give a broader name than TuringGLM in case we want to expand a bit past GLMs. For instance, being able to specify a generalized additive model where some components are estimated with a Gaussian process regression would be a very neat feature.

Docs `index.md` broken?

[Tutorial] LOO-CV

There are some excellent packages for estimating Bayesian evidence for Turing models. It would allow us to perform model comparisons for various priors and model choices. We should consider supporting these options - it could be a (killer) feature!

LOO-CV can be viewed as a proxy for Bayesian evidence / marginal likelihood.

See Vehtari, A., Gelman, A., & Gabry, J. (2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing. https://link.springer.com/article/10.1007/s11222-016-9696-4

Roadmap for Hierarchical Models

Group-level effects (Hierarchical Models):
- Varying-Slope: (x_1 | group)
- Varying-Intercept-Slope: (1 + x_1 | group)
- Correlated Group-level effects:
  - "ON" by default
  - Check the zerocorr implementation of MixedModels.jl
  - LKJ() (OBS: fast implementation depends on TuringLang/Turing.jl/issues/1629)
Case Studies showcasing:
- random effects: random-slope, random-intercept-slope

Getting more numerical errors when setting `standardize=true`

With an unstandardized dataset and standardize=false (the default), I got 3 numerical errors on a Logistic model. With standardize=true, I got hundreds of numerical errors. This shouldn't happen, right? With standardization, the sampler already starts roughly in the right place meaning that less numerical errors should occur.

Incorrect residual calculation for default Exponential prior

The default prior for σ is Exponential(residual), where residual = 1 / std(y). This follows the same approach as e.g. rstanarm. The Exponential in TuringGLM takes the scale as an argument, whereas stan takes the rate = 1/scale. The correct residual should be residual = std(y).

Gaussian Quadrature and Laplace Approximation

With large datasets, MCMC can get very slow. Luckily, approximations based on numerical integration usually get better as the sample size increases, and can work in some situations where MCMC is impractical. Luckily, the code for this is already implemented in MixedModels.jl and just needs to have priors added for it to work; we should probably take a look at it.

TagBot trigger issue

This issue is used to trigger TagBot; feel free to unsubscribe.

If you haven't already, you should update your TagBot.yml to include issue comment triggers.
Please see this post on Discourse for instructions and more details.

If you'd like for me to do this for you, comment TagBot fix on this issue.
I'll open a PR within a few hours, please be patient!

Track performance of `turing_model`

Thanks for the nice improvements, @devmotion. It’ll be great if we can track performance regression for TuringGLM (e.g. using a notebook similar to tutorials), but that can be done in a separate PR.

Originally posted by @yebai in #45 (comment)

[Tutorial] Elliptical Slice Sampling

We should have a tutorial on comparing the ESS sampler against HMC samplers. HMC is very popular, but partially due to availability in Stan. In many interesting cases with Gaussian priors, ESS could be a competitive alternative, I think, both computationally and statistically!

cc @devmotion

turinglang / turingglm.jl Goto Github PK

turingglm.jl's Introduction

TuringGLM

turingglm.jl's People

Contributors

Stargazers

Watchers

Forkers

turingglm.jl's Issues

Objective of the Package

Discussion Points

Feature list for 0.1.0 release (public repo)

Roadmap

Requested Feature

Proposed Formula Syntax

Recommend Projects

Recommend Topics

Recommend Org

Feature list for `0.1.0` release (public repo)