stan-dev / rstanarm Goto Github PK

rstanarm R package for Bayesian applied regression modeling

License: GNU General Public License v3.0

R 84.67% Shell 0.03% Stan 12.88% CSS 0.29% HTML 1.57% C++ 0.20% M4 0.12% TeX 0.23%

stan multilevel-models r-package bayesian bayesian-inference bayesian-methods bayesian-statistics bayesian-data-analysis statistical-modeling r

rstanarm's People

Contributors

Stargazers

Watchers

Forkers

fandres70 mhandreae jburos pgnepal paulhendricks tomwallis sambrilleman lixiangchun detcitty singmann harleyjean tjmahr joshnr13 bluelou fsolt anhnguyendepocen eunbeekim datalulu paasim mansmeg nesanders radovankavicky gapdata dpsimpson jjtwwdei andylin512 cfitzge1 nemochina2008 phoenixstar7 mejihero bachlaw fartist wjakethompson lauken13 julbautista malcolmbarrett mcdonohue mcol rborchering statwonk alexadibenedetto alfcrisci devincaughey csetraynor dirmeier ermeel86 lcomm gakkilovemath sremmers nyiuab mattforshaw decision-sciences snaketron stjordanis jyuu elzinga kaz-yos abikoushi davors pamelanluna lilymedina liuxiangyu127 karthy257 flynngo mriganav061 economer danschrage alexanderbor hrossman maj-biostat alexandre-caron fanzheng0352 nohemihuanca proof-by-accident nspindel ssp3nc3r sfarhd14 racarana enbrown galickda wlandau fbartos eddygassama joshualiuxu stevebronder tinnguyen96 ehoumanevans samueldonnelly arrendi quinfer jbarsotti allisterh hiroshimamiya yawomkobara lycium0605 pio82 r-study-cxb padpadpadpad brendonchau jacobenglert

rstanarm's Issues

numerical test failure on Solaris x86_32

https://www.r-project.org/nosvn/R.check/r-patched-solaris-x86/rstanarm-00check.html

Truncated priors in (g)lm(er) models?

Are there plans to implement the possibility of including truncated priors (for beta coefficients, or even intercept) in (g)lm(er) models, through beta or uniform distributions?

Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal

I am running the following type of model:

stan_glmer(log_y ~ (1 + log_x | group), 
                   prior = t7_prior, prior_intercept = t7_prior,
                   data = data, iter = 1000, chains = 4)

The dataset had about 20,000 rows with 99 groups.

...
Chain 3, Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 14649.7 seconds (Warm-up)
#                19422.5 seconds (Sampling)
#                34072.2 seconds (Total)


Warning message:
Warning message:
passing unknown arguments: show_messages. 
passing unknown arguments: show_messages. 
Warning message:
passing unknown arguments: show_messages. 
Warning message:
passing unknown arguments: show_messages. 

Chain 1, Iteration: 1000 / 1000 [100%]  (Sampling)
#  Elapsed Time: 17411.9 seconds (Warm-up)
#                18800 seconds (Sampling)
#                36211.9 seconds (Total)

Error in serialize(data, node$con, xdr = FALSE) : ignoring SIGPIPE signal
Calls: <Anonymous> ... doTryCatch -> sendData -> sendData.SOCK0node -> serialize
In addition: Warning message:
passing unknown arguments: show_messages. 
Execution halted

Session Info:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.0 (Maipo)

locale:
 [1] LC_CTYPE=en_US.UTF-8          LC_NUMERIC=C                  LC_TIME=en_US.UTF-8          
 [4] LC_COLLATE=en_US.UTF-8        LC_MONETARY=en_US.UTF-8       LC_MESSAGES=en_US.UTF-8      
 [7] LC_PAPER=en_US.UTF-8          LC_NAME=en_US.UTF-8           LC_ADDRESS=en_US.UTF-8       
[10] LC_TELEPHONE=en_US.UTF-8      LC_MEASUREMENT=en_US.UTF-8    LC_IDENTIFICATION=en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] markdown_0.7.7         shiny_0.12.2           rstanarm_2.8.0         lme4_1.1-10           
 [5] magrittr_1.5           dplyr_0.4.1            rstan_2.8.0            Rcpp_0.12.1           
 [9] Matrix_1.2-1           PricingModel_0.2.1.8-6 netezza_0.6-3.9        rJava_0.9-6           
[13] ggplot2_1.0.1         

loaded via a namespace (and not attached):
 [1] lattice_0.20-31    tidyr_0.2.0        zoo_1.7-12         gtools_3.5.0       assertthat_0.1    
 [6] digest_0.6.8       mime_0.4           R6_2.1.1           plyr_1.8.3         stats4_3.2.1      
[11] coda_0.17-1        lazyeval_0.1.10    minqa_1.2.4        nloptr_1.0.4       R.oo_1.19.0       
[16] DT_0.1             shinythemes_1.0.1  proto_0.3-10       splines_3.2.1      shinyjs_0.2.0     
[21] stringr_1.0.0      htmlwidgets_0.5    loo_0.1.3          munsell_0.4.2      httpuv_1.3.3      
[26] base64enc_0.1-3    htmltools_0.2.6    gridExtra_2.0.0    threejs_0.2.1      matrixStats_0.14.2
[31] MASS_7.3-40        R.methodsS3_1.7.0  grid_3.2.1         nlme_3.1-120       arm_1.8-6         
[36] xtable_1.7-4       gtable_0.1.2       DBI_0.3.1          RJDBC_0.2-5        scales_0.3.0      
[41] rredis_1.6.9       stringi_0.5-5      reshape2_1.4.1     dygraphs_0.4.5     xts_0.9-7         
[46] tools_3.2.1        shinystan_2.0.1    abind_1.4-3        parallel_3.2.1     inline_0.3.14     
[51] colorspace_1.2-6

Sorry, I do not have traceback.

default to QR=TRUE for advi?

@bgoodri I've also noticed that doing the QR composition before advi seems to be more reliable, the same you discussed with @akucukelbir in this thread on stan-dev. Do you think we should default to QR=TRUE if the user specifies "meanfield" or "fullrank" as the estimation algorithm?

demos

With either devtools::source_url or just source in the latest version of R, it is fairly easy to do many of the example-models via rstanarm with the following code idiom

{
source("https://raw.githubusercontent.com/stan-dev/example-models/*.data.R", local = TRUE)
stan_foo(formula, data = environment(), ...)
}

However, since this requires an internet connection, it cannot be part of R CMD check but I think it would be good to make many demos that do this using best practices (from https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Package-subdirectories ):

The demo subdirectory is for R scripts (for running via demo()) that demonstrate some of the functionality of the package. Demos may be interactive and are not checked automatically, so if testing is desired use code in the tests directory to achieve this. The script files must start with a (lower or upper case) letter and have one of the extensions .R or .r. If present, the demo subdirectory should also have a 00Index file with one line for each demo, giving its name and a description separated by a tab or at least three spaces. (This index file is not generated automatically.)

Separation of tensors and interactions in GAMM models

The gamm4 models use t2() in place of te() for tensor splines, but have to equivalent for ti(). This means one can't separate the interactions from the main effects, as in this example: http://stats.stackexchange.com/a/189437/28370

It would be great if stan_gamm4() could support ti() terms or something similar.

Related to #46

posterior_predict fails for stan_glmer if transformations used in formula

@Felipe-Osorio is the error thrown when running the code below the one you were referring to?

fit <- stan_glmer(mpg ~ log1p(wt) + (1|cyl), data = mtcars)
p <- posterior_predict(fit) # ok
p2 <- posterior_predict(fit, newdata = mtcars[1:10, ])
Error in `[.data.frame`(newdata, , colnames(frX)) : 
  undefined columns selected

printing the result of a stan_glm is broken

test <- stan_glm(mpg ~ wt, data = mtcars)
> test

Call:  stan_glm(formula = mpg ~ wt, data = mtcars)

Coefficients:
(Intercept)           wt  
     37.226       -5.322  

Degrees of Freedom: Total (i.e. Null);  30 Residual
Error in signif(x$null.deviance, digits) : 
  non-numeric argument to mathematical function

> traceback()
4: format(signif(x$null.deviance, digits))
3: cat("Null Deviance:\t   ", format(signif(x$null.deviance, digits)), 
       "\nResidual Deviance:", format(signif(x$deviance, digits)), 
       "\tAIC:", format(signif(x$aic, digits)))
2: print.glm(x)
1: function (x, ...) 
   UseMethod("print")(x)

This seems to have to do with the fact that it is calling print.glm now.

roxygen2 version 5.0.0 breaks rstanarm build

The error is

Quitting from lines 107-113 (aov.Rmd) 
Error: processing vignette 'aov.Rmd' failed with diagnostics:
'stan_lm' is not an exported object from 'namespace:rstanarm'
Execution halted

which appears to be consistent with the new auto-generated NAMESPACE, which may have to do with these warnings

roxygen2::roxygenize(clean = TRUE)
...
There were 14 warnings (use warnings() to see them)
> warnings()
Warning messages:
1: @export [posterior_predict.R#12]: may only span a single line
2: @export [predict.R#4]: may only span a single line
3: @export [rstanarm-package.R#14]: may only span a single line
4: @export [stan_glm.R#23]: may only span a single line
5: @export [stan_glm.fit.R#26]: may only span a single line
6: @export [stan_glmer.R#7]: may only span a single line
7: @export [stan_glmer.R#113]: may only span a single line
8: @export [stan_lm.R#25]: may only span a single line
9: @export [stan_polr.R#46]: may only span a single line
10: @export [stanreg-methods.R#99]: may only span a single line
11: @export [stanreg-methods.R#227]: may only span a single line
12: @export [stanreg-methods.R#239]: may only span a single line
13: @export [stanreg-methods.R#249]: may only span a single line
14: @export [summary.R#8]: may only span a single line

Conflict in R2 function with package 'caret'

I realize that you can't avoid all possible conflicts, but just wanted to let you know that R package caret also defines a function R2, which causes really puzzling errors with stan_lm. I don't have a good alternative (you might want to rename hs, etc, as well).

possibly emulate the gamm4 package / function

The gamm4 R package inputs syntax for estimating generalized additive models, converts it internally to lme4 syntax, and uses lmer to estimate the parameters. It sounds as if (from the appendix of the lme4 vignette and the documentation of gamm4) that we could just copy some of this R code and make a stan_gamm4() function with no additional Stan code and it is quite likely that the decov prior will make the estimates of the smoothing parameters more reliable.

Before First Release

These tasks (and probably other things I'm forgetting) still need to be done before first release:

Change periods to underscores
Roxygen templates for function args (b6c09dc)
Sparse matrix stuff
Copy glmer stuff from continuous.stan to most of the others
Copy horseshoe stuff from continuous.stan to others (and make needed changes on R side)
Negative binomial log-likelihood function
Add a help page for the plotting functions (839b03f provides links to the RStan help files. For now the links will only work if the ggplot RStan branch is installed.)
Check which other functions need tests

posterior prediction following stan_gamm4

Currently, there is just a stub that throws an error saying it is unimplemented (unless you are only doing in-sample prediction). The predict.gam() function in the mgcv package has the complicated stuff for handling the smooth terms when newdata is specified and if type = "lpmatrix", it returns the necessary matrix of predictors rather than the predictions. But we need to dig into the details of it to make sure we are multiplying the correct columns by the correct margins of the posterior.

errors after rstan update

@bgoodri

With rstan 2.9 I'm getting the following error when running

stan_glm(mpg ~ wt, data = mtcars)

sampling seems to proceeds ok, bun then

Error in dim(sssf) <- c(n2, object@sim$chains, length(tidx)) : 
dims [product 20000] do not match the length of object [40000]

which looks like it's coming from extract as defined in the stanfit-class.R file in rstan.

Should we really use confint method for Bayesian credible intervals?

@bgoodri I've been thinking recently that it might avoid a great deal of confusion if we create a separate method for obtaining so-called credible intervals rather than using R's standard confint. Perhaps an S3 generic credint and a credint.stanreg method? (Or just a credint function, skipping the S3 method stuff?) If we want to do this then I can make the change quite easily this afternoon.

Here are some reasons why I think it's worth doing this

The documentation for stats::confint, which we can't change, clearly says it's for confidence intervals.
It would give us the opportunity to document credint, where we can make it clear what the differences are between frequentist/Bayesian confidence/credible intervals. We don't really do this anywhere at the moment but I think the differences in interpretation are important.
The cost is small. The only downside I can think of is that some users will try to call confint for a model not estimated by optimization, but we can either issue a warning to use credint (and allow confint to still be usable) or an error that forces the use of credint.
If we wanted we could provide credible intervals based on posterior quantiles (as we do now) but also give the option to compute HPD intervals (that would be valid assuming there isn't severe multimodality).

Wrong fixed effects estimates when running stan_glm without an intercept

I get weird fixed effects estimates when fitting a model without an intercept:

library(rstanarm)
devtools::install_github("paul-buerkner/brms")
library(brms)
foo1 <- stan_glm(mpg ~ -1 + ., data=mtcars, chains=2, iter=2000, warmup=500)
foo1

foo2 <- brm(mpg ~ -1 + ., data=mtcars, chains=2, iter=2000, warmup=500)
foo2

foo3 <- glm(mpg ~ -1 + ., data = mtcars)
summary(foo3)

The results of foo2 and foo3 closely match each other while foo1 produces very different results. Especially the MD_SD (and SD) estimates are a bit off. This may be a result of centering the fixed effects design matrix X regardless of whether an intercept is present or not. I believe that centering is invalid in models without an intercept (at least when not corrected otherwise) but I may be mistaken.

memory size error with large random effects models in rstanarm

Executive summary:
While rstanarm fits a random effects logistic regression model to my large dataset (~50,000 units of observations) and runs four chains without an issue, on trying to generate the stanfit object, I get a memory size error. There are about 2000 random effects in this model. The issue arises even with only 10 iterations. I have 8 RAM and four cores on a virtual windows server.

formulaR2.0 <- ond ~ pay +age_group +sex + (1 | cpt)

stanfit2.0 <- stan_glmer(formulaR2.0, data = myAQI, family = binomial, iter= 10, cores =2)
Loading required namespace: rstudioapi

Error: cannot allocate vector of size 2.7 Gb In addition:

Warning messages:
1: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)
2: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)
3: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)
4: In structure(.Internal(La_qr(x)), useLAPACK = TRUE, class = "qr") :
Reached total allocation of 8191Mb: see help(memory.size)

sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server 2012 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] rstanarm_2.8.0 rstan_2.8.0 ggplot2_1.0.1 Rcpp_0.12.1 lme4_1.1-9
[6] Matrix_1.2-2 knitr_1.11

loaded via a namespace (and not attached):
[1] nloptr_1.0.4 plyr_1.8.3 shinyjs_0.2.0 xts_0.9-7
[5] base64enc_0.1-3 tools_3.2.2 digest_0.6.8 nlme_3.1-121
[9] gtable_0.1.2 lattice_0.20-33 rstudioapi_0.3.1 shiny_0.12.2
[13] shinystan_2.0.1 proto_0.3-10 loo_0.1.3 gridExtra_2.0.0
[17] stringr_1.0.0 gtools_3.5.0 dygraphs_0.4.5 htmlwidgets_0.5
[21] DT_0.1 stats4_3.2.2 grid_3.2.2 inline_0.3.14
[25] R6_2.1.1 minqa_1.2.4 reshape2_1.4.1 magrittr_1.5
[29] codetools_0.2-14 shinythemes_1.0.1 threejs_0.2.1 scales_0.3.0
[33] matrixStats_0.14.2 htmltools_0.2.6 MASS_7.3-43 splines_3.2.2
[37] xtable_1.7-4 mime_0.4 colorspace_1.2-6 httpuv_1.3.3
[41] stringi_0.5-5 munsell_0.4.2 markdown_0.7.7 zoo_1.7-12

add IRT model(s)

I think we are obligated to work IRT for one of Stan's grants so we might as well make it precompileable

make sure all tests pass a seed argument

We now have a non-reproducible failure on r-devel for Windows
https://www.r-project.org/nosvn/R.check/r-devel-windows-ix86+x86_64/rstanarm-00check.html

Posterior predict (.pp_data_mer) error with newdata using tbl_df

I noticed the following error when calling posterior_predict with newdata, where my newdata contains a tbl_df (which inherits from data.frame) instead of a data.frame.

Here is a reproducible example:

library(dplyr)
data(mtcars)
mtcars_tbl_df <- tbl_df(mtcars)

m1 <- stan_glmer(qsec ~ 1 + wt + (1 | cyl), data = mtcars)
ppred1_df <- posterior_predict(m1, newdata = mtcars)

m2 <- stan_glmer(qsec ~ 1 + wt + (1 | cyl), data = mtcars_tbl_df)
ppred2_tbl <- posterior_predict(m2, newdata = mtcars_tbl_df)

Which produces the following error:

> ppred2_tbl <- posterior_predict(m2, newdata = mtcars_tbl_df)
Error in .pp_data_mer(object, newdata, ...) : 
  New levels found in grouping variable(s) cyl

However, in this example, the following will work:

ppred2_tbl2 <- posterior_predict(m2, newdata = as.data.frame(mtcars_tbl_df))

This seems like a minor bug to me, since there is such an easy work-around (demonstrated above). Documenting in case it is helpful.

QR decompositions in rstanarm models

All the .stan programs in the rstanarm matrix utilize a design matrix X that is N by K. It would be interesting to explore an option to preprocess X with a QR decomposition

X = Q * R

where Q' * Q = I and R is upper triangular. Then if the linear predictor is

eta = X * beta = (Q * R) * beta = Q * (R * beta) = Q * theta

where

theta = R * beta

it is perhaps reasonable to put independent symmetric priors on the elements of theta because the columns of Q are mutually orthogonal and all have unit length. If so, you could recover the posterior distribution of beta as a generated quantity with the inverse of the fixed R matrix via

beta = R_inverse * theta

This idea may also help with the scaling problems we have seen with meanfield ADVI when doing regression problems.

warnings about VarCorr

Email from Martin Maechler:

Yesterday, a new version of 'nlme' was released on CRAN,
for which the VarCorr() S3 generic has a new '...' argument
--- so that others can write methods with extra arguments,
something really desirable.

However, this lead to the following WARNINGS for your packages,
which we'd ask you to fix.

I strongly recommend you to do what I (as co-author) did for the
lme4 package:

in your DESCRIPTION, depend on the very latest nlme,
i.e., put
nlme (>= 3.1-123)
in your 'Imports:' or 'Depends:' in your DESCRIPTION file
add ', ...' at the end of the argument list of your VarCorr.
function definition,
both in R/.R and man/.Rd

lme4/lme4@61de241

add predict method

add support for binomial models

Currently Bernoulli is supported but not binomial

Release checklist

(Edit this comment to add additional checklist items or mark any as complete)

Code

Documentation

Roxygen templates for function args
Add a help page for the plotting functions
Remove mentions of advi from doc
Add references to BDA3 and other sources

Vignettes

package overview
lm
aov
polr
binomial/bernoulli glm
count glm (poisson, neg-binomial)
continuous glm (gaussian, gamma, inverse-gaussian)
glmer (point to the bernoulli and count vignettes and the lme4 vignettes, and discuss prior on covariance matrices) and gamm4
Check size of vignettes (should be fine now with svglite)

Demos

stan_glmer stuff from the second half of the ARM book

add support for multinomial logit models

Missing images in cran vignettes

See, for example,
https://cran.rstudio.com/web/packages/rstanarm/vignettes/binomial.html
I can't see any images (just the broken image link symbols)

I just noticed that this was mentioned in the release announcement, but I'll leave this issue here for those who don't read release notes...

allow using student_t instead of gaussian for likelihood

For stan_glm with a suitable outcome (and maybe stan_lmer), we should provide an option to use the t distribution as the likelihood instead of a Gaussian.

Perhaps we just follow BDA3 17.2 and use the fact that

y_i ~ t_v(m, s^2)

is equivalent to

y_i | V_i ~ N(m, V_i)
V_i ~ scaled-inv-chi-sq(v, s^2)

'meanfield' results don't match 'sampling' results

Seeing as it's the intercept that's so different, does this have to do with the centering?

fit <- stan_glm(mpg ~ wt, data = mtcars)
fit_vb <- stan_glm(mpg ~ wt, data = mtcars, algorithm = "meanfield")

> summary(fit)
                mean mcse   sd   2.5%    25%    50%    75%  97.5% n_eff Rhat
(Intercept)    37.25 0.04 1.94  33.31  35.96  37.22  38.55  41.09  2477    1
wt             -5.33 0.01 0.58  -6.45  -5.72  -5.33  -4.94  -4.22  2477    1
sigma           3.15 0.01 0.43   2.45   2.85   3.10   3.39   4.12  1813    1
mean_PPD       20.10 0.02 0.79  18.59  19.58  20.10  20.62  21.67  2636    1
log-posterior -53.10 0.04 1.32 -56.61 -53.65 -52.76 -52.14 -51.62   938    1


> summary(fit_vb)
               mean   sd  2.5%   25%   50%   75% 97.5%
(Intercept)    0.62 0.07  0.50  0.58  0.62  0.67  0.77
wt            -5.37 0.63 -6.65 -5.77 -5.35 -4.96 -4.16
sigma          3.12 0.34  2.52  2.88  3.10  3.35  3.87
mean_PPD      37.46 2.13 33.29 36.02 37.38 38.92 41.74
log-posterior 20.16 0.88 18.52 19.54 20.19 20.75 21.84

stan_glmer( ) mistranslates varying slope, varying intercept model

I ran into an error earlier today which led to NaNs in the log-likelihood matrix. Upon further inspection, I noticed that sigma had been estimated twice, and its mean took a negative value. Without dissecting the functions, I can only speculate that this is some kind of variable indexing bug in the "translation" between lmer syntax and stan code.

In this reproducible example, the first sigma appears to represent the group-level intercept, b[(Intercept) 1] seems to be its respective slope, and every subsequent variable name is off by one, so to speak. An estimate for the real sigma is not given. Otherwise, stan does a fine job of recovering the data-generating parameters.

Inference for Stan model: Gaussian GLM.
4 chains, each with iter=1000; warmup=500; thin=1; 
post-warmup draws per chain=500, total post-warmup draws=2000.

                  mean se_mean   sd  2.5%   25%   50%   75% 97.5% n_eff Rhat
(Intercept)       1.92    0.04 0.71  0.47  1.54  1.91  2.29  3.45   338 1.01
x                 3.95    0.04 0.83  2.31  3.48  3.95  4.44  5.54   373 1.01
sigma            -0.96    0.04 0.72 -2.43 -1.34 -0.95 -0.60  0.52   341 1.01
b[(Intercept) 1] -1.97    0.04 0.85 -3.67 -2.43 -1.96 -1.46 -0.33   442 1.01
b[x 1]           -0.10    0.04 0.72 -1.66 -0.45 -0.09  0.29  1.40   359 1.01
b[(Intercept) 2]  0.22    0.04 0.84 -1.42 -0.26  0.20  0.71  1.90   441 1.01
b[x 2]            1.12    0.04 0.72 -0.39  0.75  1.12  1.50  2.62   328 1.01
b[(Intercept) 3]  1.99    0.04 0.86  0.33  1.50  1.97  2.50  3.67   372 1.01
sigma            -0.96    0.04 0.72 -2.43 -1.34 -0.95 -0.60  0.52   341 1.01
mean_PPD          3.98    0.00 0.06  3.87  3.95  3.98  4.02  4.10  1771 1.00

Samples were drawn using NUTS(diag_e) at Mon Aug 17 16:21:17 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at 
convergence, Rhat=1).

R code:

N <- 600
J <- 3
x <- runif(N)
group <- c(rep(1, N/J), rep(2, N/J), rep(3, N/J))
y <- group + 2.00 * group * x + rnorm(N)
simdat <- data.frame(cbind(y, x, group))

stan_1 <- stan_glmer(y ~ x + (1+x|group)
                     , data = simdat
                     , chains = 4
                     , iter = 1000
                     , family = gaussian()
                     , prior.for.intercept = normal(0,2)
                     , prior = normal(0,3)
                     )
stan_1
ll <- stan_1$log_lik
crossv <- loo::loo(ll, cores = 1) # NaNs not allowed error
#table(round(crossv$pareto_k)) # Inf?

rstanarm v. 2.7.0

stanreg-methods.R not parsing

Some weirdness in the latest edit around lines 424-432.

commit b2c73f0

error when getting prior.info if subset argument is supplied

@bgoodri

These lines

all_args <- mget(names(formals()), sys.frame(sys.nframe()))
prior.info <- all_args[grep("prior", names(all_args), fixed = TRUE)]

are intended to collect all the prior-related arguments so after the model is fit we or the user can see what choices were made for the priors. Because match.call won't include these if the user leaves the default priors I came up with the solution above. But I just discovered that something like

stan_glm(mpg ~ wt, data = mtcars, subset = cyl < 8)

leads to the following error:

Error in mget(names(formals()), sys.frame(sys.nframe())) : 
  object 'cyl' not found

which I guess makes sense, but I'm not sure how to work around it. Any ideas?

make "tidy" method for stanreg objects and submit PR to broom package

Should be pretty easy, just need to do it.

df.residual

@bgoodri Is there any reason why we should definitely report residual degrees of freedom? Right now we have a call to qr in stanreg that is only used to get the rank of x for computing df.residual. If x is really large we probably don't want to compute the QR decomposition in stanreg unless we really want to report df.residual. What do you think?

add support for ordinal and multinomial logit models

gaussian.stan can't handle stan_glmer models with no X matrix

For example, this model from the lme4 documentation:

stan_lmer(diameter ~ (1|plate) + (1|sample), data = Penicillin)

will result in an empty X matrix.

ignoring priors?

From @waynefoltaERI (via brms issue 24)

For example, while playing with the mtcars dataset for this issue, I found that brms' and rstanarm's answers differed considerably. And glm agreed with rstanarm... which gave me a clue and when I investigated further, it appears that rstanarm is ignoring priors. (Or I am misunderstanding how they are specified, but not getting any error messages.) So the reason for the agreement is that I was specifying priors, but rstanarm was ignoring them and using flat (improper, frequentist-like) priors. Maybe has to do with their pre-compilation.

allow binomial #trials to be specified using weights argument

stan_plot cannot exclude "log-posterior"

stan_plot(*, pars = "log-posterior", include = FALSE) does not actually exclude the log-posterior from the plot
stan_plot(*, pars = "lp__", include = FALSE) does, so I assume it is just a naming thing?

Fix errors caused by ggplot2 update

https://github.com/hadley/ggplot2/blob/master/NEWS.md

have option to draw from the prior of any model

We should just be able to pass N = 0 and skip over the likelihood contribution to the posterior, but it may take a bit of fiddling. People like to see how much the data contributes to the posterior distribution relative to the prior.

Gaussian processes

Not a short-term priority, but could be nice to have some (limited) functionality for GPs in future iterations of the package.

compiled C++ code for the model object is (sometimes) invalid

I'm working with some simulated data and started by running a few simple models. I'm assuming the generic Gaussian GLM is pre-compiled in rstanarm -- I'd installed 2.8.1 directly. This could be an idiosyncrasy of working off clone VMs, but I got the following error:

> rstan1 <- stan_glmer(qty ~ price + (1 + price | prod_key)
+                      , data = df
+                      , chains = 4
+                      , iter = 1000
+                      , prior = normal(0, 2)
+                      , prior_intercept = normal(100, 40)
+                      )
starting worker pid=24456 on localhost:11133 at 14:04:11.818
starting worker pid=24465 on localhost:11133 at 14:04:12.164
starting worker pid=24474 on localhost:11133 at 14:04:12.491
starting worker pid=24483 on localhost:11133 at 14:04:12.796
Error in checkForRemoteErrors(val) : 
  4 nodes produced errors; first error: the compiled object from C++ code for this model is invalid, possible reasons:
  - compiled with save_dso=FALSE;
  - compiled on a different platform;
  - not existed (created from reading csv files).
> stan_trace(rstan1)
Error in if (object$algorithm == "optimizing") stop("Plots not yet available for optimization (algorithm='optimizing')",  : 
  argument is of length zero
> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-redhat-linux-gnu (64-bit)
Running under: Red Hat Enterprise Linux Server 7.0 (Maipo)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] rstan_2.8.1     ggplot2_1.0.1   shinystan_2.0.1 shiny_0.12.2    rstanarm_2.8.1  Rcpp_0.12.1     arm_1.8-6       lme4_1.1-10     Matrix_1.2-1    MASS_7.3-40    
[11] lazyeval_0.1.10 dplyr_0.4.1    

loaded via a namespace (and not attached):
 [1] lattice_0.20-31     zoo_1.7-12          gtools_3.5.0        assertthat_0.1      digest_0.6.8        mime_0.4            R6_2.1.1            plyr_1.8.1         
 [9] stats4_3.2.1        coda_0.18-1         httr_0.6.1          minqa_1.2.4         nloptr_1.0.4        DT_0.1              shinythemes_1.0.1   proto_0.3-10       
[17] devtools_1.9.1.9000 splines_3.2.1       shinyjs_0.2.3       RcppEigen_0.3.2.5.1 stringr_0.6.2       htmlwidgets_0.5     loo_0.1.3.9000      RCurl_1.95-4.5     
[25] munsell_0.4.2       httpuv_1.3.3        base64enc_0.1-3     htmltools_0.2.6     gridExtra_2.0.0     threejs_0.2.1       codetools_0.2-11    matrixStats_0.15.0 
[33] withr_1.0.0         bitops_1.0-6        BH_1.58.0-1         grid_3.2.1          nlme_3.1-120        xtable_1.8-0        gtable_0.1.2        DBI_0.3.1          
[41] magrittr_1.5        StanHeaders_2.8.0   scales_0.2.4        reshape2_1.4.1      dygraphs_0.5        xts_0.9-7           tools_3.2.1         markdown_0.7.7     
[49] abind_1.4-3         parallel_3.2.1      inline_0.3.14       colorspace_1.2-4    memoise_0.2.1       knitr_1.11

The traceback() output was unusual -- mostly comprised of a long list of numbers, presumably sampling args. It's interesting that if I try this model on a single chain (eschewing the parallel calls), it suddenly works. Because this error only sometimes happens, it's been tricky to pinpoint and diagnose. I don't recall it happening prior to 2.8.0, but this may just be one of the drawbacks of working with the bleeding edge versions...

variable selection by projective prediction

Let's do this for rstanarm

https://groups.google.com/d/msg/stan-users/5IxMxY3tv2w/ptm2vyJmBwAJ

see the repo https://github.com/jtpi/rstan-varsel

Check that y is the same for model comparison (loo or waic)

The compare function in the loo package checks that models have the same number of observations, but we can also check that the outcome variable is the same. This would probably require a wrapper around loo::compare that first does this check, which means rstanarm would have its own model comparison function, instead of users just using the loo package. But I'm not sure that doing this check is worth adding a function since compare's check for the number of observations will catch almost all cases of y being different.

allow sigma to vary by group

For models with a Gaussian outcome and some form of grouping structure, allow sigma to vary by level of a group.

return "clean" stan code to user

Discussion here: https://groups.google.com/forum/?fromgroups#!topic/stan-dev/5pJdH72hoM8

glFormula error

@bgoodri I just triggered an error I didn't know about in glFormula. Apparently it won't allow a model with more random effects than observations because "the random-effects parameters are probably unidentifiable." That's a reasonable concern for glmer, but for stan_glmer this will rule out many models (like a lot of the ones Andrew fits) for which it's possible to overcome this with informative priors.

stan_glmer messed up after recent merge

@bgoodri I'm not sure what's going but after your recent merge for the posterior_predict stuff this example

stan_lmer(Reaction ~ Days + (Days | Subject), data = sleep study)

throws an error. It has something to do with the _NEW_ terms. We're getting b[(Intercept) Subject:_NEW_Subject] but missing the _NEW_ term for Days.

non-MCMC options

Each stan_foo() function should have an option to do optimizing rather than sampling (and probably an option for variational inference too). Consequently, each stan_foo() function should have an option to turn the priors off, so that the user can obtain pure MLEs. These changes can probably be accomodated by the existing .stan files if we pass a has_priors flag.