cjvanlissa / bain Goto Github PK

View Code? Open in Web Editor NEW

7.0 7.0 5.0 2.5 MB

Bayes Factors for Informative Hypotheses

License: GNU General Public License v3.0

R 93.00% Fortran 6.79% C 0.21%

bain's People

Contributors

Stargazers

Watchers

Forkers

cvzundert jomulder xingu123 herberthoijtink thomvolker

bain's Issues

Relegate much of bain documentation to a vignette

The bain help documentation is very extensive, but typically, documentation only consists of the essential information needed to run the function, and a few simple examples for automated testing.

Text as extensive as the current documentation is better suited for a Vignette (which is also automatically included with the package).

CJ will create a bare-bones vignette.
Herbert, can you edit both documents so the right info ends up in each of them?

Beste Caspar. Dit is dus de manier van communiceren? Via Github. Helder en handig. Als je dit bericht ontvangt, is het me gelukt je te bereiken. Laat even weten.

simplify output of print.bain

First 4 columns are not informative for applied researchers

DE VIJF REGELS HIERONDER GEVEN HET GOEDE RESULTAAT

varnames<-c("a1","b1","a2","b2")
hypotheses31 <-
"0.20 *a1 = 0.20 *a2&
0.20 *b1 = 0.20 *b2 &"
x1<-bain:::parse_hypothesis(varnames, hypotheses31)

MAAR .20 (ALS MULTIPLIER VAN EEN PARAMETER, EN, ZONDER 0 VOOR DE PUNT)

IPV 0.20 LEIDT TOT EEN FOUT RESULTAAT

varnames<-c("a1","b1","a2","b2")
hypotheses31 <-
".20 *a1 = .20 *a2&
.20 *b1 = .20 *b2 &"
x2<-bain:::parse_hypothesis(varnames, hypotheses31)

HIERONDER DE TESTTHAT

test_that("parser", {expect_equal(as.vector(t(do.call(rbind, x1$hyp_mat))),
as.vector(t(do.call(rbind, x2$hyp_mat))))})

VOOR EEN GETAL OPTELLEN BIJ EEN PARAMETER GAAT HET WEL

GOED ZOWEL ALS .2 ALS 0.2 ERBIJ OPGETELD WORDT

varnames<-c("a1","b1","c1","d1","e1","f1","a2","b2","c2","d2","e2","f2")
hypotheses31 <-
"a1 = a2 - 0.2&
b1 = b2 + .2 &
0.2c1 -.3 = c2 +.3&
d1 + .3= 3d2 + .3&
e1 -.3= e2 + .3&
4f1 -0.3 = 3.2f2"
x3<-bain:::parse_hypothesis(varnames, hypotheses31)

HIERONDER DE TESTTHAT

test_that("parser", {expect_equal(as.vector(t(do.call(rbind, x3$hyp_mat))),
c(1 , 0, 0.0 , 0, 0 , 0 ,-1 ,0 , 0 , 0, 0 , 0.0, -0.2,
0 , 1, 0.0, 0, 0 , 0 ,0, -1 ,0 ,0 , 0, 0.0, 0.2,
0 , 0, 0.2, 0, 0 , 0 , 0 ,0, -1, 0 , 0 , 0.0 , 0.6,
0 ,0, 0.0 , 1 , 0, 0 , 0 ,0 , 0, -3, 0 ,0.0 , 0.0,
0, 0, 0.0, 0, 1 , 0, 0, 0 , 0, 0 ,-1 , 0.0 , 0.6,
0, 0, 0.0, 0, 0, 4, 0 , 0, 0, 0 , 0 ,-3.2, 0.3)
)})

Implicitly removing the intercept in `bain` results in seemingly unexpected behavior

When estimating the effect of a categorical variable, such as in, for example, an ANOVA, bain by default drops the intercept. This seems like a logical decision, but results in seemingly unexpected behavior, because there is no warning message shown. Accordingly, the estimate as given in an lm model reflects the difference between two categories, while bain uses the same name to refer to the group specific mean. This issue becomes apparent in the following reprex.

# Load required packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
#> Warning: package 'magrittr' was built under R version 4.1.2
library(bain)

set.seed(1)

# Load data and specify sex as a factor variable
dat1 <- sesamesim %>%
  mutate(sex = factor(sex, labels = c("boy", "girl")))

# Fit model
mod1 <- lm(postnumb ~ sex, dat1)

# Summary of model
summary(mod1)
#> 
#> Call:
#> lm(formula = postnumb ~ sex, data = dat1)
#> 
#> Residuals:
#>     Min      1Q  Median      3Q     Max 
#> -30.096  -8.096  -0.856   8.144  32.904 
#> 
#> Coefficients:
#>             Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   30.096      1.175  25.616   <2e-16 ***
#> sexgirl       -1.240      1.628  -0.761    0.447    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 12.6 on 238 degrees of freedom
#> Multiple R-squared:  0.00243,    Adjusted R-squared:  -0.001761 
#> F-statistic: 0.5799 on 1 and 238 DF,  p-value: 0.4471

# Note that sexgirl here reflects the difference with sexboy
# If we use this model in bain, and specify a negative 
# effect for sexgirl (which seems like a logical thing to do),
# this results in unexpected behavior, because bain implicitly
# transforms the intercept and difference into two separate means.
# However, this only becomes apparent after inspecting the summary
# of the bain output.
bf1 <- bain(mod1, "sexgirl < 0")

bf1
#> Bayesian informative hypothesis testing for an object of class lm (ANOVA):
#> 
#>    Fit   Com   BF.u  BF.c  PMPa  PMPb 
#> H1 0.000 0.500 0.000 0.000 1.000 0.000
#> Hu                               1.000
#> 
#> Hypotheses:
#>   H1: sexgirl<0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
# Fit is 0, which seems weird, because it is in line with the effect
# as estimated by the `lm` function.

# Only when calling summary(mod1) it becomes apparent that `sexgirl`
# suddenly refers to the mean of the girls, rather than the difference
# between boys and girls.
summary(bf1)
#>   Parameter   n Estimate       lb       ub
#> 1    sexboy 115 30.09565 27.79295 32.39836
#> 2   sexgirl 125 28.85600 26.64732 31.06468

^{Created on 2022-03-09 by the reprex package (v2.0.0)}

On a related note, researchers may also manually dummy code their categories. If this is the case, bain does not remove the intercept, but estimates the difference. However, this ignores the fact that there are multiple groups involved, which results in an inconsistent Bayes factor (Hoijtink, Gu & Mulder, 2019). Is it indeed problematic to create numeric dummy variables, and estimate the effect of these using bain, because the resulting Bayes factors are highly similar (see reprex below). If this is actually a problematic way to estimate the effects of categorical variables, it may be worthwhile to let bain recognize numeric dummy variables, and display a warning if these are observed.

# Load required packages
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
library(magrittr)
#> Warning: package 'magrittr' was built under R version 4.1.2
library(bain)

set.seed(1)

# Make the site variable a factor
bf_group_means <- sesamesim %>%
  mutate(site = as.factor(site)) %$%
  lm(postnumb ~ site + prenumb) %>%
  bain("site3 < site1 < site2 & prenumb > 0")

bf_group_means
#> Bayesian informative hypothesis testing for an object of class lm (ANCOVA):
#> 
#>    Fit   Com   BF.u  BF.c   PMPa  PMPb 
#> H1 0.443 0.057 7.794 13.189 1.000 0.886
#> Hu                                0.114
#> 
#> Hypotheses:
#>   H1: site3<site1<site2&prenumb>0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.

# Create dummies manually, I'll use site = 1 as the reference
bf_differences <- sesamesim %>%
  mutate(site1 = ifelse(site == 1, 1, 0),
         site2 = ifelse(site == 2, 1, 0),
         site3 = ifelse(site == 3, 1, 0),
         site4 = ifelse(site == 4, 1, 0),
         site5 = ifelse(site == 5, 1, 0)) %$%
  lm(postnumb ~ site2 + site3 + site4 + site5 + prenumb) %>%
  bain("site3 < 0 & site2 > 0 & prenumb > 0")

bf_differences
#> Bayesian informative hypothesis testing for an object of class lm (continuous predictors):
#> 
#>    Fit   Com   BF.u  BF.c   PMPa  PMPb 
#> H1 0.448 0.055 8.118 13.888 1.000 0.890
#> Hu                                0.110
#> 
#> Hypotheses:
#>   H1: site3<0&site2>0&prenumb>0
#> 
#> Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.
# Almost identical fit & complexity

summary(bf_group_means)
#>   Parameter   n   Estimate         lb        ub
#> 1     site1  60 27.4293820 25.1603844 29.698380
#> 2     site2  55 34.9818244 32.5529993 37.410650
#> 3     site3  64 27.6904082 25.4002689 29.980548
#> 4     site4  43 26.9840163 24.3250440 29.642989
#> 5     site5  18 31.4298837 27.3413936 35.518374
#> 6   prenumb 240  0.7159311  0.5986552  0.833207
summary(bf_differences)
#>   Parameter   n  Estimate         lb        ub
#> 1     site3 240 0.2610262 -3.0455948  3.567647
#> 2     site2 240 7.5524424  4.3017084 10.803176
#> 3   prenumb 240 0.7159311  0.5986552  0.833207
# In the latter, the grouping structure is ignored

^{Created on 2022-03-09 by the reprex package (v2.0.0)}

Warnings for different types of analyses

Herbert must examine which warnings are given. Especially when users just call bain on an lm object with mixed type predictors. Are the results valid?

Include functionality to recognize interaction terms

E.g., sepal_length:sepal_width

The BF turns from Inf to NaN with increasing Test value in a one-sample t-test

Sometimes, depending on the test value in a one-sample t-test, the Bayes factor turns from Inf to NaN. An example can be found below. I'm not sure how to handle this on the R side though, maybe you have some ideas?

data("sesamesim")

set.seed(100)
testValue <- 36
test <- bain::t_test(x = sesamesim$age)
b1 <- bain::bain(x = test, hypothesis = paste0("x=", testValue, "; x>", testValue, "; x<", testValue), fraction = 1)
BF_02_1 <- b1$BFmatrix[1,3]
BF_02_1

[1] 718.2452

set.seed(100)
testValue <- 35.5
test <- bain::t_test(x = sesamesim$age)
b2 <- bain::bain(x = test, hypothesis = paste0("x=", testValue, "; x>", testValue, "; x<", testValue), fraction = 1)
BF_02_2 <- b2$BFmatrix[1,3]
BF_02_2

[1] Inf

set.seed(100)
testValue <- 35
test <- bain::t_test(x = sesamesim$age)
b3 <- bain::bain(x = test, hypothesis = paste0("x=", testValue, "; x>", testValue, "; x<", testValue), fraction = 1)
BF_02_3 <- b3$BFmatrix[1,3]
BF_02_3

[1] NaN

Deze test failt:

estimate<-c(0,1,2)
names(estimate) <- c("a", "b", "c")
sampN <- 100
covariance <- matrix(c(1,0,0,0,1,0,0,0,1),nrow=3,ncol=3)

set.seed(199)
y<-bain(estimate,"3a + .5 + .5 > 2b -1 + 1> c -1.5 -.5",n=sampN,Sigma=covariance,group_parameters=0,joint_parameters = 3)

a <- rnorm(100000,0,1)
b <- rnorm(100000,1,1)
c <- rnorm(100000,2,1)
propf <- 0
for (i in 1:100000){
if (3 * a[i] + 1 > 2 * b[i] & 2 * b[i] > c[i]-2) {propf <- propf + 1/100000}
}
test_that("bain default", {expect_equal(y$fit$Fit[1], propf,tolerance = .01)})

b <- rnorm(100000,0,50)
c <- rnorm(100000,2,50)
propc <- 0
for (i in 1:100000){
if (3 * a[i] + 1 > 2 * b[i] & 2 * b[i] > c[i]-2) {propc <- propc + 1/100000}
}
test_that("bain default", {expect_equal(y$fit$Com[1], propc,tolerance = .01)})

t_test: Check if estimate and v are in the same order

Properly implement random seed

Because the random seed was fixed to 100 in Bain 0.1.3, I have temporarily done the same in the development version of bain 0.2.0. Release this before pushing to CRAN. Use sample.int() at every call to fortran.

Data type of "sigma"

Sigma is now an atomic vector, covariance matrix, or list of covariance matrices. Would it not be more practical to use an array? List implies that the covariance matrices can have different dimensions - but this seems like something that should not be allowed.

data("sesamesim")
formula <- age ~ site + peabody + prenumb + postnumb + funumb - 1
hypothesis <- "site1 = site2 = site3 = site4 = site5;site1 < site2 < site3 < site4 < site5;site1 > site2 > site3 > site4 > site5"
fraction <- 1
standardized <- FALSE
sesamesim$site <- factor(sesamesim$site)
fit <- stats::lm(formula = formula, data = sesamesim)
bainResult <- bain::bain(x = fit, hypothesis = hypothesis, fraction = fraction, standardize = standardized)
> print(bainResult)

Bayesian informative hypothesis testing for an object of class lm (ANCOVA):
   Fit   Com   BF.u  BF.c  PMPa  PMPb 
H1 0.000 0.000 0.000 0.000 0.015 0.000
H2 0.000 0.007 0.007 0.007 0.985 0.007
H3 0.000 0.007 0.000 0.000 0.000 0.000
Hu                               0.993
Hypotheses:
  H1: site1=site2=site3=site4=site5
  H2: site1<site2<site3<site4<site5
  H3: site1>site2>site3>site4>site5
Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.

>  bainResult$BFmatrix
             H1           H2         H3
H1  1.000000000 1.476255e-02   30175.92
H2 67.738958866 1.000000e+00 2044085.52
H3  0.000033139 4.892163e-07       1.00

Linux

data("sesamesim")
formula <- age ~ site + peabody + prenumb + postnumb + funumb - 1
hypothesis <- "site1 = site2 = site3 = site4 = site5;site1 < site2 < site3 < site4 < site5;site1 > site2 > site3 > site4 > site5"
fraction <- 1
standardized <- FALSE
sesamesim$site <- factor(sesamesim$site)
fit <- stats::lm(formula = formula, data = sesamesim)
bainResult <- bain::bain(x = fit, hypothesis = hypothesis, fraction = fraction, standardize = standardized)
> print(bainResult)

Bayesian informative hypothesis testing for an object of class lm (ANCOVA):
   Fit   Com   BF.u  BF.c  PMPa  PMPb 
H1 0.000 0.000 0.000 0.000 0.014 0.000
H2 0.000 0.007 0.007 0.007 0.986 0.007
H3 0.000 0.006 0.000 0.000 0.000 0.000
Hu                               0.992
Hypotheses:
  H1: site1=site2=site3=site4=site5
  H2: site1<site2<site3<site4<site5
  H3: site1>site2>site3>site4>site5
Note: BF.u denotes the Bayes factor of the hypothesis at hand versus the unconstrained hypothesis Hu. BF.c denotes the Bayes factor of the hypothesis at hand versus its complement.

> bainResult$BFmatrix
             H1           H2         H3
H1 1.000000e+00 1.404837e-02   22430.45
H2 7.118265e+01 1.000000e+00 1596659.07
H3 4.458225e-05 6.263078e-07       1.00

As you can see, the results between Mac/Windows and Linux diverge (for example BF23 = 30175.92 on windows versus BF23 = 22430.45 on Linux).

Division operator does not work in hypotheses

/ not recognized. /x could be replaced by *1/x

cjvanlissa / bain Goto Github PK

bain's People

Contributors

Stargazers

Watchers

Forkers

bain's Issues

DE VIJF REGELS HIERONDER GEVEN HET GOEDE RESULTAAT

MAAR .20 (ALS MULTIPLIER VAN EEN PARAMETER, EN, ZONDER 0 VOOR DE PUNT)

IPV 0.20 LEIDT TOT EEN FOUT RESULTAAT

HIERONDER DE TESTTHAT

VOOR EEN GETAL OPTELLEN BIJ EEN PARAMETER GAAT HET WEL

GOED ZOWEL ALS .2 ALS 0.2 ERBIJ OPGETELD WORDT

HIERONDER DE TESTTHAT

Recommend Projects

Recommend Topics

Recommend Org