gdlc / bglr-r Goto Github PK

License: GNU General Public License v3.0

R 71.99% C 22.01% TeX 6.00%

bglr-r's Introduction

BGLR: An R Package for (Bayesian) High-Dimensional Regression

The BGLR Package (Perez & de los Campos, 2014) implements a variety of shrinkage and variable selection regression procedures. In this repository we maintain the latest version beta version. The latest stable release can be downloaded from CRAN.

Citation

Please cite Perez & de los Campos, 2014 and Perez & de los Campos, 2022 for BGLR and Multitrait, respectively.

Installation

From CRAN (stable release).

  install.packages(pkg='BGLR',repos='https://cran.r-project.org/')

From GitHub (development version, added features).

   install.packages(pkg='devtools',repos='https://cran.r-project.org/')  #1# install devtools
   library(devtools)                                                     #2# load the library
   install_git('https://github.com/gdlc/BGLR-R')                         #3# install BGLR from GitHub

Note: when trying to install from github on a mac you may get the following error message

ld: library not found for -lgfortran

This can be fixed it by installing gfortan, for mac os you can use this

Useful references:

link1.
link2.

1. Single-Trait Models

Examples BGLR-function

Other Omics

Metabolomics (milk-spectra): Ferragina et al., J.D.Sci, 2015
Multi-omic (gene expression, methylation & CNV): Vazquez et al. (10th, WCGALP) & Vazquez et al. (Genetics, 2016)

Markers or Pedigree and Environmental Covariates

-Wheat (SNPs and env. covariates): Jarquin et al. (TAG, 2014)

-Cotton (Pedigree and env. covariates): Perez-Rodriguez et al.(Crop. Sci, 2015)

Image Data

-Maize (Image data): Aguate et al. (Crop. Sci, 2017)

2. Multi-trait models

The Multitrait function included in the BGLR package fits Bayesian multitrait models with arbitrary number of random effects using a Gibbs sampler. A functionality similar to this is implemented in the MTM package. In this implementation is possible to include regression on markers directly assigning Spike-slab or Gaussian priors for the regression coefficients and fixed effects can be different for all the traits. We also have improved the sampling routines to speed up computations. Next we include some examples.

Supplementary scripts for the draft of the paper "Multi-trait Bayesian Shrinkage and Variable Selection Models with the BGLR R-package"

bglr-r's People

Contributors

Stargazers

Watchers

bglr-r's Issues

warning message

The following warning message is produced when using BGLR() with response_type = "ordinal".

Warning message:
In sqrt(post_threshold2 - post_threshold^2) : NaNs produced

Change to new cran checks badge URL

👋🏽 I maintain the cran checks badges. Please change to the new cran checks badge URL (e.g., https://badges.cranchecks.info/worst/dplyr.svg). Old badges at (e.g. https://cranchecks.info/badges/worst/dplyr) will be unavailable as of Jan 1st 2023.

Blog post from June 2022 announcing the change: https://recology.info/2022/06/cran-checks-badges/
New badges repo: https://github.com/r-hub/cchecksbadges
Old badges repo: https://github.com/sckott/cchecksapi

Calculation of "Broad-sense heritability" with various GRMs

I am trying to get dominance and epistasis effects with the custom relation matrix.
I had 5 genomic relationship matrice, A (addictive), D (dominance), AxA, AxD, DxD
and run the below test.

ALL_MODEL <- Multitrait(y=y,ETA=list(list(K=A,   model="RKHS",saveEffects=T),
                                     list(K=D,   model="RKHS",saveEffects=T),
                                     list(K=AxA, model="RKHS",saveEffects=T),
                                     list(K=AxD, model="RKHS",saveEffects=T),
                                     list(K=DxD, model="RKHS",saveEffects=T)
                                    ), verbose=T, nIter=10000,burnIn=5000,saveAt='AD_')

With this model, how can I calculate "Broad-sense heritability"?

Error in factanal(covmat = Cov$Omega, factors = Cov$nF) : unable to optimize from this starting value

Hi, thanks for the BGLR package!

I'm trying to fit an FA1 model, but sometimes I get this error:

Checking variance co-variance matrix K  for linear term 1
Ok
Setting linear term 1
MSx=0.782358413511471
FA covariance matrix
df0 set to 23 for all the traits
S0 was set to: 
       DEH1        GAH1        GAH2        GEH1        IAH1        INH1        MIH1        MNH1        NCH1        NEH1        NEH2        NEH3        NYH2        NYH3        NYS1        SCH1        TXH1        TXH2        TXH3 
  7.9556863  20.2783135 173.7748939  43.4470011  56.2167504   3.7032637   6.7851079  15.6662375   5.6775808 101.9856404   8.3571794  15.3495887  47.5418310   1.9690249  51.6612751  58.9084196   0.6233221  17.9751277  19.7048923 
       WIH1        WIH2        WIH3 
 98.1848348  20.2203369   0.2596944 
var was set to 100
Error in factanal(covmat = Cov$Omega, factors = Cov$nF) : 
  unable to optimize from this starting value

Curiously, if I try to run a couple of times, the model works, so I suppose it could be to a random initialization.

After looking for this error, I saw this post in StackOverflow:
https://stackoverflow.com/questions/41146141/error-in-factor-analysis-starting-values

Does it make sense to increase the number of starting values or give us the option to change the lower bound?
The documentation of factanal says nstart = 1 and lower = 0.005.

Thanks!

is BRR in the BGLR package have exactly same statistics calculation as rrBLUP (Ridge-regression BLUP) ?

Hi,

Just want to confirm that is BRR in the BGLR package have exactly same statistics calculation as rrBLUP (Ridge-regression BLUP) ?

Kind regards,

Model with fixed effects and pedigree

I am trying to fit a model using the BGLR package in R. Unfortunately, I'had not success to do it. I would like to predict plant height using the following BGLR model (dataset with 2 sites, 50 genotypes with pedigree):
yij = u + Ei + gi(A) + e
yij = are the BLUEs from aerial data and ground data obtained by my "golden model" previously fitted in ASReml software (model with genotype, site, GxE interaction, repetition, and bloc - lattice design).
Ei= is the fixed effects for sites
gi(A)= random effects for genotype with pedigree (A).

How can I fit a BGLR model using sites with fixed effects and genotype (random) with a pedigree? Could you please provide example in R using the BGLR scripts?

Thank you very much!

Cheers,

I attached the sample dataset (data_BLUEs) and the pedigree (calculated as twice the coefficient of parentage) as well.
A.matrix.txt
data.txt

Leonardo

How to predict New y when adding new samples?

Hi,

How to predict NEW y using exists training results(like marker effect...) when adding new samples(Marker ID and marker number are not changed)?

Best wishes.
Xie Kun

QUE about Heritability Estimation

Dear Dr. Gustavo
The heritability I calculated using BGLR is significantly lower than the heritability reported in the literature. And no errors were reported during the BGLR operation.
The following is my own code. Is there something wrong? ?
(In the ETA settings, I set 'flock' and 'sex' in my dataset as fixed factors)

GBLUP: Heritability Estimation
library(BGLR)
setwd('F:\GS2\demo\275demo')
##dataset of phenotype
Y<-read.table("wool_1_2.txt",sep="\t",header=TRUE)
Y$SEX <- as.factor(Y$SEX) ##set sex into factor
Y$FLOCK <- as.factor(Y$FLOCK) ##set flock into factor
y<-Y[,7] select the trait “diameter of wool (DIA)”
##red_ped
out=read_ped("AMS286_QC_impute.ped")
p=out$p
n=out$n
out=out$x
out[out==2]=NA
out[out==3]=2
X=matrix(out,nrow=p,ncol=n, byrow=TRUE)
X=t(X)
X<-scale(X,center=TRUE,scale=TRUE)
G=tcrossprod(X)/ncol(X)
fm1=BGLR( y=y,ETA=list(list(~factor(FLOCK)+factor(SEX),data=Y,model='FIXED'),
list(K=G,model='RKHS')),
nIter=12000,burnIn=2000,saveAt='eig_'
)

varE=scan( 'eig_varE.dat')
varU=scan('eig_ETA_G_varU.dat')
h2_2=varU/(varU+varE)
h2_2
mean(h2_2)

The heritability of the diameter of wool in the literature is 0.57 ~ 0.59, but the heritability of the calculation after running the above code is 0.19, please help me check where the above code problems occur, why the difference is so great
Thanks a lot !

Meaning of d in BGLR output

Hi,
I was wondering what the meaning of 'd' is in the BGLR output. For instance if I run BayesC and and I only have markers as my linear predictors, what is fm$ETA[[1]]$d? By looking at the source code I assumed this was the model frequency. However, I would expect to see a correlation between the marker effect size and model frequency, but the correlation is close to 0.
Thanks,
Malachy

Missing value in genotype

Hi,

Just wonder how to represent missing genotype, it looks like the BGLR still not tolerate the missing value in the predictor ?

Great thanks,

Multitrait function not found in Multitrait model

When I run this line in multi-traits model: fm<-Multitrait(y=y,ETA=ETA,nIter=1000,burnIn=500), it says "could not find function "Multitrait".
Am I missing something here? Thanks

Feature Request: Option to disable saving the output

I would appreciate if we can disable saving the output of the BGLR function. There can be hard disk space limited cases or if the library is called in parallel writing to disk can be a bottleneck.

Thank you very much for making this package.

How to view the results of Cross-validation using a loop

Dear Dr. Gustavo
Thanks for your R package, it is very practical！
According to the provided wheat data set, I tried to do 5-fold cross-validation, but I only saw the results of the operation. How can I extract the prediction accuracy result of each verification?
when I code “cor(y,yHatCV)”, Is that right?
Yours
Yoledou

!!!help!

Adjust for genetic relationship matrix?

Dear Team,
I'd be interested to use BGLR on an admixed human population for binary outcome. However, my data are familial, that is related that are to be adjusted using a NxN genetic relationship matrix as a random variable.

I was wondering if this package allows for that?

Thanks

Need some help using GxE interaction model

Hi Marco,

Please help me design a GxE model by giving me some insights.
Please help me , i have already done random regression model by using environment variables.
Just some help i need from you. i am in desperation to design a gxE model. please please please help me to design something.

my mail [email protected]
skype [email protected]

How to use fixed variance components in Multitrait function?

I would like to do cross validation with fixed variance. How to use fixed variance components in Multitrait function?

Genomic heritability

Unexpected number of fields in ped file

I have a genotype data converted from vcf to plink format in TASSEL 5.0.

Then I tried to load the ped and map file to BGLR using read_ped, however, encountered the following error:

Error in read_ped(f_geno) : 
  Unexpected number of fields in  qxs123_104_GenomeModelTestParents_germplasms_3kset

The ped file has 4006 columns, which is false for (4006-6)%%2!=0

Any explaination? Thank you.

Saved sampler trace stored too lossy if options(digits= ...) is set to a low value

When R is run with a low number of printed digits, the saved sampler chains might loose information.

This is because the used write(...) calls are affected by the number of printed digits. For example, if I set options(digits=2), the output might look like this

151 151 151 151 151
152 151 151 151 151
151 151 152 151 151
...

However, options(digits=) is meant to affect the print output and not the accuracy of saving data. This gives the false impression that the sampler has convergence problems, for example, that it rejects almost all parameter changes.

I would suggest to use format() or similar around the write(.) statements in BGLR::BGLR to avoid the dependency of options(digits=).

cross-validation for binary or categorical trait

Dear Dr. Gustavo

To calculate the accuracy of the validation in the case of the binary or categorical trait, After fitting models with the binary and ordinal response which one of the outputs are supposed to use as a phenotype instead of the binary values to make a correlation with GEBV of the validation set? On the other hand, Which output of the binary model should be used to make a correlation with GEBV validation?

Regards

Sajjad

Error in eigen(LT$K) : infinite or missing values in 'x'

Dear Dr. Gustavo,
I fit a 'GBLUP' model using BGLR ,this is my code:

nlines=152
d<-read_ped(ped_file='/home/hgc-z3/zc/gs/1520cAset7.ped')
X<-data.frame(matrix(unlist(d[1]), nrow=nlines, byrow=FALSE),stringsAsFactors=FALSE)
Y<-read.table('/home/hgc-z3/zc/gs/cA.trait.txt',header=T,sep="\t")
y<-Y[,2]
nTST<-30
n<-nrow(X);p<-ncol(X);nRep<-100;nIter<-100000;burnIn<-10000
X<-scale(X,center=TRUE,scale=TRUE)
G<-tcrossprod(X)/p
ETA<-list(list(K=G,model='RKHS'))
COR<-matrix(nrow=nRep,ncol=1,NA)
for(i in 1:nRep){
tst<-sample(1:n,size=nTST,replace=FALSE)
yNA<-y;yNA[tst]<-NA
fm<-BGLR(y=yNA,ETA=ETA,nIter=nIter,burnIn=burnIn)
COR[i,1]<-cor(y[tst],fm$yHat[tst]);rm(fm)
}
write.table(COR,file="t.csv",quote=FALSE,sep=",")

However, it give error : Error in eigen(LT$K) : infinite or missing values in 'x' ,I donot where is wrong in my file and they seem no missing value.
where is the wrong?
Thanks a lot !

Mismatch between specified and final probIn (BayesC)

Hello,

I'm setting the probIn using the following command:

ETA=list( list(X=X,model='BayesC',probIn=0.5,counts=2)

After fitting the model the probIn attribute doesn't match the specified value.

fmBC$ETA[[1]]$probIn`

For example:
probIn=0.5, counts=2 $probIn=0.01377732
probIn=0.5, counts=10 $probIn=0.01513755
probIn=0.5, counts=1000000 $probIn=0.5000028

If it's possible can you explain the difference between the input probIn and the final $probIn?

Thank you again for the library.

Autocorrelation problem with MCMC steps

Dear all,

I'm working with Genomic Selection in sugarcane, so to understand the BGLR package, I picked up the data available in Spindel et al. (2015) and I'm running different Bayesian models (Bayesian gBLUP, Bayes A, Bayes B, Bayes C, Ridge Regression and Bayesian Lasso). So for Bayes A, I'm using the function as follow:

subs <- sample(rep(1:5, length.out = length(y)))
ref <- data.frame(i=1:length(y), sg=subs)

BayesA <- function(i, my.nIter, my.burnIn, my.thin){
trn.y <- y
trn.y[ref$sg==i] <- NA
BGLR(y=trn.y, ETA=list(list(X=genot, model='BayesA')), nIter=my.nIter, burnIn=my.burnIn, thin=my.thin, verbose=FALSE)$yHat
}

In the first time, I used:

system.time(result <- data.frame(mclapply(1:5, function(i) BayesA(i, my.nIter=10000, my.burnIn=0, my.thin=1), mc.preschedule=FALSE, mc.set.seed=TRUE, mc.cores=ncores)))

mcmc <- as.mcmc(data.frame(varE=varE$V1, varU=varU$V1, mu=mu$V1))
ac <- autocorr(mcmc, lags=seq(0, 1000, 10))
acor <- data.frame(seq(0, 1000, 10), stack(data.frame(varE=ac[,1,1], varU=ac[,2,2], mu=ac[,3,3])))
names(acor) <- c('interval', 'autocorrelation', 'parameter')

ggplot(acor, aes(x=interval, y=autocorrelation, colour=parameter, group=parameter)) +
geom_line(size=0.75) +
geom_point(size=1, pch=21, fill='grey30', alpha=0.3) +
theme_bw()

and the result was:

In the second time, I used:
system.time(result <- data.frame(mclapply(1:5, function(i) BayesA(i, my.nIter=1001000, my.burnIn=1000, my.thin=50), mc.preschedule=FALSE, mc.set.seed=TRUE, mc.cores=ncores)))

and the result was:

Through these results, I saw that after 1001000 iterations there are autocorrelation in the Markov chain. What your suggestion to solve this problem? How can i reach at least 1000 effective steps for each parameter?

Best regards
Renato

Setting the environments as fixed effects.

Hello,

I am new in genomic prediction. I am using the wheat dataset provided with your package to implement different models using BGLR. Right now, I am trying to implement GBLUP to predict grain yield using information from all four environments. I would like to set the environment as fixed effect and use the markers as random effects. I am facing difficulties setting the environment as fixed effect and I was hoping you could provide your help.

Thank you very much.

fit pedigree matrix in unbalanced design

Hi,

I'm curious if it's possible to fit pedigree relationship matrix for an unbalanced data (unbalanced field design).
Also, whether it's possible to have interaction terms with either GRM or A matrix.

Thanks,
Jia

within family genomic heritabiity

I want to use BGLR for estimating within family heritability using GRM. I tried the two heritability methods (with variance components and sample variance of genomic values) but I get different estimates. Which approach should I use to estimate heritability with a single family? Thanks in advance.

mixed recursive models

Resolved: On MacOS 12.6 (Monterey) ld: library not found for -lgfortran

Hi everyone,

I'm just offering up this solution (that worked for me to install BGLR-R) on my Mac (Intel processor) running OS 12.6.

I did this (which is a derivation of a Linux solution found here: https://stackoverflow.com/questions/6302209/building-r-package-and-error-ld-cannot-find-lgfortran):

install updated gcc using homebrew
found path to gfortran libraries (/usr/local/lib/gcc/12/)
create a file called "Makevar" under hidden .R directory
write line "FLIBS=-L/usr/local/lib/gcc/12/"
source Makevar file (not sure if this was necessary)
installed BLGR-R in R using: install_git('https://github.com/gdlc/BGLR-R')

Hope this helps someone!

Best wishes,
Sheina

Push this version to CRAN?

It seems as though you took out the OPENMP requirements for this package in src/Makevars, but the version on CRAN still has them.

Do you know when you may push this to CRAN? This causes a whole slew of errors if you don't use CRAN's clang compiler and use the native OS X clang.

A similar fix was done (checking for OpenMP headers) in jonclayden/mmand#4.

BayesB specifiation

I am trying to make a comparison between BayesA and BayesB with the same training and validation set. However, when I run BayesB with the option "probIn=0.01", I got almost a zero correlation between GEBV and phenotype for BayesB and here is how I calculated the accuracy as a following code:
ytst<-pheno[tst3]; Xtst<-X[tst3,]; ghat_tst<-Xtst%*%fm$ETA[[3]]$b; geno_accu_tst3<-cor(ghat_tst,ytst)/h

First, for specifying pi for BayesB, do I need to set "probIn" and "Count" to a specific value? and Second have I calculated the correlation of GEBV and phenotype in the proper way?

Regards
Sajjad

Scale parameter in LP 2 was missing and was set to NA

Hello, Why does the following situation occur when running BGLR?
{nIter=6000;burnIn=2000;thin=3;S0=NULL;weights=NULL;R2=0.5;df0=5;}
DF in LP 1 was missing and was set to 5
R2 in LP 1 was missing and was set to 0.25
Scale parameter in LP 1 was missing and was set to 1.68210235993004
DF in LP 2 was missing and was set to 5
R2 in LP 2 was missing and was set to 0.25
Scale parameter in LP 2 was missing and was set to NA

Iter=1 Time/Iter=3.842
VarE=NA

Iter=2 Time/Iter=3.832
VarE=NA

Thanks for the BGLR package, anyway!

Error in rinvGauss(n = ETA[[j]]$p, nu = nu, lambda = ETA[[j]]$lambda2) : nu must be positive

Dear Author
I am trying to run the following script for BGLR analysis, all the required parameters for the script have been passed in correctly, however as the number of iterations increases, the software starts to report an error that I can't handle：Error in rinvGauss(n = ETA[[j]]$p, nu = nu, lambda = ETA[[j]]$lambda2)
nu must be positive:
Script Content:

library(BGLR)
library(optparse)

option_list = list(
  make_option("--pheno", type="character", default=NULL, 
              help="Phenotype file name", metavar="phenofilename"),
  make_option("--geno", type="character", default=NULL, 
              help="Genotype file name, it is a 012 type matrix which created by vcftools", metavar="genofilename"),
  make_option("--out", type="character", default=NULL, 
              help="Output model result name", metavar="outfilename"),
  make_option("--cv", type="integer", default=10, 
              help="Number of cross-validation folds", metavar="cvnum"),
  make_option("--fold", type="integer", default=5, 
              help="Number of individuals to sample in each fold", metavar="foldnum"),
  make_option("--iter", type="integer", default=16000,
              help="Number of interation times", metavar="interation"),
  make_option("--burn", type="integer", default=5000,
              help="Number of burn times", metavar="burn"),
  make_option("--outputdir", type="character", default=NULL,
              help="providing a path to output the result and idx", metavar="outputdir")
)


opt_parser = OptionParser(option_list=option_list)
opt = parse_args(opt_parser)


set.seed(42)
options(digits = 4)

SNP=read.table(opt$geno)
META=read.table(opt$pheno,header = TRUE)

options(digits =4)

rrBLUPres = matrix(ncol = 5, nrow = nrow(META) * ncol(META) * opt$cv)
colnames(rrBLUPres) = c("trait", "heritability", "accuracy", "r2", "set")
idxlist = matrix(nrow = opt$cv, ncol = nrow(META) * 1 / opt$fold)
rownames(idxlist) = c(seq(1,opt$cv))

nIter=opt$iter
burnIn=opt$burn
set.seed(42)
n=nrow(META)
nTST=round(as.numeric(n/opt$fold),0)

X1=scale(SNP)
X1[is.na(X1)] <- 0
for (i in seq(1,opt$cv)){
  testidx = sample(as.numeric(rownames(META)),as.numeric(nrow(META) * 1 / opt$fold))
  sort_testidx = sort(testidx)
  idxlist[i,] = sort_testidx
  pheno = META
  pheno[sort_testidx,] = NA

  setwd(opt$outputdir)

  for (p in seq(2,ncol(META))){
    varfile_dir <- paste0("varfile/BLcv", i, "_META", p)
    dir.create(varfile_dir, recursive = TRUE)
    setwd(varfile_dir)
    y=pheno[,p]
    y_true = META[,c(1,p)]
    idx_pre = intersect(which(is.na(y_true[,2])==F),sort_testidx)
    idx_r2 = which(is.na(y_true[,2])==F)
    
    fmBL=BGLR(y=y,ETA=list(list(X=X1,model='BL',saveEffects=T)),nIter=nIter,burnIn=burnIn)
    prediction=fmBL$yHat
    
    varU = scan('ETA_1_lambda.dat')
    #varU = na.omit(varU)
    varE = scan('varE.dat')
    #varE = na.omit(varE)
    h2_2 = varU/(varU+varE)
    
    accuracy = cor(prediction[idx_pre], y_true[idx_pre,2],use = "complete.obs")
    r2=1 - sum((prediction[idx_r2]-y_true[idx_r2,2])^2)/
      sum((y_true[idx_r2,2]-mean(y_true[idx_r2,2]))^2)
    heritability = mean(h2_2)
    
    rrBLUPres[(i-1)*ncol(META)+p,1] = colnames(META)[p]
    rrBLUPres[(i-1)*ncol(META)+p,2] = heritability
    rrBLUPres[(i-1)*ncol(META)+p,3] = accuracy
    rrBLUPres[(i-1)*ncol(META)+p,4] = r2
    rrBLUPres[(i-1)*ncol(META)+p,5] = i
    
    rrBLUPres = rrBLUPres[order(rrBLUPres[,1]),]
    setwd(opt$outputdir)
    unlink(varfile_dir, recursive = TRUE)
  }
}
setwd(opt$outputdir)
rrBLUPres_clean <- na.omit(rrBLUPres)
write.table(rrBLUPres_clean, file = opt$out, 
            col.names = T, row.names = F, quote = F, sep="\t")
write.table(idxlist, file = "idx.BGLR.BL.out", 
            col.names = T, row.names = T, quote = F, sep="\t")

Error:

Iter=1648 Time/Iter=0.071
varE=28.741
---------------------------------------
Iter=1649 Time/Iter=0.072
varE=28.545
---------------------------------------
Iter=1650 Time/Iter=0.093
varE=28.623
Error in rinvGauss(n = ETA[[j]]$p, nu = nu, lambda = ETA[[j]]$lambda2) : 
  nu must be positive
Warning in BGLR(y = y, ETA = list(list(X = X1, model = "BL", saveEffects = T)),  :
  tau2 was not updated  in iteration 1651 due to numeric problems with beta

---------------------------------------
Iter=1651 Time/Iter=0.086
varE=NaN
Error in if (any(nu <= 0)) stop("nu must be positive") : 
  missing value where TRUE/FALSE needed
In addition: Warning messages:
1: In rnorm(n = nNa, sd = sdE) : NAs produced
2: In rnorm(n = 1, sd = sqrt(1/C)) : NAs produced
Warning in BGLR(y = y, ETA = list(list(X = X1, model = "BL", saveEffects = T)),  :
  tau2 was not updated  in iteration 1652 due to numeric problems with beta

Can I make any attempts to troubleshoot this issue or are you aware of the possible cause or location of the problem?
Thank you for your help, I would appreciate it!

Working with out-of-memory matrices

I was reading the vignette of this very interesting package. In the vignette, you say
"Also, we are currently working on modifying the software so that genotypes do not need to be stored in memory".

If you do want to use genotypes stored on disk, packages bigstatsr and bigsnpr (https://doi.org/10.1093/bioinformatics/bty185) might be able to help you with this.
I'm the author of those packages and I can help you and answer your questions if you want more information about those packages.

Reliability estimate

Gustavo,
Is there any built in function to calculate a reliability estimate for each of the predicted values in the BGLR code?
(this is similar to computing accuracy of prediction in Pedigree BLUP
sqrt(1-(std_err_pred_i)^2/(diag(Aii)*s2a))
Thanks,
Mak

Warnings since R 4.0 because class(MATRIX) = c("matrix", "array") now

The line BGLR.R:500 yields a warning since R 4.0 because matrices are now class "matrix" and "array". Therefore, there is now a 2-logical in the if() condition.

if(class(LT$K)!="matrix")

Proposal: Use inherits() and never use class(...) == x inside if (this happens multiple times in the code base)

Issue with sparse response matrix

I have used MTM for a while now and I was quite happy with it. I recently tried to use the "Multitrait" equivalent in BGLR but I faced an issue.

I have a very sparse response matrix for which some individuals have been observed in a single environment (NAs otherwise). The matrix looks like:
Env1 Env2 Env3 Env4 Env5
IND1 NA 2.52 NA NA 0.23
IND2 1.37 NA NA NA NA
IND3 NA 3.50 NA NA -2.20
IND4 -0.85 NA 2.10 -0.65 -0.22
IND5 0.24 0.07 NA NA NA
IND6 NA NA NA NA -0.86

When I try to run this code:

ETA<-list(list(K=K,model="RKHS"))
BGLR_results <- BGLR::Multitrait(y=Y,ETA=ETA,resCov=list(type="DIAG"),nIter=1000,burnIn=500)

I get the following error due to the absence of individuals for which observations are available in all columns:
Error in cov(y, use = "complete.obs") : paires d'éléments incomplètes

When running MTM, I did not have this issue for this kind of data. It seems to me that modifying cov(y, use = "complete.obs") to cov(y, use = "pairwise.complete.obs") would at least enable the calculation when there is pairwise overlap between environments. But ideally, I would like to be able to infer parameters when there is no overlap at all between environments, which was possible with MTM.

Thanks for your help!

option for a multivariate response?

Hi All,

This is not so much of an issue as a question. Is there any capability to fit a multivariate response for any of the regression models in this package?

Thanks,
Jacqueline

Substantial heritability with simulated random outcome

Dear all,

I wanted to use BGLR to estimate the genome-wide variance explained of a trait by SNPs and DNA methylation . A colleague and I noticed that BGLR estimates substantial SNP heritability or DNA methylation variance components in the range of 10-25% even if the outcome is randomly simulated. I included below R code using the example data and scripts from the paper to illustrate the problem. My expectation is that the variance component should be 0%, yet the example estimates a heritability of 24%. I wonder whether this is expected behavior. The only explanation I can think of is that the prior distribution has a large influence on the mean posterior distribution. If that is the case, however, I wonder how to interpret any outcomes with true variance components of 20-30%. These would seem to be indistinguishable from random associations.

Best,

Alex

### Boxes  9  and S3 #####################################################################

### Box 9 ################################################################################
#1# Loading and preparing the input data 
library(BGLR); data(wheat);  set.seed(123);
X<-wheat.X; A<-wheat.A/mean(diag(wheat.A));  
y<-rnorm(599)

#2# Computing the genomic relationship matrix 
X<-scale(X,center=TRUE,scale=TRUE) 
G<-tcrossprod(X)/ncol(X) 

#3# Computing the eigen-value decomposition of G 
EVD<-eigen(G) 

#3# Setting the linear predictor 
ETA<-list(PED=list(K=A, model="RKHS"), 
          MRK=list(V=EVD$vectors,d=EVD$values, model="RKHS") 
) 

#4# Fitting the model 
fm<-BGLR(y=y,ETA=ETA, nIter=12000, burnIn=2000,saveAt="PGBLUP_") 
save(fm,file="fmPG_BLUP.rda") 

# Residual variance 
varE<-scan("PGBLUP_varE.dat") 

#varA and varU 
varA<-scan("PGBLUP_ETA_PED_varU.dat") 
varU<-scan("PGBLUP_ETA_MRK_varU.dat") 


varG<-varU+varA 
h2<-varG/(varE+varG) 
mean(h2);sd(h2) 

mean(varU/varG) 
mean(varA/varG)

readBinMat for Multitrait

Hi,

the function readBinMat expects the first two elements in a given file to contain the the number of rows and
columns, respectively.
A file is initialised like this in BGLR.R :

writeBin(object=c(nRow,LT$p),con=LT$fileEffects,size=ifelse(LT$storageMode=="single",4,8))

In Multitrait.R, like this:

writeBin(object=c(nRow,traits,LT$p),con=LT$fileEffects,size=ifelse(LT$storageMode=="single",4,8))

So the first 3 elements are reserved to hold: number of samples, number of traits and number of levels for the ETA,
resulting in a 3 dimensional array.

The posterior samples from a multitrait model can be read like this:

library(BGLR)

data(wheat)
Y = wheat.Y
M = wheat.X

tmpDir = tempdir()
setwd(tmpDir)

niter = 1000
burnin = 500
thin = 1

ETA <- list(
	    G = list(
		     X = M, 
		     model = "BRR", 
		     saveEffects = TRUE
	    )
)

mod <- Multitrait(
		   Y, 
		   ETA = ETA,
		   nIter = niter, 
		   burnIn = burnin, 
		   thin = thin
)

fileIn = 'ETA_G_beta.bin'
dims = readBin(fileIn, n = 3, what = numeric(), size = 8)
tmp = readBin(fileIn,n = prod(dims) + 3, what=numeric(), size = 8)[-(1:3)]
post_beta = array(data = tmp, dim = dims[c(3,2,1)]

However, the built in function readBinMat cannot be used, as it is expecting a 2-dimensional object.

This could be resolved by always reserving the first 3 elements in the output file (e.g. 'ETA_G_beta_bin')
to hold: number of samples, number of traits and number of levels for the ETA.
For a single trait model from a BGLR call, the second dimension would be 1, and can be dropped when
creating the final output array, thereby not breaking existing workflows that use readBinMat, but still allowing
the same function to be used to read posteriors for multitrait models.

Warm starting the Gibbs sampling

Is there a way to warm start Gibbs sampling with particular initial values? I wasn't sure whether it's possible with adjusting the prior means as it's a hierarchical model.

Can I use FALSE for scale(), such as scale(X,center=FALSE, scale=FALSE)?

Hi ,when I use X<-scale(X,center=TRUE,scale=TRUE) , there are NaN in X and
Error in setLT.BayesBandC(LT = ETA[[i]], n = n, j = i, weights = weights, :
LP 1 has NAs in X

when I set X<-scale(X,center=FALSE,scale=FALSE) ,it ok ,
does this two parameters matter much ? Is this setting ok ?
thanks a lot!

Repository Consolidation

I created https://github.com/QuantGen/BGLR a few months ago. Should we delete that repository in favor of this one to not confuse people?

Option to save samples of marker effects

Would you consider adding an option to save samples of marker effects? At the moment only samples of variances are saved, but in some downstream applications samples of marker effects would be needed. I am happy to provide a pull request should this be deemed worthy.

Computing time

Hello,
I use BGLR for genomic predictions and I have very long computing times (days) with a dataset comprising 22K SNPs and 200K data records, and a GBLUP approach. I think this is mostly because the function takes a lot of time before starting the iterations.
I thought it was normal, but I saw in your article (Figure S7 of Fine mapping and accurate prediction of complex traits using Bayesian Variable Selection models applied to biobank-size data) that you had much lower computing times with 10K SNPs and 300K samples (<30minutes). What could be the reason for that ? I use a high performance computing cluster, the BGLR function and 40K iterations. My model has 3 fixed effects, 3 random effects (iid) and 1 random effect associated with the G matrix.
Thanks
David

R version requirement too high

I don't see why this package needs R >= 3.1.3. The version requirement makes it difficult to install BGLR in many non-cutting edge environments (such as Debian).

fille connection issue

Hi,

I encountered this error when fitting a BGLR model.
Error in file(description = fname, open = "w") : all connections are in use Calls: BGLR -> setLT.BRR -> file Execution halted

I managed to increase the ulimit -n to a fairly large number but the issue still exist. I did have a interaction term in the model which has many levels. A slightly smaller dataset works fine. I appreciate it if any insights can be provided!

Thanks,

Missing values X

Hi,

Great package. Any chance there will be an option for having missing values in your predictors (X) in the future? If not, what do you suggest as a workaround?

Cheers,
Peter

Installation failed on Windows

> install_git('https://github.com/gdlc/BGLR-R/')
Downloading git repo https://github.com/gdlc/BGLR-R/
Installing BGLR
"C:/PROGRA~1/R/R-32~1.1/bin/x64/R" --no-site-file --no-environ --no-save --no-restore CMD INSTALL  \
  "C:/Users/agonzalez/AppData/Local/Temp/Rtmpk7TOzw/file1a1845e25d3b"  \
  --library="C:/Users/agonzalez/Documents/R/win-library/3.2" --install-tests 

* installing *source* package 'BGLR' ...
** libs

*** arch - i386
Warning: running command 'make -f "Makevars.win" -f "C:/PROGRA~1/R/R-32~1.1/etc/i386/Makeconf" -f "C:/PROGRA~1/R/R-32~1.1/share/make/winshlib.mk" SHLIB="BGLR.dll" ' had status 127
ERROR: compilation failed for package 'BGLR'
* removing 'C:/Users/agonzalez/Documents/R/win-library/3.2/BGLR'
* restoring previous 'C:/Users/agonzalez/Documents/R/win-library/3.2/BGLR'
Error: Command failed (1)