quantgen / bgdata Goto Github PK
View Code? Open in Web Editor NEWA Suite of Packages for Analysis of Big Genomic Data
License: Other
A Suite of Packages for Analysis of Big Genomic Data
License: Other
I have to think about this a little, but this could be a separate function or in the constructor.
3.1.2 is not really needed, try something lower.
We have a formula interface, so the distinction between phenotypes and covariates is not as important as in PLINK. Maybe we should have an alias for alternatePhenotypeFile
in as.BGData
for covariates? They have the same file format.
I'm running RStudio on windows and have hit an error at the Summarizing Data step:
Summarising data
Use chunkedApply to count missing values (among others):
countNAs <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = function(x) sum(is.na(x)))
Error in mclapply(X = seq_len(nChunks), FUN = chunkApply, ..., mc.cores = nCores) :
'mc.cores' > 1 is not supported on Windows
rayOLS should have a signature that starts with x
to not confuse apply etc. I don't think this is causing an issue right now because we are explicitly passing y
which takes y
out of the rotation, but it may bite back in the future.
This is the method that is run when just the variable name is typed into the console. For larger matrices, printing the whole thing can be quite annoying, so I think just printing the dimensions and number of chunks should be enough.
This implies that the data has already been centered, but since this is not the case, impute by mean instead.
How could I convert BEDmatrix to a dataframe? As I hope to run some specific tests.
Thanks
library(BGLR)
data(mice)
X <- mice.X
rownames(X) <- paste0("ID_", 1:nrow(X))
G3 <- getG.symDMatrix(X, scaleCol = TRUE, centerCol = TRUE, folder = "tmp", blockSize = 300, vmode = "double")
dim(G3) # returns 1500 1500 but should be 1814 1814
Hi,
I'm using a genomic data set with 180K SNPs to create G matrices. I want to exclude around 3K of these SNPs from the calculation, so I used the j argument to include all other columns than these 3K. This gave a wildly different answer than when not excluding columns, with generally much larger entries than what we should be seeing.
This problem persisted even when I excluded just 1 single randomly selected SNP column (again, out of 180K SNPs, so this should have a large effect, right?). This also shows that the problem is not specific to the columns I wanted to exclude, but to excluding columns with j at all.
Luckily I was able to get around this issue by instead making a new BGData object where the unwanted SNP columns were not included in the first place, and then not specifying j in getG. This gave the expected/right answer, so there seems to be some problem with using j.
Thank you for the package, it is very helpful.
For computations of G matrices gpuR can give use better performance that the current multi-core approach. I suggest to add a function crossprods_gpu(), see template below.
Problems to consider:
- Will we require gpuR? For now, while we test it and learn, we may not require gpuR
- Detecting whether gpu is availalbe, if not throw an error.
- Modifying getG by adding an argument useGPU, if TRUE, we call crossprods_gpu, that authomatically will override, nCores.
Gustavo
crossprods_gpu <- function(x, y = NULL, use_tcrossprod = FALSE) { #*# remove some args
dx <- dim(x)
if (!is.null(y)) {
y <- as.matrix(y)
dy <- dim(y)
if (use_tcrossprod) {
if (dx[2L] != ncol(y)) {
stop("Error in tcrossprod_gpu: non-conformable arguments.")
}
} else {
if (dx[1L] != dy[1L]) {
stop("Error in crossprod_gpu: non-conformable arguments.")
}
}
}
x <- gpuMatrix(x,"float")
if (!is.null(y)) {
y <- gpuMatrix(y,"float")
}
if (use_tcrossprod) {
Xy <- tcrossprod(x, y)
} else {
Xy <- crossprod(x, y)
}
return(Xy)
}
Some of the IMPUTE2 SNP names are just .
.
Similar to lm()
: Estimate, SE etc.
I'm attempting to run GWAS using BGData. I first went through the tutorial in the readme with test data and it worked properly. I'm now trying to run my data and am getting the following error.
> gwas <- GWAS(formula = my_phenotype ~ 1, data = bg)
Error in `colnames<-`(`*tmp*`, value = colnames(data@geno)[j]) :
attempt to set 'colnames' on an object with less than two dimensions
In addition: Warning message:
In parallel::mclapply(X = seq_len(nChunks), FUN = chunkApply, ..., :
all scheduled cores encountered errors in user code
The only thing I did differently from the tutorial was that I loaded the entire genome as a single .bed, rather than loading each chromosome separately and then combining them using ColumnLinkedMatrix. I don't know a good way to load and combine chromosomes/scaffolds in my case since there are over 1000 for the reference genome I'm working with. I ran the following:
wg <- BEDMatrix("my_filepath.bed")
bg <- as.BGData(wg, alternatePhenotypeFile = "my_phenodata")
...and I can't see what the problem with the object "geno" is. It appears to have two dimensions, contrary to the error I'm getting from the GWAS function.
> bg
An object of class "BGData"
Slot "geno":
BEDMatrix: 882 x 4428726 [my_filepath.bed]
Thanks for the help getting this to run with my data.
Some input files have the markers in the rows and the individuals in the columns. A simple transpose
parameter that defaults to false
and some logic that write rows into columns should do the trick.
ff
does not handle character data. We discussed recoding, but it would slow things down. Instead, we decided to throw an error with the respective PLINK command: plink --file ped.txt --recodeA
?
Hi,
I am getting this error and cannot figure out why it is happening:
Extracting number of samples and rownames from FAM file...
Extracting number of variants and colnames from BIM file...
Error in .local(.Object, ...) : No such device
obviously, it can find the fam
and bim
and the bed
file is there too but I don't see why I get this error. I would be thankful if you could help.
Thanks,
Kayhan
This happens when I run getG.symDMatrix
on a larger dataset:
Working pair 1-1 (0% 2274.482 seconds).
Error in `dimnames<-.ff_array`(`*tmp*`, value = dn) :
dimnames[[i]] is neither NULL nor matches dim[i]
Debugging information:
> dim(bg@geno)
[1] 13113 841820
> head(rownames(bg@geno))
[1] "0_230853" "0_147875" "0_373389" "0_354405" "0_500565" "0_383668"
> head(colnames(bg@geno))
[1] "SNP_A-4256704_C" "SNP_A-1964387_A" "SNP_A-1965574_A" "SNP_A-1968937_G"
[5] "SNP_A-1968946_C" "SNP_A-4212582_C"
The BC has three duplicate individual IDs that cause an error message when printing @pheno
. readPED
runs without complaining.
Check for try-error
objects, etc.
These examples illustrate a problem that we need to fix in getGi (note, getGij works fine)
library(BGData)
library(BGLR)
data(mice)
X=mice.X
scales=apply(FUN=sd,MARGIN=2,X=X)
means=apply(FUN=mean,MARGIN=2,X=X)
## Computing G using tcrossprod
G0=tcrossprod(scale(X))
## Computing G using getG with scales and centers computed internally
G1=getG(X,scaleG=F,scaleCol=T)
all.equal(G0,G1)
## Computing G using getG with user provided scales and centers
G2=getG(X,scaleG=F,scaleCol=T,i=1:nrow(X),centers=centers,scales=scales)
all.equal(G0,G2)
## Computing G using i and i2
G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20,i2=11:20)
all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))
## Computin G using i only
G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20)
all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))
That was a misunderstanding on my part, thinking that regular PED files are more common for this type of analysis than the ones that are generated by recodeA
.
This is necessary for #12.
getG
filters out constant columns based on the minVar
parameter, getG.symDMatrix
doesn't. If the X matrix contains constant columns, the G matrices created by getG.symDMatrix
will be NaN while the ones created getG
will be fine.
Thanks to @yrubio for reporting the problem.
A single parameter idCol
might not be enough: "The combination of family and individual ID should uniquely identify a person." (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)
The following PED file will break @pheno
because of duplicates if idCol = 2
(the current default):
1 1 0 0 1 0 G G 2 2 C C
1 2 0 0 1 0 A A 0 0 A C
1 3 1 2 1 2 0 0 1 2 A C
2 1 0 0 1 0 A A 2 2 0 0
2 2 0 0 1 2 A A 2 2 0 0
2 3 1 2 1 2 A A 2 2 A A
Quick fix: allow vector as idCol
?
To calculate NA and MA frequencies.
Right now centering can only be disabled by using a null vector as centers
.
Over in #5 I mentioned that we need the storage location of an mmMatrix
object to add more chunks. This could be done using a parameter of *bind
, or as a property of the mmMatrix
object itself.
I think the latter is more convenient if we are planning to append to the object instead of creating a new one. I could image some attribute that is set when the mmMatrix
is created or loaded. The problem is that currently the object can be loaded using the load
function without us having a chance to interfere. In fact, if we rely on the parameter, this will have a negative impact on portability (paths could be wrong).
I think we have discussed this before, but I would like to hide the fact that the BGData object is saved using the save
command in favor of our loadBGData
and load2
functions. We would need a dedicated save function as well (that strips the attribute before saving, for instance).
To support more than one phenotype. Used with --pheno
in PLINK. The file does not depend on the order of the PED file, rather it should be merged by FID
and IID
with missing values if a value for an entry in PED is not found. There is also support for headers, but they have to start with FID
and IID
.
For now just with BEDMatrix elements.
We need to consolidate this.
Clearly we need one keyword to define the number of columns brought to RAM at a time, we need a different term to define how many cores used to process each chunk and probably a different term to define the block size in symDMatrix.
Here is a proposal
For now we could support a subset of allowed types in R, say integer
, double
, and character
. character
will only be available in readPED.matrix
for now.
rbind
will be easy for rmmMatrix
, cbind
for cmmMatrix
. The other ones will be tricky and probably involve creating new chunks.
Currently i2
is taken from X
, assuming that all samples are in X
. For distributing data to nodes, this can be inefficient. Need to add parameter Y
: if Y
is NULL
, i2
will be taken from X
as is, otherwise from Y
.
This should make installation easier for users that do not have the devtools package.
@agugonrey brought this up, could also be useful for other functions. A view feature for BEDMatrix could also be useful to implement this.
Thanks for implementing this package!
Is there already a workflow to write merged or subset data back to the file system in plink binary format?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.