quantgen / bgdata Goto Github PK

View Code? Open in Web Editor NEW

33.0 8.0 14.0 5.01 MB

A Suite of Packages for Analysis of Big Genomic Data

License: Other

R 82.96% C 17.04%

r r-pkg cran genetics genomics gwas

bgdata's People

Contributors

Stargazers

Watchers

Forkers

lian0090 sudhaveturi minghao2016 dnharry funkhou9 davan690 feraguate jingali ruixiangliu geneticresources quanrd gcostaneto rnaimehaom wook2014

bgdata's Issues

Add a way to load an mmMatrix independent of BGData

I have to think about this a little, but this could be a separate function or in the constructor.

Relax R version requirement

3.1.2 is not really needed, try something lower.

Consider covariate files in as.BGData

We have a formula interface, so the distinction between phenotypes and covariates is not as important as in PLINK. Maybe we should have an alias for alternatePhenotypeFile in as.BGData for covariates? They have the same file format.

'mc.cores' > 1 is not supported on Windows

I'm running RStudio on windows and have hit an error at the Summarizing Data step:

Summarising data

Use chunkedApply to count missing values (among others):

countNAs <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = function(x) sum(is.na(x)))
Error in mclapply(X = seq_len(nChunks), FUN = chunkApply, ..., mc.cores = nCores) :
'mc.cores' > 1 is not supported on Windows

Fix signature of rayOLS

rayOLS should have a signature that starts with x to not confuse apply etc. I don't think this is causing an issue right now because we are explicitly passing y which takes y out of the rotation, but it may bite back in the future.

Implement show method for mmMatrix

This is the method that is run when just the variable name is typed into the console. For larger matrices, printing the whole thing can be quite annoying, so I think just printing the dimensions and number of chunks should be enough.

getG imputes by 0 if center = FALSE

This implies that the data has already been centered, but since this is not the case, impute by mean instead.

BEDmatrix conversion to data frame

How could I convert BEDmatrix to a dataframe? As I hope to run some specific tests.

Thanks

getG.symDMatrix: blockSize does not use all individuals

library(BGLR)
data(mice)
X <- mice.X
rownames(X) <- paste0("ID_", 1:nrow(X))
G3 <- getG.symDMatrix(X, scaleCol = TRUE, centerCol = TRUE, folder = "tmp", blockSize = 300, vmode = "double")
dim(G3) # returns 1500 1500 but should be 1814 1814

Check if i and j != integer()

Problem with j in getG

Hi,

I'm using a genomic data set with 180K SNPs to create G matrices. I want to exclude around 3K of these SNPs from the calculation, so I used the j argument to include all other columns than these 3K. This gave a wildly different answer than when not excluding columns, with generally much larger entries than what we should be seeing.

This problem persisted even when I excluded just 1 single randomly selected SNP column (again, out of 180K SNPs, so this should have a large effect, right?). This also shows that the problem is not specific to the columns I wanted to exclude, but to excluding columns with j at all.

Luckily I was able to get around this issue by instead making a new BGData object where the unwanted SNP columns were not included in the first place, and then not specifying j in getG. This gave the expected/right answer, so there seems to be some problem with using j.

Thank you for the package, it is very helpful.

use gpuR

For computations of G matrices gpuR can give use better performance that the current multi-core approach. I suggest to add a function crossprods_gpu(), see template below.

Problems to consider:
- Will we require gpuR? For now, while we test it and learn, we may not require gpuR
- Detecting whether gpu is availalbe, if not throw an error.
- Modifying getG by adding an argument useGPU, if TRUE, we call crossprods_gpu, that authomatically will override, nCores.

Gustavo

crossprods_gpu <- function(x, y = NULL, use_tcrossprod = FALSE) { #*# remove some args
dx <- dim(x)
if (!is.null(y)) {
y <- as.matrix(y)
dy <- dim(y)
if (use_tcrossprod) {
if (dx[2L] != ncol(y)) {
stop("Error in tcrossprod_gpu: non-conformable arguments.")
}
} else {
if (dx[1L] != dy[1L]) {
stop("Error in crossprod_gpu: non-conformable arguments.")
}
}
}

x <- gpuMatrix(x,"float")
if (!is.null(y)) {
    y <- gpuMatrix(y,"float")
}
if (use_tcrossprod) {
   Xy <- tcrossprod(x, y)
} else {
	Xy <- crossprod(x, y)
}

return(Xy)

}

Don't fail on duplicated row names in as.BGData

Some of the IMPUTE2 SNP names are just ..

Add colnames in GWAS.lsfit result

Similar to lm(): Estimate, SE etc.

GWAS: attempt to set 'colnames' on an object with less than two dimensions

I'm attempting to run GWAS using BGData. I first went through the tutorial in the readme with test data and it worked properly. I'm now trying to run my data and am getting the following error.

> gwas <- GWAS(formula = my_phenotype ~ 1, data = bg)
Error in `colnames<-`(`*tmp*`, value = colnames(data@geno)[j]) : 
  attempt to set 'colnames' on an object with less than two dimensions
In addition: Warning message:
In parallel::mclapply(X = seq_len(nChunks), FUN = chunkApply, ...,  :
  all scheduled cores encountered errors in user code

The only thing I did differently from the tutorial was that I loaded the entire genome as a single .bed, rather than loading each chromosome separately and then combining them using ColumnLinkedMatrix. I don't know a good way to load and combine chromosomes/scaffolds in my case since there are over 1000 for the reference genome I'm working with. I ran the following:

wg <- BEDMatrix("my_filepath.bed")
bg <- as.BGData(wg, alternatePhenotypeFile = "my_phenodata")

...and I can't see what the problem with the object "geno" is. It appears to have two dimensions, contrary to the error I'm getting from the GWAS function.

> bg
An object of class "BGData"
Slot "geno":
BEDMatrix: 882 x 4428726 [my_filepath.bed]

Thanks for the help getting this to run with my data.

Add option to transpose input data in readPED

Some input files have the markers in the rows and the individuals in the columns. A simple transpose parameter that defaults to false and some logic that write rows into columns should do the trick.

Handle character data

ff does not handle character data. We discussed recoding, but it would slow things down. Instead, we decided to throw an error with the respective PLINK command: plink --file ped.txt --recodeA?

Add options for i and j in getG.symDMatrix

Error in .local(.Object, ...) : No such device

Hi,

I am getting this error and cannot figure out why it is happening:

Extracting number of samples and rownames from FAM file...
Extracting number of variants and colnames from BIM file...
Error in .local(.Object, ...) : No such device

obviously, it can find the fam and bim and the bed file is there too but I don't see why I get this error. I would be thankful if you could help.

Thanks,
Kayhan

getG.symDMatrix: dimnames[[i]] is neither NULL nor matches dim[i]

This happens when I run getG.symDMatrix on a larger dataset:

 Working pair 1-1 (0% 2274.482 seconds).
Error in `dimnames<-.ff_array`(`*tmp*`, value = dn) : 
  dimnames[[i]] is neither NULL nor matches dim[i]

Debugging information:

> dim(bg@geno)
[1]  13113 841820
> head(rownames(bg@geno))
[1] "0_230853" "0_147875" "0_373389" "0_354405" "0_500565" "0_383668"
> head(colnames(bg@geno))
[1] "SNP_A-4256704_C" "SNP_A-1964387_A" "SNP_A-1965574_A" "SNP_A-1968937_G"
[5] "SNP_A-1968946_C" "SNP_A-4212582_C"

getG: Support character vectors for i, i2, and j

Handle duplicate row names in @pheno

The BC has three duplicate individual IDs that cause an error message when printing @pheno. readPED runs without complaining.

Try to improve error handling for mclapply calls

Check for try-error objects, etc.

as.BGData: Read BIM file

Add i and j to summarize

getGi versus getGij ( problems in getGi )

These examples illustrate a problem that we need to fix in getGi (note, getGij works fine)

library(BGData)
library(BGLR)
data(mice)
X=mice.X
scales=apply(FUN=sd,MARGIN=2,X=X) 
means=apply(FUN=mean,MARGIN=2,X=X)

## Computing G using tcrossprod
  G0=tcrossprod(scale(X))
## Computing G using getG with scales and centers computed internally
 G1=getG(X,scaleG=F,scaleCol=T)
 all.equal(G0,G1)
## Computing G using getG with user provided scales and centers
 G2=getG(X,scaleG=F,scaleCol=T,i=1:nrow(X),centers=centers,scales=scales)
 all.equal(G0,G2)
## Computing G using i and i2
 G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20,i2=11:20)
 all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))
## Computin G using i only
 G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20)
 all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))

Change na.strings of readPED back to NA to support raw files

That was a misunderstanding on my part, thinking that regular PED files are more common for this type of analysis than the ones that are generated by recodeA.

Implement custom save function

This is necessary for #12.

getG.symDMatrix: Support minVar

getG filters out constant columns based on the minVar parameter, getG.symDMatrix doesn't. If the X matrix contains constant columns, the G matrices created by getG.symDMatrix will be NaN while the ones created getG will be fine.

Thanks to @yrubio for reporting the problem.

Some PED files use both Family ID and Subject ID to uniquely identify individuals

A single parameter idCol might not be enough: "The combination of family and individual ID should uniquely identify a person." (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)

The following PED file will break @pheno because of duplicates if idCol = 2 (the current default):

1 1 0 0 1  0    G G    2 2    C C
1 2 0 0 1  0    A A    0 0    A C
1 3 1 2 1  2    0 0    1 2    A C
2 1 0 0 1  0    A A    2 2    0 0
2 2 0 0 1  2    A A    2 2    0 0
2 3 1 2 1  2    A A    2 2    A A

Quick fix: allow vector as idCol?

GWAS: Check for columns with too many NAs / with zero variance

Add summarize function

To calculate NA and MA frequencies.

Add centerCol parameter for getG

Right now centering can only be disabled by using a null vector as centers.

Drop .RData extension of BGData object

Over in #5 I mentioned that we need the storage location of an mmMatrix object to add more chunks. This could be done using a parameter of *bind, or as a property of the mmMatrix object itself.
I think the latter is more convenient if we are planning to append to the object instead of creating a new one. I could image some attribute that is set when the mmMatrix is created or loaded. The problem is that currently the object can be loaded using the load function without us having a chance to interfere. In fact, if we rely on the parameter, this will have a negative impact on portability (paths could be wrong).
I think we have discussed this before, but I would like to hide the fact that the BGData object is saved using the save command in favor of our loadBGData and load2 functions. We would need a dedicated save function as well (that strips the attribute before saving, for instance).

Support phenotype files

To support more than one phenotype. Used with --pheno in PLINK. The file does not depend on the order of the PED file, rather it should be merged by FID and IID with missing values if a value for an entry in PED is not found. There is also support for headers, but they have to start with FID and IID.

Support LinkedMatrix in as.BGData

For now just with BEDMatrix elements.

nChunks, nChunks2, nTastks, mc.cores, chunkSize

We need to consolidate this.

Clearly we need one keyword to define the number of columns brought to RAM at a time, we need a different term to define how many cores used to process each chunk and probably a different term to define the block size in symDMatrix.

Here is a proposal

nChunks: define the size of the chunk to be brought to RAM.
chunkSize: used also to define the size of the chunk to brought to RAM. I suggest chunkSize over-rides nChunks.
nTasks used to define the number of tasks used to process one chunk.
mc.cores, I suggest to set it by default to min(detctCores(), nTasks)
blockSize: the number of columns/rows of a block in a symDMatrix

quantgen / bgdata Goto Github PK

bgdata's People

Contributors

Stargazers

Watchers

Forkers

bgdata's Issues

Summarising data

Use chunkedApply to count missing values (among others):

Recommend Projects

Recommend Topics

Recommend Org