Git Product home page Git Product logo

bgdata's People

Contributors

agrueneberg avatar gdlc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bgdata's Issues

Consider covariate files in as.BGData

We have a formula interface, so the distinction between phenotypes and covariates is not as important as in PLINK. Maybe we should have an alias for alternatePhenotypeFile in as.BGData for covariates? They have the same file format.

'mc.cores' > 1 is not supported on Windows

I'm running RStudio on windows and have hit an error at the Summarizing Data step:

Summarising data

Use chunkedApply to count missing values (among others):

countNAs <- chunkedApply(X = geno(bg), MARGIN = 2, FUN = function(x) sum(is.na(x)))
Error in mclapply(X = seq_len(nChunks), FUN = chunkApply, ..., mc.cores = nCores) :
'mc.cores' > 1 is not supported on Windows

Fix signature of rayOLS

rayOLS should have a signature that starts with x to not confuse apply etc. I don't think this is causing an issue right now because we are explicitly passing y which takes y out of the rotation, but it may bite back in the future.

Implement show method for mmMatrix

This is the method that is run when just the variable name is typed into the console. For larger matrices, printing the whole thing can be quite annoying, so I think just printing the dimensions and number of chunks should be enough.

getG.symDMatrix: blockSize does not use all individuals

library(BGLR)
data(mice)
X <- mice.X
rownames(X) <- paste0("ID_", 1:nrow(X))
G3 <- getG.symDMatrix(X, scaleCol = TRUE, centerCol = TRUE, folder = "tmp", blockSize = 300, vmode = "double")
dim(G3) # returns 1500 1500 but should be 1814 1814

Problem with j in getG

Hi,

I'm using a genomic data set with 180K SNPs to create G matrices. I want to exclude around 3K of these SNPs from the calculation, so I used the j argument to include all other columns than these 3K. This gave a wildly different answer than when not excluding columns, with generally much larger entries than what we should be seeing.

This problem persisted even when I excluded just 1 single randomly selected SNP column (again, out of 180K SNPs, so this should have a large effect, right?). This also shows that the problem is not specific to the columns I wanted to exclude, but to excluding columns with j at all.

Luckily I was able to get around this issue by instead making a new BGData object where the unwanted SNP columns were not included in the first place, and then not specifying j in getG. This gave the expected/right answer, so there seems to be some problem with using j.

Thank you for the package, it is very helpful.

use gpuR

For computations of G matrices gpuR can give use better performance that the current multi-core approach. I suggest to add a function crossprods_gpu(), see template below.

Problems to consider:
- Will we require gpuR? For now, while we test it and learn, we may not require gpuR
- Detecting whether gpu is availalbe, if not throw an error.
- Modifying getG by adding an argument useGPU, if TRUE, we call crossprods_gpu, that authomatically will override, nCores.

Gustavo

crossprods_gpu <- function(x, y = NULL, use_tcrossprod = FALSE) { #*# remove some args
dx <- dim(x)
if (!is.null(y)) {
y <- as.matrix(y)
dy <- dim(y)
if (use_tcrossprod) {
if (dx[2L] != ncol(y)) {
stop("Error in tcrossprod_gpu: non-conformable arguments.")
}
} else {
if (dx[1L] != dy[1L]) {
stop("Error in crossprod_gpu: non-conformable arguments.")
}
}
}

x <- gpuMatrix(x,"float")
if (!is.null(y)) {
    y <- gpuMatrix(y,"float")
}
if (use_tcrossprod) {
   Xy <- tcrossprod(x, y)
} else {
	Xy <- crossprod(x, y)
}

return(Xy)

}

GWAS: attempt to set 'colnames' on an object with less than two dimensions

I'm attempting to run GWAS using BGData. I first went through the tutorial in the readme with test data and it worked properly. I'm now trying to run my data and am getting the following error.

> gwas <- GWAS(formula = my_phenotype ~ 1, data = bg)
Error in `colnames<-`(`*tmp*`, value = colnames(data@geno)[j]) : 
  attempt to set 'colnames' on an object with less than two dimensions
In addition: Warning message:
In parallel::mclapply(X = seq_len(nChunks), FUN = chunkApply, ...,  :
  all scheduled cores encountered errors in user code

The only thing I did differently from the tutorial was that I loaded the entire genome as a single .bed, rather than loading each chromosome separately and then combining them using ColumnLinkedMatrix. I don't know a good way to load and combine chromosomes/scaffolds in my case since there are over 1000 for the reference genome I'm working with. I ran the following:

wg <- BEDMatrix("my_filepath.bed")
bg <- as.BGData(wg, alternatePhenotypeFile = "my_phenodata")

...and I can't see what the problem with the object "geno" is. It appears to have two dimensions, contrary to the error I'm getting from the GWAS function.

> bg
An object of class "BGData"
Slot "geno":
BEDMatrix: 882 x 4428726 [my_filepath.bed]

Thanks for the help getting this to run with my data.

Add option to transpose input data in readPED

Some input files have the markers in the rows and the individuals in the columns. A simple transpose parameter that defaults to false and some logic that write rows into columns should do the trick.

Handle character data

ff does not handle character data. We discussed recoding, but it would slow things down. Instead, we decided to throw an error with the respective PLINK command: plink --file ped.txt --recodeA?

Error in .local(.Object, ...) : No such device

Hi,

I am getting this error and cannot figure out why it is happening:

Extracting number of samples and rownames from FAM file...
Extracting number of variants and colnames from BIM file...
Error in .local(.Object, ...) : No such device

obviously, it can find the fam and bim and the bed file is there too but I don't see why I get this error. I would be thankful if you could help.

Thanks,
Kayhan

getG.symDMatrix: dimnames[[i]] is neither NULL nor matches dim[i]

This happens when I run getG.symDMatrix on a larger dataset:

 Working pair 1-1 (0% 2274.482 seconds).
Error in `dimnames<-.ff_array`(`*tmp*`, value = dn) : 
  dimnames[[i]] is neither NULL nor matches dim[i]

Debugging information:

> dim(bg@geno)
[1]  13113 841820
> head(rownames(bg@geno))
[1] "0_230853" "0_147875" "0_373389" "0_354405" "0_500565" "0_383668"
> head(colnames(bg@geno))
[1] "SNP_A-4256704_C" "SNP_A-1964387_A" "SNP_A-1965574_A" "SNP_A-1968937_G"
[5] "SNP_A-1968946_C" "SNP_A-4212582_C"

getGi versus getGij ( problems in getGi )

These examples illustrate a problem that we need to fix in getGi (note, getGij works fine)

library(BGData)
library(BGLR)
data(mice)
X=mice.X
scales=apply(FUN=sd,MARGIN=2,X=X) 
means=apply(FUN=mean,MARGIN=2,X=X)

## Computing G using tcrossprod
  G0=tcrossprod(scale(X))
## Computing G using getG with scales and centers computed internally
 G1=getG(X,scaleG=F,scaleCol=T)
 all.equal(G0,G1)
## Computing G using getG with user provided scales and centers
 G2=getG(X,scaleG=F,scaleCol=T,i=1:nrow(X),centers=centers,scales=scales)
 all.equal(G0,G2)
## Computing G using i and i2
 G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20,i2=11:20)
 all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))
## Computin G using i only
 G12=getG(X,scaleG=F,scaleCol=T,centers=centers,scales=scales,i=11:20)
 all.equal(as.vector(G12),as.vector(G0[rownames(G12),colnames(G12)]))

getG.symDMatrix: Support minVar

getG filters out constant columns based on the minVar parameter, getG.symDMatrix doesn't. If the X matrix contains constant columns, the G matrices created by getG.symDMatrix will be NaN while the ones created getG will be fine.

Thanks to @yrubio for reporting the problem.

Some PED files use both Family ID and Subject ID to uniquely identify individuals

A single parameter idCol might not be enough: "The combination of family and individual ID should uniquely identify a person." (http://pngu.mgh.harvard.edu/~purcell/plink/data.shtml)

The following PED file will break @pheno because of duplicates if idCol = 2 (the current default):

1 1 0 0 1  0    G G    2 2    C C
1 2 0 0 1  0    A A    0 0    A C
1 3 1 2 1  2    0 0    1 2    A C
2 1 0 0 1  0    A A    2 2    0 0
2 2 0 0 1  2    A A    2 2    0 0
2 3 1 2 1  2    A A    2 2    A A

Quick fix: allow vector as idCol?

Drop .RData extension of BGData object

Over in #5 I mentioned that we need the storage location of an mmMatrix object to add more chunks. This could be done using a parameter of *bind, or as a property of the mmMatrix object itself.
I think the latter is more convenient if we are planning to append to the object instead of creating a new one. I could image some attribute that is set when the mmMatrix is created or loaded. The problem is that currently the object can be loaded using the load function without us having a chance to interfere. In fact, if we rely on the parameter, this will have a negative impact on portability (paths could be wrong).
I think we have discussed this before, but I would like to hide the fact that the BGData object is saved using the save command in favor of our loadBGData and load2 functions. We would need a dedicated save function as well (that strips the attribute before saving, for instance).

Support phenotype files

To support more than one phenotype. Used with --pheno in PLINK. The file does not depend on the order of the PED file, rather it should be merged by FID and IID with missing values if a value for an entry in PED is not found. There is also support for headers, but they have to start with FID and IID.

nChunks, nChunks2, nTastks, mc.cores, chunkSize

We need to consolidate this.

Clearly we need one keyword to define the number of columns brought to RAM at a time, we need a different term to define how many cores used to process each chunk and probably a different term to define the block size in symDMatrix.

Here is a proposal

  • nChunks: define the size of the chunk to be brought to RAM.
  • chunkSize: used also to define the size of the chunk to brought to RAM. I suggest chunkSize over-rides nChunks.
  • nTasks used to define the number of tasks used to process one chunk.
  • mc.cores, I suggest to set it by default to min(detctCores(), nTasks)
  • blockSize: the number of columns/rows of a block in a symDMatrix

Treat dataType of readPED as a string

For now we could support a subset of allowed types in R, say integer, double, and character. character will only be available in readPED.matrix for now.

Add Y parameter to getG

Currently i2 is taken from X, assuming that all samples are in X. For distributing data to nodes, this can be inefficient. Need to add parameter Y: if Y is NULL, i2 will be taken from X as is, otherwise from Y.

write.plink?

Thanks for implementing this package!

Is there already a workflow to write merged or subset data back to the file system in plink binary format?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.