hansenlab / bsseq Goto Github PK

View Code? Open in Web Editor NEW

35.0 35.0 25.0 49.93 MB

Devel repository for bsseq

R 96.23% C++ 2.31% C 0.15% TeX 1.31%

bsseq's Introduction

hansenlab

We work on statistical genomics and application in biology.

Website
[Members](htps://www.hansenlab.org/people
Publications

bsseq's People

Stargazers

Watchers

bsseq's Issues

Use same permutation method for t-test and f-test

The new permutation code for f-tests is more general than that for t-test; should be easier to maintain by switching t-test permutations to use the same underlying code.

pData without row names returns DelayedMatrix error using BSmooth

I got stuck on this error for a while. Is there any way to check if row names are present before preprocessing so you avoid that step and the confusing error message?

library(bsseq)
 M <- matrix(0:8, 3, 3)
 Cov <- matrix(1:9, 3, 3)
 BStmp <- BSseq(chr = c("chr1", "chrX", "chr1"), pos = 1:3,
                 M = M, Cov = Cov, sampleNames = c("A1","A2", "B"))
pData(BStmp) <- data.frame(id = c("A1","A2", "B"), trt=c("A", "A", "B"))
BSmooth(BStmp)
# [BSmooth] preprocessing ... done in 0.1 sec
# [BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
# Error: Could not construct DelayedMatrix from 'coef'

Adding row.names works.

pData(BStmp) <- data.frame(id = c("A1","A2", "B"), trt=c("A", "A", "B"), row.names=c("A1","A2", "B"))

Look at smoothing bug in capture data

See email from Fayaz

Move to working with clusters

Currently, most of my work is done with maxGap = 10^8 resulting in 1 chromosome = 1 cluster. To make the computations of more equal size, we should experiment with decreasing maxGap to something like 10^5. We need to be careful about this, since we would prefer this clustering to happen only based on the genome and not on the coverage data per default (so that each dataset in the same organism gets the same clustering).

Add Travis-CI support

I'm working on this. Email from Kasper:

What I would really need is for any pull request to be checked up against Bioc-devel with Bioc-devel packages and the right version of R for Bioc-devel. And then I would want to be able for the submitter and myself and possibly others to see the result of R CMD check on the pull request.

On disk support

We need to have on-disk support in bsseq. We could

use the NDF5 work currently done by Herve.
use ff
use bigmemory
something else

Need to think about matrix size limitations and also about computational / parallel performance (say reading data on a CpG cluster into memory one at a time).

This issue is tightly link to the question of size limitations (#20).

Problem in collapseBSseq

I updated bsseq (and DelayedMatrix) and collapseBSseq, which was working as expected prior to this, crashes now, giving me a connection error (Error in serialize(data, node$con) / failed to stop ‘SOCKcluster’ cluster), which corresponds to my RAM reaching the maximum (64 Go).

My SessionInfo():

[R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)

Matrix products: default

locale:
[1] LC_COLLATE=French_France.1252  LC_CTYPE=French_France.1252    LC_MONETARY=French_France.1252 LC_NUMERIC=C                  
[5] LC_TIME=French_France.1252    

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] data.table_1.10.4-3        gtools_3.5.0               bsseq_1.15.2               SummarizedExperiment_1.8.1 DelayedArray_0.5.27       
 [6] BiocParallel_1.12.0        matrixStats_0.53.1         Biobase_2.38.0             GenomicRanges_1.30.3       GenomeInfoDb_1.14.0       
[11] IRanges_2.12.0             S4Vectors_0.16.0           BiocGenerics_0.25.3       

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.16           XVector_0.18.0         zlibbioc_1.24.0        munsell_0.4.3          colorspace_1.3-2       lattice_0.20-35       
 [7] plyr_1.8.4             tools_3.4.4            grid_3.4.4             snow_0.4-2             R.oo_1.21.0            permute_0.9-4         
[13] Matrix_1.2-14          GenomeInfoDbData_1.0.0 R.utils_2.6.0          bitops_1.0-6           RCurl_1.95-4.10        limma_3.34.9          
[19] compiler_3.4.4         R.methodsS3_1.7.1      scales_0.5.0           locfit_1.5-9.1   `

Thanks

Move vignette to Rmd

Delay this until we are actually going to update vignette.

Clean up R CMD check

* checking dependencies in R code ... NOTE
Namespace in Imports field not imported from: 'HDF5Array'
  All declared Imports should be used.

Re-visit data import options

readr vs data.table. data.table always fastest on plaintext, but not always on gzipped. In-memory gzip support of readr is attractive.

See #1 (comment) and timings in #4 (done on laptop with non-SSD, if i recall correctly)

make dmrFinder support fisherTests

This could be made to work pretty easily, I think.

Speed fluctuations in `getCoverage` before and after recreating DelayedArrays

I am experiencing some behavior I don't quite understand with bsseq and DelayedArray. I was trying to troubleshoot a problem I was having with slower than expected operations on bsseq objects such as getCoverage, and I found that if I recreated the DelayedArrays within the bsseq object the issue went away (see minimal working example below). Is this expected?

require(bsseq)
require(BiocGenerics)

# Read some example data from the bsseq package:
infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
                      package = 'bsseq')

# Use the data multiple times and pretend it is different samples:
cpg <- BiocGenerics::combine(
  bsseq::read.bismark(files = infile, sampleNames = "c1a", strandCollapse = F),
  bsseq::read.bismark(files = infile, sampleNames = "c1b", strandCollapse = F),
  bsseq::read.bismark(files = infile, sampleNames = "c1c", strandCollapse = F),
  bsseq::read.bismark(files = infile, sampleNames = "c1d", strandCollapse = F),
  bsseq::read.bismark(files = infile, sampleNames = "c2a", strandCollapse = F),
  bsseq::read.bismark(files = infile, sampleNames = "c2b", strandCollapse = F),
  bsseq::read.bismark(files = infile, sampleNames = "c2c", strandCollapse = F),
  bsseq::read.bismark(files = infile, sampleNames = "c2d", strandCollapse = F)
)

# recreate M and Cov matrices and place in new bsseq object
M <- getCoverage(cpg, type="M")
M <- DelayedArray(as.matrix(M))
Cov <- getCoverage(cpg, type="Cov")
Cov <- DelayedArray(as.matrix(Cov))
cpg2 <- BSseq(M = M,
              Cov = Cov,
              chr = as.character(seqnames(cpg)),
              pos = start(cpg),
              sampleNames = sampleNames(cpg))

# compare run times of `getCoverage`
system.time(getCoverage(cpg))
   user  system elapsed 
  1.509   0.170   1.915 
system.time(getCoverage(cpg2))
   user  system elapsed 
  0.011   0.000   0.013 


library(BiocInstaller)
biocValid()

* sessionInfo()

R Under development (unstable) (2018-02-18 r74266)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4

Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets 
[8] methods   base     

other attached packages:
 [1] BiocInstaller_1.31.1        bsseq_1.17.0               
 [3] SummarizedExperiment_1.11.3 DelayedArray_0.7.2         
 [5] BiocParallel_1.15.3         matrixStats_0.53.1         
 [7] Biobase_2.41.0              GenomicRanges_1.33.5       
 [9] GenomeInfoDb_1.17.1         IRanges_2.15.13            
[11] S4Vectors_0.19.5            BiocGenerics_0.27.0        

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.17             XVector_0.21.1           zlibbioc_1.27.0         
 [4] munsell_0.4.3            colorspace_1.3-2         lattice_0.20-35         
 [7] plyr_1.8.4               tools_3.5.0              DelayedMatrixStats_1.3.0
[10] grid_3.5.0               rhdf5_2.25.0             data.table_1.11.2       
[13] R.oo_1.22.0              HDF5Array_1.9.0          gtools_3.5.0            
[16] permute_0.9-4            Matrix_1.2-14            GenomeInfoDbData_1.1.0  
[19] Rhdf5lib_1.3.1           R.utils_2.6.0            bitops_1.0-6            
[22] RCurl_1.95-4.10          limma_3.37.1             compiler_3.5.0          
[25] R.methodsS3_1.7.1        scales_0.5.0             locfit_1.5-9.1          

Library path directories:
  /Users/keegankorthauer/Library/R/3.5/library
  /Library/Frameworks/R.framework/Versions/3.5/Resources/library 

* Packages too new for Bioconductor version '3.8'

       Version LibPath                                       
dmrseq "1.1.6" "/Users/keegankorthauer/Library/R/3.5/library"
FDRreg "0.2-1" "/Users/keegankorthauer/Library/R/3.5/library"

downgrade with biocLite(c("dmrseq", "FDRreg"))

Error: 2 package(s) too new

better error message for small contigs

We need to catch the dreaded

[BSmooth.tstat] preprocessing ... done in 261.8 sec
[BSmooth.tstat] computing stats within groups ... done in 12.5 sec
[BSmooth.tstat] computing stats across groups ...
Error in approxfun(xx, yy) :
  need at least two non-NA values to interpolate

and supply a better error message.

PCA helper

Compute like in Brain paper

DMCs and DMRs of large numbers of biological replicates

BSseq, is it appropriate for RRBS data with large numbers of biological replicates ?
Such as 100 biological replicates of each group for detecting DMCs and DMRs.
Or which method should I choose for large numbers of biological replicates ?

BSseq constructor extremely slow

I am trying to construct a BSseq object from four data frames with 18519004, 15324125, 17354678 and 17382658 rows respectively (mouse, whole genome CpG methylation.) So far the constructor has been running for 30 hours on a CentOS 7 machine with 8 cores and 64GB of memory. Is this expected?

I construct the BSseq object with the following code:

library(tidyverse)
library(dmrseq)
library(bsseq)

make_bs <- function(df, sampleName) {
  M <- matrix(df$M, ncol = 1)
  colnames(M) <- sampleName
  Cov <- matrix(df$Cov, ncol = 1)
  colnames(Cov) <- sampleName
  BSseq(chr = df$chr, 
        pos = df$start,
        Cov = Cov,
        M = M,
        sampleNames = sampleName)
}

bs <- make_bs(df1, "1") %>%
  bsseq::combine(make_bs(df2, "2")) %>%
  bsseq::combine(make_bs(df3, "3")) %>%
  bsseq::combine(make_bs(df4, "4"))

Here's my sessionInfo:

> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS: /stornext/System/data/apps/R/R-3.5.0/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.5.0/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] dmrseq_1.0.0                bsseq_1.16.0
 [3] SummarizedExperiment_1.10.1 DelayedArray_0.6.0
 [5] BiocParallel_1.14.1         matrixStats_0.53.1
 [7] Biobase_2.40.0              GenomicRanges_1.32.3
 [9] GenomeInfoDb_1.16.0         IRanges_2.14.10
[11] S4Vectors_0.18.2            BiocGenerics_0.26.0
[13] forcats_0.3.0               stringr_1.3.1
[15] dplyr_0.7.5                 purrr_0.2.5
[17] readr_1.1.1                 tidyr_0.8.1
[19] tibble_1.4.2                ggplot2_2.2.1
[21] tidyverse_1.2.1

loaded via a namespace (and not attached):
  [1] colorspace_1.4-0              XVector_0.20.0
  [3] rstudioapi_0.7                bit64_0.9-8
  [5] interactiveDisplayBase_1.18.0 AnnotationDbi_1.42.1
  [7] lubridate_1.7.4               xml2_1.2.0
  [9] splines_3.5.0                 codetools_0.2-15
 [11] R.methodsS3_1.7.1             mnormt_1.5-5
 [13] jsonlite_1.5                  Rsamtools_1.32.0
 [15] broom_0.4.4                   R.oo_1.22.0
 [17] shiny_1.1.0                   HDF5Array_1.8.0
 [19] compiler_3.5.0                httr_1.3.1
 [21] assertthat_0.2.0              Matrix_1.2-14
 [23] lazyeval_0.2.1                limma_3.36.1
 [25] cli_1.0.0                     later_0.7.3
 [27] htmltools_0.3.6               prettyunits_1.0.2
 [29] tools_3.5.0                   bindrcpp_0.2.2
 [31] gtable_0.2.0                  glue_1.2.0
 [33] GenomeInfoDbData_1.1.0        annotatr_1.6.0
 [35] reshape2_1.4.3                doRNG_1.6.6
 [37] Rcpp_0.12.17                  bumphunter_1.22.0
 [39] cellranger_1.1.0              Biostrings_2.48.0
 [41] nlme_3.1-137                  rtracklayer_1.40.2
 [43] iterators_1.0.9               DelayedMatrixStats_1.2.0
 [45] psych_1.8.4                   rvest_0.3.2
 [47] mime_0.5                      rngtools_1.3.1
 [49] gtools_3.5.0                  XML_3.98-1.11
 [51] AnnotationHub_2.12.0          zlibbioc_1.26.0
 [53] scales_0.5.0                  BSgenome_1.48.0
 [55] BiocInstaller_1.30.0          hms_0.4.2
 [57] promises_1.0.1                rhdf5_2.24.0
 [59] RColorBrewer_1.1-2            yaml_2.1.19
 [61] memoise_1.1.0                 pkgmaker_0.27
 [63] biomaRt_2.36.1                stringi_1.2.2
 [65] RSQLite_2.1.1                 foreach_1.4.6
 [67] permute_0.9-4                 GenomicFeatures_1.32.0
 [69] bibtex_0.4.2                  rlang_0.2.1
 [71] pkgconfig_2.0.1               bitops_1.0-6
 [73] lattice_0.20-35               Rhdf5lib_1.2.1
 [75] bindr_0.1.1                   GenomicAlignments_1.16.0
 [77] bit_1.1-14                    tidyselect_0.2.4
 [79] plyr_1.8.4                    magrittr_1.5
 [81] R6_2.2.2                      DBI_1.0.0
 [83] withr_2.1.2                   pillar_1.2.3
 [85] haven_1.1.1                   foreign_0.8-70
 [87] RCurl_1.95-4.10               modelr_0.1.2
 [89] crayon_1.3.4                  progress_1.1.2
 [91] locfit_1.5-9.1                grid_3.5.0
 [93] readxl_1.1.0                  data.table_1.11.4
 [95] blob_1.1.1                    digest_0.6.15
 [97] xtable_1.8-3                  httpuv_1.4.3
 [99] regioneR_1.12.0               outliers_0.14
[101] R.utils_2.6.0                 munsell_0.4.3
[103] registry_0.5

Problem in function BSseq

In BSseq constructor function BSseq, when checking for gr widths, the closing bracket is misplaced, it should be:
if(any(width(gr) != 1)) (instead of if(any(width(gr)) != 1) )

Check object is sorted before processing

It's clear from Lindsay's experience that we should probably check the BSseq object is sorted before smoothing (and likely before doing other operations too). I propose the following:

~~Request/contribute patch to add is.unsorted,RangedSummarizedExperiment-method to SummarizedExperiment - we then get it via inheritance for BSseq objects.~~ Turns out there already an is.unsorted,RangedSummarizedExperiment-method
Add is.unsorted(BSseq) where necessary to bsseq codebase.
If the object is unsorted then eitehr sort the object (with message/warning) or simply throw an error.

One complication is that sort() for GRanges-based objects uses the seqlevel order from the Seqinfo slot - if this is "wrong" (as in Lindsay's case where chr3 was first) then the sort is "wrong". However, the relative order of positions within a seqlevel will still be correct, which I think is enough for what we want. In general, defining and checking a canonical order of seqlevels for an arbitrary sample with potentially limited metadata is very difficult, so I don't think we want to do anything in that space.

error with small ns/h sizes

Hi,

I wanted to use bsmooth to generate only the methylation estimates per CpG for some other purpose than DMR calling. I want to smooth over windows of size around 500bp and I'm also interested in LMRs (lowly methylated regions). LMRs usually only contains a small number of CpGs and I don't want to smooth them "out". Hence I tried to use bsmooth with parameters like h=250 (it was written somewhere that h defines half the actual genome window size) and ns=3..5.

However, I often get errors like this:

bs_fit = BSmooth(bs, ns = 7, h = 250, maxGap=10^7, parallelBy = "chromosome", mc.cores = 10, mc.preschedule = FALSE)

[BSmooth] preprocessing ... done in 0.9 sec
[BSmooth] smoothing by 'chromosome' (mc.cores = 10, mc.preschedule = FALSE)
Fehler in BSmooth(bs, ns = 7, h = 250, maxGap = 10^7, parallelBy = "chromosome", :
BSmooth encountered smoothing errors
Zusätzlich: Warnmeldung:
In mclapply(1:length(clusterIdx), function(ii) { :
1 function calls resulted in an error

when I reduce the maxGap it runs through in some cases. Why is the maxGap parameter causing these errors? maxGap shouldn't play any role in finding a window of at least 3...5 CpGs, or?

Thanks a lot!
steffen

How to use bsseq to find differentially methylated promoter?

Hi, I learned that bsseq can be used to find DMR in specific regions, like promoter. I wonder that where can I find the protocol about how to use bsseq find differentially methylated promoter.

Clarify how cutoff is computed in vignette

See https://support.bioconductor.org/p/78227/#78249

Decimal point shifts for methylation values when using BSseq() constructor function

Hi,

I was attempting to use the BSseq constructor function to create a BSseq object for a test in my package and I ran into some weird behavior. The BSseq function is moving the decimal point of the input methylation values (dividing the methylation values by 10,000). The coverage values are correct however. See the example below:

methylVals = as.matrix(c(.1, .9, .4, .5))
covVals = as.matrix(rep(10000, 4))
testBSseq = BSseq(M =methylVals, Cov = covVals, pos = c(1000, 2000, 3000, 4000), chr=rep("chr1", 4))
methylVals
getMeth(testBSseq, type = "raw")
covVals
getCoverage(testBSseq)

On my system, the results from getMeth are:
DelayedMatrix object of 4 x 1 doubles:
V1
[1,] 1e-05
[2,] 9e-05
[3,] 4e-05
[4,] 5e-05

I wondered if this could be a DelayedArray thing but the DelayedArray constructor seems to work.

 DelayedArray::DelayedArray(methylVals)

I found this behavior on both Linux and Windows. In case it helps, here is my session info:

R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS

Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0

locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] grDevices datasets parallel stats graphics utils stats4 methods base

other attached packages:
[1] bsseq_1.12.2 SummarizedExperiment_1.6.3 Biobase_2.36.2 DelayedArray_0.2.4
[5] matrixStats_0.52.2 BiocInstaller_1.26.0 GenomicRanges_1.28.4 GenomeInfoDb_1.12.1
[9] IRanges_2.10.2 S4Vectors_0.14.2 BiocGenerics_0.22.0 ggplot2_2.2.1
[13] data.table_1.10.4

loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 XVector_0.16.0 zlibbioc_1.22.0 munsell_0.4.3 lattice_0.20-35
[6] colorspace_1.3-2 rlang_0.1.1 plyr_1.8.4 tools_3.4.0 grid_3.4.0
[11] gtable_0.2.0 R.oo_1.21.0 gtools_3.5.0 lazyeval_0.2.0 permute_0.9-4
[16] tibble_1.3.3 Matrix_1.2-8 GenomeInfoDbData_0.99.0 R.utils_2.5.0 bitops_1.0-6
[21] RCurl_1.95-4.8 limma_3.32.5 compiler_3.4.0 R.methodsS3_1.7.1 scales_0.4.1
[26] locfit_1.5-9.1

It was nice to meet you all at Bioc2017. Let me know if any additional information would help!

Best,
John Lawson

Clarify 0-based vs. 1-based co-ordinates in read.bismark documentaiton

A note to myself to follow-up https://support.bioconductor.org/p/75502/. Like Jim, I don't find the message ambiguous, but I'm happy to incorporate any suggestions.

Helper to estimate bisulfite-conversion efficiency

computeComversion(BSseq, seqnames = "$LAMBDA")

where $LAMBDA is the super long name we use for lambda. Then users can change at will

Multiple testing correction

Hello,

Is there any multiple testing correction implemented in the bsseq pipeline?

Thank you.

Better bismark import documentation

Perhaps add section in vignette.

Optimization: work with the identity transformation.

Currently, the smoothed values are stored on the logit scale. This results in a transformation (and memory copying) every time we do getMeth(type = "smooth"). We do this to support the use of confidence intervals from the smoothing which are based on x +/- y on the logit scale (which is not symmetric on the [0,1] methylation scale).However, we use this very infrequent. We support the use of a transformation slot which is any function. So we need a function doing the transformation once and for all, and then we need to write code/profile to make sure no copying happens when we run getMeth().

Size of subsetted BSseq objects from chrSelectBSseq()

Hi,

Is it expected that a small subset of a BSseq object obtained by chrSelectBSseq() be close to the size of the full object? I noticed this issue after obtaining much larger file sizes than expected of BSseq object subsets saved using saveRDS, but I'm pasting below a reprex that demonstrates in-memory size. I also compare to what the size would be with a manual subset by index, and recreation of the object with realize() applied to the DelayedArray components.

Is there a simple way to obtain a subset that realizes the DelayedArray components so that the object size is proportionally reduced, that doesn't require manually recreating the object as below?

Best,
Keegan

library(bsseq)
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#> 
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#> 
#>     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#>     clusterExport, clusterMap, parApply, parCapply, parLapply,
#>     parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#> 
#>     IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#> 
#>     anyDuplicated, append, as.data.frame, basename, cbind,
#>     colMeans, colnames, colSums, dirname, do.call, duplicated,
#>     eval, evalq, Filter, Find, get, grep, grepl, intersect,
#>     is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
#>     paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
#>     Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
#>     table, tapply, union, unique, unsplit, which, which.max,
#>     which.min
#> Loading required package: GenomicRanges
#> Warning: package 'GenomicRanges' was built under R version 3.5.1
#> Loading required package: stats4
#> Loading required package: S4Vectors
#> Warning: package 'S4Vectors' was built under R version 3.5.1
#> 
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#> 
#>     expand.grid
#> Loading required package: IRanges
#> Warning: package 'IRanges' was built under R version 3.5.1
#> Loading required package: GenomeInfoDb
#> Loading required package: SummarizedExperiment
#> Warning: package 'SummarizedExperiment' was built under R version 3.5.1
#> Loading required package: Biobase
#> Warning: package 'Biobase' was built under R version 3.5.1
#> Welcome to Bioconductor
#> 
#>     Vignettes contain introductory material; view with
#>     'browseVignettes()'. To cite Bioconductor, see
#>     'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: DelayedArray
#> Warning: package 'DelayedArray' was built under R version 3.5.1
#> Loading required package: matrixStats
#> 
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#> 
#>     anyMissing, rowMedians
#> Loading required package: BiocParallel
#> Warning: package 'BiocParallel' was built under R version 3.5.1
#> 
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#> 
#>     colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#> 
#>     aperm, apply
library(pryr)

data(BS.chr22)

# artificially create multiple chromosomes
BS.chr22 <- BSseq(M = getCoverage(BS.chr22, type="M"),
                  Cov = getCoverage(BS.chr22, type="Cov"),
                  pos = start(BS.chr22),
                  chr = c(rep("1", 100), rep("2", 100), rep("3", nrow(BS.chr22)-200)),
                  pData = pData(BS.chr22),
                  sampleNames = sampleNames(BS.chr22))

# size of full object
object_size(BS.chr22)
#> 20.4 MB

# subset of object
BS.chr22_subset <- chrSelectBSseq(BS.chr22, "1")

# size of subset
object_size(BS.chr22_subset)
#> 16.5 MB

# reconstruct subset
BS.chr22_subset <- BSseq(M = realize(getCoverage(BS.chr22, type="M")[1:100,]),
                         Cov = realize(getCoverage(BS.chr22, type="Cov")[1:100,]),
                         pos = start(BS.chr22)[1:100],
                         chr = seqnames(BS.chr22)[1:100],
                         pData = pData(BS.chr22),
                         sampleNames = sampleNames(BS.chr22))

# size of reconstructed subset
object_size(BS.chr22_subset)
#> 660 kB

sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#> 
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> attached base packages:
#> [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
#> [8] methods   base     
#> 
#> other attached packages:
#>  [1] pryr_0.1.4                  bsseq_1.17.0               
#>  [3] SummarizedExperiment_1.11.6 DelayedArray_0.7.37        
#>  [5] BiocParallel_1.15.11        matrixStats_0.54.0         
#>  [7] Biobase_2.41.2              GenomicRanges_1.33.13      
#>  [9] GenomeInfoDb_1.17.1         IRanges_2.15.17            
#> [11] S4Vectors_0.19.19           BiocGenerics_0.27.1        
#> 
#> loaded via a namespace (and not attached):
#>  [1] Rcpp_0.12.18             compiler_3.5.0          
#>  [3] XVector_0.21.3           R.methodsS3_1.7.1       
#>  [5] R.utils_2.7.0            bitops_1.0-6            
#>  [7] tools_3.5.0              DelayedMatrixStats_1.3.8
#>  [9] zlibbioc_1.27.0          digest_0.6.16           
#> [11] evaluate_0.11            rhdf5_2.25.9            
#> [13] lattice_0.20-35          Matrix_1.2-14           
#> [15] yaml_2.2.0               GenomeInfoDbData_1.1.0  
#> [17] stringr_1.3.1            knitr_1.20              
#> [19] gtools_3.8.1             locfit_1.5-9.1          
#> [21] rprojroot_1.3-2          grid_3.5.0              
#> [23] data.table_1.11.4        HDF5Array_1.9.15        
#> [25] rmarkdown_1.10           limma_3.37.4            
#> [27] Rhdf5lib_1.3.3           magrittr_1.5            
#> [29] codetools_0.2-15         backports_1.1.2         
#> [31] scales_1.0.0             htmltools_0.3.6         
#> [33] permute_0.9-4            colorspace_1.3-2        
#> [35] stringi_1.2.4            RCurl_1.95-4.11         
#> [37] munsell_0.5.0            R.oo_1.22.0
BiocManager::valid()
#> [1] TRUE

Created on 2018-09-07 by the reprex
package (v0.2.0).

have problem with BS

list.files(pattern="*.txt") -> report.files
read.bismark(files=report.files,sampleNames=gsub(".txt","",report.files),rmZeroCov=TRUE, strandCollapse=FALSE,fileType="cytosineReporAssuming file type is cytosineReport
Read 3547206 rows and 7 (of 7) columns from 0.107 GB file in 00:00:04
done in 9.7 secs
Read 3547206 rows and 7 (of 7) columns from 0.107 GB file in 00:00:03
done in 6.7 secs
[read.bismark] Reading file 'LZH369L_scaffold1.txt' ... done in 6.2 secs
[read.bismark] Reading file 'LZH371L_scaffold1.txt' ... done in 6.5 secs
[read.bismark] Reading file 'LZH373L_scaffold1.txt' ... done in 6.2 secs
[read.bismark] Reading file 'MLS387L_scaffold1.txt' ... done in 6.4 secs
[read.bismark] Reading file 'MLS388L_scaffold1.txt' ... done in 6.5 secs
[read.bismark] Reading file 'MLS390L_scaffold1.txt' ... done in 6.3 secs
[read.bismark] Joining samples ... done in 19.7 secs
bsseq.data[[1]] <- c("high","low","low","low","low","high","high","high")
bsseq.data[[2]] <- c("pair2","pair1","pair1","pair1","pair1","pair2","pair2","pair2")
pData(bsseq.data) <- setNames(pData(bsseq.data),c("Type","pair"))
bsseq.data.smoothed <- BSmooth(bsseq.data, verbose=TRUE)
[BSmooth] preprocessing ... done in 2.6 sec
[BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
[BSmooth] smoothing done in 91.9 sec
Error in colnames<-(*tmp*, value = c("HLiver_scaffold1", "LLiver_scaffold1", :
attempt to set 'colnames' on an object with less than two dimensions
I don't know the problem with this information.

read.bismark() needs updating

Looks like things have changed again with the bismark output. A few things that need to be updated in code and docs:

.cov files from recent versions of Bismark (at least v0.14.3 that I tested) do not include strand information. This contradicts the docs, but more importantly can lead to really bad behaviour by read.bismark(). Specifically, if a C is only measured on the reverse strand then its position is given by the complementary G on the forward strand - if another sample has data from the forward strand for that same CG, then this will result in the "C" having different co-ordinates in different samples yet both having * strand. In general, I think this is impossible to fix/detect by read.bismark(). I think we should strongly recommend that users use the cytosineReport files and report a warning/message if .cov files are used (as well as document this behaviour).
If retaining .cov as default fileType, then might be good idea to set default strandCollapse = FALSE since the default fileType = "cov" does not allow strandCollapse = TRUE.
It's increasingly fiddly to support all different versions of Bismark output (and the various fileType values). Perhaps we should only explicitly support some minimum version of Bismark, document as such (including the expected format) and leave the user to their own devices for older versions.

work around the trans slot

The trans slot is only used to get locfit derived confidence intervals which is not used anywhere in the code. This implies that for 99% of use cases we can do the trans transformation once at object creation and then bypass it anywhere it happens. We should have a special function in trans which makes sure we do not copy the input, ie. we do not want something like

function(x) x

because I am not sure it is not copying the input. The most secure workaround might be to replace code like

trans(object)

with

if(trans != SOMETHING) then
  trans(object)

so we don't call trans() at all.

epivizr support

Functionality of argument "rmZeroCov" in `read.bismark`

Perhaps I'm using a version of the R/bsseq package that's too old (v1.6.0, installed via Bioconductor's biocLite), but it appears that the argument rmZeroCov is not actually doing anything. I have just run the read.bismark function twice on my local machine (using rmZeroCov = TRUE and rmZeroCov = FALSE), with both returning objects of the same size. What's more the matrices returned by getCoverage(... , type = "Cov") on both of these BSseq objects include many CpG loci with values of 0. Though these are easily removed with a line or two of extra code, the function does not return any warning that loci of coverage zero are present in the matrix -- and, use of getMeth() returns NaN for all such loci. Looking at the code under branch master for read.bismark, I cannot actually find a place where the rmZeroCov argument gets used (it seems to get passed along through various functions with no actual purpose). I'd be glad to help fix this if it is actually a problem -- any clarification would be appreciated @PeteHaitch

Update read.bismark documentation

The docs have fallen behind updates made to the bismark_methylation_extractor. I think the underlying code is fine, but the docs refer to the wrong input file (should be the .cov file rather than the .bedGraph file). It'd be good to update the example data correspondingly in inst/extdata.

Ideally, I'll include some unit tests that keep pace with latest version of Bismark to automatically generate the input file, but I'd have to have a think about how to implement that.

Choosing cutoff: t-stat vs. f-stat

A reminder that we want to fix the code so that t-stat comparisons made in fstat.comparisons.pipeline() use cutoff and not cutoff^2. Need to think through how this should be implemented. This bit us when analysing Lindsay's data

DMRs analyses with bsseq comparing three different populations

Hi,

I have WGBS for three different groups of two non-model species (four individuals per group). One is a population of species A and the others are two populations of species B (representing the two extremes of the geographic distribution, 10,000 Km apart). Species A and B are very closely related. I have done a search and have seen that Bsseq is mainly used to do analyses for case-control studies (thus comparing two samples) and not to compare three groups. Could I analyse DMRs among the three populations using bsseq? How?

And a second question. I would like to compare also between species A (four individuals) and species B (eight individuals). Could I do that? (considering the different number of individuals) How?

Thanks a lot in advance.

All the best,

Begoña Martínez

DRM p-values to calculate FDR

Hi Kasper,

I have obtained a set of DMR between my two groups and I would like to estimate the FDR using the qvalue package in Bioconductor. I may have missed something during the process of analyses, but I am not able to recover the p-values associated to my DMRs. How could I get these p-values? My pipeline from the smoothing is writen below.
Many thanks in advance.

Best,
Begoña

File.data.smoothed <- BSmooth(File.data,mc.cores=7,verbose=TRUE)
File.data.tstats <- BSmooth.tstat(File.data.smoothed,group1=c("A","B","C","D"),
\ group2=c("E","F","G"),estimate.var="same",local.correct=TRUE, verbose=TRUE)
dmr.list <- dmrFinder(File.data.tstats,qcutoff=c(.005,.995))
dmrs <- subset(dmr.list, n>=10 & abs(meanDiff) >= 0.3)
write.table(x=dmrs, file="File.data.txt", sep="\t", row.names= FALSE, quote=FALSE)

Permuting sample identities paired design

I've been asked to permute sample identities to test how many DMRs emerge from our data randomly as it from a low CpG methylation genome (~0.7% genome wide). The design is paired and I have used a paired t-test for mother, I am interested in the result of condition.

condition mother 1 H 16 2 H 22 3 H 5 4 H 7 5 H 8 6 H 9 7 L 16 8 L 22 9 L 5 10 L 7 11 L 8 12 L 9

Is there a way of adapting the "permute.R" module to do this? If not how would you do this, permute condition within each pair, or randomly assign any of the twelve into one group (and break pairing)? This second method was proposed to see if the same number of DMRs emerge anyhow.

DMRs were thresholded at 4.6 -4.6, 0.1 mean difference between conditions of 0.1, and 3 CpGs required.

Thanks for your input!
Jack

Limitations on the number of samples.

Currently we have a limitation on the number of samples supported by bsseq, based on R limitations on the size of a matrix. This translates into ~85 samples with 26M CpGs. We need to address this.

SUGGESTION: More efficient code using matrixStats (>= 0.50.0)

In matrixStats (>= 0.50.0), which is now on CRAN, we introduce support for subsetted calculations that are optimized both for speed and memory usage. There are now new arguments idxs, rows and cols for most functions which allow you to very easily make use of this.

For your package, I believe the following functions could gain from this:

$bsseq$BSmooth.fstat
rowSds(allPs[, group2, drop = FALSE], na.rm = TRUE)
rowVars(allPs[, group1, drop = FALSE])
rowVars(allPs[, group2, drop = FALSE])

$bsseq$BSmooth.tstat
rowSds(allPs[, group2, drop = FALSE], na.rm = TRUE)
rowVars(allPs[, group1, drop = FALSE])
rowVars(allPs[, group2, drop = FALSE])

by updating to something like:

$bsseq$BSmooth.fstat
rowSds(allPs, cols = group2, na.rm = TRUE)
rowVars(allPs, cols = group1)
rowVars(allPs, cols = group2)

$bsseq$BSmooth.tstat
rowSds(allPs, cols = group2, na.rm = TRUE)
rowVars(allPs, cols = group1)
rowVars(allPs, cols = group2)

and adding matrixStats (>= 0.50.0) to your DESCRIPTION file.

BSseq to bigWig

Easy and useful for visualising mC data in genome browser

Add ability to open gzip'd bismark cytosine files to "read.bismark"

Would be helpful to keep file size low.

DMRs for FisherTest

Hi all,

I've performed fisherTests() for two samples since I've only one replicate for each samples. but fishertest result gives only two columns p.value and log2OR and I want to know its corresponding chromosome and position information. also I want to filter and plot fisherTest result.

How do I find DMRs ?

Thanks in advance!!
Any help appreciated!!

Support BPREDO when using BiocParallel

I'd tried implementing this for BSmooth() in the refactor branch, but hit some issues and so removed it (9392512). Long term, it'd be cool to have this functionality. I think it's achievable, but requires time that I don't have right now.

"Error in numeric(lw[2]) : invalid 'length' argument" when I smooth huge chr

I want to smooth a huge chromosome(about 196M bp, 8 samples) which contain 3411686 CpGs. When I used command: bismarkBSseq.fit.dmrs <- BSmooth(bismarkBSseq, mc.cores=4, parallelBy="sample", verbose=TRUE, ns=20, h=1000). Errors happened: Error in numeric(lw[2]) : invalid 'length' argument. In addition: Warning message: In mclapply(seq(along = sampleNames), function(sIdx) { : 8 function calls resulted in an error
It works when I truncated the size of this chromosome or changed the arguments(ns=500, h=20000). I'm new in R and I can't find argument "lw" in source code of BSmooth.
My R version is 3.4.1; Bsseq version is 1.14.0.
Any help is Appreciated!

Allow combine() to combine smoothed data from different chromosomes

Currently, if calling combine() on data from different chromosomes, the smoothed data is discarded. However, it seems perfectly reasonable to smooth it on a per-chromosome basis, so why not allow it?

Can not extract smoothed methylation data

I recently updated to R3.5 and continued using my "old" bsseq objects created under R3.4. Upon updating the objects it is no problem for me to handle the bsseq objects as usual, however when I try to extract the smoothed methylation matrix I get following error:

> getMeth(AM.BS.fit, type="smooth")
Error in register_delayed_op(q, "plogis", Rargs = list(location = location,  :
  could not find function "register_delayed_op"
> traceback()
3: getBSseq(BSseq, "trans", withDimnames)(coef)
2: getBSseq(BSseq, "trans", withDimnames)(coef)
1: getMeth(AM.BS.fit, type = "smooth")
> sessionInfo()

R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS

Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1

locale:
 [1] LC_CTYPE=C.UTF-8           LC_NUMERIC=C
 [3] LC_TIME=fr_FR.UTF-8        LC_COLLATE=C.UTF-8
 [5] LC_MONETARY=fr_FR.UTF-8    LC_MESSAGES=C.UTF-8
 [7] LC_PAPER=fr_FR.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets
[8] methods   base

other attached packages:
 [1] bsseq_1.18.0                SummarizedExperiment_1.12.0
 [3] DelayedArray_0.9.0          BiocParallel_1.16.2
 [5] matrixStats_0.54.0          Biobase_2.42.0
 [7] GenomicRanges_1.34.0        GenomeInfoDb_1.18.1
 [9] IRanges_2.16.0              S4Vectors_0.20.1
[11] BiocGenerics_0.28.0

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0               compiler_3.5.1           XVector_0.22.0
 [4] R.methodsS3_1.7.1        bitops_1.0-6             R.utils_2.7.0
 [7] DelayedMatrixStats_1.4.0 tools_3.5.1              zlibbioc_1.28.0
[10] rhdf5_2.26.0             lattice_0.20-38          BSgenome_1.50.0
[13] Matrix_1.2-15            GenomeInfoDbData_1.2.0   rtracklayer_1.42.1
[16] Biostrings_2.50.1        gtools_3.8.1             locfit_1.5-9.1
[19] grid_3.5.1               data.table_1.11.8        HDF5Array_1.10.1
[22] XML_3.98-1.16            limma_3.38.3             Rhdf5lib_1.4.2
[25] scales_1.0.0             Rsamtools_1.34.0         GenomicAlignments_1.18.0
[28] beachmat_1.4.0           permute_0.9-4            colorspace_1.3-2
[31] RCurl_1.95-4.11          munsell_0.5.0            R.oo_1.22.0

Thanks for your support.

Joschka

Problem with bsseq::combine() in bsseq_1.16.1

When trying to combine two BSseq objects with same methylation loci but different sample names, I get with the 1.16.1 version the following error:

> BS
An object of type 'BSseq' with
  68164 methylation loci
  12 samples
has not been smoothed
All assays are in-memory
> BS.avg
An object of type 'BSseq' with
  68164 methylation loci
  8 samples
has not been smoothed
All assays are in-memory
> BS.all=bsseq::combine(BS,BS.avg)
Error in .validate_names(colnames, ans_colnames, "assay colnames()", "colData rownames()") : 
  assay colnames() must be NULL or identical to colData rownames()

This definitly worked in an older BSseq version

Profile bsseq

Do a comprehensive profiling of bsseq using lineprof().

Request to remove requirement for gr to have widths 1 in BSseq constructor

Hi,

I'm a big fan of bsseq because it is exactly the data structure I need for a package I maintain (methylSig) that tests for differential methylation (at CpGs and over regions).

As of the Bioc 3.7 release it appears that a check in the BSseq constructor was added (I don't see a reference to it in the NEWS file).

    if(any(width(gr) != 1))
        stop("'gr' needs to have widths of 1")

It was incredibly convenient to be able to create BSseq objects that aggregated CpGs over regions so that there was still a unified methylation object to access and manipulate.

I would like to request that this check be removed to enable BSseq objects with ranges greater than 1bp. Of course, I understand there may be intra-package consistency reasons for the change, but I figured it was worth a shot to ask because I think there is value in BSseq objects that represent methylation over regions.

This is cross-posted on Bioc support (https://support.bioconductor.org/p/109098/)

Thanks,
Raymond

hansenlab / bsseq Goto Github PK

bsseq's Introduction

hansenlab

bsseq's People

Stargazers

Watchers

Forkers

bsseq's Issues

Recommend Projects

Recommend Topics

Recommend Org