We work on statistical genomics and application in biology.
- Website
- [Members](htps://www.hansenlab.org/people
- Publications
Devel repository for bsseq
We work on statistical genomics and application in biology.
The new permutation code for f-tests is more general than that for t-test; should be easier to maintain by switching t-test permutations to use the same underlying code.
I got stuck on this error for a while. Is there any way to check if row names are present before preprocessing so you avoid that step and the confusing error message?
library(bsseq)
M <- matrix(0:8, 3, 3)
Cov <- matrix(1:9, 3, 3)
BStmp <- BSseq(chr = c("chr1", "chrX", "chr1"), pos = 1:3,
M = M, Cov = Cov, sampleNames = c("A1","A2", "B"))
pData(BStmp) <- data.frame(id = c("A1","A2", "B"), trt=c("A", "A", "B"))
BSmooth(BStmp)
# [BSmooth] preprocessing ... done in 0.1 sec
# [BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
# Error: Could not construct DelayedMatrix from 'coef'
Adding row.names
works.
pData(BStmp) <- data.frame(id = c("A1","A2", "B"), trt=c("A", "A", "B"), row.names=c("A1","A2", "B"))
See email from Fayaz
Currently, most of my work is done with maxGap = 10^8
resulting in 1 chromosome = 1 cluster. To make the computations of more equal size, we should experiment with decreasing maxGap
to something like 10^5. We need to be careful about this, since we would prefer this clustering to happen only based on the genome and not on the coverage data per default (so that each dataset in the same organism gets the same clustering).
I'm working on this. Email from Kasper:
What I would really need is for any pull request to be checked up against Bioc-devel with Bioc-devel packages and the right version of R for Bioc-devel. And then I would want to be able for the submitter and myself and possibly others to see the result of R CMD check on the pull request.
We need to have on-disk support in bsseq. We could
Need to think about matrix size limitations and also about computational / parallel performance (say reading data on a CpG cluster into memory one at a time).
This issue is tightly link to the question of size limitations (#20).
I updated bsseq
(and DelayedMatrix
) and collapseBSseq
, which was working as expected prior to this, crashes now, giving me a connection error (Error in serialize(data, node$con) / failed to stop ‘SOCKcluster’ cluster), which corresponds to my RAM reaching the maximum (64 Go).
My SessionInfo()
:
[R version 3.4.4 (2018-03-15)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 16299)
Matrix products: default
locale:
[1] LC_COLLATE=French_France.1252 LC_CTYPE=French_France.1252 LC_MONETARY=French_France.1252 LC_NUMERIC=C
[5] LC_TIME=French_France.1252
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4-3 gtools_3.5.0 bsseq_1.15.2 SummarizedExperiment_1.8.1 DelayedArray_0.5.27
[6] BiocParallel_1.12.0 matrixStats_0.53.1 Biobase_2.38.0 GenomicRanges_1.30.3 GenomeInfoDb_1.14.0
[11] IRanges_2.12.0 S4Vectors_0.16.0 BiocGenerics_0.25.3
loaded via a namespace (and not attached):
[1] Rcpp_0.12.16 XVector_0.18.0 zlibbioc_1.24.0 munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35
[7] plyr_1.8.4 tools_3.4.4 grid_3.4.4 snow_0.4-2 R.oo_1.21.0 permute_0.9-4
[13] Matrix_1.2-14 GenomeInfoDbData_1.0.0 R.utils_2.6.0 bitops_1.0-6 RCurl_1.95-4.10 limma_3.34.9
[19] compiler_3.4.4 R.methodsS3_1.7.1 scales_0.5.0 locfit_1.5-9.1 `
Thanks
Delay this until we are actually going to update vignette.
* checking dependencies in R code ... NOTE
Namespace in Imports field not imported from: 'HDF5Array'
All declared Imports should be used.
readr vs data.table. data.table always fastest on plaintext, but not always on gzipped. In-memory gzip support of readr is attractive.
See #1 (comment) and timings in #4 (done on laptop with non-SSD, if i recall correctly)
This could be made to work pretty easily, I think.
I am experiencing some behavior I don't quite understand with bsseq
and DelayedArray
. I was trying to troubleshoot a problem I was having with slower than expected operations on bsseq
objects such as getCoverage
, and I found that if I recreated the DelayedArrays within the bsseq
object the issue went away (see minimal working example below). Is this expected?
require(bsseq)
require(BiocGenerics)
# Read some example data from the bsseq package:
infile <- system.file("extdata/test_data.fastq_bismark.bismark.cov.gz",
package = 'bsseq')
# Use the data multiple times and pretend it is different samples:
cpg <- BiocGenerics::combine(
bsseq::read.bismark(files = infile, sampleNames = "c1a", strandCollapse = F),
bsseq::read.bismark(files = infile, sampleNames = "c1b", strandCollapse = F),
bsseq::read.bismark(files = infile, sampleNames = "c1c", strandCollapse = F),
bsseq::read.bismark(files = infile, sampleNames = "c1d", strandCollapse = F),
bsseq::read.bismark(files = infile, sampleNames = "c2a", strandCollapse = F),
bsseq::read.bismark(files = infile, sampleNames = "c2b", strandCollapse = F),
bsseq::read.bismark(files = infile, sampleNames = "c2c", strandCollapse = F),
bsseq::read.bismark(files = infile, sampleNames = "c2d", strandCollapse = F)
)
# recreate M and Cov matrices and place in new bsseq object
M <- getCoverage(cpg, type="M")
M <- DelayedArray(as.matrix(M))
Cov <- getCoverage(cpg, type="Cov")
Cov <- DelayedArray(as.matrix(Cov))
cpg2 <- BSseq(M = M,
Cov = Cov,
chr = as.character(seqnames(cpg)),
pos = start(cpg),
sampleNames = sampleNames(cpg))
# compare run times of `getCoverage`
system.time(getCoverage(cpg))
user system elapsed
1.509 0.170 1.915
system.time(getCoverage(cpg2))
user system elapsed
0.011 0.000 0.013
library(BiocInstaller)
biocValid()
* sessionInfo()
R Under development (unstable) (2018-02-18 r74266)
Platform: x86_64-apple-darwin15.6.0 (64-bit)
Running under: macOS High Sierra 10.13.4
Matrix products: default
BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] BiocInstaller_1.31.1 bsseq_1.17.0
[3] SummarizedExperiment_1.11.3 DelayedArray_0.7.2
[5] BiocParallel_1.15.3 matrixStats_0.53.1
[7] Biobase_2.41.0 GenomicRanges_1.33.5
[9] GenomeInfoDb_1.17.1 IRanges_2.15.13
[11] S4Vectors_0.19.5 BiocGenerics_0.27.0
loaded via a namespace (and not attached):
[1] Rcpp_0.12.17 XVector_0.21.1 zlibbioc_1.27.0
[4] munsell_0.4.3 colorspace_1.3-2 lattice_0.20-35
[7] plyr_1.8.4 tools_3.5.0 DelayedMatrixStats_1.3.0
[10] grid_3.5.0 rhdf5_2.25.0 data.table_1.11.2
[13] R.oo_1.22.0 HDF5Array_1.9.0 gtools_3.5.0
[16] permute_0.9-4 Matrix_1.2-14 GenomeInfoDbData_1.1.0
[19] Rhdf5lib_1.3.1 R.utils_2.6.0 bitops_1.0-6
[22] RCurl_1.95-4.10 limma_3.37.1 compiler_3.5.0
[25] R.methodsS3_1.7.1 scales_0.5.0 locfit_1.5-9.1
Library path directories:
/Users/keegankorthauer/Library/R/3.5/library
/Library/Frameworks/R.framework/Versions/3.5/Resources/library
* Packages too new for Bioconductor version '3.8'
Version LibPath
dmrseq "1.1.6" "/Users/keegankorthauer/Library/R/3.5/library"
FDRreg "0.2-1" "/Users/keegankorthauer/Library/R/3.5/library"
downgrade with biocLite(c("dmrseq", "FDRreg"))
Error: 2 package(s) too new
We need to catch the dreaded
[BSmooth.tstat] preprocessing ... done in 261.8 sec
[BSmooth.tstat] computing stats within groups ... done in 12.5 sec
[BSmooth.tstat] computing stats across groups ...
Error in approxfun(xx, yy) :
need at least two non-NA values to interpolate
and supply a better error message.
Compute like in Brain paper
BSseq, is it appropriate for RRBS data with large numbers of biological replicates ?
Such as 100 biological replicates of each group for detecting DMCs and DMRs.
Or which method should I choose for large numbers of biological replicates ?
I am trying to construct a BSseq
object from four data frames with 18519004, 15324125, 17354678 and 17382658 rows respectively (mouse, whole genome CpG methylation.) So far the constructor has been running for 30 hours on a CentOS 7 machine with 8 cores and 64GB of memory. Is this expected?
I construct the BSseq
object with the following code:
library(tidyverse)
library(dmrseq)
library(bsseq)
make_bs <- function(df, sampleName) {
M <- matrix(df$M, ncol = 1)
colnames(M) <- sampleName
Cov <- matrix(df$Cov, ncol = 1)
colnames(Cov) <- sampleName
BSseq(chr = df$chr,
pos = df$start,
Cov = Cov,
M = M,
sampleNames = sampleName)
}
bs <- make_bs(df1, "1") %>%
bsseq::combine(make_bs(df2, "2")) %>%
bsseq::combine(make_bs(df3, "3")) %>%
bsseq::combine(make_bs(df4, "4"))
Here's my sessionInfo
:
> sessionInfo()
R version 3.5.0 (2018-04-23)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)
Matrix products: default
BLAS: /stornext/System/data/apps/R/R-3.5.0/lib64/R/lib/libRblas.so
LAPACK: /stornext/System/data/apps/R/R-3.5.0/lib64/R/lib/libRlapack.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] dmrseq_1.0.0 bsseq_1.16.0
[3] SummarizedExperiment_1.10.1 DelayedArray_0.6.0
[5] BiocParallel_1.14.1 matrixStats_0.53.1
[7] Biobase_2.40.0 GenomicRanges_1.32.3
[9] GenomeInfoDb_1.16.0 IRanges_2.14.10
[11] S4Vectors_0.18.2 BiocGenerics_0.26.0
[13] forcats_0.3.0 stringr_1.3.1
[15] dplyr_0.7.5 purrr_0.2.5
[17] readr_1.1.1 tidyr_0.8.1
[19] tibble_1.4.2 ggplot2_2.2.1
[21] tidyverse_1.2.1
loaded via a namespace (and not attached):
[1] colorspace_1.4-0 XVector_0.20.0
[3] rstudioapi_0.7 bit64_0.9-8
[5] interactiveDisplayBase_1.18.0 AnnotationDbi_1.42.1
[7] lubridate_1.7.4 xml2_1.2.0
[9] splines_3.5.0 codetools_0.2-15
[11] R.methodsS3_1.7.1 mnormt_1.5-5
[13] jsonlite_1.5 Rsamtools_1.32.0
[15] broom_0.4.4 R.oo_1.22.0
[17] shiny_1.1.0 HDF5Array_1.8.0
[19] compiler_3.5.0 httr_1.3.1
[21] assertthat_0.2.0 Matrix_1.2-14
[23] lazyeval_0.2.1 limma_3.36.1
[25] cli_1.0.0 later_0.7.3
[27] htmltools_0.3.6 prettyunits_1.0.2
[29] tools_3.5.0 bindrcpp_0.2.2
[31] gtable_0.2.0 glue_1.2.0
[33] GenomeInfoDbData_1.1.0 annotatr_1.6.0
[35] reshape2_1.4.3 doRNG_1.6.6
[37] Rcpp_0.12.17 bumphunter_1.22.0
[39] cellranger_1.1.0 Biostrings_2.48.0
[41] nlme_3.1-137 rtracklayer_1.40.2
[43] iterators_1.0.9 DelayedMatrixStats_1.2.0
[45] psych_1.8.4 rvest_0.3.2
[47] mime_0.5 rngtools_1.3.1
[49] gtools_3.5.0 XML_3.98-1.11
[51] AnnotationHub_2.12.0 zlibbioc_1.26.0
[53] scales_0.5.0 BSgenome_1.48.0
[55] BiocInstaller_1.30.0 hms_0.4.2
[57] promises_1.0.1 rhdf5_2.24.0
[59] RColorBrewer_1.1-2 yaml_2.1.19
[61] memoise_1.1.0 pkgmaker_0.27
[63] biomaRt_2.36.1 stringi_1.2.2
[65] RSQLite_2.1.1 foreach_1.4.6
[67] permute_0.9-4 GenomicFeatures_1.32.0
[69] bibtex_0.4.2 rlang_0.2.1
[71] pkgconfig_2.0.1 bitops_1.0-6
[73] lattice_0.20-35 Rhdf5lib_1.2.1
[75] bindr_0.1.1 GenomicAlignments_1.16.0
[77] bit_1.1-14 tidyselect_0.2.4
[79] plyr_1.8.4 magrittr_1.5
[81] R6_2.2.2 DBI_1.0.0
[83] withr_2.1.2 pillar_1.2.3
[85] haven_1.1.1 foreign_0.8-70
[87] RCurl_1.95-4.10 modelr_0.1.2
[89] crayon_1.3.4 progress_1.1.2
[91] locfit_1.5-9.1 grid_3.5.0
[93] readxl_1.1.0 data.table_1.11.4
[95] blob_1.1.1 digest_0.6.15
[97] xtable_1.8-3 httpuv_1.4.3
[99] regioneR_1.12.0 outliers_0.14
[101] R.utils_2.6.0 munsell_0.4.3
[103] registry_0.5
In BSseq constructor function BSseq
, when checking for gr widths, the closing bracket is misplaced, it should be:
if(any(width(gr) != 1))
(instead of if(any(width(gr)) != 1)
)
It's clear from Lindsay's experience that we should probably check the BSseq object is sorted before smoothing (and likely before doing other operations too). I propose the following:
is.unsorted,RangedSummarizedExperiment-method
to SummarizedExperiment - we then get it via inheritance for BSseq objects.is.unsorted,RangedSummarizedExperiment-method
is.unsorted(BSseq)
where necessary to bsseq codebase.One complication is that sort()
for GRanges-based objects uses the seqlevel order from the Seqinfo slot - if this is "wrong" (as in Lindsay's case where chr3 was first) then the sort is "wrong". However, the relative order of positions within a seqlevel will still be correct, which I think is enough for what we want. In general, defining and checking a canonical order of seqlevels for an arbitrary sample with potentially limited metadata is very difficult, so I don't think we want to do anything in that space.
Hi,
I wanted to use bsmooth to generate only the methylation estimates per CpG for some other purpose than DMR calling. I want to smooth over windows of size around 500bp and I'm also interested in LMRs (lowly methylated regions). LMRs usually only contains a small number of CpGs and I don't want to smooth them "out". Hence I tried to use bsmooth with parameters like h=250 (it was written somewhere that h defines half the actual genome window size) and ns=3..5.
However, I often get errors like this:
bs_fit = BSmooth(bs, ns = 7, h = 250, maxGap=10^7, parallelBy = "chromosome", mc.cores = 10, mc.preschedule = FALSE)
[BSmooth] preprocessing ... done in 0.9 sec
[BSmooth] smoothing by 'chromosome' (mc.cores = 10, mc.preschedule = FALSE)
Fehler in BSmooth(bs, ns = 7, h = 250, maxGap = 10^7, parallelBy = "chromosome", :
BSmooth encountered smoothing errors
Zusätzlich: Warnmeldung:
In mclapply(1:length(clusterIdx), function(ii) { :
1 function calls resulted in an error
when I reduce the maxGap it runs through in some cases. Why is the maxGap parameter causing these errors? maxGap shouldn't play any role in finding a window of at least 3...5 CpGs, or?
Thanks a lot!
steffen
Hi, I learned that bsseq can be used to find DMR in specific regions, like promoter. I wonder that where can I find the protocol about how to use bsseq find differentially methylated promoter.
Hi,
I was attempting to use the BSseq constructor function to create a BSseq object for a test in my package and I ran into some weird behavior. The BSseq function is moving the decimal point of the input methylation values (dividing the methylation values by 10,000). The coverage values are correct however. See the example below:
methylVals = as.matrix(c(.1, .9, .4, .5))
covVals = as.matrix(rep(10000, 4))
testBSseq = BSseq(M =methylVals, Cov = covVals, pos = c(1000, 2000, 3000, 4000), chr=rep("chr1", 4))
methylVals
getMeth(testBSseq, type = "raw")
covVals
getCoverage(testBSseq)
On my system, the results from getMeth are:
DelayedMatrix object of 4 x 1 doubles:
V1
[1,] 1e-05
[2,] 9e-05
[3,] 4e-05
[4,] 5e-05
I wondered if this could be a DelayedArray thing but the DelayedArray constructor seems to work.
DelayedArray::DelayedArray(methylVals)
I found this behavior on both Linux and Windows. In case it helps, here is my session info:
R version 3.4.0 (2017-04-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.2 LTS
Matrix products: default
BLAS: /usr/lib/libblas/libblas.so.3.6.0
LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] grDevices datasets parallel stats graphics utils stats4 methods base
other attached packages:
[1] bsseq_1.12.2 SummarizedExperiment_1.6.3 Biobase_2.36.2 DelayedArray_0.2.4
[5] matrixStats_0.52.2 BiocInstaller_1.26.0 GenomicRanges_1.28.4 GenomeInfoDb_1.12.1
[9] IRanges_2.10.2 S4Vectors_0.14.2 BiocGenerics_0.22.0 ggplot2_2.2.1
[13] data.table_1.10.4
loaded via a namespace (and not attached):
[1] Rcpp_0.12.11 XVector_0.16.0 zlibbioc_1.22.0 munsell_0.4.3 lattice_0.20-35
[6] colorspace_1.3-2 rlang_0.1.1 plyr_1.8.4 tools_3.4.0 grid_3.4.0
[11] gtable_0.2.0 R.oo_1.21.0 gtools_3.5.0 lazyeval_0.2.0 permute_0.9-4
[16] tibble_1.3.3 Matrix_1.2-8 GenomeInfoDbData_0.99.0 R.utils_2.5.0 bitops_1.0-6
[21] RCurl_1.95-4.8 limma_3.32.5 compiler_3.4.0 R.methodsS3_1.7.1 scales_0.4.1
[26] locfit_1.5-9.1
It was nice to meet you all at Bioc2017. Let me know if any additional information would help!
Best,
John Lawson
A note to myself to follow-up https://support.bioconductor.org/p/75502/. Like Jim, I don't find the message ambiguous, but I'm happy to incorporate any suggestions.
computeComversion(BSseq, seqnames = "$LAMBDA")
where $LAMBDA
is the super long name we use for lambda. Then users can change at will
Hello,
Is there any multiple testing correction implemented in the bsseq pipeline?
Thank you.
Perhaps add section in vignette.
Currently, the smoothed values are stored on the logit scale. This results in a transformation (and memory copying) every time we do getMeth(type = "smooth")
. We do this to support the use of confidence intervals from the smoothing which are based on x +/- y on the logit scale (which is not symmetric on the [0,1] methylation scale).However, we use this very infrequent. We support the use of a transformation slot which is any function. So we need a function doing the transformation once and for all, and then we need to write code/profile to make sure no copying happens when we run getMeth()
.
Hi,
Is it expected that a small subset of a BSseq object obtained by chrSelectBSseq()
be close to the size of the full object? I noticed this issue after obtaining much larger file sizes than expected of BSseq object subsets saved using saveRDS
, but I'm pasting below a reprex that demonstrates in-memory size. I also compare to what the size would be with a manual subset by index, and recreation of the object with realize()
applied to the DelayedArray components.
Is there a simple way to obtain a subset that realizes the DelayedArray components so that the object size is proportionally reduced, that doesn't require manually recreating the object as below?
Best,
Keegan
library(bsseq)
#> Loading required package: BiocGenerics
#> Loading required package: parallel
#>
#> Attaching package: 'BiocGenerics'
#> The following objects are masked from 'package:parallel':
#>
#> clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
#> clusterExport, clusterMap, parApply, parCapply, parLapply,
#> parLapplyLB, parRapply, parSapply, parSapplyLB
#> The following objects are masked from 'package:stats':
#>
#> IQR, mad, sd, var, xtabs
#> The following objects are masked from 'package:base':
#>
#> anyDuplicated, append, as.data.frame, basename, cbind,
#> colMeans, colnames, colSums, dirname, do.call, duplicated,
#> eval, evalq, Filter, Find, get, grep, grepl, intersect,
#> is.unsorted, lapply, lengths, Map, mapply, match, mget, order,
#> paste, pmax, pmax.int, pmin, pmin.int, Position, rank, rbind,
#> Reduce, rowMeans, rownames, rowSums, sapply, setdiff, sort,
#> table, tapply, union, unique, unsplit, which, which.max,
#> which.min
#> Loading required package: GenomicRanges
#> Warning: package 'GenomicRanges' was built under R version 3.5.1
#> Loading required package: stats4
#> Loading required package: S4Vectors
#> Warning: package 'S4Vectors' was built under R version 3.5.1
#>
#> Attaching package: 'S4Vectors'
#> The following object is masked from 'package:base':
#>
#> expand.grid
#> Loading required package: IRanges
#> Warning: package 'IRanges' was built under R version 3.5.1
#> Loading required package: GenomeInfoDb
#> Loading required package: SummarizedExperiment
#> Warning: package 'SummarizedExperiment' was built under R version 3.5.1
#> Loading required package: Biobase
#> Warning: package 'Biobase' was built under R version 3.5.1
#> Welcome to Bioconductor
#>
#> Vignettes contain introductory material; view with
#> 'browseVignettes()'. To cite Bioconductor, see
#> 'citation("Biobase")', and for packages 'citation("pkgname")'.
#> Loading required package: DelayedArray
#> Warning: package 'DelayedArray' was built under R version 3.5.1
#> Loading required package: matrixStats
#>
#> Attaching package: 'matrixStats'
#> The following objects are masked from 'package:Biobase':
#>
#> anyMissing, rowMedians
#> Loading required package: BiocParallel
#> Warning: package 'BiocParallel' was built under R version 3.5.1
#>
#> Attaching package: 'DelayedArray'
#> The following objects are masked from 'package:matrixStats':
#>
#> colMaxs, colMins, colRanges, rowMaxs, rowMins, rowRanges
#> The following objects are masked from 'package:base':
#>
#> aperm, apply
library(pryr)
data(BS.chr22)
# artificially create multiple chromosomes
BS.chr22 <- BSseq(M = getCoverage(BS.chr22, type="M"),
Cov = getCoverage(BS.chr22, type="Cov"),
pos = start(BS.chr22),
chr = c(rep("1", 100), rep("2", 100), rep("3", nrow(BS.chr22)-200)),
pData = pData(BS.chr22),
sampleNames = sampleNames(BS.chr22))
# size of full object
object_size(BS.chr22)
#> 20.4 MB
# subset of object
BS.chr22_subset <- chrSelectBSseq(BS.chr22, "1")
# size of subset
object_size(BS.chr22_subset)
#> 16.5 MB
# reconstruct subset
BS.chr22_subset <- BSseq(M = realize(getCoverage(BS.chr22, type="M")[1:100,]),
Cov = realize(getCoverage(BS.chr22, type="Cov")[1:100,]),
pos = start(BS.chr22)[1:100],
chr = seqnames(BS.chr22)[1:100],
pData = pData(BS.chr22),
sampleNames = sampleNames(BS.chr22))
# size of reconstructed subset
object_size(BS.chr22_subset)
#> 660 kB
sessionInfo()
#> R version 3.5.0 (2018-04-23)
#> Platform: x86_64-apple-darwin15.6.0 (64-bit)
#> Running under: macOS High Sierra 10.13.6
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> attached base packages:
#> [1] stats4 parallel stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] pryr_0.1.4 bsseq_1.17.0
#> [3] SummarizedExperiment_1.11.6 DelayedArray_0.7.37
#> [5] BiocParallel_1.15.11 matrixStats_0.54.0
#> [7] Biobase_2.41.2 GenomicRanges_1.33.13
#> [9] GenomeInfoDb_1.17.1 IRanges_2.15.17
#> [11] S4Vectors_0.19.19 BiocGenerics_0.27.1
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_0.12.18 compiler_3.5.0
#> [3] XVector_0.21.3 R.methodsS3_1.7.1
#> [5] R.utils_2.7.0 bitops_1.0-6
#> [7] tools_3.5.0 DelayedMatrixStats_1.3.8
#> [9] zlibbioc_1.27.0 digest_0.6.16
#> [11] evaluate_0.11 rhdf5_2.25.9
#> [13] lattice_0.20-35 Matrix_1.2-14
#> [15] yaml_2.2.0 GenomeInfoDbData_1.1.0
#> [17] stringr_1.3.1 knitr_1.20
#> [19] gtools_3.8.1 locfit_1.5-9.1
#> [21] rprojroot_1.3-2 grid_3.5.0
#> [23] data.table_1.11.4 HDF5Array_1.9.15
#> [25] rmarkdown_1.10 limma_3.37.4
#> [27] Rhdf5lib_1.3.3 magrittr_1.5
#> [29] codetools_0.2-15 backports_1.1.2
#> [31] scales_1.0.0 htmltools_0.3.6
#> [33] permute_0.9-4 colorspace_1.3-2
#> [35] stringi_1.2.4 RCurl_1.95-4.11
#> [37] munsell_0.5.0 R.oo_1.22.0
BiocManager::valid()
#> [1] TRUE
Created on 2018-09-07 by the reprex
package (v0.2.0).
list.files(pattern="*.txt") -> report.files
read.bismark(files=report.files,sampleNames=gsub(".txt","",report.files),rmZeroCov=TRUE, strandCollapse=FALSE,fileType="cytosineReporAssuming file type is cytosineReport
Read 3547206 rows and 7 (of 7) columns from 0.107 GB file in 00:00:04
done in 9.7 secs
Read 3547206 rows and 7 (of 7) columns from 0.107 GB file in 00:00:03
done in 6.7 secs
[read.bismark] Reading file 'LZH369L_scaffold1.txt' ... done in 6.2 secs
[read.bismark] Reading file 'LZH371L_scaffold1.txt' ... done in 6.5 secs
[read.bismark] Reading file 'LZH373L_scaffold1.txt' ... done in 6.2 secs
[read.bismark] Reading file 'MLS387L_scaffold1.txt' ... done in 6.4 secs
[read.bismark] Reading file 'MLS388L_scaffold1.txt' ... done in 6.5 secs
[read.bismark] Reading file 'MLS390L_scaffold1.txt' ... done in 6.3 secs
[read.bismark] Joining samples ... done in 19.7 secs
bsseq.data[[1]] <- c("high","low","low","low","low","high","high","high")
bsseq.data[[2]] <- c("pair2","pair1","pair1","pair1","pair1","pair2","pair2","pair2")
pData(bsseq.data) <- setNames(pData(bsseq.data),c("Type","pair"))
bsseq.data.smoothed <- BSmooth(bsseq.data, verbose=TRUE)
[BSmooth] preprocessing ... done in 2.6 sec
[BSmooth] smoothing by 'sample' (mc.cores = 1, mc.preschedule = FALSE)
[BSmooth] smoothing done in 91.9 sec
Error incolnames<-
(*tmp*
, value = c("HLiver_scaffold1", "LLiver_scaffold1", :
attempt to set 'colnames' on an object with less than two dimensions
I don't know the problem with this information.
Looks like things have changed again with the bismark output. A few things that need to be updated in code and docs:
.cov
files from recent versions of Bismark (at least v0.14.3
that I tested) do not include strand information. This contradicts the docs, but more importantly can lead to really bad behaviour by read.bismark()
. Specifically, if a C is only measured on the reverse strand then its position is given by the complementary G on the forward strand - if another sample has data from the forward strand for that same CG, then this will result in the "C" having different co-ordinates in different samples yet both having *
strand. In general, I think this is impossible to fix/detect by read.bismark()
. I think we should strongly recommend that users use the cytosineReport
files and report a warning/message if .cov
files are used (as well as document this behaviour)..cov
as default fileType
, then might be good idea to set default strandCollapse = FALSE
since the default fileType = "cov"
does not allow strandCollapse = TRUE
.fileType
values). Perhaps we should only explicitly support some minimum version of Bismark, document as such (including the expected format) and leave the user to their own devices for older versions.The trans
slot is only used to get locfit derived confidence intervals which is not used anywhere in the code. This implies that for 99% of use cases we can do the trans transformation once at object creation and then bypass it anywhere it happens. We should have a special function in trans
which makes sure we do not copy the input, ie. we do not want something like
function(x) x
because I am not sure it is not copying the input. The most secure workaround might be to replace code like
trans(object)
with
if(trans != SOMETHING) then
trans(object)
so we don't call trans()
at all.
Perhaps I'm using a version of the R/bsseq package that's too old (v1.6.0, installed via Bioconductor's biocLite
), but it appears that the argument rmZeroCov is not actually doing anything. I have just run the read.bismark
function twice on my local machine (using rmZeroCov = TRUE
and rmZeroCov = FALSE
), with both returning objects of the same size. What's more the matrices returned by getCoverage(... , type = "Cov")
on both of these BSseq objects include many CpG loci with values of 0. Though these are easily removed with a line or two of extra code, the function does not return any warning that loci of coverage zero are present in the matrix -- and, use of getMeth()
returns NaN for all such loci. Looking at the code under branch master for read.bismark
, I cannot actually find a place where the rmZeroCov argument gets used (it seems to get passed along through various functions with no actual purpose). I'd be glad to help fix this if it is actually a problem -- any clarification would be appreciated @PeteHaitch
The docs have fallen behind updates made to the bismark_methylation_extractor
. I think the underlying code is fine, but the docs refer to the wrong input file (should be the .cov
file rather than the .bedGraph
file). It'd be good to update the example data correspondingly in inst/extdata
.
Ideally, I'll include some unit tests that keep pace with latest version of Bismark to automatically generate the input file, but I'd have to have a think about how to implement that.
A reminder that we want to fix the code so that t-stat comparisons made in fstat.comparisons.pipeline()
use cutoff
and not cutoff^2
. Need to think through how this should be implemented. This bit us when analysing Lindsay's data
Hi,
I have WGBS for three different groups of two non-model species (four individuals per group). One is a population of species A and the others are two populations of species B (representing the two extremes of the geographic distribution, 10,000 Km apart). Species A and B are very closely related. I have done a search and have seen that Bsseq is mainly used to do analyses for case-control studies (thus comparing two samples) and not to compare three groups. Could I analyse DMRs among the three populations using bsseq? How?
And a second question. I would like to compare also between species A (four individuals) and species B (eight individuals). Could I do that? (considering the different number of individuals) How?
Thanks a lot in advance.
All the best,
Begoña Martínez
Hi Kasper,
I have obtained a set of DMR between my two groups and I would like to estimate the FDR using the qvalue package in Bioconductor. I may have missed something during the process of analyses, but I am not able to recover the p-values associated to my DMRs. How could I get these p-values? My pipeline from the smoothing is writen below.
Many thanks in advance.
Best,
Begoña
File.data.smoothed <- BSmooth(File.data,mc.cores=7,verbose=TRUE)
File.data.tstats <- BSmooth.tstat(File.data.smoothed,group1=c("A","B","C","D"),
\ group2=c("E","F","G"),estimate.var="same",local.correct=TRUE, verbose=TRUE)
dmr.list <- dmrFinder(File.data.tstats,qcutoff=c(.005,.995))
dmrs <- subset(dmr.list, n>=10 & abs(meanDiff) >= 0.3)
write.table(x=dmrs, file="File.data.txt", sep="\t", row.names= FALSE, quote=FALSE)
I've been asked to permute sample identities to test how many DMRs emerge from our data randomly as it from a low CpG methylation genome (~0.7% genome wide). The design is paired and I have used a paired t-test for mother, I am interested in the result of condition.
condition mother 1 H 16 2 H 22 3 H 5 4 H 7 5 H 8 6 H 9 7 L 16 8 L 22 9 L 5 10 L 7 11 L 8 12 L 9
Is there a way of adapting the "permute.R" module to do this? If not how would you do this, permute condition within each pair, or randomly assign any of the twelve into one group (and break pairing)? This second method was proposed to see if the same number of DMRs emerge anyhow.
DMRs were thresholded at 4.6 -4.6, 0.1 mean difference between conditions of 0.1, and 3 CpGs required.
Thanks for your input!
Jack
Currently we have a limitation on the number of samples supported by bsseq, based on R limitations on the size of a matrix. This translates into ~85 samples with 26M CpGs. We need to address this.
In matrixStats (>= 0.50.0), which is now on CRAN, we introduce support for subsetted calculations that are optimized both for speed and memory usage. There are now new arguments idxs
, rows
and cols
for most functions which allow you to very easily make use of this.
For your package, I believe the following functions could gain from this:
$bsseq$BSmooth.fstat
rowSds(allPs[, group2, drop = FALSE], na.rm = TRUE)
rowVars(allPs[, group1, drop = FALSE])
rowVars(allPs[, group2, drop = FALSE])
$bsseq$BSmooth.tstat
rowSds(allPs[, group2, drop = FALSE], na.rm = TRUE)
rowVars(allPs[, group1, drop = FALSE])
rowVars(allPs[, group2, drop = FALSE])
by updating to something like:
$bsseq$BSmooth.fstat
rowSds(allPs, cols = group2, na.rm = TRUE)
rowVars(allPs, cols = group1)
rowVars(allPs, cols = group2)
$bsseq$BSmooth.tstat
rowSds(allPs, cols = group2, na.rm = TRUE)
rowVars(allPs, cols = group1)
rowVars(allPs, cols = group2)
and adding matrixStats (>= 0.50.0)
to your DESCRIPTION
file.
Easy and useful for visualising mC data in genome browser
Would be helpful to keep file size low.
Hi all,
I've performed fisherTests() for two samples since I've only one replicate for each samples. but fishertest result gives only two columns p.value and log2OR and I want to know its corresponding chromosome and position information. also I want to filter and plot fisherTest result.
How do I find DMRs ?
Thanks in advance!!
Any help appreciated!!
I'd tried implementing this for BSmooth()
in the refactor
branch, but hit some issues and so removed it (9392512). Long term, it'd be cool to have this functionality. I think it's achievable, but requires time that I don't have right now.
I want to smooth a huge chromosome(about 196M bp, 8 samples) which contain 3411686 CpGs. When I used command: bismarkBSseq.fit.dmrs <- BSmooth(bismarkBSseq, mc.cores=4, parallelBy="sample", verbose=TRUE, ns=20, h=1000). Errors happened: Error in numeric(lw[2]) : invalid 'length' argument. In addition: Warning message: In mclapply(seq(along = sampleNames), function(sIdx) { : 8 function calls resulted in an error
It works when I truncated the size of this chromosome or changed the arguments(ns=500, h=20000). I'm new in R and I can't find argument "lw" in source code of BSmooth.
My R version is 3.4.1; Bsseq version is 1.14.0.
Any help is Appreciated!
Currently, if calling combine() on data from different chromosomes, the smoothed data is discarded. However, it seems perfectly reasonable to smooth it on a per-chromosome basis, so why not allow it?
I recently updated to R3.5 and continued using my "old" bsseq objects created under R3.4. Upon updating the objects it is no problem for me to handle the bsseq objects as usual, however when I try to extract the smoothed methylation matrix I get following error:
> getMeth(AM.BS.fit, type="smooth")
Error in register_delayed_op(q, "plogis", Rargs = list(location = location, :
could not find function "register_delayed_op"
> traceback()
3: getBSseq(BSseq, "trans", withDimnames)(coef)
2: getBSseq(BSseq, "trans", withDimnames)(coef)
1: getMeth(AM.BS.fit, type = "smooth")
> sessionInfo()
R version 3.5.1 (2018-07-02)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.1 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.7.1
locale:
[1] LC_CTYPE=C.UTF-8 LC_NUMERIC=C
[3] LC_TIME=fr_FR.UTF-8 LC_COLLATE=C.UTF-8
[5] LC_MONETARY=fr_FR.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=fr_FR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=fr_FR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 parallel stats graphics grDevices utils datasets
[8] methods base
other attached packages:
[1] bsseq_1.18.0 SummarizedExperiment_1.12.0
[3] DelayedArray_0.9.0 BiocParallel_1.16.2
[5] matrixStats_0.54.0 Biobase_2.42.0
[7] GenomicRanges_1.34.0 GenomeInfoDb_1.18.1
[9] IRanges_2.16.0 S4Vectors_0.20.1
[11] BiocGenerics_0.28.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.0 compiler_3.5.1 XVector_0.22.0
[4] R.methodsS3_1.7.1 bitops_1.0-6 R.utils_2.7.0
[7] DelayedMatrixStats_1.4.0 tools_3.5.1 zlibbioc_1.28.0
[10] rhdf5_2.26.0 lattice_0.20-38 BSgenome_1.50.0
[13] Matrix_1.2-15 GenomeInfoDbData_1.2.0 rtracklayer_1.42.1
[16] Biostrings_2.50.1 gtools_3.8.1 locfit_1.5-9.1
[19] grid_3.5.1 data.table_1.11.8 HDF5Array_1.10.1
[22] XML_3.98-1.16 limma_3.38.3 Rhdf5lib_1.4.2
[25] scales_1.0.0 Rsamtools_1.34.0 GenomicAlignments_1.18.0
[28] beachmat_1.4.0 permute_0.9-4 colorspace_1.3-2
[31] RCurl_1.95-4.11 munsell_0.5.0 R.oo_1.22.0
Thanks for your support.
Joschka
When trying to combine two BSseq objects with same methylation loci but different sample names, I get with the 1.16.1 version the following error:
> BS
An object of type 'BSseq' with
68164 methylation loci
12 samples
has not been smoothed
All assays are in-memory
> BS.avg
An object of type 'BSseq' with
68164 methylation loci
8 samples
has not been smoothed
All assays are in-memory
> BS.all=bsseq::combine(BS,BS.avg)
Error in .validate_names(colnames, ans_colnames, "assay colnames()", "colData rownames()") :
assay colnames() must be NULL or identical to colData rownames()
This definitly worked in an older BSseq version
Do a comprehensive profiling of bsseq using lineprof()
.
Hi,
I'm a big fan of bsseq
because it is exactly the data structure I need for a package I maintain (methylSig
) that tests for differential methylation (at CpGs and over regions).
As of the Bioc 3.7 release it appears that a check in the BSseq
constructor was added (I don't see a reference to it in the NEWS
file).
if(any(width(gr) != 1))
stop("'gr' needs to have widths of 1")
It was incredibly convenient to be able to create BSseq
objects that aggregated CpGs over regions so that there was still a unified methylation object to access and manipulate.
I would like to request that this check be removed to enable BSseq
objects with ranges greater than 1bp. Of course, I understand there may be intra-package consistency reasons for the change, but I figured it was worth a shot to ask because I think there is value in BSseq
objects that represent methylation over regions.
This is cross-posted on Bioc support (https://support.bioconductor.org/p/109098/)
Thanks,
Raymond
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.