Git Product home page Git Product logo

Comments (8)

PeteHaitch avatar PeteHaitch commented on August 17, 2024

Setting BACKEND="HDF5Array" will write the result to an HDF5 file and writing to disk will always be slower.
Do you really want/need to do this?
Generally, you should only use this if you don't have enough memory to store the data in memory.

I'm guessing that methylSig reads into memory and hence is fast.

from bsseq.

shraddhapai avatar shraddhapai commented on August 17, 2024

Hi @PeteHaitch, excluding that parameter doesn't change the result for me.
How long should I expect a call to read.bismark() to take? Assume just CpG context for mouse genome or even a small chromosome as in my example? Thanks.

from bsseq.

PeteHaitch avatar PeteHaitch commented on August 17, 2024

Can you share an example file?
I can test locally (I'm not sure if the Docker stuff is related)

from bsseq.

shraddhapai avatar shraddhapai commented on August 17, 2024

Thanks Peter, emailed a couple test files to you.

from bsseq.

PeteHaitch avatar PeteHaitch commented on August 17, 2024

Thanks for sharing the example files with me.
Reading these files takes only a few seconds on my Ubuntu laptop, even when writing to disk as an HDF5-backed BSseq object (that it's actually faster the second time when creating the HDF5-backed version is I think an artefact of re-parsing the same files).

suppressPackageStartupMessages(library(bsseq))
suppressPackageStartupMessages(library(HDF5Array))

files <- c(
  "~/Downloads/WT_VEH1_CX_report.txt.gz.chr18.CpG.txt.gz",
  "~/Downloads/WT_VEH2_CX_report.txt.gz.chr18.CpG.txt.gz")

system.time(bsseq <- read.bismark(files, verbose = TRUE))
#> [read.bismark] Parsing files and constructing valid loci ...
#> Done in 4.4 secs
#> [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
#> Done in 2.1 secs
#> [read.bismark] Constructing BSseq object ...
#>    user  system elapsed 
#>   7.382   1.694   6.629

system.time(bsseq_hdf5 <- read.bismark(files, verbose = TRUE, BACKEND = "HDF5Array"))
#> [read.bismark] Parsing files and constructing valid loci ...
#> Done in 1.6 secs
#> [read.bismark] Parsing files and constructing 'M' and 'Cov' matrices ...
#> Done in 2.3 secs
#> [read.bismark] Constructing BSseq object ...
#>    user  system elapsed 
#>   5.244   1.260   4.151

Created on 2022-03-30 by the reprex package (v2.0.1)

Session info
sessioninfo::session_info()
#> ─ Session info ───────────────────────────────────────────────────────────────
#>  setting  value
#>  version  R version 4.1.2 (2021-11-01)
#>  os       Ubuntu 20.04.4 LTS
#>  system   x86_64, linux-gnu
#>  ui       X11
#>  language en_AU:en
#>  collate  en_AU.UTF-8
#>  ctype    en_AU.UTF-8
#>  tz       Australia/Melbourne
#>  date     2022-03-30
#>  pandoc   2.17.1.1 @ /usr/lib/rstudio/bin/quarto/bin/ (via rmarkdown)
#> 
#> ─ Packages ───────────────────────────────────────────────────────────────────
#>  package              * version  date (UTC) lib source
#>  Biobase              * 2.54.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  BiocGenerics         * 0.40.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  BiocIO                 1.4.0    2021-10-26 [1] RSPM (R 4.1.2)
#>  BiocParallel           1.28.3   2021-12-09 [1] RSPM (R 4.1.2)
#>  Biostrings             2.62.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  bitops                 1.0-7    2021-04-24 [1] RSPM (R 4.1.0)
#>  BSgenome               1.62.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  bsseq                * 1.30.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  cli                    3.2.0    2022-02-14 [1] RSPM (R 4.1.2)
#>  colorspace             2.0-3    2022-02-21 [1] RSPM (R 4.1.0)
#>  crayon                 1.5.1    2022-03-26 [1] RSPM (R 4.1.2)
#>  data.table             1.14.2   2021-09-27 [1] RSPM (R 4.1.0)
#>  DelayedArray         * 0.20.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  DelayedMatrixStats     1.16.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  digest                 0.6.29   2021-12-01 [1] RSPM (R 4.1.2)
#>  evaluate               0.15     2022-02-18 [1] RSPM (R 4.1.0)
#>  fastmap                1.1.0    2021-01-25 [1] RSPM (R 4.1.0)
#>  fs                     1.5.2    2021-12-08 [1] RSPM (R 4.1.2)
#>  GenomeInfoDb         * 1.30.1   2022-01-30 [1] RSPM (R 4.1.2)
#>  GenomeInfoDbData       1.2.7    2021-10-28 [1] RSPM (R 4.1.1)
#>  GenomicAlignments      1.30.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  GenomicRanges        * 1.46.1   2021-11-18 [1] RSPM (R 4.1.2)
#>  glue                   1.6.2    2022-02-24 [1] RSPM (R 4.1.2)
#>  gtools                 3.9.2    2021-06-06 [1] RSPM (R 4.1.0)
#>  HDF5Array            * 1.22.1   2021-11-14 [1] RSPM (R 4.1.2)
#>  highr                  0.9      2021-04-16 [1] RSPM (R 4.1.0)
#>  htmltools              0.5.2    2021-08-25 [1] RSPM (R 4.1.0)
#>  IRanges              * 2.28.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  knitr                  1.38     2022-03-25 [1] RSPM (R 4.1.2)
#>  lattice                0.20-45  2021-09-22 [4] CRAN (R 4.1.1)
#>  lifecycle              1.0.1    2021-09-24 [1] RSPM (R 4.1.0)
#>  limma                  3.50.1   2022-02-17 [1] Bioconductor
#>  locfit                 1.5-9.5  2022-03-03 [1] RSPM (R 4.1.2)
#>  magrittr               2.0.2    2022-01-26 [1] RSPM (R 4.1.2)
#>  Matrix               * 1.4-0    2021-12-08 [4] CRAN (R 4.1.2)
#>  MatrixGenerics       * 1.6.0    2021-10-26 [1] RSPM (R 4.1.2)
#>  matrixStats          * 0.61.0   2021-09-17 [1] RSPM (R 4.1.1)
#>  munsell                0.5.0    2018-06-12 [1] RSPM (R 4.1.0)
#>  permute                0.9-7    2022-01-27 [1] RSPM (R 4.1.2)
#>  R.methodsS3            1.8.1    2020-08-26 [1] RSPM (R 4.1.0)
#>  R.oo                   1.24.0   2020-08-26 [1] RSPM (R 4.1.0)
#>  R.utils                2.11.0   2021-09-26 [1] RSPM (R 4.1.0)
#>  R6                     2.5.1    2021-08-19 [1] RSPM (R 4.1.1)
#>  Rcpp                   1.0.8.3  2022-03-17 [1] RSPM (R 4.1.2)
#>  RCurl                  1.98-1.6 2022-02-08 [1] RSPM (R 4.1.2)
#>  reprex                 2.0.1    2021-08-05 [1] RSPM (R 4.1.0)
#>  restfulr               0.0.13   2017-08-06 [1] RSPM (R 4.1.0)
#>  rhdf5                * 2.38.1   2022-03-10 [1] RSPM (R 4.1.2)
#>  rhdf5filters           1.6.0    2021-10-26 [1] RSPM (R 4.1.2)
#>  Rhdf5lib               1.16.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  rjson                  0.2.21   2022-01-09 [1] RSPM (R 4.1.0)
#>  rlang                  1.0.2    2022-03-04 [1] RSPM (R 4.1.2)
#>  rmarkdown              2.13     2022-03-10 [1] RSPM (R 4.1.2)
#>  Rsamtools              2.10.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  rstudioapi             0.13     2020-11-12 [1] RSPM (R 4.1.0)
#>  rtracklayer            1.54.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  S4Vectors            * 0.32.3   2021-11-21 [1] RSPM (R 4.1.2)
#>  scales                 1.1.1    2020-05-11 [1] RSPM (R 4.1.0)
#>  sessioninfo            1.2.2    2021-12-06 [1] RSPM (R 4.1.2)
#>  sparseMatrixStats      1.6.0    2021-10-26 [1] RSPM (R 4.1.2)
#>  stringi                1.7.6    2021-11-29 [1] RSPM (R 4.1.2)
#>  stringr                1.4.0    2019-02-10 [1] RSPM (R 4.1.0)
#>  SummarizedExperiment * 1.24.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  withr                  2.5.0    2022-03-03 [1] RSPM (R 4.1.2)
#>  xfun                   0.30     2022-03-02 [1] RSPM (R 4.1.2)
#>  XML                    3.99-0.9 2022-02-24 [1] RSPM (R 4.1.2)
#>  XVector                0.34.0   2021-10-26 [1] RSPM (R 4.1.2)
#>  yaml                   2.3.5    2022-02-21 [1] RSPM (R 4.1.0)
#>  zlibbioc               1.40.0   2021-10-26 [1] RSPM (R 4.1.2)
#> 
#>  [1] /home/peter/R/x86_64-pc-linux-gnu-library/4.1
#>  [2] /usr/local/lib/R/site-library
#>  [3] /usr/lib/R/site-library
#>  [4] /usr/lib/R/library
#> 
#> ──────────────────────────────────────────────────────────────────────────────

from bsseq.

PeteHaitch avatar PeteHaitch commented on August 17, 2024

You might try reducing nThread = 1 (it's default) because I don't know how threading would interact with your code being run within Docker.
If you can share your code for setting up Docker I may be able try to reproduce on my end, but it's getting a bit beyond what I can support since bsseq itself seems to have no problem reading these files.

from bsseq.

shraddhapai avatar shraddhapai commented on August 17, 2024

Hi Peter, thanks for the code. It seems the nThread parameter was the culprit. Leaving all params as-is, the test files load quite fast for me (<10 sec). Setting nThreads to 8L caused the slowdown again. This is all in the Docker environment.

As a further test, I read read.bismark() using CpG calls for the entire mouse genome, 6 files - it read all the files in in 1.2 minutes (44s for parsing files and 29s for constructing M / Cov matrices).

I'll leave nThreads at the default value for now as it seems to work.

Thanks for your help! Shraddha

from bsseq.

PeteHaitch avatar PeteHaitch commented on August 17, 2024

Glad that worked!

from bsseq.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.