Git Product home page Git Product logo

biocbbspack's Introduction

BiocBBSpack

See bottom of this page for historical commentary about tarball construction technicalities, that may have gone away by Nov 2020.

Transcript of work in AnVIL, 15 Nov 2020

This workspace develops code for creating an R package library sufficient to install, build and check all Bioconductor software packages. There is a need for a fault-tolerant approach. Crucial risks are

  • the /tmp folder will fill up, so, prior to starting R, set TMPDIR to a folder with sufficient space
  • R CMD INSTALL will not find the newly installed packages unless they are installed to a folder in the value of .libPaths(), so use .Rprofile to set this
  • R 4.0.3 installation will time out, but options(timeout=180)

Here's what we started out with

BiocManager::install("vjcitn/BiocBBSpack")  # this has a function to acquire the list of packages in a release using the manifest repo
library(BiocBBSpack)
pks = get_bioc_packagelist(rel="RELEASE_3_12")
todo = unique(c(pks, unlist(lapply(.libPaths(), dir))))
length(todo)  # 2177
dir.create("repo_3_12")
BiocManager::install(pks, lib="repo_3_12", Ncpus=45)

The above yielded 1343 packages in repo_3_12. We returned and continued with:

library(BiocBBSpack)
pks = try(get_bioc_packagelist(rel="RELEASE_3_12"))
done = dir("repo_3_12") 
todo = setdiff(pks, done)
print(length(todo))  # 
BiocManager::install(todo, lib="repo_3_12", Ncpus=45)

but that did not succeed. We needed to have .Rprofile update the .libPaths() to include the new repo destination, so it has the line

.libPaths(c("/home/rstudio/repo_3_12", .libPaths()))

and now

library(BiocBBSpack)
pks = try(get_bioc_packagelist(rel="RELEASE_3_12"))
done = dir("repo_3_12") 
todo = setdiff(pks, done)
print(length(todo))  #
.libPaths(c("repo_3_12", .libPaths())) # not enough because INSTALL uses R CMD ... and .Rprofile must be used to set this?
stopifnot("/home/rstudio/repo_3_12" %in% .libPaths())
options(timeout=180)  # helpful for big downloads
BiocManager::install(todo, lib="repo_3_12", Ncpus=50)

runs. repo_3_12 has 3125 packages at the end; 97 members of pks fail to install on 15 Nov 2020.

Here is how we go about acquiring sources for build and check

dir.create("gits_3_12")
pks = get_bioc_packagelist(rel="RELEASE_3_12")
ps = PackageSet(pks)
populate_local_gits(ps, "gits_3_12")

That process took 35 minutes on a large instance in GCP.

Historical commentary

sketch of build system shortcuts ..

Major issue to make this work: pkgbuild had to be modified to avoid querying user on actions taken on inst/doc contents when present the fork with a special version number is at github.com/vjcitn.

See this issue for basic layout of tasks working as of July 7 2019, leading to configuration of a linux builder and creation of 1700 software tarballs in half a day on a 24 core machine with 60GB RAM; the builder configuration is about 3h and does not need to be repeated in extenso; the tarball creation is 2 minutes/package/core on average.

basic ideas:

to populate a folder with clones of all software packages in Bioconductor git:

library(BiocBBSpack) # uses BiocPkgTools
sapply(bioc_software_packagelist(), getpk)

to build a tarball for a package (R CMD build, using pkgbuild package) use build1(), which will install, as needed, all dependencies identified in BiocPkgTools::buildPkgDependencyDataFrame()before runningpkgbuild::build`

For ubuntu 18.04, the linux package set is listed in inst/ubuntu

biocbbspack's People

Contributors

vjcitn avatar

Watchers

James Cloos avatar  avatar

biocbbspack's Issues

Current docker-based work

THIS IS RELATIVELY OBSOLETE. See the waljet script with

elif [ $envtype == "shell" ]; then
    docker run -ti -e TMPDIR=/tmp/r_tmpdir --user root \
        -v $DOCKER_HOME:/home/rstudio \
        -v $DOCKER_RTMP:/tmp/r_tmpdir \
        -v $DOCKER_RPKGS:/usr/local/lib/R/host-site-library \
        -w /home/rstudio \
        $waldronIMG bash
fi

added to improve handling of TMPDIR

you do need to deal with latex
--- end of jan 26 comment

This is for Dec 2019 work with build system.

Basic idea. Use vjcitn/BiocBBSpack with docker
bioconductor/bioconductor_full:devel to get a sense of
current situation of tarball creation.

Once that is done, go over the 3.11 node management
tools by Herve.

Ultimately want to take a look at github actions for non-linux
builds

Key tasks.

  1. set TMPDIR to use the volume -- need to pass the mount point
    and the environment setting

sudo docker run -v /vol_c/R_TMPDIR:/vol_c/R_TMPDIR -e TMPDIR=/vol_c/R_TMPDIR -ti bioconductor/bioconductor_full:devel bash

  1. set .libPaths to use the volume. To do this we will have a persistent
    bash work directory

sudo docker run -v /vol_c/WORK_BBS:/vol_c/WORK_BBS -v /vol_c/R_TMPDIR:/vol_c/R_TMPDIR -e TMPDIR=/vol_c/R_TMPDIR -ti bioconductor/bioconductor_full:devel bash

Then the .Rprofile

.libPaths(c("/vol_c/WORK_BBS/BINARIES", .libPaths()))

will create binaries in /vol_c/WORK_BBS/BINARIES

  1. install BiocBBSpack and use it in WORK_BBS -- this seems to lead to chaos with lots of
    conflicting locks, when parallel_tarballs is used on the PackageSet(coreset()). This
    may be particular to the "first" run, in which considerable numbers of dependencies
    need to be installed to facilitate builds.

Indeed, it seems that BiocManager::install(BiocBBSpack::installed_r, Ncpus=8) or so
should be the first step. This will install 2000+ packages, from which an update
process can begin.

  1. can't build withut latex and texi2dvi (that comes from texinfo)

see the dec14 Dockerfile

sudo docker run -v /vol_c/WORK_BBS:/vol_c/WORK_BBS -v /vol_c/R_TMPDIR:/vol_c/R_TMPDIR -e TMPDIR=/vol_c/R_TMPDIR -ti vjcitn/bioc_bbspack2019:v1 bash

multiple gridsvg devices causing problems?

Error: processing vignette 'arrayQualityMetrics.Rnw' failed with diagnostics:
Only one 'gridsvg' device may be used at a time
--- failed re-building ‘arrayQualityMetrics.Rnw’

SUMMARY: processing the following files failed:
‘aqm.Rnw’ ‘arrayQualityMetrics.Rnw’

with docker, still issues with X11

Bioconductor version 3.11 (BiocManager 1.30.10), ?BiocManager::install for help
Warning in rgl.init(initValue, onlyNULL) :
RGL: unable to open X11 display
Warning: 'rgl_init' failed, running with rgl.useNULL = TRUE

how to build from scratch

  1. use ubuntu 18.04
  2. use endow_ubuntu() in BiocBBSpack [NOT: obtain the mystring.txt from BiocBBSpack/inst/ubuntu] and run as sudo to set up the linux infrastructure, takes about 30 minutes and two interactions on ttf EULA and a service restart are needed
  3. install libsbml from source
  4. svn co/configure/build/install R-devel from source
  5. in R, install.packages(c("BiocManager", "remotes",))
  6. in R: BiocManager::install(c("vjcitn/pkgbuild", "vjcitn/BiocBBSpack"))
  7. in R: BiocManager::install(BiocBBSpack::installed_r, Ncpus=12)
  8. in a writeable folder, run
library(BiocBBSpack) # ensure that the version of pkgbuild is the fork at github.com/vjcitn
jnk = sapply(bioc_software_packagelist(), getpk)  # could be parallelized? takes 1.5h serially in jetstream
  1. in a different writeable folder, run
library(BiocBBSpack) # ensure that the version of pkgbuild is the fork at github.com/vjcitn
num_cores = [your choice]
library(pkgbuild)
library(BiocBBSpack)
cands = list_packs_to_update("../bioc_sources", ".") # assume ../bioc_sources is where you did part 8)
library(parallel)
options(mc.cores=num_cores)
chk = mclapply(cands, function(x) {Sys.sleep(runif(1, 2, 6)); try(build1(x, "."))})

On jetstream this seems to do about one package per core every 2 minutes ... so about 3h with 20 cores to do 1800 packages

Notes: 7) will install over 2200 R packages as of July 7 2019, in approximately 2 hours

with that, these will fail:
‘cairoDevice’, ‘lpsymphony’, ‘gWidgetsRGtk2’, ‘IHW’, ‘rsbml’, ‘charmData’

lpsymphony fails because it checks R CMD config F77 instead of R CMD config FC which
is required for R-devel -- when the source is altered by hand to use "FC" in the configure
script, lpsymphony and IHW will install

rsbml failure may have to do with additional linux infrastructure needed -- but even after installing libsbml5/-dev

Error: package or namespace load failed for ‘rsbml’:
.onLoad failed in loadNamespace() for 'rsbml', details:
call: dyn.load(file, DLLpath = DLLpath, ...)
error: unable to load shared object '/usr/local/lib/R/library/00LOCK-rsbml/00new/rsbml/libs/rsbml.so':
/usr/local/lib/libsbml.so.5: undefined symbol: _ZN9Validator13addConstraintEP11VConstraint
Error: loading failed
Execution halted

additional infrastructure

Rmpi, rsbml -- latter needs libsbml-dev libsundials-dev

libsbml had to be built from source, so that pkg-config is available .. various binary distros lack it

lpsymphony package does not configure properly at top level, needs R CMD config FC instead of F77

sudo cpanm DBI for ensemblVEP, also sudo cpanm DBD::mysql, Archive::Zip

cwltool, ImageMagick (pRoloc)

ggobi

code to build tgz from binary folder in bucket

#!/bin/bash
gsutil -m cp -r $1 .
PACKNAME=basename $1
VER=grep ^Version ${PACKNAME}/DESCRIPTION | sed -e "s/^Version: //"
TGZNAME=${PACKNAME}_${VER}.tgz
echo $TGZNAME
echo $VER
echo $PACKNAME
#tar zcvf $TGZNAME $PACKNAME
#gsutil -m cp $TGZNAME gs://biocbbs_2020a/zpacks/
#rm -rf $TGZNAME $PACKNAME

strategic thoughts

This package addresses a few issues that could be separated and confronted in different ways

  1. endowment of ubuntu host -- there are 1670 linux packages that need to be present for an incomplete bioc tarball set (1600/1731) ... the package list is available in inst/ubuntu

  2. endowment of R -- the R-devel image capable of building the 1600 tarballs has over 2200 packages and a 76GB footprint

ultimately we would like to be able to run tarball production in parallel on an arbitrary number of machines ... keeping an adequate R on a volume that can be mounted read-only by these builders

we are learning some of the parameters of the machine types needed to accomplish this

check as well as build?

Not sure whether one is also checking packages, but this is important -- some packages build without passing check, and check takes a considerable length of time so influences resource demands

tmpdir problem when parallelized?

Error in install.packages(pkgs = doing, lib = lib, repos = repos, ...) :
unable to create temporary directory ‘/tmp/RtmpO1nWv2/downloaded_packages’

just bad luck?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.