Hello,
I'm happy to propose a BOF on the following topic
Reproducible environments for integrated computational workflows
Input/topic: A full biological analysis workflow often requires numerous software tools deployed at individual analysis steps, some of which may have conflicting software version requirements or are written in different programming languages. In the spirit of an open discussion, we would like to gather experiences and suggestions on current solutions and best practices around the use of software environments (e.g., Conda, renv, Docker, Singularity) in combination with workflow managers (e.g., Snakemake, Nextflow, cgat-core), with a specific focus on workflows that integrate tasks involving multiple programming languages in addition to R (e.g. Python, Java, Shell).
For example, challenges and considerations faced when designing and using software environments for multi-lingual pipelines on institutional high-performance computing clusters (HPCs) include:
- Reliable support for version control of packages in each language
- Tasks needing multiple programming languages to co-exist in the same environment
- Reuse/sharing of environment between projects vs. project-specific/redundant local copies of environments allowing independent management
- Size of environments (individual and combined total disk space)
- Number of distinct environments, maintenance, and responsibility
- Sharing of environments with collaborators and the wider scientific community
- Versioning of environments
Specifically, we are keen to discuss the pros and cons of individual software environment frameworks, in relation to the context in which they are intended to be used.
For instance, the motivation and design choices behind each software environment framework influences their respective capacity to support individual programming languages. Reciprocally, individual workflow managers strive to support multiple software environments frameworks, giving users a range of choices that may lead to a paradox of choice and confusion about best practices in their respective computing environment(s).
Ideally, this could develop into a community-driven review of existing frameworks for both software management and workflow management, driven by individual experiences and combined expectations from a broad range of users.
In particular, this effort could complement the recent preprint Streamlining Data-Intensive Biology With Workflow Systems - there the focus was on a broader set of best practices for the design of streamlined computational workflows.
Output: While the conversation will be kept in a very open format to enable the participation for attendees coming from diverse backgrounds and academic levels, we would like to document and structure the output of this BoF as a collaborative manuscript, reviewing existing frameworks and best practices in designing and managing reproducible software environments for use in computational workflows for scientific research.
Kevin (@kevinrue), Charlotte (@Charlie-George)