Git Product home page Git Product logo

ymp's Introduction

YMP - a Flexible Omics Pipeline

Install with Bioconda Github Unit Tests Read the Docs Codacy grade Codecov

YMP is a tool that makes it easy to process large amounts of NGS read data. It comes "batteries included" with everything needed to preprocess your reads (QC, trimming, contaminant removal), assemble metagenomes, annotate assemblies, or assemble and quantify RNA-Seq transcripts, offering a choice of tools for each of those procecssing stages. When your needs exceed what the stock YMP processing stages provide, you can easily add your own, using YMP to drive novel tools, tools specific to your area of research, or tools you wrote yourself.

Note

Intrigued, but think YMP doesn't exactly fit your needs?

Missing processing stages for your favorite tool? Found a bug?

Open an issue, create a PR, or better yet, join the team!

The YMP documentation is available at readthedocs.

Features:

batteries included

YMP comes with a large number of Stages implementing common read processing steps. These stages cover the most common topics, including quality control, filtering and sorting of reads, assembly of metagenomes and transcripts, read mapping, community profiling, visualisation and pathway analysis.

For a complete list, check the documentation or the source.

get started quickly

Simply point YMP at a folder containing read files, at a mapping file, a list of URLs or even an SRA RunTable and YMP will configure itself. Use tab expansion to complete your desired series of stages to be applied to your data. YMP will then proceed to do your bidding, downloading raw read files and reference databases as needed, installing requisite software environments and scheduling the execution of tools either locally or on your cluster.

explore alternative workflows

Not sure which assembler works best for your data, or what the effect of more stringent quality trimming would be? YMP is made for this! By keeping the output of each stage in a folder named to match the stack of applied stages, YMP can manage many variant workflows in parallel, while minimizing the amount of duplicate computation and storage.

go beyond the beaten path

Built on top of Bioconda and Snakemake, YMP is easily extended with your own Snakefiles, allowing you to integrate any type of processing you desire into YMP, including your own, custom made tools. Within the YMP framework, you can also make use of the extensions to the Snakemake language provided by YMP (default values, inheritance, recursive wildcard expansion, etc.), making writing rules less error prone and repetative.

Background

Bioinformatical data processing workflows can easily get very complex, even convoluted. On the way from the raw read data to publishable results, a sizeable collection of tools needs to be applied, intermediate outputs verified, reference databases selected, and summary data produced. A host of data files must be managed, processed individually or aggregated by host or spatial transect along the way. And, of course, to arrive at a workflow that is just right for a particular study, many alternative workflow variants need to be evaluated. Which tools perform best? Which parameters are right? Does re-ordering steps make a difference? Should the data be assembled individually, grouped, or should a grand co-assembly be computed? Which reference database is most appropriate?

Answering these questions is a time consuming process, justifying the plethora of published ready made pipelines each providing a polished workflow for a typical study type or use case. The price for the convenience of such a polished pipeline is the lack of flexibility -they are not meant to be adapted or extended to match the needs of a particular study. Workflow management systems on the other hand offer great flexibility by focussing on the orchestration of user defined workflows, but typicially require significant initial effort as they come without predefined workflows.

YMP strives to walk the middle ground between these. It brings everything needed to classic metagenome and RNA-Seq workflows, yet built on the workflow management system Snakemake, it can be easily expanded by simply adding Snakemake rules files. Designed around the needs of processing primarily multi-omic NGS read data, it brings a framework for handling read file meta data, provisioning reference databases, and organizing rules into semantic stages.

Working with the Github Development Version

Installing from GitHub

  1. Clone the repository:

    git clone  --recurse-submodules https://github.com/epruesse/ymp.git

    Or, if your have github ssh keys set up:

    git clone --recurse-submodules [email protected]:epruesse/ymp.git
  2. Create and activate conda environment:

    conda env create -n ymp --file environment.yaml
    source activate ymp
  3. Install YMP into conda environment:

    pip install -e .
  4. Verify that YMP works:

    source activate ymp
    ymp --help

Updating Development Version

Usually, all you need to do is a pull:

git pull
git submodule update --recursive --remote

If environments where updated, you may want to regenerate the local installations and clean out environments no longer used to save disk space:

source activate ymp
ymp env update
ymp env clean
# alternatively, you can just delete existing envs and let YMP
# reinstall as needed:
# rm -rf ~/.ymp/conda*
conda clean -a

If you see errors before jobs are executed, the core requirements may have changed. To update the YMP conda environment, enter the folder where you installed YMP and run the following:

source activate ymp
conda env update --file environment.yaml

If something changed in setup.py, a re-install may be necessary:

source activate ymp
pip install -U -e .

ymp's People

Contributors

codacy-badger avatar connorjacobs avatar dependabot-preview[bot] avatar dependabot[bot] avatar epruesse avatar thatzopoulos avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

ymp's Issues

Parallelize Trinity Stage 2

Trinity's stage 2 is "embarrassingly parallel" and can be distributed by providing a cluster script. The management is apparently done by an internal tool. It'd be nice to have snakemake manage those sub-jobs. Might need the dynamic keyword.

Add StageGroups and IOTypes

Stages are semantically grouped into classes:

  • quality control
  • assembly
  • mapping
  • quantification
  • ...

Groups share defined IO types:

  • reads (SE, PE, ...)
  • sequences (multi gene contigs, single gene ORFs)
  • count tables?

Implement "csv" string expansion

Array values, e.g. multiple input files, should be expandable as CSV as is often needed for tools accepting many input files.

rule bla:
  input:
    r1 = "{dir}/{:sources:}.{:pairnames[0]:}.fq.gz",
    r2 = "{dir}/{:sources:}.{:pairnames[1]:}.fq.gz"
 params:
    r1  = "{input.r1:,}"
    r2 = "{input.r2:,}"

This probably won't be possible for expansion in shell: directly in a robust manner. It should be doable from within SnakemakeExpander though to have in params etc. Since it would depend on already concretized input, it'd have to run as a function, so essentially limited to params.

add ymp conda commands

Want

source ymp env activate <envname>

ymp env update <envname>

ymp env clean

ymp env enter <envname>

ymp env run <envname> <cmdline>

ymp env export

Track versions

Snakemake can track versions, we should use that. The version needs to be extracted from the conda environment though, and Snakemake AFAIK doesn't have any mechanism for that yet.

Implement implicit stages

Some stages could be implicit. E.g. the indexing stage needed for mapping or searches.

Prereqisites for implicit stages:

  • provides unique data type (otherwise selection would be ambiguous)
  • parameters must be subset of requesting stage(s)

Improve resource (mem) handling

mem is currently a param

  • make it a resource:
  • add mem_gb and mem_mb
  • verify that all rules have a reasonable value set
  • set available memory for ymp make
    • configuration override
    • auto detection
  • set default in init
  • check limit not exceeding mem?
  • differentiate between local and cluster? How does snakemake handle resources with cluster?

Better management of reference databases

So far, this is just capable of dealing with directories created from tarballs and files that can be gzipped. Reference databases come in more flavors though:

  • datatypes: fasta, fastp, fastq?, gtf/gff/bed, indicies
  • structure: file, directory

Move data files to root

Place rules, etc and envs in root folder and graft via MANIFEST.in if possible. They are more easily accessible there and different code types than the python module. Check that docs still work.

ymp env activate bug

I can run ymp env activate --help and it runs appropriately but when I try to run ymp env activate dada2_qiime1 I get an error which I have pasted below. This is present in the list of installed environments. I don't know if I'm missing some configuration or what. YMP was pulled and installed just before this was ran.

Traceback (most recent call last):
File "/Users/mish0397/miniconda3/envs/ymp/bin/ymp", line 10, in
sys.exit(main())
File "/Users/mish0397/miniconda3/envs/ymp/lib/python3.6/site-packages/click/core.py", line 722, in call
return self.main(*args, **kwargs)
File "/Users/mish0397/miniconda3/envs/ymp/lib/python3.6/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/Users/mish0397/miniconda3/envs/ymp/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/mish0397/miniconda3/envs/ymp/lib/python3.6/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/mish0397/miniconda3/envs/ymp/lib/python3.6/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/mish0397/miniconda3/envs/ymp/lib/python3.6/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/Users/mish0397/git_sw/ymp/ymp/cli/env.py", line 156, in activate
if envname not in ymp.env.by_name:
AttributeError: module 'ymp' has no attribute 'env'

Complete Tab Expansion

  • cache objects needed for speed
  • expand stage
  • expand ref_
  • expand group_
  • allow only valid stacks
  • [ ] complete files within stage

Have ymp init detect available memory

Should set
limits: max_mem according to current machine.

Perhaps also understand difference for local and cluster rules?

Really need to use resources for RAM.

Allow "freezing" environments

There should be a way to "freeze" the current versions for environments.

  • run CircleCI jobs with most recent frozen version set
  • try to update version nightly and update most recent frozen set
  • allow user to
    • freeze current
    • use frozen
    • update to newest
    • tarball the whole set?

Auto-Generate Stage Tests

Stages are tested by verifying graph completion and by running the tool(s). Possible stage stacks could be specified within the stage definition.

Make YmpConfigError message more helpful

Current message:

    YmpConfigError in line 38 of /[removed]/src/ymp/rules/Snakefile:
    Configured column id_col=Sample_Name is not unique. Unique columns: ['Experiment', 'Library_Name', 'Run']
      File "/.../ymp/src/ymp/rules/Snakefile", line 38, in <module>
      File "/.../ymp/src/ymp/config.py", line 806, in allruns
      File "/.../ymp/src/ymp/config.py", line 853, in getRuns
      File "/.../ymp/src/ymp/config.py", line 854, in <listcomp>
      File "/.../ymp/src/ymp/config.py", line 324, in runs
      File "/.../ymp/src/ymp/config.py", line 314, in run_data
      File "/.../ymp/src/ymp/config.py", line 357, in choose_id_column
  • The traceback should disappear
  • The Snakefile reference should be replaced by a reference to the config.yml line, or removed in favor of naming the project.
  • The message needs to be more indicative of what the user is expected to do.

Perhaps:

ConfigurationError in line XY of ymp.yml: id_col=Sample_Name
  `id_col` must indicate a column unique among all rows (aka sequenced samples). 
  Possible choices given your data are `Experiment`, `Library_Name`, `Run`. 

@shafferm Thoughts?

ymp env update doesn't update pip packages

I needed to update my SCNIC package to fix a bug. I called ymp env update to update my package to the newest version available in pip but it was not updated. I tried going into the conda environment used by that rule and update from in there. Calling pip install -U SCNIC also installed a newer version of numpy from pip which superseded the installed conda version. Removing this caused numpy to not be found so I had to reinstall numpy via conda for it to be found again.

To avoid the problem again my workaround was to specify a version of SCNIC in the SCNIC.yml file and update as I uploaded new fixed versions to pypi which forced the environment to be rebuilt.

Unit-tests conda env too big

By now we've got 7.4 GB of tgz conda environment to test YMP. That doesn't scale.

  • Update the conda packages nightly, don't do this on every commit.
  • Figure out where those 7.4G come from (what's the 80 in this 80/20?)
  • Try to split the tests by software needs.

Validate and document config YAML

The config has grown complex. It should be validated using a schema definition at startup to make sure nothing weird happens later on. Ideally, that schema would also be used for the sphinx documentation to make sure the two stay in sync.

Options for schema validation:

Make ymp environments findable by local rules

YMP currently requires rules to sit alongside their associated environment file (srcdir("environment.yml")). Environments that are packaged with YMP should be findable by local rules so that environments don't need to be duplicated for use with user generated rule files.

Also I would vote for environments being in their own directory to go along with this change. I feel like the mixture of .rules and .yml files in the rules directory is kind of confusing at first and really clutters the rules directory. By splitting and having a dedicated environments directory that could make fixing the local rules issue easier. Although it would then require something for local environments to work I'd guess.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.