sanger-tol / treeval Goto Github PK

View Code? Open in Web Editor NEW

20.0 7.0 2.0 56.34 MB

Pipelines for the production of Treeval data

Home Page: https://pipelines.tol.sanger.ac.uk/treeval

License: Other

HTML 0.73% Nextflow 59.41% Groovy 14.48% AngelScript 0.76% Python 9.94% Perl 13.12% Shell 1.55%

curation genome-assembly genomics nextflow pipeline genome-alignment quality-control synteny

treeval's Introduction

Introduction

sanger-tol/treeval [1.1.0 - Ancient Aurora] is a bioinformatics best-practice analysis pipeline for the generation of data supplemental to the curation of reference quality genomes. This pipeline has been written to generate flat files compatible with JBrowse2 as well as HiC maps for use in Juicebox, PretextView and HiGlass.

The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. The Nextflow DSL2 implementation of this pipeline uses one container per process which makes it much easier to maintain and update software dependencies. Where possible, these processes have been submitted to and installed from nf-core/modules in order to make them available to all nf-core pipelines, and to everyone within the Nextflow community!

You can also set up and attempt to run the pipeline here: https://gitpod.io/#https://github.com/BGAcademy23/treeval-curation This is a gitpod set up for BGA23 with a version of TreeVal, although for now gitpod will not run a nextflow pipeline die to issues with using singularity. We will be replacing this with an AWS instance soon.

The treeval pipeline has a sister pipeline currently named curationpretext which acts to regenerate the pretext maps and accessory files during genomic curation in order to confirm interventions. This pipeline is sufficiently different to the treeval implementation that it is written as it's own pipeline.

Parse input yaml ( YAML_INPUT )
Generate my.genome file ( GENERATE_GENOME )
Generate insilico digests of the input assembly ( INSILICO_DIGEST )
Generate gene alignments with high quality data against the input assembly ( GENE_ALIGNMENT )
Generate a repeat density graph ( REPEAT_DENSITY )
Generate a gap track ( GAP_FINDER )
Generate a map of self complementary sequence ( SELFCOMP )
Generate syntenic alignments with a closely related high quality assembly ( SYNTENY )
Generate a coverage track using PacBio data ( LONGREAD_COVERAGE )
Generate HiC maps, pretext and higlass using HiC cram files ( HIC_MAPPING )
Generate a telomere track based on input motif ( TELO_FINDER )
Run Busco and convert results into bed format ( BUSCO_ANNOTATION )
Ancestral Busco linkage if available for clade ( BUSCO_ANNOTATION:ANCESTRAL_GENE )
Count KMERs with FastK and plot the spectra using MerquryFK ( KMER )
Generate a coverge track using KMER data ( KMER_READ_COVERAGE )

Usage

Note If you are new to Nextflow and nf-core, please refer to this page on how to set-up Nextflow. Make sure to test your setup with -profile test before running the workflow on actual data.

Currently, it is advised to run the pipeline with docker or singularity as a small number of major modules do not currently have a conda env associated with them.

Now, you can run the pipeline using:

# For the FULL pipeline
nextflow run main.nf -profile singularity --input treeval.yaml --outdir {OUTDIR}

# For the RAPID subset
nextflow run main.nf -profile singularity --input treeval.yaml -entry RAPID --outdir {OUTDIR}

An example treeval.yaml can be found here.

Further documentation about the pipeline can be found in the following files: usage, parameters and output.

Warning: Please provide pipeline parameters via the CLI or Nextflow -params-file option. Custom config files including those provided by the -c Nextflow option can be used to provide any configuration except for parameters; see docs.

Credits

sanger-tol/treeval has been written by Damon-Lee Pointon (@DLBPointon), Yumi Sims (@yumisims) and William Eagles (@weaglesBio).

We thank the following people for their extensive assistance in the development of this pipeline:

@gq1 - For building the infrastructure around TreeVal and helping with code review
@ksenia-krasheninnikova - For help with C code implementation and YAML parsing
@mcshane - For guidance on algorithms
@muffato - For code reviews and code support
@priyanka-surana - For help with the majority of code reviews and code support

Contributions and Support

If you would like to contribute to this pipeline, please see the contributing guidelines.

Citations

If you use sanger-tol/treeval for your analysis, please cite it using the following doi: 10.5281/zenodo.10047653.

Tools

An extensive list of references for the tools used by the pipeline can be found in the CITATIONS.md file.

You can cite the nf-core publication as follows:

The nf-core framework for community-curated bioinformatics pipelines.

Philip Ewels, Alexander Peltzer, Sven Fillinger, Harshil Patel, Johannes Alneberg, Andreas Wilm, Maxime Ulysse Garcia, Paolo Di Tommaso & Sven Nahnsen.

Nat Biotechnol. 2020 Feb 13. doi: 10.1038/s41587-020-0439-x.

treeval's People

Contributors

Stargazers

Watchers

Forkers

yumisims pythseq

treeval's Issues

Subworkflow MYGENOME

Add a workflow for MYGENOME generation.

As this file is required for multiple sub workflows, it should be packaged into it's own.

Containing SAMTOOLS FAIDX and the BASH found in #3 .

Use nf-core bamToBed

Description of feature

There is already a combined bamToBed + bedtools sort module in nf-core. Please use that instead of bedtools_bed_sort. You also have a redundant module under sanger-tol. Only keep one copy under nf-core.

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

I would also recommend to move away from bedtools sort if you can. It's less efficient than a regular sort, as the bedtools authors say themselves (see the disclaimer at the bottom of https://bedtools.readthedocs.io/en/latest/content/tools/sort.html) and it's indeed caused us some problems in the read-mapping pipeline.

More details on the issue with bedtools sort: sanger-tol/genomenote#51

If you do separate the sort, please make sure it is a different module. You can borrow this gnu_sort module.

CSV_PULL creating unsorted list of input data

Testing has revealed that since the directory change, the CSV pull module has begun mixing it's input lists. The first for organism names and second for the file location.

In the function there seems to be a cross over where organism 1 and organism 2's path location are being used and causing a conflict with expected file output, e.g., path/organism 1.

We are unsure why or how this is occuring since it is being passed the correct values, but we are investigating and will fix ASAP before #52 is complete.

Modules restructuring

This is a summary issue, please create PRs for the individual issues referenced here.

#37 The structure of the modules is incorrect. Take a look at genomenote for the structure expected. You may need to delete the entire contents below modules/nf-core and repopulate using nf-core modules install.
#38 The miniprot modules need to first be submitted to nf-core using released bioconda package and containers. Then install under nf-core using nf-core modules install.
#39 There is already a combined minimap2 + samtools module in nf-core. Please use that instead of minimap_samtools
#40 samtools in merge_bam module will not work without a container in production env. There is already a module for samtools merge, please use that instead.
#41 There is already a combined bamToBed + bedtools sort module in nf-core. Please use that instead of bedtools_bed_sort. You also have a redundant module under sanger-tol. Only keep one copy under nf-core.
#42 The blast/tblastn needs to be removed from sanger-tol and installed under nf-core. If not using, remove entirely.
#43 For the modules under makecmap, please move the scripts with the credits and licence information intact to the bin folder. Then, use a perl conda package and containers to create local modules using nf-core modules create from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due. Do the same for selfcomp/splitfasta.
#44 For the modules selfcomp/mapids and selfcomp/mummer2bed do the same as N°7 but with a python package and container.
#45 For these local modules, please add a bioconda package and container. You can use a basic one: https://github.com/nf-core/modules/blob/master/modules/nf-core/md5sum/main.nf

get_synteny_genomes
generate_genome_file
csv_generator
concatmummer (only conda needs adding)
concat_gff
chunkfasta (pyfasta conda package exists)
cat_blast
bb_generator

#47 For local module filter_blast, add the script in your bin folder (which seems to be missing) and use the appropriate conda package and containers as needed. Please do not package the script in a container.
#46 For the different cat local modules, might be better to use a generic one and configure it for different purposes as needed – see nf-core cat module. Example: concatmummer, concat_gff and cat_blast
It is best to create local modules using nf-core modules create from within the pipeline directory. The idea is to keep the formatting and structure of the local modules as close to the nf-core ones.
Before creating any bespoke containers, please have a chat. There are multi-package containers and other options available, which would save you time and make your pipeline more reproducible.

@muffato is happy to help with tasks above so please contact him if you need help.
@priyanka-surana is happy to help manage this release, we can have regular catch ups to keep track of the work.

miniprot module to nf-core

Description of feature

The miniprot modules need to first be submitted to nf-core using released bioconda package and containers. Then install under nf-core using nf-core modules install.

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

MAKE - CSV-GENERATOR

A process which copies the csv file into the nextflow directory and then allows for the data to be parsed in the main.nf.

This was decided upon with help from the Seqera team, as there is no direct way of building path objects from strings.

[Documentation] Review usage.md

Description of feature

Review contents of usage.md file.

MAKE - BB-GENERATOR

This is a module that makes use of the bedToBigBed software to generate a BigBed file for jBrowse display.

[Documentation] output.md - GENERATE_GENOME

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used
Outline what local modules used do.

Subworkflow SYNTENY

The addition of SYNTENY sub-workflow which uses YAML params:

Path of reference genomes for synteny
Class of organism
Organism sample name
Organism assembly fasta
Output directory

This sub-flow uses nf-core module:

MINIMAP2_ALIGN

Subworkflow INPUT_CHECK

Description of feature

The input functions must be changed to instead take the gEVAL-yaml or a new treeval-yaml.

Create local modules for `selfcomp`

Description of feature

For the modules selfcomp/mapids and selfcomp/mummer2bed please move the scripts with the credits and licence information intact to the bin folder. Then, use a python conda package and containers to create local modules using nf-core modules create from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due.

From Matthieu:

motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible

[Documentation] Review output.md

Description of feature

Review complete details once individual workflows added.

Make the CI tests pass on GitHub

Description of feature

(as per #51 (comment) )

Currently, the integration tests only pass on the farm, because some config paths refer to /nfs/team135 (at least the one I've found, perhaps some are on /lustre too). It would be useful to update the S3 test profile to include all data on the S3 server. This way, reviewers could rely on GitHub to test pull-requests and wouldn't need to run the pipeline themselves.

This would also tell us / confirm what the pipeline needs from the Sanger infrastructure, and allow to plan the next steps for the pipeline being usable by external collaborators, for when we feel ready to support that.

Outdir setting needs to be updated

Description of the bug

The outdir has been set to save all output whilst in development, we have realised that with the selfcomp module generating tens of thousands of files it is prudent to fix the outdir so that only required files and some pipeline information files are saved. We will also be adding automated clean-up, although at a later date.

Command used and terminal output

No response

Relevant files

No response

System information

No response

Include - SAMTOOLS-FAIDX

Samtools will be used in generating the my.genome file, containing chromosome sizes.

COMMIT - GENERATE_GENOME to main

As the GENERATE_GENOME subworkflow (SW) is required for multiple other SWs, this SW needs to be merged into the main for colleagues.

MAKE - FILTER-BLAST

This will be a Python 3 script that takes a concatenated blast output file and parses it into the format required by bedToBigBed.

This script will need to be dockerised and will possibly be amalgamated with multiple other scripts before the final release.

Restructure modules folder

Description of the bug

The structure of the modules is incorrect. Take a look at genomenote for the structure expected. You may need to delete the entire contents below modules/nf-core and repopulate using nf-core modules install.

From Matthieu:

motivated by allowing to exchange modules and code with others, incl. from ToLA (currently not possible – nf-core pushed that breaking change, not us)
motivated by reducing the barrier to entry for other people (incl. from ToLA ) to contribute and debug, by following the same structure

[Documentation] output.md - INPUT_READ

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used
Outline what local modules used do.

MODIFY - filter_blast

Currently, this module will not run unless using the local script and python installations.

@yumisims is currently dockerising and submitting as a module for pipeline integration.

Create local modules for `makecmap` and `splitfasta`

Description of feature

For the modules under makecmap, please move the scripts with the credits and licence information intact to the bin folder. Then, use a perl conda package and containers to create local modules using nf-core modules create from within the pipeline. If you make any changes to the original scripts, please address these in the script header, so credit is given where due. Do the same for selfcomp/splitfasta.

From Matthieu:

motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible)

Update the pipeline template

Description of feature

Advice from NF-core is to update the version of NF-core templates used by the pipeline as often as we can. This should be a priority fix.

UPDATE: Variable naming

Some variable names are less than ideal for their purpose (do not describe their usage). These should be changed.

Include - BLASTN

Include the BLASTN NF-CORE module, to be used in blasting set query data against the input genome.

Add containers and conda package to local modules

Description of the bug

For these local modules, please add a bioconda package and container. You can use a basic one: https://github.com/nf-core/modules/blob/master/modules/nf-core/md5sum/main.nf

get_synteny_genomes
generate_genome_file
csv_generator
concatmummer (only conda needs adding)
concat_gff
chunkfasta (pyfasta conda package exists)
cat_blast
bb_generator

From Matthieu:

motivated by making the pipeline usable outside of our production environment (currently at risk)

ADD - TBLASTN

@yumisims has created a TBLASTN module currently available in sanger-tol/nf-core-modules here.

This should be added to the pipeline to allow blasting of pep data.

Correct gene alignment data csv files

The input csv need to be updated for the new directory structure. The commands to generate these files also need to be corrected to:

for file in /lustre/scratch123/tol/resources/treeval/gene_alignment_data/{clade}/{item}/{item}.{accession}/*/*.fa ; do  var=$(echo $file | cut -f10 -d/); var2=$(echo $file | cut -f11 -d/);echo $var,$var2,$file >> /lustre/scratch123/tol/resources/treeval/gene_alignment_data/{clade}/csv_data/{item}.{accession}-data.csv; done

This also needs updating to work on the whole library of gene_alignment data rather than one at a time.

Generate parameters.md

Description of feature

nextflow_schema.json needs to be updated to reflect current parameters.

~~2. nf-core schema docs -x markdown to generate prettified version of this schema. Save as parameters.md in the docs folder.~~

~~Based on documentation guidance here~~

Sub workflow GENE_ALIGNMENT

Workflow for gene alignment, this requires:

Confirmation of data directory (treeval_data/insect/iy_{latin_name}{data_type}{chunk}.fa = iy_tiphia_femorata_cds_500.fa
makeblastdb
blastx
blastn
concat on data organism
filter at 90%
input data.genome file generation
generation of assembly.as file
filtered bed + assembly.as to generate BigBed file

[Documentation] output.md - SYNTENY

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

File save naming scheme

We have been using a file naming scheme that has resulted in multiple files overwriting each other upon completion of the pipeline.
This will be corrected in the next commit.

MAKE - Multiple simple bash processes

GENERATE_GENOME will use cut and sort to generate the final my.genome file.

PULL_DOTAS will use cp to pull a .as file from assets.

CAT_BLAST uses cat to concatenate to merge together multiple BLAST outputs.

Include - Makeblastdb

Add makeblastdb to the gene alignement subworkflow
This will take the input fasta, generate a db. This will allow the alignment data to be blasted against it.

[Documentation] output.md - BUSCO_ANALYSIS

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

Haplotypic Block Analysis

Description of feature

The original SelfComp (#5) was not fit for purpose and was not close enough to replicating gEVAL, which used the Ensembl database to generate the SelfComp blocks. This API cannot be easily decoded due to its age and complexity, so @yumisims is reverse engineering a standalone solution. to replace the SelfComp sub-workflow.

[Documentation] output.md - GENERATE_ALIGNMENT

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

Set up example data for testing

This will require some cdna, cds, rna and pep data as well as an input fasta.

Refactor how data is input

Using the --params-file for our input yaml does not apply to standards. I will produce a new subworkflow so we can use the --input flag and all params can parsed from the input.

This will require refactoring of subworkflows to take the new format of inputs as they are now channels not values.

Use nf-core minimap2

Description of feature

There is already a combined minimap2 + samtools module in nf-core. Please use that instead of minimap_samtools

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

[Documentation] output.md - SELFCOMP

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

Module `filter_blast` change container

Description of feature

For local module filter_blast, add the script in your bin folder (which seems to be missing) and use the appropriate conda package and containers as needed. Please do not package the script in a container.

From Matthieu:

motivated by making the pipeline inspectable / modifiable by non-Sanger people (currently not possible)

Subworkflow SELFCOMP

Subworkflow INSILICO_DIGEST

Clean up repository

Description of feature

@DLBPointon mentioned there might be some outdated modules, can you please remove these? Remove any modules or subworkflows no longer necessary. Remove commented out pieces of code, if not part of testing.

When you create a PR for this, please reference this issue, and set dev as the base branch.

Use nf-core blastn

Description of feature

The blast/tblastn needs to be removed from sanger-tol and installed under nf-core. If not using, remove entirely.

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

Use generic `cat` module

Description of feature

For the different cat local modules, might be better to use a generic one and configure it for different purposes as needed – see nf-core cat module. Example: concatmummer, concat_gff and cat_blast

From Matthieu:

motivated by reducing the amount of code you / we need to maintain

ADD - resource files

Add files such as:

assem_{cds,cdna,rna,pep}.as to the assets folder

Samtools merge container missing

Description of the bug

Samtools in merge_bam module will not work without a container in production env. There is already a module for samtools merge, please use that instead.

From Matthieu:

motivated by making the pipeline usable in our production environment (currently not possible)

[Documentation] output.md - INSILICO_DIGEST

Description of feature

Branch from the documentation branch for adding documentation.
Include:

Outputted files
Workflow description
Flow diagram
Outline nf-core modules used

sanger-tol / treeval Goto Github PK

treeval's Introduction

Introduction

Usage

Credits

Contributions and Support

Citations

Tools

treeval's People

Contributors

Stargazers

Watchers

Forkers

treeval's Issues

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Command used and terminal output

Relevant files

System information

Description of the bug

Description of feature

Description of feature

Description of the bug

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of feature

Description of the bug

Description of feature

Recommend Projects

Recommend Topics

Recommend Org