Git Product home page Git Product logo

viralclust's Introduction

ViralClust - Find representative viruses for your dataset

DOI

License: GPL v3Python3.8NextFlowconda

Twitter URL


DISCLAIMER

This pipeline is work-in-progress. There are some bugs known to me, some aren't. Before getting desperate, please check out the Issues that are already opened and discussed. If you can't find your problem there, don't hesitate to drop me an E-Mail or open an issue yourself. I am not responsible for any results produced with ViralClust nor for the conclusions you draw from it.


Overview: What is this about?

Have you ever been in the situation that you wanted to compare you're specific virus of interest with all other viruses from its genus? Or even family? For some taxonomic clades, there are many different genomes available, which can be used for comparative genomics.

However, more often than not viral genome datasets are redundant and thus introduce bias into your downstream analyses. Think about a consensus genome of Flavivirus with 2.000 Dengue virus genomes and 5 Zika virus genomes. You may start to see the problem here. To remove redundancy, clustering of the input sequences is a nice idea. However, given the scientific question at hand, it is hard to determine whether a cluster algorithm is appropiate.

Thus, ViralClust was developed. A Nextflow pipeline utilizing different cluster methods and implementations all at once on your data set. Combining this with meta information from the NCBI allows you to explore the resulting representative genomes for each tool and decide for the cluster that fit your question.

For example: clustering all available Filoviridae with cd-hit-est usually leads to a large cluster containing all Zaire Ebola viruses, which can be valueable, if you want to compare this species as a whole. If you are interested in subtle changes within the species, you may want to use another approach, which divides the "Zaire cluster" into smaller sub-cluster, which represent different outbreaks and epidemics.


Installation

In order to run ViralClust, I recommend creating a conda environment dedicated for NextFlow. Of course, you can install NextFlow on your system how ever you like, but considering potential users not having sudo permissions, the conda-way proofed to be simple.

  • First install conda on your system: You'll find the latest installer here.
  • Next, make sure that conda is part of your $PATH variable, which is usually the case. For any issues, please refer to the respective installation guide.

Warning: Currently, ViralClust is running on Linux systems. Windows and MacOS support may follow at some point, but has never been tested so far.

  • Create a conda environment and install NextFlow within this environment:

    Click here to see how:
    conda create -n nextflow -c bioconda nextflow
    conda activate nextflow

    Alternative: You can also use the environment.yml provided, after cloning the repo:

    Click here to see how:
    conda env create -f environment.yml
  • Clone the github repository for the latest version of ViralClust, or download the latest stable release version here.

    Click here to see how:
    `git clone https://github.com/klamkiew/ViralClust.git && cd ViralClust`
  • Done!


Quickstart

You may ask yourself, how you can run ViralClust yourself now. Well, first of all, you do not have to worry about any dependencies, since NextFlow will take care of this via individual conda environments for each step. You just have to make sure to have a stable internet connection, when you run ViralClust for the very first time. If not done yet, now is the time to activate your conda environment: conda activate nextflow And we're ready to go!

nextflow run viralclust.nf --fasta "data/test_input.fasta"

This might take a little bit of time, since all individual conda environments for each step of ViralClust is created. In the mean time, let's talk about parameters and options.

Parameters & Options

Let us briefly go over the most important parameters and options. There is a more detailed overview of all possible flags, parameters and additional stuff you can do in the help of message of the pipeline - and at the end of this file.

Input sequences: --fasta <PATH>

--fasta <PATH> is the main parameter you have to set. This will tell ViralClust where your genomes are located. <PATH> refers to a multiple fasta sequence file, which stores all your genomes of interest.

Specific genomes of interest: --goi <PATH>

--goi <PATH> is similar to the --fasta parameter, but the sequences stored in this specfic fasta file are your genomes of interest, or shortly GOI. Using this parameter tells ViralClust to include all genomes present in goi.fasta in the final set of representative sequences. You have a secret in-house lab-strain that is not published yet? Put it in your goi.fasta.

Evaluate and rate cluster: --eval and --ncbi

--eval and --ncbi are two parameters, that do more for you than just clustering. Since ViralClust is running several clustering algorithms, it can be hard to decide which one produced the most appriopate results. Worry not, since --eval is here to help you. Additionally to the clustering results, you'll get a brief overview of the clusters, that arose from the different algorithms. With --ncbi enabled, ViralClust further scans your genome identifiers (the lines in your fasta file starting with >) for GenBank accession IDs and uses them to retrieve further information from the NCBI about the taxonomy of the sequence, as well as accession date and country. Note that using --ncbi implicitly also sets --eval.

Update the NCBI metainformation database: --update_ncbi

--update_ncbi is used whenever you need to update the database of ViralClust. As soon as you run the pipeline with --ncbi enabled for the first time, this is done automatically for you. Each viral GenBank entry currently available from the NCBI is processed and for each entry, ViralClust stores the accession ID, taxonomy, accession date and accession country for future uses.

Specify the output path: --output <PATH>

--output <PATH> specifies the output directory, where all results are stored. Per default, this is a folder called ViralClust_result which will be created in the directory that you are currently in.

Determine the numbers of cores used: --cores and --max_cores

--max_cores and --cores determine how many CPU cores are used at maximum and how many cores are used for one individual process at maximum, respectively. The default values cause ViralClust to use all available cores, but for each individual step in the pipeline, only 1 core is used.

There are many more parameters, especially directly connected to the behaviour of Nextflow, which are not explained here. The main things are covered, for the rest, I refer to the clustering section and the complete help message of ViralClust.


Cluster Tools

Since ViralClust is nothing without the great work of awesome minds, it is only fair to give credit, where credit is due. Currently, five different approaches are used, to cluster input genomes. CD-HIT, Sumaclust and vsearch all implement the same algorithmic idea, but with minor, subtle changes in their respective heuristics. I further utilize the clustering module of MMSeqs2. And, last but not least, ViralClust implements a k-mer based clustering method, which is realized with the help of UMAP and HDBSCAN.

For all tools, the respective manual and/or github page is linked. Firstly, because I think, all of those are great tools, which you are implicitly using by using ViralClust. And second, because ViralClust offers the possibility to set all parameters of all tools; therefore, if you need something very specific, you can check out the respective documentations.

And, in case of using any of the results provided by ViralClust in a scientific publication, I would be grateful to be cited. In my eyes, it is only fair that you not only cite ViralClust, but also the clustering method you ultimately decided for, even if ViralClust was assisting you in the decision.

Click here for all citations
  • CD-HIT:

    • Weizhong Li & Adam Godzik, "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences". Bioinformatics, (2006) 22:1658-9
    • Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, (2012), 28 (23): 3150-3152
  • sumaclust:

    • Mercier C, Boyer F, Bonin A, Coissac E (2013) SUMATRA and SUMACLUST: fast and exact comparison and clustering of sequences. Available: http://metabarcoding.org/sumatra.
  • vsearch:

    • Rognes T, Flouri T, Nichols B, Quince C, Mahé F. (2016) VSEARCH: a versatile open source tool for metagenomics. PeerJ 4:e2584
  • MMSeqs2:

    • Steinegger, M., Söding, J. "MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets". Nat Biotechnol 35, 1026–1028 (2017)
  • UMAP & HDBscan:

    • McInnes, L, Healy, J, "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction", ArXiv e-prints 1802.03426, 2018
    • L. McInnes, J. Healy, S. Astels, "hdbscan: Hierarchical density based clustering" In: Journal of Open Source Software, The Open Journal, volume 2, number 11. 2017

Graphical Workflow

Workflow graph

Help Message

This paragraph is simply the help message of ViralClust.

Expand here
____________________________________________________________________________________________

Welcome to ViralClust - your pipeline to cluster viral genome sequences once and for all!
____________________________________________________________________________________________

Usage example:
nextflow run viralclust.nf --update_ncbi

or

nextflow run viralclust.nf --fasta "genomes.fasta"

or both

nextflow run viralclust.nf --update_ncbi --fasta "genomes.fasta"

____________________________________________________________________________________________

Mandatory Input:
--fasta PATH                      Path to a multiple fasta sequence file, storing all genomes that shall be clustered.
                                  Usually, this parameter has to be set, unless the parameter --ncbi_update has been set.

Optional Input:
--goi PATH                        Path to a (multiple) fasta sequence file with genomes that have to end
                                  up in the final set of representative genomes, e.g. strains of your lab that are
                                  of special interest. This parameter is optional.
____________________________________________________________________________________________

Options:
--eval                            After clustering, calculate basic statistics of clustering results. For each
                                  tool, the minimum, maximum, average and median cluster sizes are calculated,
                                  as well as the average distance of two representative genomes.

--ncbi                            Additionally to the evaluation performed by --eval, NCBI metainformation
                                  is included for all genomes of the input set. Therefore, the identifier of fasta records are
                                  scanned for GenBank accession IDs, which are then used to retrieve information about the taxonomy,
                                  accession date and accession country of a sequence. Implicitly calls --eval.
                                  Attention: If no database is available at data, setting this flag
                                  implicitly sets --ncbi_update.

--ncbi_update                     Downloads all current GenBank entries from the NCBI FTP server and processes the data to
                                  the databank stored at data.

Cluster options:
--cdhit_params                    Additional parameters for CD-HIT-EST cluster analysis. [default -c 0.9]
                                  You can use nextflow run viralclust.nf --cdhit_help
                                  For more information and options, we refer to the CD-HIT manual.

--hdbscan_params                  Additional parameters for HDBscan cluster analysis. [default -k 7]
                                  For more information and options, please use
                                  nextflow run viralclust.nf --hdbscan_help.

--sumaclust_params                Additional parameters for sumaclust cluster analysis. [default -t 0.9]
                                  You can use nextflow run viralclust.nf --sumaclust_help.
                                  For more information and options, we refer to the sumaclust manual.

--vclust_params                   Additional parameters for vsearch cluster analysis. [default --id 0.9]
                                  You can use nextflow run viralclust.nf --vclust_help
                                  For more information and options, we refer to the vsearch manual.

--mmseqs_params                   Additional parameters for MMSeqs2 cluster analysis. [default --min-seq-id 0.9]
                                  You can use nextflow run viralclust.nf --mmseqs_help
                                  For more information and options, we refer to the MMSeqs2 manual.

Computing options:
--cores INT                       max cores per process for local use [default 1]
--max_cores INT                   max cores used on the machine for local use [default 8]
--memory INT                      max memory in GB for local use [default 16.GB]
--output PATH                     name of the result folder [default viralclust_results]
--permanentCacheDir PATH          location for auto-download data like databases [default data]
--condaCacheDir PATH              location for storing the conda environments [default conda]
--workdir PATH                    working directory for all intermediate results [default /tmp/nextflow-work-$USER]

Nextflow options:
-with-report rep.html             cpu / ram usage (may cause errors)
-with-dag chart.html              generates a flowchart for the process tree
-with-timeline time.html          timeline (may cause errors)
____________________________________________________________________________________________

viralclust's People

Contributors

klamkiew avatar sandratriebel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

viralclust's Issues

NCBI collection year format

At the moment, the database has many different formats and resolutions of the collection date.
Sorting via YYYY/MM/DD and maybe removing the day-field should be an improvement in terms of overview.

sort_sequences: different default behavior

If no ORF is found at all, the "sorted" sequence file contains empty fasta records.
This might be changed to report the "original" sequence instead of an empty sequence.

invalid compressed data in ncbi update process

It seems like wget sometimes fails to download some tar-balls correctly from the FTP server.
Re-downloading those might fix the issue, however, so far I am only able to detect invalid tar.gz in the gunzip step.

HDBSCAN error for large dataset

Got this error message while running ViralClust with SARS-CoV-2 alpha genomes (152,307 non-redundant seqs).

Traceback (most recent call last): File "/home/nu76fet/programs/viralclust/bin/hdbscan_virus.py", line 663, in <module> perform_clustering() File "/home/nu76fet/programs/viralclust/bin/hdbscan_virus.py", line 604, in perform_clustering virusClusterer.determine_profile(multiPool) File "/home/nu76fet/programs/viralclust/bin/hdbscan_virus.py", line 267, in determine_profile allProfiles = p.map(self.profile, self.d_sequences.items()) File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/pool.py", line 268, in map return self._map_async(func, iterable, mapstar, chunksize).get() File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/pool.py", line 657, in get raise self._value File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/pool.py", line 431, in _handle_tasks put(task) File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/connection.py", line 206, in send self._send_bytes(_ForkingPickler.dumps(obj)) File "/home/nu76fet/programs/viralclust/conda/hdbscan-9fec0a1dfe235db7d7c78f1a0bba3ac9/lib/python3.7/multiprocessing/connection.py", line 393, in _send_bytes header = struct.pack("!i", n) struct.error: 'i' format requires -2147483648 <= number <= 2147483647

NCBI annotation via Regex

sigh
apparently, the taxonomic lineage of viruses isn't consistent among viruses. sometimes, the species is the last entry, sometimes it is second to last and everything else gets shifted as well.

a regex that looks for the family/genus suffix (-viridae and -virus) should help to avoid mis-annotations in my results

parameter/profiles for global and local cluster structures

extend the NF pipeline with two parameter config profiles (--local and --global ?); this should set the parameters of all clustering tools such that the corresponding structures are found, i.e. resolution of cluster structures are modified here.

RAM vs Diskspace trade off

Using kmer frequencies is really demanding for the memory in the HDBSCAN module.
I am thinking of re-implementing this via a (maybe pickled) data dump on the harddrive, that is read into the memory on demand, line-by-line. More I/O, less RAM...

Inconsistent use of the label revComp/revcomp

The process reverseComp is labeled with revcomp, but the config files (conda and local) use the label revComp (notice the upper case C). This leads to the process not using the python3 conda environment as intended.

NCBI Accession IDs enhancement

If several NCBI IDs are found in the header, store all and check later, whether one of them exists in the ncbi_metainfo.pkl file.

better restrictions for yml files

currently, the conda environments in viralclust are throwing errors, due to some numpy/scipy update. This should be resolved with better version management in the yml files

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.