Git Product home page Git Product logo

tada's Introduction

TADA - Taxonomic-Aware Dataset Aggregator

A Snakemake workflow to assemble balanced, representative and manageable datasets for comparative and phylogenetic analysis of bacteria and archaea. Datasets can be generated based either on the phylogenomic tree offered by GTDB , or on the taxonomy offered by GTDB or by NCBI.

Dependency

Running the TADA-workflow requires Conda.

Installing

Clone the repository from git and change into the TADA directory.

git clone https://github.com/emilhaegglund/TADA.git
cd TADA

Install and activate the conda environment from which the workflow will be run. This will install Mamba and Snakemake.

conda env create -f environment.yaml
conda activate tada

Setting up the configuration file

Before running the workflow, the first step is to set up the configuration file. This file will determine the behavior of the workflow. An example of the configuration file can be found in config/config.yaml. You can either modify this file or create a new. The location of the config-file must be specified using the --configfile option when running Snakemake.

The first option is to set the path to the output-directory:

workdir: "results"

Choice of sampling method

The workflow can be run using three different methods:

  • Sampling based on the NCBI taxonomy (sample_ncbi).
  • Sampling based on the GTDB taxonomy (sample_gtdb).
  • Sampling based on the GTDB phylogeny (prune_gtdb).

E.g.:

method: "sample_gtdb"

A random seed can be used to reproduce the output of sampling and pruning from the GTDB-database.

seed: 42

When using the sample_gtdb or sample_ncbi option a file containing a list of genome accessions to be include in the dataset can be given with the required option.

required: "../config/required-genomes.txt"

An NCBI API Key can be used in the workflow with the following option

ncbi_api_key: "NCBI-API-KEY"

Select what to download

TADA can download genomes, CDS (genes), proteomes, and/or GFF3 annotations for the sampled genomes. If all options below are set to False, the workflow will stop after the sampling procedure. TADA will annotate genomes for which no annotation is available using Prokka.

downloads:
  genomes: False
  cds: False
  proteomes: True
  gff3: False

Select what databases to create

TADA can also build different type of Blast-compatible databases, either using the NCBI Blast suite or Diamond (only for proteins).

databases:
    blast_genome: False
    blast_cds: False
    blast_protein: False
    diamond_protein: True

Options for sampling

Next follows options specific to the different sampling methods listed above.

Options for sampling from the NCBI Taxonomy

To sample from the NCBI Taxonomy we have to give the path to a sampling scheme and we also need to define if we want to sample from GenBank or RefSeq. Sampling from NCBI is restricted to taxa classified as Bacteria or Archaea. The reason for this is that the annotation software in the workflow are for prokaryotic genomes.

sample_ncbi:
    sampling_scheme: <path>
    database: <string>

sampling_scheme: Path to the sampling scheme that will be used. See Defining a sampling scheme for more details on this.

database: Sample from "GenBank" or "RefSeq".

Example

In the example below, TADA will sample one taxa from each defined phylum in the RefSeq-database.

sample_ncbi:
    sampling_scheme: "../config/sampling_scheme.ncbi_refseq.yaml"
    database: "RefSeq"

Options for sampling the GTDB Taxonomy

sample_gtdb:
    sampling_scheme: <path>
    completeness: <float>
    contamination: <float>
    gtdb_species_representative: <bool>
    version: <str>

sampling_scheme: Path to the sampling scheme that will be used. See Defining a sampling scheme for more details on this.

completeness: Exclude taxa with a completeness estimate less than this value (Default: 0).

contamination: Exclude taxa with a contamination estimate larger than this value (Default: 100).

gtdb_species_representative: False will keep all entries while True will only keep entries that are classified as GTDB species representatives.

version: Select which version of GTDB to use. E.g. 207 and 214 are supported (Default: 214).

Example

In the example below we will use the sampling scheme defined in config/sampling_scheme.basic.yaml. For this example the workflow will sample three taxa for each phylum, but only sample from representative species with an estimated completeness over 90% and an estimated contamination under 5%.

sample_gtdb:
  sampling_scheme "../config/sampling_scheme.basic.yaml"
  completeness: 90
  contamination: 5
  gtdb_species_representative: True

Options for pruning the GTDB phylogenies

GTDB includes separate phylogenies for bacteria and archaea. TADA will prune these phylogenies based on the evolutionary distance between taxa, reducing the number of taxa to the amount specified in the configuration file. Before the distance-based pruning, it is also possible to use the completeness and contamination criteria to remove taxa that do not meet these requirements from the phylogeny.

prune_gtdb:
    bac120: <int>
    ar53: <int>
    completeness: <float>
    contamination: <float>
    prune_method: <str>
    taxon: <str>
    version: <str>

bac120: Number of taxa to sample from the bacterial phylogeny.

ar53: Number of taxa to sample from the archaeal phylogeny.

completeness: Exclude taxa with a completeness estimate less than this value (Default: 0).

contamination: Exclude taxa with a contamination estimate larger than this value (Default: 100).

prune_method: Select what method to use for pruning, "shortest" will keep the taxon with the shortest branch in a leaf-pair, "longest" will keep the taxon with the longest branch in a leaf-pair, and "random" will randomly select one of the taxa to keep in a leaf-pair (Default: "shortest").

taxon: Prune only phylogeny under this taxon, other parts of the phylogeny will be discarded. The taxon must be present in the phylogeny.

version: Select which version of GTDB to use, 207 and 214 are supported (Default: 214).

Example 1

In the example below TADA will first remove all taxa with an estimated completeness under 90% and an estimated contamination over 5%. It will then continue to prune the bacterial phylogeny untill 1000 taxa remains. For the archaeal phylogeny it will prune the phylogeny until 200 taxa remains.

prune_gtdb:
  bac120: 1000
  ar53: 200
  completeness: 90
  contamination: 5
  prune_method: "shortest"

Example 2 In the example below TADA will first remove all genomes that are not of high-quality, next it will prune only the Alphaprotobacteria-clade until 100 taxa remains.

prune_gtdb:
  bac120: 100
  completeness: 90
  contamination: 5
  prune_method: "shortest"
  taxon: "Alphaprotobacteria"

Defining a sampling scheme

The taxonomic sampling (sample_gtdb or sample_ncbi) is based on a sampling scheme that is defined in a YAML-file with a structure described below. Examples of sampling schemes can be found in the config-directory.

taxonomic_name:
    sampling_level: [domain, phylum, class, order, family, genus, species]
    taxa: <int> or "all"

taxonomic_name: The name of the taxa to sample from. The name has to be defined in the GTDB or NCBI taxonomy, depending on which database to sample from. The key-word all can be used to sample from all Bacteria and Archaea.

sampling_level: The taxonomic level to perform the sampling at.

taxa: The number of taxa we want to sample from each group at that taxonomic level. The key-word all can also be used, this will keep all taxa in the groups.

Thus, if taxonomic_name is set to Bacteria, sampling_level to class, and taxa to 3, TADA till sample 3 taxa in each class of the Bacteria.

Example: Basic sampling scheme

The sampling scheme below will sample 10 taxa from each phylum in both Bacteria and Archaea.

all:
    sampling_level: phylum
    taxa: 10

Example: Complex sampling scheme

It is also possible to construct more complex sampling scheme by including multiple sampling criterias. The workflow will then perform a hierarchical sampling starting the sampling procedure at the lowest taxonomic level and then continue to sample at higher and higher levels. Taxonomic groups which have already been used to sample from is excluded from sampling at higher levels.

The example below demonstrates a complex sampling scheme. We will begin by sampling 10 taxa from the species Bartonella bacilliformis. For the remaining species within the Bartonella genus we will sample two taxa per species. Adding to this, we will sample five taxa from the Rhizobiales_A order outside of the Bartonella genus.

Bartonella bacilliformis:
  sampling_level: species
  taxa: 10

Bartonella:
  sampling_level: species
  taxa: 2

Rhizobiales_A:
  sampling_level: order
  taxa: 5

Running the workflow

To run the workflow, use the following command. If you have the config-file in a different location, replace the path after --configfile

cd workflow
snakemake --cores 4 --use-conda --conda-frontend mamba --configfile ../config/config.yaml

tada's People

Contributors

emilhaegglund avatar kdyrhage avatar lguy avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

lguy kdyrhage

tada's Issues

Support for GTDB v220

Support for the latest release of GTDB would be nice to have.
I've experimented with it. So far the hinder is in download_gtdb_metadata. The file name for the metadata file has changed, going from {domain}_metadata_r{version}.tar.gz to {domain}_metadata_r{version}.tsv.gz. I haven't verified that the files are then similar.

Readme not in line with code

Hi,

we talked on the SBW conference last week, thanks again for answering my questions!

I just wanted to test your workflow and noticed that the readme and the code are not concordant. So when I change the config file as suggested in the readme, I get a bunch of KeyErrors and TypeErrors; for example when I change the "method" section from

method: "prune_gtdb"

to

method:
subsample_refseq: True
subsample_gtdb: True
prune_gtdb: True

as suggested in the readme, I get a TypeError from line 53 of download_genome_data.smk that dictionaries can't be concatenated. Is there any other way to turn on several of the three options or do I have to run separate runs?

Not sure if the readme is outdated or too far ahead here :)

A second issue I had was that on my computer (Mac OS), I could not install diamond from the supplied yaml file but had to add

pip:

before the diamond specification. Not sure if this is only a Mac OS issue or not..

Now that I have adjusted the config file and managed to install diamond, it seems to run just fine.

Cheers
Andreas

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.