Git Product home page Git Product logo

findfungi's Introduction

FindFungi-v0.23.3

A pipeline for the identification of fungi in public metagenomics datasets. The FindFungi pipeline uses the metagenomics read-classifier Kraken with 32 custom fungal databases to generate 32 taxon predictions for a single read. These 32 predictions are combined to generate a consensus prediction. All reads are then BLASTed against their predicted genomes to generate read distribution skewness scores to select for the most likely true positives.

FindFungi-0.23 was built on an IBM platform load-sharing facility with 32 worker nodes. If you would like a SLURM version of this pipeline, or a version that can run on a standard Unix/Ubuntu machine, please navigate to the bottom of this README.

Quickstart

Download the pipeline, databases, associated scripts, prerequisites and other tools. Run the pipeline:

./FindFungi-0.23.3.sh /path/to/FASTQ-file.fastq Dataset-name

Getting Started

These instructions will hopefully allow you to get a copy of FindFungi up and running on your own compute-cluster/server for development or your own analyses. If using a non-IBM LSF compute cluster, change the 'bsub' commands to reflect your architecture. If using a single server, remove the 'bsub' commands.

Prerequisites

FindFungi v0.23 was built using the following:

  • gcc version 4.4.4 20100726 (Red Hat 4.4.4-13)
  • coreutils 8.27
  • python 2.7.13 (modules: sys, os, ete3, biopython (Bio), math, argparse, itertools, collections, re)
    • Don't use conda to install these modules, please use pip
  • skewer 0.2.2
  • kraken 0.10.5-beta
  • ncbi blast 2.2.30
  • Rscript 3.3.3 (packages: wordcloud)
  • graphviz 2.40.1

Installing

  • Download all of the scripts from GitHub/GiantSpaceRobot and move to a directory (/your/directory/scripts). You may need to give these scripts more permissions (e.g. chmod 755 *).
  • In the FindFungi-v0.23.3 script, change the absolute paths of skewer, kraken, blast, the shell and python scripts to reflect your environment, or add these tools and scripts to you $PATH. You will also need to edit the LowestCommonAncestor.sh script to include the path to the downloaded scripts.
  • NOTE: It may be necessary for you to include the absolute paths for all of the scripts and tools within the FindFungi-0.23.sh master script, depending on the cluster node preferences (e.g. executing 'python' actually calls the node's version of python, not yours).
  • Download the Kraken and BLAST databases from this website (http://bioinformatics.czc.hokudai.ac.jp/findfungi/).
  • Uncompress these files and put them somewhere sensible:
tar -xvfz Kraken_*.tar.gz
mv Kraken_* Kraken_DB_Directory/

Testing the pipeline

Download the dataset ERR675624 from the European Nucleotide Archive database. This dataset contains fungal reads.

wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR675/ERR675624/ERR675624_1.fastq.gz
wget ftp://ftp.sra.ebi.ac.uk/vol1/fastq/ERR675/ERR675624/ERR675624_2.fastq.gz

Gunzip these files and concatenate them, as we have no need to read pair information.

gunzip ERR675624_*.fastq.gz
cat ERR675624_*.fastq > ERR675624_both-pairs.fastq

Execute the FindFungi pipeline on this FASTQ file. We use nohup here to allow the pipeline to run in the background.

nohup ./FindFungi-0.23.sh /path/to/ERR675624_both-pairs.fastq ERR675624

The first command line argument (/path/to/ERR675624_both-pairs.fastq) points to your FASTQ file. The second (ERR675624) is the name FindFungi will use for this dataset. This name should be informative.

The .csv results should show the following:

#Taxon name,Taxid,Reads mapping to taxid,Reads mapping to children taxids,Pearson skewness score,Percent of pseudo-chromosomes with read hits
Candida sp. LDI48194,1759314,671,0,0.524623062587,100.0
Malassezia restricta,76775,378,0,0.496034792692,100.0
Candida tropicalis MYA-3404,294747,265,0,-0.265788716977,100.0

SLURM Implementation

Dr Ali Snedden has kindly created a SLURM implementation of the pipeline: https://github.com/astrophys/FindFungi_adapted_for_slurm.

Standard Unix Implementation (not a compute cluster)

Please use this version if you intend to use FindFungi on a single Unix machine: https://github.com/GiantSpaceRobot/FindFungi_SingleServerVersion

Contributors

License

This project is licensed under the MIT License.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.