Git Product home page Git Product logo

dsp_nf-metagenomics's Introduction

DSP_NF-METAGENOMICS

Shotgun metagenomics pipeline to process microbiome samples


Table of contents

About

The repository presents a comprehensive workflow for metagenomic analysis, starting from an initial assessment of data quality to an in-depth understanding of the composition and function of the examined microbiome. The analysis begins with a quality check of the sequenced data using FastQC, followed by a specific quality control for metagenomic data with Kneaddata. Subsequently, the workflow proceeds to the assembly of the reads with MegaHit and the classification of contigs into eukaryotic or prokaryotic. Anvi'o is then employed for the taxonomic and functional annotation of the contigs, as well as for mapping high-quality reads. Finally, Metaphlan 4.0 facilitates further taxonomic annotation and the estimation of the abundance of various species based on reference genomes, thus completing the comprehensive analysis of the microbiome.

Getting started

The following instructions are designed to guide users in extracting information from their FASTQ files. Originally, the pipeline was implemented using shell scripts that invoke various bioinformatics software for data analysis. Presently, it is undergoing a transition to be re-implemented as a Nextflow metagenomics workflow. This update aims to enhance the reproducibility and efficiency of the analysis process.

Prerequisites and installing

Azure setting

This workflow is configured to be executed through Azure Batch and Docker, leveraging cloud computing resources and containerized environments. It is recommended to follow these instructions to set Azure up. Remember also to change the name of the container which is not specified in this guide. Going through steps:

  1. Generating a batch account
  2. Generating a storage account
  3. Generating a container (the concept of container in Azure is different from Docker)
  4. Generating a Virtual Machine

Connecting with GitHub

Using the SSH protocol, you can connect and authenticate to remote servers. For more details please have a look at this page. Going through steps:

  1. Generating a new SSH key and adding it to the ssh-agent
  2. Checking for existing SSH keys
  3. Adding a new SSH key to your GitHub account
  4. About commit signature verification

If you get this message error-permission-denied-publickey:

  1. Copy and paste your private key in VM
  2. Modifing the permissions: chmod 600 ~/.ssh/id_rsa
  3. Adding your private key and entering your passphrase: ssh-add ~/path/id_rsa

Run the pipeline

  1. uptade your credentials.json file
  2. run the workflow run nextflow main.nf

Once you have cloned the repository from GitHub, it is important to configure the nextflow.config file, considering the paths of your container in your Azure account. Then type touch credentials.json and copy and paste the private keys into the respective storage account and batch account names: { "storageAccountName": "ma****ge", "storageAccountKey": "****Mi7MWBz****==", "batchAccountName": "***dtu***", "batchAccountKey": "****wX7rHYMD****==" }. After that, enter the command nextflow run main.nf -c nextflow.config -profile <name-profile-on-config-file> -w az://<your-container-name> to run the pipeline.

Step by step

  1. Quick QC check of the raw sequenced data (fastQC)
  2. quality control of metagenomic data, meant for microbiome experiments (Kneaddata)
  3. Assembly of the reads (MegaHit): per sample and coassembly (step previous to Anvi'o)
  4. Kingdom distribution: Prediction of whether a contig is eukaryotic or prokaryotic
  5. Taxonomical and functional annotation of the contigs (Anvi'o tools)
  6. Mapping high-quality reads to the contigs (within Anvi'o framework)
  7. Taxonomical annotation and taxa abundance estimation based on reference genomes (Metaphlan 4.0)

Bioinformatic parameters

Below is a detailed overview of the parameters used in each bioinformatic tool within the Nextflow pipeline (file: nextflow_orange.nf), specifically outlining the commands and their functions within the context of the entire workflow.

FASTQC Tool designed for the quality control analysis og high-throughput sequencing data reporting visualizations that help assess the quality and characteristics of sequencing data before downstream analysis.

Command Description
-o (--output) Specifies the output directory to store the processed data.
-q Specifies the ... .

KNEADDATA Tool used for QC and pre-processing of metagenomic and metatranscriptomic sequencing data;we need to consider we are working with input paired-end sequences files.

Command Description
-i1 Specifies the path to the input file containing the forward (R1) reads.
-i2 Specifies the path to the input file containing the reverse (R2) reads.
--reference-db Specifies the reference database to be used for contaminant removal.
--output Specifies the output directory to store the processed data.
--bypass-trim Skip the trimming step during the processing of sequencing data.

MEGAHIT Metagenome assembly tool used for assembling seqeuncing data particularly obtained from high-throughput sequencing technologies.

Command Description
-1 Specifies the path to the input file containing the first pair of paired-end reads.
-2 Specifies the path to the input file containing the second pair of paired-end reads.
-o Specifies the output directory to store the assembled contigs or output files.

WHOKARYOTE Tool which uses random forest to rpedict wheter a contig is from eukaryote or from a prokaryote(https://github.com/LottePronk/whokaryote).

Command Description
--contigs Specifies the path with your contigs file.
--minsize Specifies a minimum contig size in bp, by default is 5000 (accuracy below 5000 is lower).
--outdir Specifies the output directory to store the output file.

METAPHLAN Tool used for taxonomic profiling of metagenomic sequencing data (used for identification and quantification of microbial species present in a given sample based on unique clade-specific marker genes)

Command Description
-t Specifies the taxonomic level for the output; it allows users to choose the level of taxonomic resolution for the results.
--bowtie2out Specifies the output file for Bowtie2 alignments generated; it is used internally by MetaPhlAn for read alignments against the marker gene database.
--input_type Specifies the input data type for MetaPhlAn. It allows users to inform MetaPhlAn about the type of input data being provided (fastq, sam, bam).

Repository structure

The table below provides an overview of the key files and directories in this repository, along with a brief description of each.

File Description
bin/ Folder with python scripts adapted to the workflow
map/ Folder with pdf and png for better rapresent the workflow
old_scripts Folder with all the scripts used for creating the workflow (qc, assemblying, predictions, taxonimical annotation, mapping, etc...
nextflow.config Configuration file which contains a nextflow configuration for running the bioinformatics workflow, including parameters for processing genomic data on Azure cloud service
nextflow_config_full_draft.txt Text file which contains a configuration for nextflow workflow specifying resources requirements for each program used

Add additional notes about how to deploy this on a live system.

References <a name = references">

Authors </a

Contact me at [email protected] if you are interested in running it before it is done.

Acknowledgements

We would like to extend our heartfelt gratitude to DTU Biosustain and the Novo Nordisk Foundation Center for Biosustainability for providing the essential resources and support that have been fundamental in the development and success of the DSP (Data Science Platform) and MoNA (Multi-omics Network Analysis) projects.

dsp_nf-metagenomics's People

Contributors

apalleja avatar marcoreverenna avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar  avatar

dsp_nf-metagenomics's Issues

Removing host_genome using FASTQ sample from MGnify

The following command line has been used to run the pipeline:
nextflow run main.nf -profile az_test -w az://orange -ansi-log false -resume -with-dag dag.png


This error message occurs after removing the the path host_genome in the process QC and index_ch in the workflow.

This error message occurs after removing the --reference-db host_genome in kneaddata.
Seemed the pipeline was running correctly, it did not break since the beginning.

Command executed:
kneaddata -i1 ERR1713346_1.fastq.gz -i2 ERR1713346_2.fastq.gz --threads 8 --output . --bypass-trim
  
mkdir -p kneaddata_logs
mv ERR1713346_1_kneaddata.log kneaddata_logs/

Command error:
ERROR: Unable to write file: /mnt/batch/tasks/workitems/job-101f51bdea810a457fef-QC/job-1/nf-02b3c0b0a2d436eb29a216b10ec57dd0/wd/reformatted_identifierskxgtnfmc_decompressed_7533av6e_ERR1713346_1

Testing the pipeline

Considering these parameters:

host_genome  : az://metanfnewsample/orange_genome/host_genome/GCF_022201045.2_DVS_A1.0_genomic.fna
reads        : az://metanfnewsample/data/*_{1,2}.fq.gz
genomedir    : az://metanfnewsample/orange_genome/host_genome
metaphlan_db : az://metanfnewsample/databases/metaphlan_db
outdir       : az://metanfnewsample/results

The pipeline breaks in the kneddata process.

Uploading local `bin` scripts folder to az://results/tmp/a8/bc9f8851d939e7c80287c7b03f381f/bin
[O1, [/metanfnewsample/data/O1_1.fq.gz, /metanfnewsample/data/O1_2.fq.gz]]
[bc/7762c0] Submitted process > QC (Kneaddata on O1)
[e0/d873b2] Submitted process > FASTQC (FASTQC on O1)
ERROR ~ Error executing process > 'QC (Kneaddata on O1)'

kneaddata -i1 O1_1.fq.gz -i2 O1_2.fq.gz     --threads 8 	--output . 	--bypass-trim
  
  mkdir -p kneaddata_logs
  mv O1_1_kneaddata.log kneaddata_logs/

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.