Git Product home page Git Product logo

exome-sequencing's Introduction

Whole Exome Sequencing (WXS/WES) data analysis pipeline

I wrote this SLURM bash script to automate a workflow for Whole Exome Sequencing (WXS) data analysis on a high-performance computing (HPC) cluster. It activates a conda environment that I created with the bioinformatics tools required to performs several tasks:

  • Prefetching and Downloading: Downloads raw sequence data in FASTQ format from NCBI's Sequence Read Archive (SRA) database given your desired run accession.
  • Quality Control (QC):
    • Runs FastQC to assess the quality of the raw FASTQ files.
    • Trims the raw FASTQ files using TrimGalore and generates post-trim FastQC reports.
  • Read Alignment:
    • Aligns trimmed reads to a human reference genome (GRCh38) using BWA-MEM.
  • Mark Duplicates: Uses GATK's MarkDuplicatesSpark to mark PCR duplicates in the aligned BAM files.
  • Variant Calling:
    • Uses bcftools mpileup and bcftools call to call variants from the aligned and marked BAM files. I selected the variant calling parameters to prioritize heightened sensitivity, even though this might result in increased false positives. The intention is to subsequently employ machine learning models to filter out these false positives.
  • Variant Subsetting:
    • Intersects the called variants with vendor exome regions BED and Genome-in-a-Bottle (GIAB) high-confidence BED files using bedtools intersect to subset variants to those within specific regions. The intersection with the GIAB high-confidence BED file is to enable determination of false positive and false negative variant calls for the purpose of building machine learning models to mitigate the artifacts.
  • Output: Generates intermediate files for each step placing them in their respective folders.
image

Input Files Required:

  • Human reference genome (GRCh38)
  • Vendor exome regions BED file
  • GIAB high-confidence BED file

Required Bioinformatics Tools:

  • BWA
  • FastQC
  • TrimGalore
  • Samtools
  • GATK
  • bcftools
  • bedtools

Outputs:

  • FASTQ quality reports (FastQC)
  • Trimmed FASTQ files
  • Aligned BAM files
  • Marked duplicate BAM files
  • Variant calling results in VCF format
  • VCF files subsetted to specified regions

The script logs system information, information on the various processing steps, information on the versions of the various tools used, and execution time for the full run. I used the output VCF files to create machine learning models of sequencing artifacts using Bioconductor and R packages (VariantAnnotation, caret, etc).

exome-sequencing's People

Contributors

felixm3 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.