Git Product home page Git Product logo

flu-ngs's Introduction

flu-ngs

Influenza virus next generation sequence analysis pipeline.

Highlights:

  • Uses IRMA to iteratively refine reference sequences.
  • Analyses variants for standard influenza A and B segments, and influenza A splice variants: PA-X, PB1-F2, M1, M2.
  • Reports coding effects of nucleotide changes, taking account of multiple SNPs in a codon, and their phase.
  • Support for MinION and MiSeq data.

Requirements

snakemake

The 'pipeline' is really two snakemake workflows: preprocess.smk, which trims reads and runs quality control measures, and irma.smk which runs IRMA and generates summary output.

Bioinformatics

These workflows call various other bioinformatics programs:

  • fastqc is used for generating quality control reports of the raw reads.
  • Trimmomatic is used to trim adapters off reads.
  • IRMA is used to match reads to flu reference sequences.
    • Important. This workflow uses a custom configuration script. Copy the .sh files in workflow/config to the <IRMA install path>/IRMA_RES/modules/FLU/config/ directory.
  • VEP is used to identify effect of nucleotide changes at the protein level.
    • VEP is written in Perl and requires a module called Bundle::DBI. Install it with perl -MCPAN -e 'install Bundle::DBI'
  • tabix is required to preprocess files for VEP. Install via perl -MCPAN -e 'install Bio::DB::HTS::Tabix'.
  • bgzip & gunzip are used for (de)compression.
  • gffread is used to write transcripts from fasta and GFF files.
  • transeq is used to translate transcripts.
  • chopper is used for filtering and trimming MinION data.
  • MinIONQC is used for generating quality control reports of MinION data.

Versions are listed in workflow/envs/*.yaml.

Python

There are several python scripts in workflow/scripts which have a couple of dependencies.

Make a python virtual environment and install the requirements with:

pip install -r requirements.txt

Read more about what python virtual environments are, why they are useful, and how to set them up here.

You could use the same virtual environment for each analysis. If you have one setup, then activate it with:

source ~/.virtualenvs/flu-ngs-env/bin/activate

Running the workflow

Each time you have samples to run, I would suggest cloning this repository:

git clone [email protected]:IRI-UW-Bioinformatics/flu-ngs.git <name>

where <name> is the name of the directory that you want, then cd <name>.

Data

The next step is to put the read files in a structure expected by the workflow.

MiSeq

Put reads in a directory called raw with the following structure:

raw/
├── trimlog.fas
├── YK_2832
│   ├── YK_2832_1.fastq
│   └── YK_2832_2.fastq
├── YK_2833
│   ├── YK_2833_1.fastq
│   └── YK_2833_2.fastq
├── YK_2834
│   ├── YK_2834_1.fastq
│   └── YK_2834_2.fastq
...

It is fine if the fastq files are gzipped (i.e. have a .gz suffix).

Forward and reverse reads should be in {sample}_1.fastq and {sample}_2.fastq respectively.

trimlog.fas should contain the adapters.

MinION

Put reads in a directory called raw. They must be gzipped (end with fastq.gz).

raw/
├── barcode05
│   ├── FAW31148_pass_barcode05_485f6488_e81f1340_820.fastq.gz
│   ├── FAW31148_pass_barcode05_485f6488_e81f1340_821.fastq.gz
...
├── barcode06
│   ├── FAW31148_pass_barcode06_11f103b9_993a7465_30.fastq.gz
│   ├── FAW31148_pass_barcode06_11f103b9_993a7465_31.fastq.gz
...

The names of subdirectories (barcode05 and barcode06) define the names of samples. All .fastq.gz files in each subdirectory are assigned to that sample name.

Configuration

Run parameters are passed to the workflow by a file called config.json that should have these keys:

  • platform. Either minion or miseq.
  • samples. A list containing the sample IDs to analyse.
  • pair. A list containing one or more of paired, unpaired and combined. This controls whether to analyse paired, unpaired and/or paired and unpaired reads combined. For MinION data pair must be longread. The concept of paired and unpaired reads does not apply to MinION data but for implementation ease longread is used as a placeholder in the workflow wherever paired/unpaired/combined would be used in a MiSeq analysis.
  • order. A list containing one or both of primary and secondary. See the IRMA documentation for the distinction between primary and secondary data and residual and secondary assemblies. If both primary and secondary are supplied the workflow will do both, and runtime will approximately double.
  • errors. Either warn or raise. This controls error handling for some steps in the workflow. warn issues warnings if something goes wrong, but will attempt to carry on. raise would not carry on.

MiSeq example:

{
  "platform": "miseq",
  "samples": [
    "YK_2837",
    "YK_2970"
  ],
  "pair": [
    "combined"
  ],
  "order": [
    "primary",
    "secondary"
  ],
  "errors": "warn"
}

MinION example:

{
  "platform": "minion",
  "samples": [
    "barcode05",
    "barcode06"
  ],
  "pair": [
    "longread"
  ],
  "order": [
    "primary",
  ],
  "errors": "warn"
}

Preprocessing

Reads are first preprocessed. For MiSeq data adapters are trimmed and quality control reports are generated. For MinION data reads are filtered by a min and max length. See workflow/rules/preprocess-{minion,miseq}.smk for details.

snakemake -s workflow/preprocess.smk -c all

-c all tells snakemake to use all available cores, scale this back if need be. HTML QC reports for MinION data are saved in results/qc-raw and results/qc-trimmed.

Variant calling

Run IRMA and make summary reports of the output:

snakemake -s workflow/irma.smk -c all

Variant summary reports

Three summary files are generated:

  • results/<order>/xlsx/variants-mcc-by-sample-ordered.xlsx. In this version, each sample has its own sheet, and each sheet contains all influenza segments found in that sample.
  • results/<order>/xlsx/variants-mcc-by-segment-ordered.xlsx. Like above, but each segment gets its own sheet.
  • results/<order>/xlsx/variants-mcc-flat-ordered.xlsx. This contains all variants in one flat sheet.

Sequences

IRMA consensus sequences and amino acid translations are put in results/<order>/seq.

Splice variants

Splice variants of MP, PA and PB1 are all based on the assumption that IRMA finds canonical length consensus sequences for these segments (see here for more details).

If IRMA finds a consensus sequence for one of these segments that is not the expected length, then the behaviour is determined by the config file:

  • If "errors": "warn" is used, a logfile is produced at the end of the run (logs/incorrect-splice-vars-<order>.log) detailing any segments where the consensus was not the expected length. For these, it is highly likely that the variant analysis will be incorrect.
  • Instead, if "errors": "raise" is used in the config, the workflow stops immediately if a length mismatch occurs.

(For NS, we know to expect more variability in segment length, and the locations of exons are flexibly determined.)

Bonus: sorted BAM files

Most software to look at reads alignments require sorted bam files, and/or bam index files. I've written a small workflow for generating these for all bam files IRMA generates. It requires samtools. .sorted.bam and .sorted.bam.bai files are saved in the same directory as the original .bam files. Do:

snakemake -s workflow/sort-bam.smk -c all

flu-ngs's People

Contributors

davipatti avatar lavanyababujee avatar

Stargazers

 avatar  avatar

flu-ngs's Issues

Gabi (minor) changes

  • protein numbering sort as int
  • amino acid change in separate columns
  • mark transition / transversion
  • 3 d.p. rounding
  • outputs
    • one excel per sample. 1 sheet per protein
    • one excel per protein. 1 sheet per sample
    • one excel all data. all samples and proteins on one sheet.

Add 'phasing' information

IRMA outputs 'phasing' information. Alleles that are found together get assigned the same 'phase' - an integer.

Should add this to the VEP output.

Check splice variant lengths

Currently assume that unspliced versions of proteins go to the end of IRMA reference transcripts.

Wasn't the case for NS1.

Should check the other spliced proteins too.

Same codon

Report nucleotide mutations that are in the same codon.

If also in the same phase, then could / should take this into account in determining amino acid change.

Add allele frequencies to VEP output

Didn't realise VEP dropped AF data!

IRMA has separate txt files in the tables output directory for insertions, deletions and single mutations

Use FLU-secondary.sh?

There is ready made config file in IRMA called FLU-secondary.sh:

$ cat ~/.local/lib/flu-amd/IRMA_RES/modules/FLU/config/FLU-secondary.sh 
# HEADER
PARAM_FILE_NAME="FLU-residual"
PARAM_FILE_AUTHOR="S. Shepard"
PARAM_FILE_VERSION="1.0"
PARAM_FILE_DATE="2019-02-19"

# CONSENSUS REFINEMENT & READ SELECTION
INS_T=0.25				# threshold for insertion refinement
DEL_T=0.60				# threshold for deletion refinement
SKIP_E=0				# skip reference elongation
MIN_LEN=125
GRID_ON=1

# STAGES
MATCH_PROG="BLAT"
SORT_PROG="LABEL BLAT"
ALIGN_PROG="SAM BLAT"
ASSEM_PROG="SSW"

# DJP: These are the parameters that control the secondary assembly
RESIDUAL_ASSEMBLY_FACTOR=400
MIN_RP_RESIDUAL=150
MIN_RC_RESIDUAL=150
DO_SECONDARY=1

From what I can tell, SORT_PROG is the only value that is different from the default value from that used in FLU, which is BLAT (i.e. this is set in ~/.local/lib/flu-amd/IRMA_RES/modules/FLU/init.sh.

I quickly tested using FLU-secondary (i.e. using SORT_PROG="LABEL BLAT"). It was taking much longer than when just setting the secondary assembly parameters. I wonder if there is a particular reason to prefer "LABEL BLAT" for SORT_PROG` when doing a secondary assembly...

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.