Git Product home page Git Product logo

fastlin's Introduction

Anaconda-Server Badge Crates.io GitHub release (latest SemVer)

fastlin

Overview

Fastlin is an ultra-fast program to perform lineage typing of Mycobacterium tuberculosis complex (MTBC) FASTQ read data and FASTA assemblies. Using the split-kmer approach, it can accuratly predict MTBC lineages and strain mixtures in seconds.

Reference: fastlin: an ultra-fast program for Mycobacterium tuberculosis complex lineage typing.

Installation

To install fastlin via cargo, you must have the rust toolchain installed.

cargo install fastlin

Or you can copy the code from this repository and install it using this command:

cargo install --path .

Alternatively, you can install precompiled binaries using Conda (Linux and macOS Intel processors):

conda install -c bioconda fastlin

You will also need a barcode file (see Input files below).

Running fastlin

The default command line is:

fastlin -d /path/directory_fastq_files -b barcodes_file.txt

If your dataset does not contain any BAM-derived fastq file, then we would recommend to apply a maximum kmer coverage threshold to reduce runtimes:

fastlin -d /path/directory_fastq_files -b barcode_file.txt -x 80

Input files

Fastlin takes as input the path of the directory containing the fastq and/or fasta files. The directory can contain a mix of FASTA geome assemblies, paired-end and single-end FASTQ files. These data files should be gzipped, with the following extensions:

  • .fastq.gz or .fq.gz for FASTQ read data. The names of paired-end files should be in the form name_1.fq.gz and name_2.fq.gz (or equivalent with fastq.gz)
  • .fas.gz, .fasta.gz or .fna.gz for FASTA genome assemblies. In the cases of FASTA files, (i) the min-occurence paramter is automatically set to 1 and (ii) the maximum kmer coverage is ignored.

The MTBC barcode file can be downloaded from https://www.github.com/rderelle/barcodes-fastlin. Alternatively, you can build and test your own kmer barcodes using the Python scripts available in that directory.

Manual

A full description of fastlin parameters can be found here.

Output file

Fastlin output consists of a tab-delimited file with the following fields:

  • sample: sample name
  • data type: 'assembly', 'single' (reads) or 'paired' (-end reads)
  • k_cov: theoretical kmer coverage of the fastq files(s) based on the number of extracted kmers
  • mixture: pure ('no') or mixed ('yes') sample
  • lineages: detected lineages (median kmer occurences within paratheses)
  • log_barcodes: kmer barcodes passing the minimum occurence threshold, indicated by their kmer occurence and grouped by lineages

Here is a simple example:

#sample    data type    k_cov    mixture    lineages    log_barcodes    log_errors
ERRxxxxx    paired    118    no    2 (45)    2 (42, 48, 39, 43, 54, 47, 45), 4.1 (4)

The sample ERRxxxxx contains a single strain belonging to lineage 2. This typing is supported by 7 kmer barcodes, with a median number of occurences of 45. Since the abundance of the strain is far below the theoretical kmer coverage (equal here to 118), we can conclude that the sample is likely to contain high level of contaminations or sequencing errors.

Error handling

When fastlin cannot read a fastq file (e.g., faulty record within the fastq file, corrupt gzip file), it stops scanning it, re-initialises all values to 0 and reports the error message in the last column of the output file. Here is an example of output with 3 different errors:

#sample    data type    k_cov    mixture    lineages    log_barcodes    log_errors
dummy1   single   0   no       Error in file "reads/dummy1.fastq.gz": FASTQ parse error: sequence length is 150, but quality length is 50 (record 'ERR551806.5' at line 17).
dummy2   single   0   no       Error in file "reads/dummy2.fastq.gz": invalid gzip header
dummy3   single   0   no       Error in file "reads/dummy3.fastq.gz": corrupt deflate stream

fastlin's People

Contributors

rderelle avatar jeff-k avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.