Git Product home page Git Product logo

pairsnp's Introduction

pairsnp

A set of scripts for very quickly obtaining pairwise SNP distance matrices from multiple sequence alignments using sparse matrix libraries to improve performance.

For larger alignments such as the Maela pneumococcal data set (3e5 x 3e3) the c++ version is approximately an order of magnitude faster than approaches based on pairwise comparison of every site such as snp-dists from which the skeleton code for the c++ version was taken.

In order to be most useful implementations in R, python and c++ are available.

Implementation Travis
R Travis-CI Build Status
python Travis-CI Build Status
c++ Travis-CI Build Status

Installation

R

The R version can be installed using devtools or downloaded from its repository

#install.packages("devtools")
devtools::install_github("gtonkinhill/pairsnp-r")

python

The python version can be installed using pip or by downloading the repository and running setup.py.

python -m pip install pairsnp

or alternatively download the repository and run

cd ./pairsnp-python/
python ./setup.py install

c++

The c++ version can be installed manually, by downloading the binaries in this repository, or with conda as

conda install -c gtonkinhill pairsnp

The c++ code relies on a recent version of Armadillo (currently tested on v8.6) and after downloading the repository can be built by running

cd ./pairsnp-cpp/
./configure
make
make install

The majority of time is spend doing sparse matrix multiplications so linking to a parallelised library for this is likely to improve performance further.

At the moment you may need to run touch ./* before compiling to avoid some issues with time stamps.

Quick Start

R

library(pairsnp)
fasta.file.name <- system.file("extdata", "seqs.fa", package = "pairsnp")
sparse.data <- import_fasta_sparse(fasta.file.name)
d <- snp_dist(sparse.data)

python

The python version can be run from the python interpreter as

from pairsnp import calculate_snp_matrix, calculate_distance_matrix

sparse_matrix, consensus, seq_names = calculate_snp_matrix(fasta.file.name)
d = calculate_distance_matrix(sparse_matrix, consensus, "dist", False)

alternatively if installed using pip it can be used at the command line as

pairsnp -f /path/to/msa.fasta -o /path/to/output.csv

additional options include

Program to calculate pairwise SNP distance and similarity matrices.

optional arguments:
  -h, --help            show this help message and exit
  -t {sim,dist}, --type {sim,dist}
                        either sim (similarity) or dist (distance) (default).
  -n, --inc_n           flag to indicate differences to gaps should be
                        counted.
  -f FILENAME, --file FILENAME
                        location of a multiple sequence alignment. Currently
                        only DNA alignments are supported.
  -z, --zipped          Alignment is gzipped.
  -c, --csv             Output csv-delimited table (default tsv).
  -o OUTPUT, --out OUTPUT
                        location of output file.

c++

The c++ version can be run from the command line as

pairsnp -c msa.fasta > output.csv

additional options include

SYNOPSIS
  Pairwise SNP similarity and distance matrices using fast matrix algerbra libraries
USAGE
  pairsnp [options] alignment.fasta[.gz] > matrix.csv
OPTIONS
  -h	Show this help
  -v	Print version and exit
  -s	Find the similarity matrix
  -c	Output CSV instead of TSV
  -n	Count comparisons with Ns (off by default)
  -b	Blank top left corner cell instead of 'pairsnp 0.0.1'

pairsnp's People

Contributors

gtonkinhill avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.