PangtreeeBuild

This repository contains tool for multiple sequence alignment analysis. It implements the idea of pan-genome (Ref. 1) by representing the multialignment as a PO-MSA structure (Partial Order Alignment Graph - Ref. 2). The main purpose of this software is to construct a Consensus Tree - a phylogenetic-like tree, with an agreed sequence (consensus sequence) assigned for each node.

Getting Started

Prerequisites

Running:

Testing:

Installing

python3 setup.py install

Quick installation check

This line builds a pan-genome model for an example alignment of 160 Ebola virus sequences and saves it to a JSON file.

python3 -m pangtreebuild --multialignment data/Ebola/input/multialignment.maf

Usage

Import package pangtreebuild to your Python program and use it according to the documentation.

Use pangtreebuild via command line with following arguments:

python3 -m pangtreebuild [args]

Name	CLI	Required	Description
Arguments affecting PO-MSA construction:
MULTIALIGNMENT	--multialignment	Yes	Path to the mulitalignment file (.maf or .po)
METADATA	--metadata	No	Optional information about sequences in csv format. The only required column: 'seqid' and its value must match multialignment files identifiers as described in Sequence Naming Convention (below). Example: data/Ebola/input/metadata.csv
RAW_MAF	--raw_maf	No, default=False	Build PO-MSA without transforming multialignment (MAF file) to DAG. PO-MSA built in this way does not reflect real life sequences.
FASTA_PROVIDER	--fasta_provider	No	Nucleotides source if any residues are missed in the multialignment. Possible values: 'ncbi', 'file'. If not specified: MISSING_NUCLEOTIDE is used.
MISSING_SYMBOL	--missing_symbol	No, default='?'	Symbol for missing nucleotides used if no FASTA_PROVIDER is given.
CACHE	--cache	No, default='Yes'	If True, sequences downloaded from NCBI are stored on local disc and reused between program calls, used if FASTA_PROVIDER is 'ncbi'
FASTA_FILE	-fasta_source_file	Yes if FASTA_PROVIDER='FILE'	Path to fasta file or zipped fasta files with whole sequences present in multialignment, used if FASTA_PROVIDER is 'FILE'.
Arguments affecting Consensuses Tree construction:
CONSENSUS	-consensus	No	Possible values: 'TREE' (default algorithm, descibed in Documentation.md), 'POA' (simplified version, based solely on Ref. 2)
BLOSUM	--blosum	No, default=bin\blosum80.mat	Path to the blosum filem. Blosum file must include MISSING_NUCLEOTIDE.
HBMIN	--hbmin	No, default=0.9	'POA' parameter. The minimum value of sequence compatibility to generated consensus.
STOP	--stop	No, default=0.99	'TREE' parameter. Minimum value of compatibility in tree leaves.
P	-p	No, default=1	'TREE' parameter. It changes the linear meaning of compatiblities during cutoff finding because the compatibilities are raised to the power o P. For P from range [0,1] it decreases distances between small compatibilities and increases distances between the bigger ones. For p > 1 it increases distances between small compatibilities and decreases distances between the bigger ones.
Arguments affecting output generation:
OUTPUT_DIR	--output_dir, -o	No, default=timestamped folder in current working directory	Output directory path.
VERBOSE	--verbose, -v	No, default=False	Set if detailed log files must be produced.
QUIET	--quiet, -q	No, default=False	Set to turn off console logging.
FASTA	--output_fasta	No, default=False	Set to create fasta files with consensuses.
PO	-output_po	No, default=False	Set to create po file with multialignment (without consensuses).

Sequence Naming Convention

[anything before first dot is ignored].[everything after first dot (also other dots) is interpreted as seqid]

Example use cases

Build PO-MSA using default settings (transform to DAG, download missing nucleotides from NCBI) and save to .po file :

python -m pangtreebuild --multialignment data/Ebola/input/multialignment.maf -po

will produce:

pangenome.json
poagraph.po

Generate Consensuses Tree, use metadata, detailed logging and default algorithm settings.

python3 -m pangtreebuild --multialignemnt data/Ebola/input/multialignment.maf -metadata data/Ebola/input/metadata.csv -consensus tree -v

will produce:

pangenome.json
details.log
consensus/
- tresholds.csv
- .po files from internal calls to poa software

Tests

python3 -m unittest discover -s tests -t . -p tests_*

Authors

This software is developed with support of OPUS 11 scientific project of National Science Centre: Incorporating genomic variation information into DNA sequencing data analysis

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Bibliography

Computational pan-genomics: status, promises and challenges The Computational Pan-Genomics Consortium. Briefings in Bioinformatics, Volume 19, Issue 1, January 2018, Pages 118–135.
Generating consensus sequences from partial order multiple sequence alignment graphs C. Lee, Bioinformatics, Volume 19, Issue 8, 22 May 2003, Pages 999–1008

pknut / pangtree Goto Github PK

pangtree's Introduction

PangtreeeBuild

Getting Started

Prerequisites

Installing

Quick installation check

Usage

Sequence Naming Convention

Example use cases

Tests

Authors

License

Bibliography

pangtree's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent