Git Product home page Git Product logo

pangtree's Introduction

PangtreeeBuild

This repository contains tool for multiple sequence alignment analysis. It implements the idea of pan-genome (Ref. 1) by representing the multialignment as a PO-MSA structure (Partial Order Alignment Graph - Ref. 2). The main purpose of this software is to construct a Consensus Tree - a phylogenetic-like tree, with an agreed sequence (consensus sequence) assigned for each node.

Getting Started

Prerequisites

Running:

Testing:

Installing

python3 setup.py install

Quick installation check

This line builds a pan-genome model for an example alignment of 160 Ebola virus sequences and saves it to a JSON file.

python3 -m pangtreebuild --multialignment data/Ebola/input/multialignment.maf

Usage

  1. Import package pangtreebuild to your Python program and use it according to the documentation.

or

  1. Use pangtreebuild via command line with following arguments:

python3 -m pangtreebuild [args]

Name CLI Required Description
Arguments affecting PO-MSA construction:
MULTIALIGNMENT --multialignment Yes Path to the mulitalignment file (.maf or .po)
METADATA --metadata No Optional information about sequences in csv format. The only required column: 'seqid' and its value must match multialignment files identifiers as described in Sequence Naming Convention (below). Example: data/Ebola/input/metadata.csv
RAW_MAF --raw_maf No, default=False Build PO-MSA without transforming multialignment (MAF file) to DAG. PO-MSA built in this way does not reflect real life sequences.
FASTA_PROVIDER --fasta_provider No Nucleotides source if any residues are missed in the multialignment. Possible values: 'ncbi', 'file'. If not specified: MISSING_NUCLEOTIDE is used.
MISSING_SYMBOL --missing_symbol No, default='?' Symbol for missing nucleotides used if no FASTA_PROVIDER is given.
CACHE --cache No, default='Yes' If True, sequences downloaded from NCBI are stored on local disc and reused between program calls, used if FASTA_PROVIDER is 'ncbi'
FASTA_FILE -fasta_source_file Yes if FASTA_PROVIDER='FILE' Path to fasta file or zipped fasta files with whole sequences present in multialignment, used if FASTA_PROVIDER is 'FILE'.
Arguments affecting Consensuses Tree construction:
CONSENSUS -consensus No Possible values: 'TREE' (default algorithm, descibed in Documentation.md), 'POA' (simplified version, based solely on Ref. 2)
BLOSUM --blosum No, default=bin\blosum80.mat Path to the blosum filem. Blosum file must include MISSING_NUCLEOTIDE.
HBMIN --hbmin No, default=0.9 'POA' parameter. The minimum value of sequence compatibility to generated consensus.
STOP --stop No, default=0.99 'TREE' parameter. Minimum value of compatibility in tree leaves.
P -p No, default=1 'TREE' parameter. It changes the linear meaning of compatiblities during cutoff finding because the compatibilities are raised to the power o P. For P from range [0,1] it decreases distances between small compatibilities and increases distances between the bigger ones. For p > 1 it increases distances between small compatibilities and decreases distances between the bigger ones.
Arguments affecting output generation:
OUTPUT_DIR --output_dir, -o No, default=timestamped folder in current working directory Output directory path.
VERBOSE --verbose, -v No, default=False Set if detailed log files must be produced.
QUIET --quiet, -q No, default=False Set to turn off console logging.
FASTA --output_fasta No, default=False Set to create fasta files with consensuses.
PO -output_po No, default=False Set to create po file with multialignment (without consensuses).

Sequence Naming Convention

[anything before first dot is ignored].[everything after first dot (also other dots) is interpreted as seqid]

Example use cases

  1. Build PO-MSA using default settings (transform to DAG, download missing nucleotides from NCBI) and save to .po file :
python -m pangtreebuild --multialignment data/Ebola/input/multialignment.maf -po

will produce:

  • pangenome.json
  • poagraph.po
  1. Generate Consensuses Tree, use metadata, detailed logging and default algorithm settings.
python3 -m pangtreebuild --multialignemnt data/Ebola/input/multialignment.maf -metadata data/Ebola/input/metadata.csv -consensus tree -v

will produce:

  • pangenome.json
  • details.log
  • consensus/
    • tresholds.csv
    • .po files from internal calls to poa software

Tests

python3 -m unittest discover -s tests -t . -p tests_*

Authors

This software is developed with support of OPUS 11 scientific project of National Science Centre: Incorporating genomic variation information into DNA sequencing data analysis

License

This project is licensed under the MIT License - see the LICENSE.md file for details

Bibliography

  1. Computational pan-genomics: status, promises and challenges The Computational Pan-Genomics Consortium. Briefings in Bioinformatics, Volume 19, Issue 1, January 2018, Pages 118โ€“135.

  2. Generating consensus sequences from partial order multiple sequence alignment graphs C. Lee, Bioinformatics, Volume 19, Issue 8, 22 May 2003, Pages 999โ€“1008

pangtree's People

Contributors

meoke avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.