This repository contains tool for multiple sequence alignment analysis. It implements the idea of pan-genome (Ref. 1) by representing the multialignment as a PO-MSA structure (Partial Order Alignment Graph - Ref. 2). The main purpose of this software is to construct a Consensus Tree - a phylogenetic-like tree, with an agreed sequence (consensus sequence) assigned for each node.
Running:
Testing:
python3 setup.py install
This line builds a pan-genome model for an example alignment of 160 Ebola virus sequences and saves it to a JSON file.
python3 -m pangtreebuild --multialignment data/Ebola/input/multialignment.maf
- Import package pangtreebuild to your Python program and use it according to the documentation.
or
- Use pangtreebuild via command line with following arguments:
python3 -m pangtreebuild [args]
Name | CLI | Required | Description |
---|---|---|---|
Arguments affecting PO-MSA construction: | |||
MULTIALIGNMENT | --multialignment | Yes | Path to the mulitalignment file (.maf or .po) |
METADATA | --metadata | No | Optional information about sequences in csv format. The only required column: 'seqid' and its value must match multialignment files identifiers as described in Sequence Naming Convention (below). Example: data/Ebola/input/metadata.csv |
RAW_MAF | --raw_maf | No, default=False | Build PO-MSA without transforming multialignment (MAF file) to DAG. PO-MSA built in this way does not reflect real life sequences. |
FASTA_PROVIDER | --fasta_provider | No | Nucleotides source if any residues are missed in the multialignment. Possible values: 'ncbi', 'file'. If not specified: MISSING_NUCLEOTIDE is used. |
MISSING_SYMBOL | --missing_symbol | No, default='?' | Symbol for missing nucleotides used if no FASTA_PROVIDER is given. |
CACHE | --cache | No, default='Yes' | If True, sequences downloaded from NCBI are stored on local disc and reused between program calls, used if FASTA_PROVIDER is 'ncbi' |
FASTA_FILE | -fasta_source_file | Yes if FASTA_PROVIDER='FILE' | Path to fasta file or zipped fasta files with whole sequences present in multialignment, used if FASTA_PROVIDER is 'FILE'. |
Arguments affecting Consensuses Tree construction: | |||
CONSENSUS | -consensus | No | Possible values: 'TREE' (default algorithm, descibed in Documentation.md), 'POA' (simplified version, based solely on Ref. 2) |
BLOSUM | --blosum | No, default=bin\blosum80.mat | Path to the blosum filem. Blosum file must include MISSING_NUCLEOTIDE. |
HBMIN | --hbmin | No, default=0.9 | 'POA' parameter. The minimum value of sequence compatibility to generated consensus. |
STOP | --stop | No, default=0.99 | 'TREE' parameter. Minimum value of compatibility in tree leaves. |
P | -p | No, default=1 | 'TREE' parameter. It changes the linear meaning of compatiblities during cutoff finding because the compatibilities are raised to the power o P. For P from range [0,1] it decreases distances between small compatibilities and increases distances between the bigger ones. For p > 1 it increases distances between small compatibilities and decreases distances between the bigger ones. |
Arguments affecting output generation: | |||
OUTPUT_DIR | --output_dir, -o | No, default=timestamped folder in current working directory | Output directory path. |
VERBOSE | --verbose, -v | No, default=False | Set if detailed log files must be produced. |
QUIET | --quiet, -q | No, default=False | Set to turn off console logging. |
FASTA | --output_fasta | No, default=False | Set to create fasta files with consensuses. |
PO | -output_po | No, default=False | Set to create po file with multialignment (without consensuses). |
[anything before first dot is ignored].[everything after first dot (also other dots) is interpreted as seqid]
- Build PO-MSA using default settings (transform to DAG, download missing nucleotides from NCBI) and save to .po file :
python -m pangtreebuild --multialignment data/Ebola/input/multialignment.maf -po
will produce:
- pangenome.json
- poagraph.po
- Generate Consensuses Tree, use metadata, detailed logging and default algorithm settings.
python3 -m pangtreebuild --multialignemnt data/Ebola/input/multialignment.maf -metadata data/Ebola/input/metadata.csv -consensus tree -v
will produce:
- pangenome.json
- details.log
- consensus/
- tresholds.csv
- .po files from internal calls to poa software
python3 -m unittest discover -s tests -t . -p tests_*
This software is developed with support of OPUS 11 scientific project of National Science Centre: Incorporating genomic variation information into DNA sequencing data analysis
This project is licensed under the MIT License - see the LICENSE.md file for details
-
Computational pan-genomics: status, promises and challenges The Computational Pan-Genomics Consortium. Briefings in Bioinformatics, Volume 19, Issue 1, January 2018, Pages 118โ135.
-
Generating consensus sequences from partial order multiple sequence alignment graphs C. Lee, Bioinformatics, Volume 19, Issue 8, 22 May 2003, Pages 999โ1008