Git Product home page Git Product logo

cybayes's Introduction

CyBayes

Features

  • Performs MCMC based phylogenetics.
  • Can handle multistate characters.
  • Can handle polymorphism (or synonymous states of a concept).
  • Can handle Jukes-Cantor, Felsenstein-81 models.
  • Handles only Rooted trees like in linguistics. So, the trees will have branch lengths and a root.
  • Handles trees like Phylo class in ape package of R.

Parameter space moves

  • Performs tree moves using external SPR and NNI.
  • Branch lengths are sampled according to a exponential distribution.
  • Substitution parameters are sampled using Dirichlet and DualSampler move.

Usage

Compile code using python setup.py build_ext --inplace

 usage: mat_mcmc_gamma.py [-h] [-i INPUT_FILE] [-m MODEL] [-n N_GEN] [-t THIN]
                         [-d DATA_TYPE] [-o OUTPUT_FILE]

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_FILE, --input_file INPUT_FILE
                        Input a file in Phylip format with taxa and characters
                        separated by a TAB character
  -m MODEL, --model MODEL
                        JC/F81/GTR
  -n N_GEN, --n_gen N_GEN
                        Number of generations
  -t THIN, --thin THIN  Number of generations after to print to file
  -d DATA_TYPE, --data_type DATA_TYPE
                        Type of data if it is binary/multistate. Multistate
                        characters should be separated by a space whereas
                        binary need not be. Specify bin for binary and multi
                        for multistate characters or phonetic alignments
  -o OUTPUT_FILE, --output_file OUTPUT_FILE
                        Name of the out file prefix

  

Linguistic datasets are quite different from biological morphological datasets. A Jukes-Cantor model where all the transition rates are the same is useful for estimating trees with branch lengths.

Example Usage:

python3 mat_mcmc_gamma.py -i data/narrow.phy -m F81 -n 100000 -t 1000 -d bin -o narrow

Output

Tested with

  • Python3
  • Scipy (0.18.1)
  • Numpy (1.12.0)

cybayes's People

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

cybayes's Issues

Clock trees implementation

Clock trees implementation is something to think about.

Drummond et al. 2002 (http://www.genetics.org/content/161/3/1307.long) is the best paper that has MCMC moves close to what I implemented in this package.

  • Birth-death trees with incomplete sampling and extant languages. Yang and Rannala is the reference when using the tree probability as the prior.
  • Uniform tree prior from Ronquist et al. (2012) is the paper for uniform prior.
  • Yang's chapter on coalescent process is the book chapter which presents the theory behind coalescent prior that is easy to simulate using a exponential waiting time. I will add some code to simulate.
  • Strict clock involves adding a single parameter that links the linguistic branch lengths to geographical timescale.

How to generate a time tree is the question? As of now the trees are rooted. However, a time tree would have all the tips at same level. That will introduce extra complications that might need to be handled. Any ideas on how to sample a starting tree are welcome.

Add uniform prior

Add uniform prior as in Ronquist et al. (2012). Does not seem difficult to implement.

Node slider

Node Slider move has to be implemented. Just select a node, its parent and parent's parent and change the branch length using a multiplier move and reinsert the parent randomly.

Multiple aligned datasets

Multiple aligned datasets with preferably SCA/ASJP encoding might be useful for testing the phylogenies. Of course, we are interested in checking how the trees emerge from the runs. @Anaphory has some datasets that have small alphabet size to test.

@LinguList Shall I assign this to you?

Node slider proposal ratio

Node slider proposal ratio needs to be fixed. As of now I added a slider edge to keep things working.

Guided SPR

A guided SPR based on cluster similarity. Choose a branch randomly. Select the parent and child edge also. Select the edge whose subtree has the highest similarity to the branch to be pruned and regrafted. This is is more guided than the current version where the parent is attached and tested.

Gamma Dirichlet Prior

A Gamma Dirichlet prior from Rannala et al. (2012) for better tree lengths than the exponential prior being used now.

Clean up repo

Many files are repetitions that need to be thrown out or moved.

Cython Blas vs numpy dot

@robertostling I am using numpy's dot multiplication to multiply a KxK matrix with a KxS matrix. K is in the range of 2 to 40 whereas S is in the range of 500 to 10000. The dot product is the main bottleneck since most of the calls are here:

LL_mat[parent] = p_t[parent,child].dot(ll_mats[child])

I am directly calling the numpy dot within Cython. Do you think directly calling BLAS function matrix multiplication would be faster? If I can bring the cost down then it would be great since direct calling might be faster.

Tests

Tests using binary data of Rama et al.(2018) for five language families.

Metropolis coupled MCMC (MCMCMC)

The next step is to implement MCMCMC for sampling trees. The process essentially consists of a cold chain and 2<= n <= N hot chains whose exponent is set to 1/(1+alpha*(n-1)) and alpha is 0.1.

Briefly, there will be an exchange of states between two chains once in a while following an acceptance ratio given here:
http://bamm-project.org/mc3.html

The question is how to exchange states. Essentially, this is parallel simulated annealing. @robertostling any thoughts about this. I don't have much experience with MPI programming. May be there is simpler way to achieve this.

Change README

Change README to accommodate the change of input file format.

Faster matrix multiplications

Will need faster matrix multiplications since the profiling suggests that most of the time is spent in computing matrix multiplications which are part of numpy dot function. Is there any way to optimize it @Anaphory.

I tried tests with cython's blas wrapper with dgemm function. But the timing results are not better than np.dot function. Any feedback is welcome.

Initialize gamma site rates

Initialize Gamma site rates alpha to 1 always to prevent underflow errors. Already done in cogbayes branch. Fix for all the branches.

Rescaling with caching

Implementing rescaling with caching will improve the speed of the simulated tempering chains.

Tests?

Need to write tests for testing where cybayes fail. More cleaning of code to increase efficiency might be required. May be, @xrotwang can suggest some ways to test the code.

Autotuning

Add autotuning of proposal distribution hyperparameters

Add Extended SPR

Added Extended SPR for a controlled version of random SPR. Follow Yang's description.

Simulated Annealing in MH algorithm

@robertostling Does it make sense to anneal the likelihood during burnin? I googled around a bit. But did not find much literature. My idea is to anneal the likelihood from low temperature to high temperature during burnin and then sample sample at high temperature for the rest of the chain.

Leaning chain

A Leaning chain option should be implemented and tested.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.