Git Product home page Git Product logo

sowhat's Introduction

SOWHAT

Build Status

TABLE OF CONTENTS

DESCRIPTION

sowhat automates the SOWH test, a statistical test of phylogenetic topologies using a parametric bootstrap. It works on amino acid, nucleotide, and binary character state datasets.

A peer-reviewed manuscript describing sowhat is available at Systematic Biology: http://sysbio.oxfordjournals.org/content/early/2015/07/30/sysbio.syv055.abstract

sowhat includes several features that provide flexibility and aid in the interpretation and assessment of SOWH test results, including:

  • The test is performed with the adjustment suggested by Susko 2014 (http://dx.doi.org/10.1093/molbev/msu039).
  • Partitions, including partitions by codon position, can be used.
  • Missing data (gaps in alignment) are propagated from the original dataset to the simulated dataset.
  • Confidence intervals are estimated for the p-value, which helps the investigator assess if a sufficient number of bootstrap replicates have been sampled.

sowhat is in active development. Please use with caution. We appreciate hearing about your experience with the program via the issue tracker.

AVAILABILITY

https://github.com/josephryan/sowhat (click the "Download ZIP" button at the bottom of the right column).

INSTALLATION

DOCKER

sowhat is available in a docker conatiner (thanks to @xqua for troublshooting). To load a container with sowhat and the required dependencies, use the following:

docker pull shchurch/sowhat

QUICK

To install sowhat and documentation, use the following:

perl Makefile.PL
make
make test
sudo make install

To install without root privelages try:

perl Makefile.PL PREFIX=/home/myuser/scripts
make
make test
sudo make install

INSTALL WITH DEPENDENCIES

You can install SOWHAT and all the required dependencies listed above on a clean Ubuntu 15.04 machine with the following commands (executables will be placed in /usr/local/bin):

sudo apt-get update
sudo apt-get install -y r-base-core cpanminus unzip gcc git
sudo cpanm Statistics::R
sudo cpanm JSON
sudo Rscript -e "install.packages('ape', dependencies = T, repos='http://cran.rstudio.com/')"
cd ~
git clone https://github.com/josephryan/sowhat.git
cd `sowhat`/
# To work on the development branch (not recommended) execute: git checkout -b Development origin/Development
sudo ./build_3rd_party.sh
perl Makefile.PL
make
make test
sudo make install

Note that build_3rd_party.sh installs some dependencies from versions that are cached in this repository. They may be out of date.

Additional information on system requirements and dependencies are listed below.

EXAMPLE ANALYSES

Several test datasets are provided in the examples/ directory. To run example analyses on these datasets, execute:

./examples.sh

See examples.sh and the resulting test.output/ directory for more on the specifics of sowhat use.

Warning: Some of the examples take time (especially those that use Garli). For a quick example run make test and see the output in the test.output directory.

GETTING STARTED

Preparation

1. Alignment (DNA, AA, or binary characters)

Format: non-interleaved PHYLIP format

This can be DNA, amino acid, or binary characters. Often, you would have performed phylogenetic analyses on this alignment and recovered a result that was in conflict with an a priori hypothesis.

2. Constraint tree

Format: Newick format

The constraint tree represents a hypothesis that you would like to compare to the ML tree or some alternative hypothesis. In most cases you will want a tree that is mostly unresolved except for the clade being tested.

For example if your ML tree showed a sister relationship between two taxa 'A' and 'B' and you want to compare this result to topology with a sister relationship between 'A' and 'C,' you would create the following constraint tree:

((A,C),B,D,E,F);

Note that the relationship B, D, E, and F is unresolved.

3. RAxML model

The only other required parameter when using RAxML is

--raxml_model

This option can specify any of the models that are available to RAxML. Running sowhat with the option --raxml_model=available will provide a list of all possible models.

Other RAxML parameters (including number of threads) can be specified with the option:

--rax

for example:

--rax='/usr/local/bin/raxmlHPC-PTHREADS -T 20'

Running sowhat

See examples.sh for examples of sowhat commands.

By default sowhat samples 1000 bootstrap replicates. This can be adjusted with --reps=[sample size]. A sufficient sample size can be assessed by checking the reported confidence interval around the p-value.

Examining the results

The results of the SOWH test are included in a file called sowhat.results.txt, which can be found in the directory specified with the --dir option.

At the bottom of sowhat.results.txt is a p-value representing the probability that the test statistic would be observed under the null hypothesis.

A run that has been cut short can be restarted using the --restart option. In this case the null distribution will be recalculated iteratively using the previously simulated samples in the null distributions. Only the most recent two generations of sequence simulation and tree estimation will be reperformed to prevent any errors from an unfinished tree estimation.

Additional outputs include

  • detailed information on the model used for simulating new alignments in the file sowhat.model.txt
  • information on the null distribution in sowhat.distribution.txt
  • the trace file for the run is printed to sowhat.trace.txt
  • program files printed to a directory sowhat_scratch. Within this directory, the files ending in ...i.0.0 represent the initial search of the empirical alignment file.

Results can be printed to a file sowhat.results.json using the option

--json

Running large analyses

The SOWH test can take a lot of time, especially on datasets where a single tree search can take many hours. Threads can be incorporated into raxml as described above with the --rax options, which can speed up the tree searches considerably.

In some cases, though, the user may want to further parallelize the sowhat test. The following option allows a user to run the tree searches on simulated datasets simultaneously, for example on a cluster.

To use this option, you must specify the following options:

--print_tree_scripts --reps=[sample size, default=1000]

The initial two tree searches on the observed data will be performed. Subsequently sowhat will generate simulated alignments and print a series of scripts to execute the tree searches to the folder [--dir]/sowhat_scratch/tree_scripts/.

Each of these scripts must be executed externally, and can be run simultaneously. After they have all been completed, the user reruns sowhat with the following options:

--print_tree_scripts --reps=[same number of reps] --restart

One note: if the inital sample size is too low (the confidence interval around the p-value indicates that the results are not definitive), the user can generate additional tree scripts by rerunning the sowhat command with the following options:

--print_tree_scripts --reps=[some higher number of reps] --restart

sowhat will not calculate the statistics until the number of tree scripts specified in the number of reps have been executed successfully.

Additional options

See this page for descriptions of additional options and how to use more complex models.

RUN

   sowhat 
   --constraint=NEWICK_CONSTRAINT_TREE 
   --aln=PHYLIP_ALIGNMENT 
   --name=NAME_FOR_REPORT 
   --dir=DIR 
   [--debug] 
   [--garli=GARLI_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--garli_conf=PATH_TO_GARLI_CONF_FILE] 
   [--help] 
   [--initial] 
   [--json] 
   [--max] 
   [--raxml_model=MODEL_FOR_RAXML] 
   [--nogaps] 
   [--partition=PARTITION_FILE] 
   [--pb=PB_BINARY_OR_PATH_PLUS_OPTION
   [--pb_burn=BURNIN_TO_USE_FOR_PB_TREE_SIMULATIONS] 
   [--plot] 
   [--ppred=PPRED_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--print_tree_scripts]
   [--rax=RAXML_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--reps=NUMBER_OF_REPLICATES] 
   [--resolved] 
   [--rerun] 
   [--restart]
   [--runs=NUMBER_OF_TESTS_TO_RUN] 
   [--seqgen=SEQGEN_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--treetwo=NEWICK_ALTERNATIVE_TO_CONST_TREE] 
   [--usepb] 
   [--usegarli] 
   [--usegentree=NEWICK_TREE_FOR_SIMULATING_DATA] 
   [--version] 

DOCUMENTATION

Extensive documentation is embedded inside of sowhat in POD format and can be viewed by running any of the following:

    `sowhat` --help
    perldoc `sowhat`
    man `sowhat`  # available after installation

CITING

A peer-reviewed manuscript describing sowhat is available at Systematic Biology:

Church, Samuel H., Joseph F. Ryan, and Casey W. Dunn. "Automation and Evaluation of the SOWH Test with SOWHAT" Systematic Biology 2015 Nov;64(6):1048-58. doi: 10.1093/sysbio/syv055

Also see the file sowhat.bibtex

FURTHER READING

Goldman, Nick, Jon P. Anderson, and Allen G. Rodrigo. "Likelihood-based tests of topologies in phylogenetics." Systematic Biology 49.4 (2000): 652-670. doi:10.1080/106351500750049752

Swofford, David L., Gary J. Olsen, Peter J. Waddell, and David M. Hillis. Phylogenetic inference. (1996): 407-514. http://www.sinauer.com/molecular-systematics.html

SYSTEM REQUIREMENTS

We have tested sowhat on OS X 10.9, OS X 10.10, Ubuntu Server 10.04 (Amazon ami-d05e75b8), and Ubuntu Desktop 15.04. It will likely work on a variety of other Unix-like operating systems.

DEPENDENCIES

The dependencies listed below are required by sowhat. They must be installed and available in the appropriate PATH. If they are not installed already, follow the installation instructions in the links provided for each tool. We have tested sowhat with the indicated dependency versions. Other versions may be incompatible, and should be used with caution. These external tools are the result of a considerable amount of work by other investigators, please also cite them when you cite sowhat.

Phylogenetic programs:

General system tools:

To use more alternative models, you will need to install the following optional dependency:

To print results to a json file, you will need to install the following optional dependency:

  • The JSON Perl module.

COPYRIGHT AND LICENSE

Copyright (C) 2015 Samuel H. Church, Joseph F. Ryan, Casey W. Dunn

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program in the file LICENSE. If not, see http://www.gnu.org/licenses/.

sowhat's People

Contributors

caseywdunn avatar josephryan avatar shchurch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

sowhat's Issues

Add more version information to sowhat.results

Currently only the version of RAxML is recorded. It would be helpful to have the following versions recorded in the results:

  1. sowhat version
  2. seq-gen version
  3. PhyloBayes version (if used)
  4. Garli version (if used)

Description of sample datasets

Need to add a description of sample datasets to README.md. These should explain the origins of the datasets, as well as list some basic attributes (number of taxa, number of genes, number of sites). There should also be an indication of which files pertain to which datasets.

sowhat should throw error if reps < 10

"The problem with these arguments is that you have set the maximum number of reps to 2 (--reps=2) - this means that sowhat will generate only two simulated datasets and this is not enough to calculate any statistics, such as a p-value. sowhat won't start printing a sowhat.results.txt file until after the 10th bootstrap replicate to avoid choking on any mathematical errors near the start of the analysis. We strongly recommend each analysis should use 100+ bootstrap replicates, and that the number of bootstraps (the sample size) should be justified by reporting the confidence interval surrounding the p-value"

Possibly include stopping criterion

Stopping criterion would be met after a minimum number of samples (say 100) and when the p-value and the confidence interval fall entirely on the same side of the significance level.

json output

At present the output is simple text. This makes it difficult for it to be machine readable since text format may change in unspecified ways from version to version, and therefor for SOWHAT to be wrapped into larger workflows.

We should generate structured output as a json file. This should then be read and formatted as a text output, similar to what we have now. It could also be parsed into an easy-to-read html output.

Failing travis ci due to old R

The TravisCI test is failing: https://travis-ci.org/josephryan/sowhat/builds/221453597

The problem seems to be:

  • The test images is based on Ubuntu 14.04.5 LTS
  • sudo apt-get install -y r-base installs R 3.0.2
  • The ape package is unavailable for this older R, throwing the warning package 'ape' is not available (for R version 3.0.2)
  • Results in the error library(ape) : there is no package called 'ape'

We should figure out how to install a more recent version of R on this container or use a different test image that allows installation of newer R.

Add true support for multistate

Will require running seq-gen in aminoacid mode and up to 20 substitutions. multistate currently fails with: "expecting 2 frequencies. Multi-State only works w/binary matrix".

instructions for monitoring a job - and cutting a job short

SOWH tests can take a long time on large datasets. Here are some ideas on how to monitor a job and how to cut a job short. These should be considered (and tested) for being added to the documentation:

  1. Monitoring a job
    the following command can be used to monitor a job (only if reps = 1 - the default) if run from within the directory that was specified with the --dir option:

    ls -1 sowhat_scratch/ | cut -f 4 -d '.' | sort -n | tail -n1
  2. Cutting a job short
    If the job is currently running, make a copy of the directory that was specified with the --dir option (and its contents) :

    cp -R myoutputdir rerunoutputdir

    run the exact same command as before, but with this new directory as the --dir option, and

    --reps=SMALLER_NUMBER_OF_REPS --restart

treetwo is better

This message "Constraint_X is more likely than Y, X will be used as the constraint tree instead" is only printed to STDERR. This information should also go in the results file. In general there should be more details about which tree is which in the results file.

Improve Partition Feature Functionality

Only simple partitioning schemes are currently allowed.
Improve feature to allow partitioning schemes that cover multiple non-contiguous regions of an alignment, that are listed out of order, or that divide by codon position.

use IQtree for ML searches

IQtree may offer advantages to RAxML, GARLi. SOWHAT will need to read the required output from the results files (base freqs, transition rates, likelihood).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.