josephryan / sowhat Goto Github PK

View Code? Open in Web Editor NEW

10.0 7.0 4.0 21.81 MB

Program to run the SOWH test (likelihood-based test used to compare tree topologies which are not specified a priori)

Home Page: http://sysbio.oxfordjournals.org/content/64/6/1048

License: GNU General Public License v3.0

Perl 78.34% Shell 19.78% Dockerfile 0.42% TeX 1.47%

sowhat's Introduction

SOWHAT

DESCRIPTION
AVAILABILITY
INSTALLATION
EXAMPLE ANALYSES
GETTING STARTED
RUN
DOCUMENTATION
CITING
FURTHER READING
COPYRIGHT AND LICENSE

DESCRIPTION

sowhat automates the SOWH test, a statistical test of phylogenetic topologies using a parametric bootstrap. It works on amino acid, nucleotide, and binary character state datasets.

A peer-reviewed manuscript describing sowhat is available at Systematic Biology: http://sysbio.oxfordjournals.org/content/early/2015/07/30/sysbio.syv055.abstract

sowhat includes several features that provide flexibility and aid in the interpretation and assessment of SOWH test results, including:

The test is performed with the adjustment suggested by Susko 2014 (http://dx.doi.org/10.1093/molbev/msu039).
Partitions, including partitions by codon position, can be used.
Missing data (gaps in alignment) are propagated from the original dataset to the simulated dataset.
Confidence intervals are estimated for the p-value, which helps the investigator assess if a sufficient number of bootstrap replicates have been sampled.

sowhat is in active development. Please use with caution. We appreciate hearing about your experience with the program via the issue tracker.

AVAILABILITY

https://github.com/josephryan/sowhat (click the "Download ZIP" button at the bottom of the right column).

INSTALLATION

DOCKER

sowhat is available in a docker conatiner (thanks to @xqua for troublshooting). To load a container with sowhat and the required dependencies, use the following:

docker pull shchurch/sowhat

CONDA

Create a conda environment called sowhat

conda create sowhat
conda activate sowhat
conda install -c conda-forge perl-app-cpanminus
conda install -c bioconda seq-gen
cpanm Statistics::R

Download a fresh distribution and cd to the directory (eg. sowhat-1.0) and install sowhat.

cd sowhat-1.0/
perl Makefile.PL
make
make test
make install

You will need to activate this environment whenever you want to run sowhat

QUICK

To install sowhat and documentation, use the following:

perl Makefile.PL
make
make test
sudo make install

To install without root privelages try:

perl Makefile.PL PREFIX=/home/myuser/scripts
make
make test
sudo make install

INSTALL WITH DEPENDENCIES

You can install SOWHAT and all the required dependencies listed above on a clean Ubuntu 15.04 machine with the following commands (executables will be placed in /usr/local/bin):

sudo apt-get update
sudo apt-get install -y r-base-core cpanminus unzip gcc git
sudo cpanm Statistics::R
sudo cpanm JSON
sudo Rscript -e "install.packages('ape', dependencies = T, repos='http://cran.rstudio.com/')"
cd ~
git clone https://github.com/josephryan/sowhat.git
cd `sowhat`/
# To work on the development branch (not recommended) execute: git checkout -b Development origin/Development
sudo ./build_3rd_party.sh
perl Makefile.PL
make
make test
sudo make install

Note that build_3rd_party.sh installs some dependencies from versions that are cached in this repository. They may be out of date.

Additional information on system requirements and dependencies are listed below.

EXAMPLE ANALYSES

Several test datasets are provided in the examples/ directory. To run example analyses on these datasets, execute:

./examples.sh

See examples.sh and the resulting test.output/ directory for more on the specifics of sowhat use.

Warning: Some of the examples take time (especially those that use Garli). For a quick example run make test and see the output in the test.output directory.

GETTING STARTED

Preparation

1. Alignment (DNA, AA, or binary characters)

Format: non-interleaved PHYLIP format

This can be DNA, amino acid, or binary characters. Often, you would have performed phylogenetic analyses on this alignment and recovered a result that was in conflict with an a priori hypothesis.

2. Constraint tree

Format: Newick format

The constraint tree represents a hypothesis that you would like to compare to the ML tree or some alternative hypothesis. In most cases you will want a tree that is mostly unresolved except for the clade being tested.

For example if your ML tree showed a sister relationship between two taxa 'A' and 'B' and you want to compare this result to topology with a sister relationship between 'A' and 'C,' you would create the following constraint tree:

((A,C),B,D,E,F);

Note that the relationship B, D, E, and F is unresolved.

3. RAxML model

The only other required parameter when using RAxML is

--raxml_model

This option can specify any of the models that are available to RAxML. Running sowhat with the option --raxml_model=available will provide a list of all possible models.

Other RAxML parameters (including number of threads) can be specified with the option:

--rax

for example:

--rax='/usr/local/bin/raxmlHPC-PTHREADS -T 20'

Running `sowhat`

See examples.sh for examples of sowhat commands.

By default sowhat samples 1000 bootstrap replicates. This can be adjusted with --reps=[sample size]. A sufficient sample size can be assessed by checking the reported confidence interval around the p-value.

Examining the results

The results of the SOWH test are included in a file called sowhat.results.txt, which can be found in the directory specified with the --dir option.

At the bottom of sowhat.results.txt is a p-value representing the probability that the test statistic would be observed under the null hypothesis.

A run that has been cut short can be restarted using the --restart option. In this case the null distribution will be recalculated iteratively using the previously simulated samples in the null distributions. Only the most recent two generations of sequence simulation and tree estimation will be reperformed to prevent any errors from an unfinished tree estimation.

Additional outputs include

detailed information on the model used for simulating new alignments in the file sowhat.model.txt
information on the null distribution in sowhat.distribution.txt
the trace file for the run is printed to sowhat.trace.txt
program files printed to a directory sowhat_scratch. Within this directory, the files ending in ...i.0.0 represent the initial search of the empirical alignment file.

Results can be printed to a file sowhat.results.json using the option

--json

Running large analyses

The SOWH test can take a lot of time, especially on datasets where a single tree search can take many hours. Threads can be incorporated into raxml as described above with the --rax options, which can speed up the tree searches considerably.

In some cases, though, the user may want to further parallelize the sowhat test. The following option allows a user to run the tree searches on simulated datasets simultaneously, for example on a cluster.

To use this option, you must specify the following options:

--print_tree_scripts --reps=[sample size, default=1000]

The initial two tree searches on the observed data will be performed. Subsequently sowhat will generate simulated alignments and print a series of scripts to execute the tree searches to the folder [--dir]/sowhat_scratch/tree_scripts/.

Each of these scripts must be executed externally, and can be run simultaneously. After they have all been completed, the user reruns sowhat with the following options:

--print_tree_scripts --reps=[same number of reps] --restart

One note: if the inital sample size is too low (the confidence interval around the p-value indicates that the results are not definitive), the user can generate additional tree scripts by rerunning the sowhat command with the following options:

--print_tree_scripts --reps=[some higher number of reps] --restart

sowhat will not calculate the statistics until the number of tree scripts specified in the number of reps have been executed successfully.

Additional options

See this page for descriptions of additional options and how to use more complex models.

RUN

   sowhat 
   --constraint=NEWICK_CONSTRAINT_TREE 
   --aln=PHYLIP_ALIGNMENT 
   --name=NAME_FOR_REPORT 
   --dir=DIR 
   [--debug] 
   [--garli=GARLI_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--garli_conf=PATH_TO_GARLI_CONF_FILE] 
   [--help] 
   [--initial] 
   [--json] 
   [--max] 
   [--raxml_model=MODEL_FOR_RAXML] 
   [--nogaps] 
   [--partition=PARTITION_FILE] 
   [--pb=PB_BINARY_OR_PATH_PLUS_OPTION
   [--pb_burn=BURNIN_TO_USE_FOR_PB_TREE_SIMULATIONS] 
   [--plot] 
   [--ppred=PPRED_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--print_tree_scripts]
   [--rax=RAXML_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--reps=NUMBER_OF_REPLICATES] 
   [--resolved] 
   [--rerun] 
   [--restart]
   [--runs=NUMBER_OF_TESTS_TO_RUN] 
   [--seqgen=SEQGEN_BINARY_OR_PATH_PLUS_OPTIONS] 
   [--treetwo=NEWICK_ALTERNATIVE_TO_CONST_TREE] 
   [--usepb] 
   [--usegarli] 
   [--usegentree=NEWICK_TREE_FOR_SIMULATING_DATA] 
   [--version]

DOCUMENTATION

Extensive documentation is embedded inside of sowhat in POD format and can be viewed by running any of the following:

    `sowhat` --help
    perldoc `sowhat`
    man `sowhat`  # available after installation

CITING

A peer-reviewed manuscript describing sowhat is available at Systematic Biology:

Church, Samuel H., Joseph F. Ryan, and Casey W. Dunn. "Automation and Evaluation of the SOWH Test with SOWHAT" Systematic Biology 2015 Nov;64(6):1048-58. doi: 10.1093/sysbio/syv055

Also see the file sowhat.bibtex

COPYRIGHT AND LICENSE

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program in the file LICENSE. If not, see http://www.gnu.org/licenses/.

sowhat's People

Contributors

Stargazers

Watchers

Forkers

bioinformaticsarchive cactusolo ahernandez6 biologyguy

sowhat's Issues

Come up with an alternative to IPC::Run for the test suite

This does not seem to be installed by default with Perl.

Failing travis ci due to old R

The TravisCI test is failing: https://travis-ci.org/josephryan/sowhat/builds/221453597

The problem seems to be:

The test images is based on Ubuntu 14.04.5 LTS
sudo apt-get install -y r-base installs R 3.0.2
The ape package is unavailable for this older R, throwing the warning package 'ape' is not available (for R version 3.0.2)
Results in the error library(ape) : there is no package called 'ape'

We should figure out how to install a more recent version of R on this container or use a different test image that allows installation of newer R.

add travis-ci integration

Add testing via travis-ci. Dependencies can be installed as described at https://docs.travis-ci.com/user/installing-dependencies/

include sowhat version in results.txt

perhaps also seq-gen version.

escape spaces in pathnames returned from getcwd

this needs to be a system independent method using Perl module

raxml returns higher likelihood for constrained tree than unconstrained tree for some simulated matrices

This is a known issue, and is addressed here:

https://groups.google.com/forum/#!topic/raxml/qn7_ZXoJTHg

We should redo some of hte problematic searches with the addition of the --no-bfgs raxml option. If that fixes the problem for those matrices, we should rerun the tests in the manuscript that were impacted by this. If they work, we should revise the manuscript to reflect the fix.

treetwo is better

This message "Constraint_X is more likely than Y, X will be used as the constraint tree instead" is only printed to STDERR. This information should also go in the results file. In general there should be more details about which tree is which in the results file.

Character dataset fails with newest RAxML

RAxML v 8.1.20

sowhat should throw error if reps < 10

"The problem with these arguments is that you have set the maximum number of reps to 2 (--reps=2) - this means that sowhat will generate only two simulated datasets and this is not enough to calculate any statistics, such as a p-value. sowhat won't start printing a sowhat.results.txt file until after the 10th bootstrap replicate to avoid choking on any mathematical errors near the start of the analysis. We strongly recommend each analysis should use 100+ bootstrap replicates, and that the number of bootstraps (the sample size) should be justified by reporting the confidence interval surrounding the p-value"

print $DIR/sowh_trace_$name.txt with output similar to phylobayes .trace files

#cycle #treegen time loglik length alpha nmode stat
1 940 2836 -889000 21.9549 0.585788 4885 0.933128

Possibly include stopping criterion

Stopping criterion would be met after a minimum number of samples (say 100) and when the p-value and the confidence interval fall entirely on the same side of the significance level.

add subroutine tests

see: http://search.cpan.org/~oliver/Test-Subroutines-1.113350/lib/Test/Subroutines.pm

Check that constraint tree, dataset have consistent taxa names

The error that currently is output is a confusing message from RAxML.

Eat MOAR Twizzlers

Self evident.

instructions for monitoring a job - and cutting a job short

SOWH tests can take a long time on large datasets. Here are some ideas on how to monitor a job and how to cut a job short. These should be considered (and tested) for being added to the documentation:

Monitoring a job
the following command can be used to monitor a job (only if reps = 1 - the default) if run from within the directory that was specified with the --dir option:
```
ls -1 sowhat_scratch/ | cut -f 4 -d '.' | sort -n | tail -n1
```
Cutting a job short
If the job is currently running, make a copy of the directory that was specified with the --dir option (and its contents) :
```
cp -R myoutputdir rerunoutputdir
```
run the exact same command as before, but with this new directory as the --dir option, and
```
--reps=SMALLER_NUMBER_OF_REPS --restart
```

Add support for threads (multiCPU)

error in Statistics::R related to a flag sent to R (--gui=none).

My fix was to remove the command-line switch from the system call to R in 'Statistics/R/Bridge/Linux.pm'. We need to at least warn users. Probably should write the author(s) of Statistics::R. Not sure if this affects other versions.

Add tests with more intuitive messages for missing prerequisites

Add more version information to sowhat.results

Currently only the version of RAxML is recorded. It would be helpful to have the following versions recorded in the results:

sowhat version
seq-gen version
PhyloBayes version (if used)
Garli version (if used)

use IQtree for ML searches

IQtree may offer advantages to RAxML, GARLi. SOWHAT will need to read the required output from the results files (base freqs, transition rates, likelihood).

add documentation about using local::lib

Bootstrapping the package in and using that to cpan R

update documentation and usage subroutine to reflect RAxML default

update and clean github documentation

clean up and reprioritize documentation.

mislabeled constrained / unconstrained values in report

the _structure_subroutine needs to be edited so that the report is corrected, by switching the ml and t1 values.

Description of sample datasets

Need to add a description of sample datasets to README.md. These should explain the origins of the datasets, as well as list some basic attributes (number of taxa, number of genes, number of sites). There should also be an indication of which files pertain to which datasets.

Any help is highly appreciated. Here is the error.log.

Cheers Bastian

seq-gen has limited model set. sowhat needs to check models before any processes are run.

more obvious warning when not in sequential phylip

current warning is that taxa dont match up, difficult to sort out.

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.