The majority of the steps described below make use of the HPC infrastructure at sciCore of the University of Basel. Jobs were submitted to a slurm based compute cluster and the corresponding scripts are given here for reference purposes only (see below).
Artifical test data are generated by
* generating a site specific GTR model by drawing rates for each site from a Gamma distribution and equilibrium frequencies for each site from a Dirichlet distribution.
* generating a Yule tree with the specified number of leaves
* simulating sequence evolution along the tree using the GTR model
* saving the model, alignment, tree, and ancestral sequences for future analysis
These steps are executed using the script generate_toy_data.py
in the directory src
.
The simulated data are analyzed using the script reconstruct_toy_data.py
.
This script infers site-specific models using a number of different approaches of varying complexity.
The results are saved in a pickle file for further analysis.
To compare the ability to correctly infer branch lengths using inferred models, we jointly infer models and branch length.
This step is performed in the script src/calculate_branch_length.py
.
Generating the results below from scratch is computationally expensive and requires 1000s of CPU hours (in 2019).
The output necessary to reproduce the graphs of the paper are therefore provided as a tarball and sorted on github's large file storage.
The file is simulation_results.tar.gz
and the corresponding hash can be found in .gitattributes
.
The following scripts submit the jobs on the sciCore compute cluster used to generate and analyse the toy data.
python submit_gendata.py --nvals 100 --submit python submit_gendata.py --nvals 300 --submit python submit_gendata.py --nvals 1000 --submit python submit_gendata.py --nvals 3000 --submit
python submit_reconstruction.py --date 2019-12-25 --nvals 100 --submit python submit_reconstruction.py --date 2019-12-25 --nvals 300 --submit python submit_reconstruction.py --date 2019-12-25 --nvals 1000 --submit python submit_reconstruction.py --date 2019-12-25 --nvals 3000 --submit
python submit_branch_length.py --nvals 100 --date 2019-12-25 --submit python submit_branch_length.py --nvals 300 --date 2019-12-25 --submit python submit_branch_length.py --nvals 1000 --date 2019-12-25 --submit python submit_branch_length.py --nvals 3000 --date 2019-12-25 --submit
python submit_collect_tree_lengths.py --nval 100 --submit --date 2019-12-25 python submit_collect_tree_lengths.py --nval 300 --submit --date 2019-12-25 python submit_collect_tree_lengths.py --nval 1000 --submit --date 2019-12-25
run src/plot_toy_data.py --prefix 2019-12-25_simulatedData_L1000_ratealphaXXX_freqalpha1.0_nuc_results --nvals 1000
run src/plot_toy_data.py --prefix 2019-12-25_simulatedData_L1000_ratealphaXXX_freqalpha0.2_aa_results --nvals 1000
run src/compare_tree_length.py --nval 300 --treelengths 2019-12-25_simulatedData_L1000_ratealpha1.5_freqalpha1.0_nuc_results/collected_tree_lengths_n300.tsv
run src/compare_tree_length.py --nval 300 --treelengths 2019-12-25_simulatedData_L1000_ratealpha1.5_freqalpha0.2_aa_results/collected_tree_lengths_n300.tsv --aa
run src/model_deviation.py --files 2019-12-25_simulatedData_L1000_ratealpha1.5_freqalpha1.0_nuc_results/ModelDeviation_L1000_n300_m0.* --output figures/model_deviation_n300.pdf
run src/analyze_HIV_tree.py --prefix HIV_data/results/HIV_B_pol --pc 0.001 --gene pol --redo
run src/analyze_HIV_tree.py --prefix HIV_data/results/HIV_B_pol --pc 0.001 --gene pol --redo --aa
run src/analyze_HIV_tree.py --prefix HIV_data/results/HIV_B_nef --pc 0.001 --gene pol --redo run src/analyze_HIV_tree.py --prefix HIV_data/results/HIV_B_gag --pc 0.001 --gene pol --redo
run src/HIV_branch_length.py --prefix HIV_data/results/HIV_B_pol --pc 0.01
run src/HIV_branch_length.py --prefix HIV_data/results/HIV_B_pol --pc 0.01 --aa