Git Product home page Git Product logo

asaph's Introduction

Asaph

Software for population genetics analysis of SNPs from insect species

Build status: build status

Getting Asaph

To download Asaph, all you need to do is clone the git repository:

$ git clone https://github.com/rnowling/asaph.git

Asaph requires that numpy and scipy, matplotlib, scikit-learn, and seaborn are installed.

Importing data

To create an Asaph project, we first need to import the data. A minimal command for importing biallelic SNPs from a VCF file would look like so:

bin/import --vcf <path/to/vcf> \
           --populations <path/to/populations_file> \
           --workdir <path/to/workdir> \
           --features-type categories

Asaph currently supports encoding SNPs as features in two ways: categories and counts. In the categorical scheme, each genotype (e.g., A/T, A/A, T/T) is represented as a binary variable. In the counts scheme, each allele (e.g., A, T) is represented as an integer giving the number of copies (e.g., 0, 1, 2) of each allele the sample has. We recommend using the categories encoding for Random Forests and the counts encoding for Logistic Regression ensembles.

A file listing the sample ids for each population must also be specified. The populations file contains one group per line, with the first column indicating the population name, and the sample names separated by commas like so:

Population 1,Sample1,Sample2
Population 2,Sample3,Sample4

The sample ids have to match the ids in the VCF file.

The work directory will be created and contain the resulting Asaph data structures such as a feature matrix, feature labels, and sample labels.

To improve run times, reduce memory usage, and stabilize rankings, Asaph supports dictionary encoding of the feature matrix. With dictionary encoding, all SNPs with the same genotypes across all samples are replaced by a single instance. When the rankings are generated, the original SNPs are given the variable importance score of the remaining instance. In practice, we've found that the compression can reduce run times from weeks to hours. Compression can be enabled with the --compress flag:

bin/import --vcf <path/to/vcf> \
           --populations <path/to/populations_file> \
           --workdir <path/to/workdir> \
           --features-type categories \
           --compress

Likelihood-Ratio Tests

Asaph can perform single SNP association tests using Likelihood-Ratio Tests on Logistic Regression models. Once data is imported:

bin/likelihood_ratio_test --workdir <path/to/workdir>

The p-value for each SNP is written to a tab-separated value file under

<workdir>/statistics/snp_likelihood_ratio_tests.tsv

A Bonferroni correction is often used for multiple hypothesis testing. This can be conservative, however, since the SNPs are often correlated. Gao, et al. propose using PCA to identify the number of variables, after correlation is accounted for. A search function is available in the PCA module:

bin/pca --workdir <path/to/workdir> \
        min-components-explained-variance \
        --init-n-components <min-components> \
        --explained-variance-thresholds 0.999

SNP Rankings with Random Forests Variable Importance Scores

Asaph's original purpose, which it has since outgrown, was to support calculation of variable importances scores and ranking of SNPs using Random Forests. Once data is imported, Random Forest models can be trained with the command:

bin/random_forests --workdir <path/to/workdir> \
                   train \
                   --trees <number of trees>

Generally, you will want to sweep over the trees parameter, so you'll run the above command with a range of values for the trees parameter. Asaph actually trains two Random Forests each time, to use in checking convergence. You can check the convergence of the SNP rankings using the analyzing-rankings mode:

bin/random_forests --workdir <path/to/workdir> \
                   analyze-rankings

The analyze-rankings mode will generate two plots, comparisons of the number of SNPs used and the agreement of the top 0.01%, 0.1%, 1%, and 10% of ranked SNPs between each pair of models. The plots are written in PDF and PNG formats and stored in <workdir>/figures. If the rankings do not demonstrate convergence, run the training command with a larger number of trees. Once the rankings have converged, you can output the rankings to a text file:

bin/random_forests --workdir <path/to/workdir> \
                   output-rankings \
                   --trees <select model with this many trees>

The rankings will be output to a text file in the <workdir>/rankings directory.

SNP Ranking From Logistic Regression Weights (Ridge and Lasso)

Asaph can also be used for training ensembles of Logistic Regression models. By training an ensemble and averaging over the feature weights, we can ensure that the rankings of the SNPs are consistent. The LR workflow follows the RF workflow. Once data is imported, you can train a LR model like so:

bin/logistic_regression --workdir <path/to/workdir> \
                        train \
                        --n-models <number of models>

Convergence of the SNP rankings can be evaluated like with the command:

bin/logistic_regression --workdir <path/to/workdir> \
                        analyze-rankings

The analyze-rankings mode will generate two plots, comparisons of the number of SNPs used and the agreement of the top 0.01%, 0.1%, 1%, and 10% of ranked SNPs between each pair of models. The plots are written in PDF and PNG formats and stored in <workdir>/figures. If the rankings do not demonstrate convergence, run the training command with a larger number of models. Once the rankings have converged, you can output the rankings to a text file:

bin/logistic_gression --workdir <path/to/workdir> \
                      output-rankings \
                      --n-models <select ensemble with this many models>

The rankings will be output to a text file in the <workdir>/rankings directory.

By default, LR models are trained with Stochastic Gradient Descent (SGD) and a L2 penalty. Asaph also supports the average Stochastic Gradient Descent (ASGD) and Stochastic Average Gradient Descent (SAG) optimization algorithms. SGD additionally supports the elastic-net penalty. You can select different methods by using the --methods flag. If you do so, you'll need to use --methods each time you invoke train, analyze-rankings, and output-rankings.

You can also enable bagging, where the dataset is bootstrapped before each model is trained, with the --bagging flag for the train function. Bagging is disabled by default since we have found little impact from its usage.

Cramer's V

Cramer's V is a measure of association between nominal variables. Asaph can use Cramer's V to calculate associations between the SNPs and population structure or pairwise among the SNPs themselves. The command for calculating the associations between the SNPs and population structure is:

bin/cramers_v --workdir <path/to/workdir> \
              populations

The associations will then be written a text file located at <workdir>/statistics/snp_population_associations.txt.

You can also calculate the associations between pairs of SNPs:

bin/cramers_v --workdir <path/to/workdir> \
              pairwise

The associations will then be written a text file located at <workdir>/statistics/snp_pairwise_associations.txt. Since the number of pairs is often very large, you can evaluate the pairwise associations on sampled pairs of SNPs using the --samples N flag.

Lastly, you can calculate the associations between all SNPs and a single SNP:

bin/cramers_v --workdir <path/to/workdir> \
              pairwise-single \
              --chrom 1 \
              --pos 200

The associations will then be written a text file located at <workdir>/statistics/snp_pairwise_associations_1_200.txt.

Running Tests

Asaph has a test suite implemented using the Bats framework. You can run the test suite like so:

bats tests/*.bats

asaph's People

Contributors

rnowling avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.