Git Product home page Git Product logo

sticcs's Introduction

sticcs

sticcs is a method for inferring the series of genealogies along the genome, also called a tree sequence. Unlike some other methods, sticcs does not require phased haplotypes, and it can work on any ploidy level.

The input for sticcs is polarised genotype data. This means you either need to know the ancestral allele at each site, or you need an appropriate outgroup(s) to allow inference of the derived allele.

A publication for is in preparation, but you can check out the How it works section below to learn about the apporach.

Installation

sticcs requires cyvcf2 and numpy. If these are not already installed, they will be downloaded and installed wehn you run the install command.

If you would like to export tree sequence objects from tskit, you will also need to install tskit yourself before running sticcs.

git clone https://github.com/simonhmartin/sticcs.git

cd sticcs

pip install -e .

Command line tool

The command line tool takes as input a modified vcf file that contains a DC field, giving the count of derived alleles for each individual at each site.

You can make this by running:

sticcs prep -i <input vcf> -o <output vcf>  --outgroup <outgroup sample ID>

If the vcf file already has the ancestral allele (provided in the AA field in the INFO section), then you do not need to specifiy outrgoups for polarising.

Now you can run the main command to make the treesequence:

sticcs ts -i <input vcf> -o <output prefix> --output_format tskit

This will make a treesequence file that can be loaded and analysed using tskit. The default for --output_format is newick, which makes a file of newick trees and a separate file giving data for each tree window.

Python API

Classes and functions from sticcs can be used by importing sticcs in your python script. Full documention is not yet available, but some example scripts will be provided.

How it works

The approach will be described in a publication, but for now here is an overview.

sticcs stands for Sequential Tree Inference from Clusters of Compatible Sites. It uses a simple heuristic approach and works in two steps:

  1. Finding clusters of compatible variant sites
  2. Inferring the tree for each cluster

To understand the approach, it makes sense to first look at the tree inference step.

Tree inference

Assuming an infinite-sites mutation model (no recurrent mutation), each variant site tells us about a 'node' in the tree. All the individuals that carry the derived allele must share a common ancestor at that node. Given a set of sites that are all compatible with the same tree, we can infer a tree as follow:

Haploid example

Say we have six haploid individuals and 4 SNPs:

A B C D E F
SNP1 1 1 0 0 0 0
SNP2 0 0 0 1 1 0
SNP3 1 1 1 0 0 0
SNP4 1 1 1 1 1 0

The SNPs are polarised, so that 0 indicates the ancestral state and 1 indicates the derived state.

SNP1 tells us that the tree includes a clade containing individuals A and B. SNP2 telss us that the tree includes a clade containing individuals D and E, and so on. Using this logic, the inferred tree for this cluster of sites is:

(((A,B),C),(D,E)),F);

If we were lacking SNP1, for example, we would still be able to infer the tree, but it would contain a polytomy for individuals A, B and C, but there is still only ever one tree compatible with our set of SNPs:

((A,B,C),(D,E)),F);

Diploid example

Now, if instead of six phased haplotypes we have three unphased diploid genotypes.:

A B C
SNP1 1/1 0/0 0/0
SNP2 0/0 0/1 0/1
SNP3 1/1 0/1 0/0
SNP4 1/1 1/1 0/1

We can write these out in terms of the number of derived alleles they each carry:

A B C
SNP1 2 0 0
SNP2 0 1 1
SNP3 2 1 0
SNP4 2 2 1

Once again we can use the same logic to infer the tree.

SNP1 tells us that the tree includes a clade containing both tips from individual A. SNP2 tells us that the tree includes a clade containing one tip from individual B and one from C. SNP3 tells us that the other tip from B groups with the two tips from A. SNP4 tells us that one of the tips from C is basal to all the rest.

So our inferred tree is:

(((A1,A2),B2),(B1,C1)),C2);

You might have noticed that some SNPs in this diploid example are compatible with more than one possible clade. For example, considering SNP2 in isolation, it might tell us about a clade of (B1,C1), or of (B2, C1) etc. In this case, we arbitrarily decided on (B1,C1). This means that when we got to SNP3, we only have B2 left to create the clade ((A1,A2),B2).

The sticcs algorithm works through SNP patterns in order from the smallest to the largest clades (i.e. working from leaves to root). Once all SNPs have been considered, it runs through all SNP patterns again, poping up any remaining available tips into clades, until there are no more loose tips that can be mopped up.

Importantly, when building a tree sequence (from right-to-left along the chromosme), sticcs retains any compatible clades that were present in the previous tree. Thus, although the decisions to cluster say tips B1 and C1 was arbitrary the first time SNP2 was encountered, this information is retained if SNP2 is also present next tree. In this way, sticcs is actually phasing tips while building the trees.

But how exactly does sticcs decide which SNPs are included in which tree? Read on friend!

Finding clusters of compatible sites

The key feature of the four SNPs included in the examples above is that they are all compatible with each other. This means that they are all compatible with the same genealogy. sticcs works by finding such compatible clusters along the genome, and then inferring the trees for each cluster, as above. To find these clusters, we need an algorithm to determine whether sites are compatible with each other.

Defining compatible pairs of variants

Whether any pair of variants is compatible with the same genealogy can be tested using the Four-genete test. If the dataset contains all four pairs of alleles at the two sites: --0--0-- --0--1-- --1--0-- --1--1--

These sites are not compartible with the same genealogy. For polarised data, in which 1 represents the derived state, we know that the ancestral haplotype --0--0-- does (or did) exist, so we only need to observe the latter three derived haplotypes to know that the sites are incompatible.

Unphased diploids and polyploids

For unphased diploid or polyploid genotypes, we have less information, but we can still test for compatibility under the most parsimonius case. For example:

If we saw these three diploid genotypes in three individuals:

--0/1--0/0-- --0/0--0/1-- --0/1--0/1--

We can infer that individual 1 carries haplotypes --0--0-- and --1--0--, while individual 2 carries --0--0-- and --0--1--. Thus, the most parsimonius haplotypes for individual 3 are --1--0-- and --0--1--. Hence, we conclude here that only two of the three possible derived haplotypes are present: --0--1-- --1--0--

So these two sites are deemed compatible.

As a general rule that applies to any ploidy level, we can say that all three derived haplotypes must be present if:

  • There is at least one individual at which the number of derived alleles at the first site exceeds that at the second (implies --1--0-- exists)
  • There is at least one individual at which the number of derived alleles at the second site exceeds that at the first (implies --0--1-- exists)
  • There is at least one individual at which the sum of derived alleles across the two sites exceeds the ploidy (implies --1--1-- exists)
Finding clusters

Our goal is to infer genealogies along the chromosome, and the recombination points at which they change. It is not as simple as just dividing the genome into blocks of compatible sites, for two reasons. First, compatibility is assesed in a pairwise manner. If for example we have three adjacent SNPs, and SNP2 is compatible with both SNP1 and SNP3, but SNP1 and SNP3 are not compatible with each other, we need to decide whether the recombination point is between SNP1 and SNP2 or between SNP2 and SNP3. Secondly, adjacent genealogies tend to be very similar - usually only differing by one recombinatyion event and therefore one branch on the tree. This means that, for a focal genealogy with a given interval, there will usually be variant sites outside of that interval that are informative about the focal genealogy, and we want to include those in our inference.

sticcs addresses these challenges by using a heuristic clustering algorithm with the following steps:

  • First each variant is labeled in terms of its 'pattern': a list giving the number of derived alleles observed in each individual
  • For each variant site i on the chromosome, we compile two sets of compatible patterns. The left-compatible set is the set of patterns for all varaint sites from 1 to i that are compatible with i, and where a site carrying that pattern is not separated from i by a variant with an incompatible pattern. The right-compatible set is similar, but for all variants between i and the end of the chromosome.
  • Cluster i is initialised as the set of patterns shared by both the left- and right-compatible sets for site i
  • Additional patterns from both the left- and right-compatible sets are then added to cluster i, starting from the pattern associated with the nearest variant site to site i on the chromosome and moving outwards in both directions, provided this pattern is compatible with all existing patterns in cluster i.
  • Any adjacent groups of sites that have identical clusters are then merged into a single cluster, and each cluster has an interval defined by the first and last of the sites that were merged.
  • Chromosome intervals are then extended to the midpoints between the clusters (and to chromosome start and end for terminal clusters) such that there are no gaps.

sticcs's People

Contributors

simonhmartin avatar

Stargazers

Yudong Cai avatar Max Brown avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.