Git Product home page Git Product logo

hpo_similarity's Introduction

[Build Status] (https://travis-ci.org/jeremymcrae/hpo_similarity)

Similarity of Human-Phenotype-Ontology (HPO) terms in patient groups.

This code estimates the probability of a group of individuals sharing their HPO terms. This is useful when we have a set of probands all of whom have similar genetic variation in a single gene (for example they all have de novo mutations in the same gene). We have standardised phenotype terms, some of which may be shared amongst the individuals. We want to know how likely the group is to share those terms.

To estimate this probability, we need three things:

  • a way to quantify similarity of HPO terms for a pair of probands.
  • a way to quantify similarity across more than two probands.
  • a null distribution of similarity scores for those probands
Proband pair HPO similarity

For a pair of probands, we have two lists of HPO terms. The code calculates a matrix of information content scores, for each term in proband A versus each term in proband B. The information content is from the most informative common ancestor of the two terms. The IC scores are collapsed into a single score as the maximum of the IC scores.

Proband group similarity

For a group of probands, I get a matrix of proband pair HPO similarity scores, one score for each pairing. The scores are summed are across all the pairs to get an overall similarity estimate for the group of probands.

Null distribution of similarity scores

A null distribution of similarity scores is simulated by randomly sampling groups of probands. The similarity scores are calculated as above. The P value is calculated as the proportion of simulated scores greater than the observed probands' score.

Initially I thought we'd need to match the number of sampled HPO terms to the numbers in the probands for the gene. For that, I sampled terms at the rate at which they were observed in all the probands. However, the QQ plots from this had an excess of extremely significant P values. I suspect this is due to underlying relationships between HPO terms. Sampling probands seems better and simpler.

Install

Install the package with:

git clone https://github.com/jeremymcrae/hpo_similarity.git
cd hpo_similarity
python setup.py install --user

This code incorporates the following code and datasets:

Running the code

The data directory includes two example input datafiles, one listing proband IDs for each gene (data/example_genes.json), and the other listing HPO terms used for each proband (data/example_phenotypes.json). Analysis results are by default written to standard out. Run the code using the example datasets as:

python3 scripts/proband_similarity.py \
  --genes data/example_genes.json \
  --phenotypes data/example_phenotypes.json

This will not give any significant P-values, it merely demonstrates how to run the code and confirms that the code runs without errors. A real analysis requires a large population of probands with HPO terms for reliable information content scores, and to give a good population to sample from.

You can optionally use:

  • --output OUTPUT_PATH to redirect the table of gene symbols and P values to a file.
  • --ontology ONTOLOGY_OBO_PATH to specify a HPO ontology file other than the one included in the data directory.
  • --permute to permute probands between genes, to assess whether the results match a null distribution in the absence of any true relationships.
  • --iterations INTEGER to specify the number of iterations of simulation you wish to run. By default the code runs 100000 iterations, and with that number, analysis of a single gene can take from 30-1500 seconds, depending on how many probands are in each gene.

hpo_similarity's People

Contributors

jeremymcrae avatar

Watchers

James Cloos avatar Xiaowei Mao avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.