Git Product home page Git Product logo

snplentiful's Introduction

Highly connected genes contain more SNPs

DOI

Introduction

SNP to gene translation is a hallmark of modern bioinformatics. Genomic technologies often produce data on the nucleotide level. Downstream analyses, however, often operate on the gene level. Therefore, condensing nucleotide-level measurements to a gene-based value is a common and essential practice.

Many technologies and applications focus on single nucleotides that vary between individuals, which are called SNPs. Here, we investigate whether the number of SNPs contained by a gene is correlated with other types of gene centric information. Specifically, we evaluate the relationship between SNP abundance and network connectivity for a variety of network types.

When translating measurements from SNP to gene, a skilled bioinformatician will appreciate the correlations uncovered herein. Why? Gene scores from SNP-based experimentation are often analyzed in the context of other gene based information sources. Frequently, such analyses assume independence of the two datasets. However, if the SNP-to-gene conversion is biased by SNP abundance — which generally occurs absent painstaking consideration and adjustment — independence ceases to exist.

Method

SNP abundance: We calculated the number of SNPs per gene for 3 genotyping arrays (Affymetrix 500K Set, Illumina HumanHap550, Illumina HumanOmni1), exome sequencing (ExAC), and whole genome sequencing (1000 Genomes Phase 3). We limited analyses to genes that were consistent between databases and extended each gene boundary by 10,000 basepairs in both directions. The 10,000 basepair window is frequently adopted to capture unmeasured but highly linked SNPs underlying the association and to cover nearby regulatory variants.

Network degree: Hetnets are networks with multiple types of nodes and edges. We extracted gene degrees from hetio-ind, a hetnet developed for drug repurposing. The network contained 26 types of edges (metaedges) that originate with a gene. Thus, for each gene we calculated 26 metaedge-specific degrees.

Transformation: Both SNP abundances and network degrees were transformed by adding 1 and taking the logarithm with base 10. Figure axes report untransformed values, but model fitting occurs on the tranformed data.

Results

Correlations between SNP abundance and network degree are commonplace. These correlations affect genotyping arrays as well as sequencing indicating that effects are not solely due to biased coverage of genotyping arrays. Physical protein interactions — a popular input for GWAS prioritization techniques — shows less correlation than other types. However, GO annotations — a community favorite for gene set enrichment techniques — increase sharply with SNP abundance. Expression datasets also preferentially report genes with high SNP abundance.

Conclusions

Beware! The potential for erroneous conclusions when gene scores are biased by SNP abundance is high. Ideally, permutation testing should be applied on the SNP level to ensure that SNP to gene conversion biases are not the cause of any positive results. Since access to the raw SNP level data needed for permutation is often impractical or unavailable, care should be taken to use unbiased SNP to gene conversion methods.

Publication

A summary of this analysis which includes Figure 1 is published in:

Genetic Association–Guided Analysis of Gene Networks for the Study of Complex Traits
Casey S. Greene, Daniel S. Himmelstein
Circulation: Cardiovascular Genetics (2016-04) https://doi.org/bffr
DOI: 10.1161/circgenetics.115.001181 · PMID: 27094199

Figures

The number of SNPs in a gene varies with network degree Figure 1: Network degree versus SNP abundance for common data types. Platform and metaedge specific models were fit. Models are drawn as their 95% confidence band. The genes with extreme SNP abundances (bottom and top two percentiles) are omitted for visual clarity.

The number of SNPs in a gene varies with network degree Figure 2: Mean-adjusted network degree versus SNP abundance for common data types. The same as Figure 1 except that degree was divided by the mean degree for each metaedge.

The number of SNPs in a gene varies with network degree Figure 3: Network degree versus SNP abundance for all metaedges. The same as Figure 1 except that all metaedges with over 1,000 edges are shown. Additionally, genes with extreme SNP abundances were not removed. See download/network-summary.tsv for abbreviation lookup and additional metaedge information.

Execution

This analysis can be reproduced by running the Jupyter notebooks in the following order:

  1. network-degrees.ipynb to extract gene degrees from hetio-ind, a hetnet that includes many gene metaedges.
  2. SNP-to-Gene.ipynb to download data and perform processing for SNP chips.
  3. exac.ipynb to process the ExAC sequencing variants.
  4. combine-SNPs-and-degrees.ipynb to combine SNPs per Gene measurements from all platforms and hetnet degrees.
  5. visualization.ipynb to create visualizations. This is an R notebook.

Acknowledgements

This material is based upon work supported by the National Science Foundation Graduate Research Fellowship under Grant Number 1144247 to @dhimmel. This material is funded in part by the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative through Grant GBMF4552 to @cgreene.

snplentiful's People

Contributors

cgreene avatar dhimmel avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

Forkers

cgreene

snplentiful's Issues

Gene characteristics

I think you should try to account for gene characteristics. For example, longer genes might be expected to harbor more SNPs. These genes might also be the ones harboring more motifs for post-transcriptional regulation and might thus be studied more often. Also, take a look and see if the number isoforms per gene helps explain away any info associated with SNPs.

Add points to the graph

Can you add points to Figure 1?

Also can you add which method you used for the regression.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.