Git Product home page Git Product logo

genomicsupersignature's Introduction

GenomicSuperSignature

Interpretation of RNA-seq experiments through robust, efficient comparison to public databases

Purpose

Thousands of RNA sequencing profiles have been deposited in public archives, yet remain unused for the interpretation of most newly performed experiments. Methods for leveraging these public resources have focused on the interpretation of existing data, or analysis of new datasets independently, but do not facilitate direct comparison of new to existing experiments. The interpretability of common unsupervised analysis methods such as Principal Component Analysis would be enhanced by efficient comparison of the results to previously published datasets.

Methods

To help identify replicable and interpretable axes of variation in any given gene expression dataset, we performed principal component analysis (PCA) on 536 studies comprising 44,890 RNA sequencing profiles. Sufficiently similar loading vectors, when compared across studies, were combined through simple averaging. We annotated the collection of resulting average loading vectors, which we call Replicable Axes of Variation (RAV), with details from the originating studies and gene set enrichment analysis. Functions to match PCA of new datasets to RAVs from existing studies, extract interpretable annotations, and provide intuitive visualization, are implemented as the GenomicSuperSignature R package, to be submitted to Bioconductor.

Results

Usecases and benchmark examples are documented in the GenomicSuperSignaturePaper page. All the figures and tables can be reproduced using the code and instruction in this page as well.

Citation

If you use GenomicSuperSignature in published research, please cite:

Oh S, Geistlinger L, Ramos M, Blankenberg D, van den Beek M, Taroni JN, Carey VJ, Waldron L, Davis S. GenomicSuperSignature facilitates interpretation of RNA-seq experiments through robust, efficient comparison to public databases. Nature Communications 2022;13: 3695. doi: 10.1038/s41467-022-31411-3

Other relevant code

The workflow to build the RAVmodel is available from https://github.com/shbrief/model_building which is archived in Zenodo with the identifier https://doi.org/10.5281/zenodo.6496552. All analyses presented in the GenomicSuperSignatures manuscript are reproducible using code accessible from https://github.com/shbrief/GenomicSuperSignaturePaper/ and archived in Zenodo with the identifier https://doi.org/10.5281/zenodo.6496612.

Installation

You can install GenomicSuperSignature in Bioconductor. This can be done using BiocManager:

if (!require("BiocManager"))
    install.packages("BiocManager")

library(BiocManager)
install("GenomicSuperSignature")

RAVmodel can be directly downloaded from Google bucket with no cost. The sizes of RAVmodelsRAVmodel_C2.rds and RAVmodel_PLIERpriors.rds are 476.1MB and 475.1MB, respectively. You can use wget or GenomicSuperSignature::getModel function.

## Download RAVmodel with wget
wget https://storage.googleapis.com/genomic_super_signature/RAVmodel_C2.rds
wget https://storage.googleapis.com/genomic_super_signature/RAVmodel_PLIERpriors.rds

## Download RAVmodel with getModel function
getModel("C2")
getModel("PLIERpriors")

Schematic

Overview of GenomicSuperSignature

Schematic illustration of RAVmodel construction and GenomicSuperSignature application. Building the RAVmodel (components in grey) is performed once on a time scale of hours on a high-memory, high-storage server. Users can apply RAVmodel on their data (component in red) using the GenomicSuperSignature R/Bioconductor package (components in blue), which operates on a time scale of seconds for exploratory data analyses (components in orange) on a typical laptop computer.


User's perspective

The GenomicSuperSignature package allows users to access a RAVmodel (Z matrix, blue) and annotation information on each RAV. From a gene expression matrix (Y matrix, grey), users can calculate dataset-level validation score or sample score matrix (B matrix, red). Through the RAV of your interest, additional information such as related studies, GSEA, and MeSH terms can be easily extracted.

Information assembled by GenomicSuperSignature

GenomicSuperSignature connects different public databases and prior information through RAVindex, creating the knowledge graph illustrated here. Users can instantly access data and metadata resources from multiple entry points, such as gene expression profiles, MeSH terms, gene sets, and keywords.

genomicsupersignature's People

Contributors

blankenberg avatar jwokaty avatar lwaldron avatar nturaga avatar seandavi avatar shbrief avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

genomicsupersignature's Issues

Allow configurable number of PCs in `validate`

.loadingCor <- function(dataset, avgLoading, method = "pearson", scale = FALSE) {
if (any(class(dataset) == "ExpressionSet")) {
dat <- Biobase::exprs(dataset)
} else if (any(class(dataset) %in% c("SummarizedExperiment", "RangedSummarizedExperiment"))) {
dat <- SummarizedExperiment::assay(dataset)
} else if (any(class(dataset) == "matrix")) {
dat <- dataset
} else {
stop("'dataset' should be one of the following objects: ExpressionSet,
SummarizedExperiment, RangedSummarizedExperiment, and matrix.")
}
if (isTRUE(scale)) {dat <- rowNorm(dat)} # row normalization
dat <- dat[apply(dat, 1, function (x) {!any(is.na(x) | (x==Inf) | (x==-Inf))}),]
gene_common <- intersect(rownames(avgLoading), rownames(dat))
prcomRes <- stats::prcomp(t(dat[gene_common,])) # centered, but not scaled by default
loadings <- prcomRes$rotation[, 1:8]
loading_cor <- abs(stats::cor(avgLoading[gene_common,], loadings[gene_common,],
use = "pairwise.complete.obs",
method = method))
return(loading_cor)
}

'droplist' for `drawWordcloud()` function

Add a 'droplist' argument to drawWordcloud() function, which removes most and least common MeSH terms in the universe to avoid 'outliers' skew the word cloud.

e.g. droplist = c("Human", "RNA sequencing", ..., "Publication", "Utah")

Allow pluggable similarity measure to `validate`.

  • Spearman
  • other methods from dist
  • straight function

.loadingCor <- function(dataset, avgLoading, method = "pearson", scale = FALSE) {
if (any(class(dataset) == "ExpressionSet")) {
dat <- Biobase::exprs(dataset)
} else if (any(class(dataset) %in% c("SummarizedExperiment", "RangedSummarizedExperiment"))) {
dat <- SummarizedExperiment::assay(dataset)
} else if (any(class(dataset) == "matrix")) {
dat <- dataset
} else {
stop("'dataset' should be one of the following objects: ExpressionSet,
SummarizedExperiment, RangedSummarizedExperiment, and matrix.")
}
if (isTRUE(scale)) {dat <- rowNorm(dat)} # row normalization
dat <- dat[apply(dat, 1, function (x) {!any(is.na(x) | (x==Inf) | (x==-Inf))}),]
gene_common <- intersect(rownames(avgLoading), rownames(dat))
prcomRes <- stats::prcomp(t(dat[gene_common,])) # centered, but not scaled by default
loadings <- prcomRes$rotation[, 1:8]
loading_cor <- abs(stats::cor(avgLoading[gene_common,], loadings[gene_common,],
use = "pairwise.complete.obs",
method = method))
return(loading_cor)
}

Questions about R version in this research

Hi, I tried to run the codes related to RAV model building and met errors related to package installation. My R version is 4.1.3.
image
For the GF package, I cannot install it based on my R. Therefore, could you please share the version of your software? Thanks a lot.

documentation on normalization are unclear

It's not clear from ?validate whether users should normalize their data in advance. As I understand from the code, no normalization is done except for z-score if scale = TRUE, so users should do a log(x+1) transformation?

How to create new models?

Hi,

Is it possible to create new models based on GenomicSuperSignature? Let's say for example, using GTEx as and input and different prior knowledge.

Simplify browsing of studies

Currently, looking up which studies contributed to a RAV requires something like the following:

> ravc2 <- GenomicSuperSignature::getModel(prior = "C2")
> colData(ravc2)[colData(ravc2)$RAV == "RAV272", "studies"] 
$Cl4764_272
[1] "ERP020977" "SRP039361" "SRP045352"

And then searching for the accession numbers e.g. in the European Nucleotide Browser. And it's more difficult to find the PMIDs of studies contributing to the RAV.

It would be very convenient to have a function that outputs these directly, either with a message saying to search them in the European Nucleotide Browser or giving a direct link (e.g. https://www.ebi.ac.uk/ena/browser/view/ERP020977 for the first ERP above, although ENA browser annoyingly takes you down to the "reads" section of the page). Being able to browse PMIDs of the studies would also be very useful.

Questions about heatmap table function

Hi, I found that this step took me a pretty long time to run:
image
I waited for more than one hour with one 30 GB GPU from an HPC. Is it normal? Thanks a lot.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.