Git Product home page Git Product logo

pacnet's Introduction

Platform-Agnostic CellNet

Introduction

Platform-Agnostic CellNet (PACNet) is our newest version of CellNet that is agnostic to transcriptome profiling method, computational preprocessing method, and gene availability arising from preprocessing method, thus allowing cross-study comparisons of cell fate engineering protocol performance.

We also provide reference panels of human engineered cell types from state-of-the-field protocols for HSPC, heart, intestine/colon, liver, lung, neuron, and skeletal_muscle samples against which you can compare your own samples. Engineered reference panels for other human cell types and all mouse cell types are not yet available. However, we provide the resources and instructions for training and running both human and mouse PACNet here.

Below is a walk-through tutorial on

  1. how to train a PACNet classifier using a preprocessed training expression matrix and
  2. how to apply the classifier to a preprocessed query study (and reference engineered panel, if applicable).

All the code below is also available in a single R file, agnosticCellNet_example.R.

See also

PACNet is also available as a web application, which takes as input an expression matrix (counts, TPM, or FPKM) and sample meta-data. The application performs CellNet analysis. Additionally, this tool includes analysis of many state-of-the-art differentiation protocols so that you can benchmark your results against those commonly used methods.

View our PACNet publication here.

For previous versions of CellNet, please visit the original CellNet repository.

Table of contents

Data

Installation

Training

Query


Data

Training Data

Species Metadata table Expression matrix Grn Object Training Parameters
Human Hs_stTrain_Jun-20-2017.rda Hs_expTrain_Jun-20-2017.rda Hs_grnAll_curatedTFs_Apr-22-2020.rda Hs_trainingNormalization_Apr-22-2020.rda
Mouse Mm_stTrain_Oct-24-2016.rda Mm_expTrain_Oct-24-2016.rda Mm_grnAll_curatedTFs_Apr-22-2020.rda Mm_trainingNormalization_Apr-22-2020.rda

Human Engineered Reference Panels

Cell type Metadata table Expression matrix Trained Classifier Grn Object Subset Training Parameter Subset
Heart heart_engineeredRef_sampTab_all.rda heart_engineeredRef_normalized_expDat_all.rda heart_broadClassifier100.rda heart_grnAll.rda heart_trainNormParam.rda
HSPC hspc_engineeredRef_sampTab_all.rda hspc_engineeredRef_normalized_expDat_all.rda hspc_broadClassifier100.rda hspc_grnAll.rda hspc_trainNormParam.rda
Intestine/colon intestine_colon_engineeredRef_sampTab_all.rda intestine_colon_engineeredRef_normalized_expDat_all.rda intestine_colon_broadClassifier100.rda intestine_colon_grnAll.rda intestine_colon_trainNormParam.rda
Liver liver_engineeredRef_sampTab_all.rda liver_engineeredRef_normalized_expDat_all.rda broadClassifier100.rda liver_grnAll.rda liver_trainNormParam.rda
Lung lung_engineeredRef_sampTab_all.rda lung_engineeredRef_normalized_expDat_all.rda lung_broadClassifier100.rda lung_grnAll.rda lung_trainNormParam.rda
Neuron neuron_engineeredRef_sampTab_all.rda neuron_engineeredRef_normalized_expDat_all.rda neuron_broadClassifier100.rda neuron_grnAll.rda neuron_trainNormParam.rda
Skeletal muscle skeletal_muscle_engineeredRef_sampTab_all.rda skeletal_muscle_engineeredRef_normalized_expDat_all.rda skeletal_muscle_broadClassifier100.rda skeletal_muscle_grnAll.rda skeletal_muscle_trainNormParam.rda

Installation

install.packages("devtools")
library(devtools)
install_github("pcahan1/CellNet", ref="master")
install_github("pcahan1/[email protected]", ref="master")
source("pacnet_utils.R")

Other required packages: plyr, ggplot2, RColorBrewer, pheatmap, plotly

Prerequisites

Load required R packages.

library(CellNet)
library(cancerCellNet)
library(plyr)
library(ggplot2)
library(RColorBrewer)
library(pheatmap)
library(plotly)
library(igraph)

Training

Confirm correct format of training data expression matrix and training sample metadata table

Expression matrix should have gene symbols as row names and sample names as column names. Sample metadata table should have sample names as row names and sample features as column names. Column names of expression matrix must match row names of metadata table. (See the example_data folder for a small example of an expression matrix and metadata table.)

For classifier training to be robust, there should be at least 60 independent replicates per training type.

Begin Training

Load training data:

expTrain <- utils_loadObject("Hs_expTrain_Jun-20-2017.rda") 
stTrain <- utils_loadObject("Hs_stTrain_Jun-20-2017.rda")

Load engineered reference data and query data. We need to load these at this point to identify genes found in across all datasets.

liverRefExpDat <- utils_loadObject("liver_engineeredRef_normalized_expDat_all.rda")
liverRefSampTab <- utils_loadObject("liver_engineeredRef_sampTab_all.rda")
queryExpDat <- read.csv("example_data/example_counts_matrix.csv", row.names=1)
querySampTab <- read.csv("example_data/example_sample_metadata_table.csv")
rownames(querySampTab) <- querySampTab$sample_name
study_name <- "liver_example"

Identify intersecting genes:

iGenes <- intersect(rownames(expTrain), rownames(liverRefExpDat))
iGenes <- intersect(iGenes, rownames(queryExpDat))
# Subset training expression matrix based on iGenes
expTrain <- expTrain[iGenes,]

Split the training data into a training subset and a validation subset:

set.seed(99) # Setting a seed for the random number generator allows us to reproduce the same split in the future
stList <- splitCommon_proportion(sampTab = stTrain, proportion = 0.66, dLevel = "description1") # Use 2/3 of training data for training and 1/3 for validation
stTrainSubset <- stList$trainingSet
expTrainSubset <- expTrain[,rownames(stTrainSubset)]

#See number of samples of each unique type in description1 in training subset
table(stTrainSubset$description1)

stValSubset <- stList$validationSet
expValSubset <- expTrain[,rownames(stValSubset)]
#See number of samples of each unique type in description1 in validation subset
table(stValSubset$description1)

Train the random forest classifier, takes 3-10 minutes depending on memory availability:

system.time(my_classifier <- broadClass_train(stTrain = stTrainSubset, 
                                expTrain = expTrainSubset, 
                                colName_cat = "description1", 
                                colName_samp = "sra_id", 
                                nRand = 70, 
                                nTopGenes = 100, 
                                nTopGenePairs = 100, 
                                nTrees = 2000, 
                                stratify=TRUE, 
                                sampsize=25, # Must be less than the smallest number in table(stTrainSubset$description1)
                                quickPairs=TRUE)) # Increasing the number of top genes and top gene pairs increases the resolution of the classifier but increases the computing time
save(my_classifier, file="example_outputs/cellnet_classifier_100topGenes_100genePairs.rda")

Classifier Validation

Plot validation heatmap:

stValSubsetOrdered <- stValSubset[order(stValSubset$description1), ] #order samples by classification name
expValSubset <- expValSubset[,rownames(stValSubsetOrdered)]
cnProc <- my_classifier$cnProc #select the cnProc from the earlier class training

classMatrix <- broadClass_predict(cnProc, expValSubset, nrand = 60) 
stValRand <- addRandToSampTab(classMatrix, stValSubsetOrdered, desc="description1", id="sra_id")

grps <- as.vector(stValRand$description1)
names(grps)<-rownames(stValRand)

# Create validation heatmap
png(file="classification_validation_hm.png", height=6, width=10, units="in", res=300)
ccn_hmClass(classMatrix, grps=grps, fontsize_row=10)
dev.off()


Plot validation precision-recall curves:

assessmentDat <- ccn_classAssess(classMatrix, stValRand, classLevels="description1", dLevelSID="sra_id")
png(file="example_outputs/classifier_assessment_PR.png", height=8, width=10, units="in", res=300)
plot_class_PRs(assessmentDat)
dev.off()


Gene pair validation

genePairs <- cnProc$xpairs
# Get gene to gene comparison of each gene pair in the expression table
expTransform <- query_transform(expTrainSubset, genePairs)
avgGenePair_train <- avgGeneCat(expDat = expTransform, sampTab = stTrainSubset, 
                                dLevel = "description1", sampID = "sra_id")

genePairs_val <- query_transform(expValSubset, genePairs)
geneCompareMatrix <- makeGeneCompareTab(queryExpTab = genePairs_val,
                                        avgGeneTab = avgGenePair_train, geneSamples = genePairs)
val_grps <- stValSubset[,"description1"]
val_grps <- c(val_grps, colnames(avgGenePair_train))
names(val_grps) <- c(rownames(stValSubset), colnames(avgGenePair_train))

png(file="example_outputs/validation_gene-pair_comparison.png", width=10, height=80, units="in", res=300)
plotGeneComparison(geneCompareMatrix, grps = val_grps, fontsize_row = 6)
dev.off()

Example xpairs plot

Create and save xpairs_list object for grn reconstruction and training normalization parameters:

xpairs_list <- vector("list", 14) 
for (pair in rownames(avgGenePair_train)) {
   for (j in 1:ncol(avgGenePair_train)) {
      if (avgGenePair_train[pair,j] >= 0.5) {
         if (is.null(xpairs_list[[j]])) {
            xpairs_list[[j]] <- c(pair)
         } else { 
            xpairs_list[[j]] <- c(xpairs_list[[j]], pair)
         }
      }  
   }
}
xpair_names <- colnames(avgGenePair_train)
xpair_names <- sub(pattern="_Avg", replacement="", x=xpair_names)
names(xpairs_list) <- xpair_names

for (type in names(xpairs_list)) {
   names(xpairs_list[[type]]) <- xpairs_list[[type]]
}
save(xpairs_list, file="example_outputs/Hs_xpairs_list.rda")

Querying the classifier

Classify engineered reference panel samples

classMatrixLiverRef <- broadClass_predict(cnProc = cnProc, expDat = liverRefExpDat, nrand = 10) 
grp_names1 <- c(as.character(liverRefSampTab$description1), rep("random", 10))
names(grp_names1) <- c(as.character(rownames(liverRefSampTab)), paste0("rand_", c(1:10)))

# Re-order classMatrixQuery to match order of rows in querySampTab
classMatrixLiverRef <- classMatrixLiverRef[,names(grp_names1)]

Plot classification heatmap:

png(file="example_outputs/heatmapLiverRef.png", height=12, width=9, units="in", res=300)
heatmapRef(classMatrixLiverRef, liverRefSampTab) # This function can be found in pacnet_utils.R
dev.off()

# Alternatively, for an interactive plotly version:
heatmapPlotlyRef(classMatrixLiverRef, liverRefSampTab)


Classify query samples

Perform log transform:

queryExpDat <- log(1+queryExpDat)

Classify query samples:

classMatrixQuery <- broadClass_predict(cnProc = cnProc, expDat = queryExpDat, nrand = 3) 
grp_names <- c(as.character(querySampTab$description1), rep("random", 3))
names(grp_names) <- c(as.character(rownames(querySampTab)), paste0("rand_", c(1:3)))

# Re-order classMatrixQuery to match order of rows in querySampTab
classMatrixQuery <- classMatrixQuery[,names(grp_names)]

save(classMatrixQuery, file="example_outputs/example_classificationMatrix.rda")

Plot classification heatmap:

png(file="example_outputs/query_classification_heatmap.png", height=4, width=8, units="in", res=300)
# This function can be found in pacnet_utils.R
acn_queryClassHm(classMatrixQuery, main = paste0("Classification Heatmap, ", study_name), 
                 grps = grp_names, 
                 fontsize_row=10, fontsize_col = 10, isBig = FALSE)
dev.off()

Prepare GRN and expression data for Network Influence Score calculation

Subset grnAll and trainNormParam objects based on intersecting genes.

grnAll <- utils_loadObject("liver_grnAll.rda")
trainNormParam <- utils_loadObject("liver_trainNormParam.rda")


# These two functions can be found in pacnet_utils.R
grnAll <- subsetGRNall(grnAll, iGenes)
trainNormParam <- subsetTrainNormParam(trainNormParam, grnAll, iGenes)


queryExpDat_ranked <- logRank(queryExpDat, base = 0)
queryExpDat_ranked <- as.data.frame(queryExpDat_ranked)

Compute Network Influence Score (NIS) for transcriptional regulators

Compute and save TF scores:

network_cell_type <- "liver"
target_cell_type <- "liver"
system.time(TF_scores <- pacnet_nis(expDat = queryExpDat_ranked, stQuery=querySampTab, iGenes=iGenes,
                                    grnAll = grnAll, trainNorm = trainNormParam,
                                    subnet = network_cell_type, ctt=target_cell_type,
                                    colname_sid="sample_name", relaWeight=0))

save(TF_scores, file="example_outputs/my_study_TF_scores.rda")

Choose top-scoring 25 TFs for plotting:

TFsums <- rowSums(abs(TF_scores))
ordered_TFsums <- TFsums[order(TFsums, decreasing = TRUE)]
if(length(TFsums) > 25) {
    top_display_TFs <- names(ordered_TFsums)[1:25]    
} else {
    top_display_TFs <- names(ordered_TFsums)
}
TF_scores <- TF_scores[top_display_TFs,]

Plot TF scores:

sample_names <- rownames(querySampTab)

pdf(file="example_outputs/my_study_TF_scores_my_cell_type.pdf", height=6, width=8)
for(sample in sample_names) {
   descript <- querySampTab$description1[which(rownames(querySampTab) == sample)]
   plot_df <- data.frame("TFs" = rownames(TF_scores),
                         "Scores" = as.vector(TF_scores[,sample]))
   sample_TFplot <- ggplot(plot_df, aes(x = reorder(TFs,Scores,mean) , y = Scores)) + 
     geom_bar(stat="identity") + #aes(fill = medVal)) +
      theme_bw() + 
      ggtitle(paste0(sample, ", ", descript, ", ", target_cell_type, " transcription factor scores")) +
      ylab("Network influence score") + xlab("Transcriptional regulator") + 
      theme(legend.position = "none", axis.text = element_text(size = 8)) +
      theme(text = element_text(size=10), 
            legend.position="none",
            axis.text.x = element_text(angle = 45, vjust=0.5))
   print(sample_TFplot)
}
dev.off()

Negative TF scores indicate that a given TF should be upregulated to achieve an identity more similar to the target cell type. Positive TF scores indicate that a given TF should be downregulated to achieve an identity more similar to the target cell type.

Example NIS plots

Fin.


pacnet's People

Contributors

pcahan1 avatar

Stargazers

SeoheeK avatar Wang Ziyi avatar  avatar

Watchers

 avatar

Forkers

libertatem

pacnet's Issues

The data in CellNet cannot be downloaded.

1
2
The training dataset on the website needs to be downloaded through Amazon Cloud. However, it shows that there is no permission to download on Amazon Cloud. Could you please upload the dataset directly to GitHub? Thank you very much.

The online site cannot be connected

Dear author,
Hello, thank you very much for providing such a good software, here I have a few questions:

First of all, when I was using an online site, "Disconnected form the server" was displayed after I submitted the data, and I didn't know how to solve this problem.
51225a412ca58f24979e89979ebf9b8

Then, I tried to train the model myself, and I was able to run through the whole process, but there were two problems. One was that I was studying the endothelial cells of the kidney, and I found that I needed a GRN file in the step of predicting transcription factors. You provided the GRN files of seven tissues. I temporarily used "heart_grnAll.rda" and "heart_trainNormParam.rda", but I don't know whether such prediction results are valuable for me. Secondly, I found that I can only predict positive scores according to the code you provided. But your site can get both positive and negative scores, and I don't know if I need to adjust a certain parameter.
62533b6b5ad0316e78683e3ab54ecb6

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.