cri-iatlas / immunesubtypeclassifier Goto Github PK

View Code? Open in Web Editor NEW

42.0 2.0 23.0 188.41 MB

An R package for classification of immune subtypes, in cancer, using gene expression data.

License: Other

R 100.00%

immunesubtypeclassifier's Introduction

title

output

ImmuneSubtypeClassifier

html_document	html_document:
default	default

ImmuneSubtypeClassifier

This is an R package for classification of PanCancer immune subtypes. Five gene signatures were used in the initial clustering of tumor samples as part of the Immune Landscape of Cancer manuscript, providing 485 genes that were used to create quartile and binary-valued gene-pair features. With these features, an ensemble of XGBoost classifiers was trained to predict subtype membership, where each member of the ensemble was trained on 70% of 9,129 samples.

library(devtools)
install_github("CRI-iAtlas/ImmuneSubtypeClassifier")

# Right Now, the newest version of xgboost is incompatible. 
# Please use a prior version, 1.0.0.1 or 1.0.0.2 should work.
devtools::install_version("xgboost", version = "1.0.0.1")

library(ImmuneSubtypeClassifier)

To get a list of the genes needed:

data(ebpp_gene)

head(ebpp_genes_sig)  ### 485 genes are needed

To make calls on new data,

Xtest <- as.matrix(X) # has gene IDs in rownames and sample IDs in column names

calls <- callEnsemble(X=Xtest, geneids='symbol')

# or in parallel .. **Not working** #
calls <- parCallEnsemble(X=Xtest, geneids='symbol', numCores=4)

Where gene IDs are 'symbol', 'entrez', or 'ensembl'.

If you want to be safe, map your own gene IDs to symbols, using your favorite method.

But, to see where and if gene ID matches have failed:


calls <- geneMatchErrorReport(X=Xtest, geneid='symbol')

This returns the proportion of missing genes (from 485 total) and a data.frame of missing gene IDs.

$matchError
[1] 0.03505155

$missingGenes
        Symbol Entrez         Ensembl
1794  C12orf24  29902 ENSG00000204856
1841  C13orf18  80183 ENSG00000102445
1844  C13orf27  93081 ENSG00000151287

The resulting 'calls' will have 'best calls' in the first column, and probabilities of belonging to each subtype after that.

inst/how_the_model_was_fit.Rmd and inst/algorithm_details.txt Have information on how the model was built.
inst/data/five_signature_mclust_ensemble_results.tsv.gz Contains TCGA subtype membership from the manuscript, suggest using column 'ClusterModel1'.
inst/important_features_in_the_ensemble_model.tsv
A list of the important features in each subtype/ensemble member.

Also see scripts in the test directory for more detailed instructions on fitting one subtype model, a model per subtype and ensembles of models.

This following script should work.

library(readr)
library(ImmuneSubtypeClassifier)

download.file(url = 'https://raw.githubusercontent.com/CRI-iAtlas/shiny-iatlas/develop/data/ebpp_test1_1to20.tsv', destfile = 'ebpp_test.tsv')
dat <- read_tsv('ebpp_test.tsv')

dat2 <- as.data.frame(dat[!duplicated(dat$GeneID),])
Xmat <- dat2[,-1]
rownames(Xmat) <- dat2[,1]

res0 <- ImmuneSubtypeClassifier::callEnsemble(X = Xmat, geneids = 'symbol')
res0

   SampleIDs BestCall            1            2            3            4            5            6
1        XY1        4 1.121173e-07 1.873900e-06 6.311234e-02 0.9027497470 5.132385e-02 5.121503e-05
2        XY2        3 4.243225e-02 2.309184e-06 5.495564e-01 0.0084770960 1.000665e-04 2.261875e-04
3        XY3        4 3.095071e-04 1.029907e-06 1.635270e-01 0.9063920975 5.731729e-03 1.741630e-04
4        XY4        4 1.154656e-04 3.823049e-07 2.888787e-02 0.8831390142 1.236814e-03 7.847964e-05
5        XY5        4 1.193054e-07 8.741189e-06 5.060996e-01 0.9260828793 1.232723e-02 5.358944e-04
6        XY6        4 2.479082e-04 5.528710e-05 1.183165e-04 0.9923758209 1.019272e-03 2.854635e-04
7        XY7        6 9.421768e-03 1.079501e-03 1.000559e-01 0.0084094275 3.649302e-05 6.113418e-01
8        XY8        3 6.143126e-04 2.594000e-06 8.872822e-01 0.1306741871 8.199657e-04 1.142911e-04
9        XY9        4 1.001003e-04 4.479314e-06 3.989228e-03 0.9793768525 2.182218e-03 1.554600e-04
10      XY10        2 4.624279e-06 9.888145e-01 7.359737e-06 0.0053910189 3.058250e-05 6.004089e-04
11      XY11        4 5.837736e-05 7.877929e-06 1.970483e-03 0.9895575941 4.063185e-03 6.252101e-04
12      XY12        3 1.944198e-06 2.005060e-06 3.616153e-01 0.5258071870 1.320550e-02 2.124778e-04
13      XY13        4 9.002434e-07 9.563100e-04 2.673559e-06 0.9931332767 1.996231e-04 1.273039e-04
14      XY14        4 1.235332e-05 3.907399e-05 2.785671e-03 0.9972651005 2.466791e-03 5.618507e-04
15      XY15        4 4.831095e-05 1.127145e-04 1.685999e-04 0.9935480356 4.465038e-03 2.503012e-04
16      XY16        3 2.731656e-03 1.722274e-05 9.742068e-01 0.0007705171 4.555504e-05 6.263918e-03
17      XY17        4 3.431090e-06 1.327718e-03 1.570120e-05 0.9971910715 3.616062e-05 1.134532e-04
18      XY18        3 3.816116e-07 5.000733e-03 8.685111e-01 0.0030849737 2.343850e-04 2.590704e-03
19      XY19        4 4.787456e-05 2.856209e-06 5.303099e-01 0.9858489633 2.194258e-02 9.931145e-05
20      XY20        4 9.208488e-04 1.361713e-04 3.989896e-04 0.9504880905 1.666186e-03 1.220569e-02

These results match what's found on cri-iatlas.org / tools.

In looking at feature importance: You will see that really important features for classification are based on doing the binary gene-gene comparison, but on a signature level. It summarizes the question "are the genes in signature 1 (s1) expressed at a lower level than signature 2 (s2)?" In short "s1s2".

label signature_name
s1	  LIexpression_score	
s2	  CSF1_response	
s3	  Module3_IFN_score	
s4	  TGFB_score_21050467	
s5	  CHANG_CORE_SERUM_RESPONSE_UP

immunesubtypeclassifier's People

Contributors

Stargazers

Watchers

immunesubtypeclassifier's Issues

The problem of authorization of the source code in ImmuneSubtypeClassifier

Recently, I had a problem of classifying samples into subtypes based on several gene signatures.

After reading and testing the source code of ImmuneSubtypeClassifier, I think the basis of ImmuneSubtypeClassifier——Top Scoring Pairs, could be generalized to any new gene signatures, which is my demand.

I had built a new R package (not being pushed onto Github yet) that could classify samples into subtypes based on any gene signatures that users give (including the gene signatures used in ImmuneSubtypeClassifier package).

I plan to publish my paper in the future, and the new package would be part of it. But I‘m not sure whether it is proper to publish it directly, because your source code about Top Scoring Pairs algorithm was deeply referred to.

Would you give some suggestions? Thanks a lot!

discordance between BestCall and probabilities

Hi,
a very good tool helps us classify tumors into different immune subtypes!
I run this function and get the results. in the results table, I found there are some discordances between BestCall and probabilities of belonging to each subtype. For instance, in this sample, bestcall indicates C3 subclass but C4 has higher probability. So I don't know which result I need to keep?

and here is the summary infomation in the series. 18 samples have these inconsistent information

Thanks in advance!

numCores not working

Hi Dave, I'm using this commit: 30e6215

and getting this error:

calls <- ImmuneSubtypeClassifier::callEnsemble(
input_matrix,
numCores = args$num_cores
)

Error in ImmuneSubtypeClassifier::callEnsemble(input_matrix, numCores = args$num_cores) :
unused argument (numCores = args$num_cores)
Execution halted

installation error: ggplot2

Here is the error message.

Installing package into ‘/home/xxxxxxx/Rlib/R_4.0’
(as ‘lib’ is unspecified)

installing source package ‘ImmuneSubtypeClassifier’ ...
** using staged installation
** R
** data
*** moving datasets to lazyload DB
** inst
** byte-compile and prepare package for lazy loading
Error: (converted from warning) package ‘ggplot2’ was built under R version 4.0.3
Execution halted
ERROR: lazy loading failed for package ‘ImmuneSubtypeClassifier’
removing ‘/home/xxxxxxx/Rlib/R_4.0/ImmuneSubtypeClassifier’
Error: Failed to install 'ImmuneSubtypeClassifier' from GitHub:
(converted from warning) installation of package ‘/tmp/Rtmp3GLvxl/file18c167b252f2/ImmuneSubtypeClassifier_0.1.0.tar.gz’ had non-zero exit status

GEO dataset get C2 subtype only

Hi,

This is a great package that can help our research a lot, even though a little bit difficult to install. However, these are not the emphasis.

I have used this package for several datasets classification, however, I found that only in TCGA datasets, it can perform well. for several GEO dataset (I have tried 7 sets, all are microarray), I can get C2 subtype only.

Then I tried a TARGET dataset, which is sequenced by RNA-seq. I can get C1, C4 and only a few C6/C2 subtypes. I don't know why? Maybe this dataset do have only a few C6/C2 subtypes.

So, I wonder whether this package can not be used for microarray data? but it's built by pearson correlation analysis, I do not think microarray data is not applicable.

sincerely yours!

Model prediction accuracy

the ImmuneSubtypeClassifier is very useful, but i want to know the accuracy of model in test data. i don't find it, so can you share with me?

                                                                                                                                             ningwei

'ebpp_genes_sig' not found

Hi Dave, I'm using this commit: 30e6215

and getting this error:

calls <- ImmuneSubtypeClassifier::callEnsemble(input_matrix)

Error in geneMatch(X, geneids) : object 'ebpp_genes_sig' not found
Calls: -> geneMatch
In addition: Warning messages:
1: In data("subtype_caller_model") :
data set ‘subtype_caller_model’ not found
2: In data("ensemble_model") : data set ‘ensemble_model’ not found
3: In data(ebpp_gene) : data set ‘ebpp_gene’ not found
Execution halted

Error with callEnsemble

With the demo data Xmat, the callEnsembl function works fine but with COAD data (subset of 20 samples) I get this error. It does not matter if symbols or entrez IDs are used. A histogram of the values for Xmat and the COAD data are almost identical (attached). I've dug thru the code and its failing in the dataProc function inside the callOneSubtype function. I'll continue down the chain of functions but I'm already 3 levels deep. I also tried installing the older xgboost as recommended but that didn't resolve the issue.

Error in rownames<-(*tmp*, value = gs) :
attempt to set 'rownames' on an object with no dimensions

Below is my sessionInfo (with xgboost_1.5

sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-apple-darwin17.0 (64-bit)
Running under: macOS Big Sur 11.6.1

Matrix products: default
LAPACK: /Library/Frameworks/R.framework/Versions/4.1/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] parallel stats graphics grDevices utils datasets methods base

other attached packages:
[1] RColorBrewer_1.1-2 ImmuneSubtypeClassifier_0.1.0 plotROC_2.2.1
[4] xgboost_1.5.0.2 gridExtra_2.3 forcats_0.5.1
[7] stringr_1.4.0 purrr_0.3.4 readr_2.1.1
[10] tidyr_1.1.4 tibble_3.1.6 tidyverse_1.3.1
[13] dplyr_1.0.7 ggplot2_3.3.5

loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 lubridate_1.8.0 lattice_0.20-45 assertthat_0.2.1 digest_0.6.29 utf8_1.2.2
[7] R6_2.5.1 cellranger_1.1.0 backports_1.4.0 reprex_2.0.1 evaluate_0.14 httr_1.4.2
[13] pillar_1.6.4 rlang_0.4.12 readxl_1.3.1 rstudioapi_0.13 data.table_1.14.2 jquerylib_0.1.4
[19] Matrix_1.3-4 rmarkdown_2.11 bit_4.0.4 munsell_0.5.0 broom_0.7.10 compiler_4.1.1
[25] modelr_0.1.8 xfun_0.28 pkgconfig_2.0.3 htmltools_0.5.2 tidyselect_1.1.1 fansi_0.5.0
[31] crayon_1.4.2 tzdb_0.2.0 dbplyr_2.1.1 withr_2.4.3 grid_4.1.1 jsonlite_1.7.2
[37] gtable_0.3.0 lifecycle_1.0.1 DBI_1.1.1 magrittr_2.0.1 scales_1.1.1 vroom_1.5.7
[43] cli_3.1.0 stringi_1.7.6 fs_1.5.2 bslib_0.3.1 xml2_1.3.3 ellipsis_0.3.2
[49] generics_0.1.1 vctrs_0.3.8 tools_4.1.1 bit64_4.0.5 glue_1.5.1 hms_1.1.1
[55] fastmap_1.1.0 yaml_2.2.1 colorspace_2.0-2 rvest_1.0.2 knitr_1.36 haven_2.4.3
[61] sass_0.4.0

issue with last version of xgboost

Hi,

The package seems to be incompatible with the last version of xgboost, probably because of the issue outlined here -
dmlc/xgboost#5794

Using version 1.0.0.2 of xgboost works OK (albeit it warns that models were built in xgboost < 1.0.0, might be good to upgrade them anyway).

The error message I got was -
Error in predict.xgb.Booster(mi$bst, Xbin) :
[12:28:19] amalgamation/../src/learner.cc:506: Check failed: mparam_.num_feature != 0 (0 vs. 0) : 0 feature is supplied. Are you using raw Booster interface?

TCGA subtype problem

Hi,
it is very useful package! I download fpkm gene expression data from TCGA by TCGAbiolinks package, and i try to get the immune subtypes of the LUAD samples by using this package.when i compare this subtypes result with the subtypes of this paper(five_signature_mclust_ensemble_results.tsv.gz),their subtypes are different.
But when i use the gene expression of paper(ebppSubset.tsv.bz2) to classify these patients,the subtype result can match with the paper result(five_signature_mclust_ensemble_results.tsv.gz).So i want to know if there is any problem to classify immune subtype by using fpkm gene expression data from TCGA by TCGAbiolinks.if i need to do any thing with the gene expression download from TCGA?

Thanks!
here are some codes and results!
code and result .pptx

Run with one sample

When I run the classifier with one sample,
I get error below:

Error in `rownames<-`(`*tmp*`, value = ebpp_genes_sig$Symbol) : 
  attempt to set 'rownames' on an object with no dimensions
Calls: capture.output ... withCallingHandlers -> <Anonymous> -> geneMatch -> rownames<-
Execution halted

With two samples, it works smoothly.

issue with xgboost

Hi,

I used a subset of data from TCGA for test. However, I met the error like this:

Error in predict.xgb.Booster(mi$bst, Xbin) : 
  [22:12:42] amalgamation/../src/learner.cc:506: Check failed: mparam_.num_feature != 0 (0 vs. 0) : 0 feature is supplied.  Are you using raw Booster interface?

Could you help to solve this?
I think it might be good to add a test data for others to use this tool !

Data preparation of ImmuneSubtypeClassifier Package

Hello~
After some tests about this package, I think it's easy to use. Meanwhile, the capacity of concordance is robust, according to the reported result.Here are my questions:

1.Are there any limitations about the form of gene expression data? For example,is it available if I use gene expression data from Affymetrix miroarray? Or it's only fitted in RPKM data?

2.Before go to the pipeline, shall gene matrix be normalized or scaled?
Thanks~

Most ideal normalization space for expression data?

Hello,

Firstly, thank you for compiling this resource for the public. The whole work is an incredible inspiration to me and others. I have been using gene expression data and exploring these immune subtypes. I noticed that there is a significant difference in best immune subtype calls depending on whether I run RSEM expected counts vs TPM gene expression data. The TPM data results in a much higher proportion of C4 in my data. I was wondering if in theory, one or the other would be more "valid"?

Thanks for your time

input data

many thanks for providing this helpful package. Can you tell me what the input values should be for the matrix. Raw counts? TPM etc? I see from the previous version you say it is sensitive to this. Would be grateful for advice

cri-iatlas / immunesubtypeclassifier Goto Github PK

immunesubtypeclassifier's Introduction

ImmuneSubtypeClassifier

immunesubtypeclassifier's People

Contributors

Stargazers

Watchers

Forkers

immunesubtypeclassifier's Issues

Recommend Projects

Recommend Topics

Recommend Org