SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data
This is the work for SurvBenchmark (202205 updated) and the associated paper can be found:
Zhang, Yunwei & Wong, Germaine & Mann, Graham & Muller, Samuel & Yang, Jean. (2021). SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data. 10.1101/2021.07.11.451967.
Please cite this paper if you would like to use the curated data.
We develop a benchmarking design, SurvBenchmark, that evaluates a diverse collection of survival models for both clinical and omics datasets. SurvBenchmark not only focuses on classical approaches such as the Cox model, but it also evaluates state-of-art machine learning survival models. There are 16 datasets (https://github.com/SydneyBioX/SurvBenchmark/blob/main/tables/table1.docx)
Table1. Datasets summary
Dataset (name used in this paper) | Number of observations | No. of variables | Type of data | Censoring rate(rounded to 4 decimalplaces) | Reference |
---|---|---|---|---|---|
Melanoma_itraq | 41 | 643 | Omics | 0.4146 | Mactier, Swetlana et al. “Protein signatures correspond to survival outcomes of AJCC stage III melanoma patients.” Pigment cell & melanoma research vol. 27,6 (2014): 1106-16. doi:10.1111/pcmr.12290 |
Melanoma_nano | 45 | 207 | Omics | 0.4222 | Wang,K.Y.X. et al. Cross-Platform Omics Prediction procedure: agame changer for implementing precision medicine in patientswithstage-IIImelanoma.bioRxiv2020.12.09.415927;doi:https://doi.org/10.1101/2020.12.09.415927 |
Ovarian_2 | 58 | 19818 | Omics | 0.3793 | Ganzfried,B.F.etal.(2013)curatedOvarianData:clinicallyannotateddatafortheovariancancertranscriptome.Database,2013. |
GE_5 | 78 | 4753 | Omics | 0.5641 | van'tVeer,L.J.etal.(2002)Geneexpressionprofilingpredictsclinical outcomeofbreast cancer.Nature,415,530–536. |
GE_3 | 86 | 6288 | Omics | 0.7209 | Bullinger,L.etal.(2004)UseofGene-ExpressionProfilingtoIdentifyPrognostic Subclasses in Adult Acute Myeloid Leukemia. NewEnglandJournalofMedicine,350, 1605–1616. |
Melanoma_clinical | 77 | 16 | Clinical | 0.3939 | Wang,K.Y.X. et al. Cross-Platform Omics Prediction procedure: agame changer for implementing precision medicine in patientswithstage-IIImelanoma.bioRxiv2020.12.09.415927;doi:https://doi.org/10.1101/2020.12.09.415927. |
GE_1 | 115 | 551 | Omics | 0.6670 | Sorlie,T. et al. (2003) Repeated observation of breast tumor subtypesin independent gene expression data sets. Proc. Natl. Acad. Sci.U.S. A., 100, 8418–8423. |
GE-_4 | 116 | 6285 | Omics | 0.5641 | van de Vijver,M.J. et al. (2002) A gene-expression signature as apredictorofsurvivalinbreastcancer.N.Engl.J.Med.,347,1999–2009. |
Veteran | 137 | 8 | Clinical | 0.0657 | Kalbfleisch,J.D.andPrentice,R.L.(2002)TheStatisticalAnalysisofFailureTimeData.WileySeriesinProbabilityandStatistics. |
Ovarian_1 | 194 | 16050 | Omics | 0.7062 | Ganzfried,B.F.etal.(2013)curatedOvarianData:clinicallyannotateddatafortheovariancancertranscriptome.Database,2013. |
Lung | 228 | 9 | Clinical | 0.2763 | Loprinzi,C.L.etal.(1994)Prospectiveevaluationofprognosticvariables from patient-completed questionnaires. North CentralCancerTreatment Group.J. Clin.Oncol., 12,601–607. |
GE_6 | 240 | 7401 | Omics | 0.4250 | Van Houwelingen,H.C. (2004) The Elements of Statistical Learning,Data Mining, Inference, and Prediction. Trevor Hastie, RobertTibshirani and Jerome Friedman, Springer, New York, 2001. No.of pages: xvi 533. ISBN 0-387-95284-5. Statistics in Medicine,23, 528–529. |
GE_2 | 295 | 4921 | Omics | 0.7322 | Beer,D.G.etal.(2002)Gene-expressionprofilespredictsurvivalofpatientswithlungadenocarcinoma.Nat.Med.,8,816–824. |
PBC | 312 | 7 | Clinical | 0.5994 | Fleming,T.R.andHarrington,D.P.(2005)CountingProcessesandSurvivalAnalysis.WileySeriesinProbabilityandStatistics. |
UNOS_Kidney | 3000 | 101 | Clinical | 0.7350 | OPTNdata (https://optn.transplant.hrsa.gov/) |
ANZ | 3323 | 40 | Clinical | 0.8739 | ANZDATA (https://www.anzdata.org.au/) |
and 20 survival methods (https://github.com/SydneyBioX/SurvBenchmark/blob/main/tables/table2.docx)
Table2. Summary of methods used in this study
Method name | Method name in this paper | R function name | R package name | Parameters (default) |
---|---|---|---|---|
Cox | Cox | coxph | survival | NA |
Cox with backward elimination using AIC | Cox_bw_AIC | cph,fastbw | rms | rule="aic",sls=.05,k.aic=2 |
Cox with backward elimination using pvalue | Cox_bw_p | cph,fastbw | rms | rule="p",sls=.05 |
Cox with backward elimination using BIC | Cox_bw_BIC | cph,fastbw | rms | rule="aic",sls=.05,k.aic=log(as.numeric(table(train$status)[2])) |
Lassocox (for clinical datasets) | Lasso_Cox | penalized | penalized | Lambda1=1,lambda2=0 |
Ridgecox (for clinical datasets) | Ridge_Cox | penalized | penalized | Lambda1=0,lambda2=1 |
Elasticnetcox (for clinical datasets) | EN_Cox | penalized | penalized | Lambda1=1,lambda2=1 |
Lassocox (for omics datasets) | Lasso_Cox | glmnet | glmnet | alpha=1,nfolds=5,type.measure="C" |
Ridgecox (for omics datasets) | Ridge_Cox | glmnet | glmnet | alpha=0,nfolds=5,type.measure="C" |
Elasticnetcox (for omics datasets) | EN_Cox | glmnet | glmnet | alpha=0.5,nfolds=5,type.measure="C" |
Random survival forest | RSF | rfsrc | RandomSurvivalForest | Default:ntree=1000,mtry=10 |
Multitask logistic regression method | MTLR | mtlr | MTLR | C1=1 |
DNNSurv (Deeplearning survival model) | DNNSurv | multiple functionsas in Github codes | DNNSurv | Default: no parameter arguments to be changed by users |
Boosting coxmodel | CoxBoost | coxboost | CoxBoost | stepnumber=10, penalty number=100 |
Cox model with genetic algorithmas feature selection method | Cox (GA) | GenAlg | GenAlgo | n.features=10(foromics),n.features=4(forclinical),generation_num=20 |
Multitask logistic regression model with genetic algorithmas feature selection method | MTLR(GA) | GenAlg | GenAlgo | n.features=10 (foromics),n.features=4 (forclinical),generation_num=20 |
Boosting cox model with genetic algorithmas feature selection method | CoxBoost (GA) | GenAlg | GenAlgo | n.features=10(foromics),n.features=4(forclinical),generation_num=20 |
Multitask logistic regression model with ranking based methodas feature selection method | MTLR(DE) | lmFit,eBayes | limma | n.features=10(foromics),n.features=4(forclinical) |
Boosting cox model with ranking based methodas feature selection method | CoxBoost (DE) | lmFit,eBayes | limma | n.features=10(foromics),n.features=4(forclinical) |
Survival support vector machine | SurvivalSVM | survivalsvm | survivalsvm | Default: sgf.sv = 5, sigf = 7, maxiter = 20, margin = 0.05, bound = 10, eig.tol = 1e-06, conv.tol = 1e-07, posd.tol = 1e-08 |
DeepSurv(Deeplearning survival model) | DeepSurv | deepsurv | survivalmodels | Default:frac=0.3,activation="relu",num_nodes=c(4L,8L,4L,2L),dropout=0.1,early_stopping=TRUE,epochs=100L,batch_size=32L |
DeepHit(Deeplearningsurvival model) | DeepHit | deephit | survivalmodels | Default:frac=0.3,activation="relu",num_nodes=c(4L,8L,4L,2L),dropout=0.1,early_stopping=TRUE,epochs=100L,batch_size=32L |
benchmarked in this study.
##############################################################################
In this repo, all the high resolution figures related to the paper can be found under folder "figures".
The folder "functions" contains functions to run all methods.
The folder "datasets" contains all datasets benchmarked in our paper.
The folder "figures_data" contains all figure data used to generate the figures in our paper.
The github_example.R file gives an example to get the results using methods in "functions" on the Ovarian dataset.
For the datasets we used, please check this Table1 in our paper, this is under "tables" table1.
For the survival methods we benchmarked, please check Table2 in our paper, this is under "tables" table2.
The R package is available at(https://github.com/SydneyBioX/SurvBenchmark_package), on-going work will be updated continuously.
###############################################################################
library(devtools)
devtools::install_github("SydneyBioX/SurvBenchmark_package")
library(SurvBenchmark)
You may need to install the following dependencies first:
library(dplyr)
library(survival)
library(glmnet)
library(rms)
library(tidyverse)
library(caret)
library(pec)
library(coefplot)
library("survAUC")
library(gridExtra)
library(ggplot2)
library("survival")
library(survminer)
library(randomForestSRC)
library(ggRandomForests)
library(penalized)
library(DMwR)
library(randomForest)
library(riskRegression)
library(pROC)
library(ROCR)
library(cvTools)
library(parallel)
library(pbmcapply)
library(MTLR)
library(profmem)
library(keras)
library(pseudo)
library(survivalROC)
library(survival)
library(survcomp)
library(survAUC)
library(CoxBoost)
library(limma)
library(partykit)
library(coin)
library(compound.Cox)
library(GenAlgo)
library(survivalsvm)
library(rmatio)
library(survivalmodels)
library(reticulate)
The comparison of survival models can be visualized using heatmap as the below example.
Zhang, Yunwei & Wong, Germaine & Mann, Graham & Muller, Samuel & Yang, Jean. (2021). SurvBenchmark: comprehensive benchmarking study of survival analysis methods using both omics data and clinical data. 10.1101/2021.07.11.451967.
Copyright [2022] [Yunwei Zhang]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.