Detecting biomarkers and predictive diagnostics from CBC and RNA expression data with Machine Learning
The project is an continuing development of automated statistical analyses for CBC (Complete Blood Count) test results and RNA expression profiles. The research leg of this project develops and applies machine learning methods to generate predictive diagnostics, and the educational leg of this project provides a framework for undergraduate students learning beginner-level R programming. The project was developed and is run by Dr. Jeffrey Robinson at University of Maryland, Baltimore County (UMBC), and used for the courses BTEC330 (Software Applications in Translational Science) and BTEC395 (Bioinformatics). Work is supported by NSF Extreme Science and Engineering Discovery Environment (XSEDE) through an educational allocation awarded to Robinson: “Bioinformatics Training for Applications in Translational and Molecular Biosciences”, under NSF grant number ACI-1548562. BTEC495 research interns include Daniel Gidron (Summer 2021).
Current functionality: (AnalyzeBloodwork.R; IBSclassification.R)
- Loads required packages and sample data,
- Generates histograms for all variables,
- Generates linear regression models and diagnostic plots for single and multiple regressions for BMI, Complete Blood Count (CBC), and inflammation markers,
- Imputes data for columns with missing (NA) values,
- Balances unequally-sized sample groups,
- Generates box-and-whisker and scatterplot-matrix plots for WBC distributions,
- Tests the performance of machine learning classification algorithms for IBS diagnosis using CBC-WBC count data.
Data from the sample dataset was collected during an NIH natural history clinical study of the relationship between obesity, inflammation, stress, and gastrointestinal disorders and is fully open-sourced (citations and links below).
The standard CBC parameters provide point-of-care physicians with powerful diagnostic capabilities using:
- White Blood Cell counts (absolute and relative counts for Monocytes, Lymphocytes, Neutrophils, Basophils, and Eosinophils),
- Red blood cell and hemoglobin parameters (RBC count, Hematocrit (HCT), Mean Corpuscular Hemoglobin (MCH), Erythrocyte Sedimentation Rate (ESR)),
- Platelet parameters (Platelet Counts, Mean Platelet Volume (MPV))
Additional sample data in the context of obesity, inflammation, and gastrointestinal disorders includes:
- Body Mass Index (BMI),
- Stress hormones: Cortisol and ACTH,
- Inflammation markers: C-Reactive Protein (CRP), sCD14, Lipopolysaccharide Binding Protein (LBP),
- Clinical diagnoses of subtypes of Irritable Bowel Syndrome (IBS)
- Nanostring White Blood Cell RNA expression data: an associated 250-gene panel of Nanostring RNA expression data (links in citations below).
This dataset and analyses are used by Robinson in teaching for UMBC's BTEC330 (Software Applications) and BTEC350 (Biostatistics) classes, students are trained in R programming language, GitHub practices, and analysis of clinical and molecular expression data by forking this repository and modifying it according to their skill level and assignment specifics.
The long-term research goal of this project is the automation of Machine Learning and regression models for biomarker discovery from CBC and gene-expression data. The repository is under continuous development, versions will be incremented with each new functional feature.
a. Linear model with Body Mass Index (BMI), Serum Cortisol and C-Reactive Protein (CRP) (stress & inflammation biomarkers).
b. Linear model with BMI and White Blood Cells (absolute counts WBCs, Neutrophils, Monocytes, Lymphocytes, Eosinophils, Basophils)
Regressions follow the canonical form:
In R language this is expressed as: lm(y ~ x1 + x2 + xn, data=dataframe)
> BMI.Cortisol <- lm(BMI ~ SerumCortisol, data=IBS1)
> summary(BMI.Cortisol)
Call:
lm(formula = BMI ~ SerumCortisol, data = IBS1)
Residuals:
Min 1Q Median 3Q Max
-10.043 -4.043 -1.440 3.008 19.658
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 31.9454 1.4736 21.679 < 2e-16 ***
SerumCortisol -0.5004 0.1312 -3.814 0.000229 ***
---
ggplot(IBS1, aes(x=BMI, y=SerumCortisol)) +
geom_point() +
geom_smooth(method=lm)
> fit1 <- lm(BMI ~ SerumCortisol + CRP, data=IBS1)
> summary(fit1)
Call:
lm(formula = BMI ~ SerumCortisol + CRP, data = IBS1)
Residuals:
Min 1Q Median 3Q Max
-9.1378 -3.4448 -0.9904 2.3330 20.6056
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 30.7936 1.4134 21.787 < 2e-16 ***
SerumCortisol -0.5231 0.1233 -4.244 4.72e-05 ***
CRP 0.6042 0.1534 3.938 0.000147 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.354 on 106 degrees of freedom
(2 observations deleted due to missingness)
Multiple R-squared: 0.232, Adjusted R-squared: 0.2175
F-statistic: 16.01 on 2 and 106 DF, p-value: 8.388e-07
> s3d <- scatterplot3d(IBS1$BMI, IBS1$SerumCortisol, IBS$CRP, pch=16, color="steelblue", box=TRUE, highlight.3d=FALSE, type="h", main="BMI, Cortisol, CRP")
> fit <- lm(SerumCortisol ~ BMI + CRP, data=IBS1)
> s3d$plane3d(fit)
> fit <- lm(BMI ~ Monocytes + Lymphocytes + Neutrophils + Basophils + Eosinophils, data=CBC)
> summary(fit) # show results
Call:
lm(formula = BMI ~ Monocytes + Lymphocytes + Neutrophils + Basophils +
Eosinophils, data = CBC)
Residuals:
Min 1Q Median 3Q Max
-8.780 -4.037 -1.732 4.062 18.030
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 22.4000 2.7445 8.162 1.02e-12 ***
Monocytes -3.0891 4.1140 -0.751 0.4545
Lymphocytes 2.7465 1.2745 2.155 0.0336 *
Neutrophils 0.1459 0.4141 0.352 0.7254
Basophils -21.1991 41.1140 -0.516 0.6073
Eosinophils 1.8739 5.4544 0.344 0.7319
> layout(matrix(c(1,2,3,4),2,2))
> plot(fit)
> ## Generate list of columns with NA values
> list_na <- colnames(IBSblood)[ apply(IBSblood, 2, anyNA) ]
> list_na
[1] "Monocytes"
> # determine the mean of columns with missing NAs
> average_missing <- apply(as.matrix(IBSblood[,colnames(IBSblood) %in% list_na]),
+ 2,
+ mean,
+ na.rm=TRUE)
> average_missing
[1] 0.4462963
> # Create a new imputed dataframe by replacing NAs with column means
> IBSblood.replacement <- IBSblood %>%
+ mutate(Monocytes = ifelse(is.na(Monocytes), average_missing[1], Monocytes))
> sum(is.na(IBSblood.replacement$Monocytes))
[1] 0
Pre-balancing, groups "Normal" "IBSC" and "IBSD" are unequal numbers:
> IBSblood %>%
+ count(IBS.subtype) %>%
+ kable(align = 'c')
| IBS.subtype | n |
|:-----------:|:--:|
| IBSC | 19 |
| IBSD | 14 |
| NORM | 77 |
Post-balancing with up-sampling to 100 rows for each group:
> df_balanced %>%
+ count(IBS.subtype) %>%
+ kable(align = 'c')
| IBS.subtype | n |
|:-----------:|:---:|
| IBSC | 100 |
| IBSD | 100 |
| NORM | 100 |
> # Run algorithms using 10-fold cross validation
> control <- trainControl(method="cv", number=10)
> metric <- "Accuracy"
> set.seed(7) ## Seeds are set to 7 for all subsequent algorithms
> fit.lda <- train(IBS.subtype~., data=df_balanced, method="lda", metric=metric, trControl=control)
> fit.cart <- train(IBS.subtype~., data=df_balanced, method="rpart", metric=metric, trControl=control)
> fit.knn <- train(IBS.subtype~., data=df_balanced, method="knn", metric=metric, trControl=control)
> fit.svm <- train(IBS.subtype~., data=df_balanced, method="svmRadial", metric=metric, trControl=control)
> fit.rf <- train(IBS.subtype~., data=df_balanced, method="rf", metric=metric, trControl=control)
> # summarize accuracy of models
> results <- resamples(list(lda=fit.lda, cart=fit.cart, knn=fit.knn, svm=fit.svm, rf=fit.rf))
> # compare accuracy of models
> dotplot(results)
Robinson, J. 2021. Predictive Classification of IBS-subtype: Performance of a 250-gene RNA expression panel vs. Complete Blood Count (CBC) profiles under a Random Forest model. medRxiv. doi: https://doi.org/10.1101/2021.08.31.21262766.
Robinson, JM. et al. 2019. Complete blood count with differential: An effective diagnostic for IBS subtype in the context of BMI? BioRxiv. doi: https://doi.org/10.1101/608208.
Robinson, J. Differential Gene Expression Associated with BMI, Gender, and IBS-subtype in Human White Blood Cells: Results from a Custom 250-plex Nanostring Probe Panel. Preprints 2019, 2019120180 (doi: 10.20944/preprints201912.0180.v1).
Staser eta al. 2018. OMIP-042: 21-color flow cytometry to comprehensively immunophenotype major lymphocyte and myeloid subsets in human peripheral blood. Cytometry A. 93(2):186-189. DOI:10.1002/cyto.a.23303.
ImmunoGC custom Nanostring probe panel. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL25996.
Human buffy coat gene expression, custom 250-plex Nanostring panel. GSE124549. 2019. https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE124549.
This project was partially funded from NSF XSEDE Educational Allocation to Jeffrey Robinson: “Bioinformatics Training for Applications in Translational and Molecular Biosciences”. Extreme Science and Engineering Discovery Environment (XSEDE), supported by National Science Foundation grant number ACI-1548562.
STHDA: ggplot2 histogram: easy histogram with ggplot2 R package
STHDA: Scatterplot3d: 3D graphics - R software and data visualization
Quick R by DataCamp (StatMethods.net): Scatterplots
MachineLearningMastery.com: Machine Learning in R Step-by-Step
R-Bloggers.com: Regression analysis essentials for machine learning