Cancer Data

Mburu January 26, 2019

Data Set
Visualization and Feature selection
Model Fitting

Data Set

In this tutorial I’m going to predict whether a breast cancer tumor is benign or malignant. Using Wiscosin breast cancer data set available on Kaggle. The 30 predictors are divided into three parts first is Mean ( variables 3-13), Standard Error(13-23) and Worst(23-32) and each contain 10 parameter (radius, texture,area, perimeter, smoothness,compactness,concavity,concave points,symmetry and fractal dimension of the tumor). When predicting cancer breast tumor types there two types of cost;

The cost of telling someone who has malignant tumor that they have benign these are the false negatives in this case someone might not seek medical help which is can cause death.
Telling someone that they have malignant type of tumor but they don’t which is usually false positives. In this case you subject someone to unnecessary stress

So it’s highly desirable that our model has good accuracy $ f_1 score$ and high recall.

library(tidyverse)
library(data.table)
library(xgboost)
library(knitr)
library(broom)
library(caret)
library(e1071)
library(kableExtra)
library(ggthemes)

library(glmnet)

cancer <- setDT(read_csv("data.csv"))

cancer[, X33 := NULL]

head(cancer)  %>%
  kable()

id	diagnosis	radius_mean	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	radius_se	texture_se	perimeter_se	area_se	smoothness_se	compactness_se	concavity_se	concave points_se	symmetry_se	fractal_dimension_se	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
842302	M	17.99	10.38	122.80	1001.0	0.11840	0.27760	0.3001	0.14710	0.2419	0.07871	1.0950	0.9053	8.589	153.40	0.006399	0.04904	0.05373	0.01587	0.03003	0.006193	25.38	17.33	184.60	2019.0	0.1622	0.6656	0.7119	0.2654	0.4601	0.11890
842517	M	20.57	17.77	132.90	1326.0	0.08474	0.07864	0.0869	0.07017	0.1812	0.05667	0.5435	0.7339	3.398	74.08	0.005225	0.01308	0.01860	0.01340	0.01389	0.003532	24.99	23.41	158.80	1956.0	0.1238	0.1866	0.2416	0.1860	0.2750	0.08902
84300903	M	19.69	21.25	130.00	1203.0	0.10960	0.15990	0.1974	0.12790	0.2069	0.05999	0.7456	0.7869	4.585	94.03	0.006150	0.04006	0.03832	0.02058	0.02250	0.004571	23.57	25.53	152.50	1709.0	0.1444	0.4245	0.4504	0.2430	0.3613	0.08758
84348301	M	11.42	20.38	77.58	386.1	0.14250	0.28390	0.2414	0.10520	0.2597	0.09744	0.4956	1.1560	3.445	27.23	0.009110	0.07458	0.05661	0.01867	0.05963	0.009208	14.91	26.50	98.87	567.7	0.2098	0.8663	0.6869	0.2575	0.6638	0.17300
84358402	M	20.29	14.34	135.10	1297.0	0.10030	0.13280	0.1980	0.10430	0.1809	0.05883	0.7572	0.7813	5.438	94.44	0.011490	0.02461	0.05688	0.01885	0.01756	0.005115	22.54	16.67	152.20	1575.0	0.1374	0.2050	0.4000	0.1625	0.2364	0.07678
843786	M	12.45	15.70	82.57	477.1	0.12780	0.17000	0.1578	0.08089	0.2087	0.07613	0.3345	0.8902	2.217	27.19	0.007510	0.03345	0.03672	0.01137	0.02165	0.005082	15.47	23.75	103.40	741.6	0.1791	0.5249	0.5355	0.1741	0.3985	0.12440

Visualization and Feature selection

Percentage of women with malignant tumor

The percentage of women with malignant tumor is 37.26%(212 out 569) while the rest 62.74%(357) had benign tumors.

cancer[, .(freq = .N),
       by = diagnosis] %>% 
    .[, perc := round(100 * freq/sum(freq), 2)] %>%
  
ggplot(aes(x=diagnosis, y=perc, fill = diagnosis)) + 
    geom_bar(stat = "identity", width  = 0.5)+ theme_hc() +
    geom_text(aes(x=diagnosis, y=perc, label = paste(perc, "%")),
              position =  position_dodge(width = 0.5),
              vjust = 0.05, hjust = 0.5, size = 5)+
    scale_fill_hc(name = "")+
    labs(x = "Cancer Type",
         y = "Percentage",
         title = "Percentage of women with benign or malignant breast bancer")+
    theme(legend.position = "none",
          axis.title = element_text(size =12))

Boxplots

From the boxplots we can identify variables where we expect there is a significance difference between the two groups of cancer tumors. When using a boxplot if two distributions do not averlap or more than 75% of two boxplot do not overlap then we expect that there is a significance difference in the mean/median between the two groups. Some of the variables where the distribution of two cancer tumors are significantly different are radius_mean, texture_mean etc. The visible differences between malignant tumors and benign tumors can be seen in means of all cells and worst means where worst means is the average of all the worst cells. The distribution of malignant tumors have higher scores than the benign tumors in this cases.

cancerm <- melt(cancer[, -1, with = F], id.vars = "diagnosis")

ggplot(cancerm, aes(x = diagnosis, y = value))+
    geom_boxplot() + facet_wrap(~variable, scales = "free_y")

Features Scaling

We find that some variables are highly correlated. We can use principle component analysis for dimension reduction. Since variables are correlated it’s evident that we can use a smaller set of features to build our models.

cancer[, id := NULL]
predictors <- names(cancer)[3:31]
cancer[, (predictors) := lapply(.SD, function(x) scale(x)), .SDcols = predictors ]
cancer[, diagnosis := as.factor(diagnosis)]

Correlation matrix

cor(cancer[, -(1:2), with = F]) %>% kable(format = "html")

	texture_mean	perimeter_mean	area_mean	smoothness_mean	compactness_mean	concavity_mean	concave points_mean	symmetry_mean	fractal_dimension_mean	radius_se	texture_se	perimeter_se	area_se	smoothness_se	compactness_se	concavity_se	concave points_se	symmetry_se	fractal_dimension_se	radius_worst	texture_worst	perimeter_worst	area_worst	smoothness_worst	compactness_worst	concavity_worst	concave points_worst	symmetry_worst	fractal_dimension_worst
texture_mean	1.0000000	0.3295331	0.3210857	-0.0233885	0.2367022	0.3024178	0.2934641	0.0714010	-0.0764372	0.2758687	0.3863576	0.2816731	0.2598450	0.0066138	0.1919746	0.1432931	0.1638510	0.0091272	0.0544575	0.3525729	0.9120446	0.3580396	0.3435459	0.0775034	0.2778296	0.3010252	0.2953158	0.1050079	0.1192054
perimeter_mean	0.3295331	1.0000000	0.9865068	0.2072782	0.5569362	0.7161357	0.8509770	0.1830272	-0.2614769	0.6917650	-0.0867611	0.6931349	0.7449827	-0.2026940	0.2507437	0.2280823	0.4072169	-0.0816293	-0.0055234	0.9694764	0.3030384	0.9703869	0.9415498	0.1505494	0.4557742	0.5638793	0.7712408	0.1891150	0.0510185
area_mean	0.3210857	0.9865068	1.0000000	0.1770284	0.4985017	0.6859828	0.8232689	0.1512931	-0.2831098	0.7325622	-0.0662802	0.7266283	0.8000859	-0.1667767	0.2125826	0.2076601	0.3723203	-0.0724966	-0.0198870	0.9627461	0.2874886	0.9591196	0.9592133	0.1235229	0.3904103	0.5126059	0.7220166	0.1435699	0.0037376
smoothness_mean	-0.0233885	0.2072782	0.1770284	1.0000000	0.6591232	0.5219838	0.5536952	0.5577748	0.5847920	0.3014671	0.0684064	0.2960919	0.2465524	0.3323754	0.3189433	0.2483957	0.3806757	0.2007744	0.2836067	0.2131201	0.0360718	0.2388526	0.2067184	0.8053242	0.4724684	0.4349257	0.5030534	0.3943095	0.4993164
compactness_mean	0.2367022	0.5569362	0.4985017	0.6591232	1.0000000	0.8831207	0.8311350	0.6026410	0.5653687	0.4974734	0.0462048	0.5489053	0.4556529	0.1352993	0.7387218	0.5705169	0.6422619	0.2299766	0.5073181	0.5353154	0.2481328	0.5902104	0.5096038	0.5655412	0.8658090	0.8162752	0.8155732	0.5102234	0.6873823
concavity_mean	0.3024178	0.7161357	0.6859828	0.5219838	0.8831207	1.0000000	0.9213910	0.5006666	0.3367834	0.6319248	0.0762183	0.6603908	0.6174268	0.0985637	0.6702788	0.6912702	0.6832599	0.1780092	0.4493007	0.6882364	0.2998789	0.7295649	0.6759872	0.4488220	0.7549680	0.8841026	0.8613230	0.4094641	0.5149299
concave points_mean	0.2934641	0.8509770	0.8232689	0.5536952	0.8311350	0.9213910	1.0000000	0.4624974	0.1669174	0.6980498	0.0214796	0.7106499	0.6902985	0.0276533	0.4904242	0.4391671	0.6156341	0.0953508	0.2575837	0.8303176	0.2927517	0.8559231	0.8096296	0.4527531	0.6674537	0.7523995	0.9101553	0.3757441	0.3686611
symmetry_mean	0.0714010	0.1830272	0.1512931	0.5577748	0.6026410	0.5006666	0.4624974	1.0000000	0.4799213	0.3033793	0.1280529	0.3138928	0.2239702	0.1873212	0.4216591	0.3426270	0.3932979	0.4491365	0.3317861	0.1857278	0.0906507	0.2191686	0.1771934	0.4266750	0.4732000	0.4337210	0.4302966	0.6998258	0.4384135
fractal_dimension_mean	-0.0764372	-0.2614769	-0.2831098	0.5847920	0.5653687	0.3367834	0.1669174	0.4799213	1.0000000	0.0001110	0.1641740	0.0398299	-0.0901702	0.4019644	0.5598367	0.4466303	0.3411980	0.3450074	0.6881316	-0.2536915	-0.0512692	-0.2051512	-0.2318545	0.5049421	0.4587982	0.3462339	0.1753254	0.3340187	0.7672968
radius_se	0.2758687	0.6917650	0.7325622	0.3014671	0.4974734	0.6319248	0.6980498	0.3033793	0.0001110	1.0000000	0.2132473	0.9727937	0.9518301	0.1645142	0.3560646	0.3323575	0.5133464	0.2405674	0.2277535	0.7150652	0.1947986	0.7196838	0.7515485	0.1419186	0.2871032	0.3805846	0.5310623	0.0945428	0.0495594
texture_se	0.3863576	-0.0867611	-0.0662802	0.0684064	0.0462048	0.0762183	0.0214796	0.1280529	0.1641740	0.2132473	1.0000000	0.2231707	0.1115672	0.3972429	0.2316997	0.1949985	0.2302834	0.4116207	0.2797227	-0.1116903	0.4090028	-0.1022419	-0.0831950	-0.0736577	-0.0924394	-0.0689562	-0.1196375	-0.1282148	-0.0456546
perimeter_se	0.2816731	0.6931349	0.7266283	0.2960919	0.5489053	0.6603908	0.7106499	0.3138928	0.0398299	0.9727937	0.2231707	1.0000000	0.9376554	0.1510753	0.4163224	0.3624816	0.5562641	0.2664871	0.2441428	0.6972006	0.2003709	0.7210313	0.7307130	0.1300544	0.3419194	0.4188988	0.5548972	0.1099304	0.0854326
area_se	0.2598450	0.7449827	0.8000859	0.2465524	0.4556529	0.6174268	0.6902985	0.2239702	-0.0901702	0.9518301	0.1115672	0.9376554	1.0000000	0.0751503	0.2848401	0.2708947	0.4157296	0.1341090	0.1270709	0.7573732	0.1964966	0.7612126	0.8114080	0.1253894	0.2832565	0.3851001	0.5381663	0.0741263	0.0175393
smoothness_se	0.0066138	-0.2026940	-0.1667767	0.3323754	0.1352993	0.0985637	0.0276533	0.1873212	0.4019644	0.1645142	0.3972429	0.1510753	0.0751503	1.0000000	0.3366961	0.2686848	0.3284295	0.4135061	0.4273742	-0.2306907	-0.0747430	-0.2173038	-0.1821955	0.3144575	-0.0555581	-0.0582984	-0.1020068	-0.1073421	0.1014803
compactness_se	0.1919746	0.2507437	0.2125826	0.3189433	0.7387218	0.6702788	0.4904242	0.4216591	0.5598367	0.3560646	0.2316997	0.4163224	0.2848401	0.3366961	1.0000000	0.8012683	0.7440827	0.3947128	0.8032688	0.2046072	0.1430026	0.2605158	0.1993713	0.2273942	0.6787804	0.6391467	0.4832083	0.2778784	0.5909728
concavity_se	0.1432931	0.2280823	0.2076601	0.2483957	0.5705169	0.6912702	0.4391671	0.3426270	0.4466303	0.3323575	0.1949985	0.3624816	0.2708947	0.2686848	0.8012683	1.0000000	0.7718040	0.3094286	0.7273722	0.1869035	0.1002410	0.2266804	0.1883527	0.1684813	0.4848578	0.6625641	0.4404723	0.1977878	0.4393293
concave points_se	0.1638510	0.4072169	0.3723203	0.3806757	0.6422619	0.6832599	0.6156341	0.3932979	0.3411980	0.5133464	0.2302834	0.5562641	0.4157296	0.3284295	0.7440827	0.7718040	1.0000000	0.3127802	0.6110441	0.3581267	0.0867412	0.3949993	0.3422712	0.2153506	0.4528884	0.5495924	0.6024496	0.1431157	0.3106546
symmetry_se	0.0091272	-0.0816293	-0.0724966	0.2007744	0.2299766	0.1780092	0.0953508	0.4491365	0.3450074	0.2405674	0.4116207	0.2664871	0.1341090	0.4135061	0.3947128	0.3094286	0.3127802	1.0000000	0.3690781	-0.1281208	-0.0774734	-0.1037530	-0.1103427	-0.0126618	0.0602549	0.0371190	-0.0304134	0.3894025	0.0780795
fractal_dimension_se	0.0544575	-0.0055234	-0.0198870	0.2836067	0.5073181	0.4493007	0.2575837	0.3317861	0.6881316	0.2277535	0.2797227	0.2441428	0.1270709	0.4273742	0.8032688	0.7273722	0.6110441	0.3690781	1.0000000	-0.0374876	-0.0031950	-0.0010004	-0.0227361	0.1705683	0.3901588	0.3799747	0.2152040	0.1110940	0.5913281
radius_worst	0.3525729	0.9694764	0.9627461	0.2131201	0.5353154	0.6882364	0.8303176	0.1857278	-0.2536915	0.7150652	-0.1116903	0.6972006	0.7573732	-0.2306907	0.2046072	0.1869035	0.3581267	-0.1281208	-0.0374876	1.0000000	0.3599208	0.9937079	0.9840146	0.2165744	0.4758200	0.5739747	0.7874239	0.2435292	0.0934920
texture_worst	0.9120446	0.3030384	0.2874886	0.0360718	0.2481328	0.2998789	0.2927517	0.0906507	-0.0512692	0.1947986	0.4090028	0.2003709	0.1964966	-0.0747430	0.1430026	0.1002410	0.0867412	-0.0774734	-0.0031950	0.3599208	1.0000000	0.3650982	0.3458423	0.2254294	0.3608323	0.3683656	0.3597546	0.2330275	0.2191224
perimeter_worst	0.3580396	0.9703869	0.9591196	0.2388526	0.5902104	0.7295649	0.8559231	0.2191686	-0.2051512	0.7196838	-0.1022419	0.7210313	0.7612126	-0.2173038	0.2605158	0.2266804	0.3949993	-0.1037530	-0.0010004	0.9937079	0.3650982	1.0000000	0.9775781	0.2367746	0.5294077	0.6183441	0.8163221	0.2694928	0.1389569
area_worst	0.3435459	0.9415498	0.9592133	0.2067184	0.5096038	0.6759872	0.8096296	0.1771934	-0.2318545	0.7515485	-0.0831950	0.7307130	0.8114080	-0.1821955	0.1993713	0.1883527	0.3422712	-0.1103427	-0.0227361	0.9840146	0.3458423	0.9775781	1.0000000	0.2091453	0.4382963	0.5433305	0.7474188	0.2091455	0.0796470
smoothness_worst	0.0775034	0.1505494	0.1235229	0.8053242	0.5655412	0.4488220	0.4527531	0.4266750	0.5049421	0.1419186	-0.0736577	0.1300544	0.1253894	0.3144575	0.2273942	0.1684813	0.2153506	-0.0126618	0.1705683	0.2165744	0.2254294	0.2367746	0.2091453	1.0000000	0.5681865	0.5185233	0.5476909	0.4938383	0.6176242
compactness_worst	0.2778296	0.4557742	0.3904103	0.4724684	0.8658090	0.7549680	0.6674537	0.4732000	0.4587982	0.2871032	-0.0924394	0.3419194	0.2832565	-0.0555581	0.6787804	0.4848578	0.4528884	0.0602549	0.3901588	0.4758200	0.3608323	0.5294077	0.4382963	0.5681865	1.0000000	0.8922609	0.8010804	0.6144405	0.8104549
concavity_worst	0.3010252	0.5638793	0.5126059	0.4349257	0.8162752	0.8841026	0.7523995	0.4337210	0.3462339	0.3805846	-0.0689562	0.4188988	0.3851001	-0.0582984	0.6391467	0.6625641	0.5495924	0.0371190	0.3799747	0.5739747	0.3683656	0.6183441	0.5433305	0.5185233	0.8922609	1.0000000	0.8554339	0.5325197	0.6865109
concave points_worst	0.2953158	0.7712408	0.7220166	0.5030534	0.8155732	0.8613230	0.9101553	0.4302966	0.1753254	0.5310623	-0.1196375	0.5548972	0.5381663	-0.1020068	0.4832083	0.4404723	0.6024496	-0.0304134	0.2152040	0.7874239	0.3597546	0.8163221	0.7474188	0.5476909	0.8010804	0.8554339	1.0000000	0.5025285	0.5111141
symmetry_worst	0.1050079	0.1891150	0.1435699	0.3943095	0.5102234	0.4094641	0.3757441	0.6998258	0.3340187	0.0945428	-0.1282148	0.1099304	0.0741263	-0.1073421	0.2778784	0.1977878	0.1431157	0.3894025	0.1110940	0.2435292	0.2330275	0.2694928	0.2091455	0.4938383	0.6144405	0.5325197	0.5025285	1.0000000	0.5378482
fractal_dimension_worst	0.1192054	0.0510185	0.0037376	0.4993164	0.6873823	0.5149299	0.3686611	0.4384135	0.7672968	0.0495594	-0.0456546	0.0854326	0.0175393	0.1014803	0.5909728	0.4393293	0.3106546	0.0780795	0.5913281	0.0934920	0.2191224	0.1389569	0.0796470	0.6176242	0.8104549	0.6865109	0.5111141	0.5378482	1.0000000

Principle Component Analysis

Using the elbow rule we can use the first 5 principle components. Using 15 principle components we will have achieved al most 100% of the variance from the original data set.

pca <- prcomp(cancer[, predictors, with = F], scale. = F)

Variance Explained

Since PCA forms new characteristics the variance explained plot shows the amount of variation of the original features captured by each principle component. The new features are simply linear combinations of the old features.

stdpca <- pca$sdev

varpca <- stdpca^2

prop_var <- varpca/sum(varpca)
prop_var * 100

##  [1] 43.706363461 18.472236711  9.716239461  6.816735644  5.676223001
##  [6]  4.161722924  2.292352407  1.643433612  1.363238017  1.191515264
## [11]  1.011031531  0.897368380  0.832104816  0.539193136  0.323823486
## [16]  0.269517389  0.198317442  0.178851469  0.153573276  0.107095306
## [21]  0.102579308  0.093821366  0.082603220  0.058724633  0.053331079
## [26]  0.027514319  0.022985324  0.005110230  0.002393786

sum(prop_var[1:15])

## [1] 0.9864358

Scree plot

Scree plot shows the variance explained by each principle component which reduces as the number of principle components increase.

plot(prop_var, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     type = "b", xlim = c(0, 30))

Cumulative Variance Explained

The cumulative of variance plot helps to choose the number of features based on the amount of variation from original data set you want captured. In this case, I wanted to use number of principle components that capture almost 100% of the variation. After trying with different number of principle components I found out that the accuracy of the models did not increase after the 15th principle components.

cum_var <- cumsum(prop_var)
plot(cum_var, xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b", xlim = c(0, 30))

Construct new data set

We use the first 15 principle components as our new predictors, then we randomly split data into training and test set in 7:3 ratio.

set.seed(100)
train_sample <- sample(1:nrow(cancer), round(0.7*nrow(cancer)))

pcadat <- data.table( label = cancer$diagnosis, pca$x[,1:15]) 
pcadat[, label := factor(label, levels = c("M", "B"))]
train <- pcadat[train_sample,]
test <- pcadat[-train_sample,]

Model Fitting

Some of the importants terms to understand: Accuracy:

[Accuracy = \frac{True: positives+ True ; Negatives ;Classes}{N}]

Precision: Of all the patients we predicted that they have cancer malignant how many actually have cancer [Precision = \frac{True ; Positives}{Predicted ; Positives}]
Recall(sensitivity): Of all the patients that have malignant cancer how many did we detect. This is the true positive rate [Recall= \frac{True ; Positives}{Actual ; Positives}]
Specifity is the true negative rate. Of all the patient who did not have malignant tumors how many did we detect [Recall= \frac{True ; Negatives}{Actual ; Negatives}]

Logistic regression

This is one of generalized linear models which deals with binary data. There is a generalization of this model which is called multinomial regression where you can fit multi class data. The equation for logistic regression model is:

[log(\frac{p}{1-p}) = \beta_0 + \beta_1*X_1 + ... \beta_n * X_n] and using mle the cost function can be derived as: [J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} y^i log(h_\theta(x^i)) + (1-y^i) log(1 - h_\theta(x^i))] Given that [y = 0] [y = 1] . Finding [\beta] s we minimizing the cost function.

fit_glm <- glm(label ~., data = train, family = binomial)

## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Regularization in logistic regression

The warning “glm.fit: fitted probabilities numerically 0 or 1 occurred” shows that there is a perfect separation/over fitting. In this case you can load glmnet library and fit a regularized logistic regression. These can be achieved by adding a regularization term to the cost function.The L1 regularization(Lasso) adds a penalty equal to the sum of the absolute values of the coefficients.

[J(\theta) = -\frac{1}{m}\sum_{i=0}^{m} y^i log(h_\theta(x^i)) + (1-y^i) log(1 - h_\theta(x^i)) + \frac {\lambda}{2m}\sum_{j=1}^{n} |\theta^i|]

trainx <- train[,-1]

y_train <- factor(train$label, levels = c("B", "M"), labels = 0:1)
#y <- as.numeric(as.character(y))

y_test <- factor(test$label, levels = c("B", "M"), labels = 0:1) %>% as.character() %>% as.numeric()
#ytest <- as.numeric(as.character(ytest))

testx <- data.matrix(test[, -1])

To find the optimal values (\lambda) we use cross validation. We choose (\lambda) which gives the highest cross validation accuracy.

cv_fold <- createFolds(train$label, k = 10)

myControl <- trainControl(
  method = "cv", 
  number = 10,
  summaryFunction = twoClassSummary,
  savePredictions = "all",
  classProbs = TRUE,
  verboseIter = FALSE,
  index = cv_fold,
  allowParallel = TRUE
  
)

tuneGrid <-  expand.grid(
    alpha = 0:1,
    lambda = seq(0.001, 1, length.out = 10))
    
glmnet_model <- train(
  label ~.,
  data = train,
  method = "glmnet",
  metric = "ROC",
  trControl = myControl,
  tuneGrid = tuneGrid
)

plot(glmnet_model)

#lamda_min <- cv_glm$lambda.min

resample_glmnet <- thresholder(glmnet_model, 
                              threshold = seq(.2, 1, by = 0.05), 
                              final = TRUE, 
                              statistics = "all")

ggplot(resample_glmnet , aes(x = prob_threshold, y = F1)) + 
  geom_point() + 
  geom_point(aes(y = Sensitivity), col = "blue")

library(caTools)

pred_glm <- predict(glmnet_model, test, type = "prob")

colAUC(pred_glm , test$label, plotROC = TRUE)

##                 M         B
## M vs. B 0.9683183 0.9683183

pred_glm1 <- ifelse(pred_glm[, "M"] > 0.4, "M", "B")
#pred_glm1 <- predict(glmnet_model, test, type = "raw")

pred_glm1 <- factor(pred_glm1, levels = levels(test$label))

confusionMatrix(pred_glm1, test$label,positive = "M")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   B
##          M  53   7
##          B   7 104
##                                           
##                Accuracy : 0.9181          
##                  95% CI : (0.8664, 0.9545)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8203          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8833          
##             Specificity : 0.9369          
##          Pos Pred Value : 0.8833          
##          Neg Pred Value : 0.9369          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3099          
##    Detection Prevalence : 0.3509          
##       Balanced Accuracy : 0.9101          
##                                           
##        'Positive' Class : M               
##

SVM

Support Vector Machines is a type of supervised learning algorithm that is used for classification and regression. Most of the times however, it’s used for classification.

To understand how SVM works consider the following example of linearly separable data. It’s clear that we can separate the two classes using a straight line(decision boundary). Which is normally referred to a separating hyperplane.

The question is, since there exists many lines that can separate the red and the black classes which is the best one. This introduces us to the maximal margin classification, In short SVM finds the hyperplane/line that gives the biggest margin/gap between the two classes. In this case SVM will choose the solid line as the hyperplane while the margins are the dotted lines. The circled points that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. This shows that SVM uses this points to come up with a the decision boundary, the other points are not used. In this case since it’s a two dimensional space the equation of the separating line will be [latex\beta_0 + \beta_1X_1 + \beta_2X_2]. Then when equations evaluates to more than 0 then 1 is predicted [latex\beta_0 + \beta_1X_1 + \beta_2X_2 > 0, y = 1] and when it evaluates to less than zero then predicted class is -1 [latex\beta_0 + \beta_1X_1 + \beta_2X_2 < 0, ; y = -1] This becomes maximisation problem [latexwidth ; of ;the ; margin = M ] [\sum_{j=1}^{n}\beta_j = 1]

[latexy_i(\beta_0 + \beta_1X_1 + \beta_2X_2) >= M]

This is a best case scenario but in most cases the classes are noisy. Consider the plot below no matter which line you choose some points are bound to be on the wrong side of the desicion boundary. Thus maximal margin classification would not work.

SVM then introduces what is called a soft margin. In naive explanation you can think of this as a margin that allows some points to be on the wrong side. By introducing an error term we allow for some slack. Thus in a two case the maximisation becomes [y_i(\beta_0 + \beta_1X_1 + \beta_2X_2) >= M(1- \epsilon)]

[\sum_{i=0}^{n} \epsilon_i <= C] C is a tuning parameter which determines the width of the margin while [\epsilon_i ;'s] are slack variables. that allow individual observations to fall on the wrong side of the margin. In some cases the decision boundary maybe non linear. In case your are dealing with logistic regression you will be forced to introduce polynomial terms which might result in a very large feature space. SVM then introduces what are called kernels

Tuning SVM

svm_tune <-  expand.grid(
    C =c(1 ,5 ,  10, 100, 150),
    sigma = seq(0, .01, length.out = 5))
    
svm_model <- train(
  label ~.,
  data = train,
   metric="ROC",
  method = "svmRadial",
  trControl = myControl,
  tuneGrid = svm_tune,
  verbose = FALSE
)

resample_svm <- thresholder(svm_model, 
                              threshold = seq(.0, 1, by = 0.05), 
                              final = TRUE, 
                              statistics = "all")

ggplot(resample_svm , aes(x = prob_threshold, y = F1, col = "F1")) + 
  geom_point() + 
  geom_point(aes(y = Sensitivity,  col = "Sensitivity"))+
  scale_x_continuous(breaks = seq(0, 1, by = 0.1))

#mean(pred_svm == ytest)

pred_svm <-predict(svm_model, newdata = test, type = "prob")

pred_svm <- ifelse(pred_svm[, "M"] > 0.40, "M", "B")

pred_svm <- factor(pred_svm, levels = levels(test$label))

confusionMatrix(test$label, pred_svm, positive = "M")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   B
##          M  58   2
##          B   2 109
##                                           
##                Accuracy : 0.9766          
##                  95% CI : (0.9412, 0.9936)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9486          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9667          
##             Specificity : 0.9820          
##          Pos Pred Value : 0.9667          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3392          
##    Detection Prevalence : 0.3509          
##       Balanced Accuracy : 0.9743          
##                                           
##        'Positive' Class : M               
##

Xgboost

XGBoost is a type of an ensemble learner. Ensemble learning is where multiple machine learning algorithms are used at the same time for prediction. A good example will be Random Forests. In random Forest multiple decision trees are used together for prediction. There are two main types of ensemble learners, bagging and boosting. Random forest use the bagging approach. Trees are built from random subsets(rows and columns) of training set and then the final prediction is the weighted sum of all decision trees functions. Boosting methods are similar but in boosting samples are selected sequentially. For instance the first sample is selected and a decision tree is fitted, The model then picks the examples that were hard to learn and using this examples and a few others selected randomly from the training set the second model is fitted, Using the first model and the second model prediction is made, the model is evaluated and hard examples are picked and together with another randomly selected new examples from training set another model is trained. This is the process for boosting algorithms which continues for a specified number of n.

In gradient boosting the first model is fitted to the original training set. Let say your fitting a simple regression model for ease of explanation. Then your first model will be (latex y = f(x) + \epsilon). When you find that the error is too large one of the things you might try to do is add more features, use another algorithm, tune your algorithm, look for more training data etc. But what if the error is not white noise and it has some relationship with output (y) . Then we can fit a second model. (latex \epsilon = f_1(x) + \epsilon_1). then this process can continue lets say until n times. Then the final model will be

(latex \epsilon_n = f_{n}(x) + \epsilon_{n-1}).

Then the final step is to add this models together with some weighting criteria (latex weights = \alpha 's) which gives us the final function used for prediction.

(y =latex \alpha * f(x) + \alpha_1 * f_1(x) + \alpha_2 * f_2(x)...+ \alpha_n * f_n + \epsilon)

# "subsample" is the fraction of the training samples (randomly selected) that will be used to train each tree.
# "colsample_by_tree" is the fraction of features (randomly selected) that will be used to train each tree.
# "colsample_bylevel" is the fraction of features (randomly selected) that will be used in each node to train each tree.
#eta learning rate



xgb_ctrl <- trainControl(method = "cv",
                        number = 5,
                        summaryFunction = twoClassSummary,
                        classProbs = TRUE,
                        allowParallel=T,
                        index = cv_fold,
                        verboseIter = FALSE,
                        savePredictions = TRUE,
                        search = "grid")

xgb_grid <- expand.grid(nrounds = c(10, 50, 100),
                        eta = seq(0.06, .2, length.out = 3),
                        max_depth = c(50, 80),
                        gamma = c(0,.01, 0.1),
                        colsample_bytree = c(0.6, 0.7,0.8),
                        min_child_weight = 1,
                        subsample =  .7
                        
    )

    
xgb_model <-train(label~.,
                 data=train,
                 method="xgbTree",
                 trControl= xgb_ctrl,
                 tuneGrid=xgb_grid,
                 verbose=T,
                 metric="ROC",
                 nthread =3
                     
    )

Increasing cut of increases the precision. A greater fraction of those who will be predicted that they have cancer will turn out that they have, but the algorithm is likely to have lower recall. If we want to avoid too many cases of people cancer being predicted that they do not have cancer. It will be very bad to tell someone that they do not have cancer but they have. If we lower the probability let say to 0.3 then we want to make sure that even if there is a 30% chance you have cancer then you should be flagged.

resample_xgb <- thresholder(xgb_model, 
                              threshold = seq(.0, 1, by = 0.01), 
                              final = TRUE, 
                              statistics = "all")

ggplot(resample_xgb , aes(x = prob_threshold, y = F1, col = "F1")) + 
  geom_point() + 
  geom_point(aes(y = Sensitivity, col = "Sensitivity"))+
  scale_x_continuous(breaks = seq(0, 1, by =.1))

pred_xgb <-predict(xgb_model, newdata = test, type = "prob")
pred_xgb1 <- ifelse(pred_xgb[, "M"] > 0.4, "M", "B")
pred_xgb1 <- factor(pred_xgb1, levels = levels(test$label))

confusionMatrix(pred_xgb1,test$label,  positive = "M")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   B
##          M  59   1
##          B   1 110
##                                           
##                Accuracy : 0.9883          
##                  95% CI : (0.9584, 0.9986)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9743          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9833          
##             Specificity : 0.9910          
##          Pos Pred Value : 0.9833          
##          Neg Pred Value : 0.9910          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3450          
##    Detection Prevalence : 0.3509          
##       Balanced Accuracy : 0.9872          
##                                           
##        'Positive' Class : M               
##

Learning Curves

sets <- seq(from = 50, to = nrow(train), by = 50)
sets[length(sets)] <-nrow(train) 
train.err <- c()
test.err<- c()
tune_grid <- expand.grid( nrounds = 50, max_depth = 50, eta = 0.06, gamma = 0.01, 
                         colsample_bytree = 0.6, min_child_weight = 1, subsample = 0.7)
for (i in 1:length(sets)) {
    
    traini = train[1:sets[i],]
    fit_svm <- train(label ~., data = traini, metric="Accuracy", method = "svmRadial",
                 trControl = trainControl(method = "none", summaryFunction = twoClassSummary,
                                          classProbs = TRUE),
                 tuneGrid = expand_grid( sigma = 0.0075, C = 5),
                 )
    
    # fit_svm <-train(label~.,
    #              data=traini,
    #              method="xgbTree",
    #              trControl= xgb_ctrl,
    #              tuneGrid= tune_grid ,
    #              verbose=T,
    #              metric="ROC",
    #              nthread =3
    #                  
    # )
    pred_train = predict(fit_svm, newdata = traini, type = "prob")
    pred_train = ifelse(pred_train[["M"]] > 0.4, "M", "B")
    train.err[i] =1 -  mean(pred_train == traini$label)
    pred_test = predict(fit_svm, newdata = test, type = 'prob')
    pred_test = ifelse(pred_test[, "M"] > 0.4, "M", "B")
    test.err[i] = 1 - mean(test$label == pred_test)
    
    cat(i," ")
    
}

## 1  2  3  4  5  6  7

train.err

## [1] 0.00000000 0.03000000 0.03333333 0.00500000 0.02000000 0.01333333
## [7] 0.02261307

matplot(sets, cbind(test.err, train.err), pch = 19, col = c("red", "blue"),
        type = "b", ylab = "Error", xlab = "Train sample size", main = "SVM Learning Curves")
legend("topright", legend = c("Test", "Train"), pch = 19, col = c("red", "blue"))

Error Analysis

Look at the examples that the algorithm misclassified to see if there is a trend. Generally you are trying to find out the weak points of your algorithm. Checking why your algorithm is making those errors. For instance, from the boxplots below the malignant tumors that were misclassified had lower radius mean compared to mislassified benign tumors. This contrary to what we saw in the first boxplots graph.

df <- data.frame(cancer[-train_sample,], pred_svm) %>%
    setDT()


test_mis_svm <- df[(diagnosis == "M" & pred_svm == 0) |( diagnosis == "B" & pred_svm == "M")]

# test_mis_svm_m <- melt(test_mis_svm, 
#                 id.vars = c("diagnosis", "pred_svm"))
# 
# ggplot(test_mis_svm_m , aes(x = pred_svm, y = value))+
#     geom_boxplot() + facet_wrap(~variable, scales = "free_y")

m-mburu / breast_cancer_prediction Goto Github PK

breast_cancer_prediction's Introduction