Git Product home page Git Product logo

breast_cancer_prediction's Introduction

Cancer Data

Mburu January 26, 2019

Data Set

In this tutorial I’m going to predict whether a breast cancer tumor is benign or malignant. Using Wiscosin breast cancer data set available on Kaggle. The 30 predictors are divided into three parts first is Mean ( variables 3-13), Standard Error(13-23) and Worst(23-32) and each contain 10 parameter (radius, texture,area, perimeter, smoothness,compactness,concavity,concave points,symmetry and fractal dimension of the tumor). When predicting cancer breast tumor types there two types of cost;

  • The cost of telling someone who has malignant tumor that they have benign these are the false negatives in this case someone might not seek medical help which is can cause death.
  • Telling someone that they have malignant type of tumor but they don’t which is usually false positives. In this case you subject someone to unnecessary stress

So it’s highly desirable that our model has good accuracy $ f_1 score$ and high recall.

library(tidyverse)
library(data.table)
library(xgboost)
library(knitr)
library(broom)
library(caret)
library(e1071)
library(kableExtra)
library(ggthemes)

library(glmnet)

cancer <- setDT(read_csv("data.csv"))

cancer[, X33 := NULL]
head(cancer)  %>%
  kable() 

id

diagnosis

radius_mean

texture_mean

perimeter_mean

area_mean

smoothness_mean

compactness_mean

concavity_mean

concave points_mean

symmetry_mean

fractal_dimension_mean

radius_se

texture_se

perimeter_se

area_se

smoothness_se

compactness_se

concavity_se

concave points_se

symmetry_se

fractal_dimension_se

radius_worst

texture_worst

perimeter_worst

area_worst

smoothness_worst

compactness_worst

concavity_worst

concave points_worst

symmetry_worst

fractal_dimension_worst

842302

M

17.99

10.38

122.80

1001.0

0.11840

0.27760

0.3001

0.14710

0.2419

0.07871

1.0950

0.9053

8.589

153.40

0.006399

0.04904

0.05373

0.01587

0.03003

0.006193

25.38

17.33

184.60

2019.0

0.1622

0.6656

0.7119

0.2654

0.4601

0.11890

842517

M

20.57

17.77

132.90

1326.0

0.08474

0.07864

0.0869

0.07017

0.1812

0.05667

0.5435

0.7339

3.398

74.08

0.005225

0.01308

0.01860

0.01340

0.01389

0.003532

24.99

23.41

158.80

1956.0

0.1238

0.1866

0.2416

0.1860

0.2750

0.08902

84300903

M

19.69

21.25

130.00

1203.0

0.10960

0.15990

0.1974

0.12790

0.2069

0.05999

0.7456

0.7869

4.585

94.03

0.006150

0.04006

0.03832

0.02058

0.02250

0.004571

23.57

25.53

152.50

1709.0

0.1444

0.4245

0.4504

0.2430

0.3613

0.08758

84348301

M

11.42

20.38

77.58

386.1

0.14250

0.28390

0.2414

0.10520

0.2597

0.09744

0.4956

1.1560

3.445

27.23

0.009110

0.07458

0.05661

0.01867

0.05963

0.009208

14.91

26.50

98.87

567.7

0.2098

0.8663

0.6869

0.2575

0.6638

0.17300

84358402

M

20.29

14.34

135.10

1297.0

0.10030

0.13280

0.1980

0.10430

0.1809

0.05883

0.7572

0.7813

5.438

94.44

0.011490

0.02461

0.05688

0.01885

0.01756

0.005115

22.54

16.67

152.20

1575.0

0.1374

0.2050

0.4000

0.1625

0.2364

0.07678

843786

M

12.45

15.70

82.57

477.1

0.12780

0.17000

0.1578

0.08089

0.2087

0.07613

0.3345

0.8902

2.217

27.19

0.007510

0.03345

0.03672

0.01137

0.02165

0.005082

15.47

23.75

103.40

741.6

0.1791

0.5249

0.5355

0.1741

0.3985

0.12440

Visualization and Feature selection

Percentage of women with malignant tumor

The percentage of women with malignant tumor is 37.26%(212 out 569) while the rest 62.74%(357) had benign tumors.

cancer[, .(freq = .N),
       by = diagnosis] %>% 
    .[, perc := round(100 * freq/sum(freq), 2)] %>%
  
ggplot(aes(x=diagnosis, y=perc, fill = diagnosis)) + 
    geom_bar(stat = "identity", width  = 0.5)+ theme_hc() +
    geom_text(aes(x=diagnosis, y=perc, label = paste(perc, "%")),
              position =  position_dodge(width = 0.5),
              vjust = 0.05, hjust = 0.5, size = 5)+
    scale_fill_hc(name = "")+
    labs(x = "Cancer Type",
         y = "Percentage",
         title = "Percentage of women with benign or malignant breast bancer")+
    theme(legend.position = "none",
          axis.title = element_text(size =12))

Boxplots

From the boxplots we can identify variables where we expect there is a significance difference between the two groups of cancer tumors. When using a boxplot if two distributions do not averlap or more than 75% of two boxplot do not overlap then we expect that there is a significance difference in the mean/median between the two groups. Some of the variables where the distribution of two cancer tumors are significantly different are radius_mean, texture_mean etc. The visible differences between malignant tumors and benign tumors can be seen in means of all cells and worst means where worst means is the average of all the worst cells. The distribution of malignant tumors have higher scores than the benign tumors in this cases.

cancerm <- melt(cancer[, -1, with = F], id.vars = "diagnosis")

ggplot(cancerm, aes(x = diagnosis, y = value))+
    geom_boxplot() + facet_wrap(~variable, scales = "free_y")

Features Scaling

We find that some variables are highly correlated. We can use principle component analysis for dimension reduction. Since variables are correlated it’s evident that we can use a smaller set of features to build our models.

cancer[, id := NULL]
predictors <- names(cancer)[3:31]
cancer[, (predictors) := lapply(.SD, function(x) scale(x)), .SDcols = predictors ]
cancer[, diagnosis := as.factor(diagnosis)]

Correlation matrix

cor(cancer[, -(1:2), with = F]) %>% kable(format = "html") 

texture_mean

perimeter_mean

area_mean

smoothness_mean

compactness_mean

concavity_mean

concave points_mean

symmetry_mean

fractal_dimension_mean

radius_se

texture_se

perimeter_se

area_se

smoothness_se

compactness_se

concavity_se

concave points_se

symmetry_se

fractal_dimension_se

radius_worst

texture_worst

perimeter_worst

area_worst

smoothness_worst

compactness_worst

concavity_worst

concave points_worst

symmetry_worst

fractal_dimension_worst

texture_mean

1.0000000

0.3295331

0.3210857

-0.0233885

0.2367022

0.3024178

0.2934641

0.0714010

-0.0764372

0.2758687

0.3863576

0.2816731

0.2598450

0.0066138

0.1919746

0.1432931

0.1638510

0.0091272

0.0544575

0.3525729

0.9120446

0.3580396

0.3435459

0.0775034

0.2778296

0.3010252

0.2953158

0.1050079

0.1192054

perimeter_mean

0.3295331

1.0000000

0.9865068

0.2072782

0.5569362

0.7161357

0.8509770

0.1830272

-0.2614769

0.6917650

-0.0867611

0.6931349

0.7449827

-0.2026940

0.2507437

0.2280823

0.4072169

-0.0816293

-0.0055234

0.9694764

0.3030384

0.9703869

0.9415498

0.1505494

0.4557742

0.5638793

0.7712408

0.1891150

0.0510185

area_mean

0.3210857

0.9865068

1.0000000

0.1770284

0.4985017

0.6859828

0.8232689

0.1512931

-0.2831098

0.7325622

-0.0662802

0.7266283

0.8000859

-0.1667767

0.2125826

0.2076601

0.3723203

-0.0724966

-0.0198870

0.9627461

0.2874886

0.9591196

0.9592133

0.1235229

0.3904103

0.5126059

0.7220166

0.1435699

0.0037376

smoothness_mean

-0.0233885

0.2072782

0.1770284

1.0000000

0.6591232

0.5219838

0.5536952

0.5577748

0.5847920

0.3014671

0.0684064

0.2960919

0.2465524

0.3323754

0.3189433

0.2483957

0.3806757

0.2007744

0.2836067

0.2131201

0.0360718

0.2388526

0.2067184

0.8053242

0.4724684

0.4349257

0.5030534

0.3943095

0.4993164

compactness_mean

0.2367022

0.5569362

0.4985017

0.6591232

1.0000000

0.8831207

0.8311350

0.6026410

0.5653687

0.4974734

0.0462048

0.5489053

0.4556529

0.1352993

0.7387218

0.5705169

0.6422619

0.2299766

0.5073181

0.5353154

0.2481328

0.5902104

0.5096038

0.5655412

0.8658090

0.8162752

0.8155732

0.5102234

0.6873823

concavity_mean

0.3024178

0.7161357

0.6859828

0.5219838

0.8831207

1.0000000

0.9213910

0.5006666

0.3367834

0.6319248

0.0762183

0.6603908

0.6174268

0.0985637

0.6702788

0.6912702

0.6832599

0.1780092

0.4493007

0.6882364

0.2998789

0.7295649

0.6759872

0.4488220

0.7549680

0.8841026

0.8613230

0.4094641

0.5149299

concave points_mean

0.2934641

0.8509770

0.8232689

0.5536952

0.8311350

0.9213910

1.0000000

0.4624974

0.1669174

0.6980498

0.0214796

0.7106499

0.6902985

0.0276533

0.4904242

0.4391671

0.6156341

0.0953508

0.2575837

0.8303176

0.2927517

0.8559231

0.8096296

0.4527531

0.6674537

0.7523995

0.9101553

0.3757441

0.3686611

symmetry_mean

0.0714010

0.1830272

0.1512931

0.5577748

0.6026410

0.5006666

0.4624974

1.0000000

0.4799213

0.3033793

0.1280529

0.3138928

0.2239702

0.1873212

0.4216591

0.3426270

0.3932979

0.4491365

0.3317861

0.1857278

0.0906507

0.2191686

0.1771934

0.4266750

0.4732000

0.4337210

0.4302966

0.6998258

0.4384135

fractal_dimension_mean

-0.0764372

-0.2614769

-0.2831098

0.5847920

0.5653687

0.3367834

0.1669174

0.4799213

1.0000000

0.0001110

0.1641740

0.0398299

-0.0901702

0.4019644

0.5598367

0.4466303

0.3411980

0.3450074

0.6881316

-0.2536915

-0.0512692

-0.2051512

-0.2318545

0.5049421

0.4587982

0.3462339

0.1753254

0.3340187

0.7672968

radius_se

0.2758687

0.6917650

0.7325622

0.3014671

0.4974734

0.6319248

0.6980498

0.3033793

0.0001110

1.0000000

0.2132473

0.9727937

0.9518301

0.1645142

0.3560646

0.3323575

0.5133464

0.2405674

0.2277535

0.7150652

0.1947986

0.7196838

0.7515485

0.1419186

0.2871032

0.3805846

0.5310623

0.0945428

0.0495594

texture_se

0.3863576

-0.0867611

-0.0662802

0.0684064

0.0462048

0.0762183

0.0214796

0.1280529

0.1641740

0.2132473

1.0000000

0.2231707

0.1115672

0.3972429

0.2316997

0.1949985

0.2302834

0.4116207

0.2797227

-0.1116903

0.4090028

-0.1022419

-0.0831950

-0.0736577

-0.0924394

-0.0689562

-0.1196375

-0.1282148

-0.0456546

perimeter_se

0.2816731

0.6931349

0.7266283

0.2960919

0.5489053

0.6603908

0.7106499

0.3138928

0.0398299

0.9727937

0.2231707

1.0000000

0.9376554

0.1510753

0.4163224

0.3624816

0.5562641

0.2664871

0.2441428

0.6972006

0.2003709

0.7210313

0.7307130

0.1300544

0.3419194

0.4188988

0.5548972

0.1099304

0.0854326

area_se

0.2598450

0.7449827

0.8000859

0.2465524

0.4556529

0.6174268

0.6902985

0.2239702

-0.0901702

0.9518301

0.1115672

0.9376554

1.0000000

0.0751503

0.2848401

0.2708947

0.4157296

0.1341090

0.1270709

0.7573732

0.1964966

0.7612126

0.8114080

0.1253894

0.2832565

0.3851001

0.5381663

0.0741263

0.0175393

smoothness_se

0.0066138

-0.2026940

-0.1667767

0.3323754

0.1352993

0.0985637

0.0276533

0.1873212

0.4019644

0.1645142

0.3972429

0.1510753

0.0751503

1.0000000

0.3366961

0.2686848

0.3284295

0.4135061

0.4273742

-0.2306907

-0.0747430

-0.2173038

-0.1821955

0.3144575

-0.0555581

-0.0582984

-0.1020068

-0.1073421

0.1014803

compactness_se

0.1919746

0.2507437

0.2125826

0.3189433

0.7387218

0.6702788

0.4904242

0.4216591

0.5598367

0.3560646

0.2316997

0.4163224

0.2848401

0.3366961

1.0000000

0.8012683

0.7440827

0.3947128

0.8032688

0.2046072

0.1430026

0.2605158

0.1993713

0.2273942

0.6787804

0.6391467

0.4832083

0.2778784

0.5909728

concavity_se

0.1432931

0.2280823

0.2076601

0.2483957

0.5705169

0.6912702

0.4391671

0.3426270

0.4466303

0.3323575

0.1949985

0.3624816

0.2708947

0.2686848

0.8012683

1.0000000

0.7718040

0.3094286

0.7273722

0.1869035

0.1002410

0.2266804

0.1883527

0.1684813

0.4848578

0.6625641

0.4404723

0.1977878

0.4393293

concave points_se

0.1638510

0.4072169

0.3723203

0.3806757

0.6422619

0.6832599

0.6156341

0.3932979

0.3411980

0.5133464

0.2302834

0.5562641

0.4157296

0.3284295

0.7440827

0.7718040

1.0000000

0.3127802

0.6110441

0.3581267

0.0867412

0.3949993

0.3422712

0.2153506

0.4528884

0.5495924

0.6024496

0.1431157

0.3106546

symmetry_se

0.0091272

-0.0816293

-0.0724966

0.2007744

0.2299766

0.1780092

0.0953508

0.4491365

0.3450074

0.2405674

0.4116207

0.2664871

0.1341090

0.4135061

0.3947128

0.3094286

0.3127802

1.0000000

0.3690781

-0.1281208

-0.0774734

-0.1037530

-0.1103427

-0.0126618

0.0602549

0.0371190

-0.0304134

0.3894025

0.0780795

fractal_dimension_se

0.0544575

-0.0055234

-0.0198870

0.2836067

0.5073181

0.4493007

0.2575837

0.3317861

0.6881316

0.2277535

0.2797227

0.2441428

0.1270709

0.4273742

0.8032688

0.7273722

0.6110441

0.3690781

1.0000000

-0.0374876

-0.0031950

-0.0010004

-0.0227361

0.1705683

0.3901588

0.3799747

0.2152040

0.1110940

0.5913281

radius_worst

0.3525729

0.9694764

0.9627461

0.2131201

0.5353154

0.6882364

0.8303176

0.1857278

-0.2536915

0.7150652

-0.1116903

0.6972006

0.7573732

-0.2306907

0.2046072

0.1869035

0.3581267

-0.1281208

-0.0374876

1.0000000

0.3599208

0.9937079

0.9840146

0.2165744

0.4758200

0.5739747

0.7874239

0.2435292

0.0934920

texture_worst

0.9120446

0.3030384

0.2874886

0.0360718

0.2481328

0.2998789

0.2927517

0.0906507

-0.0512692

0.1947986

0.4090028

0.2003709

0.1964966

-0.0747430

0.1430026

0.1002410

0.0867412

-0.0774734

-0.0031950

0.3599208

1.0000000

0.3650982

0.3458423

0.2254294

0.3608323

0.3683656

0.3597546

0.2330275

0.2191224

perimeter_worst

0.3580396

0.9703869

0.9591196

0.2388526

0.5902104

0.7295649

0.8559231

0.2191686

-0.2051512

0.7196838

-0.1022419

0.7210313

0.7612126

-0.2173038

0.2605158

0.2266804

0.3949993

-0.1037530

-0.0010004

0.9937079

0.3650982

1.0000000

0.9775781

0.2367746

0.5294077

0.6183441

0.8163221

0.2694928

0.1389569

area_worst

0.3435459

0.9415498

0.9592133

0.2067184

0.5096038

0.6759872

0.8096296

0.1771934

-0.2318545

0.7515485

-0.0831950

0.7307130

0.8114080

-0.1821955

0.1993713

0.1883527

0.3422712

-0.1103427

-0.0227361

0.9840146

0.3458423

0.9775781

1.0000000

0.2091453

0.4382963

0.5433305

0.7474188

0.2091455

0.0796470

smoothness_worst

0.0775034

0.1505494

0.1235229

0.8053242

0.5655412

0.4488220

0.4527531

0.4266750

0.5049421

0.1419186

-0.0736577

0.1300544

0.1253894

0.3144575

0.2273942

0.1684813

0.2153506

-0.0126618

0.1705683

0.2165744

0.2254294

0.2367746

0.2091453

1.0000000

0.5681865

0.5185233

0.5476909

0.4938383

0.6176242

compactness_worst

0.2778296

0.4557742

0.3904103

0.4724684

0.8658090

0.7549680

0.6674537

0.4732000

0.4587982

0.2871032

-0.0924394

0.3419194

0.2832565

-0.0555581

0.6787804

0.4848578

0.4528884

0.0602549

0.3901588

0.4758200

0.3608323

0.5294077

0.4382963

0.5681865

1.0000000

0.8922609

0.8010804

0.6144405

0.8104549

concavity_worst

0.3010252

0.5638793

0.5126059

0.4349257

0.8162752

0.8841026

0.7523995

0.4337210

0.3462339

0.3805846

-0.0689562

0.4188988

0.3851001

-0.0582984

0.6391467

0.6625641

0.5495924

0.0371190

0.3799747

0.5739747

0.3683656

0.6183441

0.5433305

0.5185233

0.8922609

1.0000000

0.8554339

0.5325197

0.6865109

concave points_worst

0.2953158

0.7712408

0.7220166

0.5030534

0.8155732

0.8613230

0.9101553

0.4302966

0.1753254

0.5310623

-0.1196375

0.5548972

0.5381663

-0.1020068

0.4832083

0.4404723

0.6024496

-0.0304134

0.2152040

0.7874239

0.3597546

0.8163221

0.7474188

0.5476909

0.8010804

0.8554339

1.0000000

0.5025285

0.5111141

symmetry_worst

0.1050079

0.1891150

0.1435699

0.3943095

0.5102234

0.4094641

0.3757441

0.6998258

0.3340187

0.0945428

-0.1282148

0.1099304

0.0741263

-0.1073421

0.2778784

0.1977878

0.1431157

0.3894025

0.1110940

0.2435292

0.2330275

0.2694928

0.2091455

0.4938383

0.6144405

0.5325197

0.5025285

1.0000000

0.5378482

fractal_dimension_worst

0.1192054

0.0510185

0.0037376

0.4993164

0.6873823

0.5149299

0.3686611

0.4384135

0.7672968

0.0495594

-0.0456546

0.0854326

0.0175393

0.1014803

0.5909728

0.4393293

0.3106546

0.0780795

0.5913281

0.0934920

0.2191224

0.1389569

0.0796470

0.6176242

0.8104549

0.6865109

0.5111141

0.5378482

1.0000000

Principle Component Analysis

Using the elbow rule we can use the first 5 principle components. Using 15 principle components we will have achieved al most 100% of the variance from the original data set.

pca <- prcomp(cancer[, predictors, with = F], scale. = F)

Variance Explained

Since PCA forms new characteristics the variance explained plot shows the amount of variation of the original features captured by each principle component. The new features are simply linear combinations of the old features.

stdpca <- pca$sdev

varpca <- stdpca^2

prop_var <- varpca/sum(varpca)
prop_var * 100
##  [1] 43.706363461 18.472236711  9.716239461  6.816735644  5.676223001
##  [6]  4.161722924  2.292352407  1.643433612  1.363238017  1.191515264
## [11]  1.011031531  0.897368380  0.832104816  0.539193136  0.323823486
## [16]  0.269517389  0.198317442  0.178851469  0.153573276  0.107095306
## [21]  0.102579308  0.093821366  0.082603220  0.058724633  0.053331079
## [26]  0.027514319  0.022985324  0.005110230  0.002393786
sum(prop_var[1:15])
## [1] 0.9864358

Scree plot

Scree plot shows the variance explained by each principle component which reduces as the number of principle components increase.

plot(prop_var, xlab = "Principal Component",
     ylab = "Proportion of Variance Explained",
     type = "b", xlim = c(0, 30))

Cumulative Variance Explained

The cumulative of variance plot helps to choose the number of features based on the amount of variation from original data set you want captured. In this case, I wanted to use number of principle components that capture almost 100% of the variation. After trying with different number of principle components I found out that the accuracy of the models did not increase after the 15th principle components.

cum_var <- cumsum(prop_var)
plot(cum_var, xlab = "Principal Component",
     ylab = "Cumulative Proportion of Variance Explained",
     type = "b", xlim = c(0, 30))

Construct new data set

We use the first 15 principle components as our new predictors, then we randomly split data into training and test set in 7:3 ratio.

set.seed(100)
train_sample <- sample(1:nrow(cancer), round(0.7*nrow(cancer)))

pcadat <- data.table( label = cancer$diagnosis, pca$x[,1:15]) 
pcadat[, label := factor(label, levels = c("M", "B"))]
train <- pcadat[train_sample,]
test <- pcadat[-train_sample,]

Model Fitting

  • Some of the importants terms to understand: Accuracy:

[Accuracy = \frac{True: positives+ True ; Negatives ;Classes}{N}]

  • Precision: Of all the patients we predicted that they have cancer malignant how many actually have cancer [Precision = \frac{True ; Positives}{Predicted ; Positives}]

  • Recall(sensitivity): Of all the patients that have malignant cancer how many did we detect. This is the true positive rate [Recall= \frac{True ; Positives}{Actual ; Positives}]

  • Specifity is the true negative rate. Of all the patient who did not have malignant tumors how many did we detect [Recall= \frac{True ; Negatives}{Actual ; Negatives}]

Logistic regression

This is one of generalized linear models which deals with binary data. There is a generalization of this model which is called multinomial regression where you can fit multi class data. The equation for logistic regression model is:

[log(\frac{p}{1-p}) = \beta_0 + \beta_1*X_1 + ... \beta_n * X_n] and using mle the cost function can be derived as: [J(\theta) = -\frac{1}{m}\sum_{i=1}^{m} y^i log(h_\theta(x^i)) + (1-y^i) log(1 - h_\theta(x^i))] Given that [y = 0] [y = 1] . Finding [\beta] s we minimizing the cost function.

fit_glm <- glm(label ~., data = train, family = binomial)
## Warning: glm.fit: fitted probabilities numerically 0 or 1 occurred

Regularization in logistic regression

The warning “glm.fit: fitted probabilities numerically 0 or 1 occurred” shows that there is a perfect separation/over fitting. In this case you can load glmnet library and fit a regularized logistic regression. These can be achieved by adding a regularization term to the cost function.The L1 regularization(Lasso) adds a penalty equal to the sum of the absolute values of the coefficients.

[J(\theta) = -\frac{1}{m}\sum_{i=0}^{m} y^i log(h_\theta(x^i)) + (1-y^i) log(1 - h_\theta(x^i)) + \frac {\lambda}{2m}\sum_{j=1}^{n} |\theta^i|]

trainx <- train[,-1]

y_train <- factor(train$label, levels = c("B", "M"), labels = 0:1)
#y <- as.numeric(as.character(y))

y_test <- factor(test$label, levels = c("B", "M"), labels = 0:1) %>% as.character() %>% as.numeric()
#ytest <- as.numeric(as.character(ytest))

testx <- data.matrix(test[, -1]) 

To find the optimal values (\lambda) we use cross validation. We choose (\lambda) which gives the highest cross validation accuracy.

cv_fold <- createFolds(train$label, k = 10)

myControl <- trainControl(
  method = "cv", 
  number = 10,
  summaryFunction = twoClassSummary,
  savePredictions = "all",
  classProbs = TRUE,
  verboseIter = FALSE,
  index = cv_fold,
  allowParallel = TRUE
  
)

tuneGrid <-  expand.grid(
    alpha = 0:1,
    lambda = seq(0.001, 1, length.out = 10))
    
glmnet_model <- train(
  label ~.,
  data = train,
  method = "glmnet",
  metric = "ROC",
  trControl = myControl,
  tuneGrid = tuneGrid
)

s

plot(glmnet_model) 

#lamda_min <- cv_glm$lambda.min
resample_glmnet <- thresholder(glmnet_model, 
                              threshold = seq(.2, 1, by = 0.05), 
                              final = TRUE, 
                              statistics = "all")

ggplot(resample_glmnet , aes(x = prob_threshold, y = F1)) + 
  geom_point() + 
  geom_point(aes(y = Sensitivity), col = "blue")

library(caTools)

pred_glm <- predict(glmnet_model, test, type = "prob")

colAUC(pred_glm , test$label, plotROC = TRUE)

##                 M         B
## M vs. B 0.9683183 0.9683183
pred_glm1 <- ifelse(pred_glm[, "M"] > 0.4, "M", "B")
#pred_glm1 <- predict(glmnet_model, test, type = "raw")
pred_glm1 <- factor(pred_glm1, levels = levels(test$label))
confusionMatrix(pred_glm1, test$label,positive = "M") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   B
##          M  53   7
##          B   7 104
##                                           
##                Accuracy : 0.9181          
##                  95% CI : (0.8664, 0.9545)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.8203          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8833          
##             Specificity : 0.9369          
##          Pos Pred Value : 0.8833          
##          Neg Pred Value : 0.9369          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3099          
##    Detection Prevalence : 0.3509          
##       Balanced Accuracy : 0.9101          
##                                           
##        'Positive' Class : M               
## 

SVM

Support Vector Machines is a type of supervised learning algorithm that is used for classification and regression. Most of the times however, it’s used for classification.

To understand how SVM works consider the following example of linearly separable data. It’s clear that we can separate the two classes using a straight line(decision boundary). Which is normally referred to a separating hyperplane.

The question is, since there exists many lines that can separate the red and the black classes which is the best one. This introduces us to the maximal margin classification, In short SVM finds the hyperplane/line that gives the biggest margin/gap between the two classes. In this case SVM will choose the solid line as the hyperplane while the margins are the dotted lines. The circled points that lie directly on the margin, or on the wrong side of the margin for their class, are known as support vectors. This shows that SVM uses this points to come up with a the decision boundary, the other points are not used. In this case since it’s a two dimensional space the equation of the separating line will be [latex\beta_0 + \beta_1X_1 + \beta_2X_2]. Then when equations evaluates to more than 0 then 1 is predicted [latex\beta_0 + \beta_1X_1 + \beta_2X_2 > 0, y = 1] and when it evaluates to less than zero then predicted class is -1 [latex\beta_0 + \beta_1X_1 + \beta_2X_2 < 0, ; y = -1] This becomes maximisation problem [latexwidth ; of ;the ; margin = M ] [\sum_{j=1}^{n}\beta_j = 1]

[latexy_i(\beta_0 + \beta_1X_1 + \beta_2X_2) >= M]

This is a best case scenario but in most cases the classes are noisy. Consider the plot below no matter which line you choose some points are bound to be on the wrong side of the desicion boundary. Thus maximal margin classification would not work.

SVM then introduces what is called a soft margin. In naive explanation you can think of this as a margin that allows some points to be on the wrong side. By introducing an error term we allow for some slack. Thus in a two case the maximisation becomes [y_i(\beta_0 + \beta_1X_1 + \beta_2X_2) >= M(1- \epsilon)]

[\sum_{i=0}^{n} \epsilon_i <= C] C is a tuning parameter which determines the width of the margin while [\epsilon_i ;'s] are slack variables. that allow individual observations to fall on the wrong side of the margin. In some cases the decision boundary maybe non linear. In case your are dealing with logistic regression you will be forced to introduce polynomial terms which might result in a very large feature space. SVM then introduces what are called kernels

Tuning SVM

svm_tune <-  expand.grid(
    C =c(1 ,5 ,  10, 100, 150),
    sigma = seq(0, .01, length.out = 5))
    
svm_model <- train(
  label ~.,
  data = train,
   metric="ROC",
  method = "svmRadial",
  trControl = myControl,
  tuneGrid = svm_tune,
  verbose = FALSE
)
resample_svm <- thresholder(svm_model, 
                              threshold = seq(.0, 1, by = 0.05), 
                              final = TRUE, 
                              statistics = "all")

ggplot(resample_svm , aes(x = prob_threshold, y = F1, col = "F1")) + 
  geom_point() + 
  geom_point(aes(y = Sensitivity,  col = "Sensitivity"))+
  scale_x_continuous(breaks = seq(0, 1, by = 0.1))

#mean(pred_svm == ytest)
pred_svm <-predict(svm_model, newdata = test, type = "prob")

pred_svm <- ifelse(pred_svm[, "M"] > 0.40, "M", "B")

pred_svm <- factor(pred_svm, levels = levels(test$label))

confusionMatrix(test$label, pred_svm, positive = "M") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   B
##          M  58   2
##          B   2 109
##                                           
##                Accuracy : 0.9766          
##                  95% CI : (0.9412, 0.9936)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9486          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9667          
##             Specificity : 0.9820          
##          Pos Pred Value : 0.9667          
##          Neg Pred Value : 0.9820          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3392          
##    Detection Prevalence : 0.3509          
##       Balanced Accuracy : 0.9743          
##                                           
##        'Positive' Class : M               
## 

Xgboost

XGBoost is a type of an ensemble learner. Ensemble learning is where multiple machine learning algorithms are used at the same time for prediction. A good example will be Random Forests. In random Forest multiple decision trees are used together for prediction. There are two main types of ensemble learners, bagging and boosting. Random forest use the bagging approach. Trees are built from random subsets(rows and columns) of training set and then the final prediction is the weighted sum of all decision trees functions. Boosting methods are similar but in boosting samples are selected sequentially. For instance the first sample is selected and a decision tree is fitted, The model then picks the examples that were hard to learn and using this examples and a few others selected randomly from the training set the second model is fitted, Using the first model and the second model prediction is made, the model is evaluated and hard examples are picked and together with another randomly selected new examples from training set another model is trained. This is the process for boosting algorithms which continues for a specified number of n.

In gradient boosting the first model is fitted to the original training set. Let say your fitting a simple regression model for ease of explanation. Then your first model will be (latex y = f(x) + \epsilon). When you find that the error is too large one of the things you might try to do is add more features, use another algorithm, tune your algorithm, look for more training data etc. But what if the error is not white noise and it has some relationship with output (y) . Then we can fit a second model. (latex \epsilon = f_1(x) + \epsilon_1). then this process can continue lets say until n times. Then the final model will be

(latex \epsilon_n = f_{n}(x) + \epsilon_{n-1}).

Then the final step is to add this models together with some weighting criteria (latex weights = \alpha 's) which gives us the final function used for prediction.

(y =latex \alpha * f(x) + \alpha_1 * f_1(x) + \alpha_2 * f_2(x)...+ \alpha_n * f_n + \epsilon)

# "subsample" is the fraction of the training samples (randomly selected) that will be used to train each tree.
# "colsample_by_tree" is the fraction of features (randomly selected) that will be used to train each tree.
# "colsample_bylevel" is the fraction of features (randomly selected) that will be used in each node to train each tree.
#eta learning rate



xgb_ctrl <- trainControl(method = "cv",
                        number = 5,
                        summaryFunction = twoClassSummary,
                        classProbs = TRUE,
                        allowParallel=T,
                        index = cv_fold,
                        verboseIter = FALSE,
                        savePredictions = TRUE,
                        search = "grid")

xgb_grid <- expand.grid(nrounds = c(10, 50, 100),
                        eta = seq(0.06, .2, length.out = 3),
                        max_depth = c(50, 80),
                        gamma = c(0,.01, 0.1),
                        colsample_bytree = c(0.6, 0.7,0.8),
                        min_child_weight = 1,
                        subsample =  .7
                        
    )

    
xgb_model <-train(label~.,
                 data=train,
                 method="xgbTree",
                 trControl= xgb_ctrl,
                 tuneGrid=xgb_grid,
                 verbose=T,
                 metric="ROC",
                 nthread =3
                     
    )

Increasing cut of increases the precision. A greater fraction of those who will be predicted that they have cancer will turn out that they have, but the algorithm is likely to have lower recall. If we want to avoid too many cases of people cancer being predicted that they do not have cancer. It will be very bad to tell someone that they do not have cancer but they have. If we lower the probability let say to 0.3 then we want to make sure that even if there is a 30% chance you have cancer then you should be flagged.

resample_xgb <- thresholder(xgb_model, 
                              threshold = seq(.0, 1, by = 0.01), 
                              final = TRUE, 
                              statistics = "all")

ggplot(resample_xgb , aes(x = prob_threshold, y = F1, col = "F1")) + 
  geom_point() + 
  geom_point(aes(y = Sensitivity, col = "Sensitivity"))+
  scale_x_continuous(breaks = seq(0, 1, by =.1))

pred_xgb <-predict(xgb_model, newdata = test, type = "prob")
pred_xgb1 <- ifelse(pred_xgb[, "M"] > 0.4, "M", "B")
pred_xgb1 <- factor(pred_xgb1, levels = levels(test$label))

confusionMatrix(pred_xgb1,test$label,  positive = "M") 
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   M   B
##          M  59   1
##          B   1 110
##                                           
##                Accuracy : 0.9883          
##                  95% CI : (0.9584, 0.9986)
##     No Information Rate : 0.6491          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9743          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9833          
##             Specificity : 0.9910          
##          Pos Pred Value : 0.9833          
##          Neg Pred Value : 0.9910          
##              Prevalence : 0.3509          
##          Detection Rate : 0.3450          
##    Detection Prevalence : 0.3509          
##       Balanced Accuracy : 0.9872          
##                                           
##        'Positive' Class : M               
## 

Learning Curves

sets <- seq(from = 50, to = nrow(train), by = 50)
sets[length(sets)] <-nrow(train) 
train.err <- c()
test.err<- c()
tune_grid <- expand.grid( nrounds = 50, max_depth = 50, eta = 0.06, gamma = 0.01, 
                         colsample_bytree = 0.6, min_child_weight = 1, subsample = 0.7)
for (i in 1:length(sets)) {
    
    traini = train[1:sets[i],]
    fit_svm <- train(label ~., data = traini, metric="Accuracy", method = "svmRadial",
                 trControl = trainControl(method = "none", summaryFunction = twoClassSummary,
                                          classProbs = TRUE),
                 tuneGrid = expand_grid( sigma = 0.0075, C = 5),
                 )
    
    # fit_svm <-train(label~.,
    #              data=traini,
    #              method="xgbTree",
    #              trControl= xgb_ctrl,
    #              tuneGrid= tune_grid ,
    #              verbose=T,
    #              metric="ROC",
    #              nthread =3
    #                  
    # )
    pred_train = predict(fit_svm, newdata = traini, type = "prob")
    pred_train = ifelse(pred_train[["M"]] > 0.4, "M", "B")
    train.err[i] =1 -  mean(pred_train == traini$label)
    pred_test = predict(fit_svm, newdata = test, type = 'prob')
    pred_test = ifelse(pred_test[, "M"] > 0.4, "M", "B")
    test.err[i] = 1 - mean(test$label == pred_test)
    
    cat(i," ")
    
}
## 1  2  3  4  5  6  7
train.err
## [1] 0.00000000 0.03000000 0.03333333 0.00500000 0.02000000 0.01333333
## [7] 0.02261307
matplot(sets, cbind(test.err, train.err), pch = 19, col = c("red", "blue"),
        type = "b", ylab = "Error", xlab = "Train sample size", main = "SVM Learning Curves")
legend("topright", legend = c("Test", "Train"), pch = 19, col = c("red", "blue"))

Error Analysis

Look at the examples that the algorithm misclassified to see if there is a trend. Generally you are trying to find out the weak points of your algorithm. Checking why your algorithm is making those errors. For instance, from the boxplots below the malignant tumors that were misclassified had lower radius mean compared to mislassified benign tumors. This contrary to what we saw in the first boxplots graph.

df <- data.frame(cancer[-train_sample,], pred_svm) %>%
    setDT()


test_mis_svm <- df[(diagnosis == "M" & pred_svm == 0) |( diagnosis == "B" & pred_svm == "M")]
# test_mis_svm_m <- melt(test_mis_svm, 
#                 id.vars = c("diagnosis", "pred_svm"))
# 
# ggplot(test_mis_svm_m , aes(x = pred_svm, y = value))+
#     geom_boxplot() + facet_wrap(~variable, scales = "free_y")

breast_cancer_prediction's People

Contributors

m-mburu avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.