Business Excellence: Caesarean or No Caesarean : Dr. SVM Knows it Well

Hu Chen

INTRODUCTION

These days there is a lot of controversy regarding Caesarean deliveries in India. Some blame that medical fraternity and hospitals are deliberately doing more and more Caesarean deliveries. Let us explore how data science can help in this issue and making the decision about caesarean delivery or not with the help of an objective method i.e. machine learning.

DATASET

A sample of 80 women were collected who have either gone through caesarean delivery or normal delivery (M.Zain Amin, Amir Ali.'Performance Evaluation of Supervised Machine Learning Classifiers for Predicting Healthcare Operational Decisions'.Machine Learning for Operational Decision Making, Wavy Artificial Intelligence Research Foundation, Pakistan, 2018). This dataset is available on the following link---

https://archive.ics.uci.edu/ml/datasets/Caesarian+Section+Classification+Dataset

SUPPORT VECTOR MACHINE

We will do a classification model building with Support Vector Machine (SVM). Support Vector Machine is an algorithm which can be used for regression and classification. SVM can be used for linear and non-linear modelling. In this case study, we will try linear and nonlinear models for classification of Caesarean dataset.

Figure 1 Support Vector Machine

In Figure 2, the theoretical framework of Support Vector is graphically represented. In Figure 2, there are two types of labels represented by green (Class2) and orange stars (Class1). On Y-axis, there is X1 variable and on X-axis, X2 variable and both are independent variables. The graph shows how each observation is distributed in 2 dimensions X1 and X2.

The Support Vectors are the imaginary lines that are drawn on the basis of the maximum distance possible from each class of data. The redline is the hyperplane which is obtained after rounds of optimizations. The redline hyperplane is the best plane in classifying the two classes. SVM algorithm finds the best hyperplane for the best classification. To know more SVM intuitively, click on the link below---

https://www.youtube.com/watch?v=efR1C6CvhmE

The machine learning model building with Support Vector Machine comprises of following steps---

A) Data Preparation & Exploratory Data Analysis
B) SVM Model Building & Prediction

Let's begin the appointment with Dr. SVM.

A) Data Preparation & Exploratory Data Analysis

In RStudio, we begin with setting the working directory and packages required.

>setwd(PATH_to_FILE)
>getwd()
>library(foreign)
>library(e1071)
>library(caret)

>library(ROCR)

The dataset contains 80 observations and 6 variables. Out of six variables, five variables are attributes and one is a dependent variable. We give the name "ceas" to the file---

>ceas=read.arff("caesarian.csv.arff")

The attributes/independent variables are "age", "deliveryNumber", "BP", "deliveryTime" and "heartProblem". When we execute "str(ceas)" command, we find that all the variables are factor variable type. We need to convert them into the numeric format for model building and we will reove the parent variable---

>ceas$age=as.numeric(substr(ceas$Age,1,2))
>ceas$Age=NULL

>ceas$deliveryNumber=as.numeric(substr(ceas$`Delivery number`,1,1))
>ceas$`Delivery number`=NULL

>ceas$BP=as.numeric(substr(ceas$`Blood of Pressure`,1,1))
>ceas$`Blood of Pressure`=NULL

>ceas$deliveryTime=as.numeric(substr(ceas$`Delivery time`,1,1))
>ceas$`Delivery time`=NULL

>ceas$heartProblem=as.numeric(substr(ceas$`Heart Problem`,1,1))
>ceas$`Heart Problem`=NULL

Following summary statistics shows measures of central tendency and dispersion---

>summary(ceas)

let us understand distribution of each variable through the following tables---

> table(ceas$Caesarian)

0 1
34 46

Caesarian variable has two values 0 and 1. Value "0" represents that the delivery was non-caesarian and value "1" represents that the delivery was caesarian.

> table(ceas$age)

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 38 40
1 2 2 3 2 4 1 2 7 10 7 6 6 3 3 8 5 2 3 1 1 1

In age variable, the range is 17 to 40.

> table(ceas$deliveryNumber)

1 2 3 4

41 27 10 2

The deliveryNumber tells about the number of times deliver incidents with the mother.

> table(ceas$BP)

0 1 2

20 40 20

The variable BP tells about the level of Blood Pressure. The 0,1 and 2 values represent low, normal and high respectively.

> table(ceas$deliveryTime)

0 1 2

46 17 17

The deliveryTime variable represents the type of delivery. The 0,1 and 2 values represent timely, premature and latecomer respectively.

> table(ceas$heartProblem)

0 1

50 30

The heartProblem variable tells about if there is heart problem to the pregnant woman or not. The 0 and 1 values represent apt and inept.

Exploring the dependencies of the Caesarian variable with other independent variables is important for model building. Hence, we analyze the impact of independent variables on the" caesarian" variable with the Chi-Squared test. The results of chi-squared tests are as follows---

> chisq.test(ceas$deliveryNumber, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$deliveryNumber and ceas$Caesarian
X-squared = 2.407, df = 3, p-value = 0.4923

> chisq.test(ceas$BP, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$BP and ceas$Caesarian
X-squared = 7.468, df = 2, p-value = 0.0239

> chisq.test(ceas$deliveryTime, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$deliveryTime and ceas$Caesarian
X-squared = 2.6379, df = 2, p-value = 0.2674

> chisq.test(ceas$heartProblem, ceas$Caesarian)

Pearson's Chi-squared test with Yates' continuity correction

data: ceas$heartProblem and ceas$Caesarian

X-squared = 8.5251, df = 1, p-value = 0.003503

We can observe the p-values of each chi.squared test. If p-value is less than 0.05 significance level, there is no evidence against the null hypothesis that both variables are independent. We can see that variables hearProblem and BP while deliveryNumber and deliveryTIme have no impact of on taking of decision of caesarean or no caesarean.

B) SVM Model Building & Prediction

Now, we are ready for model building. We are, here, going to use Support Vector Machine (SVM). The SVM technique is a very powerful technique in which we can build linear and nonlinear models. We will try different models in SVM for our case study.

So, we begin with simple SVM model---

> model.svm=svm(Caesarian~.,data=ceas)
> summary(model.svm)

Call:
svm(formula = Caesarian ~ ., data = ceas)

Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.2

Number of Support Vectors: 65

( 29 36 )

Number of Classes: 2

Levels:
0 1

In the model summary, we can observe that, by default, model has selected nonlinear model i.e. SVM Kernal "radial". Model also optimized at Cost=1 and gamma=0.2. Model has done C-Classification. Number of Support Vectors are 65.

Let us predict the Caesarean type for data frame "ceas" with SVM model and then we evaluate the prediction with actual values---

> pred=predict(model.svm, ceas)
> table(ceas$Caesarian, pred)
pred
0 1
0 26 8
1 9 37

Following confusion matrix shows accuracy and other statistics of prediction by SVM model---

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 26 9
1 8 37

Accuracy : 0.7875
95% CI : (0.6817, 0.8711)
No Information Rate : 0.575
P-Value [Acc > NIR] : 5.404e-05

Kappa : 0.5669
Mcnemar's Test P-Value : 1

Sensitivity : 0.7647
Specificity : 0.8043
Pos Pred Value : 0.7429
Neg Pred Value : 0.8222
Prevalence : 0.4250
Detection Rate : 0.3250
Detection Prevalence : 0.4375
Balanced Accuracy : 0.7845

'Positive' Class : 0

Overall accuracy of the model is 78%.

Now, we apply cross-validation and SVM model together to see if there is any improvement in prediction or not. First, we try with linear SVM model---

>trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
>set.seed(3233)

>svm_Linear <- train(caes ~., data = ceas, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

Model summary is as follows---

>svm_Linear

Support Vector Machines with Linear Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results:

Accuracy Kappa
0.651918 0.3198123

Tuning parameter 'C' was held constant at a value of 1

Let us see, how it predicts the Caesarean variable---

> pred=predict(svm_Linear, newdata=ceas)
> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 28 20
1 6 26

Accuracy : 0.675
95% CI : (0.5611, 0.7755)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.04356

Kappa : 0.3689
Mcnemar's Test P-Value : 0.01079

Sensitivity : 0.8235
Specificity : 0.5652
Pos Pred Value : 0.5833
Neg Pred Value : 0.8125
Prevalence : 0.4250
Detection Rate : 0.3500
Detection Prevalence : 0.6000
Balanced Accuracy : 0.6944

'Positive' Class : 0

We can see that the model accuracy of this model is lower than the simple model.

We can fine-tune model with hyperparameters C and gamma of the SVM model and let us see if model accuracy improves. We take the range of C values from 0.01 to 2.5.

>grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))

>set.seed(3233)

>svm_Linear_Grid <- train(caes ~., data = ceas, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid,
tuneLength = 10)

>svm_Linear_Grid

> svm_Linear_Grid
Support Vector Machines with Linear Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

C Accuracy Kappa
0.00 NaN NaN
0.01 0.5755291 0.0000000
0.05 0.5667328 0.1259469
0.10 0.6266534 0.2777531
0.25 0.6212963 0.2586232
0.50 0.6382275 0.2938355
0.75 0.6513228 0.3195901
1.00 0.6519180 0.3198123
1.25 0.6566799 0.3255291
1.50 0.6560847 0.3242667
1.75 0.6566799 0.3234376
2.00 0.6662037 0.3421381
5.00 0.6657407 0.3330852

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 2.

> pred=predict(svm_Linear_Grid, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 24 12
1 10 34

Accuracy : 0.725
95% CI : (0.6138, 0.819)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.003992

Kappa : 0.4416
Mcnemar's Test P-Value : 0.831170

Sensitivity : 0.7059
Specificity : 0.7391
Pos Pred Value : 0.6667
Neg Pred Value : 0.7727
Prevalence : 0.4250
Detection Rate : 0.3000
Detection Prevalence : 0.4500
Balanced Accuracy : 0.7225

'Positive' Class : 0

We can see that accuracy of the model has improved from 67% to 72% at C=2.

Now, let us apply nonlinear "radial" model again and see if model prediction improves or not.

> set.seed(3233)
> svm_Radial <- train(Caesarian ~., data = ceas, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

> svm_Radial

Support Vector Machines with Radial Basis Function Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

C Accuracy Kappa
0.25 0.6093915 0.14372559
0.50 0.6707011 0.32367981
1.00 0.6381614 0.25688123
2.00 0.6212302 0.22192343
4.00 0.5928571 0.16922655
8.00 0.5871693 0.16469149
16.00 0.5913360 0.17885126
32.00 0.5556217 0.10723957
64.00 0.5172619 0.03346007
128.00 0.5010582 -0.00360852

Tuning parameter 'sigma' was held constant at a value of 0.2772785
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.2772785 and C = 0.5.

> pred=predict(svm_Radial, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 25 8

1 9 38

Accuracy : 0.7875

95% CI : (0.6817, 0.8711)

No Information Rate : 0.575

P-Value [Acc > NIR] : 5.404e-05

Kappa : 0.5635

Mcnemar's Test P-Value : 1

Sensitivity : 0.7353

Specificity : 0.8261

Pos Pred Value : 0.7576

Neg Pred Value : 0.8085

Prevalence : 0.4250

Detection Rate : 0.3125

Detection Prevalence : 0.4125

Balanced Accuracy : 0.7807

'Positive' Class : 0

We can see that accuracy of "radial" model is 79% which is higher than linear models. Let us see if finetuning of "radial" model improves the model or not by changing values of C and Sigma---

>grid_radial <- expand.grid(sigma = c(0,0.01, 0.02, 0.025, 0.03, 0.04,
0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9),
C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75,
1, 1.5, 2,5))

>set.seed(3233)

>svm_Radial_Grid <- train(Caesarian ~., data = ceas, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid_radial,
tuneLength = 10)

> pred=predict(svm_Radial_Grid, newdata=ceas)

> confusionMatrix(pred, ceas$Caesarian)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 27 16
1 7 30

Accuracy : 0.7125
95% CI : (0.6005, 0.8082)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.007869

Kappa : 0.4314
Mcnemar's Test P-Value : 0.095293

Sensitivity : 0.7941
Specificity : 0.6522
Pos Pred Value : 0.6279
Neg Pred Value : 0.8108
Prevalence : 0.4250
Detection Rate : 0.3375
Detection Prevalence : 0.5375
Balanced Accuracy: 0.7231

'Positive' Class : 0

We can observe that cross-validation with different values of C and sigma degraded the model accuracy. We can try many other finetuning alternatives and other algorithms to improve the prediction. For now, we can select svm model with highest classification accuracy.

Business Excellence

MENU

Friday, 28 December 2018

Caesarean or No Caesarean : Dr. SVM Knows it Well

No comments:

Post a Comment