Business Excellence: December 2018

Hu Chen

INTRODUCTION

These days there is a lot of controversy regarding Caesarean deliveries in India. Some blame that medical fraternity and hospitals are deliberately doing more and more Caesarean deliveries. Let us explore how data science can help in this issue and making the decision about caesarean delivery or not with the help of an objective method i.e. machine learning.

DATASET

A sample of 80 women were collected who have either gone through caesarean delivery or normal delivery (M.Zain Amin, Amir Ali.'Performance Evaluation of Supervised Machine Learning Classifiers for Predicting Healthcare Operational Decisions'.Machine Learning for Operational Decision Making, Wavy Artificial Intelligence Research Foundation, Pakistan, 2018). This dataset is available on the following link---

https://archive.ics.uci.edu/ml/datasets/Caesarian+Section+Classification+Dataset

SUPPORT VECTOR MACHINE

We will do a classification model building with Support Vector Machine (SVM). Support Vector Machine is an algorithm which can be used for regression and classification. SVM can be used for linear and non-linear modelling. In this case study, we will try linear and nonlinear models for classification of Caesarean dataset.

Figure 1 Support Vector Machine

In Figure 2, the theoretical framework of Support Vector is graphically represented. In Figure 2, there are two types of labels represented by green (Class2) and orange stars (Class1). On Y-axis, there is X1 variable and on X-axis, X2 variable and both are independent variables. The graph shows how each observation is distributed in 2 dimensions X1 and X2.

The Support Vectors are the imaginary lines that are drawn on the basis of the maximum distance possible from each class of data. The redline is the hyperplane which is obtained after rounds of optimizations. The redline hyperplane is the best plane in classifying the two classes. SVM algorithm finds the best hyperplane for the best classification. To know more SVM intuitively, click on the link below---

https://www.youtube.com/watch?v=efR1C6CvhmE

The machine learning model building with Support Vector Machine comprises of following steps---

A) Data Preparation & Exploratory Data Analysis
B) SVM Model Building & Prediction

Let's begin the appointment with Dr. SVM.

A) Data Preparation & Exploratory Data Analysis

In RStudio, we begin with setting the working directory and packages required.

>setwd(PATH_to_FILE)
>getwd()
>library(foreign)
>library(e1071)
>library(caret)

>library(ROCR)

The dataset contains 80 observations and 6 variables. Out of six variables, five variables are attributes and one is a dependent variable. We give the name "ceas" to the file---

>ceas=read.arff("caesarian.csv.arff")

The attributes/independent variables are "age", "deliveryNumber", "BP", "deliveryTime" and "heartProblem". When we execute "str(ceas)" command, we find that all the variables are factor variable type. We need to convert them into the numeric format for model building and we will reove the parent variable---

>ceas$age=as.numeric(substr(ceas$Age,1,2))
>ceas$Age=NULL

>ceas$deliveryNumber=as.numeric(substr(ceas$`Delivery number`,1,1))
>ceas$`Delivery number`=NULL

>ceas$BP=as.numeric(substr(ceas$`Blood of Pressure`,1,1))
>ceas$`Blood of Pressure`=NULL

>ceas$deliveryTime=as.numeric(substr(ceas$`Delivery time`,1,1))
>ceas$`Delivery time`=NULL

>ceas$heartProblem=as.numeric(substr(ceas$`Heart Problem`,1,1))
>ceas$`Heart Problem`=NULL

Following summary statistics shows measures of central tendency and dispersion---

>summary(ceas)

let us understand distribution of each variable through the following tables---

> table(ceas$Caesarian)

0 1
34 46

Caesarian variable has two values 0 and 1. Value "0" represents that the delivery was non-caesarian and value "1" represents that the delivery was caesarian.

> table(ceas$age)

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 38 40
1 2 2 3 2 4 1 2 7 10 7 6 6 3 3 8 5 2 3 1 1 1

In age variable, the range is 17 to 40.

> table(ceas$deliveryNumber)

1 2 3 4

41 27 10 2

The deliveryNumber tells about the number of times deliver incidents with the mother.

> table(ceas$BP)

0 1 2

20 40 20

The variable BP tells about the level of Blood Pressure. The 0,1 and 2 values represent low, normal and high respectively.

> table(ceas$deliveryTime)

0 1 2

46 17 17

The deliveryTime variable represents the type of delivery. The 0,1 and 2 values represent timely, premature and latecomer respectively.

> table(ceas$heartProblem)

0 1

50 30

The heartProblem variable tells about if there is heart problem to the pregnant woman or not. The 0 and 1 values represent apt and inept.

Exploring the dependencies of the Caesarian variable with other independent variables is important for model building. Hence, we analyze the impact of independent variables on the" caesarian" variable with the Chi-Squared test. The results of chi-squared tests are as follows---

> chisq.test(ceas$deliveryNumber, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$deliveryNumber and ceas$Caesarian
X-squared = 2.407, df = 3, p-value = 0.4923

> chisq.test(ceas$BP, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$BP and ceas$Caesarian
X-squared = 7.468, df = 2, p-value = 0.0239

> chisq.test(ceas$deliveryTime, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$deliveryTime and ceas$Caesarian
X-squared = 2.6379, df = 2, p-value = 0.2674

> chisq.test(ceas$heartProblem, ceas$Caesarian)

Pearson's Chi-squared test with Yates' continuity correction

data: ceas$heartProblem and ceas$Caesarian

X-squared = 8.5251, df = 1, p-value = 0.003503

We can observe the p-values of each chi.squared test. If p-value is less than 0.05 significance level, there is no evidence against the null hypothesis that both variables are independent. We can see that variables hearProblem and BP while deliveryNumber and deliveryTIme have no impact of on taking of decision of caesarean or no caesarean.

B) SVM Model Building & Prediction

Now, we are ready for model building. We are, here, going to use Support Vector Machine (SVM). The SVM technique is a very powerful technique in which we can build linear and nonlinear models. We will try different models in SVM for our case study.

So, we begin with simple SVM model---

> model.svm=svm(Caesarian~.,data=ceas)
> summary(model.svm)

Call:
svm(formula = Caesarian ~ ., data = ceas)

Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.2

Number of Support Vectors: 65

( 29 36 )

Number of Classes: 2

Levels:
0 1

In the model summary, we can observe that, by default, model has selected nonlinear model i.e. SVM Kernal "radial". Model also optimized at Cost=1 and gamma=0.2. Model has done C-Classification. Number of Support Vectors are 65.

Let us predict the Caesarean type for data frame "ceas" with SVM model and then we evaluate the prediction with actual values---

> pred=predict(model.svm, ceas)
> table(ceas$Caesarian, pred)
pred
0 1
0 26 8
1 9 37

Following confusion matrix shows accuracy and other statistics of prediction by SVM model---

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 26 9
1 8 37

Accuracy : 0.7875
95% CI : (0.6817, 0.8711)
No Information Rate : 0.575
P-Value [Acc > NIR] : 5.404e-05

Kappa : 0.5669
Mcnemar's Test P-Value : 1

Sensitivity : 0.7647
Specificity : 0.8043
Pos Pred Value : 0.7429
Neg Pred Value : 0.8222
Prevalence : 0.4250
Detection Rate : 0.3250
Detection Prevalence : 0.4375
Balanced Accuracy : 0.7845

'Positive' Class : 0

Overall accuracy of the model is 78%.

Now, we apply cross-validation and SVM model together to see if there is any improvement in prediction or not. First, we try with linear SVM model---

>trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
>set.seed(3233)

>svm_Linear <- train(caes ~., data = ceas, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

Model summary is as follows---

>svm_Linear

Support Vector Machines with Linear Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results:

Accuracy Kappa
0.651918 0.3198123

Tuning parameter 'C' was held constant at a value of 1

Let us see, how it predicts the Caesarean variable---

> pred=predict(svm_Linear, newdata=ceas)
> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 28 20
1 6 26

Accuracy : 0.675
95% CI : (0.5611, 0.7755)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.04356

Kappa : 0.3689
Mcnemar's Test P-Value : 0.01079

Sensitivity : 0.8235
Specificity : 0.5652
Pos Pred Value : 0.5833
Neg Pred Value : 0.8125
Prevalence : 0.4250
Detection Rate : 0.3500
Detection Prevalence : 0.6000
Balanced Accuracy : 0.6944

'Positive' Class : 0

We can see that the model accuracy of this model is lower than the simple model.

We can fine-tune model with hyperparameters C and gamma of the SVM model and let us see if model accuracy improves. We take the range of C values from 0.01 to 2.5.

>grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))

>set.seed(3233)

>svm_Linear_Grid <- train(caes ~., data = ceas, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid,
tuneLength = 10)

>svm_Linear_Grid

> svm_Linear_Grid
Support Vector Machines with Linear Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

C Accuracy Kappa
0.00 NaN NaN
0.01 0.5755291 0.0000000
0.05 0.5667328 0.1259469
0.10 0.6266534 0.2777531
0.25 0.6212963 0.2586232
0.50 0.6382275 0.2938355
0.75 0.6513228 0.3195901
1.00 0.6519180 0.3198123
1.25 0.6566799 0.3255291
1.50 0.6560847 0.3242667
1.75 0.6566799 0.3234376
2.00 0.6662037 0.3421381
5.00 0.6657407 0.3330852

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 2.

> pred=predict(svm_Linear_Grid, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 24 12
1 10 34

Accuracy : 0.725
95% CI : (0.6138, 0.819)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.003992

Kappa : 0.4416
Mcnemar's Test P-Value : 0.831170

Sensitivity : 0.7059
Specificity : 0.7391
Pos Pred Value : 0.6667
Neg Pred Value : 0.7727
Prevalence : 0.4250
Detection Rate : 0.3000
Detection Prevalence : 0.4500
Balanced Accuracy : 0.7225

'Positive' Class : 0

We can see that accuracy of the model has improved from 67% to 72% at C=2.

Now, let us apply nonlinear "radial" model again and see if model prediction improves or not.

> set.seed(3233)
> svm_Radial <- train(Caesarian ~., data = ceas, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

> svm_Radial

Support Vector Machines with Radial Basis Function Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

C Accuracy Kappa
0.25 0.6093915 0.14372559
0.50 0.6707011 0.32367981
1.00 0.6381614 0.25688123
2.00 0.6212302 0.22192343
4.00 0.5928571 0.16922655
8.00 0.5871693 0.16469149
16.00 0.5913360 0.17885126
32.00 0.5556217 0.10723957
64.00 0.5172619 0.03346007
128.00 0.5010582 -0.00360852

Tuning parameter 'sigma' was held constant at a value of 0.2772785
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.2772785 and C = 0.5.

> pred=predict(svm_Radial, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 25 8

1 9 38

Accuracy : 0.7875

95% CI : (0.6817, 0.8711)

No Information Rate : 0.575

P-Value [Acc > NIR] : 5.404e-05

Kappa : 0.5635

Mcnemar's Test P-Value : 1

Sensitivity : 0.7353

Specificity : 0.8261

Pos Pred Value : 0.7576

Neg Pred Value : 0.8085

Prevalence : 0.4250

Detection Rate : 0.3125

Detection Prevalence : 0.4125

Balanced Accuracy : 0.7807

'Positive' Class : 0

We can see that accuracy of "radial" model is 79% which is higher than linear models. Let us see if finetuning of "radial" model improves the model or not by changing values of C and Sigma---

>grid_radial <- expand.grid(sigma = c(0,0.01, 0.02, 0.025, 0.03, 0.04,
0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9),
C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75,
1, 1.5, 2,5))

>set.seed(3233)

>svm_Radial_Grid <- train(Caesarian ~., data = ceas, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid_radial,
tuneLength = 10)

> pred=predict(svm_Radial_Grid, newdata=ceas)

> confusionMatrix(pred, ceas$Caesarian)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 27 16
1 7 30

Accuracy : 0.7125
95% CI : (0.6005, 0.8082)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.007869

Kappa : 0.4314
Mcnemar's Test P-Value : 0.095293

Sensitivity : 0.7941
Specificity : 0.6522
Pos Pred Value : 0.6279
Neg Pred Value : 0.8108
Prevalence : 0.4250
Detection Rate : 0.3375
Detection Prevalence : 0.5375
Balanced Accuracy: 0.7231

'Positive' Class : 0

We can observe that cross-validation with different values of C and sigma degraded the model accuracy. We can try many other finetuning alternatives and other algorithms to improve the prediction. For now, we can select svm model with highest classification accuracy.

In this case study, we will see how to use forecasting with R for currency trading. If we know the forecasting for the next few days, we can build our trading system around the trend line of forecasts and can make buying or selling decisions.

DISCLAIMER: This case study is for educational purpose. It is neither a recommendation of trading , any sort of tips and not to be copied in real world currency trading. This article should not be taken as recommendation to trade in USDINR. Currency trading is highly risky and hence, a good financial advisor should be consulted for making investment decisions in currency trading.

The entire process of how to use forecasting with R for currency trading is divided into the following sections---

(A) DATA PREPARATION

(B) MODEL BUILDING

(C) PREDICTION

(A) DATA PREPARATION STAGE

First and foremost, we need to load required libraries as shown in the following chunk of commands---

>library(quantmod)
>library(timeSeries)
>library(forecast)
>library(bizdays)
>library(plotly)

After loading the required libraries, we need data of USDINR from Yahoo Finance (https://in.finance.yahoo.com/). The "getSymbols" command downloads data from the source since 1st January 2007 and we save the data in the file "exr"---

>exr=getSymbols("INR=X", src="yahoo", env=NULL)

NOTE: USDINR is coded as INR=X on Yahoo Finance.

When we download data from Yahoo Finance, it loads open, close, high, low, volume and adjusted price of USDINR. We need only closing price/quote (4th Column) of USDINR---

>exr=exr[,4]

We need to remove any missing values from the data to do time series operations hence we, first, calculate the number of missing values and then remove it by using "na.omit" command---

>sum(is.na(exr))

There are 30 missing values in the data, we shall remove the missing values from data.

>exr=na.omit(exr)

>sum(is.na(exr))

Now, let us observe how the trend line looks in plot---

>plot(exr)

After data preparation, the data is used for developing the forecasting model and followed by prediction.

(B) MODEL BUILDING

If we look at the trendline, it is clearly visible that over the period of time, it changes mean and variance hence this time series is non-stationary. The "auto.arima" command of "forecast" package computes the parameters of the time series model. As we have USDINR daily price quotes, we can introduce frequency 365 to address seasonality in the model building. The major components P, D, and Q are calculated by "auto.arima" function automatically. We first convert dataframe "exr" into time series "exr.ts.365"---

>exr.ts.365=ts(exr, frequency = 365)

Now, we build time series model with "auto.arima" command with seasonality denoted by D=1 ---

>model.365=auto.arima(exr.ts.365,D=1)

Model summary and accuracy of "model.365" is as follows---

> summary(model.365)

Series: exr.ts.365
ARIMA(1,1,2)(0,1,0)[365]

Coefficients:
ar1 ma1 ma2
0.8414 -0.9773 0.1442
s.e. 0.0728 0.0741 0.0197

sigma^2 estimated as 0.194: log likelihood=-1638.48

AIC=3284.95 AICc=3284.97 BIC=3308.61

Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.002054559 0.4133998 0.2762706 0.002884244 0.491483 0.05617031 0.001935873

The ARIMA model extracted is (1,1,2)(0,1,0)[365] and it's MAPE is quite low i.e.0.491483.

(C) PREDICTION

Now, we predict the next 200 day prices based on "model.365" by using forecast command---

>predict.365=forecast(model.365, h=200)

Let us see how output of "predict.365" looks like---

>predict.365

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
9.487671 70.17357 69.60910 70.73805 69.31028 71.03687
9.490411 70.02434 69.27832 70.77036 68.88340 71.16528
9.493151 69.59606 68.69539 70.49674 68.21860 70.97353
9.495890 69.64146 68.60203 70.68090 68.05179 71.23114
9.498630 69.60031 68.43319 70.76743 67.81535 71.38527
9.501370 69.75752 68.47118 71.04387 67.79023 71.72482
9.504110 69.62377 68.22505 71.02250 67.48461 71.76294
9.506849 69.80306 68.29772 71.30840 67.50084 72.10527

Ignore the first column and observe the "Point Forecast" column and we find that there is a decline in prices of USDINR.

Let us observe visually when the forecast is attached to the time series---

>plot(predict.365)

In the graph, where the extended line in blue cover is the point forecast line and grey and light grey envelopes are basically levels at 80% and 90% confidence.

After getting the forecast, we need to add series of 200 dates on USDINR trading days only. We need to first create a calendar as per India by "create.calendar" command---

>create.calendar("India/ANBIMA", holidaysANBIMA, weekdays=c("saturday", "sunday"))

Now, we calculate how many trading days between 26/12/2018 and 10/10/2019 and we find it is 200 day by running the "bizdays" command---

>bizdays("2018-12-26", "2019-10-10", "India/ANBIMA")

Now, we shall get a data frame of dates and combine it with "predict.365" to make a new data frame "predict200" containing two columns ( Dates and predict.365)---

>Dates=bizseq("2018-12-26", "2019-10-10", "India/ANBIMA")

>Datess=as.data.frame(Dates)

>dim(Datess)

>View(Datess)

>Dates=Datess[1:200,]

>predict.365=as.data.frame(predict.365)

>dim(predict.365)

>predict200=cbind.data.frame(Dates,predict.365)

Now, we can see how "predict200" data frame looks like---

>View(predict200)

We can download the csv file of "predict200" by following command---

>write.csv(predict200, "predict200.csv")

The "predict200" contains columns of Dates, Point Forecast, Lo80, Hi80, Lo90 and Hi90. For this case study, we will use only point forecast to suggest a trading system for USDINR.

Let us see the predict.200 visualization with plot_ly command---

>plot_ly(predict200, x=predict200$Dates, y=predict200$`Point Forecast`,mode="marker")

Now, we can see that until Oct 2019, the USDINR exchange rate shows range bound movement between 71.25 and 68.5. It is basically mean reversion behavior of exchange rate or in simple words, USDINR rates will be sideways between these two points.

Here, trend following strategies will be less successful and sideways market strategies will be more successful. The USDINR market in this period is more likely going to favor option writers.

Business Excellence

MENU

Friday, 28 December 2018

Caesarean or No Caesarean : Dr. SVM Knows it Well

Wednesday, 26 December 2018

How to Use Forecasting with R for Currency Trading