MENU

Drop Down MenusCSS Drop Down MenuPure CSS Dropdown Menu

Friday, 28 December 2018

Caesarean or No Caesarean : Dr. SVM Knows it Well

unsplash-logoHu Chen


INTRODUCTION

These days there is a lot of controversy regarding Caesarean deliveries in India.  Some blame that medical fraternity and hospitals are deliberately doing more and more Caesarean deliveries. Let us explore how data science can help in this issue and making the decision about caesarean delivery or not with the help of an objective method i.e. machine learning.

DATASET

A sample of 80 women were collected who have either gone through caesarean delivery or normal delivery (M.Zain Amin, Amir Ali.'Performance Evaluation of Supervised Machine Learning Classifiers for Predicting Healthcare Operational Decisions'.Machine Learning for Operational Decision Making, Wavy Artificial Intelligence Research Foundation, Pakistan, 2018). This dataset is available on the following link---

https://archive.ics.uci.edu/ml/datasets/Caesarian+Section+Classification+Dataset

SUPPORT VECTOR MACHINE

We will do a classification model building with Support Vector Machine (SVM). Support Vector Machine is an algorithm which can be used for regression and classification. SVM can be used for linear and non-linear modelling. In this case study, we will try linear and nonlinear models for classification of Caesarean dataset.

Figure 1 Support Vector Machine


In Figure 2, the theoretical framework of Support Vector is graphically represented. In Figure 2, there are two types of labels represented by green (Class2) and orange stars (Class1). On Y-axis, there is X1 variable and on X-axis, X2 variable and both are independent variables. The graph shows how each observation is distributed in 2 dimensions X1 and X2.

The Support Vectors are the imaginary lines that are drawn on the basis of the maximum distance possible from each class of data. The redline is the hyperplane which is obtained after rounds of optimizations. The redline hyperplane is the best plane in classifying the two classes. SVM algorithm finds the best hyperplane for the best classification. To know more SVM intuitively, click on the link below---

https://www.youtube.com/watch?v=efR1C6CvhmE

The machine learning model building with Support Vector Machine comprises of following steps---

A) Data Preparation & Exploratory Data Analysis
B) SVM Model Building & Prediction

Let's begin the appointment with Dr. SVM.

A) Data Preparation & Exploratory Data Analysis

In RStudio, we begin with setting the working directory and packages required.

>setwd(PATH_to_FILE)
>getwd()
>library(foreign)
>library(e1071)
>library(caret)
>library(ROCR)

The dataset contains 80 observations and 6 variables. Out of six variables, five variables are attributes and one is a dependent variable. We give the name "ceas" to the file---

>ceas=read.arff("caesarian.csv.arff")

The attributes/independent variables are  "age", "deliveryNumber", "BP", "deliveryTime" and  "heartProblem". When we execute "str(ceas)" command, we find that all the variables are factor variable type. We need to convert them into the numeric format for model building and we will reove the parent variable---

>ceas$age=as.numeric(substr(ceas$Age,1,2))
>ceas$Age=NULL

>ceas$deliveryNumber=as.numeric(substr(ceas$`Delivery number`,1,1))
>ceas$`Delivery number`=NULL

>ceas$BP=as.numeric(substr(ceas$`Blood of Pressure`,1,1))
>ceas$`Blood of Pressure`=NULL

>ceas$deliveryTime=as.numeric(substr(ceas$`Delivery time`,1,1))
>ceas$`Delivery time`=NULL

>ceas$heartProblem=as.numeric(substr(ceas$`Heart Problem`,1,1))
>ceas$`Heart Problem`=NULL

Following summary statistics shows measures of central tendency and dispersion---

>summary(ceas)



let us understand distribution of each variable through the following tables---

> table(ceas$Caesarian)

 0  1
34 46

Caesarian variable has two values 0 and 1. Value "0" represents that the delivery was non-caesarian and value "1" represents that the delivery was caesarian.

> table(ceas$age)

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 38 40
 1   2  2   3   2   4   1   2   7   10  7   6   6   3   3   8   5   2   3   1   1   1

In age variable, the range is 17 to 40.

> table(ceas$deliveryNumber)

 1  2    3   4 
41 27 10  2 

The deliveryNumber tells about the number of times deliver incidents with the mother. 

> table(ceas$BP)

 0  1  2 
20 40 20 

The variable BP tells about the level of Blood Pressure. The 0,1 and 2 values represent low, normal and high respectively. 

> table(ceas$deliveryTime)

 0  1    2 
46 17 17 

The deliveryTime variable represents the type of delivery. The 0,1 and 2 values represent timely, premature and latecomer respectively.

> table(ceas$heartProblem)

 0  1 
50 30 

The heartProblem variable tells about if there is heart problem to the pregnant woman or not. The 0 and 1 values represent apt and inept.


Exploring the dependencies of the Caesarian variable with other independent variables is important for model building. Hence, we analyze the impact of independent variables on the" caesarian" variable with the Chi-Squared test. The results of chi-squared tests are as follows---


> chisq.test(ceas$deliveryNumber, ceas$Caesarian)

Pearson's Chi-squared test

data:  ceas$deliveryNumber and ceas$Caesarian
X-squared = 2.407, df = 3, p-value = 0.4923


> chisq.test(ceas$BP, ceas$Caesarian)

Pearson's Chi-squared test

data:  ceas$BP and ceas$Caesarian
X-squared = 7.468, df = 2, p-value = 0.0239


> chisq.test(ceas$deliveryTime, ceas$Caesarian)


Pearson's Chi-squared test

data:  ceas$deliveryTime and ceas$Caesarian
X-squared = 2.6379, df = 2, p-value = 0.2674


> chisq.test(ceas$heartProblem, ceas$Caesarian)

Pearson's Chi-squared test with Yates' continuity correction

data:  ceas$heartProblem and ceas$Caesarian

X-squared = 8.5251, df = 1, p-value = 0.003503

We can observe the p-values of each chi.squared test. If p-value is less than 0.05 significance level, there is no evidence against the null hypothesis that both variables are independent. We can see that variables hearProblem and BP while deliveryNumber and deliveryTIme have no impact of on taking of decision of caesarean or no caesarean. 

B) SVM Model Building & Prediction

Now, we are ready for model building. We are, here, going to use Support Vector Machine (SVM). The SVM technique is a very powerful technique in which we can build linear and nonlinear models. We will try different models in SVM for our case study.

So, we begin with simple SVM model---

> model.svm=svm(Caesarian~.,data=ceas)
> summary(model.svm)

Call:
svm(formula = Caesarian ~ ., data = ceas)

Parameters:
   SVM-Type:  C-classification
 SVM-Kernel:  radial
       cost:  1
      gamma:  0.2

Number of Support Vectors:  65

 ( 29 36 )


Number of Classes:  2

Levels:
 0 1

In the model summary, we can observe that, by default, model has selected nonlinear model i.e. SVM Kernal "radial". Model also optimized at Cost=1 and gamma=0.2. Model has done C-Classification. Number of Support Vectors are 65.

Let us predict the Caesarean type for data frame "ceas" with SVM model and then we evaluate the prediction with actual values---

> pred=predict(model.svm, ceas)
> table(ceas$Caesarian, pred)
       pred
       0    1
  0   26  8
  1   9    37

Following confusion matrix shows accuracy and other statistics of prediction by SVM model---

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

                     Reference
Prediction         0   1
         0              26  9
         1              8    37
                                       
               Accuracy         : 0.7875       
                 95% CI          : (0.6817, 0.8711)
    No Information Rate  : 0.575         
    P-Value [Acc > NIR] : 5.404e-05     
                                       
                  Kappa           : 0.5669       
 Mcnemar's Test P-Value : 1             
                                       
            Sensitivity : 0.7647       
            Specificity : 0.8043       
         Pos Pred Value : 0.7429       
         Neg Pred Value : 0.8222       
             Prevalence : 0.4250       
         Detection Rate : 0.3250       
   Detection Prevalence : 0.4375       
      Balanced Accuracy : 0.7845       
                                       
       'Positive' Class : 0

Overall accuracy of the model is 78%.

Now, we apply cross-validation and SVM model together to see if there is any improvement in prediction or not. First, we try with linear SVM model---


>trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
>set.seed(3233)

>svm_Linear <- train(caes ~., data = ceas, method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10)


Model summary is as follows---

>svm_Linear

Support Vector Machines with Linear Kernel

80 samples
 5 predictor
 2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results:

  Accuracy  Kappa 
  0.651918  0.3198123

Tuning parameter 'C' was held constant at a value of 1

Let us see, how it predicts the Caesarean variable---

> pred=predict(svm_Linear, newdata=ceas)
> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

                      Reference
Prediction         0    1
         0               28  20
         1               6    26
                                       
               Accuracy : 0.675         
                 95% CI : (0.5611, 0.7755)
    No Information Rate : 0.575         
    P-Value [Acc > NIR] : 0.04356       
                                       
                  Kappa : 0.3689       
 Mcnemar's Test P-Value : 0.01079       
                                       
            Sensitivity : 0.8235       
            Specificity : 0.5652       
         Pos Pred Value : 0.5833       
         Neg Pred Value : 0.8125       
             Prevalence : 0.4250       
         Detection Rate : 0.3500       
   Detection Prevalence : 0.6000       
      Balanced Accuracy : 0.6944       
                                       
       'Positive' Class : 0

We can see that the model accuracy of this model is lower than the simple model.

We can fine-tune model with hyperparameters C and gamma of the SVM model and let us see if model accuracy improves. We take the range of C values from 0.01 to 2.5.

>grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))

>set.seed(3233)

>svm_Linear_Grid <- train(caes ~., data = ceas, method = "svmLinear",
                           trControl=trctrl,
                           preProcess = c("center", "scale"),
                           tuneGrid = grid,
                           tuneLength = 10)

>svm_Linear_Grid

> svm_Linear_Grid
Support Vector Machines with Linear Kernel

80 samples
 5 predictor
 2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

  C     Accuracy   Kappa 
  0.00        NaN        NaN
  0.01  0.5755291  0.0000000
  0.05  0.5667328  0.1259469
  0.10  0.6266534  0.2777531
  0.25  0.6212963  0.2586232
  0.50  0.6382275  0.2938355
  0.75  0.6513228  0.3195901
  1.00  0.6519180  0.3198123
  1.25  0.6566799  0.3255291
  1.50  0.6560847  0.3242667
  1.75  0.6566799  0.3234376
  2.00  0.6662037  0.3421381
  5.00  0.6657407  0.3330852

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 2.

> pred=predict(svm_Linear_Grid, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

                      Reference
Prediction         0      1
         0               24    12
         1               10    34
                                       
               Accuracy : 0.725       
                 95% CI : (0.6138, 0.819)
    No Information Rate : 0.575       
    P-Value [Acc > NIR] : 0.003992     
                                       
                  Kappa : 0.4416       
 Mcnemar's Test P-Value : 0.831170     
                                       
            Sensitivity : 0.7059       
            Specificity : 0.7391       
         Pos Pred Value : 0.6667       
         Neg Pred Value : 0.7727       
             Prevalence : 0.4250       
         Detection Rate : 0.3000       
   Detection Prevalence : 0.4500       
      Balanced Accuracy : 0.7225       
                                       
       'Positive' Class : 0           
                               
We can see that accuracy of the model has improved from 67% to 72% at C=2.

Now, let us apply nonlinear "radial" model again and see if model prediction improves or not.

> set.seed(3233)
> svm_Radial <- train(Caesarian ~., data = ceas, method = "svmRadial",
                        trControl=trctrl,
                        preProcess = c("center", "scale"),
                        tuneLength = 10)

> svm_Radial

Support Vector Machines with Radial Basis Function Kernel

80 samples
 5 predictor
 2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

  C       Accuracy   Kappa   
    0.25  0.6093915   0.14372559
    0.50  0.6707011   0.32367981
    1.00  0.6381614   0.25688123
    2.00  0.6212302   0.22192343
    4.00  0.5928571   0.16922655
    8.00  0.5871693   0.16469149
   16.00  0.5913360   0.17885126
   32.00  0.5556217   0.10723957
   64.00  0.5172619   0.03346007
  128.00  0.5010582  -0.00360852

Tuning parameter 'sigma' was held constant at a value of 0.2772785
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.2772785 and C = 0.5.

> pred=predict(svm_Radial, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
   Confusion Matrix and Statistics

                      Reference
Prediction        0       1
         0              25    8
         1              9    38
                                          
               Accuracy : 0.7875          
                 95% CI : (0.6817, 0.8711)
    No Information Rate : 0.575           
    P-Value [Acc > NIR] : 5.404e-05       
                                          
                  Kappa : 0.5635          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.7353          
            Specificity : 0.8261          
         Pos Pred Value : 0.7576          
         Neg Pred Value : 0.8085          
             Prevalence : 0.4250          
         Detection Rate : 0.3125          
   Detection Prevalence : 0.4125          
      Balanced Accuracy : 0.7807          
                                          
       'Positive' Class : 0


We can see that accuracy of "radial"  model is 79% which is higher than linear models. Let us see if finetuning of "radial" model improves the model or not by changing values of C and Sigma---

>grid_radial <- expand.grid(sigma = c(0,0.01, 0.02, 0.025, 0.03, 0.04,
                                     0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9),
                           C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75,
                                 1, 1.5, 2,5))

>set.seed(3233)

>svm_Radial_Grid <- train(Caesarian ~., data = ceas, method = "svmRadial",
                           trControl=trctrl,
                           preProcess = c("center", "scale"),
                           tuneGrid = grid_radial,
                           tuneLength = 10)

> pred=predict(svm_Radial_Grid, newdata=ceas)

> confusionMatrix(pred, ceas$Caesarian)
Confusion Matrix and Statistics

                     Reference
Prediction       0      1
         0             27    16
         1             7      30
                                       
               Accuracy : 0.7125       
                 95% CI : (0.6005, 0.8082)
    No Information Rate : 0.575         
    P-Value [Acc > NIR] : 0.007869     
                                       
                  Kappa : 0.4314       
 Mcnemar's Test P-Value : 0.095293     
                                       
            Sensitivity : 0.7941       
            Specificity : 0.6522       
         Pos Pred Value : 0.6279       
         Neg Pred Value : 0.8108       
             Prevalence : 0.4250       
         Detection Rate : 0.3375       
   Detection Prevalence : 0.5375       
      Balanced Accuracy: 0.7231       
                                       
       'Positive' Class : 0

We can observe that cross-validation with different values of C and sigma degraded the model accuracy. We can try many other finetuning alternatives and other algorithms to improve the prediction. For now, we can select svm model with highest classification accuracy. 

Wednesday, 26 December 2018

How to Use Forecasting with R for Currency Trading



In this case study, we will see how to use forecasting with R for currency trading. If we know the forecasting for the next few days, we can build our trading system around the trend line of forecasts and can make buying or selling decisions.

DISCLAIMER: This case study is for educational purpose. It is neither a recommendation of trading , any sort of tips and not to be copied in real world currency trading. This article should not be taken as recommendation to trade in USDINR. Currency trading is highly risky and hence, a good financial advisor should be consulted for making investment decisions in currency trading. 

The entire process of how to use forecasting with R for currency trading is divided into the following sections---

(A) DATA PREPARATION

(B) MODEL BUILDING

(C) PREDICTION

(A) DATA PREPARATION STAGE

First and foremost, we need to load required libraries as shown in the following chunk of commands---

>library(quantmod)
>library(timeSeries)
>library(forecast)
>library(bizdays)
>library(plotly)


After loading the required libraries, we need data of USDINR from  Yahoo Finance (https://in.finance.yahoo.com/). The "getSymbols" command downloads data from the source since 1st January 2007 and we save the data in the file "exr"---

>exr=getSymbols("INR=X", src="yahoo", env=NULL)

NOTE: USDINR is coded as INR=X on Yahoo Finance.



When we download data from Yahoo Finance, it loads open, close, high, low, volume and adjusted price of USDINR. We need only closing price/quote (4th Column) of USDINR---

>exr=exr[,4]

We need to remove any missing values from the data to do time series operations hence we, first, calculate the number of missing values and then remove it by using "na.omit" command---

>sum(is.na(exr))

There are 30 missing values in the data, we shall remove the missing values from data.

>exr=na.omit(exr)

>sum(is.na(exr))

Now, let us observe how the trend line looks in plot---

>plot(exr)




After data preparation, the data is used for developing the forecasting model and followed by prediction.

(B) MODEL BUILDING

If we look at the trendline, it is clearly visible that over the period of time, it changes mean and variance hence this time series is non-stationary. The "auto.arima" command of "forecast" package computes the parameters of the time series model. As we have USDINR daily price quotes, we can introduce frequency 365 to address seasonality in the model building. The major components P, D, and Q are calculated by "auto.arima" function automatically. We first convert dataframe "exr" into time series "exr.ts.365"---

>exr.ts.365=ts(exr, frequency = 365)

Now, we build time series model with "auto.arima" command with seasonality denoted by D=1 ---

>model.365=auto.arima(exr.ts.365,D=1)

Model summary and accuracy of "model.365" is as follows---

> summary(model.365)

Series: exr.ts.365
ARIMA(1,1,2)(0,1,0)[365]

Coefficients:
              ar1      ma1     ma2
         0.8414  -0.9773  0.1442
s.e.    0.0728   0.0741  0.0197

sigma^2 estimated as 0.194:  log likelihood=-1638.48

AIC=3284.95   AICc=3284.97   BIC=3308.61

Training set error measures:
                    ME                RMSE       MAE         MPE              MAPE    MASE          ACF1
Training set 0.002054559 0.4133998 0.2762706 0.002884244 0.491483 0.05617031 0.001935873

The ARIMA model extracted is (1,1,2)(0,1,0)[365] and it's MAPE is quite low i.e.0.491483.

(C) PREDICTION

Now, we predict the next 200 day prices based on "model.365" by using forecast command---

>predict.365=forecast(model.365, h=200)

Let us see how output of "predict.365" looks like---

 >predict.365
 
                       Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
 9.487671       70.17357 69.60910 70.73805 69.31028 71.03687
 9.490411       70.02434 69.27832 70.77036 68.88340 71.16528
 9.493151       69.59606 68.69539 70.49674 68.21860 70.97353
 9.495890       69.64146 68.60203 70.68090 68.05179 71.23114
 9.498630       69.60031 68.43319 70.76743 67.81535 71.38527
 9.501370       69.75752 68.47118 71.04387 67.79023 71.72482
 9.504110       69.62377 68.22505 71.02250 67.48461 71.76294
 9.506849       69.80306 68.29772 71.30840 67.50084 72.10527

Ignore the first column and observe the "Point Forecast" column and we find that there is a decline in prices of USDINR.

Let us observe visually when the forecast is attached to the time series---

>plot(predict.365)


In the graph, where the extended line in blue cover is the point forecast line and grey and light grey envelopes are basically levels at 80% and 90% confidence.

After getting the forecast, we need to add series of 200 dates on USDINR trading days only. We need to first create a calendar as per India by "create.calendar" command---

>create.calendar("India/ANBIMA", holidaysANBIMA, weekdays=c("saturday", "sunday"))

Now, we calculate how many trading days between 26/12/2018 and 10/10/2019 and we find it is 200 day by running the "bizdays" command---

>bizdays("2018-12-26", "2019-10-10", "India/ANBIMA")

Now, we shall get a data frame of dates and combine it with "predict.365" to make a new data frame "predict200" containing two columns ( Dates and predict.365)---

>Dates=bizseq("2018-12-26", "2019-10-10", "India/ANBIMA")

>Datess=as.data.frame(Dates)

>dim(Datess)

>View(Datess)

>Dates=Datess[1:200,]

>predict.365=as.data.frame(predict.365)

>dim(predict.365)

>predict200=cbind.data.frame(Dates,predict.365)

Now, we can see how "predict200" data frame looks like---

>View(predict200)

We can download the csv file of "predict200" by following command---

>write.csv(predict200, "predict200.csv")

The "predict200" contains columns of Dates, Point Forecast, Lo80, Hi80, Lo90 and Hi90. For this case study, we will use only point forecast to suggest a trading system for USDINR.

Let us see the predict.200 visualization with plot_ly command---

>plot_ly(predict200, x=predict200$Dates, y=predict200$`Point Forecast`,mode="marker")



Now, we can see that until Oct 2019, the USDINR exchange rate shows range bound movement between 71.25 and 68.5. It is basically mean reversion behavior of exchange rate or in simple words, USDINR rates will be sideways between these two points. 

Here, trend following strategies will be less successful and sideways market strategies will be more successful. The USDINR market in this period is more likely going to favor option writers.