MENU

Drop Down MenusCSS Drop Down MenuPure CSS Dropdown Menu

Friday, 28 December 2018

Caesarean or No Caesarean : Dr. SVM Knows it Well

unsplash-logoHu Chen


INTRODUCTION

These days there is a lot of controversy regarding Caesarean deliveries in India.  Some blame that medical fraternity and hospitals are deliberately doing more and more Caesarean deliveries. Let us explore how data science can help in this issue and making the decision about caesarean delivery or not with the help of an objective method i.e. machine learning.

DATASET

A sample of 80 women were collected who have either gone through caesarean delivery or normal delivery (M.Zain Amin, Amir Ali.'Performance Evaluation of Supervised Machine Learning Classifiers for Predicting Healthcare Operational Decisions'.Machine Learning for Operational Decision Making, Wavy Artificial Intelligence Research Foundation, Pakistan, 2018). This dataset is available on the following link---

https://archive.ics.uci.edu/ml/datasets/Caesarian+Section+Classification+Dataset

SUPPORT VECTOR MACHINE

We will do a classification model building with Support Vector Machine (SVM). Support Vector Machine is an algorithm which can be used for regression and classification. SVM can be used for linear and non-linear modelling. In this case study, we will try linear and nonlinear models for classification of Caesarean dataset.

Figure 1 Support Vector Machine


In Figure 2, the theoretical framework of Support Vector is graphically represented. In Figure 2, there are two types of labels represented by green (Class2) and orange stars (Class1). On Y-axis, there is X1 variable and on X-axis, X2 variable and both are independent variables. The graph shows how each observation is distributed in 2 dimensions X1 and X2.

The Support Vectors are the imaginary lines that are drawn on the basis of the maximum distance possible from each class of data. The redline is the hyperplane which is obtained after rounds of optimizations. The redline hyperplane is the best plane in classifying the two classes. SVM algorithm finds the best hyperplane for the best classification. To know more SVM intuitively, click on the link below---

https://www.youtube.com/watch?v=efR1C6CvhmE

The machine learning model building with Support Vector Machine comprises of following steps---

A) Data Preparation & Exploratory Data Analysis
B) SVM Model Building & Prediction

Let's begin the appointment with Dr. SVM.

A) Data Preparation & Exploratory Data Analysis

In RStudio, we begin with setting the working directory and packages required.

>setwd(PATH_to_FILE)
>getwd()
>library(foreign)
>library(e1071)
>library(caret)
>library(ROCR)

The dataset contains 80 observations and 6 variables. Out of six variables, five variables are attributes and one is a dependent variable. We give the name "ceas" to the file---

>ceas=read.arff("caesarian.csv.arff")

The attributes/independent variables are  "age", "deliveryNumber", "BP", "deliveryTime" and  "heartProblem". When we execute "str(ceas)" command, we find that all the variables are factor variable type. We need to convert them into the numeric format for model building and we will reove the parent variable---

>ceas$age=as.numeric(substr(ceas$Age,1,2))
>ceas$Age=NULL

>ceas$deliveryNumber=as.numeric(substr(ceas$`Delivery number`,1,1))
>ceas$`Delivery number`=NULL

>ceas$BP=as.numeric(substr(ceas$`Blood of Pressure`,1,1))
>ceas$`Blood of Pressure`=NULL

>ceas$deliveryTime=as.numeric(substr(ceas$`Delivery time`,1,1))
>ceas$`Delivery time`=NULL

>ceas$heartProblem=as.numeric(substr(ceas$`Heart Problem`,1,1))
>ceas$`Heart Problem`=NULL

Following summary statistics shows measures of central tendency and dispersion---

>summary(ceas)



let us understand distribution of each variable through the following tables---

> table(ceas$Caesarian)

 0  1
34 46

Caesarian variable has two values 0 and 1. Value "0" represents that the delivery was non-caesarian and value "1" represents that the delivery was caesarian.

> table(ceas$age)

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 38 40
 1   2  2   3   2   4   1   2   7   10  7   6   6   3   3   8   5   2   3   1   1   1

In age variable, the range is 17 to 40.

> table(ceas$deliveryNumber)

 1  2    3   4 
41 27 10  2 

The deliveryNumber tells about the number of times deliver incidents with the mother. 

> table(ceas$BP)

 0  1  2 
20 40 20 

The variable BP tells about the level of Blood Pressure. The 0,1 and 2 values represent low, normal and high respectively. 

> table(ceas$deliveryTime)

 0  1    2 
46 17 17 

The deliveryTime variable represents the type of delivery. The 0,1 and 2 values represent timely, premature and latecomer respectively.

> table(ceas$heartProblem)

 0  1 
50 30 

The heartProblem variable tells about if there is heart problem to the pregnant woman or not. The 0 and 1 values represent apt and inept.


Exploring the dependencies of the Caesarian variable with other independent variables is important for model building. Hence, we analyze the impact of independent variables on the" caesarian" variable with the Chi-Squared test. The results of chi-squared tests are as follows---


> chisq.test(ceas$deliveryNumber, ceas$Caesarian)

Pearson's Chi-squared test

data:  ceas$deliveryNumber and ceas$Caesarian
X-squared = 2.407, df = 3, p-value = 0.4923


> chisq.test(ceas$BP, ceas$Caesarian)

Pearson's Chi-squared test

data:  ceas$BP and ceas$Caesarian
X-squared = 7.468, df = 2, p-value = 0.0239


> chisq.test(ceas$deliveryTime, ceas$Caesarian)


Pearson's Chi-squared test

data:  ceas$deliveryTime and ceas$Caesarian
X-squared = 2.6379, df = 2, p-value = 0.2674


> chisq.test(ceas$heartProblem, ceas$Caesarian)

Pearson's Chi-squared test with Yates' continuity correction

data:  ceas$heartProblem and ceas$Caesarian

X-squared = 8.5251, df = 1, p-value = 0.003503

We can observe the p-values of each chi.squared test. If p-value is less than 0.05 significance level, there is no evidence against the null hypothesis that both variables are independent. We can see that variables hearProblem and BP while deliveryNumber and deliveryTIme have no impact of on taking of decision of caesarean or no caesarean. 

B) SVM Model Building & Prediction

Now, we are ready for model building. We are, here, going to use Support Vector Machine (SVM). The SVM technique is a very powerful technique in which we can build linear and nonlinear models. We will try different models in SVM for our case study.

So, we begin with simple SVM model---

> model.svm=svm(Caesarian~.,data=ceas)
> summary(model.svm)

Call:
svm(formula = Caesarian ~ ., data = ceas)

Parameters:
   SVM-Type:  C-classification
 SVM-Kernel:  radial
       cost:  1
      gamma:  0.2

Number of Support Vectors:  65

 ( 29 36 )


Number of Classes:  2

Levels:
 0 1

In the model summary, we can observe that, by default, model has selected nonlinear model i.e. SVM Kernal "radial". Model also optimized at Cost=1 and gamma=0.2. Model has done C-Classification. Number of Support Vectors are 65.

Let us predict the Caesarean type for data frame "ceas" with SVM model and then we evaluate the prediction with actual values---

> pred=predict(model.svm, ceas)
> table(ceas$Caesarian, pred)
       pred
       0    1
  0   26  8
  1   9    37

Following confusion matrix shows accuracy and other statistics of prediction by SVM model---

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

                     Reference
Prediction         0   1
         0              26  9
         1              8    37
                                       
               Accuracy         : 0.7875       
                 95% CI          : (0.6817, 0.8711)
    No Information Rate  : 0.575         
    P-Value [Acc > NIR] : 5.404e-05     
                                       
                  Kappa           : 0.5669       
 Mcnemar's Test P-Value : 1             
                                       
            Sensitivity : 0.7647       
            Specificity : 0.8043       
         Pos Pred Value : 0.7429       
         Neg Pred Value : 0.8222       
             Prevalence : 0.4250       
         Detection Rate : 0.3250       
   Detection Prevalence : 0.4375       
      Balanced Accuracy : 0.7845       
                                       
       'Positive' Class : 0

Overall accuracy of the model is 78%.

Now, we apply cross-validation and SVM model together to see if there is any improvement in prediction or not. First, we try with linear SVM model---


>trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
>set.seed(3233)

>svm_Linear <- train(caes ~., data = ceas, method = "svmLinear",
                    trControl=trctrl,
                    preProcess = c("center", "scale"),
                    tuneLength = 10)


Model summary is as follows---

>svm_Linear

Support Vector Machines with Linear Kernel

80 samples
 5 predictor
 2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results:

  Accuracy  Kappa 
  0.651918  0.3198123

Tuning parameter 'C' was held constant at a value of 1

Let us see, how it predicts the Caesarean variable---

> pred=predict(svm_Linear, newdata=ceas)
> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

                      Reference
Prediction         0    1
         0               28  20
         1               6    26
                                       
               Accuracy : 0.675         
                 95% CI : (0.5611, 0.7755)
    No Information Rate : 0.575         
    P-Value [Acc > NIR] : 0.04356       
                                       
                  Kappa : 0.3689       
 Mcnemar's Test P-Value : 0.01079       
                                       
            Sensitivity : 0.8235       
            Specificity : 0.5652       
         Pos Pred Value : 0.5833       
         Neg Pred Value : 0.8125       
             Prevalence : 0.4250       
         Detection Rate : 0.3500       
   Detection Prevalence : 0.6000       
      Balanced Accuracy : 0.6944       
                                       
       'Positive' Class : 0

We can see that the model accuracy of this model is lower than the simple model.

We can fine-tune model with hyperparameters C and gamma of the SVM model and let us see if model accuracy improves. We take the range of C values from 0.01 to 2.5.

>grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))

>set.seed(3233)

>svm_Linear_Grid <- train(caes ~., data = ceas, method = "svmLinear",
                           trControl=trctrl,
                           preProcess = c("center", "scale"),
                           tuneGrid = grid,
                           tuneLength = 10)

>svm_Linear_Grid

> svm_Linear_Grid
Support Vector Machines with Linear Kernel

80 samples
 5 predictor
 2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

  C     Accuracy   Kappa 
  0.00        NaN        NaN
  0.01  0.5755291  0.0000000
  0.05  0.5667328  0.1259469
  0.10  0.6266534  0.2777531
  0.25  0.6212963  0.2586232
  0.50  0.6382275  0.2938355
  0.75  0.6513228  0.3195901
  1.00  0.6519180  0.3198123
  1.25  0.6566799  0.3255291
  1.50  0.6560847  0.3242667
  1.75  0.6566799  0.3234376
  2.00  0.6662037  0.3421381
  5.00  0.6657407  0.3330852

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 2.

> pred=predict(svm_Linear_Grid, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

                      Reference
Prediction         0      1
         0               24    12
         1               10    34
                                       
               Accuracy : 0.725       
                 95% CI : (0.6138, 0.819)
    No Information Rate : 0.575       
    P-Value [Acc > NIR] : 0.003992     
                                       
                  Kappa : 0.4416       
 Mcnemar's Test P-Value : 0.831170     
                                       
            Sensitivity : 0.7059       
            Specificity : 0.7391       
         Pos Pred Value : 0.6667       
         Neg Pred Value : 0.7727       
             Prevalence : 0.4250       
         Detection Rate : 0.3000       
   Detection Prevalence : 0.4500       
      Balanced Accuracy : 0.7225       
                                       
       'Positive' Class : 0           
                               
We can see that accuracy of the model has improved from 67% to 72% at C=2.

Now, let us apply nonlinear "radial" model again and see if model prediction improves or not.

> set.seed(3233)
> svm_Radial <- train(Caesarian ~., data = ceas, method = "svmRadial",
                        trControl=trctrl,
                        preProcess = c("center", "scale"),
                        tuneLength = 10)

> svm_Radial

Support Vector Machines with Radial Basis Function Kernel

80 samples
 5 predictor
 2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

  C       Accuracy   Kappa   
    0.25  0.6093915   0.14372559
    0.50  0.6707011   0.32367981
    1.00  0.6381614   0.25688123
    2.00  0.6212302   0.22192343
    4.00  0.5928571   0.16922655
    8.00  0.5871693   0.16469149
   16.00  0.5913360   0.17885126
   32.00  0.5556217   0.10723957
   64.00  0.5172619   0.03346007
  128.00  0.5010582  -0.00360852

Tuning parameter 'sigma' was held constant at a value of 0.2772785
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.2772785 and C = 0.5.

> pred=predict(svm_Radial, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
   Confusion Matrix and Statistics

                      Reference
Prediction        0       1
         0              25    8
         1              9    38
                                          
               Accuracy : 0.7875          
                 95% CI : (0.6817, 0.8711)
    No Information Rate : 0.575           
    P-Value [Acc > NIR] : 5.404e-05       
                                          
                  Kappa : 0.5635          
 Mcnemar's Test P-Value : 1               
                                          
            Sensitivity : 0.7353          
            Specificity : 0.8261          
         Pos Pred Value : 0.7576          
         Neg Pred Value : 0.8085          
             Prevalence : 0.4250          
         Detection Rate : 0.3125          
   Detection Prevalence : 0.4125          
      Balanced Accuracy : 0.7807          
                                          
       'Positive' Class : 0


We can see that accuracy of "radial"  model is 79% which is higher than linear models. Let us see if finetuning of "radial" model improves the model or not by changing values of C and Sigma---

>grid_radial <- expand.grid(sigma = c(0,0.01, 0.02, 0.025, 0.03, 0.04,
                                     0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9),
                           C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75,
                                 1, 1.5, 2,5))

>set.seed(3233)

>svm_Radial_Grid <- train(Caesarian ~., data = ceas, method = "svmRadial",
                           trControl=trctrl,
                           preProcess = c("center", "scale"),
                           tuneGrid = grid_radial,
                           tuneLength = 10)

> pred=predict(svm_Radial_Grid, newdata=ceas)

> confusionMatrix(pred, ceas$Caesarian)
Confusion Matrix and Statistics

                     Reference
Prediction       0      1
         0             27    16
         1             7      30
                                       
               Accuracy : 0.7125       
                 95% CI : (0.6005, 0.8082)
    No Information Rate : 0.575         
    P-Value [Acc > NIR] : 0.007869     
                                       
                  Kappa : 0.4314       
 Mcnemar's Test P-Value : 0.095293     
                                       
            Sensitivity : 0.7941       
            Specificity : 0.6522       
         Pos Pred Value : 0.6279       
         Neg Pred Value : 0.8108       
             Prevalence : 0.4250       
         Detection Rate : 0.3375       
   Detection Prevalence : 0.5375       
      Balanced Accuracy: 0.7231       
                                       
       'Positive' Class : 0

We can observe that cross-validation with different values of C and sigma degraded the model accuracy. We can try many other finetuning alternatives and other algorithms to improve the prediction. For now, we can select svm model with highest classification accuracy. 

Wednesday, 26 December 2018

How to Use Forecasting with R for Currency Trading



In this case study, we will see how to use forecasting with R for currency trading. If we know the forecasting for the next few days, we can build our trading system around the trend line of forecasts and can make buying or selling decisions.

DISCLAIMER: This case study is for educational purpose. It is neither a recommendation of trading , any sort of tips and not to be copied in real world currency trading. This article should not be taken as recommendation to trade in USDINR. Currency trading is highly risky and hence, a good financial advisor should be consulted for making investment decisions in currency trading. 

The entire process of how to use forecasting with R for currency trading is divided into the following sections---

(A) DATA PREPARATION

(B) MODEL BUILDING

(C) PREDICTION

(A) DATA PREPARATION STAGE

First and foremost, we need to load required libraries as shown in the following chunk of commands---

>library(quantmod)
>library(timeSeries)
>library(forecast)
>library(bizdays)
>library(plotly)


After loading the required libraries, we need data of USDINR from  Yahoo Finance (https://in.finance.yahoo.com/). The "getSymbols" command downloads data from the source since 1st January 2007 and we save the data in the file "exr"---

>exr=getSymbols("INR=X", src="yahoo", env=NULL)

NOTE: USDINR is coded as INR=X on Yahoo Finance.



When we download data from Yahoo Finance, it loads open, close, high, low, volume and adjusted price of USDINR. We need only closing price/quote (4th Column) of USDINR---

>exr=exr[,4]

We need to remove any missing values from the data to do time series operations hence we, first, calculate the number of missing values and then remove it by using "na.omit" command---

>sum(is.na(exr))

There are 30 missing values in the data, we shall remove the missing values from data.

>exr=na.omit(exr)

>sum(is.na(exr))

Now, let us observe how the trend line looks in plot---

>plot(exr)




After data preparation, the data is used for developing the forecasting model and followed by prediction.

(B) MODEL BUILDING

If we look at the trendline, it is clearly visible that over the period of time, it changes mean and variance hence this time series is non-stationary. The "auto.arima" command of "forecast" package computes the parameters of the time series model. As we have USDINR daily price quotes, we can introduce frequency 365 to address seasonality in the model building. The major components P, D, and Q are calculated by "auto.arima" function automatically. We first convert dataframe "exr" into time series "exr.ts.365"---

>exr.ts.365=ts(exr, frequency = 365)

Now, we build time series model with "auto.arima" command with seasonality denoted by D=1 ---

>model.365=auto.arima(exr.ts.365,D=1)

Model summary and accuracy of "model.365" is as follows---

> summary(model.365)

Series: exr.ts.365
ARIMA(1,1,2)(0,1,0)[365]

Coefficients:
              ar1      ma1     ma2
         0.8414  -0.9773  0.1442
s.e.    0.0728   0.0741  0.0197

sigma^2 estimated as 0.194:  log likelihood=-1638.48

AIC=3284.95   AICc=3284.97   BIC=3308.61

Training set error measures:
                    ME                RMSE       MAE         MPE              MAPE    MASE          ACF1
Training set 0.002054559 0.4133998 0.2762706 0.002884244 0.491483 0.05617031 0.001935873

The ARIMA model extracted is (1,1,2)(0,1,0)[365] and it's MAPE is quite low i.e.0.491483.

(C) PREDICTION

Now, we predict the next 200 day prices based on "model.365" by using forecast command---

>predict.365=forecast(model.365, h=200)

Let us see how output of "predict.365" looks like---

 >predict.365
 
                       Point Forecast    Lo 80    Hi 80    Lo 95    Hi 95
 9.487671       70.17357 69.60910 70.73805 69.31028 71.03687
 9.490411       70.02434 69.27832 70.77036 68.88340 71.16528
 9.493151       69.59606 68.69539 70.49674 68.21860 70.97353
 9.495890       69.64146 68.60203 70.68090 68.05179 71.23114
 9.498630       69.60031 68.43319 70.76743 67.81535 71.38527
 9.501370       69.75752 68.47118 71.04387 67.79023 71.72482
 9.504110       69.62377 68.22505 71.02250 67.48461 71.76294
 9.506849       69.80306 68.29772 71.30840 67.50084 72.10527

Ignore the first column and observe the "Point Forecast" column and we find that there is a decline in prices of USDINR.

Let us observe visually when the forecast is attached to the time series---

>plot(predict.365)


In the graph, where the extended line in blue cover is the point forecast line and grey and light grey envelopes are basically levels at 80% and 90% confidence.

After getting the forecast, we need to add series of 200 dates on USDINR trading days only. We need to first create a calendar as per India by "create.calendar" command---

>create.calendar("India/ANBIMA", holidaysANBIMA, weekdays=c("saturday", "sunday"))

Now, we calculate how many trading days between 26/12/2018 and 10/10/2019 and we find it is 200 day by running the "bizdays" command---

>bizdays("2018-12-26", "2019-10-10", "India/ANBIMA")

Now, we shall get a data frame of dates and combine it with "predict.365" to make a new data frame "predict200" containing two columns ( Dates and predict.365)---

>Dates=bizseq("2018-12-26", "2019-10-10", "India/ANBIMA")

>Datess=as.data.frame(Dates)

>dim(Datess)

>View(Datess)

>Dates=Datess[1:200,]

>predict.365=as.data.frame(predict.365)

>dim(predict.365)

>predict200=cbind.data.frame(Dates,predict.365)

Now, we can see how "predict200" data frame looks like---

>View(predict200)

We can download the csv file of "predict200" by following command---

>write.csv(predict200, "predict200.csv")

The "predict200" contains columns of Dates, Point Forecast, Lo80, Hi80, Lo90 and Hi90. For this case study, we will use only point forecast to suggest a trading system for USDINR.

Let us see the predict.200 visualization with plot_ly command---

>plot_ly(predict200, x=predict200$Dates, y=predict200$`Point Forecast`,mode="marker")



Now, we can see that until Oct 2019, the USDINR exchange rate shows range bound movement between 71.25 and 68.5. It is basically mean reversion behavior of exchange rate or in simple words, USDINR rates will be sideways between these two points. 

Here, trend following strategies will be less successful and sideways market strategies will be more successful. The USDINR market in this period is more likely going to favor option writers. 

Sunday, 25 November 2018

How to ML-fy Your Small Business

ML is everywhere and those who do not know it, it is Machine Learning. The Machine Learning is out of the labs and being democratized by technologies. So, it can be used by all and sundry and it is not costly too. It is available in the micro-fraction of the cost it was used to be. Objective of this article is to understand how a small business can reap the benefits of machine learning and AI without hiccups. 

Before delving into machine learning and AI for small business, we must also know why a small business should even care about it? Can't it live without it. Yes, definitely a business can survive and be profitable without AI. It is a matter of competitive advantages if a company wants to have. AI will make small businesses what online eCommerce has done to the marketing function of small businesses. The eCommerce has uplifted the marketing and sales performance of hundreds and thousands of small businesses through pan-country reach and delivery systems. Now, any small business can take rest as far as marketing is concerned because eCommerce companies can take your products to consumers, they do marketing, they convince customers, they run sales promotion schemes....... almost everything, isn't it. Let us take an instance, what can you do if your competitor are also on same platform, selling same stuff, at same or lower price than you? Can you lower the price by capital financing the discounts, or delay the break-even or move out of the market. All these options are very scary. 

Here, ML-fy your business and this is what you can do as a logical step to sustain, survive or grow your business in fierce competitions. ML-fying your business means you delve into data generated in your small businesses and explore the opportunities of maximization and minimization. With the information of maximization and minimization opportunities, small business can foresee future business environment and internal performance. If that is known, you can have better planning, better approach to deal with competition. Over and above, your customer will feel the difference which means more customer loyalty and more referrals.

Now, we have got feeling of how ML-fication of small business can benefits in terms of minization and maximization opportunities.Following are the broad steps in implementing machine learning into business--


Let us understand this process of machine learning implementation with an example. Suppose there a sweet manufacturing company-- Sheetal Sweets. This company manufactures sweet "RASGULLA". It is a family owned business. This company is facing a problem that she has issues of either stockout situation or over-inventory of raw materials especially sugar and it is very frequent. Stockout situation of sugar leads to halt in manufacturing process and delay in sales order fulfillment. Oversupply leads to blocking of working capital which means interest has to be paid to bank on overdraft facility.

Let us dive deeper into the problem. Suppose working capital employed in business is Rs.30 lac. Bank charges interest rate @ 12% p.a. If working capital is paid in 6 months on an average, the interest burden on company is Rs.180000 for six month. On yearly basis, it is Rs.360000. Even if it is not borrowed from bank, we can assume the opportunity cost of employing the working capital. If Rs.30 lac amount is available with Sheetal Sweets, it should earn at least 12% return on business operations because bank might be offering same interest rates on bank deposits. Let us assume that working capital is completely absorbed in RASGULLA manufacturing.

The stockouts and over inventory adversely impact the working capital management. In case of stockouts, Sheetal Sweets may loose customer or sales and if there is over-inventory, the working capital rotation is delayed. So ultimately, either way it is harmful. Thus, Sheetal Sweets needs the predictability capability so that it can plan its inventory in optimal manner that minimizes the stockouts and over-inventory situations.

PREDICTABILITY CAPABILITY DEVELOPMENT

For predictability capability can be built through various measures but our focus in this article is to use machine learning for predicting the next day sales, next week sales, next fortnight sales and next month sales. There are range of models in machine learning that can predict the sales in coming periods but we will focus only on Time Series method. In time series method of forecasting, the variable data like sales is available for past periods and with suitable model, we try to predict the sales for coming periods.

                                   Picture 2 Process of Forecasting Model Development

The picture 2 shows the development process of forecasting model. Sheetal Sweets needs to collect its sales data for past periods. It is better to have daily sales data of at least 3 years or more. After collecting data, it should be stored into nice format in MS Excel. In Excel file, it sales data should be filled in cells corresponding to the periods they belong to.

Now, a small programming code in R can be used to build a forecasting model. The forecasting model programming in R can be asked separately via dropping the requirement in comment box. This model should be reviewed time to time to see how close the predicted values of model are to the actual sales realization. The lesser the gap, the better the model is.



Thursday, 15 November 2018

Self Learning Technologies for eCommerce




The eCommerce business cannot survive without incorporation of Self Learning Technologies (SLTs) due to changes in scope of decision making from years and months to hours and minutes. Self Learning Technologies can be adopted with budget as minimum as $10000 and budget as big as $1000000. 

There are various ways to implement SLTs from as simple as hiring a freelancer on temporary engagement basis, crowdsourcing, renting the Machine Learning-as-a-Service (MLaaS) and as complex as buying SLT solution, ML cloud services and in-house SLT development. The allocation of human resources depends on approach of SLT implementation chosen. For simple SLT implementation, a data scientist and digital marketer may suffice but for complex SLT implementation, a dedicated team of data engineers, data scientists, business analytics professional and digital marketers should be deployed. 

The SLT deployment may take three months to one year or more which again depends on what kind of SLTs are being implemented. Simple and intermediate SLTs implementation may take three months to nine months and complex SLTs implementation may take twelve months to eighteen months. The decision of SLT implementation is very critical. 

For simple to intermediate SLT implementation, crowdsourcing and MLaaS are better options and for complex SLT implementation, renting MLaaS to buying SLT Solutions from Microsoft, Google, IBM etc. or build SLT solution in-house are suggested. If company wants to keep the SLT intellectual property confidential, limited to itself and to do full blown business transformation; it should go build SLT solution in-house or if company wants to deploy SLT implementation faster and battle tested SLT solution without actually owning it; the company can choose buy SLT solutions from reputed vendors.    

For more detail on how to implement self learning technologies (SLTs) in eCommerce, please click link below---

https://imojo.in/e9s5oc

Thursday, 20 September 2018

CASE STUDY 4: House Price Prediction

HOUSE PRICE PREDICTION


We have a dataset related to house prices and attributes of each house sale. It is a kaggle competition dataset.

In univariate analysis, we will try to understand the overall structure of data, each variable and some univariate hypotheses analysis.

First, let us understand the overall structure of the data set by running following chunk of commands—

setwd("E:/r/R/kaggle/houseprice")

getwd()
## [1] "E:/r/R/kaggle/houseprice"
train=read.csv("train.csv", header = T, stringsAsFactors = TRUE)

str(train)
## 'data.frame':    1460 obs. of  81 variables:
##  $ Id           : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ MSSubClass   : int  60 20 60 70 60 50 20 60 50 190 ...
##  $ MSZoning     : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
##  $ LotFrontage  : int  65 80 68 60 84 85 75 NA 51 50 ...
##  $ LotArea      : int  8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
##  $ Street       : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Alley        : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
##  $ LotShape     : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
##  $ LandContour  : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
##  $ Utilities    : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
##  $ LotConfig    : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
##  $ LandSlope    : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
##  $ Condition1   : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
##  $ Condition2   : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
##  $ BldgType     : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ HouseStyle   : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
##  $ OverallQual  : int  7 6 7 7 8 5 8 7 7 5 ...
##  $ OverallCond  : int  5 8 5 5 5 5 5 6 5 6 ...
##  $ YearBuilt    : int  2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
##  $ YearRemodAdd : int  2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
##  $ RoofStyle    : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ RoofMatl     : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ Exterior1st  : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
##  $ Exterior2nd  : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
##  $ MasVnrType   : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
##  $ MasVnrArea   : int  196 0 162 0 350 0 186 240 0 0 ...
##  $ ExterQual    : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
##  $ ExterCond    : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ Foundation   : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
##  $ BsmtQual     : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
##  $ BsmtCond     : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
##  $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
##  $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
##  $ BsmtFinSF1   : int  706 978 486 216 655 732 1369 859 0 851 ...
##  $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
##  $ BsmtFinSF2   : int  0 0 0 0 0 0 0 32 0 0 ...
##  $ BsmtUnfSF    : int  150 284 434 540 490 64 317 216 952 140 ...
##  $ TotalBsmtSF  : int  856 1262 920 756 1145 796 1686 1107 952 991 ...
##  $ Heating      : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ HeatingQC    : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
##  $ CentralAir   : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
##  $ Electrical   : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
##  $ X1stFlrSF    : int  856 1262 920 961 1145 796 1694 1107 1022 1077 ...
##  $ X2ndFlrSF    : int  854 0 866 756 1053 566 0 983 752 0 ...
##  $ LowQualFinSF : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ GrLivArea    : int  1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
##  $ BsmtFullBath : int  1 0 1 1 1 1 1 1 0 1 ...
##  $ BsmtHalfBath : int  0 1 0 0 0 0 0 0 0 0 ...
##  $ FullBath     : int  2 2 2 1 2 1 2 2 2 1 ...
##  $ HalfBath     : int  1 0 1 0 1 1 0 1 0 0 ...
##  $ BedroomAbvGr : int  3 3 3 3 4 1 3 3 2 2 ...
##  $ KitchenAbvGr : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ KitchenQual  : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
##  $ TotRmsAbvGrd : int  8 6 6 7 9 5 7 7 8 5 ...
##  $ Functional   : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
##  $ Fireplaces   : int  0 1 1 1 1 0 1 2 2 2 ...
##  $ FireplaceQu  : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
##  $ GarageType   : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
##  $ GarageYrBlt  : int  2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
##  $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
##  $ GarageCars   : int  2 2 2 3 3 2 2 2 2 1 ...
##  $ GarageArea   : int  548 460 608 642 836 480 636 484 468 205 ...
##  $ GarageQual   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
##  $ GarageCond   : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
##  $ PavedDrive   : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
##  $ WoodDeckSF   : int  0 298 0 0 192 40 255 235 90 0 ...
##  $ OpenPorchSF  : int  61 0 42 35 84 30 57 204 0 4 ...
##  $ EnclosedPorch: int  0 0 0 272 0 0 0 228 205 0 ...
##  $ X3SsnPorch   : int  0 0 0 0 0 320 0 0 0 0 ...
##  $ ScreenPorch  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolArea     : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PoolQC       : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
##  $ Fence        : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
##  $ MiscFeature  : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
##  $ MiscVal      : int  0 0 0 0 0 700 0 350 0 0 ...
##  $ MoSold       : int  2 5 9 2 12 10 8 11 4 1 ...
##  $ YrSold       : int  2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
##  $ SaleType     : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
##  $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
##  $ SalePrice    : int  208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
dim(train)
## [1] 1460   81
So, this dataset contains 1460 rows and 80 variables. The first column is just a simple index therefore it is ignored. The 80th variable is SalePrice variable which is a target variable. Other 79 variables are predictor variables. These variables reflect various aspects of a house. To understand each variable better, we can cacluate summary statistics by following command chunk—

summary(train)
##        Id           MSSubClass       MSZoning     LotFrontage 
##  Min.   :   1.0   Min.   : 20.0   C (all):  10   Min.   : 21.00
##  1st Qu.: 365.8   1st Qu.: 20.0   FV     :  65   1st Qu.: 59.00
##  Median : 730.5   Median : 50.0   RH     :  16   Median : 69.00
##  Mean   : 730.5   Mean   : 56.9   RL     :1151   Mean   : 70.05
##  3rd Qu.:1095.2   3rd Qu.: 70.0   RM     : 218   3rd Qu.: 80.00
##  Max.   :1460.0   Max.   :190.0                  Max.   :313.00
##                                                  NA's   :259   
##     LotArea        Street      Alley      LotShape  LandContour
##  Min.   :  1300   Grvl:   6   Grvl:  50   IR1:484   Bnk:  63 
##  1st Qu.:  7554   Pave:1454   Pave:  41   IR2: 41   HLS:  50 
##  Median :  9478               NA's:1369   IR3: 10   Low:  36 
##  Mean   : 10517                           Reg:925   Lvl:1311 
##  3rd Qu.: 11602                                             
##  Max.   :215245                                             
##                                                             
##   Utilities      LotConfig    LandSlope   Neighborhood   Condition1
##  AllPub:1459   Corner : 263   Gtl:1382   NAmes  :225   Norm   :1260
##  NoSeWa:   1   CulDSac:  94   Mod:  65   CollgCr:150   Feedr  :  81
##                FR2    :  47   Sev:  13   OldTown:113   Artery :  48
##                FR3    :   4              Edwards:100   RRAn   :  26
##                Inside :1052              Somerst: 86   PosN   :  19
##                                          Gilbert: 79   RRAe   :  11
##                                          (Other):707   (Other):  15
##    Condition2     BldgType      HouseStyle   OverallQual 
##  Norm   :1445   1Fam  :1220   1Story :726   Min.   : 1.000
##  Feedr  :   6   2fmCon:  31   2Story :445   1st Qu.: 5.000
##  Artery :   2   Duplex:  52   1.5Fin :154   Median : 6.000
##  PosN   :   2   Twnhs :  43   SLvl   : 65   Mean   : 6.099
##  RRNn   :   2   TwnhsE: 114   SFoyer : 37   3rd Qu.: 7.000
##  PosA   :   1                 1.5Unf : 14   Max.   :10.000
##  (Other):   2                 (Other): 19                 
##   OverallCond      YearBuilt     YearRemodAdd    RoofStyle 
##  Min.   :1.000   Min.   :1872   Min.   :1950   Flat   :  13
##  1st Qu.:5.000   1st Qu.:1954   1st Qu.:1967   Gable  :1141
##  Median :5.000   Median :1973   Median :1994   Gambrel:  11
##  Mean   :5.575   Mean   :1971   Mean   :1985   Hip    : 286
##  3rd Qu.:6.000   3rd Qu.:2000   3rd Qu.:2004   Mansard:   7
##  Max.   :9.000   Max.   :2010   Max.   :2010   Shed   :   2
##                                                           
##     RoofMatl     Exterior1st   Exterior2nd    MasVnrType    MasVnrArea 
##  CompShg:1434   VinylSd:515   VinylSd:504   BrkCmn : 15   Min.   :   0.0
##  Tar&Grv:  11   HdBoard:222   MetalSd:214   BrkFace:445   1st Qu.:   0.0
##  WdShngl:   6   MetalSd:220   HdBoard:207   None   :864   Median :   0.0
##  WdShake:   5   Wd Sdng:206   Wd Sdng:197   Stone  :128   Mean   : 103.7
##  ClyTile:   1   Plywood:108   Plywood:142   NA's   :  8   3rd Qu.: 166.0
##  Membran:   1   CemntBd: 61   CmentBd: 60                 Max.   :1600.0
##  (Other):   2   (Other):128   (Other):136                 NA's   :8     
##  ExterQual ExterCond  Foundation  BsmtQual   BsmtCond    BsmtExposure
##  Ex: 52    Ex:   3   BrkTil:146   Ex  :121   Fa  :  45   Av  :221 
##  Fa: 14    Fa:  28   CBlock:634   Fa  : 35   Gd  :  65   Gd  :134 
##  Gd:488    Gd: 146   PConc :647   Gd  :618   Po  :   2   Mn  :114 
##  TA:906    Po:   1   Slab  : 24   TA  :649   TA  :1311   No  :953 
##            TA:1282   Stone :  6   NA's: 37   NA's:  37   NA's: 38 
##                      Wood  :  3                                   
##                                                                   
##  BsmtFinType1   BsmtFinSF1     BsmtFinType2   BsmtFinSF2   
##  ALQ :220     Min.   :   0.0   ALQ :  19    Min.   :   0.00
##  BLQ :148     1st Qu.:   0.0   BLQ :  33    1st Qu.:   0.00
##  GLQ :418     Median : 383.5   GLQ :  14    Median :   0.00
##  LwQ : 74     Mean   : 443.6   LwQ :  46    Mean   :  46.55
##  Rec :133     3rd Qu.: 712.2   Rec :  54    3rd Qu.:   0.00
##  Unf :430     Max.   :5644.0   Unf :1256    Max.   :1474.00
##  NA's: 37                      NA's:  38                   
##    BsmtUnfSF       TotalBsmtSF      Heating     HeatingQC CentralAir
##  Min.   :   0.0   Min.   :   0.0   Floor:   1   Ex:741    N:  95 
##  1st Qu.: 223.0   1st Qu.: 795.8   GasA :1428   Fa: 49    Y:1365 
##  Median : 477.5   Median : 991.5   GasW :  18   Gd:241           
##  Mean   : 567.2   Mean   :1057.4   Grav :   7   Po:  1           
##  3rd Qu.: 808.0   3rd Qu.:1298.2   OthW :   2   TA:428           
##  Max.   :2336.0   Max.   :6110.0   Wall :   4                     
##                                                                   
##  Electrical     X1stFlrSF      X2ndFlrSF     LowQualFinSF 
##  FuseA:  94   Min.   : 334   Min.   :   0   Min.   :  0.000
##  FuseF:  27   1st Qu.: 882   1st Qu.:   0   1st Qu.:  0.000
##  FuseP:   3   Median :1087   Median :   0   Median :  0.000
##  Mix  :   1   Mean   :1163   Mean   : 347   Mean   :  5.845
##  SBrkr:1334   3rd Qu.:1391   3rd Qu.: 728   3rd Qu.:  0.000
##  NA's :   1   Max.   :4692   Max.   :2065   Max.   :572.000
##                                                           
##    GrLivArea     BsmtFullBath     BsmtHalfBath        FullBath 
##  Min.   : 334   Min.   :0.0000   Min.   :0.00000   Min.   :0.000
##  1st Qu.:1130   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:1.000
##  Median :1464   Median :0.0000   Median :0.00000   Median :2.000
##  Mean   :1515   Mean   :0.4253   Mean   :0.05753   Mean   :1.565
##  3rd Qu.:1777   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:2.000
##  Max.   :5642   Max.   :3.0000   Max.   :2.00000   Max.   :3.000
##                                                                 
##     HalfBath       BedroomAbvGr    KitchenAbvGr   KitchenQual
##  Min.   :0.0000   Min.   :0.000   Min.   :0.000   Ex:100   
##  1st Qu.:0.0000   1st Qu.:2.000   1st Qu.:1.000   Fa: 39   
##  Median :0.0000   Median :3.000   Median :1.000   Gd:586   
##  Mean   :0.3829   Mean   :2.866   Mean   :1.047   TA:735   
##  3rd Qu.:1.0000   3rd Qu.:3.000   3rd Qu.:1.000           
##  Max.   :2.0000   Max.   :8.000   Max.   :3.000           
##                                                           
##   TotRmsAbvGrd    Functional    Fireplaces    FireplaceQu   GarageType
##  Min.   : 2.000   Maj1:  14   Min.   :0.000   Ex  : 24    2Types :  6
##  1st Qu.: 5.000   Maj2:   5   1st Qu.:0.000   Fa  : 33    Attchd :870
##  Median : 6.000   Min1:  31   Median :1.000   Gd  :380    Basment: 19
##  Mean   : 6.518   Min2:  34   Mean   :0.613   Po  : 20    BuiltIn: 88
##  3rd Qu.: 7.000   Mod :  15   3rd Qu.:1.000   TA  :313    CarPort:  9
##  Max.   :14.000   Sev :   1   Max.   :3.000   NA's:690    Detchd :387
##                   Typ :1360                               NA's   : 81
##   GarageYrBlt   GarageFinish   GarageCars      GarageArea     GarageQual
##  Min.   :1900   Fin :352     Min.   :0.000   Min.   :   0.0   Ex  :   3
##  1st Qu.:1961   RFn :422     1st Qu.:1.000   1st Qu.: 334.5   Fa  :  48
##  Median :1980   Unf :605     Median :2.000   Median : 480.0   Gd  :  14
##  Mean   :1979   NA's: 81     Mean   :1.767   Mean   : 473.0   Po  :   3
##  3rd Qu.:2002                3rd Qu.:2.000   3rd Qu.: 576.0   TA  :1311
##  Max.   :2010                Max.   :4.000   Max.   :1418.0   NA's:  81
##  NA's   :81                                                           
##  GarageCond  PavedDrive   WoodDeckSF      OpenPorchSF     EnclosedPorch 
##  Ex  :   2   N:  90     Min.   :  0.00   Min.   :  0.00   Min.   :  0.00
##  Fa  :  35   P:  30     1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.00
##  Gd  :   9   Y:1340     Median :  0.00   Median : 25.00   Median :  0.00
##  Po  :   7              Mean   : 94.24   Mean   : 46.66   Mean   : 21.95
##  TA  :1326              3rd Qu.:168.00   3rd Qu.: 68.00   3rd Qu.:  0.00
##  NA's:  81              Max.   :857.00   Max.   :547.00   Max.   :552.00
##                                                                         
##    X3SsnPorch      ScreenPorch        PoolArea        PoolQC 
##  Min.   :  0.00   Min.   :  0.00   Min.   :  0.000   Ex  :   2
##  1st Qu.:  0.00   1st Qu.:  0.00   1st Qu.:  0.000   Fa  :   2
##  Median :  0.00   Median :  0.00   Median :  0.000   Gd  :   3
##  Mean   :  3.41   Mean   : 15.06   Mean   :  2.759   NA's:1453
##  3rd Qu.:  0.00   3rd Qu.:  0.00   3rd Qu.:  0.000           
##  Max.   :508.00   Max.   :480.00   Max.   :738.000           
##                                                               
##    Fence      MiscFeature    MiscVal             MoSold   
##  GdPrv:  59   Gar2:   2   Min.   :    0.00   Min.   : 1.000
##  GdWo :  54   Othr:   2   1st Qu.:    0.00   1st Qu.: 5.000
##  MnPrv: 157   Shed:  49   Median :    0.00   Median : 6.000
##  MnWw :  11   TenC:   1   Mean   :   43.49   Mean   : 6.322
##  NA's :1179   NA's:1406   3rd Qu.:    0.00   3rd Qu.: 8.000
##                           Max.   :15500.00   Max.   :12.000
##                                                           
##      YrSold        SaleType    SaleCondition    SalePrice   
##  Min.   :2006   WD     :1267   Abnorml: 101   Min.   : 34900
##  1st Qu.:2007   New    : 122   AdjLand:   4   1st Qu.:129975
##  Median :2008   COD    :  43   Alloca :  12   Median :163000
##  Mean   :2008   ConLD  :   9   Family :  20   Mean   :180921
##  3rd Qu.:2009   ConLI  :   5   Normal :1198   3rd Qu.:214000
##  Max.   :2010   ConLw  :   5   Partial: 125   Max.   :755000
##                 (Other):   9
Let us organize the dataframe in more formatted and comprehensive way—

train.numeric=train[,c(81,4,5,27,35,37,38,39,44:47,63,67:72,76)]

train.numericfactor=train[,c(2, 18,19, 48:53,55,57,62,77)]

train.numericfactor=as.data.frame(lapply(train.numericfactor, as.factor))

PeriodBuilt=2018-train$YearBuilt
PeriodRemod=2018-train$YearRemodAdd
PeriodGarage=2018-train$GarageYrBlt
PeriodSold=2018-train$YrSold

train.period=cbind.data.frame(PeriodBuilt,PeriodGarage,PeriodRemod, PeriodSold)

train.factor=train[,c(3,6:17, 22:26,28:34,36,40:43,54,56,58,59,61, 64:66,73:75, 79,80)]

train.combined=cbind(train.numeric,train.numericfactor,train.period, train.factor)

dim(train.combined)
## [1] 1460   80
Let us understand the distribution of each numeric variable by observing their histograms—

library(e1071)

for (i in colnames(train.combined[,c(1:20)])){

  skew.measure1=round(skewness(train.combined[[i]]),digits = 2)

  skew.measure2=skewness(log(train.combined[[i]]))

  xname=names(train.combined[i])

  p1=hist(scale(train.combined[[i]], center=T, scale=T), n=100, xlab = xname, main=paste("Histogram of", xname, "with skewness", skew.measure1))

  p2=hist(log(train.combined[[i]]), n=100, xlab = xname, main=paste("Histogram of","log", xname, "with skewness", skew.measure2))


}


As we can observe the graphs, few variables with log transformation are normalized. We retain the log tranformation of these numeric variables—

train.combined$SalePrice=log(train.combined$SalePrice)

train.combined$LotArea=log(train.combined$LotArea)

train.combined$X1stFlrSF=log(train.combined$X1stFlrSF)

train.combined$GrLivArea=log(train.combined$GrLivArea)
Now, columns 1 to 38 are nummeric columns and 39 to 81 are factor columns. The correlation is bivariate analysis. The independent variables should be screened out is there is no significant correlation with dependent variable “SalePrice”. Let us the correlation of SalePrice with each of the numeric columns by plotting scatter plots—

for (i in 2:length(train.combined[,c(2:20)])) {
  a <- cor.test(train.combined$SalePrice, train.combined[,i])
  print(paste(colnames(train.combined)[i], " est:", a$estimate, " p=value:", a$p.value))

}
## [1] "LotFrontage  est: 0.355878470385983  p=value: 3.55822388129835e-37"
## [1] "LotArea  est: 0.39991774112559  p=value: 3.47429417639357e-57"
## [1] "MasVnrArea  est: 0.43080852419511  p=value: 1.10796556420336e-66"
## [1] "BsmtFinSF1  est: 0.372023073567088  p=value: 3.84595104338152e-49"
## [1] "BsmtFinSF2  est: 0.00483241053204665  p=value: 0.853629797314542"
## [1] "BsmtUnfSF  est: 0.22198505352521  p=value: 9.31852717731222e-18"
## [1] "TotalBsmtSF  est: 0.612133975369787  p=value: 7.53455105447948e-151"
## [1] "X1stFlrSF  est: 0.608946654903631  p=value: 6.97929348105704e-149"
## [1] "X2ndFlrSF  est: 0.319299984347205  p=value: 5.86690165149455e-36"
## [1] "LowQualFinSF  est: -0.037962803137427  p=value: 0.147104011505013"
## [1] "GrLivArea  est: 0.730254851198229  p=value: 1.5984652370953e-243"
## [1] "GarageArea  est: 0.650887555902007  p=value: 1.10625537128625e-176"
## [1] "WoodDeckSF  est: 0.334135073957513  p=value: 2.05532422562011e-39"
## [1] "OpenPorchSF  est: 0.321052972019767  p=value: 2.34462590169273e-36"
## [1] "EnclosedPorch  est: -0.149050281427676  p=value: 1.05080829167697e-08"
## [1] "X3SsnPorch  est: 0.0549002264676204  p=value: 0.0359475707337999"
## [1] "ScreenPorch  est: 0.121207604896864  p=value: 3.40897065981049e-06"
## [1] "PoolArea  est: 0.0697978060096918  p=value: 0.00763178070309401"
Now, we can visualize the correlation between SalePrice and other numeric variables through scatter plots—

for (i in 2:length(train.combined[,c(2:20)])) {

  p <- plot(scale(train.combined[,i], center = T, scale = T), train.combined$SalePrice,
            xlab =colnames(train.combined)[i], ylab = "SalePrice", main=paste("Cor between Sale Price and", colnames(train.combined)[i]))
 
  p
}


We need to impute missing values in LotFrontage and MasVnrArea variables through regression—

mode.LotFrontage=lm(train.combined$LotFrontage~.,data=train.combined[,c(1:20)])

summary(mode.LotFrontage)
##
## Call:
## lm(formula = train.combined$LotFrontage ~ ., data = train.combined[,
##     c(1:20)])
##
## Residuals:
##     Min      1Q  Median      3Q     Max
## -73.485  -7.729  -0.465   6.828 210.494
##
## Coefficients: (1 not defined because of singularities)
##                 Estimate Std. Error t value Pr(>|t|) 
## (Intercept)   -1.876e+02  2.813e+01  -6.668 3.97e-11 ***
## SalePrice     -5.547e+00  2.366e+00  -2.344 0.019238 *
## LotArea        2.750e+01  1.229e+00  22.379  < 2e-16 ***
## MasVnrArea     7.095e-03  3.222e-03   2.202 0.027850 *
## BsmtFinSF1     5.110e-03  2.169e-03   2.356 0.018644 *
## BsmtFinSF2     5.484e-03  3.782e-03   1.450 0.147305 
## BsmtUnfSF      4.420e-03  2.125e-03   2.080 0.037743 *
## TotalBsmtSF           NA         NA      NA       NA 
## X1stFlrSF      9.269e+00  5.598e+00   1.656 0.098050 .
## X2ndFlrSF      4.036e-03  4.349e-03   0.928 0.353623 
## LowQualFinSF   1.170e-02  1.048e-02   1.116 0.264562 
## GrLivArea     -1.427e-01  6.700e+00  -0.021 0.983008 
## GarageArea     9.983e-03  3.144e-03   3.175 0.001537 **
## WoodDeckSF    -1.571e-02  4.615e-03  -3.405 0.000685 ***
## OpenPorchSF    3.904e-03  8.431e-03   0.463 0.643431 
## EnclosedPorch  4.457e-03  8.592e-03   0.519 0.604010 
## X3SsnPorch     2.988e-02  1.769e-02   1.689 0.091462 .
## ScreenPorch   -2.608e-02  9.432e-03  -2.765 0.005776 **
## PoolArea       7.337e-02  1.389e-02   5.281 1.53e-07 ***
## MiscVal       -6.408e-04  2.729e-03  -0.235 0.814387 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.49 on 1176 degrees of freedom
##   (265 observations deleted due to missingness)
## Multiple R-squared:  0.4893, Adjusted R-squared:  0.4815
## F-statistic:  62.6 on 18 and 1176 DF,  p-value: < 2.2e-16
pred=predict(mode.LotFrontage, newdata=train.combined[,c(1:20)])
## Warning in predict.lm(mode.LotFrontage, newdata = train.combined[,
## c(1:20)]): prediction from a rank-deficient fit may be misleading
train.combined$LotFrontage=ifelse(is.na(train.combined$LotFrontage),pred,
                              train.combined$LotFrontage)

train.combined$LotFrontage=round(train.combined$LotFrontage, digits = 0)
We can see there are some numerical variables which have majority value zero. These variables can be used to create binanry variables derived from their respective numeric variable—

train.combined$MasVnrArea_yes=ifelse(train.combined$MasVnrArea>0,1,0)

train.combined$BsmtFinSF1_yes=ifelse(train.combined$BsmtFinSF1>0,1,0)

train.combined$BsmtUnfSF_yes=ifelse(train.combined$BsmtUnfSF>0,1,0)

train.combined$X2ndFlrSF_yes=ifelse(train.combined$X2ndFlrSF>0,1,0)

train.combined$GarageArea_yes=ifelse(train.combined$GarageArea>0,1,0)

train.combined$WoodDeckSF_yes=ifelse(train.combined$WoodDeckSF>0,1,0)

train.combined$OpenPorchSF_yes=ifelse(train.combined$OpenPorchSF>0,1,0)

train.combined$EnclosedPorch_yes=ifelse(train.combined$EnclosedPorch>0,1,0)

train.combined$X3SsnPorch_yes=ifelse(train.combined$X3SsnPorch>0,1,0)

train.combined$ScreenPorch_yes=ifelse(train.combined$ScreenPorch>0,1,0)

train.combined$PoolArea_yes = ifelse(train.combined$PoolArea>0,1,0)

train.combined$MiscVal_yes=ifelse(train.combined$MiscVal>0,1,0)
Anova is a technique to compare multiple group means. As in our datframe, we have lot of categorical and dummy variables, we can check group means various categorical variables with respect to SalePrice are significantly different or nor—

for(i in colnames(train.combined[,c(21:33,38:92)])){

boxplots=boxplot(train.combined$SalePrice~train.combined[[i]],xlab=colnames(train.combined[i]), main=paste("Boxplot SalePrice Vs", colnames(train.combined[i])))

boxplots

}


Before model building we need to work out missing values in data frame. To impute missing values, we can use missForest package—

library(missForest)
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
set.seed(1000)

train.withoutMV=randomForest::rfImpute(x=train.combined[,2:92],
                                       y=train.combined[,1],
                                       ntree=500,
                                       iter=5)
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  500 |  0.01983    12.44 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  500 |  0.01979    12.41 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  500 |   0.0195    12.23 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  500 |  0.01956    12.27 |
##      |      Out-of-bag   |
## Tree |      MSE  %Var(y) |
##  500 |  0.01985    12.45 |
train.withoutMV
train.combined[, 1]
<dbl>
LotFrontage
<dbl>
LotArea
<dbl>
MasVnrArea
<dbl>
BsmtFinSF1
<int>
BsmtFinSF2
<int>
BsmtUnfSF
<int>
TotalBsmtSF
<int>
12.24769 65.00000 9.041922 196.00000 706 0 150 856
12.10901 80.00000 9.169518 0.00000 978 0 284 1262
12.31717 68.00000 9.328123 162.00000 486 0 434 920
11.84940 60.00000 9.164296 0.00000 216 0 540 756
12.42922 84.00000 9.565214 350.00000 655 0 490 1145
11.87060 85.00000 9.554993 0.00000 732 0 64 796
12.63460 75.00000 9.218705 186.00000 1369 0 317 1686
12.20607 77.00000 9.247829 240.00000 859 32 216 1107
11.77452 51.00000 8.719317 0.00000 0 0 952 952
11.67844 50.00000 8.911934 0.00000 851 0 140 991
Next123456...146Previous
1-10 of 1,460 rows | 1-8 of 92 columns
Let us check how many missing values after imputation—

sum(is.na(train.withoutMV))
## [1] 0
The output shows that no missing values are left in the dataset train.withoutMV.Now let us do some model building with randomForest technique—

train.withoutMV$SalePrice=train.withoutMV$`train.combined[, 1]`

train.withoutMV$`train.combined[, 1]`=NULL
Now, it is time to build model. First, we will try a simple random forest model—

set.seed(1001)

model.rf=randomForest(SalePrice~.,
                      data=train.withoutMV,
                      ntree=500,
                      mtry=5,
                      importance=TRUE,
                      proximity=TRUE)

pred=predict(model.rf,
             newdata=train.withoutMV,
             type="response")

train.withoutMV$pred.rf=exp(pred)

train.withoutMV$SalePrice=exp(train.withoutMV$SalePrice)

library(Metrics)

rmse.rf=rmse(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

rmse.rf
## [1] 16054.31
mape.rf=Metrics::mape(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

mape.rf
## [1] 0.04494797
bias.rf=bias(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

bias.rf
## [1] 2760.606
percent_bias.rf=percent_bias(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

percent_bias.rf
## [1] -0.0028082
plot(model.rf)


varImpPlot(model.rf)