Business Excellence: 2018

Friday, 28 December 2018

Caesarean or No Caesarean : Dr. SVM Knows it Well

INTRODUCTION

These days there is a lot of controversy regarding Caesarean deliveries in India. Some blame that medical fraternity and hospitals are deliberately doing more and more Caesarean deliveries. Let us explore how data science can help in this issue and making the decision about caesarean delivery or not with the help of an objective method i.e. machine learning.

DATASET

A sample of 80 women were collected who have either gone through caesarean delivery or normal delivery (M.Zain Amin, Amir Ali.'Performance Evaluation of Supervised Machine Learning Classifiers for Predicting Healthcare Operational Decisions'.Machine Learning for Operational Decision Making, Wavy Artificial Intelligence Research Foundation, Pakistan, 2018). This dataset is available on the following link---

https://archive.ics.uci.edu/ml/datasets/Caesarian+Section+Classification+Dataset

SUPPORT VECTOR MACHINE

We will do a classification model building with Support Vector Machine (SVM). Support Vector Machine is an algorithm which can be used for regression and classification. SVM can be used for linear and non-linear modelling. In this case study, we will try linear and nonlinear models for classification of Caesarean dataset.

Figure 1 Support Vector Machine

In Figure 2, the theoretical framework of Support Vector is graphically represented. In Figure 2, there are two types of labels represented by green (Class2) and orange stars (Class1). On Y-axis, there is X1 variable and on X-axis, X2 variable and both are independent variables. The graph shows how each observation is distributed in 2 dimensions X1 and X2.

The Support Vectors are the imaginary lines that are drawn on the basis of the maximum distance possible from each class of data. The redline is the hyperplane which is obtained after rounds of optimizations. The redline hyperplane is the best plane in classifying the two classes. SVM algorithm finds the best hyperplane for the best classification. To know more SVM intuitively, click on the link below---

https://www.youtube.com/watch?v=efR1C6CvhmE

The machine learning model building with Support Vector Machine comprises of following steps---

A) Data Preparation & Exploratory Data Analysis
B) SVM Model Building & Prediction

Let's begin the appointment with Dr. SVM.

A) Data Preparation & Exploratory Data Analysis

In RStudio, we begin with setting the working directory and packages required.

>setwd(PATH_to_FILE)
>getwd()
>library(foreign)
>library(e1071)
>library(caret)

>library(ROCR)

The dataset contains 80 observations and 6 variables. Out of six variables, five variables are attributes and one is a dependent variable. We give the name "ceas" to the file---

>ceas=read.arff("caesarian.csv.arff")

The attributes/independent variables are "age", "deliveryNumber", "BP", "deliveryTime" and "heartProblem". When we execute "str(ceas)" command, we find that all the variables are factor variable type. We need to convert them into the numeric format for model building and we will reove the parent variable---

>ceas$age=as.numeric(substr(ceas$Age,1,2))
>ceas$Age=NULL

>ceas$deliveryNumber=as.numeric(substr(ceas$`Delivery number`,1,1))
>ceas$`Delivery number`=NULL

>ceas$BP=as.numeric(substr(ceas$`Blood of Pressure`,1,1))
>ceas$`Blood of Pressure`=NULL

>ceas$deliveryTime=as.numeric(substr(ceas$`Delivery time`,1,1))
>ceas$`Delivery time`=NULL

>ceas$heartProblem=as.numeric(substr(ceas$`Heart Problem`,1,1))
>ceas$`Heart Problem`=NULL

Following summary statistics shows measures of central tendency and dispersion---

>summary(ceas)

let us understand distribution of each variable through the following tables---

> table(ceas$Caesarian)

0 1
34 46

Caesarian variable has two values 0 and 1. Value "0" represents that the delivery was non-caesarian and value "1" represents that the delivery was caesarian.

> table(ceas$age)

17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 35 36 37 38 40
1 2 2 3 2 4 1 2 7 10 7 6 6 3 3 8 5 2 3 1 1 1

In age variable, the range is 17 to 40.

> table(ceas$deliveryNumber)

1 2 3 4

41 27 10 2

The deliveryNumber tells about the number of times deliver incidents with the mother.

> table(ceas$BP)

0 1 2

20 40 20

The variable BP tells about the level of Blood Pressure. The 0,1 and 2 values represent low, normal and high respectively.

> table(ceas$deliveryTime)

0 1 2

46 17 17

The deliveryTime variable represents the type of delivery. The 0,1 and 2 values represent timely, premature and latecomer respectively.

> table(ceas$heartProblem)

0 1

50 30

The heartProblem variable tells about if there is heart problem to the pregnant woman or not. The 0 and 1 values represent apt and inept.

Exploring the dependencies of the Caesarian variable with other independent variables is important for model building. Hence, we analyze the impact of independent variables on the" caesarian" variable with the Chi-Squared test. The results of chi-squared tests are as follows---

> chisq.test(ceas$deliveryNumber, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$deliveryNumber and ceas$Caesarian
X-squared = 2.407, df = 3, p-value = 0.4923

> chisq.test(ceas$BP, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$BP and ceas$Caesarian
X-squared = 7.468, df = 2, p-value = 0.0239

> chisq.test(ceas$deliveryTime, ceas$Caesarian)

Pearson's Chi-squared test

data: ceas$deliveryTime and ceas$Caesarian
X-squared = 2.6379, df = 2, p-value = 0.2674

> chisq.test(ceas$heartProblem, ceas$Caesarian)

Pearson's Chi-squared test with Yates' continuity correction

data: ceas$heartProblem and ceas$Caesarian

X-squared = 8.5251, df = 1, p-value = 0.003503

We can observe the p-values of each chi.squared test. If p-value is less than 0.05 significance level, there is no evidence against the null hypothesis that both variables are independent. We can see that variables hearProblem and BP while deliveryNumber and deliveryTIme have no impact of on taking of decision of caesarean or no caesarean.

B) SVM Model Building & Prediction

Now, we are ready for model building. We are, here, going to use Support Vector Machine (SVM). The SVM technique is a very powerful technique in which we can build linear and nonlinear models. We will try different models in SVM for our case study.

So, we begin with simple SVM model---

> model.svm=svm(Caesarian~.,data=ceas)
> summary(model.svm)

Call:
svm(formula = Caesarian ~ ., data = ceas)

Parameters:
SVM-Type: C-classification
SVM-Kernel: radial
cost: 1
gamma: 0.2

Number of Support Vectors: 65

( 29 36 )

Number of Classes: 2

Levels:
0 1

In the model summary, we can observe that, by default, model has selected nonlinear model i.e. SVM Kernal "radial". Model also optimized at Cost=1 and gamma=0.2. Model has done C-Classification. Number of Support Vectors are 65.

Let us predict the Caesarean type for data frame "ceas" with SVM model and then we evaluate the prediction with actual values---

> pred=predict(model.svm, ceas)
> table(ceas$Caesarian, pred)
pred
0 1
0 26 8
1 9 37

Following confusion matrix shows accuracy and other statistics of prediction by SVM model---

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 26 9
1 8 37

Accuracy : 0.7875
95% CI : (0.6817, 0.8711)
No Information Rate : 0.575
P-Value [Acc > NIR] : 5.404e-05

Kappa : 0.5669
Mcnemar's Test P-Value : 1

Sensitivity : 0.7647
Specificity : 0.8043
Pos Pred Value : 0.7429
Neg Pred Value : 0.8222
Prevalence : 0.4250
Detection Rate : 0.3250
Detection Prevalence : 0.4375
Balanced Accuracy : 0.7845

'Positive' Class : 0

Overall accuracy of the model is 78%.

Now, we apply cross-validation and SVM model together to see if there is any improvement in prediction or not. First, we try with linear SVM model---

>trctrl <- trainControl(method = "repeatedcv", number = 10, repeats = 3)
>set.seed(3233)

>svm_Linear <- train(caes ~., data = ceas, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

Model summary is as follows---

>svm_Linear

Support Vector Machines with Linear Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results:

Accuracy Kappa
0.651918 0.3198123

Tuning parameter 'C' was held constant at a value of 1

Let us see, how it predicts the Caesarean variable---

> pred=predict(svm_Linear, newdata=ceas)
> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 28 20
1 6 26

Accuracy : 0.675
95% CI : (0.5611, 0.7755)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.04356

Kappa : 0.3689
Mcnemar's Test P-Value : 0.01079

Sensitivity : 0.8235
Specificity : 0.5652
Pos Pred Value : 0.5833
Neg Pred Value : 0.8125
Prevalence : 0.4250
Detection Rate : 0.3500
Detection Prevalence : 0.6000
Balanced Accuracy : 0.6944

'Positive' Class : 0

We can see that the model accuracy of this model is lower than the simple model.

We can fine-tune model with hyperparameters C and gamma of the SVM model and let us see if model accuracy improves. We take the range of C values from 0.01 to 2.5.

>grid <- expand.grid(C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.25, 1.5, 1.75, 2,5))

>set.seed(3233)

>svm_Linear_Grid <- train(caes ~., data = ceas, method = "svmLinear",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid,
tuneLength = 10)

>svm_Linear_Grid

> svm_Linear_Grid
Support Vector Machines with Linear Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

C Accuracy Kappa
0.00 NaN NaN
0.01 0.5755291 0.0000000
0.05 0.5667328 0.1259469
0.10 0.6266534 0.2777531
0.25 0.6212963 0.2586232
0.50 0.6382275 0.2938355
0.75 0.6513228 0.3195901
1.00 0.6519180 0.3198123
1.25 0.6566799 0.3255291
1.50 0.6560847 0.3242667
1.75 0.6566799 0.3234376
2.00 0.6662037 0.3421381
5.00 0.6657407 0.3330852

Accuracy was used to select the optimal model using the largest value.
The final value used for the model was C = 2.

> pred=predict(svm_Linear_Grid, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 24 12
1 10 34

Accuracy : 0.725
95% CI : (0.6138, 0.819)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.003992

Kappa : 0.4416
Mcnemar's Test P-Value : 0.831170

Sensitivity : 0.7059
Specificity : 0.7391
Pos Pred Value : 0.6667
Neg Pred Value : 0.7727
Prevalence : 0.4250
Detection Rate : 0.3000
Detection Prevalence : 0.4500
Balanced Accuracy : 0.7225

'Positive' Class : 0

We can see that accuracy of the model has improved from 67% to 72% at C=2.

Now, let us apply nonlinear "radial" model again and see if model prediction improves or not.

> set.seed(3233)
> svm_Radial <- train(Caesarian ~., data = ceas, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneLength = 10)

> svm_Radial

Support Vector Machines with Radial Basis Function Kernel

80 samples
5 predictor
2 classes: '0', '1'

Pre-processing: centered (5), scaled (5)
Resampling: Cross-Validated (10 fold, repeated 3 times)
Summary of sample sizes: 71, 72, 72, 72, 73, 73, ...
Resampling results across tuning parameters:

C Accuracy Kappa
0.25 0.6093915 0.14372559
0.50 0.6707011 0.32367981
1.00 0.6381614 0.25688123
2.00 0.6212302 0.22192343
4.00 0.5928571 0.16922655
8.00 0.5871693 0.16469149
16.00 0.5913360 0.17885126
32.00 0.5556217 0.10723957
64.00 0.5172619 0.03346007
128.00 0.5010582 -0.00360852

Tuning parameter 'sigma' was held constant at a value of 0.2772785
Accuracy was used to select the optimal model using the largest value.
The final values used for the model were sigma = 0.2772785 and C = 0.5.

> pred=predict(svm_Radial, newdata=ceas)

> confusionMatrix(as.factor(pred), as.factor(ceas$Caesarian))

Confusion Matrix and Statistics

Reference

Prediction 0 1

0 25 8

1 9 38

Accuracy : 0.7875

95% CI : (0.6817, 0.8711)

No Information Rate : 0.575

P-Value [Acc > NIR] : 5.404e-05

Kappa : 0.5635

Mcnemar's Test P-Value : 1

Sensitivity : 0.7353

Specificity : 0.8261

Pos Pred Value : 0.7576

Neg Pred Value : 0.8085

Prevalence : 0.4250

Detection Rate : 0.3125

Detection Prevalence : 0.4125

Balanced Accuracy : 0.7807

'Positive' Class : 0

We can see that accuracy of "radial" model is 79% which is higher than linear models. Let us see if finetuning of "radial" model improves the model or not by changing values of C and Sigma---

>grid_radial <- expand.grid(sigma = c(0,0.01, 0.02, 0.025, 0.03, 0.04,
0.05, 0.06, 0.07,0.08, 0.09, 0.1, 0.25, 0.5, 0.75,0.9),
C = c(0,0.01, 0.05, 0.1, 0.25, 0.5, 0.75,
1, 1.5, 2,5))

>set.seed(3233)

>svm_Radial_Grid <- train(Caesarian ~., data = ceas, method = "svmRadial",
trControl=trctrl,
preProcess = c("center", "scale"),
tuneGrid = grid_radial,
tuneLength = 10)

> pred=predict(svm_Radial_Grid, newdata=ceas)

> confusionMatrix(pred, ceas$Caesarian)
Confusion Matrix and Statistics

Reference
Prediction 0 1
0 27 16
1 7 30

Accuracy : 0.7125
95% CI : (0.6005, 0.8082)
No Information Rate : 0.575
P-Value [Acc > NIR] : 0.007869

Kappa : 0.4314
Mcnemar's Test P-Value : 0.095293

Sensitivity : 0.7941
Specificity : 0.6522
Pos Pred Value : 0.6279
Neg Pred Value : 0.8108
Prevalence : 0.4250
Detection Rate : 0.3375
Detection Prevalence : 0.5375
Balanced Accuracy: 0.7231

'Positive' Class : 0

We can observe that cross-validation with different values of C and sigma degraded the model accuracy. We can try many other finetuning alternatives and other algorithms to improve the prediction. For now, we can select svm model with highest classification accuracy.

Wednesday, 26 December 2018

How to Use Forecasting with R for Currency Trading

In this case study, we will see how to use forecasting with R for currency trading. If we know the forecasting for the next few days, we can build our trading system around the trend line of forecasts and can make buying or selling decisions.

DISCLAIMER: This case study is for educational purpose. It is neither a recommendation of trading , any sort of tips and not to be copied in real world currency trading. This article should not be taken as recommendation to trade in USDINR. Currency trading is highly risky and hence, a good financial advisor should be consulted for making investment decisions in currency trading.

The entire process of how to use forecasting with R for currency trading is divided into the following sections---

(A) DATA PREPARATION

(B) MODEL BUILDING

(C) PREDICTION

(A) DATA PREPARATION STAGE

First and foremost, we need to load required libraries as shown in the following chunk of commands---

>library(quantmod)
>library(timeSeries)
>library(forecast)
>library(bizdays)
>library(plotly)

After loading the required libraries, we need data of USDINR from Yahoo Finance (https://in.finance.yahoo.com/). The "getSymbols" command downloads data from the source since 1st January 2007 and we save the data in the file "exr"---

>exr=getSymbols("INR=X", src="yahoo", env=NULL)

NOTE: USDINR is coded as INR=X on Yahoo Finance.

When we download data from Yahoo Finance, it loads open, close, high, low, volume and adjusted price of USDINR. We need only closing price/quote (4th Column) of USDINR---

>exr=exr[,4]

We need to remove any missing values from the data to do time series operations hence we, first, calculate the number of missing values and then remove it by using "na.omit" command---

>sum(is.na(exr))

There are 30 missing values in the data, we shall remove the missing values from data.

>exr=na.omit(exr)

>sum(is.na(exr))

Now, let us observe how the trend line looks in plot---

>plot(exr)

After data preparation, the data is used for developing the forecasting model and followed by prediction.

(B) MODEL BUILDING

If we look at the trendline, it is clearly visible that over the period of time, it changes mean and variance hence this time series is non-stationary. The "auto.arima" command of "forecast" package computes the parameters of the time series model. As we have USDINR daily price quotes, we can introduce frequency 365 to address seasonality in the model building. The major components P, D, and Q are calculated by "auto.arima" function automatically. We first convert dataframe "exr" into time series "exr.ts.365"---

>exr.ts.365=ts(exr, frequency = 365)

Now, we build time series model with "auto.arima" command with seasonality denoted by D=1 ---

>model.365=auto.arima(exr.ts.365,D=1)

Model summary and accuracy of "model.365" is as follows---

> summary(model.365)

Series: exr.ts.365
ARIMA(1,1,2)(0,1,0)[365]

Coefficients:
ar1 ma1 ma2
0.8414 -0.9773 0.1442
s.e. 0.0728 0.0741 0.0197

sigma^2 estimated as 0.194: log likelihood=-1638.48

AIC=3284.95 AICc=3284.97 BIC=3308.61

Training set error measures:
ME RMSE MAE MPE MAPE MASE ACF1
Training set 0.002054559 0.4133998 0.2762706 0.002884244 0.491483 0.05617031 0.001935873

The ARIMA model extracted is (1,1,2)(0,1,0)[365] and it's MAPE is quite low i.e.0.491483.

(C) PREDICTION

Now, we predict the next 200 day prices based on "model.365" by using forecast command---

>predict.365=forecast(model.365, h=200)

Let us see how output of "predict.365" looks like---

>predict.365

Point Forecast Lo 80 Hi 80 Lo 95 Hi 95
9.487671 70.17357 69.60910 70.73805 69.31028 71.03687
9.490411 70.02434 69.27832 70.77036 68.88340 71.16528
9.493151 69.59606 68.69539 70.49674 68.21860 70.97353
9.495890 69.64146 68.60203 70.68090 68.05179 71.23114
9.498630 69.60031 68.43319 70.76743 67.81535 71.38527
9.501370 69.75752 68.47118 71.04387 67.79023 71.72482
9.504110 69.62377 68.22505 71.02250 67.48461 71.76294
9.506849 69.80306 68.29772 71.30840 67.50084 72.10527

Ignore the first column and observe the "Point Forecast" column and we find that there is a decline in prices of USDINR.

Let us observe visually when the forecast is attached to the time series---

>plot(predict.365)

In the graph, where the extended line in blue cover is the point forecast line and grey and light grey envelopes are basically levels at 80% and 90% confidence.

After getting the forecast, we need to add series of 200 dates on USDINR trading days only. We need to first create a calendar as per India by "create.calendar" command---

>create.calendar("India/ANBIMA", holidaysANBIMA, weekdays=c("saturday", "sunday"))

Now, we calculate how many trading days between 26/12/2018 and 10/10/2019 and we find it is 200 day by running the "bizdays" command---

>bizdays("2018-12-26", "2019-10-10", "India/ANBIMA")

Now, we shall get a data frame of dates and combine it with "predict.365" to make a new data frame "predict200" containing two columns ( Dates and predict.365)---

>Dates=bizseq("2018-12-26", "2019-10-10", "India/ANBIMA")

>Datess=as.data.frame(Dates)

>dim(Datess)

>View(Datess)

>Dates=Datess[1:200,]

>predict.365=as.data.frame(predict.365)

>dim(predict.365)

>predict200=cbind.data.frame(Dates,predict.365)

Now, we can see how "predict200" data frame looks like---

>View(predict200)

We can download the csv file of "predict200" by following command---

>write.csv(predict200, "predict200.csv")

The "predict200" contains columns of Dates, Point Forecast, Lo80, Hi80, Lo90 and Hi90. For this case study, we will use only point forecast to suggest a trading system for USDINR.

Let us see the predict.200 visualization with plot_ly command---

>plot_ly(predict200, x=predict200$Dates, y=predict200$`Point Forecast`,mode="marker")

Now, we can see that until Oct 2019, the USDINR exchange rate shows range bound movement between 71.25 and 68.5. It is basically mean reversion behavior of exchange rate or in simple words, USDINR rates will be sideways between these two points.

Here, trend following strategies will be less successful and sideways market strategies will be more successful. The USDINR market in this period is more likely going to favor option writers.

Sunday, 25 November 2018

How to ML-fy Your Small Business

ML is everywhere and those who do not know it, it is Machine Learning. The Machine Learning is out of the labs and being democratized by technologies. So, it can be used by all and sundry and it is not costly too. It is available in the micro-fraction of the cost it was used to be. Objective of this article is to understand how a small business can reap the benefits of machine learning and AI without hiccups.

Before delving into machine learning and AI for small business, we must also know why a small business should even care about it? Can't it live without it. Yes, definitely a business can survive and be profitable without AI. It is a matter of competitive advantages if a company wants to have. AI will make small businesses what online eCommerce has done to the marketing function of small businesses. The eCommerce has uplifted the marketing and sales performance of hundreds and thousands of small businesses through pan-country reach and delivery systems. Now, any small business can take rest as far as marketing is concerned because eCommerce companies can take your products to consumers, they do marketing, they convince customers, they run sales promotion schemes....... almost everything, isn't it. Let us take an instance, what can you do if your competitor are also on same platform, selling same stuff, at same or lower price than you? Can you lower the price by capital financing the discounts, or delay the break-even or move out of the market. All these options are very scary.

Here, ML-fy your business and this is what you can do as a logical step to sustain, survive or grow your business in fierce competitions. ML-fying your business means you delve into data generated in your small businesses and explore the opportunities of maximization and minimization. With the information of maximization and minimization opportunities, small business can foresee future business environment and internal performance. If that is known, you can have better planning, better approach to deal with competition. Over and above, your customer will feel the difference which means more customer loyalty and more referrals.

Now, we have got feeling of how ML-fication of small business can benefits in terms of minization and maximization opportunities.Following are the broad steps in implementing machine learning into business--

Let us understand this process of machine learning implementation with an example. Suppose there a sweet manufacturing company-- Sheetal Sweets. This company manufactures sweet "RASGULLA". It is a family owned business. This company is facing a problem that she has issues of either stockout situation or over-inventory of raw materials especially sugar and it is very frequent. Stockout situation of sugar leads to halt in manufacturing process and delay in sales order fulfillment. Oversupply leads to blocking of working capital which means interest has to be paid to bank on overdraft facility.

Let us dive deeper into the problem. Suppose working capital employed in business is Rs.30 lac. Bank charges interest rate @ 12% p.a. If working capital is paid in 6 months on an average, the interest burden on company is Rs.180000 for six month. On yearly basis, it is Rs.360000. Even if it is not borrowed from bank, we can assume the opportunity cost of employing the working capital. If Rs.30 lac amount is available with Sheetal Sweets, it should earn at least 12% return on business operations because bank might be offering same interest rates on bank deposits. Let us assume that working capital is completely absorbed in RASGULLA manufacturing.

The stockouts and over inventory adversely impact the working capital management. In case of stockouts, Sheetal Sweets may loose customer or sales and if there is over-inventory, the working capital rotation is delayed. So ultimately, either way it is harmful. Thus, Sheetal Sweets needs the predictability capability so that it can plan its inventory in optimal manner that minimizes the stockouts and over-inventory situations.

PREDICTABILITY CAPABILITY DEVELOPMENT

For predictability capability can be built through various measures but our focus in this article is to use machine learning for predicting the next day sales, next week sales, next fortnight sales and next month sales. There are range of models in machine learning that can predict the sales in coming periods but we will focus only on Time Series method. In time series method of forecasting, the variable data like sales is available for past periods and with suitable model, we try to predict the sales for coming periods.

Picture 2 Process of Forecasting Model Development

The picture 2 shows the development process of forecasting model. Sheetal Sweets needs to collect its sales data for past periods. It is better to have daily sales data of at least 3 years or more. After collecting data, it should be stored into nice format in MS Excel. In Excel file, it sales data should be filled in cells corresponding to the periods they belong to.

Now, a small programming code in R can be used to build a forecasting model. The forecasting model programming in R can be asked separately via dropping the requirement in comment box. This model should be reviewed time to time to see how close the predicted values of model are to the actual sales realization. The lesser the gap, the better the model is.

Thursday, 15 November 2018

Self Learning Technologies for eCommerce

The eCommerce business cannot survive without incorporation of Self Learning Technologies (SLTs) due to changes in scope of decision making from years and months to hours and minutes. Self Learning Technologies can be adopted with budget as minimum as $10000 and budget as big as $1000000.

There are various ways to implement SLTs from as simple as hiring a freelancer on temporary engagement basis, crowdsourcing, renting the Machine Learning-as-a-Service (MLaaS) and as complex as buying SLT solution, ML cloud services and in-house SLT development. The allocation of human resources depends on approach of SLT implementation chosen. For simple SLT implementation, a data scientist and digital marketer may suffice but for complex SLT implementation, a dedicated team of data engineers, data scientists, business analytics professional and digital marketers should be deployed.

The SLT deployment may take three months to one year or more which again depends on what kind of SLTs are being implemented. Simple and intermediate SLTs implementation may take three months to nine months and complex SLTs implementation may take twelve months to eighteen months. The decision of SLT implementation is very critical.

For simple to intermediate SLT implementation, crowdsourcing and MLaaS are better options and for complex SLT implementation, renting MLaaS to buying SLT Solutions from Microsoft, Google, IBM etc. or build SLT solution in-house are suggested. If company wants to keep the SLT intellectual property confidential, limited to itself and to do full blown business transformation; it should go build SLT solution in-house or if company wants to deploy SLT implementation faster and battle tested SLT solution without actually owning it; the company can choose buy SLT solutions from reputed vendors.

For more detail on how to implement self learning technologies (SLTs) in eCommerce, please click link below---

https://imojo.in/e9s5oc

Thursday, 20 September 2018

CASE STUDY 4: House Price Prediction

HOUSE PRICE PREDICTION

We have a dataset related to house prices and attributes of each house sale. It is a kaggle competition dataset.

In univariate analysis, we will try to understand the overall structure of data, each variable and some univariate hypotheses analysis.

First, let us understand the overall structure of the data set by running following chunk of commands—

setwd("E:/r/R/kaggle/houseprice")

getwd()
## [1] "E:/r/R/kaggle/houseprice"
train=read.csv("train.csv", header = T, stringsAsFactors = TRUE)

str(train)
## 'data.frame': 1460 obs. of 81 variables:
## $ Id : int 1 2 3 4 5 6 7 8 9 10 ...
## $ MSSubClass : int 60 20 60 70 60 50 20 60 50 190 ...
## $ MSZoning : Factor w/ 5 levels "C (all)","FV",..: 4 4 4 4 4 4 4 4 5 4 ...
## $ LotFrontage : int 65 80 68 60 84 85 75 NA 51 50 ...
## $ LotArea : int 8450 9600 11250 9550 14260 14115 10084 10382 6120 7420 ...
## $ Street : Factor w/ 2 levels "Grvl","Pave": 2 2 2 2 2 2 2 2 2 2 ...
## $ Alley : Factor w/ 2 levels "Grvl","Pave": NA NA NA NA NA NA NA NA NA NA ...
## $ LotShape : Factor w/ 4 levels "IR1","IR2","IR3",..: 4 4 1 1 1 1 4 1 4 4 ...
## $ LandContour : Factor w/ 4 levels "Bnk","HLS","Low",..: 4 4 4 4 4 4 4 4 4 4 ...
## $ Utilities : Factor w/ 2 levels "AllPub","NoSeWa": 1 1 1 1 1 1 1 1 1 1 ...
## $ LotConfig : Factor w/ 5 levels "Corner","CulDSac",..: 5 3 5 1 3 5 5 1 5 1 ...
## $ LandSlope : Factor w/ 3 levels "Gtl","Mod","Sev": 1 1 1 1 1 1 1 1 1 1 ...
## $ Neighborhood : Factor w/ 25 levels "Blmngtn","Blueste",..: 6 25 6 7 14 12 21 17 18 4 ...
## $ Condition1 : Factor w/ 9 levels "Artery","Feedr",..: 3 2 3 3 3 3 3 5 1 1 ...
## $ Condition2 : Factor w/ 8 levels "Artery","Feedr",..: 3 3 3 3 3 3 3 3 3 1 ...
## $ BldgType : Factor w/ 5 levels "1Fam","2fmCon",..: 1 1 1 1 1 1 1 1 1 2 ...
## $ HouseStyle : Factor w/ 8 levels "1.5Fin","1.5Unf",..: 6 3 6 6 6 1 3 6 1 2 ...
## $ OverallQual : int 7 6 7 7 8 5 8 7 7 5 ...
## $ OverallCond : int 5 8 5 5 5 5 5 6 5 6 ...
## $ YearBuilt : int 2003 1976 2001 1915 2000 1993 2004 1973 1931 1939 ...
## $ YearRemodAdd : int 2003 1976 2002 1970 2000 1995 2005 1973 1950 1950 ...
## $ RoofStyle : Factor w/ 6 levels "Flat","Gable",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ RoofMatl : Factor w/ 8 levels "ClyTile","CompShg",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ Exterior1st : Factor w/ 15 levels "AsbShng","AsphShn",..: 13 9 13 14 13 13 13 7 4 9 ...
## $ Exterior2nd : Factor w/ 16 levels "AsbShng","AsphShn",..: 14 9 14 16 14 14 14 7 16 9 ...
## $ MasVnrType : Factor w/ 4 levels "BrkCmn","BrkFace",..: 2 3 2 3 2 3 4 4 3 3 ...
## $ MasVnrArea : int 196 0 162 0 350 0 186 240 0 0 ...
## $ ExterQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 4 3 4 3 4 4 4 ...
## $ ExterCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ Foundation : Factor w/ 6 levels "BrkTil","CBlock",..: 3 2 3 1 3 6 3 2 1 1 ...
## $ BsmtQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 3 3 4 3 3 1 3 4 4 ...
## $ BsmtCond : Factor w/ 4 levels "Fa","Gd","Po",..: 4 4 4 2 4 4 4 4 4 4 ...
## $ BsmtExposure : Factor w/ 4 levels "Av","Gd","Mn",..: 4 2 3 4 1 4 1 3 4 4 ...
## $ BsmtFinType1 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 3 1 3 1 3 3 3 1 6 3 ...
## $ BsmtFinSF1 : int 706 978 486 216 655 732 1369 859 0 851 ...
## $ BsmtFinType2 : Factor w/ 6 levels "ALQ","BLQ","GLQ",..: 6 6 6 6 6 6 6 2 6 6 ...
## $ BsmtFinSF2 : int 0 0 0 0 0 0 0 32 0 0 ...
## $ BsmtUnfSF : int 150 284 434 540 490 64 317 216 952 140 ...
## $ TotalBsmtSF : int 856 1262 920 756 1145 796 1686 1107 952 991 ...
## $ Heating : Factor w/ 6 levels "Floor","GasA",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ HeatingQC : Factor w/ 5 levels "Ex","Fa","Gd",..: 1 1 1 3 1 1 1 1 3 1 ...
## $ CentralAir : Factor w/ 2 levels "N","Y": 2 2 2 2 2 2 2 2 2 2 ...
## $ Electrical : Factor w/ 5 levels "FuseA","FuseF",..: 5 5 5 5 5 5 5 5 2 5 ...
## $ X1stFlrSF : int 856 1262 920 961 1145 796 1694 1107 1022 1077 ...
## $ X2ndFlrSF : int 854 0 866 756 1053 566 0 983 752 0 ...
## $ LowQualFinSF : int 0 0 0 0 0 0 0 0 0 0 ...
## $ GrLivArea : int 1710 1262 1786 1717 2198 1362 1694 2090 1774 1077 ...
## $ BsmtFullBath : int 1 0 1 1 1 1 1 1 0 1 ...
## $ BsmtHalfBath : int 0 1 0 0 0 0 0 0 0 0 ...
## $ FullBath : int 2 2 2 1 2 1 2 2 2 1 ...
## $ HalfBath : int 1 0 1 0 1 1 0 1 0 0 ...
## $ BedroomAbvGr : int 3 3 3 3 4 1 3 3 2 2 ...
## $ KitchenAbvGr : int 1 1 1 1 1 1 1 1 2 2 ...
## $ KitchenQual : Factor w/ 4 levels "Ex","Fa","Gd",..: 3 4 3 3 3 4 3 4 4 4 ...
## $ TotRmsAbvGrd : int 8 6 6 7 9 5 7 7 8 5 ...
## $ Functional : Factor w/ 7 levels "Maj1","Maj2",..: 7 7 7 7 7 7 7 7 3 7 ...
## $ Fireplaces : int 0 1 1 1 1 0 1 2 2 2 ...
## $ FireplaceQu : Factor w/ 5 levels "Ex","Fa","Gd",..: NA 5 5 3 5 NA 3 5 5 5 ...
## $ GarageType : Factor w/ 6 levels "2Types","Attchd",..: 2 2 2 6 2 2 2 2 6 2 ...
## $ GarageYrBlt : int 2003 1976 2001 1998 2000 1993 2004 1973 1931 1939 ...
## $ GarageFinish : Factor w/ 3 levels "Fin","RFn","Unf": 2 2 2 3 2 3 2 2 3 2 ...
## $ GarageCars : int 2 2 2 3 3 2 2 2 2 1 ...
## $ GarageArea : int 548 460 608 642 836 480 636 484 468 205 ...
## $ GarageQual : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 2 3 ...
## $ GarageCond : Factor w/ 5 levels "Ex","Fa","Gd",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ PavedDrive : Factor w/ 3 levels "N","P","Y": 3 3 3 3 3 3 3 3 3 3 ...
## $ WoodDeckSF : int 0 298 0 0 192 40 255 235 90 0 ...
## $ OpenPorchSF : int 61 0 42 35 84 30 57 204 0 4 ...
## $ EnclosedPorch: int 0 0 0 272 0 0 0 228 205 0 ...
## $ X3SsnPorch : int 0 0 0 0 0 320 0 0 0 0 ...
## $ ScreenPorch : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolArea : int 0 0 0 0 0 0 0 0 0 0 ...
## $ PoolQC : Factor w/ 3 levels "Ex","Fa","Gd": NA NA NA NA NA NA NA NA NA NA ...
## $ Fence : Factor w/ 4 levels "GdPrv","GdWo",..: NA NA NA NA NA 3 NA NA NA NA ...
## $ MiscFeature : Factor w/ 4 levels "Gar2","Othr",..: NA NA NA NA NA 3 NA 3 NA NA ...
## $ MiscVal : int 0 0 0 0 0 700 0 350 0 0 ...
## $ MoSold : int 2 5 9 2 12 10 8 11 4 1 ...
## $ YrSold : int 2008 2007 2008 2006 2008 2009 2007 2009 2008 2008 ...
## $ SaleType : Factor w/ 9 levels "COD","Con","ConLD",..: 9 9 9 9 9 9 9 9 9 9 ...
## $ SaleCondition: Factor w/ 6 levels "Abnorml","AdjLand",..: 5 5 5 1 5 5 5 5 1 5 ...
## $ SalePrice : int 208500 181500 223500 140000 250000 143000 307000 200000 129900 118000 ...
dim(train)
## [1] 1460 81
So, this dataset contains 1460 rows and 80 variables. The first column is just a simple index therefore it is ignored. The 80th variable is SalePrice variable which is a target variable. Other 79 variables are predictor variables. These variables reflect various aspects of a house. To understand each variable better, we can cacluate summary statistics by following command chunk—

summary(train)
## Id MSSubClass MSZoning LotFrontage
## Min. : 1.0 Min. : 20.0 C (all): 10 Min. : 21.00
## 1st Qu.: 365.8 1st Qu.: 20.0 FV : 65 1st Qu.: 59.00
## Median : 730.5 Median : 50.0 RH : 16 Median : 69.00
## Mean : 730.5 Mean : 56.9 RL :1151 Mean : 70.05
## 3rd Qu.:1095.2 3rd Qu.: 70.0 RM : 218 3rd Qu.: 80.00
## Max. :1460.0 Max. :190.0 Max. :313.00
## NA's :259
## LotArea Street Alley LotShape LandContour
## Min. : 1300 Grvl: 6 Grvl: 50 IR1:484 Bnk: 63
## 1st Qu.: 7554 Pave:1454 Pave: 41 IR2: 41 HLS: 50
## Median : 9478 NA's:1369 IR3: 10 Low: 36
## Mean : 10517 Reg:925 Lvl:1311
## 3rd Qu.: 11602
## Max. :215245
##
## Utilities LotConfig LandSlope Neighborhood Condition1
## AllPub:1459 Corner : 263 Gtl:1382 NAmes :225 Norm :1260
## NoSeWa: 1 CulDSac: 94 Mod: 65 CollgCr:150 Feedr : 81
## FR2 : 47 Sev: 13 OldTown:113 Artery : 48
## FR3 : 4 Edwards:100 RRAn : 26
## Inside :1052 Somerst: 86 PosN : 19
## Gilbert: 79 RRAe : 11
## (Other):707 (Other): 15
## Condition2 BldgType HouseStyle OverallQual
## Norm :1445 1Fam :1220 1Story :726 Min. : 1.000
## Feedr : 6 2fmCon: 31 2Story :445 1st Qu.: 5.000
## Artery : 2 Duplex: 52 1.5Fin :154 Median : 6.000
## PosN : 2 Twnhs : 43 SLvl : 65 Mean : 6.099
## RRNn : 2 TwnhsE: 114 SFoyer : 37 3rd Qu.: 7.000
## PosA : 1 1.5Unf : 14 Max. :10.000
## (Other): 2 (Other): 19
## OverallCond YearBuilt YearRemodAdd RoofStyle
## Min. :1.000 Min. :1872 Min. :1950 Flat : 13
## 1st Qu.:5.000 1st Qu.:1954 1st Qu.:1967 Gable :1141
## Median :5.000 Median :1973 Median :1994 Gambrel: 11
## Mean :5.575 Mean :1971 Mean :1985 Hip : 286
## 3rd Qu.:6.000 3rd Qu.:2000 3rd Qu.:2004 Mansard: 7
## Max. :9.000 Max. :2010 Max. :2010 Shed : 2
##
## RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea
## CompShg:1434 VinylSd:515 VinylSd:504 BrkCmn : 15 Min. : 0.0
## Tar&Grv: 11 HdBoard:222 MetalSd:214 BrkFace:445 1st Qu.: 0.0
## WdShngl: 6 MetalSd:220 HdBoard:207 None :864 Median : 0.0
## WdShake: 5 Wd Sdng:206 Wd Sdng:197 Stone :128 Mean : 103.7
## ClyTile: 1 Plywood:108 Plywood:142 NA's : 8 3rd Qu.: 166.0
## Membran: 1 CemntBd: 61 CmentBd: 60 Max. :1600.0
## (Other): 2 (Other):128 (Other):136 NA's :8
## ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure
## Ex: 52 Ex: 3 BrkTil:146 Ex :121 Fa : 45 Av :221
## Fa: 14 Fa: 28 CBlock:634 Fa : 35 Gd : 65 Gd :134
## Gd:488 Gd: 146 PConc :647 Gd :618 Po : 2 Mn :114
## TA:906 Po: 1 Slab : 24 TA :649 TA :1311 No :953
## TA:1282 Stone : 6 NA's: 37 NA's: 37 NA's: 38
## Wood : 3
##
## BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2
## ALQ :220 Min. : 0.0 ALQ : 19 Min. : 0.00
## BLQ :148 1st Qu.: 0.0 BLQ : 33 1st Qu.: 0.00
## GLQ :418 Median : 383.5 GLQ : 14 Median : 0.00
## LwQ : 74 Mean : 443.6 LwQ : 46 Mean : 46.55
## Rec :133 3rd Qu.: 712.2 Rec : 54 3rd Qu.: 0.00
## Unf :430 Max. :5644.0 Unf :1256 Max. :1474.00
## NA's: 37 NA's: 38
## BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir
## Min. : 0.0 Min. : 0.0 Floor: 1 Ex:741 N: 95
## 1st Qu.: 223.0 1st Qu.: 795.8 GasA :1428 Fa: 49 Y:1365
## Median : 477.5 Median : 991.5 GasW : 18 Gd:241
## Mean : 567.2 Mean :1057.4 Grav : 7 Po: 1
## 3rd Qu.: 808.0 3rd Qu.:1298.2 OthW : 2 TA:428
## Max. :2336.0 Max. :6110.0 Wall : 4
##
## Electrical X1stFlrSF X2ndFlrSF LowQualFinSF
## FuseA: 94 Min. : 334 Min. : 0 Min. : 0.000
## FuseF: 27 1st Qu.: 882 1st Qu.: 0 1st Qu.: 0.000
## FuseP: 3 Median :1087 Median : 0 Median : 0.000
## Mix : 1 Mean :1163 Mean : 347 Mean : 5.845
## SBrkr:1334 3rd Qu.:1391 3rd Qu.: 728 3rd Qu.: 0.000
## NA's : 1 Max. :4692 Max. :2065 Max. :572.000
##
## GrLivArea BsmtFullBath BsmtHalfBath FullBath
## Min. : 334 Min. :0.0000 Min. :0.00000 Min. :0.000
## 1st Qu.:1130 1st Qu.:0.0000 1st Qu.:0.00000 1st Qu.:1.000
## Median :1464 Median :0.0000 Median :0.00000 Median :2.000
## Mean :1515 Mean :0.4253 Mean :0.05753 Mean :1.565
## 3rd Qu.:1777 3rd Qu.:1.0000 3rd Qu.:0.00000 3rd Qu.:2.000
## Max. :5642 Max. :3.0000 Max. :2.00000 Max. :3.000
##
## HalfBath BedroomAbvGr KitchenAbvGr KitchenQual
## Min. :0.0000 Min. :0.000 Min. :0.000 Ex:100
## 1st Qu.:0.0000 1st Qu.:2.000 1st Qu.:1.000 Fa: 39
## Median :0.0000 Median :3.000 Median :1.000 Gd:586
## Mean :0.3829 Mean :2.866 Mean :1.047 TA:735
## 3rd Qu.:1.0000 3rd Qu.:3.000 3rd Qu.:1.000
## Max. :2.0000 Max. :8.000 Max. :3.000
##
## TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType
## Min. : 2.000 Maj1: 14 Min. :0.000 Ex : 24 2Types : 6
## 1st Qu.: 5.000 Maj2: 5 1st Qu.:0.000 Fa : 33 Attchd :870
## Median : 6.000 Min1: 31 Median :1.000 Gd :380 Basment: 19
## Mean : 6.518 Min2: 34 Mean :0.613 Po : 20 BuiltIn: 88
## 3rd Qu.: 7.000 Mod : 15 3rd Qu.:1.000 TA :313 CarPort: 9
## Max. :14.000 Sev : 1 Max. :3.000 NA's:690 Detchd :387
## Typ :1360 NA's : 81
## GarageYrBlt GarageFinish GarageCars GarageArea GarageQual
## Min. :1900 Fin :352 Min. :0.000 Min. : 0.0 Ex : 3
## 1st Qu.:1961 RFn :422 1st Qu.:1.000 1st Qu.: 334.5 Fa : 48
## Median :1980 Unf :605 Median :2.000 Median : 480.0 Gd : 14
## Mean :1979 NA's: 81 Mean :1.767 Mean : 473.0 Po : 3
## 3rd Qu.:2002 3rd Qu.:2.000 3rd Qu.: 576.0 TA :1311
## Max. :2010 Max. :4.000 Max. :1418.0 NA's: 81
## NA's :81
## GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch
## Ex : 2 N: 90 Min. : 0.00 Min. : 0.00 Min. : 0.00
## Fa : 35 P: 30 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.00
## Gd : 9 Y:1340 Median : 0.00 Median : 25.00 Median : 0.00
## Po : 7 Mean : 94.24 Mean : 46.66 Mean : 21.95
## TA :1326 3rd Qu.:168.00 3rd Qu.: 68.00 3rd Qu.: 0.00
## NA's: 81 Max. :857.00 Max. :547.00 Max. :552.00
##
## X3SsnPorch ScreenPorch PoolArea PoolQC
## Min. : 0.00 Min. : 0.00 Min. : 0.000 Ex : 2
## 1st Qu.: 0.00 1st Qu.: 0.00 1st Qu.: 0.000 Fa : 2
## Median : 0.00 Median : 0.00 Median : 0.000 Gd : 3
## Mean : 3.41 Mean : 15.06 Mean : 2.759 NA's:1453
## 3rd Qu.: 0.00 3rd Qu.: 0.00 3rd Qu.: 0.000
## Max. :508.00 Max. :480.00 Max. :738.000
##
## Fence MiscFeature MiscVal MoSold
## GdPrv: 59 Gar2: 2 Min. : 0.00 Min. : 1.000
## GdWo : 54 Othr: 2 1st Qu.: 0.00 1st Qu.: 5.000
## MnPrv: 157 Shed: 49 Median : 0.00 Median : 6.000
## MnWw : 11 TenC: 1 Mean : 43.49 Mean : 6.322
## NA's :1179 NA's:1406 3rd Qu.: 0.00 3rd Qu.: 8.000
## Max. :15500.00 Max. :12.000
##
## YrSold SaleType SaleCondition SalePrice
## Min. :2006 WD :1267 Abnorml: 101 Min. : 34900
## 1st Qu.:2007 New : 122 AdjLand: 4 1st Qu.:129975
## Median :2008 COD : 43 Alloca : 12 Median :163000
## Mean :2008 ConLD : 9 Family : 20 Mean :180921
## 3rd Qu.:2009 ConLI : 5 Normal :1198 3rd Qu.:214000
## Max. :2010 ConLw : 5 Partial: 125 Max. :755000
## (Other): 9
Let us organize the dataframe in more formatted and comprehensive way—

train.numeric=train[,c(81,4,5,27,35,37,38,39,44:47,63,67:72,76)]

train.numericfactor=train[,c(2, 18,19, 48:53,55,57,62,77)]

train.numericfactor=as.data.frame(lapply(train.numericfactor, as.factor))

PeriodBuilt=2018-train$YearBuilt
PeriodRemod=2018-train$YearRemodAdd
PeriodGarage=2018-train$GarageYrBlt
PeriodSold=2018-train$YrSold

train.period=cbind.data.frame(PeriodBuilt,PeriodGarage,PeriodRemod, PeriodSold)

train.factor=train[,c(3,6:17, 22:26,28:34,36,40:43,54,56,58,59,61, 64:66,73:75, 79,80)]

train.combined=cbind(train.numeric,train.numericfactor,train.period, train.factor)

dim(train.combined)
## [1] 1460 80
Let us understand the distribution of each numeric variable by observing their histograms—

library(e1071)

for (i in colnames(train.combined[,c(1:20)])){

skew.measure1=round(skewness(train.combined[[i]]),digits = 2)

skew.measure2=skewness(log(train.combined[[i]]))

xname=names(train.combined[i])

p1=hist(scale(train.combined[[i]], center=T, scale=T), n=100, xlab = xname, main=paste("Histogram of", xname, "with skewness", skew.measure1))

p2=hist(log(train.combined[[i]]), n=100, xlab = xname, main=paste("Histogram of","log", xname, "with skewness", skew.measure2))

}

As we can observe the graphs, few variables with log transformation are normalized. We retain the log tranformation of these numeric variables—

train.combined$SalePrice=log(train.combined$SalePrice)

train.combined$LotArea=log(train.combined$LotArea)

train.combined$X1stFlrSF=log(train.combined$X1stFlrSF)

train.combined$GrLivArea=log(train.combined$GrLivArea)
Now, columns 1 to 38 are nummeric columns and 39 to 81 are factor columns. The correlation is bivariate analysis. The independent variables should be screened out is there is no significant correlation with dependent variable “SalePrice”. Let us the correlation of SalePrice with each of the numeric columns by plotting scatter plots—

for (i in 2:length(train.combined[,c(2:20)])) {
a <- cor.test(train.combined$SalePrice, train.combined[,i])
print(paste(colnames(train.combined)[i], " est:", a$estimate, " p=value:", a$p.value))

}
## [1] "LotFrontage est: 0.355878470385983 p=value: 3.55822388129835e-37"
## [1] "LotArea est: 0.39991774112559 p=value: 3.47429417639357e-57"
## [1] "MasVnrArea est: 0.43080852419511 p=value: 1.10796556420336e-66"
## [1] "BsmtFinSF1 est: 0.372023073567088 p=value: 3.84595104338152e-49"
## [1] "BsmtFinSF2 est: 0.00483241053204665 p=value: 0.853629797314542"
## [1] "BsmtUnfSF est: 0.22198505352521 p=value: 9.31852717731222e-18"
## [1] "TotalBsmtSF est: 0.612133975369787 p=value: 7.53455105447948e-151"
## [1] "X1stFlrSF est: 0.608946654903631 p=value: 6.97929348105704e-149"
## [1] "X2ndFlrSF est: 0.319299984347205 p=value: 5.86690165149455e-36"
## [1] "LowQualFinSF est: -0.037962803137427 p=value: 0.147104011505013"
## [1] "GrLivArea est: 0.730254851198229 p=value: 1.5984652370953e-243"
## [1] "GarageArea est: 0.650887555902007 p=value: 1.10625537128625e-176"
## [1] "WoodDeckSF est: 0.334135073957513 p=value: 2.05532422562011e-39"
## [1] "OpenPorchSF est: 0.321052972019767 p=value: 2.34462590169273e-36"
## [1] "EnclosedPorch est: -0.149050281427676 p=value: 1.05080829167697e-08"
## [1] "X3SsnPorch est: 0.0549002264676204 p=value: 0.0359475707337999"
## [1] "ScreenPorch est: 0.121207604896864 p=value: 3.40897065981049e-06"
## [1] "PoolArea est: 0.0697978060096918 p=value: 0.00763178070309401"
Now, we can visualize the correlation between SalePrice and other numeric variables through scatter plots—

for (i in 2:length(train.combined[,c(2:20)])) {

p <- plot(scale(train.combined[,i], center = T, scale = T), train.combined$SalePrice,
xlab =colnames(train.combined)[i], ylab = "SalePrice", main=paste("Cor between Sale Price and", colnames(train.combined)[i]))

p
}

We need to impute missing values in LotFrontage and MasVnrArea variables through regression—

mode.LotFrontage=lm(train.combined$LotFrontage~.,data=train.combined[,c(1:20)])

summary(mode.LotFrontage)
##
## Call:
## lm(formula = train.combined$LotFrontage ~ ., data = train.combined[,
## c(1:20)])
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.485 -7.729 -0.465 6.828 210.494
##
## Coefficients: (1 not defined because of singularities)
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.876e+02 2.813e+01 -6.668 3.97e-11 ***
## SalePrice -5.547e+00 2.366e+00 -2.344 0.019238 *
## LotArea 2.750e+01 1.229e+00 22.379 < 2e-16 ***
## MasVnrArea 7.095e-03 3.222e-03 2.202 0.027850 *
## BsmtFinSF1 5.110e-03 2.169e-03 2.356 0.018644 *
## BsmtFinSF2 5.484e-03 3.782e-03 1.450 0.147305
## BsmtUnfSF 4.420e-03 2.125e-03 2.080 0.037743 *
## TotalBsmtSF NA NA NA NA
## X1stFlrSF 9.269e+00 5.598e+00 1.656 0.098050 .
## X2ndFlrSF 4.036e-03 4.349e-03 0.928 0.353623
## LowQualFinSF 1.170e-02 1.048e-02 1.116 0.264562
## GrLivArea -1.427e-01 6.700e+00 -0.021 0.983008
## GarageArea 9.983e-03 3.144e-03 3.175 0.001537 **
## WoodDeckSF -1.571e-02 4.615e-03 -3.405 0.000685 ***
## OpenPorchSF 3.904e-03 8.431e-03 0.463 0.643431
## EnclosedPorch 4.457e-03 8.592e-03 0.519 0.604010
## X3SsnPorch 2.988e-02 1.769e-02 1.689 0.091462 .
## ScreenPorch -2.608e-02 9.432e-03 -2.765 0.005776 **
## PoolArea 7.337e-02 1.389e-02 5.281 1.53e-07 ***
## MiscVal -6.408e-04 2.729e-03 -0.235 0.814387
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 17.49 on 1176 degrees of freedom
## (265 observations deleted due to missingness)
## Multiple R-squared: 0.4893, Adjusted R-squared: 0.4815
## F-statistic: 62.6 on 18 and 1176 DF, p-value: < 2.2e-16
pred=predict(mode.LotFrontage, newdata=train.combined[,c(1:20)])
## Warning in predict.lm(mode.LotFrontage, newdata = train.combined[,
## c(1:20)]): prediction from a rank-deficient fit may be misleading
train.combined$LotFrontage=ifelse(is.na(train.combined$LotFrontage),pred,
train.combined$LotFrontage)

train.combined$LotFrontage=round(train.combined$LotFrontage, digits = 0)
We can see there are some numerical variables which have majority value zero. These variables can be used to create binanry variables derived from their respective numeric variable—

train.combined$MasVnrArea_yes=ifelse(train.combined$MasVnrArea>0,1,0)

train.combined$BsmtFinSF1_yes=ifelse(train.combined$BsmtFinSF1>0,1,0)

train.combined$BsmtUnfSF_yes=ifelse(train.combined$BsmtUnfSF>0,1,0)

train.combined$X2ndFlrSF_yes=ifelse(train.combined$X2ndFlrSF>0,1,0)

train.combined$GarageArea_yes=ifelse(train.combined$GarageArea>0,1,0)

train.combined$WoodDeckSF_yes=ifelse(train.combined$WoodDeckSF>0,1,0)

train.combined$OpenPorchSF_yes=ifelse(train.combined$OpenPorchSF>0,1,0)

train.combined$EnclosedPorch_yes=ifelse(train.combined$EnclosedPorch>0,1,0)

train.combined$X3SsnPorch_yes=ifelse(train.combined$X3SsnPorch>0,1,0)

train.combined$ScreenPorch_yes=ifelse(train.combined$ScreenPorch>0,1,0)

train.combined$PoolArea_yes = ifelse(train.combined$PoolArea>0,1,0)

train.combined$MiscVal_yes=ifelse(train.combined$MiscVal>0,1,0)
Anova is a technique to compare multiple group means. As in our datframe, we have lot of categorical and dummy variables, we can check group means various categorical variables with respect to SalePrice are significantly different or nor—

for(i in colnames(train.combined[,c(21:33,38:92)])){

boxplots=boxplot(train.combined$SalePrice~train.combined[[i]],xlab=colnames(train.combined[i]), main=paste("Boxplot SalePrice Vs", colnames(train.combined[i])))

boxplots

}

Before model building we need to work out missing values in data frame. To impute missing values, we can use missForest package—

library(missForest)
## Loading required package: randomForest
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## Loading required package: foreach
## Loading required package: itertools
## Loading required package: iterators
set.seed(1000)

train.withoutMV=randomForest::rfImpute(x=train.combined[,2:92],
y=train.combined[,1],
ntree=500,
iter=5)
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 500 | 0.01983 12.44 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 500 | 0.01979 12.41 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 500 | 0.0195 12.23 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 500 | 0.01956 12.27 |
## | Out-of-bag |
## Tree | MSE %Var(y) |
## 500 | 0.01985 12.45 |
train.withoutMV
train.combined[, 1]
<dbl>
LotFrontage
<dbl>
LotArea
<dbl>
MasVnrArea
<dbl>
BsmtFinSF1
<int>
BsmtFinSF2
<int>
BsmtUnfSF
<int>
TotalBsmtSF
<int>
12.24769 65.00000 9.041922 196.00000 706 0 150 856
12.10901 80.00000 9.169518 0.00000 978 0 284 1262
12.31717 68.00000 9.328123 162.00000 486 0 434 920
11.84940 60.00000 9.164296 0.00000 216 0 540 756
12.42922 84.00000 9.565214 350.00000 655 0 490 1145
11.87060 85.00000 9.554993 0.00000 732 0 64 796
12.63460 75.00000 9.218705 186.00000 1369 0 317 1686
12.20607 77.00000 9.247829 240.00000 859 32 216 1107
11.77452 51.00000 8.719317 0.00000 0 0 952 952
11.67844 50.00000 8.911934 0.00000 851 0 140 991
Next123456...146Previous
1-10 of 1,460 rows | 1-8 of 92 columns
Let us check how many missing values after imputation—

sum(is.na(train.withoutMV))
## [1] 0
The output shows that no missing values are left in the dataset train.withoutMV.Now let us do some model building with randomForest technique—

train.withoutMV$SalePrice=train.withoutMV$`train.combined[, 1]`

train.withoutMV$`train.combined[, 1]`=NULL
Now, it is time to build model. First, we will try a simple random forest model—

set.seed(1001)

model.rf=randomForest(SalePrice~.,
data=train.withoutMV,
ntree=500,
mtry=5,
importance=TRUE,
proximity=TRUE)

pred=predict(model.rf,
newdata=train.withoutMV,
type="response")

train.withoutMV$pred.rf=exp(pred)

train.withoutMV$SalePrice=exp(train.withoutMV$SalePrice)

library(Metrics)

rmse.rf=rmse(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

rmse.rf
## [1] 16054.31
mape.rf=Metrics::mape(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

mape.rf
## [1] 0.04494797
bias.rf=bias(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

bias.rf
## [1] 2760.606
percent_bias.rf=percent_bias(train.withoutMV$SalePrice, train.withoutMV$pred.rf)

percent_bias.rf
## [1] -0.0028082
plot(model.rf)

varImpPlot(model.rf)

MENU