MENU

Drop Down MenusCSS Drop Down MenuPure CSS Dropdown Menu

Saturday, 16 June 2018

CASE STUDY 3: Loan Prediction with Random Forest

MACHINE LEARNING FOR LOAN PREDICTION


The loan approval depends on many parameters and based on those parameters, loan application is either rejected or accepted. Loan dataset is used in this case study. This dataset is a hackathon  dataset available on website : https//:www.analyticsvidhya.com. Following is the link to access the dataset---

https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/

The dataset is divided into two sets-- train and test. The train dataset is used for model building and test dataset is used for submitting the predictions to the hackthon for evaluation. Purpose of model building is to automate the loan approval process through machine learning algorithm.

PREPARATION OF WORKING DIRECTORY, LIBRARIES AND DATASET

So, lets understand the model building for this case study.  It starts with loading the proper working directory---

# Loading the working directory

setwd(E:/r/R/files)
getwd()

To perform the analysis in R, required libraries or packages should be called in working directory---

# Loading the libraries

library(dplyr)
library(e1071)
library(Hmisc)
library(ggplot2)
library(randomForest)
library(caret)
library(pROC)

There are two datasets-- train and test in csv format. The data is loaded in working directory---

# Loading the datasets

train=read.csv("train.csv", header = T, stringsAsFactors = F)
test=read.csv("test.csv", header=T, stringsAsFactors = F)

In train dataset, there are 614 applicants and 13 variables and in test datset, there are 367 applicants an 12 variables.  Before combining, a "DataType" variable is added to both datasets with categorical values "Train" and "Test" respectively.  In test dataset, Loan_Status variable is absent therefore it is added in test dataset.

For model building, it is better to have more number of observations, hence we combine both datasets. To combine two datasets successfully , the datasets should have equal number of variables.  and than, both train and test datasets are combined  ---

train$DataType="Train"
test$DataType="Test"
test$Loan_Status=NA

df=rbind(train,test)

Now, the combined dataset "df" has following structure---

> str(df)
'data.frame': 981 obs. of  14 variables:
 $ Loan_ID                     : chr  "LP001002" "LP001003" "LP001005" "LP001006" ...
 $ Gender                        : chr  "Male" "Male" "Male" "Male" ...
 $ Married                       : chr  "No" "Yes" "Yes" "Yes" ...
 $ Dependents                 : chr  "0" "1" "0" "0" ...
 $ Education                    : chr  "Graduate" "Graduate" "Graduate" "Not Graduate" ...
 $ Self_Employed           : chr  "No" "No" "Yes" "No" ...
 $ ApplicantIncome        : int  5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
 $ CoapplicantIncome     : num  0 1508 0 2358 0 ...
 $ LoanAmount               : int  NA 128 66 120 141 267 95 158 168 349 ...
 $ Loan_Amount_Term   : int  360 360 360 360 360 360 360 360 360 360 ...
 $ Credit_History             : int  1 1 1 1 1 1 1 0 1 1 ...
 $ Property_Area              : chr  "Urban" "Rural" "Urban" "Urban" ...
 $ Loan_Status                  : chr  "Y" "N" "Y" "Y" ...
 $ DataType                       : chr  "Train" "Train" "Train" "Train" ...

MISSING VALUES TREATMENT AND VARIABLES MANIPULATION

The dateset contains few missing values which has to be treated carefully and data formats needs to be changed with appropriate transformations. The summary of missing values is as follows---

>MV_summary=sapply(df, function(x) sum(is.na(x)))
 
>MV_summary
          Loan_ID          Gender       Married      Dependents    Education     Self_Employed
                0                 0                 0                 0                    0                   0
  ApplicantIncome CoapplicantIncome        LoanAmount  Loan_Amount_Term    Credit_History   
                0               0                                   27                    20                                  79               

Property_Area     Loan_Status          DataType
               0                367                           0
                             
We can see that Loan Amount, Loan Amount Term and Credit History have 27, 20 and 79 missing values respectively. Loan Status also has 367 missing values which are deliberately kept missing for submission in Hackathon.

There are many character type variables in the dataset. These variables should be converted into numeric values so that statistical computation can be done. For example, The variable Gender has two options-- male and female. In numeric form male can be converted into "1" and female into "0". Similar transformations are carried out for Married, Dependents, Education, Credit History, Loan Status and Property Area.

In Dependent variable, there are 25 missing values but they were not visible into missing value summary (MV_summary) which may be due to some mistakes while doing data feeding---

> table(df$Dependents)

       0     1     2      3+
 25 545 160 160  91

As we can see that 0 dependent category is the mode of Dependents variable hence we can replace missing value with 0 dependent category by following command---

>df$Dependents=ifelse(df$Dependents=="", 0, df$Dependents)

We can also observe that there is 3+ Dependent category in Dependents variable. The value 3+ can not be used for statistical computation so it is converted from 3+ to 3 and categories are converted to numeric type by following commands---

>df$Dependents=substr(df$Dependents,1,1)

>df$Dependents=substr(df$Dependents,1,1)

To indicate, if applicant has dependent or not, a new variable "depend_yes" is introduced. If there is dependent(s) than depend_yes variable will take value 1 or else zero---

>df$depend_yes=ifelse(df$Dependents>0,1,0)

Variable "Education" is a dichotomous variable having value "Graduate" and "Non Graduate". By following commands, Education variable is converted into numeric form and "Education" variable is removed from dataframe and new variable "Graduate" is introduced---

>df=df%>%
  mutate(graduate=as.numeric(Education=="Graduate"))

>df$Education=NULL

Similarly, "Self_Employed" variable is converted from character type to numeric type by following commands---

>df=df%>%
  mutate(SE_yes=as.numeric(Self_Employed=="Yes"))

>df$Self_Employed=NULL

In ApplicantIncome variable, there are two applicants with zero income. We fill up these cells with median of ApplicantIncome which is more appropriate than mean of ApplicantIncome because mean is sensitive to extreme values while median is not---

>df$ApplicantIncome[df$ApplicantIncome==0]=median(df$ApplicantIncome,na.rm = TRUE)

A new variable TotalIncome is introduced by adding ApplicantIncome and CoapplicantIncome---

>df$TotalIncome=df$ApplicantIncome+df$CoapplicantIncome

To indicate whether there is CoapplicantIncome or not, a new variable CoappIncome_yes is introduced---

>df$CoappIncome_yes=ifelse(df$CoapplicantIncome>0,1,0)

In original dattaframe, LoanAmount variable in given in 1000s which should be converted into full figures which is done by following command---

>df$LoanAmount=1000*df$LoanAmount

There are missing values in LoanAmount variable. It can be assumed that LoanAmount should vary as per Loan_Amount_Term therefore we replace NAs in LoanAmount by groupwise median LoanAmount for each Loan_Amount_Term by following commands---

>df$LoanAmount[is.na(df$LoanAmount)] <- ave(df$LoanAmount,
                                 df$Loan_Amount_Term,
                                 FUN=function(x)median(x,
                                                     na.rm = T))[is.na(df$LoanAmount)]  

There are 20 missing values in  Loan_Amount_Term. The Loan_Amount_Term 360 is highest frequency term hence all missing values are replaced with 360 by following command---

>df$Loan_Amount_Term[is.na(df$Loan_Amount_Term)]=360

In Credit_History variable, there are 79 missing variables. It has to be filled carefully. Let us observe distribution of Credit_History with Loan_Status---

> table(df$Loan_Status, df$Credit_History)
   
        0    1
  N   82  97
  Y   7    378

We can see that Applicants who have Credit_History=1 , their Loan_Status is "Y". Hence we can fill Credit_History of those Applicant as 1 where Loan_Status is "Y"---

>df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y",1,df$Credit_History)

The table below shows the distribution of credit history as per Education and Loan_Status---

> table(df$Loan_Status, df$Credit_History, df$Education)
, ,  = Graduate

   
       0    1
  N  57  73
  Y   6    307

, ,  = Not Graduate

   
        0    1
  N  25  24
  Y   1   71

 We can see that Credit_History is 1 for most of the cases where Loan_status is "Y" and Education is "Graduate". So, we can replace those NAs of Credit_History variable where Loan_Status is "Y" and Education is "Graduate". by following command---

>df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y" &  
  df$Education=="Graduate",1,df$Credit_History)

Similarly, we observe the distribution of Credit_History with Loan_Status, Education and SE_yes---

> table(df$Loan_Status, df$Credit_History, df$Education, df$SE_yes)
, ,  = Graduate,  = 0

   
        0   1
  N  50  71
  Y   7   287

, ,  = Not Graduate,  = 0

   
       0    1
  N  23  22
  Y   2   70

, ,  = Graduate,  = 1

   
       0    1
  N  10   9
  Y   2    44

, ,  = Not Graduate,  = 1

   
        0   1
  N   2   5
  Y   0  10

We can observe that Applicant who is Graduate, either self employed or not, the Credit_History is most likely 1 if Loan_Status is also "Y". We replace NAs with following commands---

> df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y" & df$Education=="Graduate" & df$SE_yes==1,1,df$Credit_History)

> df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y" & df$Education=="Graduate" & df$SE_yes==0,1,df$Credit_History)

> which(is.na(df$Credit_History))
integer(0)

> table(df$Credit_History)

  0    1 
163  818 

Loan_Status variable should also be changed from character variable to numeric variable which is done by following command---

>df$Loan_Status=ifelse(df$Loan_Status=="Y",1,0)

For  approval of LoanAmount, the LoanAmount should not exceed a certain multiple of TotalIncome. According to literature survey, it is found that Loan to Income ratio should not exceed 40 times of Income in case of Self Employed  applicants and 60 times in case of non Self Employed applicants. Hereby, we have created a variable "LoantoIncomeRatio", LoanIncomeRatio_for_self and LoanIncomeRatio_for_nonself= ifelse(df$LoantoIncomeRatio<=60 & df$SE_yes==0,1,0)
---

>df$LoantoIncomeRatio=df$LoanAmount/df$TotalIncome

>df$LoanIncomeRatio_for_self= ifelse(df$LoantoIncomeRatio<=40 & df$SE_yes==1,1,0)

>df$LoanIncomeRatio_for_nonself= ifelse(df$LoantoIncomeRatio<=60 & df$SE_yes==0,1,0)

There are three types of Property_Area in the dataset--- Urban, Semiurban and Rural. This variable is mutated into two new variables-- Area_Urban and Area_Semiurban. 

Finally, the Loan_Status variable is mutated into numeric variable by following command---

>df$Loan_Status=ifelse(df$Loan_Status=="Y",1,0)

Now, let us check whether there is any missing value left by following commands---

> MV_summary=sapply(df, function(x) sum(is.na(x)))
> MV_summary
                    Loan_ID                   Dependents            ApplicantIncome 
                          0                          0                            0 
          CoapplicantIncome            LoanAmount          Loan_Amount_Term 
                          0                          0                            0 
             Credit_History                 Loan_Status          DataType 
                          0                         367                         0 
                       male                       married                  depend_yes 
                          0                          0                            0 
                     SE_yes                    TotalIncome           LoantoIncomeRatio 
                          0                          0                            0 
   LoanIncomeRatio_for_self       LoanIncomeRatio_for_nonself     area_urban 
                          0                          0                                                    0 
             area_semiurban               graduate 
                          0                          0 

Except for Loan_Status, all variables have no missing values.

The ApplicantIncome and CoapplicantIncome variables are redundant now so, we can remove them from dataset---

df=df[,-c(3,4)]

The TotalIncome and LoanAmount variables are continuous variables. Let us observe their distribution through histogram attached below---

> hist(df$LoanAmount, n=100)
> hist(df$TotalIncome, n=100)

We can see that TotalIncome and LoanAmount variables have skewed distribution towards higher side. For modeling purpose, the variables should be normally distributed like bell shaped curve. To remove skewness, we apply log transformation and observe the histograms again---



> df$LoanAmount=log(df$LoanAmount)
> df$TotalIncome=log(df$TotalIncome)

> hist(df$LoanAmount, n=100)
> hist(df$TotalIncome, n=100)

                               



Now, the dataset "df" is divided into two datasets--- "df_train" and "df_test". The first 614 rows are taken in "df_train" and last 367 rows are taken in "df_test".

>df_train=df[c(1:614),]
>df_test=df[615:981,]

MODEL BUILDING WITH TRAINING DATASET

For model building, we can start with logistic regression as the response variable i.e. Loan_Status is binary. The formula for model is as follows---

> formula=as.formula(Loan_Status~ LoanAmount
                                                            +Loan_Amount_Term
                                                            +Credit_History           
                                                            +male                     
                                                            +married                 
                                                            +Dependents               
                                                            +depend_yes               
                                                            +graduate                 
                                                            +SE_yes                   
                                                            +LoantoIncomeRatio
                                                            +area_urban               
                                                            +area_semiurban           
                                                            +TotalIncome
                                                            +LoanIncomeRatio_for_self
                                                            +LoanIncomeRatio_for_nonself)

Logistic regression is special case of linear regression where dependent variable is categorical variable. Following commands are used to generate logistic model---

>model_glm=glm(formula,data=df_train,family="binomial" )
>summary(model_glm)


Model fit index for logistic regression model is AIC which should be as minimum as possible. for model_glm, AIC is 597.9.  There are only three variables which are significant in the regression equation i.e. married, area_semiurban and Credit_History. To fine tune model, we will iteratively remove the variable with highest p-value one by one and observe the summary of the model until we are left with equation having only significant variables. To keep distinction between initial formula and final formula, there are named as "formula" and "formula.edited".

After number of iterations, we have got following results---

>formula.edited=as.formula(Loan_Status~
                  Credit_History           
                  +married                 
                  +area_semiurban           
                  +LoanIncomeRatio_for_self
                  +LoanIncomeRatio_for_nonself)


>model_glm=glm(formula.edited,data=df_train,family="binomial" )

>summary(model_glm)


































Now, we can see that there are five predictor variables found significant i.e. LoanIncomeRatio_for_nonself, LoanIncomeRatio_for_self, area_urban, married and Credit_History.

PREDICTION OF LOAN_STATUS

After modeling, we need to predict the Loan_Status variable for all cases of df_train dataset. The predict function estimates the probability for each loan applicant. Following command adds "predict_glm" column to df_train dataset which contains probabilities---

>df_train$Predict_glm=predict(model_glm, type="response", newdata = df_train)

> str(df_train$Predict_glm)
    num [1:614] 0.692 0.79 0.799 0.79 0.692 ...

Random forest technique is advance level modeling technique in which number of models are build based on certain parameters and best model is chosen algorithmically. Following command applies RandomForest technique to fine tune the model and predict the probabilities for each applicant---

> model_rf=randomForest(formula.edited, data=df_train, importance=TRUE, proximity=TRUE,eig=TRUE, ntree=10000,type="classification")

>df_train$predict_rf=predict(model_rf, type="response", newdata=df_train)

>str(df_train$predict_rf)
 num [1:614] 0.694 0.733 0.728 0.733 0.694 ...

After modeling with Random Forest, the cutoff for prediction probabilities should ascertained to classify the Applicant into either Loan_Status=Y or Loan_Status=N. Iteratively, we will change the cutoff and see how predicted classification improves. The optimal cutoff found is 0.68 where maximum accuracy, sensitivity and specificity is achieved---

> cutoff=0.68

> df_train$cf=ifelse(df_train$predict_rf<cutoff,0,1)
> df_confusion_matrix=cbind(Survived=df_train$Loan_Status,Predicted=df_train$cf)
> df_confusion_matrix=as.data.frame(df_confusion_matrix)
> df_confusion_matrix$Survived=as.factor(df_confusion_matrix$Survived)

> df_confusion_matrix$Predicted=as.factor(df_confusion_matrix$Predicted)
> confusionMatrix(df_confusion_matrix$Survived, df_confusion_matrix$Predicted)
Confusion Matrix and Statistics

                       Reference
Prediction          0     1 
         0                93  99
         1                20 402
                                          
               Accuracy              : 0.8062          
                 95% CI               : (0.7727, 0.8367)
    No Information Rate       : 0.816           
    P-Value [Acc > NIR]      : 0.7526          
                                          
                                 Kappa : 0.4922          
     Mcnemar's Test P-Value : 8.662e-13       
                                          
            Sensitivity                : 0.8230          
            Specificity                : 0.8024          
         Pos Pred Value            : 0.4844          
         Neg Pred Value           : 0.9526          
             Prevalence               : 0.1840          
         Detection Rate             : 0.1515          
   Detection Prevalence         : 0.3127          
      Balanced Accuracy         : 0.8127          
                     'Positive' Class : 0    


CONCLUSION

The loan applicant with good credit history, married, having property in semiurban area, having Loan to Income Ratio upto 40 times for self employed and upto 60 times for non self employed have higher chances of getting loan approved. The loan applicant's loan status does not depend on gender, education, number of dependents, loan amount and total income. 


















1 comment: