MACHINE LEARNING FOR LOAN PREDICTION
The loan approval depends on many parameters and based on those parameters, loan application is either rejected or accepted. Loan dataset is used in this case study. This dataset is a hackathon dataset available on website : https//:www.analyticsvidhya.com. Following is the link to access the dataset---
https://datahack.analyticsvidhya.com/contest/practice-problem-loan-prediction-iii/
The dataset is divided into two sets-- train and test. The train dataset is used for model building and test dataset is used for submitting the predictions to the hackthon for evaluation. Purpose of model building is to automate the loan approval process through machine learning algorithm.
PREPARATION OF WORKING DIRECTORY, LIBRARIES AND DATASET
So, lets understand the model building for this case study. It starts with loading the proper working directory---
setwd(E:/r/R/files)
getwd()
To perform the analysis in R, required libraries or packages should be called in working directory---
# Loading the libraries
library(dplyr)
library(e1071)
library(Hmisc)
library(ggplot2)
library(randomForest)
library(caret)
library(pROC)
There are two datasets-- train and test in csv format. The data is loaded in working directory---
# Loading the datasets
train=read.csv("train.csv", header = T, stringsAsFactors = F)
test=read.csv("test.csv", header=T, stringsAsFactors = F)
In train dataset, there are 614 applicants and 13 variables and in test datset, there are 367 applicants an 12 variables. Before combining, a "DataType" variable is added to both datasets with categorical values "Train" and "Test" respectively. In test dataset, Loan_Status variable is absent therefore it is added in test dataset.
For model building, it is better to have more number of observations, hence we combine both datasets. To combine two datasets successfully , the datasets should have equal number of variables. and than, both train and test datasets are combined ---
train$DataType="Train"
test$DataType="Test"
test$Loan_Status=NA
df=rbind(train,test)
Now, the combined dataset "df" has following structure---
> str(df)
'data.frame': 981 obs. of 14 variables:
$ Loan_ID : chr "LP001002" "LP001003" "LP001005" "LP001006" ...
$ Gender : chr "Male" "Male" "Male" "Male" ...
$ Married : chr "No" "Yes" "Yes" "Yes" ...
$ Dependents : chr "0" "1" "0" "0" ...
$ Education : chr "Graduate" "Graduate" "Graduate" "Not Graduate" ...
$ Self_Employed : chr "No" "No" "Yes" "No" ...
$ ApplicantIncome : int 5849 4583 3000 2583 6000 5417 2333 3036 4006 12841 ...
$ CoapplicantIncome : num 0 1508 0 2358 0 ...
$ LoanAmount : int NA 128 66 120 141 267 95 158 168 349 ...
$ Loan_Amount_Term : int 360 360 360 360 360 360 360 360 360 360 ...
$ Credit_History : int 1 1 1 1 1 1 1 0 1 1 ...
$ Property_Area : chr "Urban" "Rural" "Urban" "Urban" ...
$ Loan_Status : chr "Y" "N" "Y" "Y" ...
$ DataType : chr "Train" "Train" "Train" "Train" ...
MISSING VALUES TREATMENT AND VARIABLES MANIPULATION
The dateset contains few missing values which has to be treated carefully and data formats needs to be changed with appropriate transformations. The summary of missing values is as follows---
>MV_summary=sapply(df, function(x) sum(is.na(x)))
>MV_summary
Loan_ID Gender Married Dependents Education Self_Employed
0 0 0 0 0 0
ApplicantIncome CoapplicantIncome LoanAmount Loan_Amount_Term Credit_History
0 0 27 20 79
Property_Area Loan_Status DataType
0 367 0
We can see that Loan Amount, Loan Amount Term and Credit History have 27, 20 and 79 missing values respectively. Loan Status also has 367 missing values which are deliberately kept missing for submission in Hackathon.
There are many character type variables in the dataset. These variables should be converted into numeric values so that statistical computation can be done. For example, The variable Gender has two options-- male and female. In numeric form male can be converted into "1" and female into "0". Similar transformations are carried out for Married, Dependents, Education, Credit History, Loan Status and Property Area.
In Dependent variable, there are 25 missing values but they were not visible into missing value summary (MV_summary) which may be due to some mistakes while doing data feeding---
> table(df$Dependents)
0 1 2 3+
25 545 160 160 91
As we can see that 0 dependent category is the mode of Dependents variable hence we can replace missing value with 0 dependent category by following command---
>df$Dependents=ifelse(df$Dependents=="", 0, df$Dependents)
>df$Dependents=substr(df$Dependents,1,1)
>df$Dependents=substr(df$Dependents,1,1)
To indicate, if applicant has dependent or not, a new variable "depend_yes" is introduced. If there is dependent(s) than depend_yes variable will take value 1 or else zero---
>df$depend_yes=ifelse(df$Dependents>0,1,0)
>df=df%>%
mutate(graduate=as.numeric(Education=="Graduate"))
>df$Education=NULL
>df=df%>%
mutate(SE_yes=as.numeric(Self_Employed=="Yes"))
>df$Self_Employed=NULL
>df$ApplicantIncome[df$ApplicantIncome==0]=median(df$ApplicantIncome,na.rm = TRUE)
A new variable TotalIncome is introduced by adding ApplicantIncome and CoapplicantIncome---
>df$TotalIncome=df$ApplicantIncome+df$CoapplicantIncome
To indicate whether there is CoapplicantIncome or not, a new variable CoappIncome_yes is introduced---
>df$CoappIncome_yes=ifelse(df$CoapplicantIncome>0,1,0)
In original dattaframe, LoanAmount variable in given in 1000s which should be converted into full figures which is done by following command---
>df$LoanAmount=1000*df$LoanAmount
There are missing values in LoanAmount variable. It can be assumed that LoanAmount should vary as per Loan_Amount_Term therefore we replace NAs in LoanAmount by groupwise median LoanAmount for each Loan_Amount_Term by following commands---
>df$LoanAmount[is.na(df$LoanAmount)] <- ave(df$LoanAmount,
df$Loan_Amount_Term,
FUN=function(x)median(x,
na.rm = T))[is.na(df$LoanAmount)]
>df$LoanAmount[is.na(df$LoanAmount)] <- ave(df$LoanAmount,
df$Loan_Amount_Term,
FUN=function(x)median(x,
na.rm = T))[is.na(df$LoanAmount)]
There are 20 missing values in Loan_Amount_Term. The Loan_Amount_Term 360 is highest frequency term hence all missing values are replaced with 360 by following command---
>df$Loan_Amount_Term[is.na(df$Loan_Amount_Term)]=360
In Credit_History variable, there are 79 missing variables. It has to be filled carefully. Let us observe distribution of Credit_History with Loan_Status---
> table(df$Loan_Status, df$Credit_History)
0 1
N 82 97
Y 7 378
We can see that Applicants who have Credit_History=1 , their Loan_Status is "Y". Hence we can fill Credit_History of those Applicant as 1 where Loan_Status is "Y"---
>df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y",1,df$Credit_History)
The table below shows the distribution of credit history as per Education and Loan_Status---
> table(df$Loan_Status, df$Credit_History, df$Education)
, , = Graduate
0 1
N 57 73
Y 6 307
, , = Not Graduate
0 1
N 25 24
Y 1 71
We can see that Credit_History is 1 for most of the cases where Loan_status is "Y" and Education is "Graduate". So, we can replace those NAs of Credit_History variable where Loan_Status is "Y" and Education is "Graduate". by following command---
>df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y" &
df$Education=="Graduate",1,df$Credit_History)
Similarly, we observe the distribution of Credit_History with Loan_Status, Education and SE_yes---
> table(df$Loan_Status, df$Credit_History, df$Education, df$SE_yes)
, , = Graduate, = 0
0 1
N 50 71
Y 7 287
, , = Not Graduate, = 0
0 1
N 23 22
Y 2 70
, , = Graduate, = 1
0 1
N 10 9
Y 2 44
, , = Not Graduate, = 1
0 1
N 2 5
Y 0 10
We can observe that Applicant who is Graduate, either self employed or not, the Credit_History is most likely 1 if Loan_Status is also "Y". We replace NAs with following commands---
> df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y" & df$Education=="Graduate" & df$SE_yes==1,1,df$Credit_History)
> df$Credit_History[is.na(df$Credit_History)]=ifelse(df$Loan_Status=="Y" & df$Education=="Graduate" & df$SE_yes==0,1,df$Credit_History)
> which(is.na(df$Credit_History))
integer(0)
> table(df$Credit_History)
0 1
163 818
Loan_Status variable should also be changed from character variable to numeric variable which is done by following command---
>df$Loan_Status=ifelse(df$Loan_Status=="Y",1,0)
>df$Loan_Status=ifelse(df$Loan_Status=="Y",1,0)
For approval of LoanAmount, the LoanAmount should not exceed a certain multiple of TotalIncome. According to literature survey, it is found that Loan to Income ratio should not exceed 40 times of Income in case of Self Employed applicants and 60 times in case of non Self Employed applicants. Hereby, we have created a variable "LoantoIncomeRatio", LoanIncomeRatio_for_self and LoanIncomeRatio_for_nonself= ifelse(df$LoantoIncomeRatio<=60 & df$SE_yes==0,1,0)
---
>df$LoantoIncomeRatio=df$LoanAmount/df$TotalIncome
>df$LoanIncomeRatio_for_self= ifelse(df$LoantoIncomeRatio<=40 & df$SE_yes==1,1,0)
>df$LoanIncomeRatio_for_nonself= ifelse(df$LoantoIncomeRatio<=60 & df$SE_yes==0,1,0)
There are three types of Property_Area in the dataset--- Urban, Semiurban and Rural. This variable is mutated into two new variables-- Area_Urban and Area_Semiurban.
Finally, the Loan_Status variable is mutated into numeric variable by following command---
>df$Loan_Status=ifelse(df$Loan_Status=="Y",1,0)
Now, let us check whether there is any missing value left by following commands---
> MV_summary=sapply(df, function(x) sum(is.na(x)))
> MV_summary
Loan_ID Dependents ApplicantIncome
0 0 0
CoapplicantIncome LoanAmount Loan_Amount_Term
0 0 0
Credit_History Loan_Status DataType
0 367 0
male married depend_yes
0 0 0
SE_yes TotalIncome LoantoIncomeRatio
0 0 0
LoanIncomeRatio_for_self LoanIncomeRatio_for_nonself area_urban
0 0 0
area_semiurban graduate
0 0
Except for Loan_Status, all variables have no missing values.
The ApplicantIncome and CoapplicantIncome variables are redundant now so, we can remove them from dataset---
df=df[,-c(3,4)]
The ApplicantIncome and CoapplicantIncome variables are redundant now so, we can remove them from dataset---
df=df[,-c(3,4)]
The TotalIncome and LoanAmount variables are continuous variables. Let us observe their distribution through histogram attached below---
> hist(df$LoanAmount, n=100)
> hist(df$TotalIncome, n=100)
We can see that TotalIncome and LoanAmount variables have skewed distribution towards higher side. For modeling purpose, the variables should be normally distributed like bell shaped curve. To remove skewness, we apply log transformation and observe the histograms again---
> df$LoanAmount=log(df$LoanAmount)
> df$TotalIncome=log(df$TotalIncome)
> hist(df$LoanAmount, n=100)
> hist(df$TotalIncome, n=100)
Now, the dataset "df" is divided into two datasets--- "df_train" and "df_test". The first 614 rows are taken in "df_train" and last 367 rows are taken in "df_test".
>df_train=df[c(1:614),]
>df_test=df[615:981,]
MODEL BUILDING WITH TRAINING DATASET
For model building, we can start with logistic regression as the response variable i.e. Loan_Status is binary. The formula for model is as follows---
> formula=as.formula(Loan_Status~ LoanAmount
+Loan_Amount_Term
+Credit_History
+male
+married
+Dependents
+depend_yes
+graduate
+SE_yes
+LoantoIncomeRatio
+area_urban
+area_semiurban
+TotalIncome
+LoanIncomeRatio_for_self
+LoanIncomeRatio_for_nonself)
+Loan_Amount_Term
+Credit_History
+male
+married
+Dependents
+depend_yes
+graduate
+SE_yes
+LoantoIncomeRatio
+area_urban
+area_semiurban
+TotalIncome
+LoanIncomeRatio_for_self
+LoanIncomeRatio_for_nonself)
Logistic regression is special case of linear regression where dependent variable is categorical variable. Following commands are used to generate logistic model---
>model_glm=glm(formula,data=df_train,family="binomial" )
>summary(model_glm)
Model fit index for logistic regression model is AIC which should be as minimum as possible. for model_glm, AIC is 597.9. There are only three variables which are significant in the regression equation i.e. married, area_semiurban and Credit_History. To fine tune model, we will iteratively remove the variable with highest p-value one by one and observe the summary of the model until we are left with equation having only significant variables. To keep distinction between initial formula and final formula, there are named as "formula" and "formula.edited".
After number of iterations, we have got following results---
>formula.edited=as.formula(Loan_Status~
Credit_History
+married
+area_semiurban
+LoanIncomeRatio_for_self
+LoanIncomeRatio_for_nonself)
>model_glm=glm(formula.edited,data=df_train,family="binomial" )
>summary(model_glm)
After number of iterations, we have got following results---
>formula.edited=as.formula(Loan_Status~
Credit_History
+married
+area_semiurban
+LoanIncomeRatio_for_self
+LoanIncomeRatio_for_nonself)
>model_glm=glm(formula.edited,data=df_train,family="binomial" )
>summary(model_glm)
Now, we can see that there are five predictor variables found significant i.e. LoanIncomeRatio_for_nonself, LoanIncomeRatio_for_self, area_urban, married and Credit_History.
PREDICTION OF LOAN_STATUS
After modeling, we need to predict the Loan_Status variable for all cases of df_train dataset. The predict function estimates the probability for each loan applicant. Following command adds "predict_glm" column to df_train dataset which contains probabilities---
>df_train$Predict_glm=predict(model_glm, type="response", newdata = df_train)
> str(df_train$Predict_glm)
num [1:614] 0.692 0.79 0.799 0.79 0.692 ...
Random forest technique is advance level modeling technique in which number of models are build based on certain parameters and best model is chosen algorithmically. Following command applies RandomForest technique to fine tune the model and predict the probabilities for each applicant---
> model_rf=randomForest(formula.edited, data=df_train, importance=TRUE, proximity=TRUE,eig=TRUE, ntree=10000,type="classification")
>df_train$predict_rf=predict(model_rf, type="response", newdata=df_train)
>str(df_train$predict_rf)
num [1:614] 0.694 0.733 0.728 0.733 0.694 ...
After modeling with Random Forest, the cutoff for prediction probabilities should ascertained to classify the Applicant into either Loan_Status=Y or Loan_Status=N. Iteratively, we will change the cutoff and see how predicted classification improves. The optimal cutoff found is 0.68 where maximum accuracy, sensitivity and specificity is achieved---
> cutoff=0.68
> df_train$cf=ifelse(df_train$predict_rf<cutoff,0,1)
> df_confusion_matrix=cbind(Survived=df_train$Loan_Status,Predicted=df_train$cf)
> df_confusion_matrix=as.data.frame(df_confusion_matrix)
> df_confusion_matrix$Survived=as.factor(df_confusion_matrix$Survived)
> df_confusion_matrix$Predicted=as.factor(df_confusion_matrix$Predicted)
> confusionMatrix(df_confusion_matrix$Survived, df_confusion_matrix$Predicted)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 93 99
1 20 402
Accuracy : 0.8062
95% CI : (0.7727, 0.8367)
No Information Rate : 0.816
P-Value [Acc > NIR] : 0.7526
Kappa : 0.4922
Mcnemar's Test P-Value : 8.662e-13
Sensitivity : 0.8230
Specificity : 0.8024
Pos Pred Value : 0.4844
Neg Pred Value : 0.9526
Prevalence : 0.1840
Detection Rate : 0.1515
Detection Prevalence : 0.3127
Balanced Accuracy : 0.8127
'Positive' Class : 0
CONCLUSION
The loan applicant with good credit history, married, having property in semiurban area, having Loan to Income Ratio upto 40 times for self employed and upto 60 times for non self employed have higher chances of getting loan approved. The loan applicant's loan status does not depend on gender, education, number of dependents, loan amount and total income.