MENU

Drop Down MenusCSS Drop Down MenuPure CSS Dropdown Menu

Tuesday, 10 April 2018

CASE STUDY 2: Who Did Survive in Titanic Disaster?


Titanic is the classic man made disaster. The brilliant craft of man kind, the great Titanic ship, sunk in the sea after striking an iceberg in its first voyage. The disaster had more grievous impact on passengers due to lack of life boats in the ship. 

There were many classes of passengers, males, females, variety of age groups from children to old age citizens in the ship. The dataset we are going to analyze in this case study contains the passengers details with information of whether passenger survived the disaster or not. Let us look at the data set--

















The dataset contains details of 891 passengers. Each passenger's detail is explained through 12 columns or variables. The variables and their definitions are explained in following table---



The column "survival" tells that whether passenger survived the disaster or not (1= Survived, 0=Did not survive). Out of 891 passengers, 342 passengers survived and 549 passengers died. 

Questions which should be answered to find out the survival pattern among the passengers, corresponding hypotheses, relevant tests, results and interpretation to be tested are as following---

a) Whether "Ticket class" has any influence on chances of survival?

Ho : The survival of ith passenger is independent of Ticket class of ith passenger  
H1  : The survival of ith passenger is dependent of Ticket class of ith passenger

Statistical tool: Chi Squared Test of independence

Result and Interpretation:

> table(titanic$Survived, titanic$Pclass)
 
      1      2    3
  0  80    97  372
  1  136  87  119

We can observe from the the table that passengers of class 1 has got more survivors proportion while passengers of class 3 has more deaths.

> chisq.test(table(titanic$Survived,titanic$Pclass))

Pearson's Chi-squared test

data:  table(titanic$Survived, titanic$Pclass)
X-squared = 102.89, df = 2, p-value < 2.2e-16

The Chi-squared test result also shows that passengers' class has influence on survival chances of passengers. The class 1 passengers are safer and class 3 passengers are more vulnerable.

b) Whether gender of passenger has any influence on chances of survival?

Ho : The survival of ith passenger is independent of gender of ith passenger  
H1  : The survival of ith passenger is dependent of gender of ith passenger

Statistical tool: Chi Squared Test of independence

Result and Interpretations:

> table(titanic$Survived, titanic$Sex)
 
         female  male
  0     81         468
  1     233       109

We can observe that if passenger is female, chances of survival are more than the fellow male passengers.

> chisq.test(table(titanic$Survived, titanic$Sex))

Pearson's Chi-squared test with Yates' continuity correction

data:  table(titanic$Survived, titanic$Sex)
X-squared = 260.72, df = 1, p-value < 2.2e-16

The Chi-squared test result shows that the chances of survival depends on sex of the passenger.

c) Whether age of passenger has any influence on chances of survival?

Ho : There is no significant difference among the group means of survivors and non survivors.  
H1  : There is significant difference among the group means of survivors and non survivors.

Statistical tool: Two sample t-test

Before doing two sample t test, the variance of age of survived passengers and dead passengers has to be compared to see whether variance of both dataset is equal or not. The following is the result of F test---

Result and Interpretations---

> var.test(titanic_survived$Age, titanic_dead$Age)

F test to compare two variances

data:  titanic_survived$Age and titanic_dead$Age
F = 1.1129, num df = 289, denom df = 423, p-value = 0.317
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 0.9023453 1.3785538
sample estimates:
ratio of variances 
          1.112932  

The p-value is greater than 0.05 hence age variable of both datasets have homogeneous variance.

> t.test(titanic_survived$Age, titanic_dead$Age, var.equal = TRUE, paired = FALSE)

Two Sample t-test

data:  titanic_survived$Age and titanic_dead$Age
t = -2.0667, df = 712, p-value = 0.03912
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -4.450798 -0.114181
sample estimates:
mean of x mean of y
 28.34369  30.62618

We can observe that the difference in age of titanic survivor and titanic non-survivor is significant at 0.05 significance level. 

d) The influence of having siblings or spouse and number of siblings on survival chances of passengers need to be studied thoroughly. It will be done in two parts---

i. Whether having a sibling or spouse have influence on chances of survival?

Ho : The survival of ith passenger is independent of having siblings or spouse  
H1  : The survival of ith passenger is dependent of having siblings or spouse

Statistical tool: Chi Squared Test of independence

Result and Interpretations---

> table(titanic$Survived, titanic$SibSp_yes)
 
      0      1
  0  398  151
  1  210  132

> chisq.test(table(titanic$Survived, titanic$SibSp_yes))

Pearson's Chi-squared test with Yates' continuity correction

data:  table(titanic$Survived, titanic$SibSp_yes)
X-squared = 11.456, df = 1, p-value = 0.0007128

The p-vslue is much less than 0.05 hence null hypothesis is rejected and it can be interpreted that there is dependence of passengers' survival chances on having siblings or spouse.

ii. Whether number of siblings or spouse have influence on chances of survival?

To answer this question we can observe following tables---

> table(titanic$SibSp)

  0   1      2    3    4     5   8
608 209  28  16  18   5   7

> table(titanic$Survived, titanic$SibSp)
 
      0      1       2   3    4    5   8
  0  398  97    15  12  15  5   7
  1  210  112  13   4    3   0   0

We can understand very well that having no sibling or spouse clearly reduces chances of passenger survival and having only one sibling or spouse has greater chances of survival than having no sibling or more than one sibling.

e) The influence of having parents or children and number of children on survival chances of passengers need to be studied thoroughly. It will be done in two parts---

i. Whether having parents or children have influence on chances of survival?

Ho : The survival of ith passenger is independent of having parents or children  
H1  : The survival of ith passenger is dependent of having parents or children

Statistical tool: Chi Squared Test of independence

Result and Interpretations---

> table(titanic$Survived, titanic$parch_yes)
 
      0    1
  0 445 104
  1 233 109

> chisq.test(table(titanic$Survived, titanic$parch_yes))

Pearson's Chi-squared test with Yates' continuity correction

data:  table(titanic$Survived, titanic$parch_yes)
X-squared = 18.656, df = 1, p-value = 1.565e-05

It is evident from Chi-squared test that having parents or children influence chances of passenger's survival.

ii. Whether number of children have influence on chances of survival?

The tables below shows how number of children influence the chances of passenger survival---

> table(titanic$Parch)

  0   1       2    3   4   5   6
678 118  80   5   4   5   1

> table(titanic$Survived, titanic$Parch)
 
      0     1     2    3   4   5   6
  0 445  53  40   2   4   4   1
  1 233  65  40   3   0   1   0

We can observe that having no children or parents increases chances of non survival and having only one sibling or parent increases chances of survival but having children more than one decreases chances of survival.

g Whether fare of passenger has any influence on chances of survival?

Ho : There is no significant difference among the group means of survivors and non survivors.  
H1  : There is significant difference among the group means of survivors and non survivors.

Statistical tool: Two sample t-test

Result and Interpretations---

> var.test(titanic_survived$Fare, titanic_dead$Fare)

F test to compare two variances

data:  titanic_survived$Fare and titanic_dead$Fare
F = 4.5017, num df = 341, denom df = 548, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
 3.725366 5.463382
sample estimates:
ratio of variances
          4.501697

The F test result shows that variance in fares of survived and non survived passengers is not homogeneous. Hence, two sample t test is applied with due consideration of non homogeneous variances.

> t.test(titanic_survived$Fare, titanic_dead$Fare, var.equal = FALSE, paired=FALSE)

Welch Two Sample t-test

data:  titanic_survived$Fare and titanic_dead$Fare
t = 6.8391, df = 436.7, p-value = 2.699e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 18.72592 33.82912
sample estimates:
mean of x mean of y
 48.39541  22.11789 

The p-value is less than 0.05 hence it can be concluded that the mean fares of survived passengers and non survived passengers have significant difference. We can see that mean fare for  survived passengers is $48.40 and mean fare for non survived passengers is $22.12. There is clear evidence that passengers paying lower fares had discriminatory low opportunity to survive.

LOGISTIC REGRESSION MODELING

The passenger on Titanic either survived or died after striking the iceberg disaster. In terms of modeling, the outcome of the disaster is binary. Other variables in dataset are independent variables which are influencing the probability of passenger's survival.

The variables which were considered as independent variables are Pclass, Age, SibSp, Parch, SibSp_yes, parch_yes and female_yes. Pclass variable has three categories. Age variable is a continuous variable. SibSp variable is number of siblings or spouse. Parch variable is number of children or parents. SIbSp_yes variable is dichotomous variable has two categories i.e. passenger having sibling or spouse and passenger not having sibling or spouse. Similarly, Parch_yes variable represents whether passenger is having parent or children or not along with him. Female_yes variable represents whether passenger is male or female.

Logistic regression modeling results are as follows---

> summary(glm_model)

Call:
glm(formula = Survived ~ . - Embarked - Fare, family = "binomial",
    data = titanic_sub)

Deviance Residuals:
    Min       1Q   Median       3Q      Max
-2.7440  -0.6168  -0.4328   0.5865   2.6719

Coefficients:
                          Estimate    Std. Error      z value      Pr(>|z|) 
(Intercept)         2.090042   0.444176       4.705        2.53e-06 ***
Pclass              -1.093903    0.121616     -8.995          < 2e-16 ***
Age                 -0.037635    0.008076     -4.660         3.16e-06 ***
SibSp              -0.779355    0.202681     -3.845          0.00012 ***
Parch              -0.454346    0.210032     -2.163           0.03052 *
SibSp_yes       0.886692     0.331693       2.673           0.00751 **
parch_yes        0.903163    0.411734        2.194          0.02827 *
female_yes      2.718656    0.199072      13.657          < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 1186.66  on 890  degrees of freedom
Residual deviance:  776.49  on 883  degrees of freedom
AIC: 792.49.

The model contains variables with significant regression coefficients. The following visualization shows how passengers are divided into survivor and non survivors---



The score axis tells  the probability of survival. Red points indicate the passengers we died in Titanic disaster. We can observe very well that red points are more concentrated towards 0 score which means chances of survival are very bleak or no chance of survival at all. 
The lift curve attached below also shows the accuracy of model classification. The line graph should be as distant as possible fro diagonal. It seems that classification model needs more fine tuning.

FINE TUNING THE MODEL

There are many ways to fine tune the model. Here, we are going to use random forest technique for fine tuning the model. In this technique, 500 decision tree models are created and than best model is selected which has minimum BIC and AIC values. Following result is obtained in random forest modeling---\

randomForest(formula = formula, data = df_for_model, importance = TRUE,      proximity = TRUE, eig = TRUE, ntree = 500, type = "classification") 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 1

          Mean of squared residuals: 0.1411216
                    % Var explained: 40.33

When prediction is estimated for chances of survival of passenger, random forest model creates a vector of probabilities assigned with each passenger. The probabilities are not deterministic hence we need to find a cutoff score. This cutoff score will be a threshold. passengers with probabilities below the cutoff value will be assigned the value "Not Survived=0" and passengers with probabilities equal to or above cutoff value will be assigned the value "Survived=1". There are two ways to find optimal cutoff where maximum accuracy in classification can be obtained--- a) Iterative method and b) ROC method. In iterative method, the cutoff options are changes and model accuracy is calculated again and again. The cutoff is selected where maximum accuracy is obtained. In ROC method, algorithm automatically calculates the cutoff where accuracy is maximum. The ROC method is less time consuming hence we select ROC method to find out optimal cutoff and following results are obtained---

   
  Survived
         |        Predicted
        \/         0       1
         0        516   33
         1        107   235

Area under the curve: 0.8945 


Threshold       Accuracy 
0.3521408      0.8271605 

The area under the curve covers 89% of plot with threshold 0.35 and accuracy 83%. This model is better than logistic regression model. Following table shows the comparison of results----



The accuracy of random forest modeling is 84.29% while accuracy of logistic regression model is 77.89%. Random forest model is definitely chosen because of higher accuracy.

========================================================================
 R Script
=======================================================================

#####$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$#####
######################################################################
######################################################################
###              WHO DID SURVIVE THE TITANIC DISASTER?             ###
######################################################################
######################################################################
#####@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#####


######################################################################
######################################################################
###                        GETTING STARTED                         ###
######################################################################
######################################################################


########################################################
## Loading the working directory

setwd()

getwd()

#########################################################
## Loading the libraries

library(dplyr)
library(e1071)
library(dplyr)
library(Hmisc)
library(ggplot2)
library(randomForest)
library(caret)
library(pROC)

#########################################################
## Loading the datasets

train=read.csv("train.csv", header = T, stringsAsFactors = F)
test=read.csv("test.csv", header=T, stringsAsFactors = F)

str(train)
str(test)

#########################################################
## Combining the datasets

train$DataType="Train"
test$DataType="Test"
test$Survived=NA

df=rbind(train,test)


######################################################################
######################################################################
###              MISSING VALUE ANALYSIS AND INPUTATIOn             ###
######################################################################
######################################################################

#########################################################
## Calculating the missing values for each column

MV_summary=sapply(df, function(x) sum(is.na(x)))
MV_summary

#########################################################
## Regression for estimating missing value in age variable

str(df)

model_age=lm(Age~Fare, data=df)

summary(model_age)

View(df)

View(test)

df$Sex=ifelse(df$Sex=="male",1,0)


model_age_2=lm(Age~+Pclass+Sex+SibSp+Parch, data=df)

summary(model_age_2)


df$Age[is.na(df$Age)] <- predict(model_age_2, na.rm = T, newdata=df)

df$Age=round(df$Age, digits = 0)

df$Fare=round(df$Fare, digits = 0)

MV_summary=sapply(df, function(x) sum(is.na(x)))
MV_summary

df$Fare[is.na(df$Fare)] <- mean(df$Fare, na.rm = T)

summary(df)

df[925,6]=2

which(df$Age<0)

which(df$Age==0)

which(df$Fare==0)

str(df)

######################################################################
######################################################################
###      Modeling of Response Variable with Predictors             ###
######################################################################
######################################################################



df_for_model=df[c(1:891),c(2,3,5,6,7,8,10)]


formula=as.formula(Survived~Age+Sex+SibSp+Pclass)

#########################################################
## Applying linear regression model


model_lm=lm(formula, data=df_for_model)

summary(model_lm)

#########################################################
## Applying logistic regression model

model_lg=glm(formula, data=df_for_model, family="binomial")

summary(model_lg)

df_for_model$Predicted=predict(model_lg, type="response", newdata = df_for_model)

View(df_for_model)

#########################################################
## Plotting the model predicted with actual variable

ggplot(df_for_model,aes(y=Survived,x=Predicted,color=factor(Survived)))+
  geom_point()+geom_jitter()

#########################################################
## Applying Random Forest technique for modeling

model_rf=randomForest(formula, data=df_for_model, importance=TRUE, proximity=TRUE,eig=TRUE, ntree=500, type="classification")

model_rf

summary(model_rf)

plot(model_rf)

df_for_model$predict_rf=predict(model_rf, type="response", newdata=df_for_model)

#########################################################
## Plotting the random forest model

ggplot(df_for_model,aes(y=Survived,x=predict_rf,color=factor(Survived)))+
  geom_point()+geom_jitter()

#########################################################
## Optimizing the cutoff value by interative procedure

cutoff=0.3521408

df_for_model$P0.10=ifelse(df_for_model$predict_rf<cutoff,0,1)

table(df_for_model$Survived, df_for_model$P0.10)

df_confusion_matrix=cbind(Survived=df_for_model$Survived,Predicted=df_for_model$P0.10)

df_confusion_matrix=as.data.frame(df_confusion_matrix)


df_confusion_matrix$Survived=as.factor(df_confusion_matrix$Survived)
df_confusion_matrix$Predicted=as.factor(df_confusion_matrix$Predicted)

confusionMatrix(df_confusion_matrix$Survived, df_confusion_matrix$Predicted)

model_roc=roc(df_for_model$Survived,df_for_model$predict_rf)

model_roc

plot(model_roc, print.thres="best", print.thres.best.method="closest.topleft", xlim=c(0,1))

result.coords <- coords(model_roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)

cutoff=0.3521408

df_for_model$Predicted=ifelse(df_for_model$Predicted<cutoff,0,1)

table(df_for_model$Survived, df_for_model$Predicted)






Monday, 9 April 2018

CASE STUDY 1: Iris Species Classification






Iris is a plant species have different varieties based on it sepal and petal length and breadth. In this article, three species of Iris flowers are classified based on petal and sepal length and breadth are satosa, versicolour and virginica. The dataset is available on following link---

https://archive.ics.uci.edu/ml/datasets/Iris

In this dataset, there are five columns/variables and 150 rows/observations. The four variables are related to Sepal Length, Sepal Width, Petal Length and Petal Width. The fifth variable is categorical variable denoting the species of Iris plant. There are three species of Iris plant in the dataset. Each species contain 50 observations in the dataset. The snapshot of dataset is as follows---

The analysis of Iris data set is done in RStudio. The R Script is shared in this blog at the end. Let us understand how this dataset should be analyzed. The summary statistics of iris dataset attached below tell us about mean, median, 1st quartile, 3rd quartile, minimum and maximum values of the Sepal Length, Sepal Width, Petal Length and Petal Width. 

 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width           
 
Min.   :4.300     Min.   :2.000      Min.   :1.000     Min.     :0.100   
 1st Qu.:5.100    1st Qu.:2.800     1st Qu.:1.600    1st Qu. :0.300
 Median :5.800   Median :3.000   Median :4.350  Median :1.300
 Mean   :5.843    Mean   :3.054    Mean   :3.759   Mean     :1.199                 
 3rd Qu.:6.400    3rd Qu.:3.300   3rd Qu.:5.100    3rd Qu  :1.800                 
 Max.   :7.900     Max.   :4.400    Max.   :6.900     Max.    :2.500  

But the summary statistics of entire dataset does not tell us how three species of Iris are different from each other with respect to these variables.  Let us observe following boxplots of each variable with respect to three categories of Iris species---
 


We can observe from boxplots that Sepal Length, Petal Length and Petal Width clearly discriminate among the three species of Iris flowers. As the Petal Length, Sepal Length and Petal Width increase, the species of Iris transits from setosa to versicolour and from versicolour to virginica

As we are purposely looking for classifying the species based on the first four variables of the dataset, it is important that we should understand the correlation among the four variables. Following corrplot depicts the correlations among the variables---



Correlation plot of variables tells interesting insights about Iris flowers. Sepal Length has weak negative correlation with Sepal Width. While Sepal Length has positive and strong correlation with Petal Length and Petal Width. Sepal Width has negative and moderate correlation with Petal Length and Petal WIdth. 

As there are three categories of Iris flowers, the most appropriate tool for exploring the relationship between species of Iris flower and other variables is Linear Discriminant Analysis. The linear discriminant model summary is as following---

lda(Species ~ ., data = iris, prior = c(1, 1, 1)/3)

Prior probabilities of groups:
      Iris-setosa       Iris-versicolor   Iris-virginica 
      0.3333333       0.3333333       0.3333333 

Group means:
                Sepal.Length   Sepal.Width   Petal.Length   Petal.Width
Iris-setosa            5.006       3.418             1.464             0.244
Iris-versicolor        5.936       2.770            4.260             1.326
Iris-virginica         6.588       2.974             5.552             2.026

Coefficients of linear discriminants:
                    LD1         LD2
Sepal.Length  0.8192685  0.03285975
Sepal.Width   1.5478732  2.15471106
Petal.Length -2.1849406 -0.93024679
Petal.Width  -2.8538500  2.80600460

Proportion of trace:
   LD1    LD2 
0.9915 0.0085

We can observe from LDA model that the mean sepal length for setosa, versicolor and virginica are 5.006, 5.936 and 6.588 respectively. Petal length for setosa, versicolor and virginica are 1.464, 4.26 and 5.552 respectively.  Petal width also clearly discriminates among the three species. The mean petal width of setosa, versicolor and virginica are 0.244, 1.326 and 2.026. Same findings are reflected in graph shown below---




By linear discriminant analysis, we can see that how three varieties of Iris flowers are different from each other and how sepal length and width and petal length and width play significant role. 

========================================================================

R Script

setwd()

getwd()


iris=read.csv("iris.csv", header = TRUE, stringsAsFactors = F)

summary(iris)

boxplot(iris$Sepal.Length~iris$Species,main="Iris Species Vs. Sepal Length")

boxplot(iris$Sepal.Width~iris$Species,main="Iris Species Vs. Sepal Width")

boxplot(iris$Petal.Length~iris$Species,main="Iris Species Vs. Petal Length")


boxplot(iris$Petal.Width~iris$Species,main="Iris Species Vs. Petal Width")


corrplot::corrplot(cor(iris[,-5]),method="pie", type="upper")


library(MASS)

library(car)

library(caret)

str(iris)

model_lda=lda(Species~., data=iris, prior=c(1,1,1)/3)

model_lda

summary(model_lda)

plot(model_lda, dimen=1, type="both")