Titanic is the classic man made disaster. The brilliant craft of man kind, the great Titanic ship, sunk in the sea after striking an iceberg in its first voyage. The disaster had more grievous impact on passengers due to lack of life boats in the ship.
There were many classes of passengers, males, females, variety of age groups from children to old age citizens in the ship. The dataset we are going to analyze in this case study contains the passengers details with information of whether passenger survived the disaster or not. Let us look at the data set--
The dataset contains details of 891 passengers. Each passenger's detail is explained through 12 columns or variables. The variables and their definitions are explained in following table---
The column "survival" tells that whether passenger survived the disaster or not (1= Survived, 0=Did not survive). Out of 891 passengers, 342 passengers survived and 549 passengers died.
Questions which should be answered to find out the survival pattern among the passengers, corresponding hypotheses, relevant tests, results and interpretation to be tested are as following---
a) Whether "Ticket class" has any influence on chances of survival?
Ho : The survival of ith passenger is independent of Ticket class of ith passenger
H1 : The survival of ith passenger is dependent of Ticket class of ith passenger
Statistical tool: Chi Squared Test of independence
Result and Interpretation:
> table(titanic$Survived, titanic$Pclass)
1 2 3
0 80 97 372
1 136 87 119
We can observe from the the table that passengers of class 1 has got more survivors proportion while passengers of class 3 has more deaths.
> chisq.test(table(titanic$Survived,titanic$Pclass))
data: table(titanic$Survived, titanic$Pclass)
X-squared = 102.89, df = 2, p-value < 2.2e-16
The Chi-squared test result also shows that passengers' class has influence on survival chances of passengers. The class 1 passengers are safer and class 3 passengers are more vulnerable.
Result and Interpretation:
> table(titanic$Survived, titanic$Pclass)
1 2 3
0 80 97 372
1 136 87 119
We can observe from the the table that passengers of class 1 has got more survivors proportion while passengers of class 3 has more deaths.
> chisq.test(table(titanic$Survived,titanic$Pclass))
Pearson's Chi-squared test
X-squared = 102.89, df = 2, p-value < 2.2e-16
The Chi-squared test result also shows that passengers' class has influence on survival chances of passengers. The class 1 passengers are safer and class 3 passengers are more vulnerable.
b) Whether gender of passenger has any influence on chances of survival?
Ho : The survival of ith passenger is independent of gender of ith passenger
H1 : The survival of ith passenger is dependent of gender of ith passenger
Statistical tool: Chi Squared Test of independence
Result and Interpretations:
> table(titanic$Survived, titanic$Sex)
female male
0 81 468
1 233 109
We can observe that if passenger is female, chances of survival are more than the fellow male passengers.
> chisq.test(table(titanic$Survived, titanic$Sex))
Pearson's Chi-squared test with Yates' continuity correction
data: table(titanic$Survived, titanic$Sex)
X-squared = 260.72, df = 1, p-value < 2.2e-16
The Chi-squared test result shows that the chances of survival depends on sex of the passenger.
c) Whether age of passenger has any influence on chances of survival?
Ho : There is no significant difference among the group means of survivors and non survivors.
H1 : There is significant difference among the group means of survivors and non survivors.
Statistical tool: Two sample t-test
Before doing two sample t test, the variance of age of survived passengers and dead passengers has to be compared to see whether variance of both dataset is equal or not. The following is the result of F test---
Result and Interpretations---
> var.test(titanic_survived$Age, titanic_dead$Age)
> t.test(titanic_survived$Age, titanic_dead$Age, var.equal = TRUE, paired = FALSE)
data: titanic_survived$Age and titanic_dead$Age
t = -2.0667, df = 712, p-value = 0.03912
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.450798 -0.114181
sample estimates:
mean of x mean of y
28.34369 30.62618
Before doing two sample t test, the variance of age of survived passengers and dead passengers has to be compared to see whether variance of both dataset is equal or not. The following is the result of F test---
Result and Interpretations---
> var.test(titanic_survived$Age, titanic_dead$Age)
F test to compare two variances
data: titanic_survived$Age and titanic_dead$Age
F = 1.1129, num df = 289, denom df = 423, p-value = 0.317
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
0.9023453 1.3785538
sample estimates:
ratio of variances
1.112932
The p-value is greater than 0.05 hence age variable of both datasets have homogeneous variance.
Two Sample t-test
data: titanic_survived$Age and titanic_dead$Age
t = -2.0667, df = 712, p-value = 0.03912
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-4.450798 -0.114181
sample estimates:
mean of x mean of y
28.34369 30.62618
We can observe that the difference in age of titanic survivor and titanic non-survivor is significant at 0.05 significance level.
d) The influence of having siblings or spouse and number of siblings on survival chances of passengers need to be studied thoroughly. It will be done in two parts---
i. Whether having a sibling or spouse have influence on chances of survival?
Ho : The survival of ith passenger is independent of having siblings or spouse
H1 : The survival of ith passenger is dependent of having siblings or spouse
Statistical tool: Chi Squared Test of independence
Result and Interpretations---
> table(titanic$Survived, titanic$SibSp_yes)
0 1
0 398 151
1 210 132
> chisq.test(table(titanic$Survived, titanic$SibSp_yes))
Pearson's Chi-squared test with Yates' continuity correction
data: table(titanic$Survived, titanic$SibSp_yes)
X-squared = 11.456, df = 1, p-value = 0.0007128
The p-vslue is much less than 0.05 hence null hypothesis is rejected and it can be interpreted that there is dependence of passengers' survival chances on having siblings or spouse.
ii. Whether number of siblings or spouse have influence on chances of survival?
To answer this question we can observe following tables---
> table(titanic$SibSp)
0 1 2 3 4 5 8
608 209 28 16 18 5 7
> table(titanic$Survived, titanic$SibSp)
0 1 2 3 4 5 8
0 398 97 15 12 15 5 7
1 210 112 13 4 3 0 0
We can understand very well that having no sibling or spouse clearly reduces chances of passenger survival and having only one sibling or spouse has greater chances of survival than having no sibling or more than one sibling.
e) The influence of having parents or children and number of children on survival chances of passengers need to be studied thoroughly. It will be done in two parts---
> table(titanic$SibSp)
0 1 2 3 4 5 8
608 209 28 16 18 5 7
> table(titanic$Survived, titanic$SibSp)
0 1 2 3 4 5 8
0 398 97 15 12 15 5 7
1 210 112 13 4 3 0 0
We can understand very well that having no sibling or spouse clearly reduces chances of passenger survival and having only one sibling or spouse has greater chances of survival than having no sibling or more than one sibling.
e) The influence of having parents or children and number of children on survival chances of passengers need to be studied thoroughly. It will be done in two parts---
i. Whether having parents or children have influence on chances of survival?
Ho : The survival of ith passenger is independent of having parents or children
H1 : The survival of ith passenger is dependent of having parents or children
Statistical tool: Chi Squared Test of independence
Result and Interpretations---
> table(titanic$Survived, titanic$parch_yes)
0 1
0 445 104
1 233 109
> chisq.test(table(titanic$Survived, titanic$parch_yes))
Pearson's Chi-squared test with Yates' continuity correction
data: table(titanic$Survived, titanic$parch_yes)
X-squared = 18.656, df = 1, p-value = 1.565e-05
It is evident from Chi-squared test that having parents or children influence chances of passenger's survival.
ii. Whether number of children have influence on chances of survival?
The tables below shows how number of children influence the chances of passenger survival---
> table(titanic$Parch)
0 1 2 3 4 5 6
678 118 80 5 4 5 1
> table(titanic$Survived, titanic$Parch)
0 1 2 3 4 5 6
0 445 53 40 2 4 4 1
1 233 65 40 3 0 1 0
We can observe that having no children or parents increases chances of non survival and having only one sibling or parent increases chances of survival but having children more than one decreases chances of survival.
> table(titanic$Parch)
0 1 2 3 4 5 6
678 118 80 5 4 5 1
> table(titanic$Survived, titanic$Parch)
0 1 2 3 4 5 6
0 445 53 40 2 4 4 1
1 233 65 40 3 0 1 0
We can observe that having no children or parents increases chances of non survival and having only one sibling or parent increases chances of survival but having children more than one decreases chances of survival.
g Whether fare of passenger has any influence on chances of survival?
Ho : There is no significant difference among the group means of survivors and non survivors.
H1 : There is significant difference among the group means of survivors and non survivors.
Statistical tool: Two sample t-test
Result and Interpretations---
> var.test(titanic_survived$Fare, titanic_dead$Fare)
data: titanic_survived$Fare and titanic_dead$Fare
F = 4.5017, num df = 341, denom df = 548, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
3.725366 5.463382
sample estimates:
ratio of variances
4.501697
The F test result shows that variance in fares of survived and non survived passengers is not homogeneous. Hence, two sample t test is applied with due consideration of non homogeneous variances.
> t.test(titanic_survived$Fare, titanic_dead$Fare, var.equal = FALSE, paired=FALSE)
data: titanic_survived$Fare and titanic_dead$Fare
t = 6.8391, df = 436.7, p-value = 2.699e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
18.72592 33.82912
sample estimates:
mean of x mean of y
48.39541 22.11789
Result and Interpretations---
> var.test(titanic_survived$Fare, titanic_dead$Fare)
F test to compare two variances
F = 4.5017, num df = 341, denom df = 548, p-value < 2.2e-16
alternative hypothesis: true ratio of variances is not equal to 1
95 percent confidence interval:
3.725366 5.463382
sample estimates:
ratio of variances
4.501697
The F test result shows that variance in fares of survived and non survived passengers is not homogeneous. Hence, two sample t test is applied with due consideration of non homogeneous variances.
> t.test(titanic_survived$Fare, titanic_dead$Fare, var.equal = FALSE, paired=FALSE)
Welch Two Sample t-test
data: titanic_survived$Fare and titanic_dead$Fare
t = 6.8391, df = 436.7, p-value = 2.699e-11
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
18.72592 33.82912
sample estimates:
mean of x mean of y
48.39541 22.11789
The p-value is less than 0.05 hence it can be concluded that the mean fares of survived passengers and non survived passengers have significant difference. We can see that mean fare for survived passengers is $48.40 and mean fare for non survived passengers is $22.12. There is clear evidence that passengers paying lower fares had discriminatory low opportunity to survive.
LOGISTIC REGRESSION MODELING
The passenger on Titanic either survived or died after striking the iceberg disaster. In terms of modeling, the outcome of the disaster is binary. Other variables in dataset are independent variables which are influencing the probability of passenger's survival.
The variables which were considered as independent variables are Pclass, Age, SibSp, Parch, SibSp_yes, parch_yes and female_yes. Pclass variable has three categories. Age variable is a continuous variable. SibSp variable is number of siblings or spouse. Parch variable is number of children or parents. SIbSp_yes variable is dichotomous variable has two categories i.e. passenger having sibling or spouse and passenger not having sibling or spouse. Similarly, Parch_yes variable represents whether passenger is having parent or children or not along with him. Female_yes variable represents whether passenger is male or female.
Logistic regression modeling results are as follows---
> summary(glm_model)
Call:
glm(formula = Survived ~ . - Embarked - Fare, family = "binomial",
data = titanic_sub)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7440 -0.6168 -0.4328 0.5865 2.6719
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.090042 0.444176 4.705 2.53e-06 ***
Pclass -1.093903 0.121616 -8.995 < 2e-16 ***
Age -0.037635 0.008076 -4.660 3.16e-06 ***
SibSp -0.779355 0.202681 -3.845 0.00012 ***
Parch -0.454346 0.210032 -2.163 0.03052 *
SibSp_yes 0.886692 0.331693 2.673 0.00751 **
parch_yes 0.903163 0.411734 2.194 0.02827 *
female_yes 2.718656 0.199072 13.657 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 776.49 on 883 degrees of freedom
AIC: 792.49.
The model contains variables with significant regression coefficients. The following visualization shows how passengers are divided into survivor and non survivors---
LOGISTIC REGRESSION MODELING
The passenger on Titanic either survived or died after striking the iceberg disaster. In terms of modeling, the outcome of the disaster is binary. Other variables in dataset are independent variables which are influencing the probability of passenger's survival.
The variables which were considered as independent variables are Pclass, Age, SibSp, Parch, SibSp_yes, parch_yes and female_yes. Pclass variable has three categories. Age variable is a continuous variable. SibSp variable is number of siblings or spouse. Parch variable is number of children or parents. SIbSp_yes variable is dichotomous variable has two categories i.e. passenger having sibling or spouse and passenger not having sibling or spouse. Similarly, Parch_yes variable represents whether passenger is having parent or children or not along with him. Female_yes variable represents whether passenger is male or female.
Logistic regression modeling results are as follows---
> summary(glm_model)
Call:
glm(formula = Survived ~ . - Embarked - Fare, family = "binomial",
data = titanic_sub)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.7440 -0.6168 -0.4328 0.5865 2.6719
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 2.090042 0.444176 4.705 2.53e-06 ***
Pclass -1.093903 0.121616 -8.995 < 2e-16 ***
Age -0.037635 0.008076 -4.660 3.16e-06 ***
SibSp -0.779355 0.202681 -3.845 0.00012 ***
Parch -0.454346 0.210032 -2.163 0.03052 *
SibSp_yes 0.886692 0.331693 2.673 0.00751 **
parch_yes 0.903163 0.411734 2.194 0.02827 *
female_yes 2.718656 0.199072 13.657 < 2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1186.66 on 890 degrees of freedom
Residual deviance: 776.49 on 883 degrees of freedom
AIC: 792.49.
The model contains variables with significant regression coefficients. The following visualization shows how passengers are divided into survivor and non survivors---
The score axis tells the probability of survival. Red points indicate the passengers we died in Titanic disaster. We can observe very well that red points are more concentrated towards 0 score which means chances of survival are very bleak or no chance of survival at all.
The lift curve attached below also shows the accuracy of model classification. The line graph should be as distant as possible fro diagonal. It seems that classification model needs more fine tuning.
FINE TUNING THE MODEL
There are many ways to fine tune the model. Here, we are going to use random forest technique for fine tuning the model. In this technique, 500 decision tree models are created and than best model is selected which has minimum BIC and AIC values. Following result is obtained in random forest modeling---\
The area under the curve covers 89% of plot with threshold 0.35 and accuracy 83%. This model is better than logistic regression model. Following table shows the comparison of results----
The accuracy of random forest modeling is 84.29% while accuracy of logistic regression model is 77.89%. Random forest model is definitely chosen because of higher accuracy.
FINE TUNING THE MODEL
There are many ways to fine tune the model. Here, we are going to use random forest technique for fine tuning the model. In this technique, 500 decision tree models are created and than best model is selected which has minimum BIC and AIC values. Following result is obtained in random forest modeling---\
randomForest(formula = formula, data = df_for_model, importance = TRUE, proximity = TRUE, eig = TRUE, ntree = 500, type = "classification")
Type of random forest: regression
Number of trees: 500
No. of variables tried at each split: 1
Mean of squared residuals: 0.1411216
% Var explained: 40.33
When prediction is estimated for chances of survival of passenger, random forest model creates a vector of probabilities assigned with each passenger. The probabilities are not deterministic hence we need to find a cutoff score. This cutoff score will be a threshold. passengers with probabilities below the cutoff value will be assigned the value "Not Survived=0" and passengers with probabilities equal to or above cutoff value will be assigned the value "Survived=1". There are two ways to find optimal cutoff where maximum accuracy in classification can be obtained--- a) Iterative method and b) ROC method. In iterative method, the cutoff options are changes and model accuracy is calculated again and again. The cutoff is selected where maximum accuracy is obtained. In ROC method, algorithm automatically calculates the cutoff where accuracy is maximum. The ROC method is less time consuming hence we select ROC method to find out optimal cutoff and following results are obtained---
Survived
| Predicted
\/ 0 1
0 516 33
1 107 235
Area under the curve: 0.8945
Threshold Accuracy
0.3521408 0.8271605
The accuracy of random forest modeling is 84.29% while accuracy of logistic regression model is 77.89%. Random forest model is definitely chosen because of higher accuracy.
========================================================================
R Script
=======================================================================
#####$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$#####
######################################################################
######################################################################
### WHO DID SURVIVE THE TITANIC DISASTER? ###
######################################################################
######################################################################
#####@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@#####
######################################################################
######################################################################
### GETTING STARTED ###
######################################################################
######################################################################
########################################################
## Loading the working directory
setwd()
getwd()
#########################################################
## Loading the libraries
library(dplyr)
library(e1071)
library(dplyr)
library(Hmisc)
library(ggplot2)
library(randomForest)
library(caret)
library(pROC)
#########################################################
## Loading the datasets
train=read.csv("train.csv", header = T, stringsAsFactors = F)
test=read.csv("test.csv", header=T, stringsAsFactors = F)
str(train)
str(test)
#########################################################
## Combining the datasets
train$DataType="Train"
test$DataType="Test"
test$Survived=NA
df=rbind(train,test)
######################################################################
######################################################################
### MISSING VALUE ANALYSIS AND INPUTATIOn ###
######################################################################
######################################################################
#########################################################
## Calculating the missing values for each column
MV_summary=sapply(df, function(x) sum(is.na(x)))
MV_summary
#########################################################
## Regression for estimating missing value in age variable
str(df)
model_age=lm(Age~Fare, data=df)
summary(model_age)
View(df)
View(test)
df$Sex=ifelse(df$Sex=="male",1,0)
model_age_2=lm(Age~+Pclass+Sex+SibSp+Parch, data=df)
summary(model_age_2)
df$Age[is.na(df$Age)] <- predict(model_age_2, na.rm = T, newdata=df)
df$Age=round(df$Age, digits = 0)
df$Fare=round(df$Fare, digits = 0)
MV_summary=sapply(df, function(x) sum(is.na(x)))
MV_summary
df$Fare[is.na(df$Fare)] <- mean(df$Fare, na.rm = T)
summary(df)
df[925,6]=2
which(df$Age<0)
which(df$Age==0)
which(df$Fare==0)
str(df)
######################################################################
######################################################################
### Modeling of Response Variable with Predictors ###
######################################################################
######################################################################
df_for_model=df[c(1:891),c(2,3,5,6,7,8,10)]
formula=as.formula(Survived~Age+Sex+SibSp+Pclass)
#########################################################
## Applying linear regression model
model_lm=lm(formula, data=df_for_model)
summary(model_lm)
#########################################################
## Applying logistic regression model
model_lg=glm(formula, data=df_for_model, family="binomial")
summary(model_lg)
df_for_model$Predicted=predict(model_lg, type="response", newdata = df_for_model)
View(df_for_model)
#########################################################
## Plotting the model predicted with actual variable
ggplot(df_for_model,aes(y=Survived,x=Predicted,color=factor(Survived)))+
geom_point()+geom_jitter()
#########################################################
## Applying Random Forest technique for modeling
model_rf=randomForest(formula, data=df_for_model, importance=TRUE, proximity=TRUE,eig=TRUE, ntree=500, type="classification")
model_rf
summary(model_rf)
plot(model_rf)
df_for_model$predict_rf=predict(model_rf, type="response", newdata=df_for_model)
#########################################################
## Plotting the random forest model
ggplot(df_for_model,aes(y=Survived,x=predict_rf,color=factor(Survived)))+
geom_point()+geom_jitter()
#########################################################
## Optimizing the cutoff value by interative procedure
cutoff=0.3521408
df_for_model$P0.10=ifelse(df_for_model$predict_rf<cutoff,0,1)
table(df_for_model$Survived, df_for_model$P0.10)
df_confusion_matrix=cbind(Survived=df_for_model$Survived,Predicted=df_for_model$P0.10)
df_confusion_matrix=as.data.frame(df_confusion_matrix)
df_confusion_matrix$Survived=as.factor(df_confusion_matrix$Survived)
df_confusion_matrix$Predicted=as.factor(df_confusion_matrix$Predicted)
confusionMatrix(df_confusion_matrix$Survived, df_confusion_matrix$Predicted)
model_roc=roc(df_for_model$Survived,df_for_model$predict_rf)
model_roc
plot(model_roc, print.thres="best", print.thres.best.method="closest.topleft", xlim=c(0,1))
result.coords <- coords(model_roc, "best", best.method="closest.topleft", ret=c("threshold", "accuracy"))
print(result.coords)
cutoff=0.3521408
df_for_model$Predicted=ifelse(df_for_model$Predicted<cutoff,0,1)
table(df_for_model$Survived, df_for_model$Predicted)
Good analysis
ReplyDelete