MENU

Drop Down MenusCSS Drop Down MenuPure CSS Dropdown Menu

Monday, 9 April 2018

CASE STUDY 1: Iris Species Classification






Iris is a plant species have different varieties based on it sepal and petal length and breadth. In this article, three species of Iris flowers are classified based on petal and sepal length and breadth are satosa, versicolour and virginica. The dataset is available on following link---

https://archive.ics.uci.edu/ml/datasets/Iris

In this dataset, there are five columns/variables and 150 rows/observations. The four variables are related to Sepal Length, Sepal Width, Petal Length and Petal Width. The fifth variable is categorical variable denoting the species of Iris plant. There are three species of Iris plant in the dataset. Each species contain 50 observations in the dataset. The snapshot of dataset is as follows---

The analysis of Iris data set is done in RStudio. The R Script is shared in this blog at the end. Let us understand how this dataset should be analyzed. The summary statistics of iris dataset attached below tell us about mean, median, 1st quartile, 3rd quartile, minimum and maximum values of the Sepal Length, Sepal Width, Petal Length and Petal Width. 

 Sepal.Length    Sepal.Width     Petal.Length    Petal.Width           
 
Min.   :4.300     Min.   :2.000      Min.   :1.000     Min.     :0.100   
 1st Qu.:5.100    1st Qu.:2.800     1st Qu.:1.600    1st Qu. :0.300
 Median :5.800   Median :3.000   Median :4.350  Median :1.300
 Mean   :5.843    Mean   :3.054    Mean   :3.759   Mean     :1.199                 
 3rd Qu.:6.400    3rd Qu.:3.300   3rd Qu.:5.100    3rd Qu  :1.800                 
 Max.   :7.900     Max.   :4.400    Max.   :6.900     Max.    :2.500  

But the summary statistics of entire dataset does not tell us how three species of Iris are different from each other with respect to these variables.  Let us observe following boxplots of each variable with respect to three categories of Iris species---
 


We can observe from boxplots that Sepal Length, Petal Length and Petal Width clearly discriminate among the three species of Iris flowers. As the Petal Length, Sepal Length and Petal Width increase, the species of Iris transits from setosa to versicolour and from versicolour to virginica

As we are purposely looking for classifying the species based on the first four variables of the dataset, it is important that we should understand the correlation among the four variables. Following corrplot depicts the correlations among the variables---



Correlation plot of variables tells interesting insights about Iris flowers. Sepal Length has weak negative correlation with Sepal Width. While Sepal Length has positive and strong correlation with Petal Length and Petal Width. Sepal Width has negative and moderate correlation with Petal Length and Petal WIdth. 

As there are three categories of Iris flowers, the most appropriate tool for exploring the relationship between species of Iris flower and other variables is Linear Discriminant Analysis. The linear discriminant model summary is as following---

lda(Species ~ ., data = iris, prior = c(1, 1, 1)/3)

Prior probabilities of groups:
      Iris-setosa       Iris-versicolor   Iris-virginica 
      0.3333333       0.3333333       0.3333333 

Group means:
                Sepal.Length   Sepal.Width   Petal.Length   Petal.Width
Iris-setosa            5.006       3.418             1.464             0.244
Iris-versicolor        5.936       2.770            4.260             1.326
Iris-virginica         6.588       2.974             5.552             2.026

Coefficients of linear discriminants:
                    LD1         LD2
Sepal.Length  0.8192685  0.03285975
Sepal.Width   1.5478732  2.15471106
Petal.Length -2.1849406 -0.93024679
Petal.Width  -2.8538500  2.80600460

Proportion of trace:
   LD1    LD2 
0.9915 0.0085

We can observe from LDA model that the mean sepal length for setosa, versicolor and virginica are 5.006, 5.936 and 6.588 respectively. Petal length for setosa, versicolor and virginica are 1.464, 4.26 and 5.552 respectively.  Petal width also clearly discriminates among the three species. The mean petal width of setosa, versicolor and virginica are 0.244, 1.326 and 2.026. Same findings are reflected in graph shown below---




By linear discriminant analysis, we can see that how three varieties of Iris flowers are different from each other and how sepal length and width and petal length and width play significant role. 

========================================================================

R Script

setwd()

getwd()


iris=read.csv("iris.csv", header = TRUE, stringsAsFactors = F)

summary(iris)

boxplot(iris$Sepal.Length~iris$Species,main="Iris Species Vs. Sepal Length")

boxplot(iris$Sepal.Width~iris$Species,main="Iris Species Vs. Sepal Width")

boxplot(iris$Petal.Length~iris$Species,main="Iris Species Vs. Petal Length")


boxplot(iris$Petal.Width~iris$Species,main="Iris Species Vs. Petal Width")


corrplot::corrplot(cor(iris[,-5]),method="pie", type="upper")


library(MASS)

library(car)

library(caret)

str(iris)

model_lda=lda(Species~., data=iris, prior=c(1,1,1)/3)

model_lda

summary(model_lda)

plot(model_lda, dimen=1, type="both")


No comments:

Post a Comment