Iris is a plant species have different varieties based on it sepal and petal length and breadth. In this article, three species of Iris flowers are classified based on petal and sepal length and breadth are satosa, versicolour and virginica. The dataset is available on following link---
https://archive.ics.uci.edu/ml/datasets/Iris
In this dataset,
there are five columns/variables and 150 rows/observations. The four variables
are related to Sepal Length, Sepal Width, Petal Length and Petal Width. The
fifth variable is categorical variable denoting the species of Iris plant.
There are three species of Iris plant in the dataset. Each species contain 50
observations in the dataset. The snapshot of dataset is as follows---
The analysis of
Iris data set is done in RStudio. The R Script is shared in this blog at the end. Let us understand how this dataset should be analyzed. The summary statistics
of iris dataset attached below tell us about mean, median, 1st quartile, 3rd
quartile, minimum and maximum values of the Sepal Length, Sepal Width, Petal
Length and Petal Width.
Sepal.Length Sepal.Width
Petal.Length Petal.Width
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu. :0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu :1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
Min. :4.300 Min. :2.000 Min. :1.000 Min. :0.100
1st Qu.:5.100 1st Qu.:2.800 1st Qu.:1.600 1st Qu. :0.300
Median :5.800 Median :3.000 Median :4.350 Median :1.300
Mean :5.843 Mean :3.054 Mean :3.759 Mean :1.199
3rd Qu.:6.400 3rd Qu.:3.300 3rd Qu.:5.100 3rd Qu :1.800
Max. :7.900 Max. :4.400 Max. :6.900 Max. :2.500
But the summary
statistics of entire dataset does not tell us how three species of Iris are
different from each other with respect to these variables. Let us observe
following boxplots of each variable with respect to three categories of Iris
species---
We can observe from boxplots that
Sepal Length, Petal Length and Petal Width clearly discriminate among the three
species of Iris flowers. As the Petal Length, Sepal Length and Petal Width
increase, the species of Iris transits from setosa to versicolour and
from versicolour to virginica.
As we are purposely
looking for classifying the species based on the first four variables of the
dataset, it is important that we should understand the correlation among the
four variables. Following corrplot depicts the correlations among the
variables---
Correlation plot of
variables tells interesting insights about Iris flowers. Sepal Length has weak
negative correlation with Sepal Width. While Sepal Length has positive and
strong correlation with Petal Length and Petal Width. Sepal Width has negative
and moderate correlation with Petal Length and Petal WIdth.
As there are three
categories of Iris flowers, the most appropriate tool for exploring the
relationship between species of Iris flower and other variables is Linear
Discriminant Analysis. The linear discriminant model summary is as following---
lda(Species ~ .,
data = iris, prior = c(1, 1, 1)/3)
Prior probabilities
of groups:
Iris-setosa Iris-versicolor
Iris-virginica
0.3333333 0.3333333
0.3333333
Group means:
Sepal.Length
Sepal.Width Petal.Length Petal.Width
Iris-setosa
5.006 3.418
1.464
0.244
Iris-versicolor
5.936 2.770
4.260 1.326
Iris-virginica
6.588 2.974
5.552
2.026
Coefficients of
linear discriminants:
LD1
LD2
Sepal.Length
0.8192685 0.03285975
Sepal.Width
1.5478732 2.15471106
Petal.Length
-2.1849406 -0.93024679
Petal.Width
-2.8538500 2.80600460
Proportion of
trace:
LD1 LD2
0.9915 0.0085
We can observe from
LDA model that the mean sepal length for setosa, versicolor and virginica are
5.006, 5.936 and 6.588 respectively. Petal length for setosa,
versicolor and virginica are 1.464, 4.26 and 5.552
respectively. Petal width also clearly discriminates among the three
species. The mean petal width of setosa, versicolor and virginica are 0.244,
1.326 and 2.026. Same findings are reflected in graph shown below---
By linear
discriminant analysis, we can see that how three varieties of Iris flowers are
different from each other and how sepal length and width and petal length and
width play significant role.
setwd()
getwd()
iris=read.csv("iris.csv", header = TRUE, stringsAsFactors = F)
summary(iris)
boxplot(iris$Sepal.Length~iris$Species,main="Iris Species Vs. Sepal Length")
boxplot(iris$Sepal.Width~iris$Species,main="Iris Species Vs. Sepal Width")
boxplot(iris$Petal.Length~iris$Species,main="Iris Species Vs. Petal Length")
boxplot(iris$Petal.Width~iris$Species,main="Iris Species Vs. Petal Width")
corrplot::corrplot(cor(iris[,-5]),method="pie", type="upper")
library(MASS)
library(car)
library(caret)
str(iris)
model_lda=lda(Species~., data=iris, prior=c(1,1,1)/3)
model_lda
summary(model_lda)
plot(model_lda, dimen=1, type="both")
No comments:
Post a Comment