Making Sense of Data a Practical Guide to Exploratory Data Analysis and Data Mining
Exploratory Data Analysis And DATA MINING
Transcript of Exploratory Data Analysis And DATA MINING
Exploratory Data Analysis And DATA MINING
Presentation of E.D.A and D.M submitted in partial fulfillment of the requirements for the course of
Raj Kumar 411MA5058
Department of Mathematics ,N.I.T Rourkela
Exploratory Data Analysis And Data Mining: Data mining is a cross field which includes application of different fields such as computer Science, Statistics and Machine learning to extract knowledge from a given dataset which can be used for future values to prediction or classification purpose .Any Data mining project goes through a cyclic execution cycle which is presented in the following diagram.
Data Integration
And Selection
Data Cleaning And E.D.A
Statistical Results and Data
Transformation
Data Mining ,Pattern
Evaluation
Knowledge Discovery
Reiteration if results not productive
enough
• Data Integration• Data Selection• Data Cleaning• Exploratory data Analysis• Data Transformation• Data mining (linear
Regression,Neural Network)• Pattern Evaluation• Deployment• Reiteration
Task Statement:To develop a model on iris dataset.Iris Dataset: Iris flower is a multivariate dataset used by Scientist Ronald fisher to develop model.It is a flower which have three subcategories the task is to classify the given dataset into its subcategories using its attributes and perform discriminant Analysis.
First exploratory data Analysis techniques were applied to understand the data .
• We will find out the basic histograms to understand the distribution of data and explore its each corresponding attributes.
• From the following plot we can explore that the different species of iris have different ration of length and width and for its sepal length and width they can be classified into three different class .
• “Setosa” Species forms a different chunk altogether and is seperable from “versicolor “ and “Virginica”.
Correlation Matrix: Correlation represents the dependency of two random variables and is measure inbetwen (-1 to 1) .if the value is zero than there is no correlation between them linear relationship .Following R script produces the correlation plot between them .Following Rscript produces the correlation matrix.library(corrplot)iris1<-irisiris1$Species<-NULLiris_cor<-cor(iris)round(iris_cor,digits=2)
corrplot(iris_cor,diag=FALSE)
Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variablesfit<-princomp(iris1,cor=TRUE)summary(fit) loadings(fit)plot(fit,type="lines")biplot(fit)
DATA TESTS:
There are different type of test which we can apply on the dataset to do preliminary analysis . They can be divided into two parts.• Two-sample test.• Paired two sample tests.
Two- Sample tests:
Kolmogorov-Smirnov Test • The Kolmogorov-Smirnov test is a non-parametric test of the similarity of two distributions. The null hypothesis is that
the two samples are drawn from the same distribution. The two-sided and the two one-sided tests are performed.
Wilcoxon rank sum Test• The two-sample non-parametric Wilcoxon rank sum test (equivalent to the Mann-Whitney test) is performed on the two
specified samples. The null hypothesis is that the distributions are the same (i.e., there is no shift in the location of the two distributions) with an alternative hypothesis that they differ on location (based on median).
Two-Sample t-Test • The two-sample T-test is performed on the two specified samples. The null hypothesis is that the difference
between the two means is zero. This test assumes that the two samples are normally distributed. If not, use the Wilcoxon Rank-Sum test.
PAIRED TWO SAMPLE TESTS
Correlation test• The paired sample correlation test is performed on the two specified samples. The two samples are expected to be
paired (two observations for the same entity). The null hypothesis is that the two samples have no (i.e., 0) correlation. Pearson's product moment correlation coefficient is used.
Wilcox Signed Rank• he paired sample non-parametric Wilcoxon signed rank test is performed on the two specified samples. The two
samples are expected to be paired (two observations for the same entity). The null hypothesis is that the distributions are the same.
SCATTERPLOT MATRIXscatterplot matrix (sometimes abbreviated as SPLOM),which is simply a collection of scatterplots arranged in a grid. It is used to detect patterns among three or more variables. The scatterplot matrix is not a true multidimensional visualization because only two features are examined at a time. Still, it provides a general sense of how the data may be interrelated.Following graph can be plotted in R using the following command .```{r}library(psych)
pairs.panels(iris1[c(“Sepal.Length”,”Sepal.Width”,+”Petal.Length”,”Petal.Width”)])```
DISCRIMINANT ANALYSIS ON IRIS DATASET
Discriminant function analysis is a statistical analysis to predict a categorical dependent variable (called a grouping variable) by one or more continuous or binary independent variables (called predictor variables). It is different from an ANOVA or MANOVA, which is used to predict one (ANOVA) or multiple (MANOVA) continuous dependent variables by one or more independent categorical variables. Discriminant function analysis is useful in determining whether a set of variables is effective in predicting category membership
data(iris)library(MASS)require(MASS)linear_disc<-lda(formula=Species~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,data=iris,prior=c(1,1,1)/3)linear_disc$priorlinear_disc$countslinear_disc$meanslinear_disc$scalinglinear_disc$levlinear_disc$svd
CLASSIFICATION OF IRIS DATASET USING THE NEURAL NETWORKS
Neural networks can be used for both classification and forecasting purpose .In neural network each layer have nodes which have assigned weights this weights can be trained for classification as well as forecasting purpose . Initially a neural network is trained to optimize its weight .Here we will use the iris dataset to first train the model and than will evaluate itsperformance . All operations will be performed in R environment and inbuilt packages will be used for the purpose.
```{r}#Normalize the data for increaing the neural network performancenormalize<-function(x){
return ((x-(min(x))/(max(x)-min(x)))}iris1<-irisiris1$Sepal.Length<-normalize(iris1$Sepal.Length)iris1$Sepal.Width<-normalize(iris1$Sepal.Width)iris1$Petal.Length<-normalize(iris1$Petal.Length)iris1$Petal.Width<-normalize(iris1$Petal.Width)
library(neuralnet)iris_train<-iris1[sample(1:150,75),]iris_train$setosa<-c(iris_train$Species=="setosa")iris_train$versicolor<-c(iris_train$Species=="versicolor")iris_train$virginica<-c(iris_train$Species=="virginica")iris_net<-neuralnet(setosa+virginica+versicolor~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris_train,hidden=2,lifesign="full")plot(iris_net,rep="best")plot(iris_net,rep="best",intercept="False")```
The following R chunk produces the neural network for classification purpose in case of iris data.We can change the function parameters to obtain different layers and and the relation between them. The following graphs are for 2,3,4 Hidden layers.
Hidden layer=2 Hidden layer=3 Hidden layer=4
library(neuralnet)iris_train<-iris1[sample(1:150,75),]iris_train$setosa<-c(iris_train$Species=="setosa")iris_train$versicolor<-c(iris_train$Species=="versicolor")iris_train$virginica<-c(iris_train$Species=="virginica")iris_net<-neuralnet(setosa+virginica+versicolor~Sepal.Length+Sepal.Width+Petal.Length+Petal.Width,iris_train,hidden=2,lifesign="full")plot(iris_net,rep="best")plot(iris_net,rep="best",intercept="False")```
CROSS VALIDATING THE RESULT
The neural network formed will classify the iris dataset using the weight deployed to each node. Here we will cross validate the model formed to see whatever the model formed is accurately prediciting or not.
predict<-compute(iris_net,iris[1:4])predict$net.resultresult<-0for(i in 1:150){
result[i]<-which.max(predict$net.result[i,])}for(i in 1:150){if(result[i]==1){
result[i]="setosa"}}
for(i in 1:150){if(result[i]==2){result[i]="versicolor"
}}for(i in 1:150){f(result[i]==3){
result[i]="virginica"}}
comparision<-iris1comparision$predicted<-result
RESULT
The neural netwok formed can be successfully used to classify in between “setosa” and (“versicolor” and “virginica “Sub class .While results are not attainable for classification between “versicolor” and “virginica “class.
END