Data mining project presentation

Post on 14-Jun-2015

1.750 views 5 download


Transcript of Data mining project presentation

Classification Technique KNN in Data Mining

---on dataset “Iris”

Comp722 data miningKaiwen Qi, UNC

Spring 2012


Dataset introduction Data processing Data analysis KNN & Implementation Testing

Dataset Raw dataset Iris(http


150 total records

50 records Iris Setosa

50 records Iris Versicolour

50 records Iris Virginica

5 Attributes

Sepal length in cm(continious number)

Sepal width in cm(continious number)

Petal length in cm(continious number)

Petal width in cm(continious number)

Class(nominal data: Iris Setosa Iris Versicolour Iris Virginica)

(a) Raw data

(b) Data organization (C) Data


Classification Goal


Data Processing

Original data

Data Processing

• Balanced distribution

Data Analysis


Data Analysis


Data Analysis



KNN algorithm

The unknown data, the green circle, is classified to be square when K is 5. The distance between two points is calculated with Euclidean distance d(p, q)= . .In this example, square is the majority in 5 nearest neighbors.


Advantage the skimpiness of implementation. It is

good at dealing with numeric attributes.    Does not set up the model and just

imports the dataset with very low computer overhead.

Does not need to calculate the useful attribute subset. Compared with naïve Bayesian, we do not need to worry about lack of available probability data

Implementation of KNN Algorithm

Algorithm: KNN. Asses a classification label from training data for an unlabeled data

Input: K, the number of neighbors. Dataset that include training data Output: A string that indicates unknown tuple’s classification

Method: Create a distance array whose size is K Initialize the array with the distances between the unlabeled tuple with

first K records in dataset Let i=k+1 calculate the distance between the unlabeled tuple with the (k+1)th

record in dataset, if the distance is greater than the biggest distance in the array, replace the old max distance with the new distance; i=i+1

repeat step (4) until i is greater than dataset size(150) Count the class number in the array, the class of biggest number is

mining result

Implementation of KNN



Testing (K=7, total 150 tuples)

Testing Testing (K=7, 60% data as training data)


Input random distribution dataset

Random dataset

Accuracy test:



Comparison Decision tree

Advantage• comprehensibility • construct a decision tree

without any domain knowledge• handle high dimensional • By eliminating unrelated

attributes and tree pruning, it simplifies classification calculation 

Disadvantage• requires good quality of training

data. • usually runs in memory • Not good at handling

continuous number features.

Advantage• relatively simply. • By simply calculating

attributes frequency from training data and without any other operations (e.g. sort, search),  

Disadvantage• The assumption of

independence is not right

• No available probability data to calculate probability

Naïve Bayesian


KNN is a simple algorithm with high classification accuracy for dataset with continuous attributes.

It shows high performance with balanced distribution training data as input.
