Classification DMM

7/30/2019 Classification DMM

1/50

Data Mining 1

1

[email protected]

Jalali.mshdiau.ac.ir


2/50

Data Mining 2

Association Rules

2


3/50

3

What Is Classification?

The goal of data classification is to organize and

categorize data in distinct classes

A model is first created based on the data distribution

The model is then used to classify new data

Given the model, a class can be predicted for new data

Classification = prediction for discrete and nominal

values

With classification, I can predict in which bucket to put the ball,but I cant predict the weight of the ball


4/50

4

Prediction, Clustering, Classification

What is Prediction? The goal of prediction is to forecast or deduce the value of an attribute

based on values of other attributes

models continuous-valued functions, i.e., predicts unknown or missing values

A model is first created based on the data distribution

The model is then used to predict future or unknown values

Supervised vs. Unsupervised Classification Supervised Classification = Classification

We know the class labels and the number ofclasses

Unsupervised Classification = Clustering

We do not know the class labels and may not knowthe number of classes


5/50

5

Classification and Prediction

Typical applications

Credit approval

Target marketing

Medical diagnosis

Fraud detection


6/50

6

Classification: 3 Step Process

1. Model construction (Learning): Each record (instance) is assumed to belong to a predefined class, as

determined by one of the attributes, called the class label

The set of all records used for construction of the model is called training set

The model is usually represented in the form of classification rules, (IF-THENstatements) or decision trees

2. Model Evaluation (Accuracy): Estimate accuracy rate of the model based on a test set

The known label of test sample is compared with the classified result from model

Accuracy rate: percentage of test set samples correctly classified by the model

Test set is independent of training set otherwise over-fitting will occur

3. Model Use (Classification): The model is used to classify unseen instances (assigning class labels)

Predict the value of an actual attribute


7/50

7

Process (1): Model Construction

Training

Data

N A M E R A N K Y E A R S T E N U R E D

Mike Assistant Prof 3 no

Mary Assistant Prof 7 yesBill Professor 2 yes

Jim Associate Prof 7 yes

Dave Assistant Prof 6 no

Anne Associate Pro f 3 no

Classification

Algorithms

IF rank = professor

OR years > 6

THEN tenured = yes

Classifier

(Model)


8/50

8

Process (2): Using the Model in Prediction

Classifier

Testing

Data

N A M E R A N K Y E A R S T E N U R E D

Tom Assistant Prof 2 no

Merlisa Associate Prof 7 no

George Professor 5 yes

Joseph Assistant Prof 7 yes

Unseen Data

(Jeff, Professor, 4)

Tenured?


9/50

9

Accuracy of Supervised Learning

Holdout Approach

A certain amount of data (Usually, one-third) for testing and

remainder (two-third) for training

K-Fold cross validation

Split the data into k subsets of equal size, then each subset is turn is

used for testing and remainder for training


10/50

10

Issues: Data Preparation

Data cleaning

Preprocess data in order to reduce noise and handle missing values

Relevance analysis (feature selection)

Remove the irrelevant or redundant attributes Data transformation

Normalize data


11/50

11

Issues: Evaluating Classification Methods

Accuracy

classifier accuracy: predicting class label

predictor accuracy: guessing value of predicted attributes

Speed

time to construct the model (training time) time to use the model (classification/prediction time)

Robustness: handling noise and missing values

Scalability: efficiency in disk-resident databases

Interpretability understanding and insight provided by the model

Other measures, e.g., goodness of rules, such as decision treesize or compactness of classification rules


12/50

12

Other Issues

Function complexity and amount of training data

If the true function is simple then an learning algorithm will be able to

learn it from a small amount of data, otherwise the function will only

be learnable from a very large amount of training data.

Dimensionality of the input space Noise in the output values

Heterogeneity of the data.

If the feature vectors include features of many different kinds

(discrete, discrete ordered, counts, continuous values), somealgorithms are easier to apply than others.

Dependency of features


13/50

13

Decision Tree - Review of Basics

What exactly is a Decision Tree?

A tree where each branching node represents a choice between two ormore alternatives, with every branching node being part of a path to aleaf node (bottom of the tree). The leaf node represents a decision,

derived from the tree for the given input.

How can Decision Trees be used to classify instances of data?

Instead of representing decisions, leaf nodes represent a particularclassification of a data instance, based on the given set of attributes (andtheir discrete values) that define the instance of data, which is kind of likea relational tuple for illustration purposes.


14/50

14

Review of Basics (contd)

It is important that data instances have boolean or discrete

data values for their attributes to help with the basic

understanding of ID3, although there are extensions of ID3

that deal with continuous data.


15/50

15

How does ID3 relate to Decision Trees, then?

ID3, or Iterative Dichotomiser 3 Algorithm, is a Decision Tree learning

algorithm. The name is correct in that it creates Decision Trees for

dichotomizing data instances, or classifying them discretely through

branching nodes until a classification bucket is reached (leaf node).

By using ID3 and other machine-learning algorithms from Artificial

Intelligence, expert systems can engage in tasks usually done by human

experts, such as doctors diagnosing diseases by examining various

symptoms (the attributes) of patients (the data instances) in a complex

Decision Tree.

Of course, accurate Decision Trees are fundamental to Data Mining and

Databases.


16/50

16

Other Decision Trees

ID3

J48

C4.5

SLIQ SPRINT

PUBLIC

RainForest

BOAT

Data Cube-based Decision Tree


17/50

17

Tree Presentation


18/50

18

Decision Tree Induction: Training Dataset

age income student credit_rating buys_computer

40 low yes fair yes>40 low yes excellent no

3140 low yes excellent yes


19/50

19

age?

overcast

student? credit rating?

40

no yes yes

yes

31..40

fairexcellentyesno

Output: A Decision Tree for buys_computer


20/50

20

Algorithm for Decision Tree Induction

Basic algorithm (a greedy algorithm)

Tree is constructed in a top-down recursive divide-and-conquer manner

At start, all the training examples are at the root

Attributes are categorical (if continuous-valued, they are discretized in

advance) Examples are partitioned recursively based on selected attributes

Test attributes are selected on the basis of a heuristic or statistical measure

(e.g., information gain)

Conditions for stopping partitioning

All samples for a given node belong to the same class There are no remaining attributes for further partitioningmajority voting is

employed for classifying the leaf

There are no samples left


21/50

21

Entropy

8

2

P

8

6P )

8

6log

8

6()

8

2log

8

2( 22 E

S is the training example

is the proportion of positive examples in S

is the proportion of negative examples in S

Entropy measures the impurity of S

For Non-Boolean classification :


22/50

22

Gain (S,A) = Expected reduction in entropy due to sorting on A

Attribute Selection Measure: Information Gain

(ID3/C4.5)


23/50

23

Class P: buys_computer = yes

Class N: buys_computer = no

Attribute Selection: Information Gain

)()()()( 4040

40..31

40..31

30

30

)(

SEntropyS

SSEntropy

S

SSEntropy

S

SSEntropy

S

sv

Avaluev

v


40 low yes fair yes

>40 low yes excellent no



24/50

24

Attribute Selection: Information Gain

048.0)_,(

151.0),(

029.0),(

ratingcreditSGain

studentSGain

incomeSGain

246.0694.0940.0),( ageSGain


40 low yes fair yes

>40 low yes excellent no



25/50

25

ID3 Algorithm


26/50

26

Computing Information-Gain for Continuous-

Value Attributes

Let attribute A be a continuous-valued attribute

Must determine the best split pointfor A

Sort the value A in increasing order

Typically, the midpoint between each pair of adjacent values is considered as

a possible split point

(ai+ai+1)/2 is the midpoint between the values of ai and ai+1

The point with the minimum expected information requirementfor A is

selected as the split-point for A

Split:

D1 is the set of tuples in D satisfying A split-point, and D2 is the set of tuples

in D satisfying A > split-point


27/50

27

Gain Ratio for Attribute Selection and Cost of Attributes

What happens if you choose Day as root ??

Attribute with Cost


28/50

28

Overfitting and Tree Pruning

Overfitting: An induced tree may overfit the training data

Too many branches, some may reflect anomalies due to noise or outliers

Poor accuracy for unseen samples

Two approaches to avoid overfitting

Prepruning: Halt tree construction earlydo not split a node if this would result in the

goodness measure falling below a threshold

Difficult to choose an appropriate threshold

Postpruning: Remove branches from a fully grown treeget a sequence of progressively

pruned trees

Use a set of data different from the training data to decide which is the best pruned tree


29/50

29

Enhancements to Basic Decision Tree Induction

Allow for continuous-valued attributes Dynamically define new discrete-valued attributes that partition the continuous

attribute value into a discrete set of intervals

Handle missing attribute values

Assign the most common value of the attribute

Assign probability to each of the possible values

Attribute construction

Create new attributes based on existing ones that are sparsely represented

This reduces fragmentation, repetition, and replication


30/50

30

Instance Based Learning

Lazy learning (e.g., instance-based learning): Simply stores

training data (or only minor processing) and waits until it is

given a test tuple

Eager learning (eg. Decision trees, SVM, NN): Given a set of

training set, constructs a classification model before receiving

new (e.g., test) data to classify


31/50

31

Instance-based Learning

Its very similar to aDesktop!!


32/50

32

Example

Image Scene Classification


33/50

33

Instance-based Learning

When To Consider IBL Instances map to points

Less than 20 attributes per instance

Lots of training data

Advantages:

Learn complex target functions(Class)

Don't lose information

Disadvantages:

Slow at query time

Easily fooled by irrelevant attributes


34/50

34

K-Nearest Neighbor Learning (k-NN)

k-NN is a type of instance-based learning, or lazy learning

where the function is only approximated locally and all

computation is deferred until classification.

The k-nearest neighbor algorithm is amongst the simplest of

all machine learning algorithms: an object is classified by amajority vote of its neighbors, with the object being assigned

to the class most common amongst its k nearest neighbors (k

is a positive integer, typically small).

If k = 1, then the object is simply assigned to the class of its nearestneighbor.


35/50

35

K-Nearest Neighbor Learning (k-NN)- Cont.

Usually Euclidean distance is used as the distance metric;

however this is only applicable to continuous variables. In

cases such as text classification, another metric such as the

overlap metric (or Hamming distance: it measures the minimum

number ofsubstitutions required to change one string into the other, or thenumber oferrors that transformed one string into the other. ) can be used.


36/50

36

Distance or Similarity Measures

Common Distance Measures:

Manhattan distance:

Euclidean distance:

Cosine similarity:

( , ) 1 ( , )dist X Y sim X Y 2 2

( )

( , )i i

i

i ii i

x y

sim X Yx y


37/50

37

KNN - Algorithm

Key Idea: Just store all training examples

Nearest Neighbour :

Given query instance Xq , First locate nearest training example Xn ,

Then estimate

K Nearest Neighbour :

Given xq , take vote among its K nearest nbrs ( If discrete-value for

class)

Take mean offValues of k nearest nbrs (if real-value)


38/50

38

Figure K Nearest Neighbors Example

X

Stored training set patterns

X input pattern for classification

--- Euclidean distance measure to

the nearest three patterns


39/50

39

twoone

four

three

five six

seven Eight ?

Which one belongs to Mondrian ?


40/50

40

Training data

Number Lines Line types Rectangles Colours Mondrian?

1 6 1 10 4No

2 4 2 8 5 No

3 5 2 7 4 Yes

4 5 1 8 4 Yes

5 5 1 10 5 No

6 6 1 8 6 Yes

7 7 1 14 5 No

Number Lines Line types Rectangles Colours Mondrian?

8 7 2 9 4

Test instance


41/50

41

Normalization

Min-max normalization: to [new_minA, new_maxA]

Z-score normalization (: mean, : standard deviation):

Normalization by decimal scaling

AAA

AA

A

minnewminnewmaxnewminmax

minvv _)__('

A

Avv

'

j

vv

10' Where j is the smallest integer such that Max(||) < 1


42/50

42

Normalised training data

Number Lines Linetypes

Rectangles Colours Mondrian?

1 0.632 -0.632 0.327 -1.021 No

2 -1.581 1.581 -0.588 0.408 No

3 -0.474 1.581 -1.046 -1.021 Yes

4 -0.474 -0.632 -0.588 -1.021 Yes

5 -0.474 -0.632 0.327 0.408 No

6 0.632 -0.632 -0.588 1.837 Yes

7 1.739 -0.632 2.157 0.408 No

Number Lines Linetypes

Rectangles Colours Mondrian?

8 1.739 1.581 -0.131 -1.021

Test instance


43/50

43

Distances of test instance from training data

Example Distanceof testfromexample

Mondrian?

1 2.517 No

23.644

No

3 2.395 Yes

4 3.164 Yes

5 3.472 No

6 3.808 Yes

7 3.490 No

Classification

1-NN Yes

3-NN Yes

5-NN No

7-NN No


44/50

44

Example

Classify Theatre ?


45/50

45

Nearest neighbors algorithms: illustration

++

++

-

-

-

-

-

-

e1

1-nearest neighbor:

the concept represented by e1

5-nearest neighbors:

q1 is classified as negative

q1


46/50

46

Voronoi diagram

querypoint qf

nearest neighbor qi


47/50

47

Variant of kNN: Distance-Weighted kNN

We might want to weight nearer neighbors more heavily

Then it makes sense to use all training examples instead of just k

2

1

1

),(

1where

)(:)(

iq

ik

i i

k

i ii

qd

ww

fwf

xx

x

x


48/50

48

Difficulties with k-nearest neighbour

algorithms

Have to calculate the distance of the test case from all

training cases

There may be irrelevant attributes amongst the attributes

curse of dimensionality

For instance , we have 20 attribute, however 2 of them are

appropriate attributes , non-relevant attributes effect to

results

Solution : Assign appropriate Weight to the attributes by

using cross-validation method.


49/50

49

How to choose k

Large k:

less sensitive to noise (particularly class noise)

better probability estimates for discrete classes

larger training sets allow larger values of k

Small k: captures fine structure of problem space better

may be necessary with small training sets

Balance must be struck between large and small k

As training set approaches infinity, and k grows large, kNNbecomes Bayes optimal


50/50

1. Use Weka to classify and evaluate Letter Recognition dataset through IBL2. Write down 3 new challenges in KNN

Classification DMM

Documents

Transcript of Classification DMM