DATA MINING - EVALUATING CLUSTERING ALGORITHM

MINISTRY OF EDUCATION, YOUTHS, SCIENCE AND SPORTS FRANCE

Ecole Internationale des Sciences du Traitment de

I’Information

Course: DATA MINING

Course Work on CLUSTERS

DEPARTMENT: BUSINESS ANALYTICS (M2)

COURSE TUTOR: PROF. MARIA MALEK

Submitted by: UDEH TOCHUKWU LIVINUS

Cergy Pontoise - 2014

1A. EVALUATING CAR ACCESSIBILITY USING K-MEANS ALGORITHM

S/N

K – MEANS

K = 3

Seed 1 = 10

Seed 2 = 100

Within cluster sum of squared errors: 5709.0

Incorrectly clustered instances : 807.0 46.7014 %

Cluster 0 (unacc)

774 ( 45%)

Cluster 1(acc)

600 ( 35%)

Cluster 2(good)

354 ( 20%)



Cluster 0(acc)

813 ( 47%)

Cluster 1(unacc)

555 ( 32%)

Cluster 2 (vgood)

360 ( 21%)

K = 4

Seed = 10

Seed 2 = 100



Cluster 0 (unacc)

592 ( 34%)

Cluster 1(acc)

557 ( 32%)

Cluster 2(good)

327 ( 19%)

Cluster 3 (vgood)

252 ( 15%)



Cluster 0 (acc)

697 ( 40%)

Cluster 1(unacc)

496 ( 29%)

Cluster

2(vgood)

346 ( 20%)

Cluster 3 (good)

189 ( 11%)

K = 5



Seed 1 = 10

Seed 2 = 100

Cluster 0 (unacc)

543 ( 31%)

Cluster 1(acc)

430 ( 25%)

Cluster

2(good)

302 ( 17%)

Cluster 3

(No class)

227 ( 13%)

Cluster 4

(vgood)

226 ( 13%)



Cluster 0 (acc)

586 ( 34%)

Cluster 1(unacc)

424 ( 25%)

Cluster

2(vgood)

310 ( 18%)

Cluster 3

(No class)

174 ( 10%)

Cluster 4

(good)

234 ( 14%)

ANALYSIS OF THE RESULT:

These Algorithm minimizes the total squared distance from instances to their clustered

centers local, not global minimum. So we tend to get different results when we vary the

seeds. From the table above at K = 3, we discover that we had a lesser squared errors

when we assigned the value of the seed to be 100 compared to when the value of the seed

was 100. However, there is an inverse relation with the values of the incorrectly

clustered instances as we can see it from the table. While we tend to minimize the errors

of the squared distances of the instances, the incorrectly clustered instances increases

for each increase in the number of seed. Vice versa. Hence, we try to compare the

similarities of each clustered instances when we iterate and don not take into account

any values or conditions.

FIGURE 1.0

The Figure below describe the cluster and instances as illustrated in the table above.

The Y – axis represents the class value, while the X- axis represents the instance

number. The color represents the cluster, so we can see the cluster when we select each

of the instance from the menu and then compare the similarity before validating our

decisions.

1 B. ESTIMATING CLUSTERS USING EXPECTATION MAXIMIZATION (EM CLUSTERING)

K = 3; Seed = 10

Incorrectly clustered instances : 519.0

30.0347 %

K = 3; Seed = 100


31.5394 %

Cluster

Attribute 0 1 2

(0.51) (0.29) (0.2)

=======================================

buying

vhigh 211.4062 134.0433 89.5505

high 221.0605 127.5401 86.3994

med 230.866 120.6598 83.4743

low 221.0605 127.5401 86.3994

[total] 884.3931 509.7833 345.8236

maint

Cluster

Attribute 0 1 2

(0.48) (0.32) (0.2)

======================================

buying

vhigh 209.4287 139.8151 85.7562

high 198.5154 146.6941 89.7905

med 209.4287 139.8151 85.7562

low 220.8453 132.7588 81.3959

[total] 838.2182 559.083 342.6988

maint

vhigh 220.5171 127.8878 86.595

Clustered Instances

0 1725 (100%)

1 3 ( 0%)

Log likelihood: -7.45474

Class attribute: class

Classes to Clusters:

0 1 <-- assigned to cluster

1208 2 | unacc

383 1 | acc

69 0 | good

65 0 | vgood

Cluster 0 <-- unacc

Cluster 1 <-- acc

vhigh 204.4103 138.1582 92.4315

Clustered Instances

0 1699 ( 98%)

1 29 ( 2%)





1182 28 | unacc

383 1 | acc

69 0 | good

65 0 | vgood

Cluster 0 <-- unacc

Cluster 1 <-- acc

K = 4; Seed = 10


32.1759 %

K = 4; Seed = 100


30.5556 %

Cluster

Attribute 0 1 2 3

(0.41) (0.26) (0.18) (0.16)

=============================================

buying

vhigh 171.8456 115.3853 81.9122 66.8569

high 175.9113 111.256 79.4969 69.3358

med 182.9562 108.0125 77.5667 67.4646

low 173.1743 110.3401 79.4592 73.0264

[total] 703.8874 444.9938 318.435 276.6838

Cluster

Attribute 0 1 2 3

(0.42) (0.18) (0.26) (0.14)

==========================================

buying

vhigh 179.4064 76.6192 118.9071 61.0673

high 183.026 77.5366 112.1751 63.2622

med 189.4165 80.2604 106.0872 60.2359

low 183.1553 75.7134 110.2891 66.8422

[total] 735.0042 310.1296 447.4586 251.4076

Clustered Instances

0 1616 ( 94%)

1 112 ( 6%)





1140 70 | unacc

352 32 | acc

59 10 | good

65 0 | vgood

Cluster 0 <-- unacc

Cluster 1 <-- acc

Clustered Instances

0 1718 ( 99%)

2 10 ( 1%)





1200 10 | unacc

384 0 | acc

69 0 | good

65 0 | vgood

Cluster 0 <-- unacc

Cluster 2 <-- No class

ANALYSIS OF THIS ALGORITHM:

These algorithm uses probabilistic approach to classify cluster. It uses expectation

maximization to classify the instances of the cluster. From the table we see that each of

the attribute value, has a probability value assigned to it. We can get the probability

of each attribute by dividing the value of each with the total value of all the

attributes given. With this, we can calculate the probability of each cluster. It also

uses the overall quality measure known as log likelihood. The nominal attributes are the

probability of each value while the numeric attributes are the mean and standard

deviation. It is also an unsupervised learning algorithm. Whilst we compare it with that

of K-Means, we discover that the value of the incorrectly classified instances are lesser

compared to K-means.

2 CLASSIFICATION

A. K-NEAREST NEIGHBOUR (KNN)

S/N

USE TRAINIG SET

CROSS VALIDATION (10 FOLD)

PRECISION

RECALL

ERROR

PRECISION

RECALL

ERROR

Type 1

(TPR)

Type 2

(FPR)

Type

1

(TPR)

Type 2

(FPR)

K = 1

Correctly Classified Instances 1728 100%

Incorrectly Classified Instances 0 0%

Correctly Classified Instances

1616 93.5185 %

Incorrectly Classified Instances

112 6.4815 %

1

1

1

1

1

1

1

1

1

1

1

1

1

1

1

0

0

0

0

0

0.973

0.818

1

1

0.94

0.998

0.911

0.188

0.708

0.935

0.998

0.911

0.188

0.708

0.935

0.066

0.058

0

0

0.059

K = 5


1664 96.2963 %


64 3.7037 %


1616 93.5185 %


112 6.4815 %

0.988

0.883

1

1

0.965

1

0.961

0.435

0.846

0.963

1

0.961

0.435

0.846

0.963

0.029

0.036

0

0

0.028

0.973

0.818

1

1

0.94

0.998

0.911

0.188

0.708

0.935

0.998

0.911

0.188

0.708

0.935

0.066

0.058

0

0

0.059

K = 20


1337 77.3727 %


391 22.6273 %


1327 76.794 %


401 23.206 %

0.813

0.531

0

0

0.687

1

0.331

0

0

0.774

1

0.331

0

0

0.774

0.539

0.083

0

0

0.396

0.802

0.528

0

1

0.717

1

0.299

0

0.031

0.768

1

0.299

0

0.031

0.768

0.575

0.077

0

0

0.42

REMARK:

From the above table we had a correctly classified instances with K = 1. So both

precisions and Recall were the same. This means that the algorithm correctly classified

all the misclassified points or instances. We can see it from the Figure below:

IBK 1.0

When we choose a different value for K = 5,and k = 20 we had approximately 4% and 23% of

the misclassified instances. However, this explains the concept of a noisy dataset or

instances. So what we did in this regard was we choose 5 neighboring classes and the

majority of the class to classify the unknown points or instances. In the figure below

those misclassified points are represented with rectangles in colors.

IBK 1.1

Applying Cross Validation Algorithm process which divides the instances in 10 equal sized

sets whereby 90 instances will be used for training and 10 instances will be used for

testing. At the end it averages the performances of the 10 classifiers produced from the

10 equal sized instances. Similar results was obtained when k =1 and k = 5. We had

approximately 6% of the misclassified instances compared to when we use the whole

training set. The figure below shows the visual diagram of using cross validation

algorithm to evaluate the training sets of data.

IBK 1.2

In summary, there is no model in this method. We only test instances to make our

prediction. The More value of k increases the more the percentage of misclassified points

also increase. However, the accuracy of the dataset only improves when we have a more

noisy instances. The base line accuracy can only be achieved by increasing the value of

k. and from the data set giving the base line accuracy is approximately 30%. K method is

a good method although it is very slow method because it has to scan the entire training

instances or data set before it can make prediction.

J48 (USE TRAINING SET) J48 (CROSS VALIDATION)


1664 96.2963 %


64 3.7037 %

Number of Leaves : 131

Size of the tree : 182


1596 92.3611 %


132 7.6389 %

Number of Leaves : 131

Size of the tree : 182



1182 25 3 0 | a = unacc

10 370 2 2 | b = acc

0 9 57 3 | c = good

0 4 6 55 | d = vgood



1164 43 3 0 | a = unacc

33 333 11 7 | b = acc

0 17 42 10 | c = good

0 3 5 57 | d = vgood

ONE R ALGORITHM

=== Evaluation on training set ===

=== Summary ===

Correctly Classified Instances 1210 70.0231 %

Incorrectly Classified Instances 518 29.9769 %

Kappa statistic 0

Mean absolute error 0.1499

Root mean squared error 0.3871

Relative absolute error 65.4574 %

Root relative squared error 114.5023 %

Total Number of Instances 1728

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 1 0.7 1 0.824 0.5 unacc

0 0 0 0 0 0.5 acc

0 0 0 0 0 0.5 good

0 0 0 0 0 0.5 vgood

Weighted Avg. 0.7 0.7 0.49 0.7 0.577 0.5



1210 0 0 0 | a = unacc

384 0 0 0 | b = acc

69 0 0 0 | c = good

65 0 0 0 | d = vgood

PRISM

=== Evaluation on training set ===

=== Summary ===

Correctly Classified Instances 1728 100 %

Incorrectly Classified Instances 0 0 %

Kappa statistic 1

Mean absolute error 0

Root mean squared error 0

Relative absolute error 0 %

Root relative squared error 0 %

Total Number of Instances 1728

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class

1 0 1 1 1 1 unacc

1 0 1 1 1 1 acc

1 0 1 1 1 1 good

1 0 1 1 1 1 vgood

Weighted Avg. 1 0 1 1 1 1



1210 0 0 0 | a = unacc

0 384 0 0 | b = acc

0 0 69 0 | c = good

0 0 0 65 | d = vgood

CONCLUSION

The Best algorithm in terms of precision are Prism and ID3. Both algorithm gave exact values of all of the

Instances with no instances misclassified. J48 is also a good algorithm but cannot be used for large

dataset as we see there were about 4% misclassified instances. All of these are examples of supervised

learning and can be used in making various decisions. Although the decision trees makes use of Entropy as

the Basis for making decisions.

2.3 ASSOCIATION RULES

A. Apriori

=======

Minimum support: 0.1 (173 instances)

Minimum metric <confidence>: 0.9

Number of cycles performed: 18

Generated sets of large itemsets:

Size of set of large itemsets L(1): 23



Best rules found:

1. persons=2 576 ==> class=unacc 576 conf:(1)

2. safety=low 576 ==> class=unacc 576 conf:(1)

3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1)

4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1)

5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1)

6. persons=2 safety=low 192 ==> class=unacc 192 conf:(1)

7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1)

8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1)


10. persons=more safety=low 192 ==> class=unacc 192 conf:(1)

B. Apriori

=======

Minimum support: 0.1 (173 instances)

Minimum metric <confidence>: 0.5

Number of cycles performed: 18

Generated sets of large itemsets:


Large Itemsets L(1):

buying=vhigh 432

buying=high 432

buying=med 432

buying=low 432

maint=vhigh 432

maint=high 432

maint=med 432

maint=low 432

doors=2 432

doors=3 432

conf:(1)

doors=4 432

doors=5more 432

persons=2 576

persons=4 576

persons=more 576

lug_boot=small 576

lug_boot=med 576

lug_boot=big 576

safety=low 576

safety=med 576

safety=high 576

class=unacc 1210

class=acc 384



buying=vhigh class=unacc 360

buying=high class=unacc 324

buying=med class=unacc 268

buying=low class=unacc 258

maint=vhigh class=unacc 360

maint=high class=unacc 314

maint=med class=unacc 268

maint=low class=unacc 268

doors=2 class=unacc 326



doors=5more class=unacc 292

persons=2 lug_boot=small 192

persons=2 lug_boot=med 192

persons=2 lug_boot=big 192

persons=2 safety=low 192

persons=2 safety=med 192

persons=2 safety=high 192

persons=2 class=unacc 576

persons=4 lug_boot=small 192

persons=4 lug_boot=med 192

persons=4 lug_boot=big 192

persons=4 safety=low 192

persons=4 safety=med 192

persons=4 safety=high 192

persons=4 class=unacc 312

persons=4 class=acc 198

persons=more lug_boot=small 192

persons=more lug_boot=med 192

persons=more lug_boot=big 192

persons=more safety=low 192

persons=more safety=med 192

persons=more safety=high 192

persons=more class=unacc 322

persons=more class=acc 186

lug_boot=small safety=low 192

lug_boot=small safety=med 192

lug_boot=small safety=high 192

lug_boot=small class=unacc 450

lug_boot=med safety=low 192

lug_boot=med safety=med 192

lug_boot=med safety=high 192

lug_boot=med class=unacc 392

lug_boot=big safety=low 192

lug_boot=big safety=med 192

lug_boot=big safety=high 192

lug_boot=big class=unacc 368

safety=low class=unacc 576

safety=med class=unacc 357

safety=med class=acc 180

safety=high class=unacc 277

safety=high class=acc 204



persons=2 lug_boot=small class=unacc 192

persons=2 lug_boot=med class=unacc 192

persons=2 lug_boot=big class=unacc 192

persons=2 safety=low class=unacc 192

persons=2 safety=med class=unacc 192

persons=2 safety=high class=unacc 192

persons=4 safety=low class=unacc 192

persons=more safety=low class=unacc 192

lug_boot=small safety=low class=unacc 192

lug_boot=med safety=low class=unacc 192

lug_boot=big safety=low class=unacc 192

Best rules found:

1. persons=2 576 ==> class=unacc 576 conf:(1)

2. safety=low 576 ==> class=unacc 576 conf:(1)

3. persons=2 lug_boot=small 192 ==> class=unacc 192 conf:(1)

4. persons=2 lug_boot=med 192 ==> class=unacc 192 conf:(1)

5. persons=2 lug_boot=big 192 ==> class=unacc 192 conf:(1)


7. persons=2 safety=med 192 ==> class=unacc 192 conf:(1)

8. persons=2 safety=high 192 ==> class=unacc 192 conf:(1)


10. persons=more safety=low 192 ==> class=unacc 192 conf:(1)

11. lug_boot=small safety=low 192 ==> class=unacc 192 conf:(1)

12. lug_boot=med safety=low 192 ==> class=unacc 192 conf:(1)

13. lug_boot=big safety=low 192 ==> class=unacc 192 conf:(1)

14. buying=vhigh 432 ==> class=unacc 360 conf:(0.83)

15. maint=vhigh 432 ==> class=unacc 360 conf:(0.83)

16. lug_boot=small 576 ==> class=unacc 450 conf:(0.78)

17. doors=2 432 ==> class=unacc 326 conf:(0.75)

18. buying=high 432 ==> class=unacc 324 conf:(0.75)

19. maint=high 432 ==> class=unacc 314 conf:(0.73)


21. safety=high class=unacc 277 ==> persons=2 192 conf:(0.69)

22. lug_boot=med 576 ==> class=unacc 392 conf:(0.68)


24. doors=5more 432 ==> class=unacc 292 conf:(0.68)

25. lug_boot=big 576 ==> class=unacc 368 conf:(0.64)

26. buying=med 432 ==> class=unacc 268 conf:(0.62)

27. maint=med 432 ==> class=unacc 268 conf:(0.62)

28. maint=low 432 ==> class=unacc 268 conf:(0.62)

29. safety=med 576 ==> class=unacc 357 conf:(0.62)

30. persons=4 class=unacc 312 ==> safety=low 192 conf:(0.62)

31. buying=low 432 ==> class=unacc 258 conf:(0.6)

32. persons=more class=unacc 322 ==> safety=low 192 conf:(0.6)

33. persons=more 576 ==> class=unacc 322 conf:(0.56)

34. persons=4 576 ==> class=unacc 312 conf:(0.54)

35. safety=med class=unacc 357 ==> persons=2 192 conf:(0.54)

36. class=acc 384 ==> safety=high 204 conf:(0.53)

37. lug_boot=big class=unacc 368 ==> persons=2 192 conf:(0.52)

38. lug_boot=big class=unacc 368 ==> safety=low 192 conf:(0.52)

39. class=acc 384 ==> persons=4 198 conf:(0.52)

From the comparison table, the Best Algorithm depends on the type of problem given to the

researcher. If some datasets were given with a minimum condition or less condition. Then

Association rules can be applied in order to get or generate various possible rules or outcomes. The outcomes depends on the confidence limit and the number of rules you want to

generate. So it is good when you are making unsupervised learning algorithm.

Conversely, Prism Rule is good when you are given some sets of conditions. It is best because it minimizes errors to its lowest proportion and gives out the best alternatives. It

has an artificial intel. That gives out the best possible alternatives. It is limited when it

comes to complex decisions making processes.

J48

This decision making process uses tree method. It is not the best Ideal method because of the

divide and conquer rule it uses. It can’t be used in complex decision making process. Many

errors will be generated in the classification process. It is also an example of Supervised

Learning or Classification Algorithm.

DATA MINING - EVALUATING CLUSTERING ALGORITHM

Education

Transcript of DATA MINING - EVALUATING CLUSTERING ALGORITHM