Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining...

8/2/2019 Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining Algorithms through Mo…

http://slidepdf.com/reader/full/model-selection-criterions-as-data-mining-algorithms-selector-the-selection 1/13

Model Selection Criterions as Data Mining Algorithms’ Selector

The Selection of Data Mining Algorithms through Model Selection Criterions

Dost Muhammad Khan1, Nawaz Mohamudally

2

1Assistant Professor, Department of Computer Science & IT, The Islamia University of Bahawalpur,

PAKISTAN & PhD Student, School of Innovative Technologies & Engineering, University of Technology

Mauritius (UTM), MAURITIUS2Associate Professor, & Consultancy & Technology Transfer Centre, Manager, University of Technology,

Mauritius (UTM), MAURITIUS

Abstract

The selection criterion plays a vital role in the selection of right model for right dataset. It is a gauge to

determine whether the dataset is under-fitted or over-fitted. If the dataset is either over-fitted or under-

fitted, both are the errors in the dataset and lead to produce the vague or ambiguous knowledge from the

dataset and hence need to be addressed properly. The data is used either to predict future behavior or to

describe patterns in an understandable form within discovered process. The major issue is that how to avoid

from these problems. There are different approaches to avoid the problem of over and underfitting, namely,

model selection, Jittering, Weight Decay, Early Stopping and Bayesian estimation. We talk about only the

model selection criterions in this paper. Furthermore, we focus on how the value of model selectioncriterion is used to map with the appropriate data mining algorithm for the dataset.

Keywords: AIC, BIC, Overfitting, Underfitting, Model Selection

1. Introduction

The purpose of model selection is to identify a model that best fits the available dataset, with model

complexity being corrected or penalized. There are two main issues working in data mining, first is bad or

wrong data and the second is controlling the model capacity, making sure it is neither so small that we are

missing useful and exploitable patterns, nor so large that we are confusing pattern and noise. The concept

of over-fitting and under-fitting is important in data mining. The over and underfitting is due to missing and

noisy, inconsistent and redundant values and number of attributes in a dataset. We can avoid these

problems by using one of these techniques; apply upper or lower thresholds values, remove attributes

below a threshold value and remove noise and redundant attributes. The best solution of these problems isuse lots of training dataset and do not make too many or too few assumptions. The other possible solutions

are, model selection, jittering, weight decay, early stopping and Bayesian estimation. The model selection

criterion is discussed in this paper.

There exists models for the selection of data mining algorithms, such as VC (Vapnik-Chervonenkis)-

dimension, AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), (SRMVC)

Structural Risk Minimize with VC dimension, CV (Cross-validation), Deviance Information Criterion,

Hannan-Quinn Information Criterion, Jensen-Shannon Divergence, Kullback-Leibler Divergence models

and many more. We select only two model selection criterions AIC and BIC. A model is better than

another model if it has a smaller AIC or BIC value. Both AIC and BIC have solid theoretical foundations:

Kullback-Leibler distance in information theory (for AIC), and integrated likelihood in Bayesian theory

(for BIC). If the complexity of the true model does not increase with the size of the dataset, BIC is the

preferred criterion, otherwise AIC is preferred. Since selecting the number of parameters and number of

attributes is a model selection problem, so one has to take care of these important aspects of a dataset.Using too many parameters can fit the data perfectly, but it can be an overfitting. Using too few parameters

may not fit the dataset at all, thus underfitting. This shows the importance of parameters and observed data

in a given dataset. Variable selection by AIC or BIC will provide an answer to this problem. We illustrate

the importance of comparing different models with different number of parameters by using AIC and BIC.

The goal of this paper is to draw a comparison between the two selected model selection criterions namely

AIC and BIC and to map the appropriate data mining algorithms with a particular dataset i.e. the right

algorithm for the dataset. The idea of model selection using AIC or BIC has also been applied recently to

epidemiology, microarray data analysis, and DNA sequence analysis [21][22][23][24][25].

JOURNAL OF COMPUTING, VOLUME 4, ISSUE 3, MARCH 2012, ISSN 2151-9617

https://sites.google.com/site/journalofcomputing

WWW.JOURNALOFCOMPUTING.ORG 102



The rest of the paper is organized as follows; section 2 discusses the model selection criterions AIC and

BIC and section 3 is about the methodology. In section 4 the results are discussed and finally the

conclusion is drawn in section 5.

2. Model Selection Criterions

A brief introduction of AIC and BIC is given below:

1. Akaike Information Criterion: The AIC is a criterion for the model selection, developed by HirotsuguAkaike in 1974, under the name of Akaike Information Criterion. The AIC is based on information theory.

Suppose that the data is generated by some unknown process f . We consider two candidate models to

represent f : g 1 and g 2. If we knew f , then we could find the information lost from using g 1 to represent f by

calculating the Kullback–Leibler divergence, DKL( f , g 1); similarly, the information lost from using g 2 to

represent f would be found by calculating DKL( f , g 2). We would then choose the candidate model that

minimized the information loss. The AIC can tell nothing about how well a model fits the data in an

absolute sense. If all the candidate models fit poorly, AIC will not give any warning. In other words, AIC is

a difference of accuracy and the complexity of the model [7][9][10][15][17][18][20]. The AIC is defined

below:

k likelihood AIC 2))(log(2 , where 2k are the number of parameters and )log(likelihood is

the log of the likelihood and ))(log(2 likelihood due to perfect fitting, the value of log-likelihood

values gradually approach to 0 as the number of parameters are increased, is also called the ModelAccuracy. Therefore, AIC is:

acy ModelAccur rsofParamete No AIC

rsofParamete Noacy ModelAccur AIC

.

.

2. Bayesian Information Criterion: The BIC is a criteria for the selection of model among class of models

with different numbers of parameters. When estimating model parameters using maximum likelihood

estimation, it is possible to increase the likelihood by adding parameters, which may result in overfitting.

BIC resolves this problem by introducing a penalty term for the number of parameters in the model. This

penalty is larger in the BIC than in the related AIC. BIC is widely used for model identification in time

series and linear regression. The main characteristics of BIC are: It measures the efficiency of the

parameterized model in terms of predicting the data, penalizes the complexity of the model where

complexity refers to the number of parameters in model, is exactly equal to the minimum description lengthcriterion but with negative sign and is closely related to other likelihood criteria such as AIC

[4][8][12][13][16][19]. The formula for BIC is given below:

)log()log(.2 nk likelihood BIC , where k is the number of parameters, n is the sample size or

the datapoints of the given dataset, )log(.2 likelihood the values of log-likelihood gradually approach

to 0 with the increase on number of parameters, is also known as the Model Accuracy and )log(nk is the

model size. Therefore, BIC is:

acy ModelAccur ModelSize BIC

3. Methodology

Suppose there is a sample },...,,{ 21 n x x x X of n sequence of observations, coming from a distributionwith an unknown probabilty density function )|( X p , where )|( X p is called a parametric model, in

which all the parameters are in finite-dimensional parameter spaces. These parameters are collected

together to form a single m-dimensional parameter vector ),...,,( 21 m . To use the method of

maximum likelihood, one first specifies the joint density function for all observations. The joint density

function of the given observation is given below.

)|()|()|()|,...,,()|( 2121 nn x p x p x p x x x p X p






where the observed values x1, x2, ..., xn are fixed parameters of this function and θ will be the function’s

variable and allowed to vary freely. From this point of view this distribution function will be called the

likelihood:

n

i

inn x p x x x p x x xlikelihood 1

2121 )|()|,...,,(),...,,|(

It is more convenient to work with the logarithm of the likelihood function, called the log-likelihood asshown below.

))|(log()),...,,|(log(1

21

n

i

in x p x x xlikelihood , )log(1

likelihood n

The method of maximum likelihood was first proposed by the English statistician and population geneticist

R. A. Fisher. The maximum likelihood method finds the estimate of a parameter that maximizes the

probability of observing the data given a specific model for the data. The idea behind maximum likelihood

parameter estimation is to determine the parameters that maximize the probability (likelihood) of the

sample data. From a statistical point of view, the method of maximum likelihood is considered to be more

robust and yields estimators with good statistical properties. It is a flexible method and can be applied to

most models and to different types of data. Although the methodology for maximum likelihood estimation

is simple, the implementation is mathematically intense [1][2][3][5][6][11][14].We select a dataset ‘cars’, which is about the different models of brands from different countries. The

number of attributes is 9, number of datapoints or records or the sample size is 261 and number of

parameters is, in this dataset, the brands from 3 countries, from US is 14, from Europe is 10 and from Japan

is 6, are 30. The number of records from US is 62.45%, from Europe 18.00% and from Japan 19.54% of

the whole dataset. The number of parameters/brands from US is 46.67%, from Europe is 33.33% and from

Japan is 16.67% of the total number of brands or parameters. We use the stepwise variable selection

method, starting with one variable and then add or remove variable if the value of AIC or BIC is reduced.

The stepwise variable is a local optimal procedure and is tested with different starting sets of parameters so

that the optimization is not carried to the extreme.

The following steps explain the computation of the value of AIC and BIC:

Step 1: Calculate the maximum likelihood of the dataset

The likelihood function is simply the joint probability of observing the data. The joint probability

is

n

i

inn x p x x x p x x x L1

2121 )|()|,...,,(),...,,|( . Take the log of this value will give the

value of model accuracy which is shown below:

)log(likelihood acy ModelAccur

Step 2: Compute the Model Size

The formula to calculate the model size is )(lognk ModelSize where k is the number of parameters

and n is the datapoints.

Step 3: Compute the Minimum Description Length (MDL)

acy ModelAccur ModelSize MDLScore

Minimum Description Length (MDL) is also referred as BIC (Bayesian Information Criterion). Therefore,

here we can say that AIC and BIC are:

acy ModelAccur rsofParamete No AIC .

acy ModelAccur ModelSize BIC

The smallest value of the model shows that it performs better than the other selection criterion. Therefore,

smallest is the best [1].






4. Results and Discussion

Case 1: We compute the values of AIC and BIC with different sample size i.e. small, medium and large

and the different number of parameters i.e. minimum 2 and maximum 9, where the small sample size is 50,

medium is 100 and the large is of 400. In case 1, the number of attributes or the observed data are 9. Table

1 shows the values of AIC and BIC with respect to the different number of parameters, when the sample

size n is 50.

Table 1 Models Selection with n=50

No. of Parameters BIC AIC

2 15.02 13.10

3 32.30 29.44

4 54.47 50.64

5 76.87 72.09

6 100.49 94.75

7 124.95 118.26

8 150.18 142.53

9 176.10 167.49

The table 1 shows that as the number of parameters increases the values of AIC and BIC increase with the

small sample size. At the beginning the gap between the two values is small but as the number of parameters increases the difference between the values of AIC and BIC also become large. For this sample

dataset AIC is the best selection due to its low values for each parameters. The graph is drawn between the

value of AIC and the value of BIC when the sample size is 50 as shown in figure 1 below.

Comaprison b/w AIC & BIC

0.00

20.00

40.00

60.00

80.00

100.00

120.00

140.00

160.00

180.00

200.00

1 2 3 4 5 6 7 8

BIC

AIC

Figure 1 A Comparison of AIC and BIC when Sample Size=50

The graph in figure 1 shows that as the number of parameters increase the values of AIC and BIC increase,

when the sample size is small i.e. the datapoints in the given dataset are 50. But the value of AIC remains

less than the value of BIC from beginning to the end. At the beginning the gap between the two lines is

minute as the number of parameters increase the gap also increases and at the end of the graph it is clearly

visible. So AIC is the preferred model for the dataset because of its less value.

Table 2 shows the values of AIC and BIC with respect to the different number of parameters, when thesample size n is 100.



2 15.71 13.10

3 33.29 29.38

4 56.50 51.29

5 78.83 72.32

6 102.16 94.35






7 127.67 118.55

8 153.18 142.76

9 179.07 167.35


small sample size. We notice that as the sample size changes from small to medium the values of AIC and

BIC also change. For this sample dataset AIC is the best selection due to its low values for each parameters.

The graph is drawn between the value of AIC and the value of BIC when the sample size is 100 as shown

in figure 2 below.

Graph b/w AIC & BIC

0

20

4060

80

100

120

140

160

180

200

1 2 3 4 5 6 7 8

BIC

AIC



when the sample size is medium i.e. the datapoints in the given dataset are 100. But the line in the graph

which shows the value of AIC remains less than the value of BIC from beginning to the end. At the

beginning the gap between the two lines is minute as the number of parameters increase the gap also

increases and at the end of the graph it is clearly visible. Another point in this graph is the gap between the

two lines is wider as compared to the graph in figure 1 i.e. as the sample size increases the difference

between the two values increases. Again for this sample dataset AIC is the best selection.Table 3 shows the values of AIC and BIC with respect to the different number of parameters, when the

sample size n is 400.



2 17.06 13.09

3 35.33 29.37

4 59.06 51.12

5 82.92 72.99

6 107.33 95.42

7 133.37 119.48

8 159.05 143.16

9 185.17 167.30


small sample size. We observe that as the sample size changes from medium to large the values of AIC and

BIC also change. For this sample dataset AIC is the best selection due to its low values for each parameters.

The graph is drawn between the value of AIC and the value of BIC when the sample size is 400 as shown

in figure 3 below.






Graph b/w AIC & BIC

0

20

40

60

80

100

120

140

160

180

200

1 2 3 4 5 6 7 8

BIC

AIC



when the sample size is large i.e. the datapoints in the given dataset are 400. But the line in the graph which

shows the value of AIC remains less than the value of BIC from beginning to the end. At the beginning thegap between the two lines is minute as the number of parameters increase the gap also increases and at the

end of the graph it is clearly visible. Another point in this graph is the gap between the two lines is wider as

compared to the graphs in figure 1 and 2 i.e. as the sample size increases the difference between the two

values increases. Again for this sample dataset AIC is the best selection. We conclude this case that there is

no problem of overfitting in this dataset.

Case 2: In this case the number of attributes or the observed data is the same as in case 1 i.e. the value is 9.

We compute the values of AIC and BIC on the large sample size of 400 but increase the number of

parameters from 10 to 24 i.e. minimum 10 and maximum 24. As the number of parameters increase the

value of both selection criterions increase. When the number of parameters reaches to 24 the values of AIC

and BIC are infinity, which shows that at the dataset is over-fitted, although, the number of observed data

in this case is not large. The dataset can produce the knowledge up to 23 number of parameters, of course,

the number is again high but the values of the selection criterions are not infinite. This also proves that

number of parameters is non-trivial for any dataset. If the user does not take care of the parameters, it isdifficult to extract the knowledge from the given dataset. The table 4 shows the values of AIC and BIC

with respect to the different number of parameters, when the sample size n is 400.



10 251.18 231.32

11 290.03 268.19

12 328.85 305.03

13 359.69 333.88

14 399.57 371.77

15 442.01 412.23

16 473.04 441.27

17 517.13 483.38

18 555.81 520.06

19 601.18 563.46

20 647.36 607.64

21 692.97 651.27

22 742.68 698.99

23 786.41 740.74

24 Infinity Infinity







sample size is large i.e. the datapoints in the given dataset are 400. At the beginning the gap between the

two values is minute but as the number of parameters increase the gap also increases and at the end it

reaches to a maximum difference. The values of AIC and BIC is undetermined for the value of parameters

from 24. For this sample dataset AIC is the best selection. The upper threshold value of this model is 23

and the lower threshold values of this model are 2, therefore, the range is between 2 and 23. The value

outside the range will create the problem of over and underfitting.

Case 3: We take another case in which the number of parameters is small, the observed data is large and

the sample size is large. We observe that there is no such difference between the values of AIC and BIC.

The choice for this dataset is AIC. The results are shown in the table 5 below.


No. of Parameters BIC AIC No. of Attributes

5 522.10 511.12 39

The table 5 shows that with the increase of number of attributes and sample size of the dataset there is no

such problem of overfitting for this dataset and the value of AIC remains small as compared to BIC,

therefore, AIC is the right choice for this dataset.

The second example of this case is where the number of parameters is large, number of observed data is

large and the sample size is medium. We notice that the value of AIC and BIC is infinity or undetermined.

Which proves that dataset is over-fitted, although, the sample size is medium. The user has to modify the

dataset in order to extract knowledge. The results are demonstrated in table 6 below.



20 Infinity Infinity 72

The table 6 shows that with the increase of number of attributes and number of parameters, the dataset

becomes overfitted and the value of both AIC and BIC is infinity.

The third example (DNA dataset) of this case is where the number of parameters is small, number of

observed data is large and the sample size is very large. We notice that there is no such difference between

both the values of AIC and BIC but the value of AIC is still small as compared to BIC. Although the

observed data in this dataset is in binary (0, 1) format but still the values of both model selections are

computable. The results are demonstrated in table 6 below.



3 648.7343 640.333 180

The result of all these above discussed cases is that if the number of observed data or the number of

parameters is small then the dataset is underfitted. Over and underfitting are the noise in the data which

must be removed in order to extract the useful knowledge from the dataset. We conclude this case that the

number of parameters and the number of attributes play a vital role for a dataset to become over or

underfitting, therefore, in order to avoid from these problems, the user must set the datasets carefully.

Case 4: The model selection criterion, AIC (Akaike Information Criterion), is used to map the appropriate

algorithms with a particular dataset, in order to extract the knowledge. We select three data mining

algorithms namely, K-means, C4.5 and Data Visualization and five different datasets, ‘Iris’, ‘Diabetes’,

Breastcancer, ‘DNA’ and ‘Cars’ are chosen. The sample size of each dataset is different. We have to select

appropriate algorithms for each dataset i.e. the best and suitable algorithm for a dataset. The value of AIC

of these datasets is computed, shown in table 8.

Table 8 The Value of AIC of Datasets

No. of Parameters No. of Attributes Value of AIC Sample Size Dataset

2 9 21.60 788 Diabetes

2 10 24.62 233 Breastcancer






3 5 26.77 150 Iris

3 180 922.48 2000 DNA

23 9 1054.75 400 Cars

Table 8 shows that the value of AIC of the datasets, ‘Diabetes’, ‘Breastcancer’ and ‘Iris’ is small as

compared to the datasets, ‘DNA’ and ‘Cars’. We also notice in this table that although the number of

attributes of dataset ‘Cars’ is small but the value of AIC is high, which is due to the high number of

parameters. Similarly, in case of dataset ‘DNA’, the number of parameters are small but again the value of

AIC is high, which is due to the high number of attributes. The conclusion is that if either the number of

attributes or the number of parameters are large, the value of AIC will be high, which shows that the dataset

is not suitable for the extraction of knowledge and requires the cleansing.

The computational and storage complexities of selected data mining algorithms are shown in table 9.

Table 9 The Complexities of Data Mining Algorithms

Data Mining Algorithm Computational Complexity Storage Complexity

K-means O(nkl) O(n+k)

C4.5 O(n.m) O(n)

Data Visualization (2D Graph) O(d.n) O(n)

Table 9 is about the computational and storage complexities of K-means, C4.5 and Data Visualization, datamining algorithms. Where ‘n’ is the sample size, ‘m’ is the number of attributes, ‘k’ is the number of

clusters, ‘l’ is the number of iterations and ‘d’ is the dimension (in our case it is 2). We take the ‘log’ of

computational complexities of these algorithms; the value gradually approaches to zero, with the decrease

of number of parameters. We take an example which explains the use of ‘log’; an input dataset containing

10 items takes one second to complete, a dataset containing 100 items takes two seconds, and a dataset

containing 1000 items will take three seconds. This makes the use of ‘log’ extremely efficient when dealing

with large datasets. There are some other utilities of taking the ‘log’ of a value, which are: the ‘log’ is taken

if the transferred data is closer to satisfy the assumptions of the statistical model, to analyze the exponential

processes, because the ‘log’ function is the inverse of the exponential function, to measure the pH or acidity

of a chemical solution, to measure the intensity of earth quake on Richter scale, to model many natural

processes with the statistical model and to model the value of computational complexity of a data mining

algorithm with the value of model selection criterion AIC. This will help to select the right algorithm for

the given dataset.Table 10 The value of AIC of Dataset ‘Iris’ and Complexities of Algorithms

Iris/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC

K-means 19.39 26.77

C4.5 16.78 26.77

Data Visualization 15.46 26.77

Table 10 shows the relationship between the computational complexity of data mining algorithms and the

value of model selection criterion AIC for the dataset ‘Iris’. The value of AIC of the dataset is 26.77 and

the value of the log of computational complexity of K-means is 19.39, which is closer to the value of AIC.

The value of log of computational complexity of C4.5 is 16.78, which is closer to the value of AIC and

similarly, the log of computational complexity of Data Visualization is 15.46, which is again closer to the

value of AIC. The result of this table shows that the values of log of computational complexity of all thedata mining algorithms are close to the value of AIC, so these algorithms are the right choice for the dataset

‘Iris’. It is clear from the table that K-means algorithm is first choice for the dataset ‘Iris’, then C4.5 and

finally data visualization. This is further illustrated in figure 4.






Dataset 'Iris'

0.005.00

10.00

15.00

20.00

25.00

30.00

K-means C4.5 Data Visualization

Data Mining Algorithm s

V a l u e o f A I C &

C o m

p l e x i t i e s o f

A l g o s .

Complexities

AIC

Figure 4. The Graph between the value of AIC & Complexities of DM Algorithms

The graph in figure 4 is a comparison between the computational complexity of data mining algorithms and

the value of model selection AIC of the dataset ‘Iris’. The graph shows that the values of computational

complexity of data mining algorithms are close to the value of AIC; therefore, these algorithms are the best

choice for the dataset ‘Iris’. The difference between the values is not perfect but still there is not a huge

difference.

Table 11 The value of AIC of Dataset ‘Breastcancer’ and Complexities of Algorithms

Breastcancer/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC

K-means 20.06 24.62

C4.5 18.41 24.62


Table 11 is a relationship between the computational complexity of data mining algorithm and the value of

model selection criterion AIC for dataset ‘Breastcancer’. The value of AIC of the dataset is 24.62 and the

value of the log of computational complexity of K-means is 20.06, which is closer to the value of AIC. The

value of log of computational complexity of C4.5 is 18.41, which is closer to the value of AIC and

similarly, the log of computational complexity of Data Visualization is 16.73, which is again closer to the

value of AIC. The result of this table shows that the values of log of computational complexity of all the

data mining algorithms are close to the value of AIC, so these algorithms are the right choice for the dataset

‘Breastcancer’. It is clear from the table that K-means algorithm is first choice for the dataset‘Breastcancer’, then C4.5 and finally data visualization. This is illustrated in figure 5.

Dataset 'Breastcancer'

0.00

5.00

10.00

15.00

20.00

25.00

30.00

K-means C4.5 Data V isualization



C o m p l e x i t i e s o f A l g o s .

Complexities

AIC



the value of model selection AIC of the dataset ‘Breastcancer’. The graph shows that the values of

computational complexity of data mining algorithms are close to the value of AIC; therefore, these

algorithms are the best choice for the dataset ‘Breastcancer’. The difference between the values is not

perfect but still the difference is not huge.

Table 12 The value of AIC of Dataset ‘Diabetes’ and Complexities of Algorithms






Diabetes/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC

K-means 23.57 21.60

C4.5 22.41 21.60



model selection criterion AIC for dataset ‘Diabetes’. The value of AIC of the dataset is 21.60 and the logvalue of complexities of K-means is 23.94, which is greater than the value of AIC but still it is closer to the

value of AIC. The log value of complexities of C4.5 is 22.41, which greater than the value of AIC but the

difference is not large and similarly, the log value of complexities of Data Visualization is 20.24, which is

almost equal to the value of AIC. The result of this table shows that the log value of complexities of all the

data mining algorithms is close to the value of AIC, so these algorithms are the right choice for the dataset

‘Diabetes’. Figure 6 illustrate this comparison.

Dataset 'Diabetes'

18.00

19.00

20.00

21.00

22.00

23.00

24.00



V a l u

e o f A I C &

C o m p l e x i t i e s o f

A l g o s .

Complexities

AIC



the value of model selection AIC of the dataset ‘Diabetes’. The graph shows that the values of two data

mining algorithms K-means and C4.5 are greater than the value of AIC and the value of Data Visualization

is less than the value of AIC but there is no such difference between both values, they are very close to

each other; therefore, these algorithms are the best choice for the dataset ‘Diabetes’. In this case the

difference between these values is small as compare to figures 4 and 5 but the value of AIC of the given

dataset is high for K-means and C4.5 data mining algorithms.

Table 13 The value of AIC of Dataset ‘DNA’ and Complexities of Algorithms

DNA/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC

K-means 26.84 922.48

C4.5 29.42 922.48



model selection criterion AIC for dataset ‘DNA’. The value of AIC of the dataset is 922.48 and the log

value of complexities of K-means is 26.84, there is a huge difference between both values. The log value of

complexities of C4.5 is 29.42, there is a big difference between both values and similarly, the log value of

complexities of Data Visualization is 22.93, again the difference between both values is very high. The

result of this table shows that there is no comparison between the log value of complexities of all the datamining algorithms and the value of AIC, so these algorithms are not suitable for the dataset ‘DNA’. In

other words, the selected data mining algorithms are not the right choice for this dataset. In order to make

the dataset ‘DNA’ usable, reduce the number of attributes of the dataset. This is further illustrated in figure

7.






Dataset 'DNA'

0.00200.00

400.00

600.00

800.00

1000.00


Data Mining Algorithms


C o m

p l e x i t i e s o f

A l g o s .

Complexities

AIC



the value of model selection AIC of the dataset ‘DNA’. The graph shows that there is a huge difference

between the values of computational complexity of data mining algorithms and the value of AIC; therefore,

these algorithms are not suitable for the dataset ‘DNA’. We can say that in this case there is no comparison

between these values due to the huge difference.

Table 14 The value of AIC of Dataset ‘Cars’ and Complexities of Algorithms

Cars/Data Mining Algorithms Log of Complexities of Algorithms Value of AIC

K-means 25.21 1054.75

C4.5 20.46 1054.75



model selection criterion AIC for dataset ‘Cars’. The value of AIC of the dataset is 1054.75 and the log

value of complexities of K-means is 25.21, which is far away from the value of AIC. The log value of

complexities of C4.5 is 11.81, there is a huge difference between both values and similarly, the log value

complexities of Data Visualization is 18.29, again there is enormous gap between both values. The result of

this table shows that there is no comparison between the log value of complexities of all the data mining

algorithms and the value of AIC, so these algorithms are not suitable for the dataset ‘Cars’. In other words,

the selected data mining algorithms are not the right choice for this dataset. In order to make the dataset‘Cars’ useable, reduce the number of parameters of the dataset. Figure 8 illustrate this comparison.

Dataset 'Cars'

0.00

200.00

400.00

600.00

800.00

K-means C4.5 Data

Visualization

Data Mining Algorithms

V a l u e s o f A I C & C C

Computational Complexity

AIC



the value of model selection AIC of the dataset ‘Cars’. The graph shows that there is gigantic difference

between the log value of complexities of data mining algorithms and the value of AIC; therefore, these

algorithms are not suitable for the dataset ‘Cars’. We can say that in this case there is no comparison

between these values due to the huge difference.






5. Conclusion

This paper discusses the non-trivial and important role of parameters and observed data in a given dataset.

If a dataset has a few number of parameters or small number of observed data then it is difficult to produce

knowledge from that dataset, it is called under-fitted and if a dataset has large number of parameters or

larger number of observed data then again it is a difficult to handle these parameters and to produce the

knowledge, it is called over-fitted. The over-fitted and under-fitted are the errors in the dataset which show

that the dataset is not properly cleansed. The conclusion is that the middle range i.e. number of parametersneither too small nor too large, is required to find knowledge from a dataset. In order to find that the dataset

is either over-fitted or under-fitted, one has to use the model selection criterion. In this research paper we

use two model selection criterions, AIC and BIC. These model selection criterions are tested over the

sample size of small, medium and large and a comparison is also draw. The AIC performs better than BIC

in all types of the sample sizes. As the sample size increases, from small to large, the difference between

the value of AIC and BIC increases. Therefore, we opt for AIC as a selection criterion for our proposed

model. The model selection is not a hypothesis testing, it does not draw conclusion whether a model is

wrong; it explores and ranks various alternative models. In this paper we try to map the value of model

selection criterion AIC of a dataset with the computational complexities of data mining algorithms K-

means, C4.5 and Data Visualization, which helps to select the appropriate data mining algorithm(s) for a

particular dataset. We test the value of AIC for algorithm selection over five datasets namely, ‘Iris’,

‘Breastcancer’, ‘Diabetes’, ‘DNA’ and ‘Cars’. The number of parameters, sample size and number of

observed data is different for each dataset. The bar graphs are plotted to compare the value of AIC of thesedatasets and the computational complexity of data mining algorithms. The conclusion is that the datasets,

‘Iris’, ‘Breastcancer’ and ‘Diabetes’ are suitable to extract the knowledge and there is no problem of over

and under-fitting in these datasets. But on the other hand, the datasets, ‘DNA’ and ‘Cars’ are not suitable

for knowledge extraction due to over-fitting problem. These datasets again require some cleansing, i.e.

reduce the number of parameters from the dataset ‘Cars’ and reduce the number of attributes from the

dataset ‘DNA’. We also recommend a threshold value for the selection of data mining algorithm if the

percentage difference between the value of AIC & complexities is ‘40’ then the algorithm(s) is suitable for

that dataset otherwise the dataset requires cleansing i.e. reduce the number of parameters or reduce the

number of attributes. We conclude our paper, in this paper we use the model selection criterion AIC to

avoid over and under-fitting problems of a dataset and also map this value with the complexities of data

mining algorithms, which helps to select the right algorithm for the right dataset. The results are

encouraging and satisfactory. Furthermore, the number of parameters of a dataset can also be used as the

number of clusters (the value of ‘k’ ) for K-means clustering data mining algorithm.

Acknowledgement

The authors are thankful to The Islamia University of Bahawalpur, Pakistan for providing financial

assistance to carry out this research activity under HEC project 6467/F – II.

Reference

[1] URL: http://www.doc.ic.ac.uk/~dfg/ProbabilisticInference/IDAPILecture08.pdf, 2011

[2] Aldrich, John., “R. A. Fisher and the making of maximum likelihood 1912–1922”, Statistical Science

12 (3): 162–176. doi:10.1214/ss/1030037906. MR1617519, 1997.

[3] Anderson, Erling B., “Asymptotic Properties of Conditional Maximum Likelihood Estimators”, Journal

of the Royal Statistical Society B 32, 283–301, 1970.

[4] Andersen, Erling B., “Discrete Statistical Models with Social Science Applications”, North Holland,

1980.[5] Debabrata Basu., “Statistical Information and Likelihood : A Collection of Critical Essays”, by Dr. D.

Basu ; J.K. Ghosh, editor. Lecture Notes in Statistics Volume 45, Springer-Verlag, 1988.

[6] Le Cam, Lucien., “Maximum likelihood — an introduction”. ISI Review 58 (2): 153–171, 1990.

[7] Burnham. Kenneth P., Anderson. David R., “Model Selection and Multi-model Inference: a Practical

Information-theoretic Approach”, 2nd edition, Springer, ISBN: 0-387-95364-7, 2002.

[8] Brockwell, P.J., and Davis, R.A., “Time Series: Theory and Methods”, 2nd ed. Springer, 2009.

[9] Akaike, Hirotugu, “A new look at the statistical model identification”, IEEE Transactions on Automatic

Control 19 (6): 716–723. doi:10.1109/TAC.1974.1100705. MR0423716, 1974.






[10] Weakliem. David L. , “A Critique of the Bayesian Information Criterion for Model Selection”,

University of Connecticut Sociological Methods Research, vol. 27 no. 3 359-397, 1999

[11] Cavanaugh. Joseph E., “Statistics and Actuarial Science”, The University of Iowa, URL:

http://myweb.uiowa.edu/cavaaugh/ms_lec_6_ho.pdf, 2009.

[12]Liddle, A.R., “Information criteria for astrophysical model selection”,

http://xxx.adelaide.edu.au/PS_cache/astro-ph/pdf/0701/0701113v2.pdf

[13] Ernest S. et al., “HOW TO BE A BAYESIAN IN SAS: MODEL SELECTION UNCERTAINTY IN

PROC LOGISTIC AND PROC GENMOD”, Harvard Medical School, Harvard Pilgrim Health Care,

Boston, MA, 2010.

[14] In Jae Myung, “Tutorial on maximum likelihood estimation”, Department of Psychology, Ohio State

University, USA, Journal of Mathematical Psychology 47 (2003) pp 90–100.

[15] Isabelle Guyon., “A practical guide to model selection”, ClopiNet, Berkeley, CA 94708, USA, 2010.

[16] Vladimir Cherkassky, “COMPARISON of MODEL SELECTION METHODS for REGRESSION”,

Dept. Electrical & Computer Eng. University of Minnesota, 2010.

[17] Schwarz G., “Estimating the dimension of a model”. Ann Stat 6:461-464, 1978.

[18] Burnham KP, Anderson DR., “Model Selection and Inference”, (Springer), 1998.

[19] Parzen E, Tanabe K, Kitagawa G., “Selected Papers of Hirotugu Akaike”, (Springer), 1998.

[20] Li W., “DNA segmentation as a model selection process”. Proc. RECOMB'01, in press, 2001.

[21] Li W., “New criteria for segmenting DNA sequences”, submitted, 2001.

[22] Li W, Sherriff A, Liu X., “Assessing risk factors of human complex diseases by Akaike and Bayesian

information criteria (abstract)”. Am J Hum Genet 67(Suppl):S222, 2000.[23] Li W, Yang Y., “How many genes are needed for a discriminate microarray data analysis”. Proc.

CAMDA'00, in press, 2001.

[24]Li W, Yang Y, Edington J, Haghighi F., “Determining the number of genes needed for cancer

classification using microarray data”, submitted, 2001.

[25] Wentian Li, Dale R Nyholt, “Marker Selection by AIC and BIC”, Laboratory of Statistical Genetics,

The Rockefeller University, New York, NY, 2010.




Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining...

Documents

Transcript of Model Selection Criterions as Data Mining Algorithms’ Selector The Selection of Data Mining...