Chapter 3 A New Discretization Algorithm based on Range Coeﬃcient of Dispersion...

Chapter 3

A New Discretization Algorithm

based on Range Coefficient of

Dispersion and Skewness for

Neural Networks Classifier

3.1 Introduction

Researchers have proposed a lot of discretization algorithms (Kurgan & Cios, 2004;

Tsai et al., 2008; Butterworth et al.,2004; Fayyad & Irani, 1993). These discretiza-

tion algorithms can be specified in several dimensions such as supervised Vs un-

supervised, static Vs dynamic, global Vs local, top-down Vs bottom-up, direct Vs

incremental and so on. The discretization methods Equal-W and Equal-F are the

best examples for unsupervised. In Equal-W method the range of values is simply

divided into sub ranges of equal extent and in Equal-F method the range is divided

into sub ranges containing equal number of examples. The entropy based and Chi-

square based methods are the examples for the supervised procedure. The best

60

Print to PDF without this message by purchasing novaPDF (http://www.novapdf.com/)

http://www.novapdf.com/


examples for the supervised top-down algorithms are Information Entropy Max-

imization (Fayyad & Irani, 1993), CACC (Tsai et al., 2008) and CAIM (Kurgan

& Cios, 2004). These algorithms generally maintain the highest interdependence

between target class and discretized attributes, and attain the best classification

accuracy. The famous algorithms in bottom-up methods are chi-merge (Kerber,

1992), chi2 (Liu & Setiono, 1997), modified chi2 (Tay & Shen, 2002) and extended

chi2 (Su & Hsu, 2005). Chi-merge is the most typical bottom-up algorithm. The

main drawback of chi-merge is that the user has to provide several parameters such

as maximal and minimal intervals. Chi2 was proposed based on chi-merge. Chi2

automatically calculates the value of the significance level, but still requires the

users to provide an inconsistency rate to stop the merging procedure. Modified

chi2 replaces the inconsistency checking in chi2 by the quality of approximation

after each step of discretization. That makes the modified chi2 as a completely

automated method. The extended chi2 algorithm determines the predefined mis-

classification rate from the data itself and also considers the variance in the two

adjacent intervals. These modifications make the algorithm to handle the misclas-

sified or uncertain data with better accuracy than the original chi2 algorithm. Wu

et al., (2006) have proposed a dynamic discretization algorithm to enhance the de-

cision accuracy of naive Bayes classifiers. But the static approach is always better

than the dynamic approach by its independence from other learning algorithms

i.e., a dataset discretized by a static discretization algorithm can be used in any

classification algorithm that classifies discrete attributes (Tsai et al., 2008).

The recent vast research activities in neural classification have established

61




that neural networks are a promising alternative to various conventional classi-

fication methods. Originally, the most well known training method is conjugate

gradient (Moller, 1990); other methods include, quasi-Newton, backpropagation,

Levenberg-Marquardt, and genetic algorithms. Each training method has a set of

parameters that control various aspects of training such as avoiding local minimum

or adjusting the speed of convergence.

Conjugate gradient is one of the best neural network learning algorithm. Un-

like backpropagation algorithm it does not require the user to specify learning rate

and momentum parameters. The traditional conjugate gradient algorithm uses

the gradient to compute a search direction. It then uses a line search algorithm

such as Brents Method to find the optimal step size along a line in the search

direction. The line search avoids the need to compute the Hessian matrix of sec-

ond derivatives, but it requires computing the error at multiple points along the

line. The scaled conjugate gradient algorithm uses a numerical approximation for

the second derivatives (Hessian matrix), but it avoids instability by combining

the model-trust region approach from the Levenberg-Marquardt algorithm with

the conjugate gradient approach. This allows scaled conjugate gradient to com-

pute the optimal step size in the search direction without having to perform the

computationally expensive line search used by the traditional conjugate gradient

algorithm.

The basic structure of the neural network in this work is a standard three

layered feedforward neural network, which consists of an input layer, a hidden

62




layer and an output layer. A scaled conjugate gradient training algorithm (Moller,

1990) performs learning on this network.

In this chapter a new static, global, incremental, supervised and bottom-up

discretization algorithm is proposed. The proposed algorithm is aimed at selecting

the number of discrete intervals without any user supervision and discretizing an

attribute into the smallest number of intervals within less amount of time. It is

also aimed at to reduce the training time of neural network and also to improve

the accuracy, efficiency and scalability of the classification process. The chap-

ter is organized as follows: Section 3.2 explains the methodology of the proposed

discretization, Section 3.3 describes the new discretization algorithm for prepro-

cessing in data mining, Section 3.4 compares the results of the proposed method

with other discretization methods in terms of discretization time and classification

accuracy.

3.2 DRDS Discretization Method

This discretization method discretizes the continuous attributes based on range

coefficient of dispersion and skewness of data and hence the proposed method

is Discretization based on Range coefficient of Dispersion and Skewness of data

(DRDS).

The quality of the discretization is measured by two parameters, namely

classification accuracy and number of discretization intervals (Liu & Wang, 2005).

More discretization intervals always fewer the classification errors and lower the

63




cost of data discretization (Jin et al., 2009). The DRDS method has two phases.

The first phase interested only in minimizing the classification errors, resulting

more intervals in Initial Discretization Scheme (IDS) and the second phase inter-

ested in minimizing the number of intervals without affecting the classification

accuracy by merging the intervals in IDS. From the phase 2, the Final Discretiza-

tion Scheme (FDS) is obtained.

3.2.1 Initial Discretization Scheme (IDS)

Let us consider a dataset with N continuous attributes, M examples and S target

classes. To classify any example in a dataset using classification algorithm, these

N continuous attributes should be discretized. Let k be an index to a target

class where k = 1, 2, ..., S. Let maxk and mink respectively be the maximum and

minimum of an attribute to be discretized in the class k. To find the discrete

intervals within the range [mink , maxk ], best interval length to be computed. A

value jmink is taken between mink and maxk to get best interval length. The

distance between jmink and mink defines the best interval length. jmink is the

j th minimum value of the discretizing attribute in the class k and j is decided

based on the dispersion of data series. The degree to which numerical data tend

to spread is called dispersion, or variance of data (Han & Kamber, 2001). One of

the most common measures of data dispersion is range. When the dispersion is

large, the values are widely scattered; when it is small they are tightly clustered.

So, for a data series with large dispersion, smaller j value is selected and for a

data series with small dispersion, larger j value is selected i.e. j is always inversely

64




proportional to the dispersion of data series. Considering the wide range of data,

range Coefficient of Dispersion (CD) is used to measure the dispersion as it is

independent of the units of measurement (Gupta & Kapoor, 2001). The range

coefficient of dispersion is the relative measure of dispersion based on the value of

range (Gupta & Kapoor, 2001). The value CDk of data of the discretized attribute

in the class k is estimated by

CDk = |maxk | − |mink | |maxk | + |mink |

The value of CDk is always in [−1, +1]. To decide the value j, the range [−1, +1]

is divided into set of intervals based on the magnitude of number of distinct values

in the discretizing attribute of the class k. Let ck be the count of distinct values in

the discretizing attribute of the class k and let nk be the dividing parameter of the

range [−1, +1]. To identify nk , initially the value of ck is rounded to the nearest

highest place value and then its order of magnitude plus one is computed. Now

the range [−1,+1] is divided into 2nk intervals such as

−nk −(nk − 1) −(nk − 2)−2 −1 0 1 2(nk − 2) (nk − 1) nk ,,, ...,,, , , , ...,,, . nknknknk nk nk nk nknknknk

It contains negative intervals from

intervals from 0 nk

−nk nk to 0

nk with index −nk to 0 and positive

to +nk nk with index 0 to +nk . The value j is selected based on the

value of CDk lies in the above interval. If CDk belongs to the highest intervals

0namely [ (nk −1) , nk ] or [ −1 , nk ] then data series of the discretizing attribute of classnknknk k is said to be more dispersive and the data can be discretized by selecting (nk +1−

(nk − 1))th minimum or (1 − (−1))th minimum respectively as jmink . Similarly if

CDk belongs to the interval [ (nk −2) , (nk −1) ] or [ −2 , −1 ] then the (nk + 1 − (nk − 2))th nknknk nk

65




minimum or (1 − (−2))th minimum is selected respectively as jmink and so on.

Fig. 3.1 explains the selection process of j and the value of j is calculated by,

Figure 3.1: The selection process of j

(nk + 1) − i if CDk ∈ [ i , i+1 ], 0 ≤ i < nk

nk nk 1−i

j= if CDk ∈ [ nik , (i+1) ], −nk ≤ i < 0nk

(3.1) where i represents the index of an interval.

Hence jmink be the j th minimum of a discretizing attribute of the class k.

66




In order to obtain the good quality for discretization, finding the best interval

length of the continuous valued attribute is the primary vital task. The best

interval length lk for a discretizing attribute of a class k can be obtained by,

lk = (jmink − mink ) (3.2)

The distance between the j th minimum value and 1st minimum value of each class

defines the best interval length for that class. The range coefficient of dispersion

CD is estimated based on only two extreme observations, so it is not at all reliable

measure of dispersion (Gupta & Kapoor, 2001). It may selects the jmink value

as small or big if the data are right or left skewed respectively. A distribution

of data is said to be skewed if the given data is symmetrical but stretched more

to one side than to the other (Gupta & Kapoor, 2001). The selection of very

small jmink value due to the right skewness leads the interval length lk as too

small and the number of intervals n also as very high. This situation raises

the discretization time and tiny intervals but these are avoided by increasing the

value lk . It can be performed by computing the lower quintile of data range using

LQ = (maxk − mink ) × 1 . If the value of lk is lower than LQ then increase the 5 value of lk by lk 2 . Similarly the selection of very big jmink value due to the left

skewness leads the interval length lk as too high and fails to discretize the data in

a consistent manner. To reduce the left skewness, compute the upper quintile of

data range using U Q = (maxk − mink ) × √ lk by lk when lk is higher than U Q.

4 5 and then reduce the splitting length

67




This adjustment process of lk can be formulated as,

lk =

2 lk √

lk lk

if lk < LQ if lk > U Q otherwise

1 5 where LQ = (maxk − mink ) × and U Q = (maxk − mink ) × 4

5 (3.3)

Let t be a dynamic variable and it specifies the value from which the discretiza-

tion process to be begun for a discretizing attribute of the class k. The value of t

can be identified by the following procedure:

1. Let P be the sequence of tuples (mink , maxk , lk , k) with k = 1 to S

2. Sort P with respect to mink in ascending order and call the sorted sequence

as P

3. Let P be p1 , p2 , ..., pS

4. Call the 1st coordinate in pi as min(i), 2nd coordinate in pi as max(i), 3rd

coordinate in pi as intlen(i) and the 4th coordinate in pi as cls(i)

where i = 1 to S

5. Initialize t as min(1)

6. For each class cls(i), i = 2 to S

6.1. if (min(i) ≥ max(i − 1)) then t = min(i)

6.2. Else t = max(i − 1)

7. end.

68




Here for any given i, the parameters min(i), max(i), intlen(i) and cls(i) equal

to mink , maxk , lk and k respectively for some k in the range 1 to S. Initially t

starts with min(1) in the sorted set P and consequently it is assigned as either

min(i) or max(i − 1), i = 2 to S and i be an index of sorted set P . Assigning

min(i) as t for ((i > 1) and (min(i) < max(i − 1)) is leading to the generation of

redundant intervals. So assigning t = max(i − 1) instead of min(i) as in Fig. 3.2

avoids the repeated discretization of data in the overlapped area.

Figure 3.2: Data of class cls(i − 1) overlapped with data of class cls(i)

Now the number of intervals n for a discretizing attribute of the target class

cls(i), i = 1 to S is calculated by dividing the total length of data to be discretized

in class cls(i) by the best interval length intlen(i). As shown in Fig. 3.2 the

difference between max( i) and the dynamic variable t identifies the total length of

data to be discretized in class cls(i). So

n = (max(i) − t)/intlen(i) (3.4)

Here if intlen(i) doesn’t divide (max(i) − t) perfectly then the value of n is incre-

mented by 1 since an additional interval has been required to cover the remaining

69




values. Let the variables lb and ub denote the lower bound and the upper bound of

an interval. The intervals in the Initial Discretization Scheme (IDS) can be written

as

IDS = {d11 , d12 , ..., d1n1 , d21 , d22 , ..., d2n2 , ..., dS1 , dS2 , ..., dSnS } (3.5)

where dij represents an interval j of a discretizing attribute of the class cls(i).

d11 = [lb11 , ub11 ], d12 = [lb12 , ub12 ], ..., d1n1 = [lb1n1 , ub1n1 ]

d21 = [lb21 , ub21 ], d22 = [lb22 , ub22 ], ..., d2n1 = [lb2n1 , ub2n2 ]

.

.

dS1 = [lbS1 , ubS1 ], dS2 = [lbS2 , ubS2 ], ..., dSnS = [lbSnS , ubSnS ]

Here lb11 = min(i), lbij = ubij−1 , ubij = lbij + l i and li = intlen(i) of the class

cls(i). Note that an additional interval such as [ubini , max(i)] is included as a final

interval for each cls(i) if (ubini < max(i)).

The proposed discretization procedure and IDS holds the following properties.

1. The j th minimum helps to start with the reduced list of continuous values

of attribute as cut-points instead of complete list.

2. The j th minimum generates lesser intervals in the IDS and reduces the

time for merging.

3. Requires minimum discretization time as sorting procedure is used only

for S tuples.

70




3.2.2 Final Discretization Scheme (FDS)

The best discretization method should find the best trade off between the

classification accuracy and simplicity. So the goal of proposed discretization method

is to reduce the number of intervals while maximizing the classification accuracy.

To achieve that the number of intervals in IDS are to be reduced by merging the

intervals as follows. Let b be the total number of intervals in IDS and let us denote

each interval in IDS as Ii , the total number of examples within the interval Ii as qi

where i = 1 to b and M be the total number of examples. The stopping criterion

of FDS is

qi ≥ √

M

√ M , called as tiny √ interval, is merged with the adjacent intervals until the qi of that interval ≥ M .

Merging of interval Ii where i = 2 to b − 1 is done with the adjacent interval Ii−1 or

Ii+1 , when its quantity q is smaller while comparing each other i.e first an adjacent

interval with minimum number of data is selected by comparing both qi−1 and

qi+1 and then an interval Ii is merged with that selected interval. It is performed

only for the intervals I2 to Ib−1 since the first and last interval doesn’t have the

left adjacent interval Ii−1 and the right adjacent interval Ii+1 respectively. This

procedure helps to deduce the minimum discrete intervals from IDS. The discrete

intervals obtained from this procedure is known as the Final Discretization Scheme.

where i = 2 to b − 1. An interval consisting of examples <

71




3.3 DRDS Discretization Algorithm

The proposed algorithm consists of two phases,

1. The first phase of the algorithm gets the Initial Discretization scheme by

searching through globally.

2. The second phase, refines the intervals. Here the intervals can be further

merged upto the stopping criterion without affecting the quality of the

discretization.

The DRDS algorithm

Input:

Dataset with N continuous Attributes, M examples and S target classes.

Begin

1. For each continuous Attribute,

1.1 For each target class k, k = 1 to S

1.1.1. Find the maximum value maxk , minimum value mink ,

count of distinct values ck and the dividing parameter nk .

1.1.2 Compute j value using (3.1) and find the j th minimum value

jmink .

1.1.3 Find the best interval length lk using (3.2)

72




1.1.4 Adjust lk using (3.3)

1.2 Let P be the sequence of tuples (mink , maxk , lk , k) where

k = 1 to S.

1.3 Sort P with respect to mink in ascending order and call the sorted

sequence as P .

1.4. Let P be p1 , p2 , ..., pS

1.5. Call the 1st coordinate in pi as min(i), 2nd coordinate in pi as

max(i), 3rd coordinate in pi as intlen(i) and the 4th coordinate in

pi as cls(i) where i = 1 to S

1.6. Initialize the dynamic variable t as min(1)

1.7. For each class cls(i), i = 1 to S

1.7.1 Compute the number of intervals n using (3.4)

1.7.2 Generate n intervals using (3.5)

1.7.3. If ((i > 1) and (min(i) ≥ max(i − 1))) then t = min(i) else

t = max(i − 1)

2. The Initial Discretization Scheme (IDS) for S classes would be,

IDS = {d11 , d12 , ..., d1n1 , d21 , d22 , ..., d2n2 , ..., dS1 , dS2 , ..., dSnS }

3. Let b be the number of intervals in IDS and for each interval Ii ,

3.1 Calculate the total number of examples qi within the interval Ii .

3.2 Merge the interval Ii with the adjacent smallest interval until

73




qi ≥

Output:

√ M Where i=2 to b-1.

The Final Discretization Scheme D.

3.4 Experimental Evaluation

The proposed algorithm is implemented on six well known continuous and

mixed mode WEKA’s datasets namely iris plants (iris), ionosphere (iono), statlog

project heart disease (heart), Pima Indians diabetes (pid), wave form (wav) and

Wisconsin breast cancer (breastw) and compared with other discretizaion methods

such as Equal-w, Equal-F, Chimerge, Ex-chi2, CACC and CAIM. The training and

testing examples are selected based on 10-fold cross validation method. In this, the

dataset is divided into ten disjoint groups of equal size. The training procedure for

each data set is repeated 10 times, each time with nine partitions as training data

and one partition as test data. All the reported results are obtained by averaging

the outcomes of the 10 separate tests.

3.4.1 Results

Experiments are performed for the DRDS algorithm with all the experimental

datasets. The unsupervised algorithms such as Equal-W, Equal-F require the user

to specify the number of discrete intervals. Many supervised algorithms apply

their own criteria to generate an appropriate number of discrete intervals. In the

experiments, the heuristic formula (3.4) is used to estimate the number of intervals.

The DRDS algorithm is applied to the entire dataset as the method is global. Some

74




dataset may be affected by left skewness or right skewness. Fig. 3.3 shows the

skewness of the data of an ’insu’ attribute of class 1 of pid dataset. Such cases are

also handled by DRDS using (3.3). The results obtained by the DRDS algorithm

Figure 3.3: Right skewness of the data of an ”insu” attribute of class 1 of pid dataset

with the six datasets are shown in Table 3.1.

Table 3.1: The results of DRDS on six datasets

A discretization scheme with very fewer intervals may not only lead to the

best quality of discretization scheme, it may leads to decrease in the accuracy of

75




a classifier (Kurgan & Cios, 2004). The proposed algorithm generates minimum

number of intervals, not very few but leads to highest classification accuracy with

best discretization time. The classification accuracy of the discretized datasets are

computed using the feed forward neural network with conjugate gradient training

(MLP-CG) algorithm with the help of KEEL software (Fdez et al., 2009). The

results obtained for six datasets using the MLP-CG are shown in Table 3.2.

Table 3.2: The accuracy obtained by MLP-CG on six datasets

People often refused to choose the neural network for classifying large datasets,

since it requires long training time. Normally the data discretized with unsuper-

vised discretization algorithms or with some supervised algorithms requires long

training time (Han & Kamber, 2001). But the data discretized by the DRDS

achieves the highest accuracy with minimum learning time during the classifica-

tion process using neural network. Finally, for the dataset used in the experimental

section, the proposed algorithm generates discretization scheme with the possible

smallest number of intervals that assumes low computational cost, and achieves

significant improvement in classification accuracy.

76




3.4.2 Comparison of Discretization Schemes

The comparisons of six datasets results with other six discretization schemes

are shown in Table 3.3. The discretization schemes Equal-W and Equal- F are two

unsupervised top down methods, Chimerge and Extended Chi2 are two bottom-up

approach methods and CACC and CAIM are two new top-down methods.

Table 3.3 shows the number of discrete intervals obtained in this experiment,

it is not the main concern. The main goals of discretization are to improve the

accuracy and efficiency of learning algorithm and the discretization process should

be as fast as possible. From the Table 3.3, it can be observed that the generated

Table 3.3: Comparison of the seven discretization schemes on six datasets

77




number of intervals of DRDS are comparable with Chimerge, Extended Chi2 and

CACC.

Regarding the discretization time, the unsupervised methods are the fastest

since they are not considered any class related information. Fig. 3.4 compares the

Figure 3.4: Right skewness of the data of an ”insu” attribute of class 1 of pid dataset

discretization time of DRDS with only the algorithms which require no parameters.

Here the discretization time of DRDS is smaller than the other bottom-up method

Ex-chi2 for all datasets and smaller than the other top-down methods for three

datasets.

Normally the bottom-up methods require more execution time to check the

merged inconsistency in every step (Tsai et al., 2008), but the proposed bottom-up

method DRDS requires less discretization time due to its low computational cost.

The performance of all discretization schemes are evaluated using MLP-CG

78




algorithm on the the discretized data of each scheme. The accuracy is computed

for all the six datasets and the results are tabulated in Table 3.2. The accuracies

obtained by neural network (MLP-CG) for DRDS are compared with the accura-

cies obtained for other six discretization schemes on all datasets and it is shown

in Table 3.4. Here the DRDS achieves the highest classification accuracy for three

Table 3.4: Comparison of the accuracies achieved by the MLP-CG using DRDS method and by using other discretization methods

datasets namely heart, pid and wave among all other discretization methods. Com-

paring with other bottom-up methods namely chi-merge and Ex-chi2, the proposed

method DRDS achieves an equal or high accuracy for all datasets except for the

breastw dataset and comparing with top-down methods CACC and CAIM, DRDS

achieves a high or closer accuracy for all datasets.

Finally the experimental results on data sets show that the proposed algo-

rithm DRDS generates the discrete data that results in improved performance of

subsequently used learning algorithms when compared to the data generated by

other discretization algorithms.

79




3.5 Conclusions

In this chapter, the DRDS algorithm is proposed to handle continuous and mixed

mode attributes. The algorithm handles many class labeled data and tested using

FNN with conjugate gradient training algorithm. The proposed algorithm does

not require any user interaction in IDS and FDS and performs automatic selection

of the number of discrete intervals based on coefficient of dispersion and skewness

of data range. This proposed method discretizes an attribute into smallest num-

ber of intervals within less amount of time. The discretization time of DRDS is

smaller than the other bottom-up methods for maximum datasets. The proposed

algorithm DRDS achieves highest classification accuracy among the other six dis-

cretization algorithms. The experimental results show that, when the proposed

algorithm is applied as a front-end tool, it improves the performance of supervised

ML algorithms.

In a nutshell, the DRDS algorithm is very effective and easy to use supervised

discretization algorithm which can be applied to problems that require discretiza-

tion of large datasets.

80




Chapter 3 A New Discretization Algorithm based on Range Coeﬃcient of Dispersion...

Documents

Transcript of Chapter 3 A New Discretization Algorithm based on Range Coeﬃcient of Dispersion...