Breeding Decision Trees Using Evolutionary Techniques Papagelis Athanasios - Kalles Dimitrios...

24
Breeding Decision Trees Using Evolutionary Techniques Papagelis Athanasios - Kalles Dimitrios Computer Technology Institute & AHEAD RM

Transcript of Breeding Decision Trees Using Evolutionary Techniques Papagelis Athanasios - Kalles Dimitrios...

Breeding Decision Trees Using Evolutionary Techniques

Papagelis Athanasios - Kalles DimitriosComputer Technology Institute & AHEAD RM

Introduction

We use GAs to evolve simple and accurate binary decision treesSimple genetic operators over tree structuresExperiments with UCI datasets very good size competitive accuracy results

Experiments with synthetic datasets Superior accuracy results

Current tree induction algorithms…

.. Use greedy heuristics To guide search during tree building To prune the resulting trees

Fast implementationsAccurate results on widely used benchmark datasets (like UCI datasets)Optimal results ? No

Good for real world problems? There are not many real world datasets available

for research.

More on greedy heuristics

They can quickly guide us to desired solutionsOn the other hand they can substantially deviate from optimalWHY? They are very strict Which means that they are VERY

GOOD just for a limited problem space

Why GAs should work ?

GAs are not Hill climbers

Blind on complex search spaces Exhaustive searchers

Extremely expensive

They are … Beam searchers

They balance between time needed and space searched

Application on bigger problem space Good results for much more problems No need to tune or derive new algorithms

Another way to see it..

BiasesPreference bias Characteristics of output

We should choose about it e.g small trees

Procedural bias How we will search?

We should not choose about it Unfortunately we have to:

Greedy heuristics make strong hypotheses about search space

GAs make weak hypotheses about search space

The real world question…Are there datasets where hill-climbing techniques are really inadequate ? e.g unnecessarily big – misguiding output

Yes there are… Conditionally dependent attributes

e.g XOR Irrelevant attributes

Many solutions that use GAs as a preprocessor so as to select adequate attributes

Direct genetic search can be proven more efficient for those datasets

The proposed solution

Select the desired decision tree characteristics (e.g small size)Adopt a decision tree representation with appropriate genetic operatorsCreate an appropriate fitness functionProduce a representative initial populationEvolve for as long as you wish!

Initialization procedurePopulation of minimum decision trees Simple and fast

Choose a random value as test valueChoose two random classes as leaves

A=2

Class=1 Class=2

Genetic operators

M utate d N o de

N e w Te s t Value

M utate d Le af

N e w C las s

F i g u r e 1 . M u t a t i o n E x a m p l e s

C hose n N o de

C hose n N o de

F igure 2 . C rossover E xam ples

Payoff function

Balance between accuracy and size

xsize

xssifiedCorrectClaitreepayoff

i

i

2

2 *)(

set x depending on the desired output characteristics.

•Small Trees ? x near 1

•Emphasis on accuracy ? x grows big

Advanced System Characteristics

Scalled payoff function (Goldberg, 1989)

Alternative crossovers Evolution towards fit subtrees

Accurate subtrees had less chance to be used for crossover or mutation.

Limited Error Fitness (LEF) (Gathercole & Ross, 1997) significant CPU timesavings and

insignificant accuracy loses

Second Layer GA

Test the effectiveness of all those components

coded information about the mutation/crossover rates and different heuristics as well as a number of other optimizing parameters

Most recurring results: mutation rate 0.005 crossover rate 0.93 use a crowding avoidance technique Alternative crossover/mutation techniques did not

produce better results than basic crossover/mutation

Search space / Induction costs

10 leaves,6 values,2 classes Search space >50,173,704,142,848 (HUGE!)

Greedy feature selection O(ak) a=attributes,k=instances (Quinlan 1986) O(a2k2) one level lookahead (Murthy and Salzberg,

1995) O(adkd) for d-1 levels of lookahead

Proposed heuristic O(gen* k

2*a).

Extended heuristic O(gen*k*a)

How it works? An example (a)

An artificial dataset with eight rules (26 possible value, three classes) First two activation-rules as below:

(15.0 %) c1 A=(a or b or t) & B=(a or h or q or x)

(14.0%) c1 B=(f or l or s or w) & C=(c or e or f or k)

Huge Search Space !!!

How it works? An example (b)

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

10

70

130

190

250

310

370

430

490

550

610

670

730

790

Generations

Fit

ne

ss

- A

ccu

racy

0

20

40

60

80

100

120

140160

180

200

220

240

260

280

300

Siz

e

Mean FitnessFitnessAccuracySize

Illustration of greedy heuristics problem

An example dataset (XOR over A1&A2)A1 A2 A3 Class

T F T TT F F TF T F TF T T TF F F FF F F FT T T FT T F T

C4.5 result tree

A1=t

A2=t

tf

A2=f

ft

Totally unacceptable!!!

A1=t

A2=t

tf

A2=f

ft

A3=t

More experiments towards this direction

Name Attrib. Class Function Noise Instanc. Random Attributes

Xor1 10 (A1 xor A2) or (A3 xor A4)

No 100 6

Xor2 10 (A1 xor A2) xor (A3 xor A4)

No 100 6

Xor3 10 (A1 xor A2) or (A3 and A4) or (A5 and

A6)

10% class error

100 4

Par1 10 Three attributes parity problem

No 100 7

Par2 10 Four attributes parity problem

No 100 6

Results for artificial datasets

C4.5 GATree

Xor1 67±12.04 100±0

Xor2 53±18.57 90±17.32

Xor3 79±6.52 78±8.37

Par1 70±24,49 100±0

Par2 63±6.71 85±7.91

Results for UCI datasets

1

Table 1: Classification accuracy

C4.5 OneR GATree

Colic 83.84±3.41 81.37±5.36 85.01±4.55

Heart-Statlog 74.44±3.56 76.3±3.04 77.48±3.07

Diabetes 66.27±3.71 63,27±2.59 63,97±3.71

Credit 83.77±2.93 86.81±4.45 86.81±4

Hepatitis 77.42±6.84 84.52±6.2 80.46±5.39

Iris 92±2.98 94.67±3.8 93.8±4.02

Labor 85.26±7.98 72.73±14.37 87.27±7.24

Lymph 65.52±14.63 74.14±7.18 75.24±10.69

Breast-Cancer 71.93±5.11 68.17±7.93 71.03±8.34

Zoo 90±7.91 43.8±10.47 85.4±4.02

Vote 96.09±3.86 95.63±4.33 95.63±4.33

Glass 55.24±7.49 43.19±4.33 53.48±4.33

Balance-Scale 78.24±4.4 59.68±4.4 71.15±6.47

AVERAGES 78.46 72.64 78.98

2

Table 2: Average tree sizes

C4.5 GATree

Colic 27.4 5.84

Heart-Statlog 39.4 8.28

Diabetes 140.6 6.6

Credit 57.8 3

Hepatitis 19.8 5.56

Iris 9.6 7.48

Labor 8.6 8.72

Lymph 28.2 7.96

Breast-Cancer 35.4 6.68

Zoo 17 10.12

Vote 11 3

Glass 60.2 8.98

Balance-Scale 106.6 8.92

AVERAGES 43.2 7.01

C4.5 / OneR deficiencies

Similar preference biases Accurate, small decision trees

This is acceptable

Not optimized procedural biases Emphasis on accuracy (C4.5)

Not optimized tree’s size Emphasis on size (OneR)

Trivial search policy Pruning as a greedy heuristic has similar

disadvantages

Future workMinimize evolution time

crossover/mutation operators change the tree from a node downwards

we can classify only the instances that belong to the changed-node’s subtree.

But we need to maintain more node statistics

0.000

0.100

0.200

0.300

0.400

0.500

0.600

0.700

1 5 9 13 17 21 25 29 33 37 41 45 49

Tree levels

Percen

t O

f In

sta

nces t

ha

t m

ust

be r

e-

cla

ssif

ied

Complete BinaryDecision Tree

Linear BinaryDecision Tree

Average needed re-classification

Future work (2)

Choose the output class using a majority vote over the produced tree forest (experts voting)Pruning is a greedy heuristic A GA’s pruning?