3-1 Decision Tree Learning Kelby Lee. 3-2 Overview ¨ What is a Decision Tree ¨ ID3 ¨ REP ¨ IREP...

3-1

Decision Tree Learning

Kelby Lee

3-2

Overview

What is a Decision Tree ID3 REP IREP RIPPER Application

3-3

What is Decision Tree

3-4

What is Decision Tree

Select best attribute that classifies examples Top Down

• Start with concept that represents all Greedy Algorithm

• Select attribute that classifies maximum examples

Does not backtrack ID3

3-5

ID3 Algorithm

ID3(Examples, Target_attribute, Attributes) Create a Root node for the tree If Examples all positive?

• Return Single Node Tree Root, with label = +

If Examples all negative?• Return Single node Tree Root, with label = -

If Attributes is empty• Return single-node tree Root, label = most common

value of Target_attribute in Examples

3-6

ID3 Algorithm

Otherwise• A Best_Attribute (Attributes, Examples)• Root A

For each value vi of A– Add a new tree branch– Examples_svi is a subset of Examples for vi

– If Examples_svi is empty?– Add leaf node label = most common value of

Target_attribute– Add a new sub tree: ID3(Examples_svi,

Target_attribute, Attributes – {A})

3-7

Selecting Best Attribute

New property of Attribute: Information Gain Information Gain: Measures how well a

given attribute separates the training examples according to their target classification

3-8

Information Gain

{E1+, E2+, E3-, E4-}

{E1+, E2+}

{E3-, E4-}

att1

{E1+, E2+, E3-, E4-}

{E1+, E3-}

{E2+, E4-}

att2

att1 = 1

att2 = 0.5

3-9

Tree Pruning

Overfit and Simplify Simplify Tree In most cases it improves accuracy

3-10

REP

Reduced Error Pruning Deletes Single Conditions or Single

Rules Improves on Noisy Data O(n4) on large data sets

3-11

IREP

Incremental Reduced Error Pruning Produces one rule at a time and

eliminates all examples covered by that rule

Stops when no positive examples or pruning produces unacceptable error

3-12

IREP Algorithm

PROCEDURE IREP(Pos, Neg)

BEGIN

Ruleset := 0

WHILE Pos != 0 DO

/* Grow and Prune a New Rule */

split (Pos, Neg) into (GrowPos, GrowNeg)

Rule := GrowRule( GrowPos, GrowNeg )

Rule := PruneRule( Rule, PrunePos, PruneNeg )

3-13

IREP Algorithm

IF error rate of Rule on

( PrunePos, PruneNeg ) exceeds 50% THEN

RETURN Ruleset

ELSE

Add Rule to Ruleset

Remove examples covered by Rule from ( Pos, Neg )

ENDIF

ENDWHILE

RETURN Ruleset

END

3-14

RIPPER

Repeated Grow and Simplify produces quite different results than REP

Repeatedly prune the rule set to minimize the error

Repeated Incremental Pruning to Produce Error Reduction (RIPPER)

3-15

RIPPER Algorithm

PROCEDURE RIPPERk (Pos, Neg)

BEGIN

Ruleset : = IREP(Pos, Neg)

REPEAT k TIMES

Ruleset := Optimize(Ruleset, Pos, Neg)

UncovPos : = Pos \ {data covered by Ruleset }

UncovNeg : = Neg \ {data covered by Ruleset }

Ruleset : = Ruleset IREP(UncovPos, UncovNeg)

ENDREPEAT

END

3-16

Optimization Function

FUNCTION Optimize (Ruleset, Pos, Neg)

BEGIN

FOR each rule r Ruleset do

split ( Pos, Neg) into (GrowPos, GrowNeg) and (PrunePos, PruneNeg)

/* Compute Replacement for r */

r’ : = GrowRule (GrowPos, GrowNet)

r’ : = PruneRule ( r’, PrunePos, PruneNeg )

guided by error of Ruleset \ {c} {c’}

3-17

Optimization Function

/* Compute Replacement for r */

r’’ : = GrowRule (GrowPos, GrowNet)

r’’ : = PruneRule ( r’, PrunePos, PruneNeg )

guided by error of Ruleset \ {c} {c’’}

Replace c in Ruleset with best of c, c’, c’’ guided by description length of

Compress(Ruleset\{c} {x})

ENDFOR

RETURN Ruleset

END

3-18

RIPPER Data3,6.0E+00,6.0E+00,4.0E+00,none,35,empl_contr,7.444444444444445E+00,14,false,9,gnr,true,full,true,full,good.

2,4.5E+00,4.0E+00,3.913333333333334E+00,none,40,empl_contr,7.444444444444445E+00,4,false,10,gnr,true,half,true,full,good.

3,5.0E+00,5.0E+00,5.0E+00,none,40,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,12,avg,true,half,true,half,good.

2,4.6E+00,4.6E+00,3.913333333333334E+00,tcf,38,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,1.109433962264151E+01,ba,true,half,true,half,good.

3-19

RIPPER Names filegood,bad.

dur: continuous.

wage1: continuous.

wage2: continuous.

wage3: continuous.

cola: none, tcf, tc.

hours: continuous.

pension: none, ret_allw, empl_contr.

stby_pay: continuous.

shift_diff: continuous.

educ_allw: false, true.

holidays: continuous.

vacation: ba, avg, gnr.

lngtrm_disabil: false, true.

dntl_ins: none, half, full.

bereavement: false, true.

empl_hplan: none, half, full.

3-20

RIPPER Output

Final hypothesis is:

bad :- wage1<=2.8 (14/3).

bad :- lngtrm_disabil=false (5/0).

default good (34/1).

=====================summary==================

Train error rate: 7.02% +/- 3.41% (57 datapoints) <<

Hypothesis size: 2 rules, 4 conditions

Learning time: 0.01 sec

3-21

RIPPER Hypothesis

bad 14 3 IF wage1 <= 2.8 .

bad 5 0 IF lngtrm_disabil = false .

good 34 1 IF .

.

3-22

IDS

Intrusion Detection System

3-23

IDS

Use Data Mining to Detect Anomaly Better than Pattern Matching since may

be possible to detect undiscovered attacks

3-24

RIPPER IDS data

86,543520084,192168000120,2698,192168000190,22,6,17,40,2096,158723779,14054,normal.

87,543520084,192168000190,22,192p168p0p120,2698,6,16,40,58387,39130843,46725,normal.

...........................

11,543520084,192168000190,80,192168000120,2703,6,16,40,58400,39162494,46738,anomaly.

12,543520084,192168000190,80,192168000120,2703,6,16,1500,58400,39162494,45277,anomaly.

3-25

RIPPER IDS names

normal,anomaly.

recID: ignore.

timestamp: symbolic.

sourceIP: set.

sourcePORT: symbolic.

destIP: set.

destPORT: symbolic.

protocol: symbolic.

flags: symbolic.

length: symbolic.

winsize: symbolic.

ack: symbolic.

checksum: symbolic.

3-26

RIPPER Output

Final hypothesis is:

anomaly :- sourcePORT='80' (33/0).

anomaly :- destPORT='80' (35/0).

anomaly :- ack='7.01238e+07' (3/0).

anomaly :- ack='7.03859e+07' (2/0).

default normal (87/0).

=================summary=====================

Train error rate: 0.00% +/- 0.00% (160 datapoints) <<

Hypothesis size: 4 rules, 8 conditions

Learning time: 0.01 sec

3-27

RIPPER Output

anomaly 33 0 IF sourcePORT = 80 .

anomaly 35 0 IF destPORT = 80 .

anomaly 3 0 IF ack = 7.01238e+07 .

anomaly 2 0 IF ack = 7.03859e+07 .

normal 87 0 IF .

.

3-28

IDS Output

3-29

IDS Output

3-30

Conclusion

What is a Decision Tree ID3 REP IREP RIPPER Application

3-1 Decision Tree Learning Kelby Lee. 3-2 Overview ¨ What is a Decision Tree ¨ ID3 ¨ REP ¨ IREP...

Documents

Transcript of 3-1 Decision Tree Learning Kelby Lee. 3-2 Overview ¨ What is a Decision Tree ¨ ID3 ¨ REP ¨ IREP...