Decision Tree under MapReduce Week 14 Part II. Decision Tree.
3-1 Decision Tree Learning Kelby Lee. 3-2 Overview ¨ What is a Decision Tree ¨ ID3 ¨ REP ¨ IREP...
-
Upload
jerome-lowry -
Category
Documents
-
view
223 -
download
1
Transcript of 3-1 Decision Tree Learning Kelby Lee. 3-2 Overview ¨ What is a Decision Tree ¨ ID3 ¨ REP ¨ IREP...
3-1
Decision Tree Learning
Kelby Lee
3-2
Overview
What is a Decision Tree ID3 REP IREP RIPPER Application
3-3
What is Decision Tree
3-4
What is Decision Tree
Select best attribute that classifies examples Top Down
• Start with concept that represents all Greedy Algorithm
• Select attribute that classifies maximum examples
Does not backtrack ID3
3-5
ID3 Algorithm
ID3(Examples, Target_attribute, Attributes) Create a Root node for the tree If Examples all positive?
• Return Single Node Tree Root, with label = +
If Examples all negative?• Return Single node Tree Root, with label = -
If Attributes is empty• Return single-node tree Root, label = most common
value of Target_attribute in Examples
3-6
ID3 Algorithm
Otherwise• A Best_Attribute (Attributes, Examples)• Root A
For each value vi of A– Add a new tree branch– Examples_svi is a subset of Examples for vi
– If Examples_svi is empty?– Add leaf node label = most common value of
Target_attribute– Add a new sub tree: ID3(Examples_svi,
Target_attribute, Attributes – {A})
3-7
Selecting Best Attribute
New property of Attribute: Information Gain Information Gain: Measures how well a
given attribute separates the training examples according to their target classification
3-8
Information Gain
{E1+, E2+, E3-, E4-}
{E1+, E2+}
{E3-, E4-}
att1
{E1+, E2+, E3-, E4-}
{E1+, E3-}
{E2+, E4-}
att2
att1 = 1
att2 = 0.5
3-9
Tree Pruning
Overfit and Simplify Simplify Tree In most cases it improves accuracy
3-10
REP
Reduced Error Pruning Deletes Single Conditions or Single
Rules Improves on Noisy Data O(n4) on large data sets
3-11
IREP
Incremental Reduced Error Pruning Produces one rule at a time and
eliminates all examples covered by that rule
Stops when no positive examples or pruning produces unacceptable error
3-12
IREP Algorithm
PROCEDURE IREP(Pos, Neg)
BEGIN
Ruleset := 0
WHILE Pos != 0 DO
/* Grow and Prune a New Rule */
split (Pos, Neg) into (GrowPos, GrowNeg)
Rule := GrowRule( GrowPos, GrowNeg )
Rule := PruneRule( Rule, PrunePos, PruneNeg )
3-13
IREP Algorithm
IF error rate of Rule on
( PrunePos, PruneNeg ) exceeds 50% THEN
RETURN Ruleset
ELSE
Add Rule to Ruleset
Remove examples covered by Rule from ( Pos, Neg )
ENDIF
ENDWHILE
RETURN Ruleset
END
3-14
RIPPER
Repeated Grow and Simplify produces quite different results than REP
Repeatedly prune the rule set to minimize the error
Repeated Incremental Pruning to Produce Error Reduction (RIPPER)
3-15
RIPPER Algorithm
PROCEDURE RIPPERk (Pos, Neg)
BEGIN
Ruleset : = IREP(Pos, Neg)
REPEAT k TIMES
Ruleset := Optimize(Ruleset, Pos, Neg)
UncovPos : = Pos \ {data covered by Ruleset }
UncovNeg : = Neg \ {data covered by Ruleset }
Ruleset : = Ruleset IREP(UncovPos, UncovNeg)
ENDREPEAT
END
3-16
Optimization Function
FUNCTION Optimize (Ruleset, Pos, Neg)
BEGIN
FOR each rule r Ruleset do
split ( Pos, Neg) into (GrowPos, GrowNeg) and (PrunePos, PruneNeg)
/* Compute Replacement for r */
r’ : = GrowRule (GrowPos, GrowNet)
r’ : = PruneRule ( r’, PrunePos, PruneNeg )
guided by error of Ruleset \ {c} {c’}
3-17
Optimization Function
/* Compute Replacement for r */
r’’ : = GrowRule (GrowPos, GrowNet)
r’’ : = PruneRule ( r’, PrunePos, PruneNeg )
guided by error of Ruleset \ {c} {c’’}
Replace c in Ruleset with best of c, c’, c’’ guided by description length of
Compress(Ruleset\{c} {x})
ENDFOR
RETURN Ruleset
END
3-18
RIPPER Data3,6.0E+00,6.0E+00,4.0E+00,none,35,empl_contr,7.444444444444445E+00,14,false,9,gnr,true,full,true,full,good.
2,4.5E+00,4.0E+00,3.913333333333334E+00,none,40,empl_contr,7.444444444444445E+00,4,false,10,gnr,true,half,true,full,good.
3,5.0E+00,5.0E+00,5.0E+00,none,40,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,12,avg,true,half,true,half,good.
2,4.6E+00,4.6E+00,3.913333333333334E+00,tcf,38,empl_contr,7.444444444444445E+00,4.870967741935484E+00,false,1.109433962264151E+01,ba,true,half,true,half,good.
3-19
RIPPER Names filegood,bad.
dur: continuous.
wage1: continuous.
wage2: continuous.
wage3: continuous.
cola: none, tcf, tc.
hours: continuous.
pension: none, ret_allw, empl_contr.
stby_pay: continuous.
shift_diff: continuous.
educ_allw: false, true.
holidays: continuous.
vacation: ba, avg, gnr.
lngtrm_disabil: false, true.
dntl_ins: none, half, full.
bereavement: false, true.
empl_hplan: none, half, full.
3-20
RIPPER Output
Final hypothesis is:
bad :- wage1<=2.8 (14/3).
bad :- lngtrm_disabil=false (5/0).
default good (34/1).
=====================summary==================
Train error rate: 7.02% +/- 3.41% (57 datapoints) <<
Hypothesis size: 2 rules, 4 conditions
Learning time: 0.01 sec
3-21
RIPPER Hypothesis
bad 14 3 IF wage1 <= 2.8 .
bad 5 0 IF lngtrm_disabil = false .
good 34 1 IF .
.
3-22
IDS
Intrusion Detection System
3-23
IDS
Use Data Mining to Detect Anomaly Better than Pattern Matching since may
be possible to detect undiscovered attacks
3-24
RIPPER IDS data
86,543520084,192168000120,2698,192168000190,22,6,17,40,2096,158723779,14054,normal.
87,543520084,192168000190,22,192p168p0p120,2698,6,16,40,58387,39130843,46725,normal.
...........................
11,543520084,192168000190,80,192168000120,2703,6,16,40,58400,39162494,46738,anomaly.
12,543520084,192168000190,80,192168000120,2703,6,16,1500,58400,39162494,45277,anomaly.
3-25
RIPPER IDS names
normal,anomaly.
recID: ignore.
timestamp: symbolic.
sourceIP: set.
sourcePORT: symbolic.
destIP: set.
destPORT: symbolic.
protocol: symbolic.
flags: symbolic.
length: symbolic.
winsize: symbolic.
ack: symbolic.
checksum: symbolic.
3-26
RIPPER Output
Final hypothesis is:
anomaly :- sourcePORT='80' (33/0).
anomaly :- destPORT='80' (35/0).
anomaly :- ack='7.01238e+07' (3/0).
anomaly :- ack='7.03859e+07' (2/0).
default normal (87/0).
=================summary=====================
Train error rate: 0.00% +/- 0.00% (160 datapoints) <<
Hypothesis size: 4 rules, 8 conditions
Learning time: 0.01 sec
3-27
RIPPER Output
anomaly 33 0 IF sourcePORT = 80 .
anomaly 35 0 IF destPORT = 80 .
anomaly 3 0 IF ack = 7.01238e+07 .
anomaly 2 0 IF ack = 7.03859e+07 .
normal 87 0 IF .
.
3-28
IDS Output
3-29
IDS Output
3-30
Conclusion
What is a Decision Tree ID3 REP IREP RIPPER Application