7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
1/31
Chng 3: PHN LPDA VO CY QUYT NHClassification based on Decision Tree
KHAI PH D LIU
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
2/31
Page 2
Chng 3: PHN LP BNG CY QUYT NH
PHN LP (Classification)
Spxpmtitng vo mtlpbit Bi ton hc c gim st (Supervised learning)
Chomtcsdliu (CSDL) D = {t1, t2,.., tn} vtp cclp C = {C1, C2, .., Cm}.Bitrng lp Ci = {t, Ci(t)}, phn lp l bi ton xcnh nhx f:DC, sao choti, Cj: tiCj.
Tin trnh phnlp:B1: Xc nh hm y=f(X) gn nhn lp cho itng X
Hm f c thbiudinbi cc lut, cng thc ton hc, cy quytnh
B2: Sdng f phn lp (gn nhn) cc itngchabit
D ON (Prediction)Xc nh cc gi tr cn thiu hay chabit thng qua cc hm phn lpc xydngttpmu.
http://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/Ch3_1.ppthttp://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/dudoan.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/dudoan.pptxhttp://localhost/var/www/apps/conversion/tmp/scratch_1/Ho_tro/Ch3_1.ppt7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
3/31
Page 3
Chng 3: PHN LP BNG CY QUYT NH
CY QUYT NH (Decision tree)
Cy quyt nh l mt m hnh d bo (predictive model), nh x t cc d liu quan st c v mt s vt/hin tng n cc kt lun
v gi tr ch ca s vt/hin tng :
Mi nt trong (internal node) tng ng vi mt bin; ng ni gia mt nt trongvi cc nt con th hin gi tr c th cho bin.
Mi nt l i din cho gi tr d on ca bin (cc gi tr d on ca cc bin cbiu din bi ng i t nt gc ti nt l).
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
4/31
Page 4
Chng 3: PHN LP BNG CY QUYT NH
XY DNG CY QUYT NH
Xydngcy
Thchin chia quytpmudliuhunluyn chon khi ccitng/mu mintlthuccngmtlp
Rirc ha cc thuc tnh dng phi s.
Cc muhunluynxut pht nm gcca cy
Chnmtthuc tnh (da trnothngkhoco heuristic) phn chia tpmuhunluyn thnh cc nhnh.
Tiptclpvic xy dng cy quytnh cho cc nhnh, qu trnh lpdng khi:
Ttc cc muuc phn lp (thucmt nt l)
Khng cn thuc tnh no c th dng phn chia mucna.
Ctta cy: Ctta cy chnh l victrnmt cy con vo trong mt nt l.nh gi cy:nh gi chnh xc ca cy ktqu.
Tiu ch nh gi l tsgiatngsitngc phn lp chnh xc trn tngsitngcn phn lp.
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
5/31
Page 5
Chng 3: PHN LP BNG CY QUYT NH
XY DNG CY
Sdngchinlc tham lam (greedy strategy) Phn chia cc bn ghi theo mtthuc tnh thnghimtctiu
theo mt tiu ch no
Cc bi ton Phn chia cc bn ghi
Xc nh tnh trngthuc tnh thnghimnhth no?
Lm sao xc nhcs phn nhnh l tiunht?
Xc nhthiimkt thc chia nhnh
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
6/31
Page 6
Chng 3: PHN LP BNG CY QUYT NH
PH THUC VO KIU THUC TNH
nh danh (Nominal) Th t (Ordinal)
Lin tc (Continuous)
PH THUC VO S LNG NHNH PHN CHIA
2- Nhnh
Nhiu nhnh (Multi-way split)
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
7/31
Page 7
Chng 3: PHN LP BNG CY QUYT NH
Phn chia da vo thuc tnhnh danh&Th t (Nominal & Ordinal Att)
Phn chia nhiu nhnh (Multi-way split):Sdng cc gi tr khc nhau phn hoch
Phn chia nh phn (Binary split): Chia tp gi tr thnh 2 tpCn tm phn hochtiu.
CarType{Family,
Luxury}{Sports}
CarType{Sports,
Luxury}{Family}
hay
CarTypeFamily
Sports
Luxury Small
Medium
Large
Size
Size{Medium,
Large} {Small}
Size{Small,
Medium} {Large}hay
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
8/31
Page 8
Chng 3: PHN LP BNG CY QUYT NH
Phn chia da vo thuc Lin tc (Continuos Att)
C cc cch x l khc nhau Ri rc ha (Discretization) thnh cc thuc tnh th t phn cp
(ordinal categorical attribute)
Tnh Ri rc ha mt ln khi bt u x l
ng Tm cc khong/ bucket lin tip nhau Quyt nh nh phn (Binary Decision): (A < v) or (A v)
Tm cc cch phn chia c th tm phn hoch tt nht
Taxable
Income
> 80K?
Yes No
Taxable
Income?
(i) Binary split (ii) Multi-way split
< 10K
[10K,25K) [25K,50K) [50K,80K)
> 80K
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
9/31
Page 9
Chng 3: PHN LP BNG CY QUYT NH
XY DNG CY
Sdngchinlc tham lam (greedy strategy) Phn chia cc bn ghi theo mtthuc tnh thnghimtctiu
theo mt tiu ch no
Cc bi ton Phn chia cc bn ghi nhth no
Xc nh tnh trngthuc tnh thnghimnhth no?
Lm sao xc nhcs phn nhnh l tiunht?
Xc nhthiimkt thc chia nhnh
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
10/31
Page 10
Chng 3: PHN LP BNG CY QUYT NH
Da vo cc o:
Entropy
li thng tin (Gain)
Ch s GINI
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
11/31
Page 11
Chng 3: PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
12/31
Page 12
C1 0
C2 6
C1 2
C2 4
C1 1
C2 5
P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = 0 log 0 1 log 1 = 0 0 = 0
P(C1) = 1/6 P(C2) = 5/6
Entropy = (1/6) log2 (1/6) (5/6) log2 (5/6) = 0.65
P(C1) = 2/6 P(C2) = 4/6
Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92
j tjptjptEntropy )|(log)|()(
2
Chng 3: PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
13/31
Page 13
Chng 3: PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
14/31
Page 14
B?
Yes No
Node N3 Node N4
A?
Yes No
Node N1 Node N2
Trc khi phn nhnh:
C0 N10
C1 N11
C0 N20
C1 N21
C0 N30
C1 N31
C0 N40
C1 N41
C0 N00
C1 N01M0
M1 M2 M3 M4
M12 M34Gain = M0 M12 vs M0 M34
Chng 3: PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
15/31
Page 15
Chng 3: PHN LP BNG CY QUYT NH
CH S GINI (GINI Index)
Chs GINI ca nt t
Trong : p(j/t) l tnsutcalpjtrong nt t
Lnnht l 1-1/nc khi cc mu phn bu trn cc lp
Thpnht l 0 khi cc muchthucvmtlp Khi phn chia nt pthnh knhnh, chtlngca php chia c tnh bng:
trong :
nil smu trong nt i;
nl smu trong nt p
Ngi ta chnthuc tnh c GINI nhnht phn nhnh.
2
)(1)(GINI j tjpt
k
i
ichia i
n
n
1
)(GINIGINI
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
16/31
Page 16
V d: Xt mt phn nhnh thuc tnh nh phn
N2N1
A p
p
C1 7
C2 3
Gini=0.42
N1 N2
C1 3 4
C2 0 3
Gini=0.342
Gini(N1) =1-(3/3)2-(0/3)2
=0
Gini(N2) =1-(4/7)2
-(3/7)2
=0.489
Ginichia =3/10*0+7/10*0.489
=0.342
Chng 3: PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
17/31
Page 17
Chng 3: PHN LP BNG CY QUYT NH
XY DNG CY
Sdngchinlc tham lam (greedy strategy) Phn chia cc bn ghi theo mtthuc tnh thnghimtctiu
theo mt tiu ch no
Cc bi ton Phn chia cc bn ghi nhth no
Xc nh tnh trngthuc tnh thnghimnhth no?
Lm sao xc nhcs phn nhnh l tiunht?
Xc nhthiimkt thc chia nhnh- Ti nt ang xt cc bn ghi thuc cng mtlp
- Cc o cho tiu ch phn lpnhhnngng qui c
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
18/31
Page 18
Chng 3: PHN LP BNG CY QUYT NH
Thut ton ID3:
Thut ton ID3 c pht biu bi Quinlan (trng i hcSydney, Australia), ccngb vocui thp nin 70ca thk20. Sau, thut ton ID3cgii thiu v trnh by (trongmc Induction on decision trees, machine learning) nm 1986.ID3cxemnhlmtcitinca CLSvikhnnglachnthuctnhttnhttiptctrin khai cytimibc.
uvo: Mttphp cc mu, mimu/ i tng bao gm cc thuc tnhm t v mt gi tr/nhn phn lp .
ura: Cy quytnh c khnng phn lp ngn cc mu trong tpd
liuhunluyn v daon phn lp cho cc mu/itngchac phnlp.
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
19/31
Page 19
Chng 3: PHN LP BNG CY QUYT NH
V d Xt bi ton phn lp xem c i chi tennis trong mt tnh trng thi tit no khng (?). Gii thut ID3 s hc cy quyt nh t tp hp cc mu:
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
20/31
Page 20
Chng 3: PHN LP BNG CY QUYT NH
Qu trnh xy dng cy
Tnh Entropy(S) |S| = 14; m = 2; C1= C, C2= Khng; |C1|= 9=s1, |C2|= 5=s2
Entropy(S) =I(s1, s2)= - (9/14) Log2(9/14) (5/14) Log2(5/14) =0.940
Entropy(SNng) =-2/5 log 2/5 -3/5 log 3/5 =0.971
Entropy(Sm_u)=-4/4 log 4/4 0 log 0/4 = 0
Entropy(SMa) =-3/5 log 3/5 2/5 log 2/5 = 0.971 Gain(S, Quangcnh) =
Entropy(S) (5/14)Entropy(SNng) (4/14)Entropy(Sm_u) (5/14)Entropy(SMa)
= 0.940(5/14)*0.971-(4/14)*0.0-(5/14)*0.0971=0.247
Quang cnh Nng m_u Ma
C 2 4 3
Khng 3 0 2
5 4 5
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
21/31
Page 21
Chng 3: PHN LP BNG CY QUYT NH
Qu trnh xy dng cy
Entropy(SMt) =-3/4 log 3/4 -1/4 log 1/4 = 0.811
Entropy(Sm ap ) =-4/6 log 4/6 2/6 log 2/6 = 0.918
Entropy(SNng) =-2/4 log 2/4 2/4 log 2/4 = 1
Gain(S, Nhit) =
Entropy(S) (4/14)Entropy(SMt) (6/14)Entropy(Sm_p) (4/14)Entropy(SNng)= 0.940(4/14)*0.811-(6/14)*0.918-(4/14)*1=0.029
Nhit Mt m p NngC 3 4 2
Khng 1 2 2
4 6 4
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
22/31
Page 22
Chng 3: PHN LP BNG CY QUYT NH
Qu trnh xy dng cy
Entropy(SCao) =-3/7 log 3/7 -4/7 log 4/7 = 0.986
Entropy(STB ) =-6/7 log 4/6 1/7 log 1/7 = 0.592
Gain(S, m) =
Entropy(S) (7/14)Entropy(SCao) (7/14)Entropy(STB)= 0.940(7/14)*0.986 -(7/14)*0.592 =0.151
m Cao TBC 3 6
Khng 4 1
7 7
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
23/31
Page 23
Chng 3: PHN LP BNG CY QUYT NH
Qu trnh xy dng cy
Entropy(SNh) =-6/8 log 6/8 -2/8 log 2/8 = 0.811
Entropy(SMnh ) =-3/6 log 3/6 3/6 log 3/6 = 1
Gain(S, m) =
Entropy(S) (8/14)Entropy(SNh) (6/14)Entropy(SMnh)= 0.940(8/14)*0.811 -(6/14)*1 =0.048
Gi Nh MnhC 6 3
Khng 2 3
8 6
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
24/31
Page 24
Chng 3: PHN LP BNG CY QUYT NH
Qu trnh xy dng cy
Tnh thunnhtcadliu v lingvi cc thuc tnh
Gain thuc tnh Quang cnhlnnhtnn phn nhnh da vo thuc
tnh ny
Entropy (S) = 0,940
Gain(S, Quang cnh) 0.247
Gain(S,Nhit) 0.029
Gain(S,m) 0.151Gain(S, Gi) 0.048
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
25/31
Page 25
Chng 3: PHN LP BNG CY QUYT NH
Phn nhnh vithuc tnh Quang cnh
...
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
26/31
Page 26
Chng 3: PHN LP BNG CY QUYT NH
Tngt xt nhnh con Nng
Tnh Entropy v Gain ngvi cc thuc tnhm, Nhit, GiGain(SNng, m) = 0.970; Gain(SNng, Nhit) = 0.570; Gain(SNng, Gi) = 0.019
Cy quytnhcui cng c dng
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
27/31
Page 27
Chng 3: PHN LP BNG CY QUYT NH
V d Xt tp mu hun luyn sau:
Phn nhnh da vo thuc tnh Thu
Ch 3 PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
28/31
Page 28
Chng 3: PHN LP BNG CY QUYT NH
RT TRCH LUT T CY QUYT NH
Rt trchluttcyquytnh: C thchuyni qua ligia m hnh cyquytnh v m hnh dnglut(IF THEN) theo qui tc:
Milutto ra tmingdntgcn l.
Micp gi trthuc tnh dc theo ngdnto nn php kt (php ANDv)
Cc nt l mang nhn calpV d: Cc lut rt trch ct cy quytnh dliu Thitit-Tennis
R1: IF (Quang cnh = Nng) AND (m = Cao) THEN Chi Tennis = Khng
R2: IF (Quang cnh = Nng) AND (m = TB) THEN Chi Tennis = C
R3: IF (Quang cnh = m u) THEN Chi Tennis = C
R4: IF (Quang cnh = Ma) AND (Gi = Mnh) THEN Chi Tennis = Khng
R5: IF (Quang cnh = Ma) AND (Gi = Nh) THEN Chi Tennis = C
Ch 3 PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
29/31
Page 29
Chng 3: PHN LP BNG CY QUYT NH
S DNG CY QUYT NH D ON LP CA D LIU MI
Cch d onlpca cc itngcthchin: Duyt cy hay da vo tplut rt trch c.
Chn nhnh ca cy hay lut c tpiukin bao phtp gi trthuc tnh itngcndonlnnht lm csdon.
V d
Ch 3 PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
30/31
Page 30
Chng 3: PHN LP BNG CY QUYT NH
KT LUN & NHN XT
Cy quytnhdhiu, hiunng phn lp cao. Hnch:
Khng Backtracking Hai chiu
Thut ton ID3 gp kh khn khi x l dliu c kiu gi tr lin tc, thiudliu (missing data) hay nhiu (noisy data)
Cc o Entropy, Gain, GINI phcv cho viclachnthuc tnh phnnhnh cy quytnh.
Vicnh gi cy quytnhthngtin hnh:
Chia tpmu thnh 2 tp:
Mttphunluyn (xy dng) cc phn lp (70-75% kch dliugc) Mttpnh gi
chnh xc l tsgiasitng phn lp chnh xc trn tngsitng phn lp.
Ch 3 PHN LP BNG CY QUYT NH
7/30/2019 Ch3 - Phan Lop Dua Vao Cay Quyet Dinh
31/31
Page 31
Chng 3: PHN LP BNG CY QUYT NH
TI LIU THAM KHO THM
The top ten algorithm in Data Mining Xindong Hu, Vipin Kuma Principles of Data Mining Max Bramer
SlideLecture Notes for Chapter 4: www.cse.msu.edu/~ptan/
Ccphn mm m ngun m Khai ph d liu:
Weka: www.cs.waikato.ac.nz/ml/weka/ DBMiner: http://db.cs.sfu.ca/DBMiner/index.html
TANAGRA: http://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.html
BI TP
http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://eric.univ-lyon2.fr/~ricco/tanagra/en/tanagra.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://db.cs.sfu.ca/DBMiner/index.htmlhttp://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/http://www.cs.waikato.ac.nz/ml/weka/Top Related