Phân lớp dữ liệu
-
Upload
thanh-thu-thai -
Category
Documents
-
view
226 -
download
0
Transcript of Phân lớp dữ liệu
-
7/31/2019 Phn lp d liu
1/67
I HC QUC GIA H NI
TRNG I HC CNG NGH
Nguyn Th Thy Linh
NGHIN CU CC THUT TON PHN LP DLIU
DA TRN CY QUYT NH
KHA LUN TT NGHIP I HC H CHNH QUY
H NI - 2005
Ngnh: Cng ngh thng tin
-
7/31/2019 Phn lp d liu
2/67
I HC QUC GIA H NI
TRNG I HC CNG NGH
Nguyn Th Thy Linh
NGHIN CU CC THUT TON PHN LP DLIUDA TRN CY QUYT NH
KHA LUN TT NGHIP I HC H CHNH QUY
H NI - 2005
Ngnh: Cng ngh thng tin
Cn b hng dn: TS. Nguyn Hi Chu
-
7/31/2019 Phn lp d liu
3/67
- i-
TM TT NI DUNG
Phn lp d liu l mt trong nhng hng nghin cu chnh ca khai ph d
liu. Cng ngh ny , ang v s c nhiu ng dng trong cc lnh vc thng mi,
ngn hng, y t, gio dcTrong cc m hnh phn lp c xut, cy quytnh c coi l cng c mnh, ph bin v c bit thch hp vi cc ng dng khai
ph d liu. Thut ton phn lp l nhn t trung tm trong mt m hnh phn lp.
Kha lun nghin cu vn phn lp d liu da trn cy quyt nh. T
tp trung vo phn tch, nh gi, so snh hai thut ton tiu biu cho hai phm vi
ng dng khc nhau l C4.5 v SPRINT. Vi cc chin lc ring v la chn thuc
tnh pht trin, cch thc lu tr phn chia d liu, v mt sc im khc, C4.5 l
thut ton ph bin nht khi phn lp tp d liu va v nh, SPRINT l thut ton
tiu biu p dng cho nhng tp d liu c kch thc cc ln. Kha lun chy th
nghim m hnh phn lp C4.5 vi tp d liu thc v thu c mt s kt qu phn
lp c ngha thc tin cao, ng thi nh gi c hiu nng ca m hnh phn lp
C4.5. Trn csnghin cu l thuyt v qu trnh thc nghim, kha lun xut
mt s ci tin m hnh phn lp C4.5 v tin ti ci t SPRINT.
-
7/31/2019 Phn lp d liu
4/67
- ii-
LI CM N
Trong sut thi gian hc tp, hon thnh kha lun em may mn c cc
thy c ch bo, du dt v c gia nh, bn b quan tm, ng vin.
Em xin c by t lng bit n chn thnh ti cc thy c trng i hcCng Ngh truyn t cho em ngun kin thc v cng qu bu cng nh cch hc
tp v nghin cu khoa hc.
Cho php em c gi li cm n su sc nht ti TS. Nguyn Hi Chu,
ngi thy rt nhit tnh ch bo v hng dn em trong sut qu trnh thc hin
kha lun.
Vi tt c tm lng mnh, em xin by t lng bit n su sc n TS. H
Quang Thy to iu kin thun li v cho em nhng nh hng nghin cu. Em
xin li cm n ti Nghin cu sinh on Sn (JAIST) cung cp ti liu v cho em
nhng li khuyn qu bu. Em cng xin gi li cm n ti cc thy c trong B mn
Cc h thng thng tin, Khoa Cng ngh thng tin gip em c c mi thc
nghim thun li.
Em cng xin gi ti cc bn trong nhm Seminar Khai ph d liu v Tnh
ton song song li cm n chn thnh v nhng ng gp v nhng kin thc qu bu
em tip thu c trong sut thi gian tham gia nghin cu khoa hc.
Cui cng, em xin cm n gia nh, bn b v tp th lp K46CA, nhng
ngi lun bn khch l v ng vin em rt nhiu.
H Ni, thng 6 nm 2005
Sinh vin
Nguyn Th Thy Linh
-
7/31/2019 Phn lp d liu
5/67
- iii-
MC LC
TM TT NI DUNG ..................................................................................................i
LI CM N ............................................................................................................... ii
MC LC .................................................................................................................... iii
DANH MC BIU HNH V...............................................................................v
DANH MC THUT NG...................................................................................... vii
T VN .................................................................................................................1
Chng 1. TNG QUAN V PHN LP DLIU DA TRN CY QUYT
NH...............................................................................................................................3
1.1. Tng quan v phn lp d liu trong data mining................................................31.1.1. Phn lp d liu........................................................................................................3
1.1.2. Cc vn lin quan n phn lp d liu...............................................................6
1.1.3. Cc phng php nh gi chnh xc ca m hnh phn lp ..............................8
1.2. Cy quyt nh ng dng trong phn lp d liu .................................................91.2.1. nh ngha ................................................................................................................9
1.2.2. Cc vn trong khai ph d liu s dng cy quyt nh....................................10
1.2.3. nh gi cy quyt nh trong lnh vc khai ph d liu.......................................11
1.2.4. Xy dng cy quyt nh........................................................................................13
1.3. Thut ton xy dng cy quyt nh...................................................................141.3.1. T tng chung ......................................................................................................14
1.3.2. Tnh hnh nghin cu cc thut ton hin nay........................................................15
1.3.3. Song song ha thut ton phn lp da trn cy quyt nh tun t ......................17
Chng 2. C4.5 V SPRINT......................................................................................212.1. Gii thiu chung .................................................................................................21
2.2. Thut ton C4.5...................................................................................................212.2.1. C4.5 dng Gain-entropy lm o la chn thuc tnh tt nht........................22
2.2.2. C4.5 c cch ring trong x l nhng gi tr thiu..............................................25
2.2.3. Trnh qu va d liu .........................................................................................26
2.2.4. Chuyn i t cy quyt nh sang lut .................................................................26
2.2.5. C4.5 l mt thut ton hiu qu cho nhng tp d liu va v nh .......................27
2.3. Thut ton SPRINT ............................................................................................28
2.3.1. Cu trc d liu trong SPRINT..............................................................................292.3.2. SPRINT s dng Gini-index lm o tm im phn chia tp d liu tt nht
..........................................................................................................................................31
2.3.3. Thc thi s phn chia .............................................................................................34
2.3.4. SPRINT l thut ton hiu qu vi nhng tp d liu qu ln so vi cc thut ton
khc...................................................................................................................................35
-
7/31/2019 Phn lp d liu
6/67
- iv-
2.4. So snh C4.5 v SPRINT....................................................................................37
Chng 3. CC KT QU THC NGHIM .........................................................38
3.1. Mi trng thc nghim.....................................................................................38
3.2. Cu trc m hnh phn lp C4.5 release8:..........................................................383.2.1. M hnh phn lp C4.5 c 4 chng trnh chnh: ..................................................38
3.2.2. Cu trc d liu s dng trong C4.5 ......................................................................39
3.3. Kt qu thc nghim...........................................................................................403.3.1. `7Mt s kt qu phn lp tiu biu:......................................................................40
3.3.2. Cc biu hiu nng ............................................................................................47
3.4. Mt s xut ci tin m hnh phn lp C4.5..................................................54
KT LUN ..................................................................................................................56
TI LIU THAM KHO...........................................................................................57
-
7/31/2019 Phn lp d liu
7/67
- v-
DANH MC BIU HNH V
Hnh 1 - Qu trnh phn lp d liu - (a) Bc xy dng m hnh phn lp .................4
Hnh 2 - Qu trnh phn lp d liu - (b1)c lng chnh xc ca m hnh...........5
Hnh 3 - Qu trnh phn lp d liu - (b2) Phn lp d liu mi ...................................5
Hnh 4 - c lng chnh xc ca m hnh phn lp vi phng php holdout ......8
Hnh 5- V d v cy quyt nh .....................................................................................9
Hnh 6 - M gi ca thut ton phn lp d liu da trn cy quyt nh....................14
Hnh 7 - S xy dng cy quyt nh theo phng php ng b ...........................18
Hnh 8 - S xy dng cy quyt nh theo phng php phn hoch .....................19
Hnh 9 - S xy dng cy quyt nh theo phng php lai....................................20
Hnh 10 - M gi thut ton C4.5..................................................................................22
Hnh 11 - M gi thut ton SPRINT............................................................................28
Hnh 12 - Cu trc d liu trong SLIQ..........................................................................29Hnh 13 - Cu trc danh sch thuc tnh trong SPRINT Danh sch thuc tnh lin tc
c sp xp theo th t ngay c to ra ............................................................30
Hnh 14 - c lng cc im phn chia vi thuc tnh lin tc .................................32
Hnh 15 - c lng im phn chia vi thuc tnh ri rc.........................................33
Hnh 16 - Phn chia danh sch thuc tnh ca mt node ..............................................34
Hnh 17 - Cu trc ca bng bm phn chia d liu trong SPRINT (theo v d cc hnh
trc) ......................................................................................................................35
Hnh 18 - File nh ngha cu trc d liu s dng trong thc nghim ........................39
Hnh 19 - File cha d liu cn phn lp ......................................................................40Hnh 20 - Dng cy quyt nh to ra t tp d liu th nghim..................................41
Hnh 21 - c lng trn cy quyt nh va to ra trn tp d liu training v tp d
liu test ...................................................................................................................42
Hnh 22 - Mt s lut rt ra t b d liu 19 thuc tnh, phn lp loi thit lp ch
giao din ca ngi s dng (WEB_SETTING_ID).............................................43
Hnh 23 - Mt s lut rt ra t b d liu 8 thuc tnh, phn lp theo s hiu nh sn
xut in thoi (PRODUCTER_ID) ......................................................................44
Hnh 24 - Mt s lut sinh ra t tp d liu 8 thuc tnh, phn lp theo dch v
inthoi m khch hng s dng (MOBILE_SERVICE_ID)..............................45
Hnh 25 - c lng tp lut trn tp d liu o to ..................................................46
-
7/31/2019 Phn lp d liu
8/67
- vi-
Bng 1 - Bng d liu tp training vi thuc tnh phn lp l buys_computer ............24
Bng 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch
thc tp d liu o to 2 thuc tnh....................................................................49
Bng 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch
thc tp d liu o to 7 thuc tnh....................................................................50
Bng 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch
thc tp d liu o to18 thuc tnh...................................................................51
Bng 5- Thi gian sinh cy quyt nh ph thuc vo s lng thuc tnh.................52
Bng 6 - Thi gian xy dng cy quyt nh vi thuc tnh ri rc v thuc tnh lin
tc ...........................................................................................................................53
Bng 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp...................54
Biu 1- So snh thi gian thc thi ca m hnh phn lp SPRINT v SLIQ theokch thc tp d liu o to................................................................................36
Biu 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch
thc tp d liu o to 2 thuc tnh....................................................................49
Biu 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch
thc tp d liu o to 7 thuc tnh....................................................................50
Biu 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch
thc tp d liu o to18 thuc tnh...................................................................51
Biu 5 -S ph thuc thi gian sinh cy quyt nh vo s lng thuc tnh.........52
Biu 6 - So snh thi gian xy dng cy quyt nh t tp thuc tnh lin tc v ttp thuc tnh ri rc ..............................................................................................53
Biu 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp...............54
-
7/31/2019 Phn lp d liu
9/67
- vii-
DANH MC THUT NG
STT Ting Anh Ting Vit
1 training data d liu o to
2 test data d liu kim tra3 Pruning decision tree Ct, ta cy quyt nh
4 Over fitting data Qu va d liu
5 Noise D liu li
6 Missing value Gi tr thiu
7 Data tuple Phn t d liu
8 Case
Case (c hiu nh mt data
tuple, cha mt b gi tr ca
cc thuc tnh trong tp d liu)
-
7/31/2019 Phn lp d liu
10/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 1-
T VN
Trong qu trnh hot ng, con ngi to ra nhiu d liu nghip v. Cc tp
d liu c tch ly c kch thc ngy cng ln, v c th cha nhiu thng tin n
dng nhng quy lut cha c khm ph. Chnh v vy, mt nhu cu t ra l cn tmcch trch rt t tp d liu cc lut v phn lp d liu hay don nhng xu
hng d liu tng lai. Nhng quy tc nghip v thng minh c to ra s phc v
c lc cho cc hot ng thc tin, cng nh phc vc lc cho qu trnh nghin
cu khoa hc. Cng ngh phn lp v don d liu ra i p ng mong mun
.
Cng ngh phn lp d liu , ang v s pht trin mnh m trc nhng
khao kht tri thc ca con ngi. Trong nhng nm qua, phn lp d liu thu ht s
quan tm cc nh nghin cu trong nhiu lnh vc khc nhau nh hc my (machine
learning), h chuyn gia (expert system), thng k (statistics)... Cng ngh ny cng
ng dng trong nhiu lnh vc thc t nh: thng mi, nh bng, maketing, nghin
cu th trng, bo him, y t, gio dc...
Nhiu k thut phn lp c xut nh: Phn lp cy quyt nh
(Decision tree classification), phn lp Bayesian (Bayesian classifier), phn lp K-
hng xm gn nht (K-nearest neighbor classifier), mng nron, phn tch thng k,
Trong cc k thut , cy quyt nh c coi l cng c mnh, ph bin v c bit
thch hp cho data mining [5][7]. Trong cc m hnh phn lp, thut ton phn lp l
nhn t cho. Do vy cn xy dng nhng thut ton c chnh xc cao, thc thi
nhanh, i km vi kh nng m rng c c th thao tc vi nhng tp d liu
ngy cng ln.
Kha lun nghin cu tng quan v cng ngh phn lp d liu ni chung
v phn lp d liu da trn cy quyt nh ni ring. T tp trung hai thut ton
tiu biu cho hai phm vi ng dng khc nhau l C4.5 v SPRINT. Vic phn tch,
nh gi cc thut ton c gi tr khoa hc v ngha thc tin. Tm hiu cc thut
ton gip chng ta tip thu v c th pht trin v mt t tng, cng nh k thut ca
mt cng ngh tin tin v ang l thch thc i vi cc nh khoa hc trong lnh
vc data mining. T c th trin khai ci t v th nghim cc m hnh phn lp
d liu trn thc t. Tin ti ng dng vo trong cc hot ng thc tin ti Vit Nam,
m trc tin l cc hot ng phn tch, nghin cu th trng khch hng.
-
7/31/2019 Phn lp d liu
11/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 2-
Kha lun cng chy th nghim m hnh phn lp C4.5 trn tp d liu
thc t t Tng cng ty bu chnh vin thng. Qua tip thu c cc k thut trin
khai, p dng mt m hnh phn lp d liu vo hot ng thc tin. Qu trnh chy
th nghim thu c cc kt qu phn lp kh quan vi tin cy cao v nhiu
tim nng ng dng. Cc nh gi hiu nng ca m hnh phn lp cng c tinhnh. Trn c s, kha lun xut nhng ci tin nhm tng hiu nng ca m
hnh phn lp C4.5 ng thi thm tin ch cho ngi dng.
Kha lun gm c 3 chng chnh:
Chng 1i t tng quan cng ngh phn lp d liu ti k thut phn lp d
liu da trn cy quyt nh. Cc nh gi v cng c cy quyt nh cng c trnh
by. Chng ny cng cung cp mt ci nhn tng quan v lnh vc nghin cu cc
thut ton phn lp d liu da trn cy quyt nh vi nn tng t tng, tnh hnhnghin cu v phng hng pht trin hin nay.
Chng 2 tp trung vo hai thut ton tiu biu cho hai phm vi ng dng
khc nhau l C4.5 v SPRINT. Hai thut ton ny c nhng chin lc ring trong la
chn tiu chun phn chia d liu cng nh cch thc lu tr phn chia d
liuChnh nhng c im ring m C4.5 l thut ton tiu biu ph bin nht
vi tp d liu va v nh, trong khi SPRINT li l s la chn i vi nhng tp
d liu cc ln.
Chng 3 trnh by qu trnh thc nghim vi m hnh phn lp C4.5 trn tp
d liu thc t tng cng ty bu chnh vin thng Vit Nam. Cc kt qu thc nghim
c trnh by. T kha lun xut cc ci tin m hnh phn lp C4.5
-
7/31/2019 Phn lp d liu
12/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 3-
Chng 1. TNG QUAN V PHN LP D LIU DA
TRN CY QUYTNH
1.1. Tng quan v phn lp dliu trong data mining
1.1.1. Phn lp d liu
Ngy nayphn lp d liu (classification) l mt trong nhng hng nghin
cu chnh ca khai ph d liu. Thc tt ra nhu cu l t mt c sd liu vi
nhiu thng tin n con ngi c th trch rt ra cc quyt nh nghip v thng minh.
Phn lp v don l hai dng ca phn tch d liu nhm trch rt ra mt m hnh
m t cc lp d liu quan trng hay don xu hng d liu tng lai. Phn lp d
on gi tr ca nhng nhn xc nh (categorical label) hay nhng gi tr ri rc
(discrete value), c ngha l phn lp thao tc vi nhng i tng d liu m c bgi tr l bit trc. Trong khi , don li xy dng m hnh vi cc hm nhn gi
tr lin tc. V d m hnh phn lp d bo thi tit c th cho bit thi tit ngy mai l
ma, hay nng da vo nhng thng s vm, sc gi, nhit , ca ngy hm
nay v cc ngy trc . Hay nhcc lut v xu hng mua hng ca khch hng
trong siu th, cc nhn vin kinh doanh c th ra nhng quyt sch ng n v lng
mt hng cng nh chng loi by bn Mt m hnh don c th don c
lng tin tiu dng ca cc khch hng tim nng da trn nhng thng tin v thu
nhp v ngh nghip ca khch hng. Trong nhng nm qua, phn lp d liu thu
ht s quan tm cc nh nghin cu trong nhiu lnh vc khc nhau nh hc my
(machine learning), h chuyn gia (expert system), thng k (statistics)... Cng ngh
ny cng ng dng trong nhiu lnh vc khc nhau nh: thng mi, nh bng,
maketing, nghin cu th trng, bo him, y t, gio dc... Phn ln cc thut ton ra
i trc u s dng cch d liu c tr trong b nh (memory resident), thng
thao tc vi lng d liu nh. Mt s thut ton ra i sau ny s dng k thut c
tr trn a ci thin ng k kh nng m rng ca thut ton vi nhng tp d liu
ln ln ti hng t bn ghi.
Qu trnh phn lp d liu gm hai bc [14]:
Bc thnht (learning)
Qu trnh hc nhm xy dng mt m hnh m t mt tp cc lp d liu hay
cc khi nim nh trc. u vo ca qu trnh ny l mt tp d liu c cu trc
c m t bng cc thuc tnh v c to ra t tp cc b gi tr ca cc thuc tnh
-
7/31/2019 Phn lp d liu
13/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 4-
. Mi b gi trc gi chung l mt phn t d liu (data tuple), c th l cc
mu (sample), v d (example), i tng(object), bn ghi (record) hay trng hp
(case). Kho lun s dng cc thut ng ny vi ngha tng ng. Trong tp d liu
ny, mi phn t d liu c gi s thuc v mt lp nh trc, lp y l gi tr
ca mt thuc tnh c chn lm thuc tnh gn nhn lp hay thuc tnh phn lp(class label attribute). u ra ca bc ny thng l cc quy tc phn lp di dng
lut dng if-then, cy quyt nh, cng thc logic, hay mng nron. Qu trnh ny
c m t nh trong hnh 1
Hnh 1 - Qu trnh phn lp dliu - (a) Bc xy dng m hnh phn lp
Bc thhai (classification)
Bc th hai dng m hnh xy dng bc trc phn lp d liu
mi. Trc tin chnh xc mang tnh cht don ca m hnh phn lp va to ra
c c lng. Holdout l mt k thut n gin c lng chnh xc . K
thut ny s dng mt tp d liu kim tra vi cc mu c gn nhn lp. Cc
mu ny c chn ngu nhin v c lp vi cc mu trong tp d liu o to.
chnh xc ca m hnh trntp d
liu ki
m tra
a l t l phn trm cc cc mu
trong tp d liu kim tra c m hnh phn lp ng (so vi thc t). Nu chnh
xc ca m hnh c c lng da trn tp d liu o to th kt qu thu c l
rt kh quan v m hnh lun c xu hng qu va d liu. Qu va d liu l hin
tng kt qu phn lp trng kht vi d liu thc t v qu trnh xy dng m hnh
phn lp t tp d liu o to c th kt hp nhng c im ring bit ca tp d
A g e C ar T yp e R isk2 0 Co mbi High
1 8 S po rts High
4 0 S po rts High
5 0 F a mily L o w
3 5 M iniv a n L o w
3 0 Co mbi High
3 2 F a mily L o w
4 0 Co mbi L o w
Training data
Classification
algorithm
Classifier (model)
ifage < 31
or Car Type =Sports
then Risk = High
a)
-
7/31/2019 Phn lp d liu
14/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 5-
liu . Do vy cn s dng mt tp d liu kim tra c lp vi tp d liu o to.
Nu chnh xc ca m hnh l chp nhn c, th m hnh c s dng phn
lp nhng d liu tng lai, hoc nhng d liu m gi tr ca thuc tnh phn lp l
cha bit.
Hnh 2 - Qu trnh phn lp dliu - (b1)c lng chnh xc ca m hnh
Hnh 3 - Qu trnh phn lp dliu - (b2) Phn lp dliu mi
Trong m hnh phn lp, thut ton phn lp gi vai tr trung tm, quyt nh
ti s thnh cng ca m hnh phn lp. Do vy cha kha ca vn phn lp d liu
l tm ra c mt thut ton phn lp nhanh, hiu qu, c chnh xc cao v c kh
nng mrng c. Trong kh nng mrng c ca thut ton c c bit tr
trng v pht trin [14].
C th lit k ra y cc k thut phn lp c s dng trong nhng nm qua:
Phn lp cy quytnh (Decision tree classification)
Age Car Type Risk27 Sports High
34 Family Low
66 Family High
44 Sports High
Test data
Classifier (model)
RiskHigh
LowLow
High
b1)
A g e C ar Typ e R isk2 7 S ports
3 4 M iniva n
5 5 F amily
3 4 S po rts
New dataClassifier (model)
R i s kHigh
L o w
L o w
High
b2)
-
7/31/2019 Phn lp d liu
15/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 6-
B phn lp Bayesian (Bayesian classifier)
M hnh phn lp K-hng xm gn nht (K-nearest neighbor classifier)
Mng nron
Phn tch thng k Cc thut ton di truyn
Phng php tp th (Rough set Approach)
1.1.2. Cc vn lin quan n phn lp d liu
1.1.2.1. Chun b dliu cho vic phn lp
Vic tin x l d liu cho qu trnh phn lp l mt vic lm khng th thiu
v c vai tr quan trng quyt nh ti s p dng c hay khng ca m hnh phn
lp. Qu trnh tin x l d liu s gip ci thin chnh xc, tnh hiu qu v kh
nng mrng c ca m hnh phn lp.
Qu trnh tin x l d liu gm c cc cng vic sau:
Lm sch dliu
Lm sch d liu lin quan n vic x l vi li (noise) v gi tr thiu
(missing value) trong tp d liu ban u.Noise l cc li ngu nhin hay cc
gi tr khng hp l ca cc bin trong tp d liu. x l vi loi li ny c
th dng k thut lm trn.Missing value l nhng khng c gi tr ca ccthuc tnh. Gi tr thiu c th do li ch quan trong qu trnh nhp liu, hoc
trong trng hp c th gi tr ca thuc tnh khng c, hay khng quan
trng. K thut x l y c th bng cch thay gi tr thiu bng gi tr ph
bin nht ca thuc tnh hoc bng gi tr c th xy ra nht da trn thng
k. Mc d phn ln thut ton phn lp u c cch x l vi nhng gi tr
thiu v litrong tp d liu, nhng bc tin x l ny c th lm gim s hn
n trong qu trnh hc (xy dng m hnh phn lp).
Phn tch scn thit ca dliuC rt nhiu thuc tnh trong tp d liu c th hon ton khng cn thit hay
lin quan n mt bi ton phn lp c th. V d d liu v ngy trong tun
hon ton khng cn thit i vi ng dng phn tch ri ro ca cc khon
tin cho vay ca ngn hng, nn thuc tnh ny l d tha. Phn tch s cn
thit ca d liu nhm mc ch loi b nhng thuc tnh khng cn thit, d
-
7/31/2019 Phn lp d liu
16/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 7-
tha khi qu trnh hc v nhng thuc tnh s lm chm, phc tp v gy ra
s hiu sai trong qu trnh hc dn ti mt m hnh phn lp khng dng c.
Chuyn i dliu
Vic khi qut ha d liu ln mc khi nim cao hn i khi l cn thit trong
qu trnh tin x l. Vic ny c bit hu ch vi nhng thuc tnh lin tc
(continuous attribute hay numeric attribute). V d cc gi tr s ca thuc tnh
thu nhp ca khch hng c thc khi qut ha thnh cc dy gi tr ri rc:
thp, trung bnh, cao. Tng t vi nhng thuc tnh ri rc (categorical
attribute) nha chphc thc khi qut ha ln thnh thnh ph. Vic
khi qut ha lm c ng d liu hc nguyn thy, v vy cc thao tc vo/ ra
lin quan n qu trnh hc s gim.
1.1.2.2. So snh cc m hnh phn lpTrong tng ng dng c th cn la chn m hnh phn lp ph hp. Vic la
chn cn c vo s so snh cc m hnh phn lp vi nhau, da trn cc tiu chun
sau:
chnh xc don (predictive accuracy)
chnh xc l kh nng ca m hnh don chnh xc nhn lp ca d
liu mi hay d liu cha bit.
Tc (speed)
Tc l nhng chi ph tnh ton lin quan n qu trnh to ra v s dng m
hnh.
Sc mnh (robustness)
Sc mnh l kh nng m hnh to ta nhng don ng t nhng d liu
noise hay d liu vi nhng gi tr thiu.
Kh nng mrng(scalability)
Kh nng mrng l kh nng thc thi hiu qu trn lng ln d liu ca m
hnh hc.
Tnh hiu c (interpretability)
Tnh hiu c l mc hiu v hiu r nhng kt qu sinh ra bi m hnh
hc.
-
7/31/2019 Phn lp d liu
17/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 8-
Tnh n gin (simplicity)
Tnh n gin lin quan n kch thc ca cy quyt nh hay c ng ca
cc lut.
Trong cc tiu chun trn, kh nng mrng ca m hnh phn lp c nhn
mnh v tr trng pht trin, c bit vi cy quyt nh. [14]
1.1.3. Cc phng php nh gi chnh xc ca m hnh phn lp
c lng chnh xc ca b phn lp l quan trng ch n cho php d
on c chnh xc ca cc kt qu phn lp nhng d liu tng lai. chnh
xc cn gip so snh cc m hnh phn lp khc nhau. Kha lun ny cp n 2
phng php nh gi ph bin l holdout v k-fold cross-validation. C 2 k thut
ny u da trn cc phn hoch ngu nhin tp d liu ban u.
Trong phng php holdout, d liu da ra c phn chia ngu nhin thnh 2phn l: tp d liu o to v tp d liu kim tra. Thng thng 2/3 d liu cp
cho tp d liu o to, phn cn li cho tp d liu kim tra [14].
Hnh 4 -c lng chnh xc ca m hnh phn lp vi phng php holdout
Trong phng php k-fold cross validation tp d liu ban u c chia ngu
nhin thnh ktp con (fold) c kch thc xp x nhau S1, S2, , Sk. Qu trnh hc
v test c thc hin kln. Ti ln lp thi, Sil tp d liu kim tra, cc tp cn
li hp thnh tp d liu o to. C ngha l, u tin vic dy c thc hin trncc tp S2, S3 , Sk, sau test trn tp S1; tip tc qu trnh dy c thc hin
trn tp S1, S3, S4,, Sk, sau test trn tp S2; v c th tip tc. chnh xc l
ton b s phn lp ng tkln lp chia cho tng s mu ca tp d liu ban u.
Data
Test set
Training setDerive
classifier
Esitmate
accuracy
-
7/31/2019 Phn lp d liu
18/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 9-
1.2. Cy quytnh ng dng trong phn lp dliu
1.2.1. nh ngha
Trong nhng nm qua, nhiu m hnh phn lp d liu c cc nh khoa
hc trong nhiu lnh vc khc nhau xut nh mng notron, m hnh thng k tuyn
tnh /bc 2, cy quyt nh, m hnh di truyn. Trong s nhng m hnh , cy quyt
nh vi nhng u im ca mnh c nh gi l mt cng c mnh, ph bin v
c bit thch hp cho data mining ni chung v phn lp d liu ni ring [7]. C th
k ra nhng u im ca cy quyt nh nh: xy dng tng i nhanh; n gin, d
hiu. Hn na cc cy c th d dng c chuyn i sang cc cu lnh SQL c
thc s dng truy nhp csd liu mt cch hiu qu. Cui cng, vic phn
lp da trn cy quyt nh t c s tng t v i khi l chnh xc hn so vi
cc phng php phn lp khc [10].
Cy quyt nh l biu pht trin c cu trc dng cy, nh m t trong
hnh v sau:
Hnh 5- V d vcy quytnh
Trong cy quyt nh:
Gc: l node trn cng ca cy Node trong: biu din mt kim tra trn mt thuc tnh n (hnh ch nht)
Nhnh: biu din cc kt qu ca kim tra trn node trong (mi tn)
Node l: biu din lp hay s phn phi lp (hnh trn)
Age27.5
Risk = High
Age>27.5
Risk = High
Car type {sport} Car type {family, truck}
Age
Car type
Risk = Low
-
7/31/2019 Phn lp d liu
19/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 10-
phn lp mu d liu cha bit, gi tr cc thuc tnh ca mu c a
vo kim tra trn cy quyt nh. Mi mu tng ng c mt ng i t gc n l
v l biu din don gi tr phn lp mu .
1.2.2. Cc vn trong khai ph d liu s dng cy quyt nh
Cc vn c th trong khi hc hay phn lp d liu bng cy quyt nh
gm: xc nh su pht trin cy quyt nh, x l vi nhng thuc tnh lin tc,
chn php o la chn thuc tnh thch hp, s dng tp d liu o to vi nhng gi
tr thuc tnh b thiu, s dng cc thuc tnh vi nhng chi ph khc nhau, v ci thin
hiu nng tnh ton. Sau y kha lun s cp n nhng vn chnh c gii
quyt trong cc thut ton phn lp da trn cy quyt nh.
1.2.2.1. Trnh qu va dliu
Th no l qu va d liu? C th hiu y l hin tng cy quyt nhcha mt sc trng ring ca tp d liu o to, nu ly chnh tp traning data
test li m hnh phn lp th chnh xc s rt cao, trong khi i vi nhng d liu
tng lai khc nu s dng cy li khng t c chnh xc nh vy.
Qu va d liu l mt kh khn ng ki vi hc bng cy quyt nh v
nhng phng php hc khc. c bit khi s lng v d trong tp d liu o to
qu t, hay c noise trong d liu.
C hai phng php trnh qu va d liu trong cy quyt nh:
Dng pht trin cy sm hn bnh thng, trc khi t ti im phn lp hon
ho tp d liu o to. Vi phng php ny, mt thch thc t ra l phi c
lng chnh xc thi im dng pht trin cy.
Cho php cy c th qu va d liu, sau s ct, ta cy.
Mc d phng php th nht c v trc tip hn, nhng vi phng php th
hai th cy quyt nh c sinh ra c thc nghim chng minh l thnh cng hn
trong thc t. Hn na vic ct ta cy quyt nh cn gip tng qut ha, v ci thin
chnh xc ca m hnh phn lp. D thc hin phng php no th vn mucht y l tiu chun no c s dng xc nh kch thc hp l ca cy cui
cng.
-
7/31/2019 Phn lp d liu
20/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 11-
1.2.2.2. Thao tc vi thuc tnh lin tc
Vic thao tc vi thuc tnh lin tc trn cy quyt nh hon ton khng n
gin nh vi thuc tnh ri rc.
Thuc tnh ri rc c tp gi tr (domain) xc nh t trc v l tp hp cc
gi tr ri rc. V d loi t l mt thuc tnh ri rc vi tp gi tr l: {xe ti, xekhch, xe con, taxi}.Vic phn chia d liu da vo php kim tra gi tr ca thuc
tnh ri rc c chn ti mt v d c th c thuc tp gi tr ca thuc tnh hay
khng: value(A) Xvi Xdomain (A). y l php kim tra logic n gin, khng
tn nhiu ti nguyn tnh ton. Trong khi , vi thuc tnh lin tc (thuc tnh dng
s) th tp gi tr l khng xc nh trc. Chnh v vy, trong qu trnh pht trin cy,
cn s dng kim tra dng nh phn: value(A) . Vi l hng s ngng
(threshold) c ln lt xc nh da trn tng gi tr ring bit hay tng cp gi tr
lin nhau (theo th t sp xp) ca thuc tnh lin tc ang xem xt trong tp dliu o to. iu c ngha l nu thuc tnh lin tc A trong tp d liu o to c
dgi tr phn bit th cn thc hin d-1 ln kim tra value(A) i vi i = 1..d-1 tm
ra ngng best tt nht tng ng vi thuc tnh . Vic xc nh gi tr ca v tiu
chun tm tt nht ty vo chin lc ca tng thut ton [13][1]. Trong thut ton
C4.5, ic chn l gi tr trung bnh ca hai gi tr lin k nhau trong dy gi tr
sp xp.
Ngoi ra cn mt s vn lin quan n sinh tp lut, x l vi gi tr thiu
sc trnh by c th trong phn thut ton C4.5.
1.2.3. nh gi cy quyt nh trong lnh vc khai ph d liu
1.2.3.1. Sc mnh ca cy quyt nh
Cy quyt nh c 5 sc mnh chnh sau [5]:
Kh nng sinh ra cc quy tc hiu c
Cy quyt nh c kh nng sinh ra cc quy tc c th chuyn i c sang
dng ting Anh, hoc cc cu lnh SQL. y l u im ni bt ca k thut ny.Thm ch vi nhng tp d liu ln khin cho hnh dng cy quyt nh ln v phc
tp, vic i theo bt cng no trn cy l d dng theo ngha ph bin v r rng.
Do vy s gii thch cho bt c mt s phn lp hay don no u tng i minh
bch.
-
7/31/2019 Phn lp d liu
21/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 12-
Kh nng thc thi trong nhng lnh vc hng quy tc
iu ny c nghe c v hin nhin, nhng quy tc quy np ni chung v cy
quyt nh ni ring l la chn hon ho cho nhng lnh vc thc s l cc quy tc.
Rt nhiu lnh vc t di truyn ti cc qu trnh cng nghip thc s cha cc quy tc
n, khng r rng (underlying rules) do kh phc tp v ti ngha bi nhng d liu li(noisy). Cy quyt nh l mt s la chn t nhin khi chng ta nghi ngs tn ti
ca cc quy tc n, khng r rng.
Ddng tnh ton trong khi phn lp
Mc d nh chng ta bit, cy quyt nh c th cha nhiu nh dng,
nhng trong thc t, cc thut ton s dng to ra cy quyt nh thng to ra
nhng cy vi s phn nhnh thp v cc test n gin ti tng node. Nhng test in
hnh l: so snh s, xem xt phn t ca mt tp hp, v cc php ni n gin. Khi
thc thi trn my tnh, nhng test ny chuyn thnh cc ton hm logic v s nguyn
l nhng ton hng thc thi nhanh v khng t. y l mt u im quan trng bi
trong mi trng thng mi, cc m hnh don thng c s dng phn lp
hng triu thm tr hng t bn ghi.
Kh nng xl vi c thuc tnh lin tc v thuc tnh ri rc
Cy quyt nh x l tt nh nhau vi thuc tnh lin tc v thuc tnh ri
rc. Tuy rng vi thuc tnh lin tc cn nhiu ti nguyn tnh ton hn. Nhng thuc
tnh ri rc tng gy ra nhng vn vi mng neural v cc k thut thng k lithc s d dng thao tc vi cc tiu chun phn chia (splitting criteria) trn cy quyt
nh: mi nhnh tng ng vi tng phn tch tp d liu theo gi tr ca thuc tnh
c chn pht trin ti node . Cc thuc tnh lin tc cng d dng phn chia
bng vic chn ra mt s gi l ngng trong tp cc gi tr sp xp ca thuc tnh
. Sau khi chn c ngng tt nht, tp d liu phn chia theo test nh phn ca
ngng .
Thhin r rng nhng thuc tnh tt nht
Cc thut ton xy dng cy quyt nh a ra thuc tnh m phn chia tt
nht tp d liu o to bt u t node gc ca cy. T c th thy nhng thuc
tnh no l quan trng nht cho vic don hay phn lp.
-
7/31/2019 Phn lp d liu
22/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 13-
1.2.3.2. im yu ca cy quyt nh
D c nhng sc mnh ni bt trn, cy quyt nh vn khng trnh khi c
nhng im yu. l cy quyt nh khng thch hp lm vi nhng bi ton vi
mc tiu l don gi tr ca thuc tnh lin tc nh thu nhp, huyt p hay li xut
ngn hng, Cy quyt nh cng kh gii quyt vi nhng d liu thi gian lin tcnu khng b ra nhiu cng sc cho vic t ra s biu din d liu theo cc mu lin
tc.
Dxy ra li khi c qu nhiu lp
Mt s cy quyt nh ch thao tc vi nhng lp gi tr nh phn dngyes/no
hay accept/reject. S khc li c th chnh cc bn ghi vo mt s lp bt k, nhng
d xy ra li khi s v do to ng vi mt lp l nh. iu ny xy ra cng nhanh
hn vi cy m c nhiu tng hay c nhiu nhnh trn mt node.
Chi ph tnh ton to to
iu ny nghe c v mu thun vi khng nh u im ca cy quyt nh
trn. Nhng qu trnh pht trin cy quyt nh t v mt tnh ton. V cy quyt nh
c rt nhiu node trong trc khi i n l cui cng. Ti tng node, cn tnh mt
o (hay tiu chun phn chia)trn tng thuc tnh, vi thuc tnh lin tc phi thm
thao tc xp xp li tp d liu theo th t gi tr ca thuc tnh . Sau mi c th
chn c mt thuc tnh pht trin v tng ng l mt phn chia tt nht. Mt vi
thut ton s dng t hp cc thuc tnh kt hp vi nhau c trng s pht trin cyquyt nh. Qu trnh ct ct cy cng t v nhiu cy con ng c phi c to ra
v so snh.
1.2.4. Xy dng cy quyt nh
Qu trnh xy dng cy quyt nh gm hai giai on:
Giai on th nht pht trin cy quyt nh:
Giai on ny pht trin bt u t gc, n tng nhnh v pht trin quy np
theo cch thc chia tr cho ti khi t c cy quyt nh vi tt c cc l cgn nhn lp.
Giai on th hai ct, ta bt cc cnh nhnh trn cy quyt nh.
Giai on ny nhm mc ch n gin ha v khi qut ha t lm tng
chnh xc ca cy quyt nh bng cch loi b s ph thuc vo mc li (noise)
-
7/31/2019 Phn lp d liu
23/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 14-
ca d liu o to mang tnh cht thng k, hay nhng s bin i m c th l c
tnh ring bit ca d liu o to. Giai on ny ch truy cp d liu trn cy quyt
nh c pht trin trong giai on trc v qu trnh thc nghim cho thy giai
on ny khng tn nhiu ti nguyn tnh ton, nh vi phn ln cc thut ton, giai
on ny chim khong di 1% tng thi gian xy dng m hnh phn lp [7][1].Do vy, y chng ta ch tp trung vo nghin cu giai on pht trin cy
quyt nh. Di y l khung cng vic ca giai on ny:
1) Chn thuc tnh tt nht bng mt o nh trc
2) Pht trin cy bng vic thm cc nhnh tng ng vi tng gi tr ca thuctnh chn
3) Sp xp, phn chia tp d liu o to ti node con
4) Nu cc v dc phn lp r rng th dng.
Ngc li: lp li bc 1 ti bc 4 cho tng node con
1.3. Thut ton xy dng cy quytnh
1.3.1. T tng chung
Phn ln cc thut ton phn lp d liu da trn cy quyt nh c m gi nh sau:
Hnh 6 - M gi ca thut ton phn lp dliu da trn cy quytnh
Make Tree (Training Data T)
{
Partition(T)
}
Partit ion(Data S)
{
i f (all points in S are in the same class) then
return
for each attribute Ado
evaluate splits on attribute A;
use best split found to partition S into S1, S2,..., Sk
Partition(S1)
Partition(S2)...
Partition(Sk)
}
-
7/31/2019 Phn lp d liu
24/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 15-
Cc thut ton phn lp nh C4.5 (Quinlan, 1993), CDP(Agrawal v cc tc
gi khc, 1993), SLIQ (Mehta v cc tc gi khc, 1996) v SPRINT (Shafer v cc
tc gi khc, 1996) u s dng phng php ca Hunt lm t tng ch o.
Phng php ny c Hunt v cc ng s ngh ra vo nhng nm cui thp k 50
u thp k 60.
M t quy np phng php Hunt [1]:
Gi s xy dng cy quyt nh t T l tp training data v cc lp c biu
din di dng tp C = {C1,C2, ,Ck }
Trng hp 1:T cha cc case thuc v mt lp n Cj, cy quyt nh ng
vi Tl mt l tng ng vi lp Cj
Trng hp 2:T cha cc case thuc v nhiu lp khc nhau trong tp C. Mt
kim tra c chn trn mt thuc tnh c nhiu gi tr {O1, O2, .,On }. Trong nhiu ng
dng n thng c chn l 2, khi to ra cy quyt nh nh phn. Tp T c chiathnh cc tp con T1, T2, , Tn, vi Ti cha tt c cc case trong T m c kt qu l Oi
trong kim tra chn. Cy quyt nh ng vi T bao gm mt node biu din kim tra
c chn, v mi nhnh tng ng vi mi kt qu c th ca kim tra . Cch thc
xy dng cy tng tc p dng quy cho tng tp con ca tp training data.
Trng hp 3:T khng cha case no. Cy quyt nh ng vi T l mt l,
nhng lp gn vi l phi c xc nh t nhng thng tin khc ngoi T. V d
C4.5 chn gi tr phn lp l lp ph bin nht ti cha ca node ny.
1.3.2. Tnh hnh nghin cu cc thut ton hin nay
Cc thut ton phn lp d liu da trn cy quyt nh u c t tng ch
o l phng php Hunt trnh by trn. Lun c 2 cu hi ln cn phi c tr
li trong cc thut ton phn lp d liu da trn cy quyt nh l:
1. Lm cch no xc nh c thuc tnh tt nht pht trin ti mi
node?
2. Lu tr d liu nh th no v lm cch no phn chia d liu theo cc
test tng ng?
Cc thut ton khc nhau c cc cch tr li khc nhau cho hai cu hi trn.
iu ny lm nn s khc bit ca tng thut ton.
C 3 loi tiu chun hay ch s xc nh thuc tnh tt nht pht trin ti mi
node
-
7/31/2019 Phn lp d liu
25/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 16-
Gini-index (Breiman v cc ng s, 1984 [1]): Loi tiu chun ny la chn
thuc tnh m lm cc tiu ha khng tinh khit ca mi phn chia. Cc
thut ton s dng ny l CART, SLIQ, SPRINT.
Informationgain (Quinlan, 1993 [1]): Khc vi Gini-index, tiu chun ny s
dng entropy o khng tinh khit ca mt phn chia v la chn thuctnh theo mc cc i ha ch s entropy. Cc thut ton s dng tiu chun
ny l ID3, C4.5.
2 -bng thng k cc skin xy ra ngu nhin:2 o tng quan gia tng
thuc tnh v nhn lp. Sau la chn thuc tnh c tng quan ln nht.
CHAID l thut ton s dng tiu chun ny.
Chi tit v cch tnh cc tiu chun Gini-index v Information-gain sc
trnh by trong hai thut ton C4.5 v SPRINT, chng 2.
Vic tnh ton cc ch s trn i khi i hi phi duyt ton b hay mt phn
ca tp d liu o to. Do vy cc thut ton ra i trc yu cu ton b tp d liu
o to phi nm thng tr trong b nh (memory- resident) trong qu trnh pht
trin cy quyt nh. iu ny lm hn ch kh nng mrng ca cc thut ton ,
v kch thc b nhl c hn, m kch thc ca tp d liu o to th tng khng
ngng, i khi l triu l t bn ghi trong lnh vc thng mi. R rng cn tm ra gii
php mi thay i cch lu tr v truy cp d liu, nm 1996 SLIQ (Mehta) v
SPRINT (Shafer) ra i gii quyt c hn ch. Hai thut ton ny s dng
cch lu tr d liu thng tr trn a (disk- resident) v cchsp xp trc mt
ln (pre- sorting) tp d liu o to. Nhng c im mi ny lm ci thin ng k
hiu nng v tnh mrng so vi cc thut ton khc. Tip theo l mt s thut ton
khc pht trin trn nn tng SPRINT vi mt s b xung ci tin nh PUBLIC (1998)
[11] vi tng kt hp hai qu trnh xy dng v ct ta vi nhau, hay ScalParC
(1998) ci thin qu trnh phn chia d liu ca SPRINT vi cch dng bng bm
khc, hay thut ton do cc nh khoa hc trng i hc Minesota (M ) kt hp vi
IBM xut lm gim chi ph vo ra cng nh chi ph giao tip ton cc khi song
song ha so vi SPRINT [2]. Trong cc thut ton SPRINT c coi l sng to tbin, ng chng ta tm hiu v pht trin.
-
7/31/2019 Phn lp d liu
26/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 17-
1.3.3. Song song ha thut ton phn lp da trn cy quyt nh tun
t
Song song ha xu hng nghin cu hin nay ca cc thut ton phn lp d liu da
trn cy quyt nh. Nhu cu song song ha cc thut ton tun t l mt nhu cu tt
yu ca thc tin pht trin khi m cc i hi v hiu nng, chnh xc ngy cngcao. Thm vo l s gia tng nhanh chng v kch thc ca d liu cn khai ph.
Mt m hnh phn lp chy trn h thng tnh ton song song c hiu nng cao, c kh
nng khai ph c nhng tp d liu ln hn t gia tng tin cy ca cc quy tc
phn lp. Hin nay, cc thut ton tun t yu cu d liu thng tr trong b nh
khng p ng c yu cu ca cc tp d liu c kch thc TetaByte vi hng t
bn ghi. Do vy xy dng thut ton song song hiu qu da trn nhng thut ton
tun t sn c l mt thch thc t ra cho cc nh nghin cu.
C 3 chin lc song song ha cc thut ton tun t:
Phng php xy dng cy ng b
Trong phng php ny, tt c cc b vi x l ng thi tham gia xy dng
cy quyt nh bng vic gi v nhn cc thng tin phn lp ca d liu a phng.
Hnh 7 m t cch lm vic ca cc b vi x l trong phng php ny
u im ca phng php ny l khng yu cu vic di chuyn cc d liu
trong tp d liu o to. Tuy nhin, thut ton ny phi chp nhn chi ph giao tip
cao, v ti bt cn bng. Vi tng node trong cy quyt nh, sau khi tp hp c cc
thng tin phn lp, tt c cc b vi x l cn phi ng b v cp nht cc thng tinphn lp. Vi nhng node su thp, chi ph giao tip tng i nh, bi v s
lng cc mc training data c x l l tng i nh. Nhng khi cy cng su th
chi ph cho giao tip chim phn ln thi gian x l. Mt vn na ca phng php
ny l ti bt cn bng do cch lu tr v phn chia d liu ban u ti tng b vi x
l.
-
7/31/2019 Phn lp d liu
27/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 18-
Hnh 7 - S xy dng cy quytnh theo phng php ng b
Phng php xy dng cy phn hoch
Khi xy dng cy quyt nh bng phng php phn hoch cc b vi x l
khc nhau lm vic vi cc phn khc nhau ca cy quyt nh. Nu nhiu hn 1 b vi
x l cng kt hp pht trin 1 node, th cc b vi x l c phn hoch pht
trin cc con ca node . Phng php ny tp trung vo trng hp 1 nhm cc b
vi x lPn cng hp tc pht trin node n. Khi bt u, tt c cc b vi x l cng
ng thi kt hp pht trin node gc ca cy phn lp. Khi kt thc, ton b cy
phn lp c to ra bng cch kt hp tt c cc cy con ca tng b vi x l. Hnh 8
m t cch lm vic ca cc b vi x l trong phng php ny.
u im ca phng php ny l khi mt b vi x l mt mnh chu trch
nhim pht trin mt node, th n c th pht trin thnh mt cy con ca cy ton cc
mt cch c lp m khng cn bt c chi ph giao tip no.
Tuy nhin cng c mt vi nhc im trong phng php ny, l: Th
nht yu cu di chuyn d liu sau mi ln pht trin mt node cho ti khi mi b vi
x l cha ton b d liu c th pht trin ton b mt cy con. Do vy dn n
tn km chi ph giao tip khi phn trn ca cy phn lp. Th hai l kh t c ti
cn bng. Vic gn cc node cho cc b vi x l c thc hin da trn s lng cc
case trong cc node con. Tuy nhin s lng cc case gn vi mt node khng nht
-
7/31/2019 Phn lp d liu
28/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 19-
thit phi tng ng vi s lng cng vic cn phi x l pht trin cy con ti
node .
Hnh 8 - S xy dng cy quytnh theo phng php phn hoch
Phng php lai
Phng php lai c tn dng u im ca c 2 phng php trn. Phng
php xy dng cy ng b chp nhn chi ph giao tip cao khi bin gii ca cy cng
rng. Trong khi , phng php xy dng cy quyt nh phn hoch th phi chp
nhn chi ph cho vic ti cn bng sau mi bc. Trn cs, phng php lai tip
tc duy tr cch thc th nht min l chi ph giao tip phi chu do tun theo cch
thc th nht khng qu ln. Khi m chi ph ny vt qu mt ngng quy nh, th
cc b vi x l ang x l cc node ti ng bin hin ti ca cy phn lp c
phn chia thnh 2 phn (vi gi thit s lng cc b vi x l l ly tha ca 2).
Phng php ny cn s dng tiu chun khi to s phn hoch tp cc b
vi x l hin ti, l:
(Chi ph giao tip) Chi ph di chuyn + Ti cn bng
-
7/31/2019 Phn lp d liu
29/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 20-
M hnh hot ng ca phng php lai c m t trong hnh 9.
Hnh 9 - S xy dng cy quytnh theo phng php lai
-
7/31/2019 Phn lp d liu
30/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 21-
Chng 2. C4.5 V SPRINT
2.1. Gii thiu chung
Sau y l nhng gii thiu chung nht v lch s ra i ca hai thut tonC4.5 v SPRINT.
C4.5 l s k tha ca ca thut ton hc my bng cy quyt nh da trn
nn tng l kt qu nghin cu ca HUNT v cc cng s ca ng trong na cui thp
k 50 v na u nhng nm 60 (Hunt 1962). Phin bn u tin ra i l ID3
(Quinlan, 1979)- 1 h thng n gin ban u cha khong 600 dng lnh Pascal, v
tip theo l C4 (Quinlan 1987). Nm 1993, J. Ross Quinlan k tha cc kt qu
pht trin thnh C4.5 vi 9000 dng lnh C cha trong mt a mm. Mc d c
phin bn pht trin t C4.5 l C5.0 - mt h thng to ra li nhun t Rule QuestResearch, nhng nhiu tranh lun, nghin cu vn tp trung vo C4.5 v m ngun ca
n l sn dng [13].
Nm 1996, 3 tc gi John Shafer, Rakesh Agrawal, Manish Mehta thucIBM
Almaden Research Center xut mt thut ton mi vi tn gi SPRINT
(Scalable PaRallelization INduction of decision Trees). SPRINT ra i loi b tt
c cc gii hn v b nh, thc thi nhanh v c kh nng m rng. Thut ton ny
c thit k d dng song song ha, cho php nhiu b vi x l cng lm vic
ng thi xy dng mt m hnh phn lp n, ng nht [7]. Hin nay SPRINT c thng mi ha, thut ton ny c tch hp vo trong cc cng c khai ph d
liu ca IBM.
Trong cc thut ton phn lp d liu da trn cy quyt nh, C4.5 v
SPRINT l hai thut ton tiu biu cho hai phm vi ng dng khc nhau. C4.5 l thut
ton hiu qu v c dng rng ri nht trong cc ng dng phn lp vi lng d
liu nh cvi trm nghn bn ghi. SPRINT mt thut ton tuyt vi cho nhng ng
dng vi lng d liu khng l cvi triu n hng t bn ghi.
2.2. Thut ton C4.5
Vi nhng c im C4.5 l thut ton phn lp d liu da trn cy quyt
nh hiu qu v ph bin trong nhng ng dng khai ph csd liu c kch thc
nh. C4.5 s dng cch lu tr d liu thng tr trong b nh, chnh c im ny
lm C4.5 ch thch hp vi nhng csd liu nh, v cch sp xp li d liu ti
-
7/31/2019 Phn lp d liu
31/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 22-
mi node trong qu trnh pht trin cy quyt nh. C4.5 cn cha mt k thut cho
php biu din li cy quyt nh di dng mt danh sch sp th t cc lut if-then
(mt dng quy tc phn lp d hiu).K thut ny cho php lm gim bt kch thc
tp lut v n gin ha cc lut m chnh xc so vi nhnh tng ng cy quyt
nh l tng ng.T tng pht trin cy quytnh ca C4.5 l phng php HUNT nghin
cu trn. Chin lc pht trin theo su (depth-first strategy) c p dng cho
C4.5.
M gi ca thut ton C4.5:
Hnh 10 - M gi thut ton C4.5
Trong bo co ny, chng ti tp trung phn tch nhng im khc bit ca
C4.5 so vi cc thut ton khc. l cch chn thuc tnh kim tra ti mi node,
cch x l vi nhng gi tr thiu, trnh vic qu va d liu, c lng chnh
xc v cch ct ta cy.
2.2.1. C4.5 dng Gain-entropy lm o la chn thuc tnh tt nht
Phn ln cc h thng hc my u c gng to ra 1 cy cng nh cng tt,
v nhng cy nh hn th d hiu hn v dt c chnh xc don cao hn.
(1) ComputerClassFrequency(T);
(2) if OneClass or FewCases
return a leaf;
Create a decision node N;
(3) ForEach Attribute A
ComputeGain(A);
(4) N.test=AttributeWithBestGain;
(5) if N.test is continuous
find Threshold;
(6) ForEach T' in the splitting of T
(7) if T' is Empty
Child of N is a leaf
else
(8) Child of N=FormTree(T');
(9) ComputeErrors of N;
return N
-
7/31/2019 Phn lp d liu
32/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 23-
Do khng thm bo c s cc tiu ca cy quyt nh, C4.5 da vo nghin cu
ti u ha, v s la chn cch phn chia m c o la chn thuc tnht gi tr
cc i.
Hai o c s dng trong C4.5 l information gain vgain ratio.RF(Cj,
S) biu din tn xut (Relative Frequency) cc case trong Sthuc v lp Cj.
Vi |Sj| l kch thc tp cc case c gi tr phn lp l Cj. |S| l kch thc tp
d liu o to.
Ch s thng tin cn thit cho s phn lp: I(S) vi S l tp cn xt s phn
phi lp c tnh bng:
Sau khi S c phn chia thnh cc tp con S1, S2,, St bi test B th
information gain c tnh bng:
Test B sc chn nu c G(S, B) t gi tr ln nht.
Tuy nhin c mt vn khi s dng G(S, B) u tin test c s lng ln kt
qu, v d G(S, B) t cc i vi test m tng Si ch cha mt casen. Tiu chun
gain ratio gii quyt c vn ny bng vic a vo thng tin tim nng (potential
information) ca bn thn mi phn hoch
Test B sc chn nu c t s gi trgain ratio =G(S, B) / P(S, B) lnnht.
Trong m hnh phn lp C4.5 release8, c th dng mt trong hai loi ch s
Information Gain hay Gain ratio xc nh thuc tnh tt nht. Trong Gain ratio
l la chn mc nh.
RF (Cj, S) = |Sj| / |S|
-
7/31/2019 Phn lp d liu
33/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 24-
V d m t cch tnh information gain
Vi thuc tnh ri rc
Bng 1 - Bng dliu tp training vi thuc tnh phn lp l buys_computer
Trong tp d liu trn: s1 l tp nhng bn ghi c gi tr phn lp lyes, s2 l tp
nhng bn ghi c gi tr phn lp l no. Khi :
I(S) = I(s1,s2) = I(9, 5) = -9/14*log29/14 5/14* log25/14 = 0.940
Tnh G(S, A) vi A ln lt l tng thuc tnh:
A = age. Thuc tnh age c ri rc ha thnh cc gi tr 40.
Vi age= 40: I (S3) = I(s13,s23) = 0.971
|Si| / |S|* I(Si) = 5/14* I(S1) + 4/14 * I(S2) + 5/14 * I(S3) =
0.694
Gain (S, age) = I(s1,s2) |Si| / |S|* I(Si) = 0.246
Tnh tng t vi cc thuc tnh khc ta c:
A = income: Gain (S, income) = 0.029 A = student: Gain (S, student) = 0.151
A = credit_rating: Gain (S, credit_rating) = 0.048
Thuc tnh age l thuc tnh c oInformation Gain ln nht. Do
vy agec chn lm thuc tnh pht trin ti node ang xt.
-
7/31/2019 Phn lp d liu
34/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 25-
Vi thuc tnh lin tc
X l thuc tnh lin tc i hi nhiu ti nguyn tnh ton hn thuc tnh ri
rc. Gm cc bc sau:
1. K thut Quick sortc s dng sp xp cc case trong tp d liu
o to theo th t tng dn hoc gim dn cc gi tr ca thuc tnh
lin tc V ang xt. c tp gi tr V = {v1, v2, , vm}
2. Chia tp d liu thnh hai tp con theo ngng i = (vi + vi+1)/2 nm
gia hai gi tr lin k nhau vi v vi+1. Test phn chia d liu l test
nh phn dng V i. Thc thi test ta c hai tp d
liu con: V1 = {v1, v2, , vi} v V2 = {vi+1, vi+2, , vm}.
3. Xt (m-1) ngng i c th c ng vi m gi tr ca thuc tnh V bng
cch tnhInformation gain hay Gain ratio vi tng ngng . Ngng
c gi tr ca Information gain hay Gain ratio ln nht sc chn
lm ngng phn chia ca thuc tnh .
Vic tm ngng (theo cch tuyn tnh nh trn) v sp xp tp training
theo thuc tnh lin tc ang xem xt i khi gy ra tht c chai v tn
nhiu ti nguyn tnh ton.
2.2.2. C4.5 c c ch ring trong x l nhng gi tr thiu
Gi tr thiu ca thuc tnh l hin tng ph bin trong d liu, c th do li
khi nhp cc bn ghi vo csd liu, cng c th do gi tr thuc tnh c nh
gi l khng cn thit i vi case c th.
Trong qu trnh xy dng cy t tp d liu o to S, B l test da trn thuc
tnh Aa vi cc gi tru ra l b1, b2, ..., bt. Tp S0 l tp con cc case trong S m c
gi tr thuc tnh Aa khng bit v Si biu din cc case vi u ra l bi trong test B.
Khi o information gain ca test B gim v chng ta khng hc c g t cc
case trong S0.
Tng ng vi G(S, B), P(S, B) cng thay i,
-
7/31/2019 Phn lp d liu
35/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 26-
Hai thay i ny lm gim gi tr ca test lin quan n thuc tnh c t l gi
tr thiu cao.
Nu testB c chn, C4.5 khng to mt nhnh ring trn cy quyt nh
cho S0. Thay vo , thut ton c cch phn chia cc case trong S0 v vc tp con Si
l tp con m c gi tr thuc tnh test xc nh theo trong s |Si|/ |S S0|.
2.2.3. Trnh qu va d liu
Qu va d liu l mt kh khn ng ki vi hc bng cy quyt nh
v nhng phng php hc khc. Qu va d liu l hin tng: nu khng c cc
case xung t (l nhng case m gi tr cho mi thuc tnh l ging nhau nhng gi tr
ca lp li khc nhau) th cy quyt nh s phn lp chnh xc ton b cc case trong
tp d liu o to. i khi d liu o to li cha nhng c tnh c th, nn khi p
dng cy quyt nh cho nhng tp d liu khc th chnh xc khng cn cao
nh trc.
C mt s phng php trnh qu va d liu trong cy quyt nh:
Dng pht trin cy sm hn bnh thng, trc khi t ti im phn lp
hon ho tp d liu o to. Vi phng php ny, mt thch thc t ra l
phi c lng chnh xc thi im dng pht trin cy.
Cho php cy c th qu va d liu, sau s ct, ta cy
Mc d phng php th nht c v trc quan hn, nhng vi phng php
th hai th cy quyt nh c sinh ra c th nghim chng minh l thnh cng
hn trong thc t, v n cho php cc tng tc tim nng gia cc thuc tnh c
khm ph trc khi quyt nh xem kt qu no ng gi li. C4.5 s dng k thut
th hai trnh qu va d liu.
2.2.4. Chuyn i t cy quyt nh sang lut
Vic chuyn i t cy quyt nh sang lut sn xut (production rules) dng
if-then to ra nhng quy tc phn lp d hiu, d p dng. Cc m hnh phn lp biu
din cc khi nim di dng cc lut sn xut c chng minh l hu ch trongnhiu lnh vc khc nhau, vi cc i hi v c chnh xc v tnh hiu c ca m
hnh phn lp. Dng output tp lut sn xut l s la chn khn ngoan. Tuy nhin,
ti nguyn tnh ton dng cho vic to ra tp lut t tp d liu o to c kch thc
ln v nhiu gi tr sai l v cng ln [12]. Khng nh ny sc chng minh qua
kt qu thc nghim trn m hnh phn lp C4.5
-
7/31/2019 Phn lp d liu
36/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 27-
Giai on chuyn di t cy quyt nh sang lut bao gm 4 bc:
Ct ta:
Lut khi to ban u l ng i t gc n l ca cy quyt nh. Mt cy
quyt nh c ll th tng ng tp lut sn xut s c llut khi to. Tng iu kin
trong lut c xem xt v loi b nu khng nh hng ti chnh xc ca lut .
Sau , cc lut ct ta c thm vo tp lut trung gian nu n khng trng vi
nhng lut c.
La chn
Cc lut ct ta c nhm li theo gi tr phn lp, to nn cc tp con
cha cc lut theo lp. S c ktp lut con nu tp training c kgi tr phn lp. Tng
tp con trn c xem xt chn ra mt tp con cc lut m ti u ha chnh xc
don ca lp gn vi tp lut .
Sp xp
Sp xp Ktp lut to ra t trn bc theo tn s li. Lp mc nh c
to ra bng cch xc nh cc case trong tp training khng cha trong cc lut hin ti
v chn lp ph bin nht trong cc case lm lp mc nh.
c lng, nh gi:
Tp lut c em c lng li trn ton b tp training, nhm mc ch xcnh xem liu c lut no lm gim chnh xc ca s phn lp. Nu c, lut b
loi b v qu trnh c lng c lp cho n khi khng th ci tin thm.
2.2.5. C4.5 l mt thut ton hiu qu cho nhng tp d liu va v nh
C4.5 c cch sinh cy quyt nh hiu qu v cht ch bng vic s dng
o la chn thuc tnh tt nht l information-gain. Cc cch x l vi gi tr li,
thiu v chng qu va d liu ca C4.5 cng vi cch ct ta cy to nn sc
mnh ca C4.5. Thm vo , m hnh phn lp C4.5 cn c phn chuyn i t cy
quyt nh sang lut dng if-then, lm tng chnh xc v tnh d hiu ca kt quphn lp. y l tin ch rt c ngha i vi ngi s dng.
-
7/31/2019 Phn lp d liu
37/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 28-
2.3. Thut ton SPRINT
Ngy nay d liu cn khai ph c th c ti hng triu bn ghi v khong 10
n 10000 thuc tnh. Hng Tetabyte (100 M bn ghi * 2000 trng * 5 bytes) d liu
cn c khai ph. Nhng thut ton ra i trc khng thp ng c nhu cu .
Trc tnh hnh , SPRINT l s ci tin ca thut ton SLIQ (Mehta, 1996) ra i.Cc thut ton SLIQ v SPRINT u c nhng ci tin tng kh nng mrng ca
thut ton nh:
Kh nng x l tt vi nhng thuc tnh lin tc v thuc tnh ri rc.
C hai thut ton ny u s dng k thut sp xp trc mt ln d liu, v
lu trthng tr trn a (disk resident data) nhng d liu qu ln khng
th cha va trong b nh trong. V sp xp nhng d liu lu tr trn a l
t [3], nn vi cch sp xp trc, d liu phc v cho qu trnh pht trin
cy ch cn c sp xp mt ln. Sau mi bc phn chia d liu ti tng
node, th t ca cc bn ghi trong tng danh sch c duy tr, khng cn phi
sp xp li nh cc thut ton CART, v C4.5 [13][12]. T lm gim ti
nguyn tnh ton khi s dng gii php lu tr d liu thng tr trn a.
C 2 thut ton s dng nhng cu trc d liu gip cho vic xy dng cy
quyt nh d dng hn. Tuy nhin cu trc d liu lu tr ca SLIQ v
SPRINT khc nhau, dn n nhng kh nng mrng, v song song ha khc
nhau gia hai thut ton ny.
M gi ca thut ton SPRINT nh sau:
Hnh 11 - M gi thut ton SPRINT
SPRINT algorithm:
Partition(Data S) {
if (all points in S are of the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
Use best split found to partition S into S1& S2
Partition(S1);
Partition(S2);
}
Initial call: Partition(Training Data)
-
7/31/2019 Phn lp d liu
38/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 29-
2.3.1. Cu trc d liu trong SPRINT
K thut phn chia d liu thnh cc danh sch thuc tnh ring bit ln u
tin c SLIQ (Supervised Learning In Quest) xut. D liu s dng trong SLIQ
gm: nhiu danh sch thuc tnh lu tr thng tr trn a (mi thuc tnh tng ng
vi mt danh sch), v mt danh sch n cha gi tr ca class lu tr thng trtrong b nhchnh. Cc danh sch ny lin kt vi nhau bi gi tr ca thuc tnh rid
(ch s bn ghi c nh th t trong csd liu) c trong mi danh sch.
SLIQ phn chia dliu thnh hai loi cu trc:[14][9]
Hnh 12 - Cu trc dliu trong SLIQ
Danh sch thuc tnh (Attribute List) thng tr trn a. Danh sch ny gm
trng thuc tnh v rid (a record identifier).
Danh sch lp (Class List) cha cc gi tr ca thuc tnh phn lp tng ng vi
tng bn ghi trong c sd liu. Danh sch ny gm cc trng rid, thuc tnh
phn lp v node (lin kt vi node c gi tr tng ng trn cy quyt nh). Vic
to ra trng con tr tr ti node tng ng trn cy quyt nh gip cho qu trnh
phn chia d liu ch cn thay i gi tr ca trng con tr, m khng cn thc s
phn chia d liu gia cc node. Danh sch lp c lu tr thng tr trong b
nh trong v n thng xuyn c truy cp, sa i c trong giai on xy dng
cy, v c trong giai on ct, ta cy. Kch thc ca danh sch lp t l thun vi
s lng cc bn ghi u vo. Khi danh sch lp khng va trong b nh, hiunng ca SLIQ s gim. l hn ch ca thut ton SLIQ. Vic s dng cu trc
d liu thng tr trong b nh lm gii hn tnh m rng c ca thut ton
SLIQ.
-
7/31/2019 Phn lp d liu
39/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 30-
SPRINT sdng danh sch thuc tnh ctr trn a
SPRINT khc phc c hn ch ca SLIQ bng cch khng s dng danh
sch lp c tr trong b nh, SPRINT ch s dng mt loi danh sch l danh sch
thuc tnh c cu trc nh sau:
Hnh 13 - Cu trc danh sch thuc tnh trong SPRINT Danh sch thuc tnh lin
tc c sp xp theo thtngay c to ra
Danh sch thuc tnh
SPRINT to danh sch thuc tnh cho tng thuc tnh trong tp d liu. Danh
sch ny bao gm thuc tnh, nhn lp (Class label hay thuc tnh phn lp), v ch s
ca bn ghi rid(c nh t tp d liu ban u). Danh sch thuc tnh lin tc c
sp xpth t theo gi tr ca thuc tnh ngay khi c to ra. Nu ton b d liukhng cha trong b nhth tt c cc danh sch thuc tnh c lu tr trn a.
Chnh do c im lu tr ny m SPRINT loi b mi gii hn v b nh, v c
kh nng ng dng vi nhng csd liu thc t vi s lng bn ghi c khi ln ti
hng t.
Cc danh sch thuc tnh ban u to ra t tp d liu o to c gn vi
gc ca cy quyt nh. Khi cy pht trin, cc node c phn chia thnh cc node
con mi th cc dnh sch thuc tnh thuc v node cng c phn chia tng ng
v gn vo cc node con. Khi danh sch b phn chia th th t ca cc bn ghi trongdanh sch c gi nguyn, v th cc danh sch con c to ra khng bao gi
phi sp xp li. l mt u im ca SPRINT so vi cc thut ton trc .
Biu (Histogram)
RID Age Car Type Risk0 23 family high
1 17 sport high
2 43 sport high
3 68 family low
4 32 truck low
5 20 family high
Age RID Risk17 1 high
20 5 high
23 0 high
32 4 low
43 2 high68 3 low
Car Type RID Riskfamily 0 high
sport 1 high
sport 2 high
family 3 low
truck 4 low
family 5 high
-
7/31/2019 Phn lp d liu
40/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 31-
SPRINT s dng biu lp bng thng k s phn phi lp ca cc bn
ghi trong mi danh sch thuc tnh,t dng vo vic c lng im phn chia cho
danh sch . Thuc tnh lin tc v thuc tnh ri rc c hai dng biu khc nhau.
Biu ca thuc tnh lin tc
SPRINT s dng 2 biu : Cbelow v Cabove. Cbelow cha s phn phi
ca nhng bn ghi c x l, Cabove cha s phn phi ca nhng bn ghi
cha c x l trong danh sch thuc tnh. Hnh II-3 minh ha vic s dng
biu cho thuc tnh lin tc
Biu ca thuc tnh ri rc
Thuc tnh ri rc cng c mt biu gn vi tng node. Tuy nhin
SPRINT ch s dng mt biu l count matrix cha s phn phi lp ng
vi tng gi tr ca thuc tnh c xem xt.Cc danh sch thuc tnh c x l cng mt lc, do vy thay v i hi cc
danh sch thuc tnh trong b nh, vi SPRINT b nhch cn cha tp cc biu nh trn trong qu trnh pht trin cy.
2.3.2. SPRINT s dng Gini-index lm o tm im phn chia tp d
liu tt nht
SPRINT l mt trong nhng thut ton s dng o Gini-index tm thuc
tnh tt nht lm thuc tnh test ti mi node trn cy. Ch s ny c Breiman ngh
ra t nm 1984, cch tnh nh sau:
Trc tin cn nh ngha:gini (S) = 1- pj2
Trong : Sl tp d liu o to c n lp; pj
l tn xut ca lp j
trong S(l thng ca s bn ghi c gi tr ca thuc tnh phn lp lpjvi
tng s bn ghi trong S)
Nu phn chia dng nh phn, tc l S c chia thnh S1, S2 (SPRINT ch
s dng phn chia nh phn ny) th ch s tnh phn chia c cho bi
cng thc sau:
ginisplit(S) = n1/n*gini(S1) + n2/n*gini(S2)
Vi n, n1, n2 ln lt l kch thc ca S, S1, S2.
-
7/31/2019 Phn lp d liu
41/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 32-
u im caloi ch s ny l cc tnh ton trn n ch da vo thng tin v
s phn phi cc gi tr lp trong tng phn phn chia m khng tnh ton trn cc gi
tr ca thuc tnhang xem xt.
tm c im phn chia cho mi node, cn qut tng danh sch thuc tnh
ca node v c lng cc phn chia da trn mi thuc tnh gn vi node .Thuc tnh c chn phn chia l thuc tnh c ch sginisplit(S) nh nht.
im cn nhn mnh y l khc viInformation Gain ch s ny c tnh
m khng cn c ni dung d liu, ch cn biu biu din s phn phi cc bn ghi
theo cc gi tr phn lp. l tin cho cch lu tr d liu thng tr trn a.
Cc biu ca danh sch thuc tnh lin tc, hay ri rc c m t di y.
Vi thuc tnh lin tc
Vi thuc tnh lin tc, cc gi tr kim tra l cc gi tr nm gia mi cp 2gi tr lin k ca thuc tnh . tm im phn chia cho thuc tnh ti mt node
nht nh, biu c khi to vi Cbelow bng 0 v Cabove l phn phi lp ca tt c
cc bn ghi ti node . Hai biu trn c cp nht ln lt mi khi tng bn ghi
c c. Mi khi con tr chy gini-indexc tnh trn tng im phn chia nm
gia gi tr va c v gi tr sp c. Khi c ht danh sch thuc tnh (Cabove bng 0
tt c cc ct) th cng l lc tnh c ton b cc gini-index ca cc im phn
chia cn xem xt. Cn c vo kt qu c th chn ra gini-index thp nht v tng
ng l im phn chia ca thuc tnh lin tc ang xem xt ti node . Vic tnh gini-
index hon ton da vo biu . Nu tm ra im phn chia tt nht th kt qu
c lu li v biu va gn danh sch thuc tnh c khi to li trc khi x
l vi thuc tnh tip theo.
Hnh 14 - c lng cc im phn chia vi thuc tnh lin tc
-
7/31/2019 Phn lp d liu
42/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 33-
Vi thuc tnh ri rc
Vi thuc tnh ri rc, qu trnh tm im phn chia tt nht cng c tnh
ton da trn biu ca danh sch thuc tnh . Trc tin cn qut ton b danh
sch thuc tnh thu c s lng phn lp ng vi tng gi tr ca thuc tnh ri
rc, kt qu ny c lu trong biu count matrix. Sau , cn tm tt c cc tpcon c th c t cc gi tr ca thuc tnh ang xt, coi l im phn chia v tnh
gini-index tng ng. Cc thng tin cn cho vic tnh ton ch sgini-index ca bt c
tp con no u c trong count matrix. B nhcung cp cho count matrixc thu hi
sau khi tm ra c im phn chia tt nht ca thuc tnh .
Hnh 15 - c lngim phn chia vi thuc tnh ri rc
V d m t cch tnh ch s Giniindex
Vi tp d liu o to c m t trong hnh 13, vic tnh ch s Gini-index tm ra
im phn chia tt nht c thc hin nh sau:
1. Vi Thuc tnh lin tc Age cn tnh im phn chia trn ln lt cc so snhsau Age
-
7/31/2019 Phn lp d liu
43/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 34-
Tnh ton tng t vi cc test cn li Age
-
7/31/2019 Phn lp d liu
44/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 35-
Vi thuc tnh c chn (Age nh trn hnh v) lm thuc tnh phn chia ti
node , vic phn chia danh sch thuc tnh ny v cc node con kh n gin. Nu
l thuc tnh lin tc, ch cn ct danh sch thuc tnh theo im phn chia thnh 2
phn v gn cho 2 node con tng ng. Nu l thuc tnh ri rc th cn qut ton
b danh sch v p dng test xc nh chuyn cc bn ghi v 2 danh sch ming vi 2 node con.
Nhng vn khng n gin nh vy vi nhng thuc tnh cn li ti node
(Car Type chng hn), khng c test trn thuc tnh ny, nn khng th p dng cc
kim tra trn gi tr ca thuc tnh phn chia cc bn ghi. Lc ny cn dng n mt
trng c bit trong cc danh sch thuc tnh l rids. y chnh l trng kt ni
cc bn ghi trong cc danh sch thuc tnh. C th nh sau: trong khi phn chia danh
sch ca thuc tnh phn chia (Age) cn chn gi tr trng rids ca mi bn ghi vo
mt bng bm (hash table
) nh u node con m cc bn ghi tng ng (c cng
rids) trong cc danh sch thuc tnh khc c phn chia ti. Cu trc ca bng bm
nh sau:
Hnh 17 - Cu trc ca bng bm phn chia dliu trongSPRINT (theo v d cc
hnh trc)
Phn chia xong danh sch ca thuc tnh phn chia th cng l lc xy dngxong bng bm. Danh sch cc thuc tnh cn li c phn chia ti cc node con theo
thng tin trn bng bm bng cch c trng rids trn tng bn ghi v trng Child
node tng ng trn bng bm.
Nu bng bm qu ln so vi b nh, qu trnh phn chia c chia thnh
nhiu bc. Bng bm c tch thnh nhiu phn sao cho va vi b nh, v cc
danh sch thuc tnh phn chia theo tng phn bng bm. Qu trnh lp li cho n khi
bng bm nm trong b nh.
2.3.4. SPRINT l thut ton hiu qu vi nhng tp d liu qu ln so
vi cc thut tonkhc
SPRINT ra i khng nhm mc ch lm tt hn SLIQ [9] vi nhng tp d
liu m danh sch lp nm va trong b nh. Mc tiu ca thut ton ny l nhm vo
nhng tp d liu qu ln so vi cc thut ton khc v c kh nng to ra mt m
Hash tableRids 1 2 3 4 5 6
Child node L R R R L L
-
7/31/2019 Phn lp d liu
45/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 36-
hnh phn lp hiu qu t. Hn na, SPRINT cn c thit k d dng song
song ha. Qu vy, vic song song ha SPRINT kh t nhin v hiu qu vi cch
x l d liu song song. SPRINT t c chun cho vic sp xp d liu v ti cn
bng khi lng cng vic bng cch phn phi u danh sch thuc tnh thuc tnh
cho N b vi x l ca mt my theo kin trcshared-nothing[7]. Vic song song haSPRINT ni ring cng nh song song ha cc m hnh phn lp d liu da trn cy
quyt nh ni chung trn h thng Shared-memory multiprocessor(SMPs) hay cn
c gi l h thngshared-everthingc nghin cu trong [10].
Bn cnh nhng mt mnh, SPRINT cng c nhng mt yu. Trc ht l
bng bm s dng cho vic phn chia d liu, c kch ct l thun vi s lng i
tng d liu gn vi node hin ti (s bn ghi ca mt danh sch thuc tnh). ng
thi bng bm cn c t trong b nhkhi thi hnh phn chia d liu, khi kch c
bng bm qu ln, vic phn chia d liu phi tch thnh nhiu bc. Mt khc, thut
ton ny phi chu chi ph vo-ra trm trng. Vic song song ha thut ton ny
cng i hi chi ph giao tip ton cc cao do cn ng b ha cc thng tin v cc ch
sGini-index ca tng danh sch thuc tnh.
Ba tc gi ca SPRINT a ra mt s kt qu thc nghim trn m hnh
phn lp SPRINT so snh vi SLIQ [7] c th hin bng biu di y.
Biu 1- So snh thi gian thc thi ca m hnh phn lp SPRINT v SLIQ theo kch
thc tp dliu o to
-
7/31/2019 Phn lp d liu
46/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 37-
T biu trn c th thy: vi nhng tp d liu nh (1 triu cases) th SLIQ khng th thao tc, trong
khi vi nhng tp d liu khong hn 2,5 triu cases SPRINT vn thao tc d dng. Ldo l SPRINT s dng cch lu tr liu thng tr hon ton trn a.
2.4. So snh C4.5 v SPRINT
Ni dung so
snh
C4.5 SPRINT
Tiu chunla chn
thuc tnhphn chia
Gain-entropyC khuynh hng lm c lp lp
ln nht khi cc lp khc
Gini-indexC khuynh hng chia thnh cc
nhm lp vi lng d liutng ng
C ch lutrdliu
Lu tr trong b nh (memory-resident)
-> p dng cho nhng ng dngkhai ph csd liu nh (hngtrm nghn bn ghi)
Lu tr trn a (disk-resdient)
-> p dng cho nhng ng dngkhai ph d liu cc ln m ccthut ton khc khng lm c(hng trm triu - hng t bnghi)
C ch spxp dliu Sp xp li tp d liu tngng vi mi node Sp xp trc mt ln. Trongqu trnh pht trin cy, danhsch thuc tnh c phn chianhng th t ban uvn cduy tr, do khng cn phisp xp li.
class A 40
class B 30
class C 20
class D 10
if age < 40
class A 40 class B 30class C 20
class D 10
yes no
class A 40
class B 30
class C 20
class D 10
if age < 65
class A 40
class D
class B 30
class C 20
yes no
-
7/31/2019 Phn lp d liu
47/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 38-
Chng 3. CC KT QU THC NGHIM
Tc gi s dng m hnh phn lp C4.5 release8 m ngun m do J. Ross
Quinlan vit, ti a ch:http://www.cse.unsw.edu.au/~quinlan/ phn tch, nh gi
m hnh phn lp C4.5 v kt qu phn lp v cc nhn tnh hng n hiu nng
ca m hnh.
3.1. Mi trng thc nghim
M ngun C.45 c ci t v chy th nghim trn Server 10.10.0.10 ca
i hc Cng Ngh.
Cu hnh ca Server nhsau: b vi x l Intel Xeon 2.4GHz, c 2 b
x l vt l c th hot ng nh 4 b x l logic theo cng ngh hyper-threading,
cache size: 512KB, dung lng b nhtrong 1GB.
Tp d liu thnghim l tp d liu cha cc thng tin v khch hng s
dng in thoi di ng ng k s dng web portal. Cc trng trong tp d liu
gm c: Cc thng tin c nhn nh: Tn tui, gii tnh, ngy sinh, vng ng k s
dng in thoi, loi in thoi s dng, version ca loi in thoi , s ln v thi
gian truy cp web portal s dng cc dch v nh gi tin nhn, gi logo hay
ringtone... Tp d liu c kch thc khong 120000 bn ghi dng training v
khong 60000 bn ghi c s dng lm tp d liu test.
3.2. Cu trc m hnh phn lp C4.5 release8:
3.2.1. M hnh phn lp C4.5 c 4 chng trnh chnh:
Chng trnh sinh cy quyt nh (c4.5)
Chng trnh sinh lut sn xut (c4.5rules)
Chng trnh ng dng cy quyt nh vo phn lp nhng d liu mi(consult)
Chng trnh ng dng b lut sn xut vo phn lp nhng d liu mi(consultr)
Ngoi ra C4.5 cn c 2 tin ch i km phc v cho qu trnh chy thc nghim l:
csh shell script cho k thut c lng chnh xc ca m hnh phn lpcross-validation ('xval.sh')
Hai chng trnh ph thuc i km l ('xval-prep' v 'average').
Chi tit hn v m hnh phn lp C4.5 c th tham kho ti a ch:
-
7/31/2019 Phn lp d liu
48/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 39-
http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html
3.2.2. Cu trc d liu s dng trong C4.5
Mi b d liu dng trong C4.5 gm c 3 file:
3.2.2.1. Filestem.names: nh ngha b dliu
Hnh 18 - File nh ngha cu trc dliu sdng trong thc nghim
M t:
Dng trn cng nh ngha cc gi tr phn lp theo thuc tnh c chn (v
d trn hnh 18 l thuc tnh MOBILE_PRODUCTER_ID)
Cc dng tip theo l danh sch cc thuc tnh cng vi tp gi tr ca n
trong tp d liu. Cc thuc tnh lin tc c nh ngha bng t khacontinuous
Ch thch c nh ngha sau du |
3.2.2.2. Filestem.data: cha dliu training
-
7/31/2019 Phn lp d liu
49/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 40-
Hnh 19 - File cha dliu cn phn lp
Filestem.data c cu trc nh sau: mi dng tng ng vi mt bn ghi
(cases) trong c sd liu. Mi dng mt b gi tr theo th nh ca cc thuc
tnh nh ngha trongfilestem.names. Cc gi tr ngn cch nhau bi du phy. Gi trthiu (missing value) c biu din bng du ?.
3.2.2.3. Filestem.test: cha dliu test
File ny cha d liu test trn m hnh phn lp c to ra t tp d liu
training, v c cu trc gingfilestem.data
3.3. Kt qu thc nghim
3.3.1. `7Mt s kt qu phn lp tiu biu:
3.3.1.1. Cy quyt nh
Lnh to cy quyt nh$ ./C4.5 -f ../Data/Classes/10-5/class u>> ../Data/Classes/10-5/class.dtTham s ty chn:
-f: xc nh b d liu cn phn lp
-u: ty chn cy c to ra c nh gi trn tp d liu test.
-v verb: mc chi tit ca output [0..3], mc nh l 0
-t trials: thit lp ch iteractive vi trials l s cy th
nghim. Iteractive l ch cho php to ra nhiu cy th nghim bt
u vi mt tp con d liu c chn ngu nhin. Mc nh l ch
batch vi ton b tp d liu c s dng to mt cy quyt nh
duy nht.
-
7/31/2019 Phn lp d liu
50/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 41-
Cy quyt nh c cc node trong l cc kim tra gi tr ca thuc tnh c
chn pht trin ti node . L ca cy quyt nh c nh dng: Gi_tr_phn_lp
(N/E) hoc (N). Vi N/E l t l gia tng cc case t ti l vi s case t ti l
nhng thuc v lp khc (trong tp d liu o to).
Hnh 20 - Dng cy quytnh to ra ttp dliu thnghim
-
7/31/2019 Phn lp d liu
51/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 42-
Hnh 21 -c lng trn cy quytnh va to ra trn tp dliu training v tp
dliu test
Sau khi cy quyt nh c to ra, n sc c lng li chnh xc trn
chnh tp d liu o to va hc c, v c thc c lng trn tp d liu test
c lp vi d liu training nu c ty chn t pha ngi dng.
Cc c lng c thc hin trn cy khi cha ct ta v sau khi ct ta.
M hnh C4.5 cng cho php truyn cc tham s v mc ct ta ca cy, mc nhl ct ta 25%.
3.3.1.2. Cc lut sn xut tiu biu
Lnh to lut sn xut khi c cy quyt nh:$ ./C4.5rules -f ../Data/Classes/10-5/class -u >> ../Data/Classes/10-
5/class.r
Cc tham s ty chn f, -v, -u ging nh vi lnh to cy quyt nh.
Mi lut sinh ra gm c 3 phn: iu kin phn lp
Gi tr phn lp ( ->class )
[]: d on chnh xc ca lut. Gi tr ny c c lng trn tp
training v test (nu c ty chn u khi sinh lut)
-
7/31/2019 Phn lp d liu
52/67
-
7/31/2019 Phn lp d liu
53/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 44-
Hnh 23 - Mt slut rt ra tb dliu 8 thuc tnh, phn lp theo shiu nh sn
xutin thoi (PRODUCTER_ID)
T kt qu thc t hnh 23, tRule 1021, chng ta c th kt lun: nu khch hng
lm cng vic Supervisory v sinh trong khong t nm 1969 n 1973 th loi inthoi m khch hng dng c s hiu l 1 (l in thoi SAMSUNG). chnh xc
ca kt lun ny l 91,7%.
Nhng lut nh trn gip cho cc nhn vin maketing c th tm ra c th trng
in thoi di ng i vi tng loi i tng khch hng khc nhau, t c cc
chin lc pht trin sn phm hp l.
-
7/31/2019 Phn lp d liu
54/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 45-
Hnh 24 - Mt slut sinh ra ttp dliu 8 thuc tnh, phn lp theo dch vin
thoi m khch hng sdng (MOBILE_SERVICE_ID)
V d t Rule 661: nu khch hng l nam (F), ngh nghipEngineering, in
thoi s dng l Erricsion (MOBILE_PRODUCTER_ID = 4) v ng k nm 2004,
th dch v m khch hng s dng l gi logo (MOBILE_SERVICE_ID = 2).
chnh xc ca lut ny l 79,4%.
T nhng lut nh vy, ta c th thng k cng nh don c xu hng
s dng cc loi dch v ca tng i tng khch hng khc nhau. T c chin
lc pht trin dch v khch hng hiu qu.
-
7/31/2019 Phn lp d liu
55/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 46-
Hnh 25 - c lng tp lut trn tp dliu o to
Sau khi c to ra, tp lut c c lng li trn tp training data, hay tp
d liu test (ty chn).
M t cc mt s trng tiu biu:
Rule: s hiu ca lut
Zize: Kch thc ca lut (s cc iu kin so snh trong phn iu kin phn
lp)
Used: s lng cases trong tp training p dng lut . Trng ny quy nh
tnh ph bin ca lut.
Wrong: s lng case phn lp sai -> t l phn trm li
Kt lun
T qu trnh thc nghim, chng ti nhn thy vai tr ca qu trnh tin x l
d liu l rt quan trng. Trong qu trnh ny, cn xc nh chnh xc nhng thng tin
g cn rt ra t csd liu , t chn thuc tnh phn lp ph hp. Sau vic
-
7/31/2019 Phn lp d liu
56/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 47-
la chn nhng thuc tnh lin quan l rt quan trng, n quyt nh m hnh phn lp
c ng n khng, c ngha thc t khng v c th p dng cho nhng d liu
tng lai hay khng.
3.3.2. Cc biu hiu nng
Cc tham snh hng n hiu nng ca m hnh phn lp l [6]:
S cc bn ghi trong tp d liu o to (N)
S lng thuc tnh (A)
S cc gi tr ri rc ca mi thuc tnh (nhn t nhnh) (V)
S cc lp (C)
Chi ph xy dng cy quyt nh l tng chi ph xy dng tng node:
T = tnode(i)
Chi ph tn cho node i c tnh bng tng cc khon chi ph ring cho tng cng
vic:
tnode(i) = tsingle(i) + tfreq(i) + tinfo(i) + tdiv(i)
Vi:
tsingle(i) l chi ph thc thi vic kim tra xem liu tt c cc case trong tp
d liu o to c thuc v cng mt lp khng?
tdiv(i) l chi ph phn chia tp d liu theo thuc tnh chn
Vic la chn thuc tnh c Information gain ln nht trong tp d liu
hin ti l kt qu ca vic tnhInformation gain ca tng thuc tnh. Chi
ph cho qu trnh ny bao gm thi gian tnh ton tn xut phn phi theo
cc gi tr phn lp ca tng thuc tnh (tfreq(i)) v thi gian tnh
Information gain t cc thng tin phn phi (tinfo(i)).
C th biu din s ph thuc ca cc khon chi ph trn vo cc tham s hiu nng
m ttrn nh sau:tfreq = k1 *AiNi
tinfo = k2 * CAiV
tdiv = k3 * Ai
tsingle = k4*Ni
-
7/31/2019 Phn lp d liu
57/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 48-
Vi kj l hng s c gi tr ty theo tng ng dng c th. S lng bn ghi (Ni) v s
lng thuc tnh (Ai) tng ng vi tng node ph thuc vo su ca node v
bn thn tp d liu.
Vic xc nh chnh xc chi ph cho qu trnh xy dng cy quyt nh (T) l rt kh
v cn phi bit chnh xc hnh dng ca cy quyt nh, iu ny khng th xc nhtrong thi gian chy. Chnh v vy m T c n gin ha bng cch dng gi tr
trung bnh i km vi nhng gi s v hnh dng ca cy v gii cc phng trnh lp
cho tng thnh phn ring l ca m hnh [6].
Sau y l cc kt qu thc nghim nh gi nh hng ca cc tham s hiu
nng nh kch thc tp d liu o to, s lng thuc tnh, thuc tnh lin tc, v s
gi tr phn lp ti m hnh phn lp C4.5:
-
7/31/2019 Phn lp d liu
58/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 49-
3.3.2.1. Thi gian thc thi ph thuc vo kch thc tp dliu o to
Cc th nghim c tin hnh trn nhiu tp d liu vi kch thc, s lng
thuc tnh v thuc tnh phn lp khc nhau. Sau y l cc bng kt qu v biu
th hin s ph thuc ang xt.
Thnghim vi tp dliu 2 thuc tnh
Bng 2 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch
thc tp dliu o to 2 thuc tnh
Kch thc
Thi gian tp d liu
xy dng (giy)
29000 60000 66000 131000 262000
Decision Tree 0.15 0.46 0.47 1.17 2.2
Production Rules 3.21 6.82 8.85 20.51 37.94
0
5
1015
20
25
30
35
40
29000 60000 66000 131000 262000 (cases)
(s)
DecisionTree
Production Rules
Trend lineofProduction rules
Biu 2 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch
thc tp dliu o to 2 thuc tnh
-
7/31/2019 Phn lp d liu
59/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 50-
Thnghim vi tp dliu 7 thuc tnh
Bng 3 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch
thc tp dliu o to 7 thuc tnh
Kch thc
Thi gian tp d liu
xy dng (giy)
1000 10000 15000 20000 25000 30000 36000
Decision Tree 0.03 0.46 1.90 2.79 5.70 8.31 13.34
Production Rules 0.13 107.1 276.2 709.9 1211.0 2504.8 5999.5
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1000 10000 15000 20000 25000 30000 36000
Decision
Tree
Production Rules
Trend lineofProduction rules
Biu 3 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch
thc tp dliu o to 7 thuc tnh
-
7/31/2019 Phn lp d liu
60/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 51-
Thnghim vi tp dliu 18 thuc tnh
Bng 4 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch
thc tp dliu o to18 thuc tnh
Kch thc
Thi gian tp d liu
xy dng (giy)
4000 6000 8500 10000 12000 15000 17500 20000 25000
Decision Tree 0.45 0.64 1.32 1.77 2.37 1.8 2.68 2.98 5.24
Production Rules 43.6 90.77 304.0
7
531.3
4
838.8
8
968.2
4
1584.
63
2927.
56
4617.
23
0
500
1000
1500
2000
2500
3000
3500
4000
4500
5000
4000 6000 8500 10000 12000 15000 17500 20000 25000
(case)
(s)
DecisionTree
Production Rules
Trend Lineof
Production Rules
Biu 4 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch
thc tp dliu o to18 thuc tnh
Cc nh gi s ph thuc ca thi gian thc thi vo kch thc tp d liu
o to c tin hnh trn cc tp d liu vi s lng thuc tnh khc nhau. Cth rt ra cc kt lun sau:
Kch thc tp d liu cng ln th thi gian sinh cy quyt nh cng nh thi
gian sinh tp lut sn xut cng ln. Cn c vo cc ng trendline ca ng
-
7/31/2019 Phn lp d liu
61/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 52-
biu din thi gian sinh tp lut sn xut c v thm trn cc biu , chng
ti don s ph thuc trn c din t bng hm a thc.
Cc biu trn cho thy qu trnh sinh lut sn xut sau t cy quyt nh
to ra tn ti nguyn tnh ton gp nhiu ln so vi qu trnh sinh cy quyt
nh. Thc nghim cho thy vi nhng tp d liu ctrm nghn bn ghi, thigian sinh lut sn xut l kh lu ( thng thng > 5 gi). cng l mt trong
nhng l do khin C4.5 khng th p dng vi nhng tp d liu ln. Tp d
liu o to c cng nhiu thuc tnh th s chnh lch v thi gian thc thi gia
2 qu trnh trn cng ln.
3.3.2.2. Hiu nng ca C4.5 ph thuc vo s lng thuc tnh
nh gi s ph thuc trn, cc th nghim tin hnh vi 3 tp d liu
c 2, 4, v 8 thuc tnh ri rc, vi cng thuc tnh phn lp.
Bng 5- Thi gian sinh cy quytnh ph thuc vo slng thuc tnh
3000 6000 16000 23000 32000 40500 55500 65500 96600 131000
2 attributes 0.01 0.02 0.05 0.1 0.18 0.25 0.39 0.47 0.89 1.17
4 attributes 0.12 0.18 0.82 2.18 3.32 5.58 11.83 16.79 33.49 71.52
8 attributes 0.14 0.3 3.56 9.99 23.40 33.36 47.62 80 106.61 185
0
20
40
60
80
100
120
140
160
180
200
3000
6000
1600
0
2300
0
3200
0
4050
0
5550
0
6550
0
9660
0
131000 (cases)
(s)
2attributes4attributes8attributes
Biu 5 -Sph thuc thi gian sinh cy quytnh vo slng thuc tnh
-
7/31/2019 Phn lp d liu
62/67
Nghin cu cc thut ton phn lp d liu da trn cy quytnh
Kha lun tt nghip Nguyn ThThy Linh K46CA
- 53-
Thi gian C4.5 xy dng cy quyt nh ph thuc vo s lng thuc tnh
qua cc khong thi gian tfreq, tinfo, tdiv. S thuc tnh cng nhiu thi gian tnh ton
la chn thuc tnh tt nht test ti mi node cng ln, v vy thi gian sinh cy quyt
nh cng tng. Do vy C4.5 b hn ch v s lng thuc tnh trong tp d liu o
to [2]. y l mt im khc bit so vi SPRINT
3.3.2.3. Hiu nng ca C4.5 khi thao tc vi thuc tnh lin tc
Bng 6 - Thi gian xy dng cy quytnh vi thuc tnh ri rc v thuc tnh lin
tc
3000 6000 16000 22000 31000 40000 55000 65000 96000 131000
3 thuc tnh ri rc+
1 thuc tnh lin tc0.12 0.18 0.92 2.18 3.32 5.74 11.83 16.79 33.47 61.52
4 thuc tnh lin tc 0.24 0.66 3.02 5.01 11.56 16.99 30.37 38.16 70.38 125.21
0
20
40
60
80
100
120
140
3000
6000
1600
0
2200
0
3100
0
4000
0
5500
0
6500
0
9600
0
131000
(cases)
(s)
3 categoricalattributes + 1continuousattribute4 continuous
attributes
Biu 6 - So sn