Phân lớp dữ liệu

download Phân lớp dữ liệu

of 67

Transcript of Phân lớp dữ liệu

  • 7/31/2019 Phn lp d liu

    1/67

    I HC QUC GIA H NI

    TRNG I HC CNG NGH

    Nguyn Th Thy Linh

    NGHIN CU CC THUT TON PHN LP DLIU

    DA TRN CY QUYT NH

    KHA LUN TT NGHIP I HC H CHNH QUY

    H NI - 2005

    Ngnh: Cng ngh thng tin

  • 7/31/2019 Phn lp d liu

    2/67

    I HC QUC GIA H NI

    TRNG I HC CNG NGH

    Nguyn Th Thy Linh

    NGHIN CU CC THUT TON PHN LP DLIUDA TRN CY QUYT NH

    KHA LUN TT NGHIP I HC H CHNH QUY

    H NI - 2005

    Ngnh: Cng ngh thng tin

    Cn b hng dn: TS. Nguyn Hi Chu

  • 7/31/2019 Phn lp d liu

    3/67

    - i-

    TM TT NI DUNG

    Phn lp d liu l mt trong nhng hng nghin cu chnh ca khai ph d

    liu. Cng ngh ny , ang v s c nhiu ng dng trong cc lnh vc thng mi,

    ngn hng, y t, gio dcTrong cc m hnh phn lp c xut, cy quytnh c coi l cng c mnh, ph bin v c bit thch hp vi cc ng dng khai

    ph d liu. Thut ton phn lp l nhn t trung tm trong mt m hnh phn lp.

    Kha lun nghin cu vn phn lp d liu da trn cy quyt nh. T

    tp trung vo phn tch, nh gi, so snh hai thut ton tiu biu cho hai phm vi

    ng dng khc nhau l C4.5 v SPRINT. Vi cc chin lc ring v la chn thuc

    tnh pht trin, cch thc lu tr phn chia d liu, v mt sc im khc, C4.5 l

    thut ton ph bin nht khi phn lp tp d liu va v nh, SPRINT l thut ton

    tiu biu p dng cho nhng tp d liu c kch thc cc ln. Kha lun chy th

    nghim m hnh phn lp C4.5 vi tp d liu thc v thu c mt s kt qu phn

    lp c ngha thc tin cao, ng thi nh gi c hiu nng ca m hnh phn lp

    C4.5. Trn csnghin cu l thuyt v qu trnh thc nghim, kha lun xut

    mt s ci tin m hnh phn lp C4.5 v tin ti ci t SPRINT.

  • 7/31/2019 Phn lp d liu

    4/67

    - ii-

    LI CM N

    Trong sut thi gian hc tp, hon thnh kha lun em may mn c cc

    thy c ch bo, du dt v c gia nh, bn b quan tm, ng vin.

    Em xin c by t lng bit n chn thnh ti cc thy c trng i hcCng Ngh truyn t cho em ngun kin thc v cng qu bu cng nh cch hc

    tp v nghin cu khoa hc.

    Cho php em c gi li cm n su sc nht ti TS. Nguyn Hi Chu,

    ngi thy rt nhit tnh ch bo v hng dn em trong sut qu trnh thc hin

    kha lun.

    Vi tt c tm lng mnh, em xin by t lng bit n su sc n TS. H

    Quang Thy to iu kin thun li v cho em nhng nh hng nghin cu. Em

    xin li cm n ti Nghin cu sinh on Sn (JAIST) cung cp ti liu v cho em

    nhng li khuyn qu bu. Em cng xin gi li cm n ti cc thy c trong B mn

    Cc h thng thng tin, Khoa Cng ngh thng tin gip em c c mi thc

    nghim thun li.

    Em cng xin gi ti cc bn trong nhm Seminar Khai ph d liu v Tnh

    ton song song li cm n chn thnh v nhng ng gp v nhng kin thc qu bu

    em tip thu c trong sut thi gian tham gia nghin cu khoa hc.

    Cui cng, em xin cm n gia nh, bn b v tp th lp K46CA, nhng

    ngi lun bn khch l v ng vin em rt nhiu.

    H Ni, thng 6 nm 2005

    Sinh vin

    Nguyn Th Thy Linh

  • 7/31/2019 Phn lp d liu

    5/67

    - iii-

    MC LC

    TM TT NI DUNG ..................................................................................................i

    LI CM N ............................................................................................................... ii

    MC LC .................................................................................................................... iii

    DANH MC BIU HNH V...............................................................................v

    DANH MC THUT NG...................................................................................... vii

    T VN .................................................................................................................1

    Chng 1. TNG QUAN V PHN LP DLIU DA TRN CY QUYT

    NH...............................................................................................................................3

    1.1. Tng quan v phn lp d liu trong data mining................................................31.1.1. Phn lp d liu........................................................................................................3

    1.1.2. Cc vn lin quan n phn lp d liu...............................................................6

    1.1.3. Cc phng php nh gi chnh xc ca m hnh phn lp ..............................8

    1.2. Cy quyt nh ng dng trong phn lp d liu .................................................91.2.1. nh ngha ................................................................................................................9

    1.2.2. Cc vn trong khai ph d liu s dng cy quyt nh....................................10

    1.2.3. nh gi cy quyt nh trong lnh vc khai ph d liu.......................................11

    1.2.4. Xy dng cy quyt nh........................................................................................13

    1.3. Thut ton xy dng cy quyt nh...................................................................141.3.1. T tng chung ......................................................................................................14

    1.3.2. Tnh hnh nghin cu cc thut ton hin nay........................................................15

    1.3.3. Song song ha thut ton phn lp da trn cy quyt nh tun t ......................17

    Chng 2. C4.5 V SPRINT......................................................................................212.1. Gii thiu chung .................................................................................................21

    2.2. Thut ton C4.5...................................................................................................212.2.1. C4.5 dng Gain-entropy lm o la chn thuc tnh tt nht........................22

    2.2.2. C4.5 c cch ring trong x l nhng gi tr thiu..............................................25

    2.2.3. Trnh qu va d liu .........................................................................................26

    2.2.4. Chuyn i t cy quyt nh sang lut .................................................................26

    2.2.5. C4.5 l mt thut ton hiu qu cho nhng tp d liu va v nh .......................27

    2.3. Thut ton SPRINT ............................................................................................28

    2.3.1. Cu trc d liu trong SPRINT..............................................................................292.3.2. SPRINT s dng Gini-index lm o tm im phn chia tp d liu tt nht

    ..........................................................................................................................................31

    2.3.3. Thc thi s phn chia .............................................................................................34

    2.3.4. SPRINT l thut ton hiu qu vi nhng tp d liu qu ln so vi cc thut ton

    khc...................................................................................................................................35

  • 7/31/2019 Phn lp d liu

    6/67

    - iv-

    2.4. So snh C4.5 v SPRINT....................................................................................37

    Chng 3. CC KT QU THC NGHIM .........................................................38

    3.1. Mi trng thc nghim.....................................................................................38

    3.2. Cu trc m hnh phn lp C4.5 release8:..........................................................383.2.1. M hnh phn lp C4.5 c 4 chng trnh chnh: ..................................................38

    3.2.2. Cu trc d liu s dng trong C4.5 ......................................................................39

    3.3. Kt qu thc nghim...........................................................................................403.3.1. `7Mt s kt qu phn lp tiu biu:......................................................................40

    3.3.2. Cc biu hiu nng ............................................................................................47

    3.4. Mt s xut ci tin m hnh phn lp C4.5..................................................54

    KT LUN ..................................................................................................................56

    TI LIU THAM KHO...........................................................................................57

  • 7/31/2019 Phn lp d liu

    7/67

    - v-

    DANH MC BIU HNH V

    Hnh 1 - Qu trnh phn lp d liu - (a) Bc xy dng m hnh phn lp .................4

    Hnh 2 - Qu trnh phn lp d liu - (b1)c lng chnh xc ca m hnh...........5

    Hnh 3 - Qu trnh phn lp d liu - (b2) Phn lp d liu mi ...................................5

    Hnh 4 - c lng chnh xc ca m hnh phn lp vi phng php holdout ......8

    Hnh 5- V d v cy quyt nh .....................................................................................9

    Hnh 6 - M gi ca thut ton phn lp d liu da trn cy quyt nh....................14

    Hnh 7 - S xy dng cy quyt nh theo phng php ng b ...........................18

    Hnh 8 - S xy dng cy quyt nh theo phng php phn hoch .....................19

    Hnh 9 - S xy dng cy quyt nh theo phng php lai....................................20

    Hnh 10 - M gi thut ton C4.5..................................................................................22

    Hnh 11 - M gi thut ton SPRINT............................................................................28

    Hnh 12 - Cu trc d liu trong SLIQ..........................................................................29Hnh 13 - Cu trc danh sch thuc tnh trong SPRINT Danh sch thuc tnh lin tc

    c sp xp theo th t ngay c to ra ............................................................30

    Hnh 14 - c lng cc im phn chia vi thuc tnh lin tc .................................32

    Hnh 15 - c lng im phn chia vi thuc tnh ri rc.........................................33

    Hnh 16 - Phn chia danh sch thuc tnh ca mt node ..............................................34

    Hnh 17 - Cu trc ca bng bm phn chia d liu trong SPRINT (theo v d cc hnh

    trc) ......................................................................................................................35

    Hnh 18 - File nh ngha cu trc d liu s dng trong thc nghim ........................39

    Hnh 19 - File cha d liu cn phn lp ......................................................................40Hnh 20 - Dng cy quyt nh to ra t tp d liu th nghim..................................41

    Hnh 21 - c lng trn cy quyt nh va to ra trn tp d liu training v tp d

    liu test ...................................................................................................................42

    Hnh 22 - Mt s lut rt ra t b d liu 19 thuc tnh, phn lp loi thit lp ch

    giao din ca ngi s dng (WEB_SETTING_ID).............................................43

    Hnh 23 - Mt s lut rt ra t b d liu 8 thuc tnh, phn lp theo s hiu nh sn

    xut in thoi (PRODUCTER_ID) ......................................................................44

    Hnh 24 - Mt s lut sinh ra t tp d liu 8 thuc tnh, phn lp theo dch v

    inthoi m khch hng s dng (MOBILE_SERVICE_ID)..............................45

    Hnh 25 - c lng tp lut trn tp d liu o to ..................................................46

  • 7/31/2019 Phn lp d liu

    8/67

    - vi-

    Bng 1 - Bng d liu tp training vi thuc tnh phn lp l buys_computer ............24

    Bng 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch

    thc tp d liu o to 2 thuc tnh....................................................................49

    Bng 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch

    thc tp d liu o to 7 thuc tnh....................................................................50

    Bng 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch

    thc tp d liu o to18 thuc tnh...................................................................51

    Bng 5- Thi gian sinh cy quyt nh ph thuc vo s lng thuc tnh.................52

    Bng 6 - Thi gian xy dng cy quyt nh vi thuc tnh ri rc v thuc tnh lin

    tc ...........................................................................................................................53

    Bng 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp...................54

    Biu 1- So snh thi gian thc thi ca m hnh phn lp SPRINT v SLIQ theokch thc tp d liu o to................................................................................36

    Biu 2 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch

    thc tp d liu o to 2 thuc tnh....................................................................49

    Biu 3 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch

    thc tp d liu o to 7 thuc tnh....................................................................50

    Biu 4 - Thi gian xy dng cy quyt nh v tp lut sn xut ph thuc vo kch

    thc tp d liu o to18 thuc tnh...................................................................51

    Biu 5 -S ph thuc thi gian sinh cy quyt nh vo s lng thuc tnh.........52

    Biu 6 - So snh thi gian xy dng cy quyt nh t tp thuc tnh lin tc v ttp thuc tnh ri rc ..............................................................................................53

    Biu 7 - Thi gian sinh cy quyt nh ph thuc vo s gi tr phn lp...............54

  • 7/31/2019 Phn lp d liu

    9/67

    - vii-

    DANH MC THUT NG

    STT Ting Anh Ting Vit

    1 training data d liu o to

    2 test data d liu kim tra3 Pruning decision tree Ct, ta cy quyt nh

    4 Over fitting data Qu va d liu

    5 Noise D liu li

    6 Missing value Gi tr thiu

    7 Data tuple Phn t d liu

    8 Case

    Case (c hiu nh mt data

    tuple, cha mt b gi tr ca

    cc thuc tnh trong tp d liu)

  • 7/31/2019 Phn lp d liu

    10/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 1-

    T VN

    Trong qu trnh hot ng, con ngi to ra nhiu d liu nghip v. Cc tp

    d liu c tch ly c kch thc ngy cng ln, v c th cha nhiu thng tin n

    dng nhng quy lut cha c khm ph. Chnh v vy, mt nhu cu t ra l cn tmcch trch rt t tp d liu cc lut v phn lp d liu hay don nhng xu

    hng d liu tng lai. Nhng quy tc nghip v thng minh c to ra s phc v

    c lc cho cc hot ng thc tin, cng nh phc vc lc cho qu trnh nghin

    cu khoa hc. Cng ngh phn lp v don d liu ra i p ng mong mun

    .

    Cng ngh phn lp d liu , ang v s pht trin mnh m trc nhng

    khao kht tri thc ca con ngi. Trong nhng nm qua, phn lp d liu thu ht s

    quan tm cc nh nghin cu trong nhiu lnh vc khc nhau nh hc my (machine

    learning), h chuyn gia (expert system), thng k (statistics)... Cng ngh ny cng

    ng dng trong nhiu lnh vc thc t nh: thng mi, nh bng, maketing, nghin

    cu th trng, bo him, y t, gio dc...

    Nhiu k thut phn lp c xut nh: Phn lp cy quyt nh

    (Decision tree classification), phn lp Bayesian (Bayesian classifier), phn lp K-

    hng xm gn nht (K-nearest neighbor classifier), mng nron, phn tch thng k,

    Trong cc k thut , cy quyt nh c coi l cng c mnh, ph bin v c bit

    thch hp cho data mining [5][7]. Trong cc m hnh phn lp, thut ton phn lp l

    nhn t cho. Do vy cn xy dng nhng thut ton c chnh xc cao, thc thi

    nhanh, i km vi kh nng m rng c c th thao tc vi nhng tp d liu

    ngy cng ln.

    Kha lun nghin cu tng quan v cng ngh phn lp d liu ni chung

    v phn lp d liu da trn cy quyt nh ni ring. T tp trung hai thut ton

    tiu biu cho hai phm vi ng dng khc nhau l C4.5 v SPRINT. Vic phn tch,

    nh gi cc thut ton c gi tr khoa hc v ngha thc tin. Tm hiu cc thut

    ton gip chng ta tip thu v c th pht trin v mt t tng, cng nh k thut ca

    mt cng ngh tin tin v ang l thch thc i vi cc nh khoa hc trong lnh

    vc data mining. T c th trin khai ci t v th nghim cc m hnh phn lp

    d liu trn thc t. Tin ti ng dng vo trong cc hot ng thc tin ti Vit Nam,

    m trc tin l cc hot ng phn tch, nghin cu th trng khch hng.

  • 7/31/2019 Phn lp d liu

    11/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 2-

    Kha lun cng chy th nghim m hnh phn lp C4.5 trn tp d liu

    thc t t Tng cng ty bu chnh vin thng. Qua tip thu c cc k thut trin

    khai, p dng mt m hnh phn lp d liu vo hot ng thc tin. Qu trnh chy

    th nghim thu c cc kt qu phn lp kh quan vi tin cy cao v nhiu

    tim nng ng dng. Cc nh gi hiu nng ca m hnh phn lp cng c tinhnh. Trn c s, kha lun xut nhng ci tin nhm tng hiu nng ca m

    hnh phn lp C4.5 ng thi thm tin ch cho ngi dng.

    Kha lun gm c 3 chng chnh:

    Chng 1i t tng quan cng ngh phn lp d liu ti k thut phn lp d

    liu da trn cy quyt nh. Cc nh gi v cng c cy quyt nh cng c trnh

    by. Chng ny cng cung cp mt ci nhn tng quan v lnh vc nghin cu cc

    thut ton phn lp d liu da trn cy quyt nh vi nn tng t tng, tnh hnhnghin cu v phng hng pht trin hin nay.

    Chng 2 tp trung vo hai thut ton tiu biu cho hai phm vi ng dng

    khc nhau l C4.5 v SPRINT. Hai thut ton ny c nhng chin lc ring trong la

    chn tiu chun phn chia d liu cng nh cch thc lu tr phn chia d

    liuChnh nhng c im ring m C4.5 l thut ton tiu biu ph bin nht

    vi tp d liu va v nh, trong khi SPRINT li l s la chn i vi nhng tp

    d liu cc ln.

    Chng 3 trnh by qu trnh thc nghim vi m hnh phn lp C4.5 trn tp

    d liu thc t tng cng ty bu chnh vin thng Vit Nam. Cc kt qu thc nghim

    c trnh by. T kha lun xut cc ci tin m hnh phn lp C4.5

  • 7/31/2019 Phn lp d liu

    12/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 3-

    Chng 1. TNG QUAN V PHN LP D LIU DA

    TRN CY QUYTNH

    1.1. Tng quan v phn lp dliu trong data mining

    1.1.1. Phn lp d liu

    Ngy nayphn lp d liu (classification) l mt trong nhng hng nghin

    cu chnh ca khai ph d liu. Thc tt ra nhu cu l t mt c sd liu vi

    nhiu thng tin n con ngi c th trch rt ra cc quyt nh nghip v thng minh.

    Phn lp v don l hai dng ca phn tch d liu nhm trch rt ra mt m hnh

    m t cc lp d liu quan trng hay don xu hng d liu tng lai. Phn lp d

    on gi tr ca nhng nhn xc nh (categorical label) hay nhng gi tr ri rc

    (discrete value), c ngha l phn lp thao tc vi nhng i tng d liu m c bgi tr l bit trc. Trong khi , don li xy dng m hnh vi cc hm nhn gi

    tr lin tc. V d m hnh phn lp d bo thi tit c th cho bit thi tit ngy mai l

    ma, hay nng da vo nhng thng s vm, sc gi, nhit , ca ngy hm

    nay v cc ngy trc . Hay nhcc lut v xu hng mua hng ca khch hng

    trong siu th, cc nhn vin kinh doanh c th ra nhng quyt sch ng n v lng

    mt hng cng nh chng loi by bn Mt m hnh don c th don c

    lng tin tiu dng ca cc khch hng tim nng da trn nhng thng tin v thu

    nhp v ngh nghip ca khch hng. Trong nhng nm qua, phn lp d liu thu

    ht s quan tm cc nh nghin cu trong nhiu lnh vc khc nhau nh hc my

    (machine learning), h chuyn gia (expert system), thng k (statistics)... Cng ngh

    ny cng ng dng trong nhiu lnh vc khc nhau nh: thng mi, nh bng,

    maketing, nghin cu th trng, bo him, y t, gio dc... Phn ln cc thut ton ra

    i trc u s dng cch d liu c tr trong b nh (memory resident), thng

    thao tc vi lng d liu nh. Mt s thut ton ra i sau ny s dng k thut c

    tr trn a ci thin ng k kh nng m rng ca thut ton vi nhng tp d liu

    ln ln ti hng t bn ghi.

    Qu trnh phn lp d liu gm hai bc [14]:

    Bc thnht (learning)

    Qu trnh hc nhm xy dng mt m hnh m t mt tp cc lp d liu hay

    cc khi nim nh trc. u vo ca qu trnh ny l mt tp d liu c cu trc

    c m t bng cc thuc tnh v c to ra t tp cc b gi tr ca cc thuc tnh

  • 7/31/2019 Phn lp d liu

    13/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 4-

    . Mi b gi trc gi chung l mt phn t d liu (data tuple), c th l cc

    mu (sample), v d (example), i tng(object), bn ghi (record) hay trng hp

    (case). Kho lun s dng cc thut ng ny vi ngha tng ng. Trong tp d liu

    ny, mi phn t d liu c gi s thuc v mt lp nh trc, lp y l gi tr

    ca mt thuc tnh c chn lm thuc tnh gn nhn lp hay thuc tnh phn lp(class label attribute). u ra ca bc ny thng l cc quy tc phn lp di dng

    lut dng if-then, cy quyt nh, cng thc logic, hay mng nron. Qu trnh ny

    c m t nh trong hnh 1

    Hnh 1 - Qu trnh phn lp dliu - (a) Bc xy dng m hnh phn lp

    Bc thhai (classification)

    Bc th hai dng m hnh xy dng bc trc phn lp d liu

    mi. Trc tin chnh xc mang tnh cht don ca m hnh phn lp va to ra

    c c lng. Holdout l mt k thut n gin c lng chnh xc . K

    thut ny s dng mt tp d liu kim tra vi cc mu c gn nhn lp. Cc

    mu ny c chn ngu nhin v c lp vi cc mu trong tp d liu o to.

    chnh xc ca m hnh trntp d

    liu ki

    m tra

    a l t l phn trm cc cc mu

    trong tp d liu kim tra c m hnh phn lp ng (so vi thc t). Nu chnh

    xc ca m hnh c c lng da trn tp d liu o to th kt qu thu c l

    rt kh quan v m hnh lun c xu hng qu va d liu. Qu va d liu l hin

    tng kt qu phn lp trng kht vi d liu thc t v qu trnh xy dng m hnh

    phn lp t tp d liu o to c th kt hp nhng c im ring bit ca tp d

    A g e C ar T yp e R isk2 0 Co mbi High

    1 8 S po rts High

    4 0 S po rts High

    5 0 F a mily L o w

    3 5 M iniv a n L o w

    3 0 Co mbi High

    3 2 F a mily L o w

    4 0 Co mbi L o w

    Training data

    Classification

    algorithm

    Classifier (model)

    ifage < 31

    or Car Type =Sports

    then Risk = High

    a)

  • 7/31/2019 Phn lp d liu

    14/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 5-

    liu . Do vy cn s dng mt tp d liu kim tra c lp vi tp d liu o to.

    Nu chnh xc ca m hnh l chp nhn c, th m hnh c s dng phn

    lp nhng d liu tng lai, hoc nhng d liu m gi tr ca thuc tnh phn lp l

    cha bit.

    Hnh 2 - Qu trnh phn lp dliu - (b1)c lng chnh xc ca m hnh

    Hnh 3 - Qu trnh phn lp dliu - (b2) Phn lp dliu mi

    Trong m hnh phn lp, thut ton phn lp gi vai tr trung tm, quyt nh

    ti s thnh cng ca m hnh phn lp. Do vy cha kha ca vn phn lp d liu

    l tm ra c mt thut ton phn lp nhanh, hiu qu, c chnh xc cao v c kh

    nng mrng c. Trong kh nng mrng c ca thut ton c c bit tr

    trng v pht trin [14].

    C th lit k ra y cc k thut phn lp c s dng trong nhng nm qua:

    Phn lp cy quytnh (Decision tree classification)

    Age Car Type Risk27 Sports High

    34 Family Low

    66 Family High

    44 Sports High

    Test data

    Classifier (model)

    RiskHigh

    LowLow

    High

    b1)

    A g e C ar Typ e R isk2 7 S ports

    3 4 M iniva n

    5 5 F amily

    3 4 S po rts

    New dataClassifier (model)

    R i s kHigh

    L o w

    L o w

    High

    b2)

  • 7/31/2019 Phn lp d liu

    15/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 6-

    B phn lp Bayesian (Bayesian classifier)

    M hnh phn lp K-hng xm gn nht (K-nearest neighbor classifier)

    Mng nron

    Phn tch thng k Cc thut ton di truyn

    Phng php tp th (Rough set Approach)

    1.1.2. Cc vn lin quan n phn lp d liu

    1.1.2.1. Chun b dliu cho vic phn lp

    Vic tin x l d liu cho qu trnh phn lp l mt vic lm khng th thiu

    v c vai tr quan trng quyt nh ti s p dng c hay khng ca m hnh phn

    lp. Qu trnh tin x l d liu s gip ci thin chnh xc, tnh hiu qu v kh

    nng mrng c ca m hnh phn lp.

    Qu trnh tin x l d liu gm c cc cng vic sau:

    Lm sch dliu

    Lm sch d liu lin quan n vic x l vi li (noise) v gi tr thiu

    (missing value) trong tp d liu ban u.Noise l cc li ngu nhin hay cc

    gi tr khng hp l ca cc bin trong tp d liu. x l vi loi li ny c

    th dng k thut lm trn.Missing value l nhng khng c gi tr ca ccthuc tnh. Gi tr thiu c th do li ch quan trong qu trnh nhp liu, hoc

    trong trng hp c th gi tr ca thuc tnh khng c, hay khng quan

    trng. K thut x l y c th bng cch thay gi tr thiu bng gi tr ph

    bin nht ca thuc tnh hoc bng gi tr c th xy ra nht da trn thng

    k. Mc d phn ln thut ton phn lp u c cch x l vi nhng gi tr

    thiu v litrong tp d liu, nhng bc tin x l ny c th lm gim s hn

    n trong qu trnh hc (xy dng m hnh phn lp).

    Phn tch scn thit ca dliuC rt nhiu thuc tnh trong tp d liu c th hon ton khng cn thit hay

    lin quan n mt bi ton phn lp c th. V d d liu v ngy trong tun

    hon ton khng cn thit i vi ng dng phn tch ri ro ca cc khon

    tin cho vay ca ngn hng, nn thuc tnh ny l d tha. Phn tch s cn

    thit ca d liu nhm mc ch loi b nhng thuc tnh khng cn thit, d

  • 7/31/2019 Phn lp d liu

    16/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 7-

    tha khi qu trnh hc v nhng thuc tnh s lm chm, phc tp v gy ra

    s hiu sai trong qu trnh hc dn ti mt m hnh phn lp khng dng c.

    Chuyn i dliu

    Vic khi qut ha d liu ln mc khi nim cao hn i khi l cn thit trong

    qu trnh tin x l. Vic ny c bit hu ch vi nhng thuc tnh lin tc

    (continuous attribute hay numeric attribute). V d cc gi tr s ca thuc tnh

    thu nhp ca khch hng c thc khi qut ha thnh cc dy gi tr ri rc:

    thp, trung bnh, cao. Tng t vi nhng thuc tnh ri rc (categorical

    attribute) nha chphc thc khi qut ha ln thnh thnh ph. Vic

    khi qut ha lm c ng d liu hc nguyn thy, v vy cc thao tc vo/ ra

    lin quan n qu trnh hc s gim.

    1.1.2.2. So snh cc m hnh phn lpTrong tng ng dng c th cn la chn m hnh phn lp ph hp. Vic la

    chn cn c vo s so snh cc m hnh phn lp vi nhau, da trn cc tiu chun

    sau:

    chnh xc don (predictive accuracy)

    chnh xc l kh nng ca m hnh don chnh xc nhn lp ca d

    liu mi hay d liu cha bit.

    Tc (speed)

    Tc l nhng chi ph tnh ton lin quan n qu trnh to ra v s dng m

    hnh.

    Sc mnh (robustness)

    Sc mnh l kh nng m hnh to ta nhng don ng t nhng d liu

    noise hay d liu vi nhng gi tr thiu.

    Kh nng mrng(scalability)

    Kh nng mrng l kh nng thc thi hiu qu trn lng ln d liu ca m

    hnh hc.

    Tnh hiu c (interpretability)

    Tnh hiu c l mc hiu v hiu r nhng kt qu sinh ra bi m hnh

    hc.

  • 7/31/2019 Phn lp d liu

    17/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 8-

    Tnh n gin (simplicity)

    Tnh n gin lin quan n kch thc ca cy quyt nh hay c ng ca

    cc lut.

    Trong cc tiu chun trn, kh nng mrng ca m hnh phn lp c nhn

    mnh v tr trng pht trin, c bit vi cy quyt nh. [14]

    1.1.3. Cc phng php nh gi chnh xc ca m hnh phn lp

    c lng chnh xc ca b phn lp l quan trng ch n cho php d

    on c chnh xc ca cc kt qu phn lp nhng d liu tng lai. chnh

    xc cn gip so snh cc m hnh phn lp khc nhau. Kha lun ny cp n 2

    phng php nh gi ph bin l holdout v k-fold cross-validation. C 2 k thut

    ny u da trn cc phn hoch ngu nhin tp d liu ban u.

    Trong phng php holdout, d liu da ra c phn chia ngu nhin thnh 2phn l: tp d liu o to v tp d liu kim tra. Thng thng 2/3 d liu cp

    cho tp d liu o to, phn cn li cho tp d liu kim tra [14].

    Hnh 4 -c lng chnh xc ca m hnh phn lp vi phng php holdout

    Trong phng php k-fold cross validation tp d liu ban u c chia ngu

    nhin thnh ktp con (fold) c kch thc xp x nhau S1, S2, , Sk. Qu trnh hc

    v test c thc hin kln. Ti ln lp thi, Sil tp d liu kim tra, cc tp cn

    li hp thnh tp d liu o to. C ngha l, u tin vic dy c thc hin trncc tp S2, S3 , Sk, sau test trn tp S1; tip tc qu trnh dy c thc hin

    trn tp S1, S3, S4,, Sk, sau test trn tp S2; v c th tip tc. chnh xc l

    ton b s phn lp ng tkln lp chia cho tng s mu ca tp d liu ban u.

    Data

    Test set

    Training setDerive

    classifier

    Esitmate

    accuracy

  • 7/31/2019 Phn lp d liu

    18/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 9-

    1.2. Cy quytnh ng dng trong phn lp dliu

    1.2.1. nh ngha

    Trong nhng nm qua, nhiu m hnh phn lp d liu c cc nh khoa

    hc trong nhiu lnh vc khc nhau xut nh mng notron, m hnh thng k tuyn

    tnh /bc 2, cy quyt nh, m hnh di truyn. Trong s nhng m hnh , cy quyt

    nh vi nhng u im ca mnh c nh gi l mt cng c mnh, ph bin v

    c bit thch hp cho data mining ni chung v phn lp d liu ni ring [7]. C th

    k ra nhng u im ca cy quyt nh nh: xy dng tng i nhanh; n gin, d

    hiu. Hn na cc cy c th d dng c chuyn i sang cc cu lnh SQL c

    thc s dng truy nhp csd liu mt cch hiu qu. Cui cng, vic phn

    lp da trn cy quyt nh t c s tng t v i khi l chnh xc hn so vi

    cc phng php phn lp khc [10].

    Cy quyt nh l biu pht trin c cu trc dng cy, nh m t trong

    hnh v sau:

    Hnh 5- V d vcy quytnh

    Trong cy quyt nh:

    Gc: l node trn cng ca cy Node trong: biu din mt kim tra trn mt thuc tnh n (hnh ch nht)

    Nhnh: biu din cc kt qu ca kim tra trn node trong (mi tn)

    Node l: biu din lp hay s phn phi lp (hnh trn)

    Age27.5

    Risk = High

    Age>27.5

    Risk = High

    Car type {sport} Car type {family, truck}

    Age

    Car type

    Risk = Low

  • 7/31/2019 Phn lp d liu

    19/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 10-

    phn lp mu d liu cha bit, gi tr cc thuc tnh ca mu c a

    vo kim tra trn cy quyt nh. Mi mu tng ng c mt ng i t gc n l

    v l biu din don gi tr phn lp mu .

    1.2.2. Cc vn trong khai ph d liu s dng cy quyt nh

    Cc vn c th trong khi hc hay phn lp d liu bng cy quyt nh

    gm: xc nh su pht trin cy quyt nh, x l vi nhng thuc tnh lin tc,

    chn php o la chn thuc tnh thch hp, s dng tp d liu o to vi nhng gi

    tr thuc tnh b thiu, s dng cc thuc tnh vi nhng chi ph khc nhau, v ci thin

    hiu nng tnh ton. Sau y kha lun s cp n nhng vn chnh c gii

    quyt trong cc thut ton phn lp da trn cy quyt nh.

    1.2.2.1. Trnh qu va dliu

    Th no l qu va d liu? C th hiu y l hin tng cy quyt nhcha mt sc trng ring ca tp d liu o to, nu ly chnh tp traning data

    test li m hnh phn lp th chnh xc s rt cao, trong khi i vi nhng d liu

    tng lai khc nu s dng cy li khng t c chnh xc nh vy.

    Qu va d liu l mt kh khn ng ki vi hc bng cy quyt nh v

    nhng phng php hc khc. c bit khi s lng v d trong tp d liu o to

    qu t, hay c noise trong d liu.

    C hai phng php trnh qu va d liu trong cy quyt nh:

    Dng pht trin cy sm hn bnh thng, trc khi t ti im phn lp hon

    ho tp d liu o to. Vi phng php ny, mt thch thc t ra l phi c

    lng chnh xc thi im dng pht trin cy.

    Cho php cy c th qu va d liu, sau s ct, ta cy.

    Mc d phng php th nht c v trc tip hn, nhng vi phng php th

    hai th cy quyt nh c sinh ra c thc nghim chng minh l thnh cng hn

    trong thc t. Hn na vic ct ta cy quyt nh cn gip tng qut ha, v ci thin

    chnh xc ca m hnh phn lp. D thc hin phng php no th vn mucht y l tiu chun no c s dng xc nh kch thc hp l ca cy cui

    cng.

  • 7/31/2019 Phn lp d liu

    20/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 11-

    1.2.2.2. Thao tc vi thuc tnh lin tc

    Vic thao tc vi thuc tnh lin tc trn cy quyt nh hon ton khng n

    gin nh vi thuc tnh ri rc.

    Thuc tnh ri rc c tp gi tr (domain) xc nh t trc v l tp hp cc

    gi tr ri rc. V d loi t l mt thuc tnh ri rc vi tp gi tr l: {xe ti, xekhch, xe con, taxi}.Vic phn chia d liu da vo php kim tra gi tr ca thuc

    tnh ri rc c chn ti mt v d c th c thuc tp gi tr ca thuc tnh hay

    khng: value(A) Xvi Xdomain (A). y l php kim tra logic n gin, khng

    tn nhiu ti nguyn tnh ton. Trong khi , vi thuc tnh lin tc (thuc tnh dng

    s) th tp gi tr l khng xc nh trc. Chnh v vy, trong qu trnh pht trin cy,

    cn s dng kim tra dng nh phn: value(A) . Vi l hng s ngng

    (threshold) c ln lt xc nh da trn tng gi tr ring bit hay tng cp gi tr

    lin nhau (theo th t sp xp) ca thuc tnh lin tc ang xem xt trong tp dliu o to. iu c ngha l nu thuc tnh lin tc A trong tp d liu o to c

    dgi tr phn bit th cn thc hin d-1 ln kim tra value(A) i vi i = 1..d-1 tm

    ra ngng best tt nht tng ng vi thuc tnh . Vic xc nh gi tr ca v tiu

    chun tm tt nht ty vo chin lc ca tng thut ton [13][1]. Trong thut ton

    C4.5, ic chn l gi tr trung bnh ca hai gi tr lin k nhau trong dy gi tr

    sp xp.

    Ngoi ra cn mt s vn lin quan n sinh tp lut, x l vi gi tr thiu

    sc trnh by c th trong phn thut ton C4.5.

    1.2.3. nh gi cy quyt nh trong lnh vc khai ph d liu

    1.2.3.1. Sc mnh ca cy quyt nh

    Cy quyt nh c 5 sc mnh chnh sau [5]:

    Kh nng sinh ra cc quy tc hiu c

    Cy quyt nh c kh nng sinh ra cc quy tc c th chuyn i c sang

    dng ting Anh, hoc cc cu lnh SQL. y l u im ni bt ca k thut ny.Thm ch vi nhng tp d liu ln khin cho hnh dng cy quyt nh ln v phc

    tp, vic i theo bt cng no trn cy l d dng theo ngha ph bin v r rng.

    Do vy s gii thch cho bt c mt s phn lp hay don no u tng i minh

    bch.

  • 7/31/2019 Phn lp d liu

    21/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 12-

    Kh nng thc thi trong nhng lnh vc hng quy tc

    iu ny c nghe c v hin nhin, nhng quy tc quy np ni chung v cy

    quyt nh ni ring l la chn hon ho cho nhng lnh vc thc s l cc quy tc.

    Rt nhiu lnh vc t di truyn ti cc qu trnh cng nghip thc s cha cc quy tc

    n, khng r rng (underlying rules) do kh phc tp v ti ngha bi nhng d liu li(noisy). Cy quyt nh l mt s la chn t nhin khi chng ta nghi ngs tn ti

    ca cc quy tc n, khng r rng.

    Ddng tnh ton trong khi phn lp

    Mc d nh chng ta bit, cy quyt nh c th cha nhiu nh dng,

    nhng trong thc t, cc thut ton s dng to ra cy quyt nh thng to ra

    nhng cy vi s phn nhnh thp v cc test n gin ti tng node. Nhng test in

    hnh l: so snh s, xem xt phn t ca mt tp hp, v cc php ni n gin. Khi

    thc thi trn my tnh, nhng test ny chuyn thnh cc ton hm logic v s nguyn

    l nhng ton hng thc thi nhanh v khng t. y l mt u im quan trng bi

    trong mi trng thng mi, cc m hnh don thng c s dng phn lp

    hng triu thm tr hng t bn ghi.

    Kh nng xl vi c thuc tnh lin tc v thuc tnh ri rc

    Cy quyt nh x l tt nh nhau vi thuc tnh lin tc v thuc tnh ri

    rc. Tuy rng vi thuc tnh lin tc cn nhiu ti nguyn tnh ton hn. Nhng thuc

    tnh ri rc tng gy ra nhng vn vi mng neural v cc k thut thng k lithc s d dng thao tc vi cc tiu chun phn chia (splitting criteria) trn cy quyt

    nh: mi nhnh tng ng vi tng phn tch tp d liu theo gi tr ca thuc tnh

    c chn pht trin ti node . Cc thuc tnh lin tc cng d dng phn chia

    bng vic chn ra mt s gi l ngng trong tp cc gi tr sp xp ca thuc tnh

    . Sau khi chn c ngng tt nht, tp d liu phn chia theo test nh phn ca

    ngng .

    Thhin r rng nhng thuc tnh tt nht

    Cc thut ton xy dng cy quyt nh a ra thuc tnh m phn chia tt

    nht tp d liu o to bt u t node gc ca cy. T c th thy nhng thuc

    tnh no l quan trng nht cho vic don hay phn lp.

  • 7/31/2019 Phn lp d liu

    22/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 13-

    1.2.3.2. im yu ca cy quyt nh

    D c nhng sc mnh ni bt trn, cy quyt nh vn khng trnh khi c

    nhng im yu. l cy quyt nh khng thch hp lm vi nhng bi ton vi

    mc tiu l don gi tr ca thuc tnh lin tc nh thu nhp, huyt p hay li xut

    ngn hng, Cy quyt nh cng kh gii quyt vi nhng d liu thi gian lin tcnu khng b ra nhiu cng sc cho vic t ra s biu din d liu theo cc mu lin

    tc.

    Dxy ra li khi c qu nhiu lp

    Mt s cy quyt nh ch thao tc vi nhng lp gi tr nh phn dngyes/no

    hay accept/reject. S khc li c th chnh cc bn ghi vo mt s lp bt k, nhng

    d xy ra li khi s v do to ng vi mt lp l nh. iu ny xy ra cng nhanh

    hn vi cy m c nhiu tng hay c nhiu nhnh trn mt node.

    Chi ph tnh ton to to

    iu ny nghe c v mu thun vi khng nh u im ca cy quyt nh

    trn. Nhng qu trnh pht trin cy quyt nh t v mt tnh ton. V cy quyt nh

    c rt nhiu node trong trc khi i n l cui cng. Ti tng node, cn tnh mt

    o (hay tiu chun phn chia)trn tng thuc tnh, vi thuc tnh lin tc phi thm

    thao tc xp xp li tp d liu theo th t gi tr ca thuc tnh . Sau mi c th

    chn c mt thuc tnh pht trin v tng ng l mt phn chia tt nht. Mt vi

    thut ton s dng t hp cc thuc tnh kt hp vi nhau c trng s pht trin cyquyt nh. Qu trnh ct ct cy cng t v nhiu cy con ng c phi c to ra

    v so snh.

    1.2.4. Xy dng cy quyt nh

    Qu trnh xy dng cy quyt nh gm hai giai on:

    Giai on th nht pht trin cy quyt nh:

    Giai on ny pht trin bt u t gc, n tng nhnh v pht trin quy np

    theo cch thc chia tr cho ti khi t c cy quyt nh vi tt c cc l cgn nhn lp.

    Giai on th hai ct, ta bt cc cnh nhnh trn cy quyt nh.

    Giai on ny nhm mc ch n gin ha v khi qut ha t lm tng

    chnh xc ca cy quyt nh bng cch loi b s ph thuc vo mc li (noise)

  • 7/31/2019 Phn lp d liu

    23/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 14-

    ca d liu o to mang tnh cht thng k, hay nhng s bin i m c th l c

    tnh ring bit ca d liu o to. Giai on ny ch truy cp d liu trn cy quyt

    nh c pht trin trong giai on trc v qu trnh thc nghim cho thy giai

    on ny khng tn nhiu ti nguyn tnh ton, nh vi phn ln cc thut ton, giai

    on ny chim khong di 1% tng thi gian xy dng m hnh phn lp [7][1].Do vy, y chng ta ch tp trung vo nghin cu giai on pht trin cy

    quyt nh. Di y l khung cng vic ca giai on ny:

    1) Chn thuc tnh tt nht bng mt o nh trc

    2) Pht trin cy bng vic thm cc nhnh tng ng vi tng gi tr ca thuctnh chn

    3) Sp xp, phn chia tp d liu o to ti node con

    4) Nu cc v dc phn lp r rng th dng.

    Ngc li: lp li bc 1 ti bc 4 cho tng node con

    1.3. Thut ton xy dng cy quytnh

    1.3.1. T tng chung

    Phn ln cc thut ton phn lp d liu da trn cy quyt nh c m gi nh sau:

    Hnh 6 - M gi ca thut ton phn lp dliu da trn cy quytnh

    Make Tree (Training Data T)

    {

    Partition(T)

    }

    Partit ion(Data S)

    {

    i f (all points in S are in the same class) then

    return

    for each attribute Ado

    evaluate splits on attribute A;

    use best split found to partition S into S1, S2,..., Sk

    Partition(S1)

    Partition(S2)...

    Partition(Sk)

    }

  • 7/31/2019 Phn lp d liu

    24/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 15-

    Cc thut ton phn lp nh C4.5 (Quinlan, 1993), CDP(Agrawal v cc tc

    gi khc, 1993), SLIQ (Mehta v cc tc gi khc, 1996) v SPRINT (Shafer v cc

    tc gi khc, 1996) u s dng phng php ca Hunt lm t tng ch o.

    Phng php ny c Hunt v cc ng s ngh ra vo nhng nm cui thp k 50

    u thp k 60.

    M t quy np phng php Hunt [1]:

    Gi s xy dng cy quyt nh t T l tp training data v cc lp c biu

    din di dng tp C = {C1,C2, ,Ck }

    Trng hp 1:T cha cc case thuc v mt lp n Cj, cy quyt nh ng

    vi Tl mt l tng ng vi lp Cj

    Trng hp 2:T cha cc case thuc v nhiu lp khc nhau trong tp C. Mt

    kim tra c chn trn mt thuc tnh c nhiu gi tr {O1, O2, .,On }. Trong nhiu ng

    dng n thng c chn l 2, khi to ra cy quyt nh nh phn. Tp T c chiathnh cc tp con T1, T2, , Tn, vi Ti cha tt c cc case trong T m c kt qu l Oi

    trong kim tra chn. Cy quyt nh ng vi T bao gm mt node biu din kim tra

    c chn, v mi nhnh tng ng vi mi kt qu c th ca kim tra . Cch thc

    xy dng cy tng tc p dng quy cho tng tp con ca tp training data.

    Trng hp 3:T khng cha case no. Cy quyt nh ng vi T l mt l,

    nhng lp gn vi l phi c xc nh t nhng thng tin khc ngoi T. V d

    C4.5 chn gi tr phn lp l lp ph bin nht ti cha ca node ny.

    1.3.2. Tnh hnh nghin cu cc thut ton hin nay

    Cc thut ton phn lp d liu da trn cy quyt nh u c t tng ch

    o l phng php Hunt trnh by trn. Lun c 2 cu hi ln cn phi c tr

    li trong cc thut ton phn lp d liu da trn cy quyt nh l:

    1. Lm cch no xc nh c thuc tnh tt nht pht trin ti mi

    node?

    2. Lu tr d liu nh th no v lm cch no phn chia d liu theo cc

    test tng ng?

    Cc thut ton khc nhau c cc cch tr li khc nhau cho hai cu hi trn.

    iu ny lm nn s khc bit ca tng thut ton.

    C 3 loi tiu chun hay ch s xc nh thuc tnh tt nht pht trin ti mi

    node

  • 7/31/2019 Phn lp d liu

    25/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 16-

    Gini-index (Breiman v cc ng s, 1984 [1]): Loi tiu chun ny la chn

    thuc tnh m lm cc tiu ha khng tinh khit ca mi phn chia. Cc

    thut ton s dng ny l CART, SLIQ, SPRINT.

    Informationgain (Quinlan, 1993 [1]): Khc vi Gini-index, tiu chun ny s

    dng entropy o khng tinh khit ca mt phn chia v la chn thuctnh theo mc cc i ha ch s entropy. Cc thut ton s dng tiu chun

    ny l ID3, C4.5.

    2 -bng thng k cc skin xy ra ngu nhin:2 o tng quan gia tng

    thuc tnh v nhn lp. Sau la chn thuc tnh c tng quan ln nht.

    CHAID l thut ton s dng tiu chun ny.

    Chi tit v cch tnh cc tiu chun Gini-index v Information-gain sc

    trnh by trong hai thut ton C4.5 v SPRINT, chng 2.

    Vic tnh ton cc ch s trn i khi i hi phi duyt ton b hay mt phn

    ca tp d liu o to. Do vy cc thut ton ra i trc yu cu ton b tp d liu

    o to phi nm thng tr trong b nh (memory- resident) trong qu trnh pht

    trin cy quyt nh. iu ny lm hn ch kh nng mrng ca cc thut ton ,

    v kch thc b nhl c hn, m kch thc ca tp d liu o to th tng khng

    ngng, i khi l triu l t bn ghi trong lnh vc thng mi. R rng cn tm ra gii

    php mi thay i cch lu tr v truy cp d liu, nm 1996 SLIQ (Mehta) v

    SPRINT (Shafer) ra i gii quyt c hn ch. Hai thut ton ny s dng

    cch lu tr d liu thng tr trn a (disk- resident) v cchsp xp trc mt

    ln (pre- sorting) tp d liu o to. Nhng c im mi ny lm ci thin ng k

    hiu nng v tnh mrng so vi cc thut ton khc. Tip theo l mt s thut ton

    khc pht trin trn nn tng SPRINT vi mt s b xung ci tin nh PUBLIC (1998)

    [11] vi tng kt hp hai qu trnh xy dng v ct ta vi nhau, hay ScalParC

    (1998) ci thin qu trnh phn chia d liu ca SPRINT vi cch dng bng bm

    khc, hay thut ton do cc nh khoa hc trng i hc Minesota (M ) kt hp vi

    IBM xut lm gim chi ph vo ra cng nh chi ph giao tip ton cc khi song

    song ha so vi SPRINT [2]. Trong cc thut ton SPRINT c coi l sng to tbin, ng chng ta tm hiu v pht trin.

  • 7/31/2019 Phn lp d liu

    26/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 17-

    1.3.3. Song song ha thut ton phn lp da trn cy quyt nh tun

    t

    Song song ha xu hng nghin cu hin nay ca cc thut ton phn lp d liu da

    trn cy quyt nh. Nhu cu song song ha cc thut ton tun t l mt nhu cu tt

    yu ca thc tin pht trin khi m cc i hi v hiu nng, chnh xc ngy cngcao. Thm vo l s gia tng nhanh chng v kch thc ca d liu cn khai ph.

    Mt m hnh phn lp chy trn h thng tnh ton song song c hiu nng cao, c kh

    nng khai ph c nhng tp d liu ln hn t gia tng tin cy ca cc quy tc

    phn lp. Hin nay, cc thut ton tun t yu cu d liu thng tr trong b nh

    khng p ng c yu cu ca cc tp d liu c kch thc TetaByte vi hng t

    bn ghi. Do vy xy dng thut ton song song hiu qu da trn nhng thut ton

    tun t sn c l mt thch thc t ra cho cc nh nghin cu.

    C 3 chin lc song song ha cc thut ton tun t:

    Phng php xy dng cy ng b

    Trong phng php ny, tt c cc b vi x l ng thi tham gia xy dng

    cy quyt nh bng vic gi v nhn cc thng tin phn lp ca d liu a phng.

    Hnh 7 m t cch lm vic ca cc b vi x l trong phng php ny

    u im ca phng php ny l khng yu cu vic di chuyn cc d liu

    trong tp d liu o to. Tuy nhin, thut ton ny phi chp nhn chi ph giao tip

    cao, v ti bt cn bng. Vi tng node trong cy quyt nh, sau khi tp hp c cc

    thng tin phn lp, tt c cc b vi x l cn phi ng b v cp nht cc thng tinphn lp. Vi nhng node su thp, chi ph giao tip tng i nh, bi v s

    lng cc mc training data c x l l tng i nh. Nhng khi cy cng su th

    chi ph cho giao tip chim phn ln thi gian x l. Mt vn na ca phng php

    ny l ti bt cn bng do cch lu tr v phn chia d liu ban u ti tng b vi x

    l.

  • 7/31/2019 Phn lp d liu

    27/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 18-

    Hnh 7 - S xy dng cy quytnh theo phng php ng b

    Phng php xy dng cy phn hoch

    Khi xy dng cy quyt nh bng phng php phn hoch cc b vi x l

    khc nhau lm vic vi cc phn khc nhau ca cy quyt nh. Nu nhiu hn 1 b vi

    x l cng kt hp pht trin 1 node, th cc b vi x l c phn hoch pht

    trin cc con ca node . Phng php ny tp trung vo trng hp 1 nhm cc b

    vi x lPn cng hp tc pht trin node n. Khi bt u, tt c cc b vi x l cng

    ng thi kt hp pht trin node gc ca cy phn lp. Khi kt thc, ton b cy

    phn lp c to ra bng cch kt hp tt c cc cy con ca tng b vi x l. Hnh 8

    m t cch lm vic ca cc b vi x l trong phng php ny.

    u im ca phng php ny l khi mt b vi x l mt mnh chu trch

    nhim pht trin mt node, th n c th pht trin thnh mt cy con ca cy ton cc

    mt cch c lp m khng cn bt c chi ph giao tip no.

    Tuy nhin cng c mt vi nhc im trong phng php ny, l: Th

    nht yu cu di chuyn d liu sau mi ln pht trin mt node cho ti khi mi b vi

    x l cha ton b d liu c th pht trin ton b mt cy con. Do vy dn n

    tn km chi ph giao tip khi phn trn ca cy phn lp. Th hai l kh t c ti

    cn bng. Vic gn cc node cho cc b vi x l c thc hin da trn s lng cc

    case trong cc node con. Tuy nhin s lng cc case gn vi mt node khng nht

  • 7/31/2019 Phn lp d liu

    28/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 19-

    thit phi tng ng vi s lng cng vic cn phi x l pht trin cy con ti

    node .

    Hnh 8 - S xy dng cy quytnh theo phng php phn hoch

    Phng php lai

    Phng php lai c tn dng u im ca c 2 phng php trn. Phng

    php xy dng cy ng b chp nhn chi ph giao tip cao khi bin gii ca cy cng

    rng. Trong khi , phng php xy dng cy quyt nh phn hoch th phi chp

    nhn chi ph cho vic ti cn bng sau mi bc. Trn cs, phng php lai tip

    tc duy tr cch thc th nht min l chi ph giao tip phi chu do tun theo cch

    thc th nht khng qu ln. Khi m chi ph ny vt qu mt ngng quy nh, th

    cc b vi x l ang x l cc node ti ng bin hin ti ca cy phn lp c

    phn chia thnh 2 phn (vi gi thit s lng cc b vi x l l ly tha ca 2).

    Phng php ny cn s dng tiu chun khi to s phn hoch tp cc b

    vi x l hin ti, l:

    (Chi ph giao tip) Chi ph di chuyn + Ti cn bng

  • 7/31/2019 Phn lp d liu

    29/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 20-

    M hnh hot ng ca phng php lai c m t trong hnh 9.

    Hnh 9 - S xy dng cy quytnh theo phng php lai

  • 7/31/2019 Phn lp d liu

    30/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 21-

    Chng 2. C4.5 V SPRINT

    2.1. Gii thiu chung

    Sau y l nhng gii thiu chung nht v lch s ra i ca hai thut tonC4.5 v SPRINT.

    C4.5 l s k tha ca ca thut ton hc my bng cy quyt nh da trn

    nn tng l kt qu nghin cu ca HUNT v cc cng s ca ng trong na cui thp

    k 50 v na u nhng nm 60 (Hunt 1962). Phin bn u tin ra i l ID3

    (Quinlan, 1979)- 1 h thng n gin ban u cha khong 600 dng lnh Pascal, v

    tip theo l C4 (Quinlan 1987). Nm 1993, J. Ross Quinlan k tha cc kt qu

    pht trin thnh C4.5 vi 9000 dng lnh C cha trong mt a mm. Mc d c

    phin bn pht trin t C4.5 l C5.0 - mt h thng to ra li nhun t Rule QuestResearch, nhng nhiu tranh lun, nghin cu vn tp trung vo C4.5 v m ngun ca

    n l sn dng [13].

    Nm 1996, 3 tc gi John Shafer, Rakesh Agrawal, Manish Mehta thucIBM

    Almaden Research Center xut mt thut ton mi vi tn gi SPRINT

    (Scalable PaRallelization INduction of decision Trees). SPRINT ra i loi b tt

    c cc gii hn v b nh, thc thi nhanh v c kh nng m rng. Thut ton ny

    c thit k d dng song song ha, cho php nhiu b vi x l cng lm vic

    ng thi xy dng mt m hnh phn lp n, ng nht [7]. Hin nay SPRINT c thng mi ha, thut ton ny c tch hp vo trong cc cng c khai ph d

    liu ca IBM.

    Trong cc thut ton phn lp d liu da trn cy quyt nh, C4.5 v

    SPRINT l hai thut ton tiu biu cho hai phm vi ng dng khc nhau. C4.5 l thut

    ton hiu qu v c dng rng ri nht trong cc ng dng phn lp vi lng d

    liu nh cvi trm nghn bn ghi. SPRINT mt thut ton tuyt vi cho nhng ng

    dng vi lng d liu khng l cvi triu n hng t bn ghi.

    2.2. Thut ton C4.5

    Vi nhng c im C4.5 l thut ton phn lp d liu da trn cy quyt

    nh hiu qu v ph bin trong nhng ng dng khai ph csd liu c kch thc

    nh. C4.5 s dng cch lu tr d liu thng tr trong b nh, chnh c im ny

    lm C4.5 ch thch hp vi nhng csd liu nh, v cch sp xp li d liu ti

  • 7/31/2019 Phn lp d liu

    31/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 22-

    mi node trong qu trnh pht trin cy quyt nh. C4.5 cn cha mt k thut cho

    php biu din li cy quyt nh di dng mt danh sch sp th t cc lut if-then

    (mt dng quy tc phn lp d hiu).K thut ny cho php lm gim bt kch thc

    tp lut v n gin ha cc lut m chnh xc so vi nhnh tng ng cy quyt

    nh l tng ng.T tng pht trin cy quytnh ca C4.5 l phng php HUNT nghin

    cu trn. Chin lc pht trin theo su (depth-first strategy) c p dng cho

    C4.5.

    M gi ca thut ton C4.5:

    Hnh 10 - M gi thut ton C4.5

    Trong bo co ny, chng ti tp trung phn tch nhng im khc bit ca

    C4.5 so vi cc thut ton khc. l cch chn thuc tnh kim tra ti mi node,

    cch x l vi nhng gi tr thiu, trnh vic qu va d liu, c lng chnh

    xc v cch ct ta cy.

    2.2.1. C4.5 dng Gain-entropy lm o la chn thuc tnh tt nht

    Phn ln cc h thng hc my u c gng to ra 1 cy cng nh cng tt,

    v nhng cy nh hn th d hiu hn v dt c chnh xc don cao hn.

    (1) ComputerClassFrequency(T);

    (2) if OneClass or FewCases

    return a leaf;

    Create a decision node N;

    (3) ForEach Attribute A

    ComputeGain(A);

    (4) N.test=AttributeWithBestGain;

    (5) if N.test is continuous

    find Threshold;

    (6) ForEach T' in the splitting of T

    (7) if T' is Empty

    Child of N is a leaf

    else

    (8) Child of N=FormTree(T');

    (9) ComputeErrors of N;

    return N

  • 7/31/2019 Phn lp d liu

    32/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 23-

    Do khng thm bo c s cc tiu ca cy quyt nh, C4.5 da vo nghin cu

    ti u ha, v s la chn cch phn chia m c o la chn thuc tnht gi tr

    cc i.

    Hai o c s dng trong C4.5 l information gain vgain ratio.RF(Cj,

    S) biu din tn xut (Relative Frequency) cc case trong Sthuc v lp Cj.

    Vi |Sj| l kch thc tp cc case c gi tr phn lp l Cj. |S| l kch thc tp

    d liu o to.

    Ch s thng tin cn thit cho s phn lp: I(S) vi S l tp cn xt s phn

    phi lp c tnh bng:

    Sau khi S c phn chia thnh cc tp con S1, S2,, St bi test B th

    information gain c tnh bng:

    Test B sc chn nu c G(S, B) t gi tr ln nht.

    Tuy nhin c mt vn khi s dng G(S, B) u tin test c s lng ln kt

    qu, v d G(S, B) t cc i vi test m tng Si ch cha mt casen. Tiu chun

    gain ratio gii quyt c vn ny bng vic a vo thng tin tim nng (potential

    information) ca bn thn mi phn hoch

    Test B sc chn nu c t s gi trgain ratio =G(S, B) / P(S, B) lnnht.

    Trong m hnh phn lp C4.5 release8, c th dng mt trong hai loi ch s

    Information Gain hay Gain ratio xc nh thuc tnh tt nht. Trong Gain ratio

    l la chn mc nh.

    RF (Cj, S) = |Sj| / |S|

  • 7/31/2019 Phn lp d liu

    33/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 24-

    V d m t cch tnh information gain

    Vi thuc tnh ri rc

    Bng 1 - Bng dliu tp training vi thuc tnh phn lp l buys_computer

    Trong tp d liu trn: s1 l tp nhng bn ghi c gi tr phn lp lyes, s2 l tp

    nhng bn ghi c gi tr phn lp l no. Khi :

    I(S) = I(s1,s2) = I(9, 5) = -9/14*log29/14 5/14* log25/14 = 0.940

    Tnh G(S, A) vi A ln lt l tng thuc tnh:

    A = age. Thuc tnh age c ri rc ha thnh cc gi tr 40.

    Vi age= 40: I (S3) = I(s13,s23) = 0.971

    |Si| / |S|* I(Si) = 5/14* I(S1) + 4/14 * I(S2) + 5/14 * I(S3) =

    0.694

    Gain (S, age) = I(s1,s2) |Si| / |S|* I(Si) = 0.246

    Tnh tng t vi cc thuc tnh khc ta c:

    A = income: Gain (S, income) = 0.029 A = student: Gain (S, student) = 0.151

    A = credit_rating: Gain (S, credit_rating) = 0.048

    Thuc tnh age l thuc tnh c oInformation Gain ln nht. Do

    vy agec chn lm thuc tnh pht trin ti node ang xt.

  • 7/31/2019 Phn lp d liu

    34/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 25-

    Vi thuc tnh lin tc

    X l thuc tnh lin tc i hi nhiu ti nguyn tnh ton hn thuc tnh ri

    rc. Gm cc bc sau:

    1. K thut Quick sortc s dng sp xp cc case trong tp d liu

    o to theo th t tng dn hoc gim dn cc gi tr ca thuc tnh

    lin tc V ang xt. c tp gi tr V = {v1, v2, , vm}

    2. Chia tp d liu thnh hai tp con theo ngng i = (vi + vi+1)/2 nm

    gia hai gi tr lin k nhau vi v vi+1. Test phn chia d liu l test

    nh phn dng V i. Thc thi test ta c hai tp d

    liu con: V1 = {v1, v2, , vi} v V2 = {vi+1, vi+2, , vm}.

    3. Xt (m-1) ngng i c th c ng vi m gi tr ca thuc tnh V bng

    cch tnhInformation gain hay Gain ratio vi tng ngng . Ngng

    c gi tr ca Information gain hay Gain ratio ln nht sc chn

    lm ngng phn chia ca thuc tnh .

    Vic tm ngng (theo cch tuyn tnh nh trn) v sp xp tp training

    theo thuc tnh lin tc ang xem xt i khi gy ra tht c chai v tn

    nhiu ti nguyn tnh ton.

    2.2.2. C4.5 c c ch ring trong x l nhng gi tr thiu

    Gi tr thiu ca thuc tnh l hin tng ph bin trong d liu, c th do li

    khi nhp cc bn ghi vo csd liu, cng c th do gi tr thuc tnh c nh

    gi l khng cn thit i vi case c th.

    Trong qu trnh xy dng cy t tp d liu o to S, B l test da trn thuc

    tnh Aa vi cc gi tru ra l b1, b2, ..., bt. Tp S0 l tp con cc case trong S m c

    gi tr thuc tnh Aa khng bit v Si biu din cc case vi u ra l bi trong test B.

    Khi o information gain ca test B gim v chng ta khng hc c g t cc

    case trong S0.

    Tng ng vi G(S, B), P(S, B) cng thay i,

  • 7/31/2019 Phn lp d liu

    35/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 26-

    Hai thay i ny lm gim gi tr ca test lin quan n thuc tnh c t l gi

    tr thiu cao.

    Nu testB c chn, C4.5 khng to mt nhnh ring trn cy quyt nh

    cho S0. Thay vo , thut ton c cch phn chia cc case trong S0 v vc tp con Si

    l tp con m c gi tr thuc tnh test xc nh theo trong s |Si|/ |S S0|.

    2.2.3. Trnh qu va d liu

    Qu va d liu l mt kh khn ng ki vi hc bng cy quyt nh

    v nhng phng php hc khc. Qu va d liu l hin tng: nu khng c cc

    case xung t (l nhng case m gi tr cho mi thuc tnh l ging nhau nhng gi tr

    ca lp li khc nhau) th cy quyt nh s phn lp chnh xc ton b cc case trong

    tp d liu o to. i khi d liu o to li cha nhng c tnh c th, nn khi p

    dng cy quyt nh cho nhng tp d liu khc th chnh xc khng cn cao

    nh trc.

    C mt s phng php trnh qu va d liu trong cy quyt nh:

    Dng pht trin cy sm hn bnh thng, trc khi t ti im phn lp

    hon ho tp d liu o to. Vi phng php ny, mt thch thc t ra l

    phi c lng chnh xc thi im dng pht trin cy.

    Cho php cy c th qu va d liu, sau s ct, ta cy

    Mc d phng php th nht c v trc quan hn, nhng vi phng php

    th hai th cy quyt nh c sinh ra c th nghim chng minh l thnh cng

    hn trong thc t, v n cho php cc tng tc tim nng gia cc thuc tnh c

    khm ph trc khi quyt nh xem kt qu no ng gi li. C4.5 s dng k thut

    th hai trnh qu va d liu.

    2.2.4. Chuyn i t cy quyt nh sang lut

    Vic chuyn i t cy quyt nh sang lut sn xut (production rules) dng

    if-then to ra nhng quy tc phn lp d hiu, d p dng. Cc m hnh phn lp biu

    din cc khi nim di dng cc lut sn xut c chng minh l hu ch trongnhiu lnh vc khc nhau, vi cc i hi v c chnh xc v tnh hiu c ca m

    hnh phn lp. Dng output tp lut sn xut l s la chn khn ngoan. Tuy nhin,

    ti nguyn tnh ton dng cho vic to ra tp lut t tp d liu o to c kch thc

    ln v nhiu gi tr sai l v cng ln [12]. Khng nh ny sc chng minh qua

    kt qu thc nghim trn m hnh phn lp C4.5

  • 7/31/2019 Phn lp d liu

    36/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 27-

    Giai on chuyn di t cy quyt nh sang lut bao gm 4 bc:

    Ct ta:

    Lut khi to ban u l ng i t gc n l ca cy quyt nh. Mt cy

    quyt nh c ll th tng ng tp lut sn xut s c llut khi to. Tng iu kin

    trong lut c xem xt v loi b nu khng nh hng ti chnh xc ca lut .

    Sau , cc lut ct ta c thm vo tp lut trung gian nu n khng trng vi

    nhng lut c.

    La chn

    Cc lut ct ta c nhm li theo gi tr phn lp, to nn cc tp con

    cha cc lut theo lp. S c ktp lut con nu tp training c kgi tr phn lp. Tng

    tp con trn c xem xt chn ra mt tp con cc lut m ti u ha chnh xc

    don ca lp gn vi tp lut .

    Sp xp

    Sp xp Ktp lut to ra t trn bc theo tn s li. Lp mc nh c

    to ra bng cch xc nh cc case trong tp training khng cha trong cc lut hin ti

    v chn lp ph bin nht trong cc case lm lp mc nh.

    c lng, nh gi:

    Tp lut c em c lng li trn ton b tp training, nhm mc ch xcnh xem liu c lut no lm gim chnh xc ca s phn lp. Nu c, lut b

    loi b v qu trnh c lng c lp cho n khi khng th ci tin thm.

    2.2.5. C4.5 l mt thut ton hiu qu cho nhng tp d liu va v nh

    C4.5 c cch sinh cy quyt nh hiu qu v cht ch bng vic s dng

    o la chn thuc tnh tt nht l information-gain. Cc cch x l vi gi tr li,

    thiu v chng qu va d liu ca C4.5 cng vi cch ct ta cy to nn sc

    mnh ca C4.5. Thm vo , m hnh phn lp C4.5 cn c phn chuyn i t cy

    quyt nh sang lut dng if-then, lm tng chnh xc v tnh d hiu ca kt quphn lp. y l tin ch rt c ngha i vi ngi s dng.

  • 7/31/2019 Phn lp d liu

    37/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 28-

    2.3. Thut ton SPRINT

    Ngy nay d liu cn khai ph c th c ti hng triu bn ghi v khong 10

    n 10000 thuc tnh. Hng Tetabyte (100 M bn ghi * 2000 trng * 5 bytes) d liu

    cn c khai ph. Nhng thut ton ra i trc khng thp ng c nhu cu .

    Trc tnh hnh , SPRINT l s ci tin ca thut ton SLIQ (Mehta, 1996) ra i.Cc thut ton SLIQ v SPRINT u c nhng ci tin tng kh nng mrng ca

    thut ton nh:

    Kh nng x l tt vi nhng thuc tnh lin tc v thuc tnh ri rc.

    C hai thut ton ny u s dng k thut sp xp trc mt ln d liu, v

    lu trthng tr trn a (disk resident data) nhng d liu qu ln khng

    th cha va trong b nh trong. V sp xp nhng d liu lu tr trn a l

    t [3], nn vi cch sp xp trc, d liu phc v cho qu trnh pht trin

    cy ch cn c sp xp mt ln. Sau mi bc phn chia d liu ti tng

    node, th t ca cc bn ghi trong tng danh sch c duy tr, khng cn phi

    sp xp li nh cc thut ton CART, v C4.5 [13][12]. T lm gim ti

    nguyn tnh ton khi s dng gii php lu tr d liu thng tr trn a.

    C 2 thut ton s dng nhng cu trc d liu gip cho vic xy dng cy

    quyt nh d dng hn. Tuy nhin cu trc d liu lu tr ca SLIQ v

    SPRINT khc nhau, dn n nhng kh nng mrng, v song song ha khc

    nhau gia hai thut ton ny.

    M gi ca thut ton SPRINT nh sau:

    Hnh 11 - M gi thut ton SPRINT

    SPRINT algorithm:

    Partition(Data S) {

    if (all points in S are of the same class) then

    return;

    for each attribute A do

    evaluate splits on attribute A;

    Use best split found to partition S into S1& S2

    Partition(S1);

    Partition(S2);

    }

    Initial call: Partition(Training Data)

  • 7/31/2019 Phn lp d liu

    38/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 29-

    2.3.1. Cu trc d liu trong SPRINT

    K thut phn chia d liu thnh cc danh sch thuc tnh ring bit ln u

    tin c SLIQ (Supervised Learning In Quest) xut. D liu s dng trong SLIQ

    gm: nhiu danh sch thuc tnh lu tr thng tr trn a (mi thuc tnh tng ng

    vi mt danh sch), v mt danh sch n cha gi tr ca class lu tr thng trtrong b nhchnh. Cc danh sch ny lin kt vi nhau bi gi tr ca thuc tnh rid

    (ch s bn ghi c nh th t trong csd liu) c trong mi danh sch.

    SLIQ phn chia dliu thnh hai loi cu trc:[14][9]

    Hnh 12 - Cu trc dliu trong SLIQ

    Danh sch thuc tnh (Attribute List) thng tr trn a. Danh sch ny gm

    trng thuc tnh v rid (a record identifier).

    Danh sch lp (Class List) cha cc gi tr ca thuc tnh phn lp tng ng vi

    tng bn ghi trong c sd liu. Danh sch ny gm cc trng rid, thuc tnh

    phn lp v node (lin kt vi node c gi tr tng ng trn cy quyt nh). Vic

    to ra trng con tr tr ti node tng ng trn cy quyt nh gip cho qu trnh

    phn chia d liu ch cn thay i gi tr ca trng con tr, m khng cn thc s

    phn chia d liu gia cc node. Danh sch lp c lu tr thng tr trong b

    nh trong v n thng xuyn c truy cp, sa i c trong giai on xy dng

    cy, v c trong giai on ct, ta cy. Kch thc ca danh sch lp t l thun vi

    s lng cc bn ghi u vo. Khi danh sch lp khng va trong b nh, hiunng ca SLIQ s gim. l hn ch ca thut ton SLIQ. Vic s dng cu trc

    d liu thng tr trong b nh lm gii hn tnh m rng c ca thut ton

    SLIQ.

  • 7/31/2019 Phn lp d liu

    39/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 30-

    SPRINT sdng danh sch thuc tnh ctr trn a

    SPRINT khc phc c hn ch ca SLIQ bng cch khng s dng danh

    sch lp c tr trong b nh, SPRINT ch s dng mt loi danh sch l danh sch

    thuc tnh c cu trc nh sau:

    Hnh 13 - Cu trc danh sch thuc tnh trong SPRINT Danh sch thuc tnh lin

    tc c sp xp theo thtngay c to ra

    Danh sch thuc tnh

    SPRINT to danh sch thuc tnh cho tng thuc tnh trong tp d liu. Danh

    sch ny bao gm thuc tnh, nhn lp (Class label hay thuc tnh phn lp), v ch s

    ca bn ghi rid(c nh t tp d liu ban u). Danh sch thuc tnh lin tc c

    sp xpth t theo gi tr ca thuc tnh ngay khi c to ra. Nu ton b d liukhng cha trong b nhth tt c cc danh sch thuc tnh c lu tr trn a.

    Chnh do c im lu tr ny m SPRINT loi b mi gii hn v b nh, v c

    kh nng ng dng vi nhng csd liu thc t vi s lng bn ghi c khi ln ti

    hng t.

    Cc danh sch thuc tnh ban u to ra t tp d liu o to c gn vi

    gc ca cy quyt nh. Khi cy pht trin, cc node c phn chia thnh cc node

    con mi th cc dnh sch thuc tnh thuc v node cng c phn chia tng ng

    v gn vo cc node con. Khi danh sch b phn chia th th t ca cc bn ghi trongdanh sch c gi nguyn, v th cc danh sch con c to ra khng bao gi

    phi sp xp li. l mt u im ca SPRINT so vi cc thut ton trc .

    Biu (Histogram)

    RID Age Car Type Risk0 23 family high

    1 17 sport high

    2 43 sport high

    3 68 family low

    4 32 truck low

    5 20 family high

    Age RID Risk17 1 high

    20 5 high

    23 0 high

    32 4 low

    43 2 high68 3 low

    Car Type RID Riskfamily 0 high

    sport 1 high

    sport 2 high

    family 3 low

    truck 4 low

    family 5 high

  • 7/31/2019 Phn lp d liu

    40/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 31-

    SPRINT s dng biu lp bng thng k s phn phi lp ca cc bn

    ghi trong mi danh sch thuc tnh,t dng vo vic c lng im phn chia cho

    danh sch . Thuc tnh lin tc v thuc tnh ri rc c hai dng biu khc nhau.

    Biu ca thuc tnh lin tc

    SPRINT s dng 2 biu : Cbelow v Cabove. Cbelow cha s phn phi

    ca nhng bn ghi c x l, Cabove cha s phn phi ca nhng bn ghi

    cha c x l trong danh sch thuc tnh. Hnh II-3 minh ha vic s dng

    biu cho thuc tnh lin tc

    Biu ca thuc tnh ri rc

    Thuc tnh ri rc cng c mt biu gn vi tng node. Tuy nhin

    SPRINT ch s dng mt biu l count matrix cha s phn phi lp ng

    vi tng gi tr ca thuc tnh c xem xt.Cc danh sch thuc tnh c x l cng mt lc, do vy thay v i hi cc

    danh sch thuc tnh trong b nh, vi SPRINT b nhch cn cha tp cc biu nh trn trong qu trnh pht trin cy.

    2.3.2. SPRINT s dng Gini-index lm o tm im phn chia tp d

    liu tt nht

    SPRINT l mt trong nhng thut ton s dng o Gini-index tm thuc

    tnh tt nht lm thuc tnh test ti mi node trn cy. Ch s ny c Breiman ngh

    ra t nm 1984, cch tnh nh sau:

    Trc tin cn nh ngha:gini (S) = 1- pj2

    Trong : Sl tp d liu o to c n lp; pj

    l tn xut ca lp j

    trong S(l thng ca s bn ghi c gi tr ca thuc tnh phn lp lpjvi

    tng s bn ghi trong S)

    Nu phn chia dng nh phn, tc l S c chia thnh S1, S2 (SPRINT ch

    s dng phn chia nh phn ny) th ch s tnh phn chia c cho bi

    cng thc sau:

    ginisplit(S) = n1/n*gini(S1) + n2/n*gini(S2)

    Vi n, n1, n2 ln lt l kch thc ca S, S1, S2.

  • 7/31/2019 Phn lp d liu

    41/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 32-

    u im caloi ch s ny l cc tnh ton trn n ch da vo thng tin v

    s phn phi cc gi tr lp trong tng phn phn chia m khng tnh ton trn cc gi

    tr ca thuc tnhang xem xt.

    tm c im phn chia cho mi node, cn qut tng danh sch thuc tnh

    ca node v c lng cc phn chia da trn mi thuc tnh gn vi node .Thuc tnh c chn phn chia l thuc tnh c ch sginisplit(S) nh nht.

    im cn nhn mnh y l khc viInformation Gain ch s ny c tnh

    m khng cn c ni dung d liu, ch cn biu biu din s phn phi cc bn ghi

    theo cc gi tr phn lp. l tin cho cch lu tr d liu thng tr trn a.

    Cc biu ca danh sch thuc tnh lin tc, hay ri rc c m t di y.

    Vi thuc tnh lin tc

    Vi thuc tnh lin tc, cc gi tr kim tra l cc gi tr nm gia mi cp 2gi tr lin k ca thuc tnh . tm im phn chia cho thuc tnh ti mt node

    nht nh, biu c khi to vi Cbelow bng 0 v Cabove l phn phi lp ca tt c

    cc bn ghi ti node . Hai biu trn c cp nht ln lt mi khi tng bn ghi

    c c. Mi khi con tr chy gini-indexc tnh trn tng im phn chia nm

    gia gi tr va c v gi tr sp c. Khi c ht danh sch thuc tnh (Cabove bng 0

    tt c cc ct) th cng l lc tnh c ton b cc gini-index ca cc im phn

    chia cn xem xt. Cn c vo kt qu c th chn ra gini-index thp nht v tng

    ng l im phn chia ca thuc tnh lin tc ang xem xt ti node . Vic tnh gini-

    index hon ton da vo biu . Nu tm ra im phn chia tt nht th kt qu

    c lu li v biu va gn danh sch thuc tnh c khi to li trc khi x

    l vi thuc tnh tip theo.

    Hnh 14 - c lng cc im phn chia vi thuc tnh lin tc

  • 7/31/2019 Phn lp d liu

    42/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 33-

    Vi thuc tnh ri rc

    Vi thuc tnh ri rc, qu trnh tm im phn chia tt nht cng c tnh

    ton da trn biu ca danh sch thuc tnh . Trc tin cn qut ton b danh

    sch thuc tnh thu c s lng phn lp ng vi tng gi tr ca thuc tnh ri

    rc, kt qu ny c lu trong biu count matrix. Sau , cn tm tt c cc tpcon c th c t cc gi tr ca thuc tnh ang xt, coi l im phn chia v tnh

    gini-index tng ng. Cc thng tin cn cho vic tnh ton ch sgini-index ca bt c

    tp con no u c trong count matrix. B nhcung cp cho count matrixc thu hi

    sau khi tm ra c im phn chia tt nht ca thuc tnh .

    Hnh 15 - c lngim phn chia vi thuc tnh ri rc

    V d m t cch tnh ch s Giniindex

    Vi tp d liu o to c m t trong hnh 13, vic tnh ch s Gini-index tm ra

    im phn chia tt nht c thc hin nh sau:

    1. Vi Thuc tnh lin tc Age cn tnh im phn chia trn ln lt cc so snhsau Age

  • 7/31/2019 Phn lp d liu

    43/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 34-

    Tnh ton tng t vi cc test cn li Age

  • 7/31/2019 Phn lp d liu

    44/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 35-

    Vi thuc tnh c chn (Age nh trn hnh v) lm thuc tnh phn chia ti

    node , vic phn chia danh sch thuc tnh ny v cc node con kh n gin. Nu

    l thuc tnh lin tc, ch cn ct danh sch thuc tnh theo im phn chia thnh 2

    phn v gn cho 2 node con tng ng. Nu l thuc tnh ri rc th cn qut ton

    b danh sch v p dng test xc nh chuyn cc bn ghi v 2 danh sch ming vi 2 node con.

    Nhng vn khng n gin nh vy vi nhng thuc tnh cn li ti node

    (Car Type chng hn), khng c test trn thuc tnh ny, nn khng th p dng cc

    kim tra trn gi tr ca thuc tnh phn chia cc bn ghi. Lc ny cn dng n mt

    trng c bit trong cc danh sch thuc tnh l rids. y chnh l trng kt ni

    cc bn ghi trong cc danh sch thuc tnh. C th nh sau: trong khi phn chia danh

    sch ca thuc tnh phn chia (Age) cn chn gi tr trng rids ca mi bn ghi vo

    mt bng bm (hash table

    ) nh u node con m cc bn ghi tng ng (c cng

    rids) trong cc danh sch thuc tnh khc c phn chia ti. Cu trc ca bng bm

    nh sau:

    Hnh 17 - Cu trc ca bng bm phn chia dliu trongSPRINT (theo v d cc

    hnh trc)

    Phn chia xong danh sch ca thuc tnh phn chia th cng l lc xy dngxong bng bm. Danh sch cc thuc tnh cn li c phn chia ti cc node con theo

    thng tin trn bng bm bng cch c trng rids trn tng bn ghi v trng Child

    node tng ng trn bng bm.

    Nu bng bm qu ln so vi b nh, qu trnh phn chia c chia thnh

    nhiu bc. Bng bm c tch thnh nhiu phn sao cho va vi b nh, v cc

    danh sch thuc tnh phn chia theo tng phn bng bm. Qu trnh lp li cho n khi

    bng bm nm trong b nh.

    2.3.4. SPRINT l thut ton hiu qu vi nhng tp d liu qu ln so

    vi cc thut tonkhc

    SPRINT ra i khng nhm mc ch lm tt hn SLIQ [9] vi nhng tp d

    liu m danh sch lp nm va trong b nh. Mc tiu ca thut ton ny l nhm vo

    nhng tp d liu qu ln so vi cc thut ton khc v c kh nng to ra mt m

    Hash tableRids 1 2 3 4 5 6

    Child node L R R R L L

  • 7/31/2019 Phn lp d liu

    45/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 36-

    hnh phn lp hiu qu t. Hn na, SPRINT cn c thit k d dng song

    song ha. Qu vy, vic song song ha SPRINT kh t nhin v hiu qu vi cch

    x l d liu song song. SPRINT t c chun cho vic sp xp d liu v ti cn

    bng khi lng cng vic bng cch phn phi u danh sch thuc tnh thuc tnh

    cho N b vi x l ca mt my theo kin trcshared-nothing[7]. Vic song song haSPRINT ni ring cng nh song song ha cc m hnh phn lp d liu da trn cy

    quyt nh ni chung trn h thng Shared-memory multiprocessor(SMPs) hay cn

    c gi l h thngshared-everthingc nghin cu trong [10].

    Bn cnh nhng mt mnh, SPRINT cng c nhng mt yu. Trc ht l

    bng bm s dng cho vic phn chia d liu, c kch ct l thun vi s lng i

    tng d liu gn vi node hin ti (s bn ghi ca mt danh sch thuc tnh). ng

    thi bng bm cn c t trong b nhkhi thi hnh phn chia d liu, khi kch c

    bng bm qu ln, vic phn chia d liu phi tch thnh nhiu bc. Mt khc, thut

    ton ny phi chu chi ph vo-ra trm trng. Vic song song ha thut ton ny

    cng i hi chi ph giao tip ton cc cao do cn ng b ha cc thng tin v cc ch

    sGini-index ca tng danh sch thuc tnh.

    Ba tc gi ca SPRINT a ra mt s kt qu thc nghim trn m hnh

    phn lp SPRINT so snh vi SLIQ [7] c th hin bng biu di y.

    Biu 1- So snh thi gian thc thi ca m hnh phn lp SPRINT v SLIQ theo kch

    thc tp dliu o to

  • 7/31/2019 Phn lp d liu

    46/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 37-

    T biu trn c th thy: vi nhng tp d liu nh (1 triu cases) th SLIQ khng th thao tc, trong

    khi vi nhng tp d liu khong hn 2,5 triu cases SPRINT vn thao tc d dng. Ldo l SPRINT s dng cch lu tr liu thng tr hon ton trn a.

    2.4. So snh C4.5 v SPRINT

    Ni dung so

    snh

    C4.5 SPRINT

    Tiu chunla chn

    thuc tnhphn chia

    Gain-entropyC khuynh hng lm c lp lp

    ln nht khi cc lp khc

    Gini-indexC khuynh hng chia thnh cc

    nhm lp vi lng d liutng ng

    C ch lutrdliu

    Lu tr trong b nh (memory-resident)

    -> p dng cho nhng ng dngkhai ph csd liu nh (hngtrm nghn bn ghi)

    Lu tr trn a (disk-resdient)

    -> p dng cho nhng ng dngkhai ph d liu cc ln m ccthut ton khc khng lm c(hng trm triu - hng t bnghi)

    C ch spxp dliu Sp xp li tp d liu tngng vi mi node Sp xp trc mt ln. Trongqu trnh pht trin cy, danhsch thuc tnh c phn chianhng th t ban uvn cduy tr, do khng cn phisp xp li.

    class A 40

    class B 30

    class C 20

    class D 10

    if age < 40

    class A 40 class B 30class C 20

    class D 10

    yes no

    class A 40

    class B 30

    class C 20

    class D 10

    if age < 65

    class A 40

    class D

    class B 30

    class C 20

    yes no

  • 7/31/2019 Phn lp d liu

    47/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 38-

    Chng 3. CC KT QU THC NGHIM

    Tc gi s dng m hnh phn lp C4.5 release8 m ngun m do J. Ross

    Quinlan vit, ti a ch:http://www.cse.unsw.edu.au/~quinlan/ phn tch, nh gi

    m hnh phn lp C4.5 v kt qu phn lp v cc nhn tnh hng n hiu nng

    ca m hnh.

    3.1. Mi trng thc nghim

    M ngun C.45 c ci t v chy th nghim trn Server 10.10.0.10 ca

    i hc Cng Ngh.

    Cu hnh ca Server nhsau: b vi x l Intel Xeon 2.4GHz, c 2 b

    x l vt l c th hot ng nh 4 b x l logic theo cng ngh hyper-threading,

    cache size: 512KB, dung lng b nhtrong 1GB.

    Tp d liu thnghim l tp d liu cha cc thng tin v khch hng s

    dng in thoi di ng ng k s dng web portal. Cc trng trong tp d liu

    gm c: Cc thng tin c nhn nh: Tn tui, gii tnh, ngy sinh, vng ng k s

    dng in thoi, loi in thoi s dng, version ca loi in thoi , s ln v thi

    gian truy cp web portal s dng cc dch v nh gi tin nhn, gi logo hay

    ringtone... Tp d liu c kch thc khong 120000 bn ghi dng training v

    khong 60000 bn ghi c s dng lm tp d liu test.

    3.2. Cu trc m hnh phn lp C4.5 release8:

    3.2.1. M hnh phn lp C4.5 c 4 chng trnh chnh:

    Chng trnh sinh cy quyt nh (c4.5)

    Chng trnh sinh lut sn xut (c4.5rules)

    Chng trnh ng dng cy quyt nh vo phn lp nhng d liu mi(consult)

    Chng trnh ng dng b lut sn xut vo phn lp nhng d liu mi(consultr)

    Ngoi ra C4.5 cn c 2 tin ch i km phc v cho qu trnh chy thc nghim l:

    csh shell script cho k thut c lng chnh xc ca m hnh phn lpcross-validation ('xval.sh')

    Hai chng trnh ph thuc i km l ('xval-prep' v 'average').

    Chi tit hn v m hnh phn lp C4.5 c th tham kho ti a ch:

  • 7/31/2019 Phn lp d liu

    48/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 39-

    http://www2.cs.uregina.ca/~hamilton/courses/831/notes/ml/dtrees/c4.5/tutorial.html

    3.2.2. Cu trc d liu s dng trong C4.5

    Mi b d liu dng trong C4.5 gm c 3 file:

    3.2.2.1. Filestem.names: nh ngha b dliu

    Hnh 18 - File nh ngha cu trc dliu sdng trong thc nghim

    M t:

    Dng trn cng nh ngha cc gi tr phn lp theo thuc tnh c chn (v

    d trn hnh 18 l thuc tnh MOBILE_PRODUCTER_ID)

    Cc dng tip theo l danh sch cc thuc tnh cng vi tp gi tr ca n

    trong tp d liu. Cc thuc tnh lin tc c nh ngha bng t khacontinuous

    Ch thch c nh ngha sau du |

    3.2.2.2. Filestem.data: cha dliu training

  • 7/31/2019 Phn lp d liu

    49/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 40-

    Hnh 19 - File cha dliu cn phn lp

    Filestem.data c cu trc nh sau: mi dng tng ng vi mt bn ghi

    (cases) trong c sd liu. Mi dng mt b gi tr theo th nh ca cc thuc

    tnh nh ngha trongfilestem.names. Cc gi tr ngn cch nhau bi du phy. Gi trthiu (missing value) c biu din bng du ?.

    3.2.2.3. Filestem.test: cha dliu test

    File ny cha d liu test trn m hnh phn lp c to ra t tp d liu

    training, v c cu trc gingfilestem.data

    3.3. Kt qu thc nghim

    3.3.1. `7Mt s kt qu phn lp tiu biu:

    3.3.1.1. Cy quyt nh

    Lnh to cy quyt nh$ ./C4.5 -f ../Data/Classes/10-5/class u>> ../Data/Classes/10-5/class.dtTham s ty chn:

    -f: xc nh b d liu cn phn lp

    -u: ty chn cy c to ra c nh gi trn tp d liu test.

    -v verb: mc chi tit ca output [0..3], mc nh l 0

    -t trials: thit lp ch iteractive vi trials l s cy th

    nghim. Iteractive l ch cho php to ra nhiu cy th nghim bt

    u vi mt tp con d liu c chn ngu nhin. Mc nh l ch

    batch vi ton b tp d liu c s dng to mt cy quyt nh

    duy nht.

  • 7/31/2019 Phn lp d liu

    50/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 41-

    Cy quyt nh c cc node trong l cc kim tra gi tr ca thuc tnh c

    chn pht trin ti node . L ca cy quyt nh c nh dng: Gi_tr_phn_lp

    (N/E) hoc (N). Vi N/E l t l gia tng cc case t ti l vi s case t ti l

    nhng thuc v lp khc (trong tp d liu o to).

    Hnh 20 - Dng cy quytnh to ra ttp dliu thnghim

  • 7/31/2019 Phn lp d liu

    51/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 42-

    Hnh 21 -c lng trn cy quytnh va to ra trn tp dliu training v tp

    dliu test

    Sau khi cy quyt nh c to ra, n sc c lng li chnh xc trn

    chnh tp d liu o to va hc c, v c thc c lng trn tp d liu test

    c lp vi d liu training nu c ty chn t pha ngi dng.

    Cc c lng c thc hin trn cy khi cha ct ta v sau khi ct ta.

    M hnh C4.5 cng cho php truyn cc tham s v mc ct ta ca cy, mc nhl ct ta 25%.

    3.3.1.2. Cc lut sn xut tiu biu

    Lnh to lut sn xut khi c cy quyt nh:$ ./C4.5rules -f ../Data/Classes/10-5/class -u >> ../Data/Classes/10-

    5/class.r

    Cc tham s ty chn f, -v, -u ging nh vi lnh to cy quyt nh.

    Mi lut sinh ra gm c 3 phn: iu kin phn lp

    Gi tr phn lp ( ->class )

    []: d on chnh xc ca lut. Gi tr ny c c lng trn tp

    training v test (nu c ty chn u khi sinh lut)

  • 7/31/2019 Phn lp d liu

    52/67

  • 7/31/2019 Phn lp d liu

    53/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 44-

    Hnh 23 - Mt slut rt ra tb dliu 8 thuc tnh, phn lp theo shiu nh sn

    xutin thoi (PRODUCTER_ID)

    T kt qu thc t hnh 23, tRule 1021, chng ta c th kt lun: nu khch hng

    lm cng vic Supervisory v sinh trong khong t nm 1969 n 1973 th loi inthoi m khch hng dng c s hiu l 1 (l in thoi SAMSUNG). chnh xc

    ca kt lun ny l 91,7%.

    Nhng lut nh trn gip cho cc nhn vin maketing c th tm ra c th trng

    in thoi di ng i vi tng loi i tng khch hng khc nhau, t c cc

    chin lc pht trin sn phm hp l.

  • 7/31/2019 Phn lp d liu

    54/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 45-

    Hnh 24 - Mt slut sinh ra ttp dliu 8 thuc tnh, phn lp theo dch vin

    thoi m khch hng sdng (MOBILE_SERVICE_ID)

    V d t Rule 661: nu khch hng l nam (F), ngh nghipEngineering, in

    thoi s dng l Erricsion (MOBILE_PRODUCTER_ID = 4) v ng k nm 2004,

    th dch v m khch hng s dng l gi logo (MOBILE_SERVICE_ID = 2).

    chnh xc ca lut ny l 79,4%.

    T nhng lut nh vy, ta c th thng k cng nh don c xu hng

    s dng cc loi dch v ca tng i tng khch hng khc nhau. T c chin

    lc pht trin dch v khch hng hiu qu.

  • 7/31/2019 Phn lp d liu

    55/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 46-

    Hnh 25 - c lng tp lut trn tp dliu o to

    Sau khi c to ra, tp lut c c lng li trn tp training data, hay tp

    d liu test (ty chn).

    M t cc mt s trng tiu biu:

    Rule: s hiu ca lut

    Zize: Kch thc ca lut (s cc iu kin so snh trong phn iu kin phn

    lp)

    Used: s lng cases trong tp training p dng lut . Trng ny quy nh

    tnh ph bin ca lut.

    Wrong: s lng case phn lp sai -> t l phn trm li

    Kt lun

    T qu trnh thc nghim, chng ti nhn thy vai tr ca qu trnh tin x l

    d liu l rt quan trng. Trong qu trnh ny, cn xc nh chnh xc nhng thng tin

    g cn rt ra t csd liu , t chn thuc tnh phn lp ph hp. Sau vic

  • 7/31/2019 Phn lp d liu

    56/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 47-

    la chn nhng thuc tnh lin quan l rt quan trng, n quyt nh m hnh phn lp

    c ng n khng, c ngha thc t khng v c th p dng cho nhng d liu

    tng lai hay khng.

    3.3.2. Cc biu hiu nng

    Cc tham snh hng n hiu nng ca m hnh phn lp l [6]:

    S cc bn ghi trong tp d liu o to (N)

    S lng thuc tnh (A)

    S cc gi tr ri rc ca mi thuc tnh (nhn t nhnh) (V)

    S cc lp (C)

    Chi ph xy dng cy quyt nh l tng chi ph xy dng tng node:

    T = tnode(i)

    Chi ph tn cho node i c tnh bng tng cc khon chi ph ring cho tng cng

    vic:

    tnode(i) = tsingle(i) + tfreq(i) + tinfo(i) + tdiv(i)

    Vi:

    tsingle(i) l chi ph thc thi vic kim tra xem liu tt c cc case trong tp

    d liu o to c thuc v cng mt lp khng?

    tdiv(i) l chi ph phn chia tp d liu theo thuc tnh chn

    Vic la chn thuc tnh c Information gain ln nht trong tp d liu

    hin ti l kt qu ca vic tnhInformation gain ca tng thuc tnh. Chi

    ph cho qu trnh ny bao gm thi gian tnh ton tn xut phn phi theo

    cc gi tr phn lp ca tng thuc tnh (tfreq(i)) v thi gian tnh

    Information gain t cc thng tin phn phi (tinfo(i)).

    C th biu din s ph thuc ca cc khon chi ph trn vo cc tham s hiu nng

    m ttrn nh sau:tfreq = k1 *AiNi

    tinfo = k2 * CAiV

    tdiv = k3 * Ai

    tsingle = k4*Ni

  • 7/31/2019 Phn lp d liu

    57/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 48-

    Vi kj l hng s c gi tr ty theo tng ng dng c th. S lng bn ghi (Ni) v s

    lng thuc tnh (Ai) tng ng vi tng node ph thuc vo su ca node v

    bn thn tp d liu.

    Vic xc nh chnh xc chi ph cho qu trnh xy dng cy quyt nh (T) l rt kh

    v cn phi bit chnh xc hnh dng ca cy quyt nh, iu ny khng th xc nhtrong thi gian chy. Chnh v vy m T c n gin ha bng cch dng gi tr

    trung bnh i km vi nhng gi s v hnh dng ca cy v gii cc phng trnh lp

    cho tng thnh phn ring l ca m hnh [6].

    Sau y l cc kt qu thc nghim nh gi nh hng ca cc tham s hiu

    nng nh kch thc tp d liu o to, s lng thuc tnh, thuc tnh lin tc, v s

    gi tr phn lp ti m hnh phn lp C4.5:

  • 7/31/2019 Phn lp d liu

    58/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 49-

    3.3.2.1. Thi gian thc thi ph thuc vo kch thc tp dliu o to

    Cc th nghim c tin hnh trn nhiu tp d liu vi kch thc, s lng

    thuc tnh v thuc tnh phn lp khc nhau. Sau y l cc bng kt qu v biu

    th hin s ph thuc ang xt.

    Thnghim vi tp dliu 2 thuc tnh

    Bng 2 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch

    thc tp dliu o to 2 thuc tnh

    Kch thc

    Thi gian tp d liu

    xy dng (giy)

    29000 60000 66000 131000 262000

    Decision Tree 0.15 0.46 0.47 1.17 2.2

    Production Rules 3.21 6.82 8.85 20.51 37.94

    0

    5

    1015

    20

    25

    30

    35

    40

    29000 60000 66000 131000 262000 (cases)

    (s)

    DecisionTree

    Production Rules

    Trend lineofProduction rules

    Biu 2 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch

    thc tp dliu o to 2 thuc tnh

  • 7/31/2019 Phn lp d liu

    59/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 50-

    Thnghim vi tp dliu 7 thuc tnh

    Bng 3 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch

    thc tp dliu o to 7 thuc tnh

    Kch thc

    Thi gian tp d liu

    xy dng (giy)

    1000 10000 15000 20000 25000 30000 36000

    Decision Tree 0.03 0.46 1.90 2.79 5.70 8.31 13.34

    Production Rules 0.13 107.1 276.2 709.9 1211.0 2504.8 5999.5

    0

    1000

    2000

    3000

    4000

    5000

    6000

    7000

    8000

    9000

    10000

    1000 10000 15000 20000 25000 30000 36000

    Decision

    Tree

    Production Rules

    Trend lineofProduction rules

    Biu 3 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch

    thc tp dliu o to 7 thuc tnh

  • 7/31/2019 Phn lp d liu

    60/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 51-

    Thnghim vi tp dliu 18 thuc tnh

    Bng 4 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch

    thc tp dliu o to18 thuc tnh

    Kch thc

    Thi gian tp d liu

    xy dng (giy)

    4000 6000 8500 10000 12000 15000 17500 20000 25000

    Decision Tree 0.45 0.64 1.32 1.77 2.37 1.8 2.68 2.98 5.24

    Production Rules 43.6 90.77 304.0

    7

    531.3

    4

    838.8

    8

    968.2

    4

    1584.

    63

    2927.

    56

    4617.

    23

    0

    500

    1000

    1500

    2000

    2500

    3000

    3500

    4000

    4500

    5000

    4000 6000 8500 10000 12000 15000 17500 20000 25000

    (case)

    (s)

    DecisionTree

    Production Rules

    Trend Lineof

    Production Rules

    Biu 4 - Thi gian xy dng cy quytnh v tp lut sn xut ph thuc vo kch

    thc tp dliu o to18 thuc tnh

    Cc nh gi s ph thuc ca thi gian thc thi vo kch thc tp d liu

    o to c tin hnh trn cc tp d liu vi s lng thuc tnh khc nhau. Cth rt ra cc kt lun sau:

    Kch thc tp d liu cng ln th thi gian sinh cy quyt nh cng nh thi

    gian sinh tp lut sn xut cng ln. Cn c vo cc ng trendline ca ng

  • 7/31/2019 Phn lp d liu

    61/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 52-

    biu din thi gian sinh tp lut sn xut c v thm trn cc biu , chng

    ti don s ph thuc trn c din t bng hm a thc.

    Cc biu trn cho thy qu trnh sinh lut sn xut sau t cy quyt nh

    to ra tn ti nguyn tnh ton gp nhiu ln so vi qu trnh sinh cy quyt

    nh. Thc nghim cho thy vi nhng tp d liu ctrm nghn bn ghi, thigian sinh lut sn xut l kh lu ( thng thng > 5 gi). cng l mt trong

    nhng l do khin C4.5 khng th p dng vi nhng tp d liu ln. Tp d

    liu o to c cng nhiu thuc tnh th s chnh lch v thi gian thc thi gia

    2 qu trnh trn cng ln.

    3.3.2.2. Hiu nng ca C4.5 ph thuc vo s lng thuc tnh

    nh gi s ph thuc trn, cc th nghim tin hnh vi 3 tp d liu

    c 2, 4, v 8 thuc tnh ri rc, vi cng thuc tnh phn lp.

    Bng 5- Thi gian sinh cy quytnh ph thuc vo slng thuc tnh

    3000 6000 16000 23000 32000 40500 55500 65500 96600 131000

    2 attributes 0.01 0.02 0.05 0.1 0.18 0.25 0.39 0.47 0.89 1.17

    4 attributes 0.12 0.18 0.82 2.18 3.32 5.58 11.83 16.79 33.49 71.52

    8 attributes 0.14 0.3 3.56 9.99 23.40 33.36 47.62 80 106.61 185

    0

    20

    40

    60

    80

    100

    120

    140

    160

    180

    200

    3000

    6000

    1600

    0

    2300

    0

    3200

    0

    4050

    0

    5550

    0

    6550

    0

    9660

    0

    131000 (cases)

    (s)

    2attributes4attributes8attributes

    Biu 5 -Sph thuc thi gian sinh cy quytnh vo slng thuc tnh

  • 7/31/2019 Phn lp d liu

    62/67

    Nghin cu cc thut ton phn lp d liu da trn cy quytnh

    Kha lun tt nghip Nguyn ThThy Linh K46CA

    - 53-

    Thi gian C4.5 xy dng cy quyt nh ph thuc vo s lng thuc tnh

    qua cc khong thi gian tfreq, tinfo, tdiv. S thuc tnh cng nhiu thi gian tnh ton

    la chn thuc tnh tt nht test ti mi node cng ln, v vy thi gian sinh cy quyt

    nh cng tng. Do vy C4.5 b hn ch v s lng thuc tnh trong tp d liu o

    to [2]. y l mt im khc bit so vi SPRINT

    3.3.2.3. Hiu nng ca C4.5 khi thao tc vi thuc tnh lin tc

    Bng 6 - Thi gian xy dng cy quytnh vi thuc tnh ri rc v thuc tnh lin

    tc

    3000 6000 16000 22000 31000 40000 55000 65000 96000 131000

    3 thuc tnh ri rc+

    1 thuc tnh lin tc0.12 0.18 0.92 2.18 3.32 5.74 11.83 16.79 33.47 61.52

    4 thuc tnh lin tc 0.24 0.66 3.02 5.01 11.56 16.99 30.37 38.16 70.38 125.21

    0

    20

    40

    60

    80

    100

    120

    140

    3000

    6000

    1600

    0

    2200

    0

    3100

    0

    4000

    0

    5500

    0

    6500

    0

    9600

    0

    131000

    (cases)

    (s)

    3 categoricalattributes + 1continuousattribute4 continuous

    attributes

    Biu 6 - So sn