Data Mining Classification and Prediction by Dr. Tanvir Ahmed

download Data Mining Classification and Prediction by Dr. Tanvir Ahmed

of 124

Transcript of Data Mining Classification and Prediction by Dr. Tanvir Ahmed

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    1/124

    November 16, 2015Data Mining: Concepts and

    Techniques 1

    Classifcation and

    Prediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    2/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    Classifcation and Prediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    'a(% !earners )or !earning*rom %our neighbors+

    ther c!assi"cation

    methods

    -rediction

    .ccurac% and error

    measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    3/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    C!assi"cation predicts categorica! c!ass !abe!s )discrete or nomina!+ c!assi"es data )constructs a mode!+ based on the

    training set and the va!ues )c!ass !abe!s+ in ac!assi*%ing attribute and uses it in c!assi*%ing nedata

    -rediction mode!s continuous3va!ued *unctions, i4e4, predicts

    unnon or missing va!ues

    T%pica! app!ications Credit approva! Target mareting Medica! diagnosis raud detection

    Classifcation vs. Prediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    4/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    ClassifcationA Two-StepProcess

    Mode! construction: describing a set o* predetermined c!asses /ach tup!e8samp!e is assumed to be!ong to a prede"ned

    c!ass, as determined b% the c!ass !abe! attribute The set o* tup!es used *or mode! construction is training

    set

    The mode! is represented as c!assi"cation ru!es, decisiontrees, or mathematica! *ormu!ae Mode! usage: *or c!assi*%ing *uture or unnon ob9ects

    /stimate accurac%o* the mode! The non !abe! o* test samp!e is compared ith the

    c!assi"ed resu!t *rom the mode! .ccurac% rate is the percentage o* test set samp!es

    that are correct!% c!assi"ed b% the mode! Test set is independent o* training set, otherise over3

    "tting i!! occur $* the accurac% is acceptab!e, use the mode! to c!assi*%

    datatup!es hose c!ass !abe!s are not non

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    5/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    rocess : o eConstruction

    Training

    Data

    NAME RANK YEARS TENURED

    Mike Assistant Prof 3 no

    Mary Assistant Prof 7 yesBill Professor 2 yes

    Jim Associate Prof 7 yes

    Dave Assistant Prof 6 no

    Anne Associate Prof 3 no

    Classification

    Algorithms

    IF rank = professor

    OR years > 6

    T!" ten#re$ = yes

    Classifier

    %&o$el'

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    6/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    Process (2): sin! t"e Model in

    Prediction

    Classifier

    Testing

    Data

    NAME RANK YEARS TENURED

    Tom Assistant Prof 2 no

    Merlisa Associate Prof 7 no

    George Professor 5 yes

    Josep Assistant Prof 7 yes

    (nseen Data

    %)eff* +rofessor* ,'

    Ten#re$-

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    7/124November 16, 2015Data Mining: Concepts and

    Techniques

    Supervised vs. nsupervised#earnin!

    upervised !earning )c!assi"cation+

    upervision: The training data )observations,

    measurements, etc4+ are accompanied b%

    !abe!s indicating the c!ass o* the observations Ne data is c!assi"ed based on the training set

    ;nsupervised !earning)c!ustering+

    The c!ass !abe!s o* training data is unnon

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    8/124November 16, 2015Data Mining: Concepts and

    Techniques >

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    upport ?ector Machines

    )?M+

    'a(% !earners )or !earning*rom %our neighbors+

    ther c!assi"cation

    methods

    -rediction

    .ccurac% and error

    measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    9/124November 16, 2015Data Mining: Concepts and

    Techniques @

    %ssues: &ata Preparation

    Data c!eaning

    -reprocess data in order to reduce noise and

    hand!e missing va!ues

    Ae!evance ana!%sis )*eature se!ection+ Aemove the irre!evant or redundant attributes

    Data trans*ormation

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    10/124November 16, 2015Data Mining: Concepts and

    Techniques 10

    %ssues: 'valuatin! Classifcation

    Met"ods

    .ccurac% c!assi"er accurac%: predicting c!ass !abe! predictor accurac%: guessing va!ue o* predicted

    attributes peed

    time to construct the mode! )training time+ time to use the mode! )c!assi"cation8prediction time+

    Aobustness: hand!ing noise and missing va!ues ca!abi!it%: eBcienc% in dis3resident databases $nterpretabi!it%

    understanding and insight provided b% the mode! ther measures, e4g4, goodness o* ru!es, such as

    decision tree si(e or compactness o* c!assi"cation ru!es

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    11/124November 16, 2015Data Mining: Concepts and

    Techniques 11

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines)?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    & i i T % d i T i i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    12/124November 16, 2015Data Mining: Concepts and

    Techniques 12

    &ecision Tree %nduction: Trainin!&ataset

    age income st!"ent cre"it#rating $!ys#comp!ter

    %&3' ig no fair no

    %&3' ig no e(cellent no

    3)*+' ig no fair yes

    ,+' me"i!m no fair yes

    ,+' lo- yes fair yes

    ,+' lo- yes e(cellent no

    3)*+' lo- yes e(cellent yes

    %&3' me"i!m no fair no

    %&3' lo- yes fair yes

    ,+' me"i!m yes fair yes

    %&3' me"i!m yes e(cellent yes

    3)*+' me"i!m no e(cellent yes

    3)*+' ig yes fair yes

    ,+' me"i!m no e(cellent no

    This*o!!os

    ane=amp!eo*uin!ans$D)-!a%ing

    Tennis+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    13/124November 16, 2015Data Mining: Concepts and

    Techniques 1

    utput: A &ecision Tree or*buys_computer

    age-

    o.ercast

    st#$ent- cre$it rating-

    40

    no yes yes

    yes

    31..40

    no

    faire/cellentyesno

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    14/124November 16, 2015Data Mining: Concepts and

    Techniques 17

    %nduction

    &asic a!gorithm )a greed% a!gorithm+ Tree is constructed in a top3don recursive divide3and3

    conquer manner .t start, a!! the training e=amp!es are at the root .ttributes are categorica! )i* continuous3va!ued, the% are

    discreti(ed in advance+ /=amp!es are partitioned recursive!% based on se!ected

    attributes Test attributes are se!ected on the basis o* a heuristic or

    statistica! measure )e4g4, in*ormation gain+

    Conditions *or stopping partitioning .!! samp!es *or a given node be!ong to the same c!ass There are no remaining attributes *or *urther partitioning E

    ma9orit% votingis emp!o%ed *or c!assi*%ing the !ea*

    There are no samp!es !e*t

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    15/124

    ," &ecision tree is popular

    Does not require domain no!edge Can hand!e mu!tidimensiona! data /as% to understand

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    16/124November 16, 2015Data Mining: Concepts and

    Techniques 16

    Attri/ute Selection Measure:%nor+ation 0ain (%&C3.4)

    e!ect the attribute ith the highest in*ormationgain4 'east impurit%4

    'etpibe the probabi!it% that an arbitrar% tup!e in D

    be!ongs to c!ass Ci, estimated b% FCi, DF8FDF

    /=pected in*ormation)entrop%+ needed to c!assi*%a tup!e in D:

    $n*ormationneeded )a*ter using . to sp!it D into vpartitions+ to c!assi*% D:

    $n*ormation gainedb% branching on attribute .

    '%log'% 01

    i

    m

    i

    i ppDInfo =

    =

    '%22

    22'%

    1

    j

    v

    j

    j

    A DID

    DDInfo =

    =

    (D)InfoInfo(D)Gain(A) A=

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    17/124November 16, 2015

    Data Mining: Concepts andTechniques 1

    Attri/ute Selection: %nor+ation 0ain

    C!ass -: bu%sGcomputer HI%esJ C!ass N: bu%sGcomputer H

    InoJ

    means Iage KH0J has

    5 out o* 17 samp!es, ith 2

    %eses and nos4 Lence

    imi!ar!%,

    63,45'0*%1,

    7

    '5*,%1,

    ,'*0%1,

    7'%

    =+

    +=

    I

    IIDInfoage

    5,845'9%

    17145'%

    50345'%

    ===

    ratingcreditGain

    studentGain

    incomeGain

    0,645'%'%'% == DInfoDInfoageGain age

    '*0%1,

    7I

    3,545'1,

    7%log

    1,

    7'

    1,

    3%log

    1,

    3'7*3%'% 00 === IDInfo

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    18/124November 16, 2015

    Data Mining: Concepts andTechniques 1>

    Co+putin! %nor+ation-0ain orContinuous-5alue Attri/utes

    'et attribute . be a continuous3va!ued attribute Must determine the best split point*or .

    ort the va!ue . in increasing order

    T%pica!!%, the midpoint beteen each pair o* ad9acent

    va!ues is considered as a possib!e split point )aiai1+82 is the midpoint beteen the va!ues o* a iand ai1

    The point ith the minimum expected information

    requirement*or . is se!ected as the sp!it3point *or . p!it:

    D1 is the set o* tup!es in D satis*%ing . sp!it3point,

    and D2 is the set o* tup!es in D satis*%ing . O sp!it3

    point

    0 i 6 ti Att i/ t S l ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    19/124November 16, 2015

    Data Mining: Concepts andTechniques 1@

    0ain 6atio or Attri/ute Selection(C3.4)

    $n*ormation gain measure is biased toardsattributes ith a !arge number o* va!ues

    C745 )a successor o* $D+ uses gain ratio to overcome

    the prob!em )norma!i(ation to in*ormation gain+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    20/124November 16, 2015

    Data Mining: Concepts andTechniques 20

    0ini inde7 (CA6T8 %9M%ntelli!entMiner)

    $* a data set D contains e=amp!es *rom nc!asses, gini inde=,gini)D+ is de"ned as

    herepjis the re!ative *requenc% o* c!assjin D

    $* a data set D is sp!it on . into to subsets D1and D2, the giniinde= gini)D+ is de"ned as

    Aeduction in $mpurit%:

    The attribute provides the sma!!est ginisplit)D+ )or the !argest

    reduction in impurit%+ is chosen to sp!it the node )need to

    enumerate all the possible splitting points for each attribute+

    =

    =n

    j

    p jDgini

    1

    01'%

    '%22

    22'%

    22

    22'% 0

    01

    1Dgini

    D

    DDgini

    D

    DDginiA +=

    '%'%'% DginiDginiAginiA

    =

    0ini inde7 (CA6T %9M

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    21/124November 16, 2015

    Data Mining: Concepts andTechniques 21

    0ini inde7 (CA6T8 %9M%ntelli!entMiner)

    /=4 D has @ tup!es in bu%sGcomputer H I%esJ and 5 in InoJ

    uppose the attribute income partitions D into 10 in D1: P!o,

    mediumQ and 7 in D2

    but giniPmedium,highQis 040 and thus the best since it is the !oest

    .!! attributes are assumed continuous3va!ued

    Ma% need other too!s, e4g4, c!ustering, to get the possib!e sp!it

    va!ues

    Can be modi"ed *or categorica! attributes

    ,73451,

    7

    1,

    31'%

    00

    =

    =Dgini

    '%1,

    ,

    '%1,

    15

    '% 11:*; DGiniDGiniDgini mediumlowincome

    +

    =

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    22/124November 16, 2015

    Data Mining: Concepts andTechniques 22

    Co+parin! Attri/ute SelectionMeasures

    The three measures, in genera!, return good resu!tsbut

    $n*ormation gain:

    biased toards mu!tiva!ued attributes

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    23/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    t"er Attri/ute SelectionMeasures

    CL.$D: a popu!ar decision tree a!gorithm, measure based on S2

    test *orindependence

    C3/-: per*orms better than in*o4 gain and gini inde= in certain cases

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    24/124

    November 16, 2015Data Mining: Concepts and

    Techniques 27

    verfttin! and Tree Prunin!

    ver"tting: .n induced tree ma% over"t the training data Too man% branches, some ma% reect anoma!ies due to noise or

    out!iers

    -oor accurac% *or unseen samp!es

    To approaches to avoid over"tting -repruning: La!t tree construction ear!%Udo not sp!it a node i* this

    ou!d resu!t in the goodness measure *a!!ing be!o a thresho!d

    DiBcu!t to choose an appropriate thresho!d

    -ostpruning: Aemove branches *rom a I*u!!% gronJ treeUget a

    sequence o* progressive!% pruned trees

    ;se a set o* data diVerent *rom the training data to decide

    hich is the Ibest pruned treeJ

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    25/124

    November 16, 2015Data Mining: Concepts and

    Techniques 25

    'n"ance+ents to 9asic &ecision Tree%nduction

    .!!o *or continuous3va!ued attributes D%namica!!% de"ne ne discrete3va!ued attributes

    that partition the continuous attribute va!ue into adiscrete set o* interva!s

    Land!e missing attribute va!ues .ssign the most common va!ue o* the attribute

    .ssign probabi!it% to each o* the possib!e va!ues

    .ttribute construction Create ne attributes based on e=isting ones that

    are sparse!% represented

    This reduces *ragmentation, repetition, and

    rep!ication

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    26/124

    November 16, 2015Data Mining: Concepts and

    Techniques 26

    Classifcation in #ar!e &ata/ases

    C!assi"cationUa c!assica! prob!em e=tensive!% studiedb% statisticians and machine !earning researchers

    ca!abi!it%: C!assi*%ing data sets ith mi!!ions o*

    e=amp!es and hundreds o* attributes ith reasonab!e

    speed h% decision tree induction in data mining#

    re!ative!% *aster !earning speed )than otherc!assi"cation methods+

    convertib!e to simp!e and eas% to understandc!assi"cation ru!es can use ' queries *or accessing databases comparab!e c!assi"cation accurac% ith other

    methods

    Scala/le &ecision Tree %nduction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    27/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    Scala/le &ecision Tree %nductionMet"ods

    '$)/D&T@6 U Mehta et a!4+ &ui!ds an inde= *or each attribute and on!% c!ass !ist

    and the current attribute !ist reside in memor% -A$NT)?'D&@6 U W4 ha*er et a!4+

    Constructs an attribute !ist data structure -;&'$C)?'D&@> U Aastogi X him+

    $ntegrates tree sp!itting and tree pruning: stopgroing the tree ear!ier

    Aainorest )?'D&@> U

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    28/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2>

    ca a t ra+ewor or6ainorest

    .eparates te scala$ility aspects from te criteria tat"etermine te /!ality of te tree

    B!il"s an A01list: AVC (Attribute, Value, Class_label)

    AVC-set of an attri$!teX4 Pro5ection of training "ataset onto te attri$!teXan"

    class la$el -ere co!nts of in"ivi"!al class la$el are

    aggregate" AVC-grou of a no"e n4

    .et of A01sets of all pre"ictor attri$!tes at te no"e n

    6ainorest: Trainin! Set and %ts A5C

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    29/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2@

    6ainorest: Trainin! Set and %ts A5CSets

    st!"ent B!y#1omp!ter

    yes no

    yes 6 )

    no 3 +

    Age B!y#1omp!ter

    yes no

    %&3' 3 2

    3)+' + '

    ,+' 3 2

    1re"itrating

    B!y#1omp!ter

    yes no

    fair 6 2

    e(cellent 3 3

    .?C3set on incom.?C3set onAge

    .?C3set on Student

    Training /=amp!esincome B!y#1omp!ter

    yes no

    ig 2 2

    me"i!m + 2

    lo- 3 )

    .?C3set oncredit_rating

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    30/124

    November 16, 2015Data Mining: Concepts and

    Techniques 0

    &ata Cu/e-9ased &ecision-Tree%nduction

    $ntegration o* genera!i(ation ith decision3tree

    induction )Yamber et a!4@+

    C!assi"cation at primitive concept !eve!s

    /4g4, precise temperature, humidit%, out!oo, etc4 'o3!eve! concepts, scattered c!asses, bush%

    c!assi"cation3trees

    emantic interpretation prob!ems Cube3based mu!ti3!eve! c!assi"cation

    Ae!evance ana!%sis at mu!ti3!eve!s

    $n*ormation3gain ana!%sis ith dimension !eve!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    31/124

    November 16, 2015Data Mining: Concepts and

    Techniques 1

    9AT (9ootstrapped pti+isticAl!orit"+ or Tree Construction)

    7se a statistical tecni/!e calle" bootstrappingto create

    several smaller samples s!$sets48 eac fits in memory

    9ac s!$set is !se" to create a tree8 res!lting in several

    trees Tese trees are e(amine" an" !se" to constr!ct a ne-

    tree T

    :t t!rns o!t tatTis very close to te tree tat -o!l"$e generate" !sing te -ole "ata set togeter

    A"v; re/!ires only t-o scans of DB8 an incremental alg

    P t ti Cl if ti 6 lt

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    32/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    Presentation o Classifcation 6esults

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    33/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    5isuali

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    34/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    %nteractive 5isual Minin!/ Perception-

    9ased Classifcation (P9C)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    35/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines)?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

    9 i Cl if ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    36/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    9aesian Classifcation:,"

    . statistica! c!assi"er: per*ormsprobabilisticprediction, i.e.,predicts c!ass membershipprobabi!ities

    oundation: &ased on &a%es Theorem4 -er*ormance: . simp!e &a%esian c!assi"er, nae

    !a"esian classi#er, has comparab!e per*ormance ithdecision tree and se!ected neura! netor c!assi"ers

    $ncrementa!: /ach training e=amp!e can incrementa!!%increase8decrease the probabi!it% that a h%pothesis iscorrect U prior no!edge can be combined ith

    observed data tandard: /ven hen &a%esian methods are

    computationa!!% intractab!e, the% can provide astandard o* optima! decision maing against hichother methods can be measured

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    37/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    9aesian T"eore+: 9asics

    'et >be a data samp!e )IeidenceJ+: c!ass !abe! isunnon

    'et L be a h"pothesisthat Z be!ongs to c!ass C

    C!assi"cation is to determine -)LF>+, the probabi!it% that

    the h%pothesis ho!ds given the observed data samp!e > -)L+ )prior probabilit"+, the initia! probabi!it%

    /4g4,>i!! bu% computer, regard!ess o* age, income, [

    -)>+: probabi!it% that samp!e data is observed

    -)>FL+ )posteriori probabilit"+, the probabi!it% o* observingthe samp!e >, given that the h%pothesis ho!ds

    /4g4,i!! bu% computer, the prob4 that Z is

    14470, medium income

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    38/124

    November 16, 2015Data Mining: Concepts and

    Techniques >

    9aesian T"eore+

    , posteriori probabilit" of ah"pothesis L, -)LF>+, *o!!os the &a%es theorem

    $n*orma!!%, this can be ritten as

    posteriori H !ie!ihood = prior8evidence

    -redicts >be!ongs to C2iV the probabi!it% -)CiF>+ is

    the highest among a!! the -)CFZ+ *or a!! the $c!asses

    -ractica! diBcu!t%: require initia! no!edge o* man%

    probabi!ities, signi"cant computationa! cost

    '%'%'2%'2%

    X

    XX

    PHPHPHP =

    owar s a ve aes an

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    39/124

    November 16, 2015Data Mining: Concepts and

    Techniques @

    owar s a ve aes anClassifer

    'et D be a training set o* tup!es and their associatedc!ass !abe!s, and each tup!e is represented b% an n3Dattribute vector >H )=1, =2, [, =n+

    uppose there are mc!asses C1, C2, [, Cm4

    C!assi"cation is to derive the ma=imum posteriori,i4e4, the ma=ima! -)CiF>+ This can be derived *rom &a%es theorem

    ince -)Z+ is constant *or a!! c!asses, on!%

    needs to be ma=imi(ed

    '%

    '%'2%'2%

    X

    XX

    Pi

    CPi

    CP

    iCP =

    '%'2%'2%i

    CPi

    CPi

    CP XX =

    er va on o a ve aes

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    40/124

    November 16, 2015Data Mining: Concepts and

    Techniques 70

    er va on o a ve aesClassifer

    . simp!i"ed assumption: attributes are conditiona!!%independent )i4e4, no dependence re!ation beteenattributes+:

    This great!% reduces the computation cost: n!%

    counts the c!ass distribution $* .is categorica!, -)=FCi+ is the R o* tup!es in Ci

    having va!ue =*or .divided b% FCi, DF )R o* tup!es o*Ciin D+

    $* .is continous3va!ued, -)=FCi+ is usua!!% computedbased on

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    41/124

    November 16, 2015Data Mining: Concepts and

    Techniques 71

    ?a ve 9aes an ass er: Tra n n!&ataset

    C!ass:

    C1:bu%sGcomputer H

    ^%es

    C2:bu%sGcomputer H ^no

    Data samp!e

    Z H )age KH0,

    $ncome H medium,

    tudent H %esCreditGrating H air+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    42/124

    November 16, 2015Data Mining: Concepts and

    Techniques 72

    '7a+ple

    -)Ci+: -)bu%sGcomputer H I%esJ+ H @817 H 0467 -)bu%sGcomputer H InoJ+ H 5817H 045

    Compute -)ZFCi+ *or each c!ass -)age H IKH0J F bu%sGcomputer H I%esJ+ H 28@ H 04222 -)age H IKH 0J F bu%sGcomputer H InoJ+ H 85 H 046 -)income H ImediumJ F bu%sGcomputer H I%esJ+ H 78@ H 04777

    -)income H ImediumJ F bu%sGcomputer H InoJ+ H 285 H 047 -)student H I%esJ F bu%sGcomputer H I%es+ H 68@ H 0466 -)student H I%esJ F bu%sGcomputer H InoJ+ H 185 H 042 -)creditGrating H I*airJ F bu%sGcomputer H I%esJ+ H 68@ H 0466 -)creditGrating H I*airJ F bu%sGcomputer H InoJ+ H 285 H 047

    > (a!e B = 8 inco+e +ediu+8 student es8 creditratin! air)

    P(>DCi) :-)ZFbu%sGcomputer H I%esJ+ H 04222 = 04777 = 0466 = 0466 H 04077 -)ZFbu%sGcomputer H InoJ+ H 046 = 047 = 042 = 047 H 0401@P(>DCi)EP(Ci) : -)ZFbu%sGcomputer H I%esJ+ _ -)bu%sGcomputer H I%esJ+ H 0402>

    -)ZFbu%sGcomputer H InoJ+ _ -)bu%sGcomputer H InoJ+ H 0400

    T"ereore8 > /elon!s to class (*/usco+puter esF)

    vo n! e - ro a

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    43/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    vo n! e ro a Pro/le+

    Na`ve &a%esian prediction requires each conditiona! prob4 benon3(ero4 therise, the predicted prob4 i!! be (ero

    /=4 uppose a dataset ith 1000 tup!es, incomeH!o )0+,incomeH medium )@@0+, and income H high )10+,

    ;se 'ap!acian correction )or 'ap!acian estimator+ .dding 1 to each case

    -rob)income H !o+ H 18100-rob)income H medium+ H @@18100

    -rob)income H high+ H 118100 The IcorrectedJ prob4 estimates are c!ose to their

    IuncorrectedJ counterparts

    =

    =n

    kCixkPCiP

    1

    '2%'2%

    a ve aes an ass er:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    44/124

    November 16, 2015Data Mining: Concepts and

    Techniques 77

    a ve aes an ass er:Co++ents

    .dvantages /as% to imp!ement

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    45/124

    November 16, 2015Data Mining: Concepts and

    Techniques 75

    9aesian 9elie ?etwor;s

    &a%esian be!ie* netor a!!os a subseto* the

    variab!es conditiona!!% independent

    . graphica! mode! o* causa! re!ationships

    Aepresents dependenc% among the variab!es

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    46/124

    November 16, 2015Data Mining: Concepts and

    Techniques 76

    '7a+ple

    FamilyHistory

    LungCancer

    PositiveXRay

    Smoer

    !m"#ysema

    $ys"nea

    LC

    %LC

    &FH' S) &FH' %S) &%FH' S) &%FH' %S)

    0.*

    0.+

    0.,

    0.,

    0.-

    0.3

    0.1

    0.

    aesian 9elie ?etwor;s

    The conditional pro/a/ilitta/le)CPT+ *or variab!e'ungCancer:

    =

    =n

    i

    !Parents ixiPxxP n1

    ''%2%'*444*% 1

    C-T shos the conditiona! probabi!it%*or each possib!e combination o* itsparents

    Derivation o* the probabi!it% o* aparticu!ar combination o* va!ueso* >, *rom C-T:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    47/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    Trainin! 9aesian ?etwor;s

    evera! scenarios:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    48/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7>

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines)?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    49/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7@

    sin! %-TI'? 6ules or Classifcation

    Aepresent the no!edge in the *orm o* $3TL/Nru!es

    A: $ ageH %outh .ND studentH %es TL/N bu"s_computerH %es

    Au!e antecedent8precondition vs4 ru!e consequent

    .ssessment o* a ru!e: coerageand accurac"

    ncovers H R o* tup!es covered b% A

    ncorrect H R o* tup!es correct!% c!assi"ed b% Acoverage)A+ H ncovers 8FDF 8_ D: training data set _8

    accurac%)A+ H ncorrect 8 ncovers $* more than one ru!e is triggered, need conJict resolution

    i(e ordering: assign the highest priorit% to the triggering ru!es that

    has the ItoughestJ requirement )i4e4, ith the most attribute test+ C!ass3based ordering: decreasing order o*prealence or

    misclassi#cation cost per class

    Au!e3based ordering )decision list+: ru!es are organi(ed into one !ong

    priorit% !ist, according to some measure o* ru!e qua!it% or b% e=perts

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    50/124

    November 16, 2015Data Mining: Concepts and

    Techniques 50

    age-

    st#$ent- cre$it rating-

    40

    no yes yes

    yes

    31..40

    no

    faire/cellentyesno

    /=amp!e: Au!e e=traction *rom our bu"s_computerdecision3tree

    $ ageH %oung .ND studentH no TL/N bu"s_computerH no

    $ ageH %oung .ND studentH"es TL/N bu"s_computerH"es

    $ ageH mid3age TL/N bu"s_computerH"es

    $ ageH o!d .ND credit_ratingH excellent TL/N bu"s_computer H"es

    $ ageH %oung .ND credit_ratingH fair TL/N bu"s_computerH no

    6ule '7traction ro+ a &ecision Tree

    Au!es are easier to understand than !argetrees

    ne ru!e is created *or each path *rom the

    root to a !ea*

    /ach attribute3va!ue pair a!ong a path*orms a con9unction: the !ea* ho!ds the

    c!ass prediction

    Au!es are mutua!!% e=c!usive and

    e=haustive

    6u e '7tract on ro+ t e Tra n n!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    51/124

    November 16, 2015Data Mining: Concepts and

    Techniques 51

    u e ac o o e a !&ata

    equentia! covering a!gorithm: /=tracts ru!es direct!% *rom trainingdata

    T%pica! sequentia! covering a!gorithms: $', ., CN2, A$--/A

    Au!es are !earned sequentiall", each *or a given c!ass Ci i!! cover

    man% tup!es o* Ci but none )or *e+ o* the tup!es o* other c!asses

    teps:

    Au!es are !earned one at a time

    /ach time a ru!e is !earned, the tup!es covered b% the ru!es are

    removed

    The process repeats on the remaining tup!es un!ess terminationcondition, e4g4, hen no more training e=amp!es or hen the

    qua!it% o* a ru!e returned is be!o a user3speci"ed thresho!d

    Comp4 4 decision3tree induction: !earning a set o* ru!es

    simultaneousl"

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    52/124

    November 16, 2015Data Mining: Concepts and

    Techniques 52

    Iow to #earn-ne-6ule

    tar ith the most genera! ru!e possib!e: condition H empt%

    .dding ne attributes b% adopting a greed% depth3"rst strateg%

    -ics the one that most improves the ru!e qua!it%

    Au!e3ua!it% measures: consider both coverage and accurac%

    oi!3gain )in $' X A$--/A+: assesses in*oGgain b% e=tending

    condition

    $t *avors ru!es that have high accurac% and cover man% positive tup!es

    Au!e pruning based on an independent set o* test tup!es

    -os8neg are R o* positive8negative tup!es covered b% A4

    $* )*+_&runeis higher *or the pruned version o* A, prune A

    'log

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    53/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hatis prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

    Classifcation: A Mat"e+atical

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    54/124

    November 16, 2015Data Mining: Concepts and

    Techniques 57

    C!assi"cation: predicts categorica! c!ass !abe!s

    /4g4, -ersona! homepage c!assi"cation

    =iH )=1, =2, =, [+, %iH 1 or E1 =1: R o* a ord IhomepageJ

    =2: R o* a ord Ie!comeJ

    Mathematica!!% = Z H n, % H P1, E1Q e ant a *unction *: Z

    Classifcation: A Mat"e+aticalMappin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    55/124

    November 16, 2015Data Mining: Concepts and

    Techniques 55

    #inear Classifcation

    &inar% C!assi"cationprob!em The data above the red

    !ine be!ongs to c!ass ^=

    The data be!o red !inebe!ongs to c!ass ^o /=amp!es: ?M,

    -erceptron,

    -robabi!istic C!assi"ers

    =

    ==

    =

    ==

    =

    =

    =

    = ooooo

    o

    o

    o

    o o

    o

    o

    o

    i i i i Cl if

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    56/124

    November 16, 2015Data Mining: Concepts and

    Techniques 56

    &iscri+inative Classifers

    .dvantages prediction accurac% is genera!!% high

    .s compared to &a%esian methods E in genera!

    robust, ors hen training e=amp!es contain errors

    *ast eva!uation o* the !earned target *unction &a%esian netors are norma!!% s!o

    Criticism !ong training time

    diBcu!t to understand the !earned *unction )eights+ &a%esian netors can be used easi!% *or pattern discover%

    not eas% to incorporate domain no!edge /as% in the *orm o* priors on the data or distributions

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    57/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    Perceptron K ,innow

    ?ector: =,

    ca!ar: =, %,

    $nput: P)=1, %1+, [Q

    utput: c!assi"cation *unction*)=+

    *)=i+ O 0 *or %iH 1

    *)=i+ K 0 *or %iH 31

    *)=+ HO = b H 0

    or 1=12=2b H 0

    =1

    =2

    -erceptron: update additive!%

    inno: update mu!tip!icative!%

    ass ca on

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    58/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5>

    9ac;propa!ation

    &acpropagation: . neural networ; !earning a!gorithm tarted b% ps%cho!ogists and neurobio!ogists to deve!op

    and test computationa! ana!ogues o* neurons

    . neura! netor: . set o* connected input8output units

    here each connection has a wei!"tassociated ith it During the !earning phase, the networ; learns /

    adLustin! t"e wei!"tsso as to be ab!e to predict the

    correct c!ass !abe! o* the input tup!es

    .!so re*erred to as connectionist learnin!due to the

    connections beteen units

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    59/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5@

    ?eural ?etwor; as a Classifer

    eaness 'ong training time Aequire a number o* parameters t%pica!!% best determined

    empirica!!%, e4g4, the netor topo!og% or structure4 -oor interpretabi!it%: DiBcu!t to interpret the s%mbo!ic

    meaning behind the !earned eights and o* hidden units inthe netor

    trength Ligh to!erance to nois% data .bi!it% to c!assi*% untrained patterns

    e!!3suited *or continuous3va!ued inputs and outputs uccess*u! on a ide arra% o* rea!3or!d data .!gorithms are inherent!% para!!e! Techniques have recent!% been deve!oped *or the e=traction o*

    ru!es *rom trained neura! netors

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    60/124

    November 16, 2015Data Mining: Concepts and

    Techniques 60

    A ?euron ( a perceptron)

    The n3dimensiona! input vector 7is mapped into variab!e % b%means o* the sca!ar product and a non!inear *unction mapping

    k

    f

    /eig#te

    sum

    n"ut

    vector 7

    out"ut y

    2ctivation

    unction

    /eig#t

    vector w

    w&

    w'

    wn

    x&

    x'

    xn

    'sign%y

    !/ampleFor

    n

    5i

    kiixw +=

    =

    A Multi-#aer eed-orward ?eural

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    61/124

    November 16, 2015Data Mining: Concepts and

    Techniques 61

    ?etwor;

    utput laer

    %nput laer

    Iidden laer

    utput vector

    %nput vector: X

    wij

    +=i

    jiijj #wI

    jIj e#

    += 1

    1

    ''%1% jjjjj #(##)rr =

    jkk

    kjjj w)rr##)rr = '1%

    ijijij #)rrlww '%+=

    jjj )rrl'%+=

    Iow A Multi-#aer ?eural ?etwor;

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    62/124

    November 16, 2015Data Mining: Concepts and

    Techniques 62

    ,or;s

    The inputsto the netor correspond to the attributes measured *or

    each training tup!e

    $nputs are *ed simu!taneous!% into the units maing up the input

    laer

    The% are then eighted and *ed simu!taneous!% to a "idden laer

    The number o* hidden !a%ers is arbitrar%, a!though usua!!% on!% one The eighted outputs o* the !ast hidden !a%er are input to units

    maing up the output laer, hich emits the netorfs prediction

    The netor is eed-orwardin that none o* the eights c%c!es bac

    to an input unit or to an output unit o* a previous !a%er

    rom a statistica! point o* vie, netors per*orm nonlinear

    re!ression:

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    63/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    &efnin! a ?etwor; Topolo!

    irst decide the networ; topolo!: R o* units in theinput la"er, R o* hidden la"ers)i* O 1+, R o* units in

    each hidden la"er, and R o* units in the output la"er

    Norma!i(ing the input va!ues *or each attribute

    measured in the training tup!es to 040U140 ne inputunit per domain va!ue, each initia!i(ed to 0

    utput, i* *or c!assi"cation and more than to

    c!asses, one output unit per c!ass is used

    nce a netor has been trained and its accurac% isunaccepta/le, repeat the training process ith a

    di-erent net(or$ topolog"or a di-erent set of initial

    (eights

    9 ; i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    64/124

    November 16, 2015Data Mining: Concepts and

    Techniques 67

    9ac;propa!ation

    $terative!% process a set o* training tup!es X compare the

    netorfs prediction ith the actua! non target va!ue

    or each training tup!e, the eights are modi"ed to +ini+i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    65/124

    November 16, 2015Data Mining: Concepts and

    Techniques 65

    p p !%nterpreta/ilit

    /Bcienc% o* bacpropagation: /ach epoch )one interation

    through the training set+ taes )FDF _ (+, ith FDF tup!es and

    (eights, but R o* epochs can be e=ponentia! to n, the

    number o* inputs, in the orst case

    Au!e e=traction *rom netors: netor pruning

    imp!i*% the netor structure b% removing eighted !insthat have the !east eVect on the trained netor

    Then per*orm !in, unit, or activation va!ue c!ustering

    The set o* input and activation va!ues are studied to derive

    ru!es describing the re!ationship beteen the input andhidden unit !a%ers

    ensitivit% ana!%sis: assess the impact that a given input

    variab!e has on a netor output4 The no!edge gained

    *rom this ana!%sis can be represented in ru!es

    C"apter $. Classifcation and

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    66/124

    November 16, 2015Data Mining: Concepts and

    Techniques 66

    C"apter $. Classifcation andPrediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures /nsemb!e methods

    Mode! se!ection

    ummar%

    S5M Support 5ector Mac"ines

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    67/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    S5MSupport 5ector Mac"ines

    . ne c!assi"cation method *or both !inear and

    non!inear data

    $t uses a non!inear mapping to trans*orm the origina!

    training data into a higher dimension

    ith the ne dimension, it searches *or the !inear

    optima! separating h%perp!ane )i4e4, Idecision

    boundar%J+

    ith an appropriate non!inear mapping to a suBcient!%

    high dimension, data *rom to c!asses can a!a%s be

    separated b% a h%perp!ane

    ?M "nds this h%perp!ane using support vectors

    )Iessentia!J training tup!es+ and margins )de"ned b%

    the support vectors+

    S5M Ii t d A li ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    68/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6>

    S5MIistor and Applications

    ?apni and co!!eagues )1@@2+Ugroundor *rom

    ?apni X Chervonenis statistica! !earning theor% in

    1@60s

    eatures: training can be s!o but accurac% is high

    oing to their abi!it% to mode! comp!e= non!ineardecision boundaries )margin ma=imi(ation+

    ;sed both *or c!assi"cation and prediction

    .pp!ications: handritten digit recognition, ob9ect recognition,

    speaer identi"cation, benchmaring time3series

    prediction tests

    S5M 0 l P"il "

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    69/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6@

    S5M0eneral P"ilosop"

    upport ?ectors

    ma!! Margin 'arge Margin

    ar! ns an uppor5 t

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    70/124

    November 16, 2015Data Mining: Concepts and

    Techniques 0

    ! pp5ectors

    en a a s near S /l

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    71/124

    November 16, 2015Data Mining: Concepts and

    Techniques 1

    Separa/le

    m

    'et data D be )>1, %1+, [, )>FDF, %FDF+, here >iis the set o* training

    tup!es associated ith the c!ass !abe!s % i

    There are in"nite !ines )h%perp!anes+ separating the to c!asses bute ant to "nd the best one )the one that minimi(es c!assi"cationerror on unseen data+

    ?M searches *or the h%perp!ane ith the !argest margin, i4e4,

    +a7i+u+ +ar!inal "perplane)MML+

    S5M #i l S /l

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    72/124

    November 16, 2015Data Mining: Concepts and

    Techniques 2

    S5M#inearl Separa/le

    . separating h%perp!ane can be ritten as

    , > b H 0

    here ,HP1, 2, [, nQ is a eight vector and b a sca!ar

    )bias+

    or 23D it can be ritten as

    0 1=1 2=2H 0

    The h%perp!ane de"ning the sides o* the margin:

    L1: 0 1=1 2=2j 1 *or %i H 1, and

    L2

    : 0

    1

    =1

    2

    =2

    E 1 *or %i

    H E1

    .n% training tup!es that *a!! on h%perp!anes L1or L2)i4e4, the

    sides de"ning the margin+ are support vectors

    This becomes a constrained (conve7) uadratic

    opti+i

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    73/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    &ata

    The comp!e=it% o* trained c!assi"er is characteri(ed b% the R

    o* support vectors rather than the dimensiona!it% o* the data

    The support vectors are the essentia! or critica! training

    e=amp!es Uthe% !ie c!osest to the decision boundar% )MML+

    $* a!! other training e=amp!es are removed and the training isrepeated, the same separating h%perp!ane ou!d be *ound

    The number o* support vectors *ound can be used to compute

    an )upper+ bound on the e=pected error rate o* the ?M

    c!assi"er, hich is independent o* the data dimensiona!it%

    Thus, an ?M ith a sma!! number o* support vectors can

    have good genera!i(ation, even hen the dimensiona!it% o*

    the data is high

    S5M #inearl %nsepara/le

    A *

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    74/124

    November 16, 2015Data Mining: Concepts and

    Techniques 7

    S5M#inearl %nsepara/le

    Trans*orm the origina! input data into a higherdimensiona! space

    earch *or a !inear separating h%perp!ane in the

    ne space

    A '

    S5M Oernel unctions

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    75/124

    November 16, 2015Data Mining: Concepts and

    Techniques 5

    S5MOernel unctions

    $nstead o* computing the dot product on the trans*ormed

    data tup!es, it is mathematica!!% equiva!ent to insteadapp!%ing a erne! *unction Y)>i, >L+ to the origina! data, i4e4,

    Y)>i, >L+ H k)>i+ k)>L+

    T%pica! Yerne! unctions

    ?M can a!so be used *or c!assi*%ing mu!tip!e )O 2+ c!asses

    and *or regression ana!%sis )ith additiona! user parameters+

    Scalin! S5M / Iierarc"ical Micro-Clusterin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    76/124

    November 16, 2015Data Mining: Concepts and

    Techniques 6

    Clusterin!

    ?M is not sca!ab!e to the number o* data ob9ects in terms o*

    training time and memor% usage

    IC!assi*%ing 'arge Datasets ;sing ?Ms ith Lierarchica!

    C!usters -rob!emJ b% Lan9o u, Wiong ang, Wiaei Lan, YDD0

    C&3?M )C!ustering3&ased ?M+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    77/124

    November 16, 2015Data Mining: Concepts and

    Techniques

    C9-S5M: Clusterin!-9ased S5M

    Training data sets ma% not even "t in memor%

    Aead the data set once )minimi(ing dis access+

    Construct a statistica! summar% o* the data )i4e4,

    hierarchica! c!usters+ given a !imited amount o* memor%

    The statistica! summar% ma=imi(es the bene"t o* !earning?M

    The summar% p!a%s a ro!e in inde=ing ?Ms

    /ssence o* Micro3c!ustering )Lierarchica! inde=ing structure+

    ;se micro3c!uster hierarchica! inde=ing structure

    provide "ner samp!es c!oser to the boundar% and

    coarser samp!es *arther *rom the boundar%

    e!ective de3c!ustering to ensure high accurac%

    C Tree: Iierarc"ical Micro cluster

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    78/124

    November 16, 2015Data Mining: Concepts and

    Techniques >

    C-Tree: Iierarc"ical Micro-cluster

    C9 S5M Al!orit"+: utline

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    79/124

    November 16, 2015Data Mining: Concepts and

    Techniques @

    C9-S5M Al!orit"+: utline

    Construct to C3trees *rom positive and negativedata sets independent!% Need one scan o* the data set

    Train an ?M *rom the centroids o* the root entries

    De3c!uster the entries near the boundar% into thene=t !eve! The chi!dren entries de3c!ustered *rom the

    parent entries are accumu!ated into the training

    set ith the non3dec!ustered parent entries Train an ?M again *rom the centroids o* the

    entries in the training set Aepeat unti! nothing is accumu!ated

    Selective &eclusterin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    80/124

    November 16, 2015Data Mining: Concepts and

    Techniques >0

    Selective &eclusterin!

    C tree is a suitab!e base structure *or se!ective dec!ustering

    De3c!uster on!% the c!uster /isuch that

    DiE AiK Ds, here Diis the distance *rom the boundar% to

    the center point o* /iand Aiis the radius o* /i Dec!uster on!% the c!uster hose subc!usters have

    possibi!ities to be the support c!uster o* the boundar% Iupport c!usterJ: The c!uster hose centroid is a

    support vector

    '7peri+ent on Snt"etic &ataset

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    81/124

    November 16, 2015 Data Mining: Concepts andTechniques >1

    '7peri+ent on Snt"etic &ataset

    '7peri+ent on a #ar!e &ata Set

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    82/124

    November 16, 2015 Data Mining: Concepts andTechniques >2

    '7peri+ent on a #ar!e &ata Set

    S5M vs ?eural ?etwor;

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    83/124

    November 16, 2015 Data Mining: Concepts andTechniques >

    S5M vs. ?eural ?etwor;

    ?M Ae!ative!% ne concept

    Deterministic a!gorithm

    Nice

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    84/124

    November 16, 2015 Data Mining: Concepts andTechniques >7

    S5M 6elated #in;s

    ?M ebsite http:884erne!3machines4org8

    Aepresentative imp!ementations

    '$&?M: an eBcient imp!ementation o* ?M, mu!ti3c!ass

    c!assi"cations, nu3?M, one3c!ass ?M, inc!uding a!so various

    inter*aces ith 9ava, p%thon, etc4

    ?M3!ight: simp!er but per*ormance is not better than '$&?M,

    support on!% binar% c!assi"cation and on!% C !anguage

    ?M3torch: another recent imp!ementation a!so ritten in C4

    #iterature

    http://www.kernel-machines.org/http://www.kernel-machines.org/
  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    85/124

    November 16, 2015 Data Mining: Concepts andTechniques >5

    #iterature

    Itatistica! 'earning Theor%J b% ?apni: e=treme!% hard to

    understand, containing man% errors too4

    C4 W4 C4 &urges4

    . Tutoria! on upport ?ector Machines *or -attern Aecognition 4

    no(ledge Discoer" and Data ining, 2)2+, 1@@>4

    &etter than the ?apnis boo, but sti!! ritten too hard *or

    introduction, and the e=amp!es are so not3intuitive

    The boo I.n $ntroduction to upport ?ector MachinesJ b% N4

    Cristianini and W4 hae3Ta%!or

    .!so ritten hard *or introduction, but the e=p!anation about

    the mercers theorem is better than above !iteratures

    The neura! netor boo b% La%ins

    Contains one nice chapter o* ?M introduction

    C"apter $. Classifcation and

    http://www.kernel-machines.org/papers/Burges98.ps.gzhttp://www.kernel-machines.org/papers/Burges98.ps.gz
  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    86/124

    November 16, 2015 Data Mining: Concepts andTechniques >6

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Associative Classifcation

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    87/124

    November 16, 2015 Data Mining: Concepts andTechniques >

    Associative Classifcation

    .ssociative c!assi"cation

    .ssociation ru!es are generated and ana!%(ed *or use in c!assi"cation

    earch *or strong associations beteen *requent patterns

    )con9unctions o* attribute3va!ue pairs+ and c!ass !abe!s

    C!assi"cation: &ased on eva!uating a set o* ru!es in the *orm o*

    -1 p2[ p!I.c!assH CJ )con*, sup+

    h% eVective#

    $t e=p!ores high!% con"dent associations among mu!tip!e attributes

    and ma% overcome some constraints introduced b% decision3tree

    induction, hich considers on!% one attribute at a time $n man% studies, associative c!assi"cation has been *ound to be more

    accurate than some traditiona! c!assi"cation methods, such as C745

    Tp ca Assoc at ve ass cat onMet"ods

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    88/124

    November 16, 2015 Data Mining: Concepts andTechniques >>

    Met"ods

    C&. )C!assi"cation &% .ssociation: 'iu, Lsu X Ma, YDD@>+

    Mine association possib!e ru!es in the *orm o*

    Cond3set )a set o* attribute3va!ue pairs+ c!ass !abe!

    &ui!d c!assi"er: rgani(e ru!es according to decreasing

    precedence based on con"dence and then support

    CM.A )C!assi"cation based on Mu!tip!e .ssociation Au!es: 'i, Lan, -ei, $CDM01+ C!assi"cation: tatistica! ana!%sis on mu!tip!e ru!es

    C-.A )C!assi"cation based on -redictive .ssociation Au!es: in X Lan, DM0+

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    89/124

    November 16, 2015 Data Mining: Concepts andTechniques >@

    A Closer #oo; at CMA6

    CM.A )C!assi"cation based on Mu!tip!e .ssociation Au!es: 'i, Lan, -ei, $CDM01+ /Bcienc%: ;ses an enhanced -3tree that maintains the distribution

    o* c!ass !abe!s among tup!es satis*%ing each *requent itemset Au!e pruning henever a ru!e is inserted into the tree

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    90/124

    November 16, 2015 Data Mining: Concepts andTechniques @0

    S%0M&=4)

    C"apter $. Classifcation and

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    91/124

    November 16, 2015 Data Mining: Concepts andTechniques @1

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+

    ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    #a

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    92/124

    November 16, 2015 Data Mining: Concepts andTechniques @2

    #a

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    93/124

    November 16, 2015 Data Mining: Concepts andTechniques @

    Met"ods

    $nstance3based !earning: tore training e=amp!es and de!a% theprocessing )I!a(% eva!uationJ+ unti! a neinstance must be c!assi"ed

    T%pica! approaches $3nearest neighbor approach

    $nstances represented as points in a/uc!idean space4

    'oca!!% eighted regression

    Constructs !oca! appro=imation Case3based reasoning

    ;ses s%mbo!ic representations andno!edge3based in*erence

    T"e k-?earest ?ei!"/or

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    94/124

    November 16, 2015 Data Mining: Concepts andTechniques @7

    Al!orit"+

    .!! instances correspond to points in the n3Dspace The nearest neighbor are de"ned in terms o*

    /uc!idean distance, dist)>1, >2+ Target *unction cou!d be discrete3 or rea!3

    va!ued or discrete3va!ued, $3NN returns the most

    common va!ue among the $training e=amp!esnearest toxq

    ?onoroi diagram: the decision sur*ace inducedb% 13NN *or a t%pica! set o* training e=amp!es

    4

    9

    9 xq

    9 9

    9

    9

    .

    .. .

    &iscussion on t"e k-??Al it"

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    95/124

    November 16, 2015 Data Mining: Concepts andTechniques @5

    Al!orit"+

    3NN *or rea!3va!ued prediction *or a given unnontup!e

    Aeturns the mean va!ues o* the$nearest neighbors

    Distance3eighted nearest neighbor a!gorithm

    eight the contribution o* each o* the neighborsaccording to their distance to the quer%xq

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    96/124

    November 16, 2015 Data Mining: Concepts andTechniques @6

    Case 9ased 6easonin! (C96)

    C&A: ;ses a database o* prob!em so!utions to so!ve ne prob!ems

    tore s%mbo!ic description )tup!es or cases+Unot points in a /uc!ideanspace

    .pp!ications: Customer3service )product3re!ated diagnosis+, !ega!

    ru!ing

    Methodo!og%

    $nstances represented b% rich s%mbo!ic descriptions )e4g4, *unction

    graphs+

    earch *or simi!ar cases, mu!tip!e retrieved cases ma% be combined

    Tight coup!ing beteen case retrieva!, no!edge3based reasoning,

    and prob!em so!ving

    Cha!!enges

    ind a good simi!arit% metric

    $nde=ing based on s%ntactic simi!arit% measure, and hen *ai!ure,

    bactracing, and adapting to additiona! cases

    C"apter $. Classifcation andP di ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    97/124

    November 16, 2015 Data Mining: Concepts andTechniques @

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    0enetic Al!orit"+s (0A)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    98/124

    November 16, 2015 Data Mining: Concepts andTechniques @>

    0e e c !o s (0 )

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    99/124

    November 16, 2015 Data Mining: Concepts andTechniques @@

    ! pp

    Aough sets are used to appro7i+atel or *rou!"lF

    defne euivalent classes

    . rough set *or a given c!ass C is appro=imated b% to sets:

    a !oer appro=imation)certain to be in C+ and an upper

    appro=imation)cannot be described as not be!onging to C+

    inding the minima! subsets )reducts+ o* attributes *or

    *eature reduction is N-3hard but a discerni/ilit +atri7

    )hich stores the diVerences beteen attribute va!ues *or

    each pair o* data tup!es+ is used to reduce the computation

    intensit%

    u

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    100/124

    November 16, 2015 Data Mining: Concepts andTechniques 100

    Approac"es

    u((% !ogic uses truth va!ues beteen 040 and 140 torepresent the degree o* membership )such as using*u((% membership graph+

    .ttribute va!ues are converted to *u((% va!ues e4g4, income is mapped into the discrete categories

    P!o, medium, highQ ith *u((% va!ues ca!cu!ated or a given ne samp!e, more than one *u((% va!ue

    ma% app!% /ach app!icab!e ru!e contributes a vote *or

    membership in the categories T%pica!!%, the truth va!ues *or each predicted categor%

    are summed, and these sums are combined

    C"apter $. Classifcation andP di ti

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    101/124

    November 16, 2015 Data Mining: Concepts andTechniques 101

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    ,"at %s Prediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    102/124

    November 16, 2015 Data Mining: Concepts andTechniques 102

    )Numerica!+ prediction is simi!ar to c!assi"cation construct a mode! use mode! to predict continuous or ordered va!ue *or a given

    input -rediction is diVerent *rom c!assi"cation

    C!assi"cation re*ers to predict categorica! c!ass !abe! -rediction mode!s continuous3va!ued *unctions Ma9or method *or prediction: regression

    mode! the re!ationship beteen one or more independentorpredictorvariab!es and a dependentor responsevariab!e

    Aegression ana!%sis 'inear and mu!tip!e regression Non3!inear regression ther regression methods: genera!i(ed !inear mode!, -oisson

    regression, !og3!inear mode!s, regression trees

    #inear 6e!ression

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    103/124

    November 16, 2015 Data Mining: Concepts andTechniques 10

    ea e! ess o

    'inear regression: invo!ves a response variab!e % and a sing!e

    predictor variab!e =

    % H 0 1=

    here 0)%3intercept+ and 1)s!ope+ are regression coeBcients

    Method o* !east squares: estimates the best3"tting straight !ine

    Mu!tip!e !inear regression: invo!ves more than one predictor variab!e

    Training data is o* the *orm )>1, %1+, )>2, %2+,[, )>D&D, %FDF+

    /=4 or 23D data, e ma% have: % H 0 1=1 2=2 o!vab!e b% e=tension o* !east square method or using ., 3

    -!us

    Man% non!inear *unctions can be trans*ormed into the above

    =

    =

    =

    22

    1

    0

    22

    1

    '%

    ''%%

    1 D

    i

    i

    D

    i

    ii

    xx

    ,,xx

    w xw,w 15 =

    ?onlinear 6e!ression

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    104/124

    November 16, 2015 Data Mining: Concepts andTechniques 107

    ome non!inear mode!s can be mode!ed b% a

    po!%nomia! *unction . po!%nomia! regression mode! can be trans*ormed into

    !inear regression mode!4 or e=amp!e,

    % H 0 1= 2=2 =

    convertib!e to !inear ith ne variab!es: =2 H =2, =H =

    % H 0 1= 2=2 = ther *unctions, such as poer *unction, can a!so be

    trans*ormed to !inear mode! ome mode!s are intractab!e non!inear )e4g4, sum o*

    e=ponentia! terms+ possib!e to obtain !east square estimates through

    e=tensive ca!cu!ation on more comp!e= *ormu!ae

    !

    t"er 6e!ression-9ased Models

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    105/124

    November 16, 2015 Data Mining: Concepts andTechniques 105

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    106/124

    November 16, 2015 Data Mining: Concepts andTechniques 106

    Trees Aegression tree: proposed in C.AT s%stem )&reiman et a!4 1@>7+

    C.AT: C!assi"cation .nd Aegression Trees

    /ach !ea* stores a continuous3alued prediction

    $t is the aerage alue of the predicted attribute*or the

    training tup!es that reach the !ea*

    Mode! tree: proposed b% uin!an )1@@2+

    /ach !ea* ho!ds a regression mode!Ua mu!tivariate !inear

    equation *or the predicted attribute

    . more genera! case than regression tree Aegression and mode! trees tend to be more accurate than

    !inear regression hen the data are not represented e!! b% a

    simp!e !inear mode!

    Predictive Modelin! in Multidi+ensional&ata/ases

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    107/124

    November 16, 2015Data Mining: Concepts and

    Techniques 10

    -redictive mode!ing: -redict data va!ues or construct

    genera!i(ed !inear mode!s based on the database data ne can on!% predict va!ue ranges or categor% distributions Method out!ine:

    Minima! genera!i(ation

    .ttribute re!evance ana!%sis

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    108/124

    November 16, 2015Data Mining: Concepts and

    Techniques 10>

    Prediction: Cate!orical &ata

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    109/124

    November 16, 2015Data Mining: Concepts and

    Techniques 10@

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    110/124

    November 16, 2015Data Mining: Concepts and

    Techniques 110

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Classifer AccuracMeasures

    C1 C2

    C1 True positive a!senegative

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    111/124

    November 16, 2015Data Mining: Concepts and

    Techniques 111

    .ccurac% o* a c!assi"er M, acc)M+: percentage o* test set tup!es that arecorrect!% c!assi"ed b% the mode! M

    /rror rate )misc!assi"cation rate+ o* M H 1 E acc)M+ > 000 >642

    tota! 66 267 1000

    0

    @5452

    g

    C2 a!sepositive

    True negative

    Predictor 'rror Measures

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    112/124

    November 16, 2015Data Mining: Concepts and

    Techniques 112

    Measure predictor accurac%: measure ho *ar oV the predicted

    va!ue is *rom the actua! non va!ue #oss unction: measures the error bet4 %iand the predicted va!ue

    %i

    .bso!ute error: F %iE %iF

    quared error: )%iE %i+2

    Test error )genera!i(ation error+: the average !oss over the test set

    Mean abso!ute error: Mean squared error:

    Ae!ative abso!ute error: Ae!ative squared error:

    The mean squared3error e=aggerates the presence o* out!iers

    -opu!ar!% use )square+ root mean3square error, simi!ar!%, root

    re!ative squared error

    d

    ,,d

    i

    ii=

    1

    2

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    113/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11

    Classifer or Predictor (%) Lo!dout method

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    114/124

    November 16, 2015Data Mining: Concepts and

    Techniques 117

    Classifer or Predictor (%%)

    &ootstrap

    ors e!! ith sma!! data sets

    amp!es the given training tup!es uni*orm!% (ith replacement

    i4e4, each time a tup!e is se!ected, it is equa!!% !ie!% to be

    se!ected again and re3added to the training set

    evera! boostrap methods, and a common one is .$2 /oostrap uppose e are given a data set o* d tup!es4 The data set is samp!ed

    d times, ith rep!acement, resu!ting in a training set o* d samp!es4

    The data tup!es that did not mae it into the training set end up

    *orming the test set4 .bout 642 o* the origina! data i!! end up in

    the bootstrap, and the remaining 64> i!! *orm the test set )since )1E 18d+d e31H 046>+

    Aepeat the samp!ing procedue times, overa!! accurac% o* the

    mode!:''%6845'%6045%'% 9

    1

    9 settraini

    k

    i

    settesti -acc-acc-acc +==

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    115/124

    November 16, 2015Data Mining: Concepts and

    Techniques 115

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    'nse+ e Met o s: %ncreas n! t eAccurac

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    116/124

    November 16, 2015Data Mining: Concepts and

    Techniques 116

    /nsemb!e methods ;se a combination o* mode!s to increase accurac% Combine a series o* !earned mode!s, M1, M2, [, M,

    ith the aim o* creating an improved mode! M_ -opu!ar ensemb!e methods

    &agging: averaging the prediction over a co!!ection o*

    c!assi"ers &oosting: eighted vote ith a co!!ection o*

    c!assi"ers /nsemb!e: combining a set o* heterogeneous

    c!assi"ers

    a!! n!: oos rapA!!re!ation

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    117/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11

    .na!og%: Diagnosis based on mu!tip!e doctors ma9orit% vote

    Training

    /ach c!assi"er Mireturns its c!ass prediction The bagged c!assi"er M_ counts the votes and assigns the c!ass

    ith the most votes to > -rediction: can be app!ied to the prediction o* continuous va!ues b%

    taing the average va!ue o* each prediction *or a given test tup!e

    .ccurac% *ten signi"cant better than a sing!e c!assi"er derived *rom D or noise data: not considerab!% orse, more robust -roved improved accurac% in prediction

    9oostin!

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    118/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11>

    .na!og%: Consu!t severa! doctors, based on a combination o* eighted

    diagnosesUeight assigned based on the previous diagnosis accurac% Lo boosting ors#

    eights are assigned to each training tup!e

    . series o* c!assi"ers is iterative!% !earned

    .*ter a c!assi"er Miis !earned, the eights are updated to a!!o the

    subsequent c!assi"er, Mi1, to pa% more attention to the training

    tup!es that ere misc!assi"ed b% Mi The "na! M_ combines the votes o* each individua! c!assi"er, here

    the eight o* each c!assi"erfs vote is a *unction o* its accurac%

    The boosting a!gorithm can be e=tended *or the prediction o*

    continuous va!ues

    Comparing ith bagging: boosting tends to achieve greater accurac%,

    but it a!so riss over"tting the mode! to misc!assi"ed data

    a oos reun an c ap re81QQR)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    119/124

    November 16, 2015Data Mining: Concepts and

    Techniques 11@

    1, %1+, [, )>d, %d+ $nitia!!%, a!! the eights o* tup!es are set the same )18d+ L4

    C!assi"er Mierror rate is the sum o* the eights o* themisc!assi"ed tup!es:

    The eight o* c!assi"er Mis vote is '%

    '%1log

    i

    i

    -error

    -error

    =d

    j

    ji errw-error '%'% X

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    120/124

    November 16, 2015Data Mining: Concepts and

    Techniques 120

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Model Selection: 6CCurves

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    121/124

    November 16, 2015Data Mining: Concepts and

    Techniques 121

    AC )Aeceiver perating

    Characteristics+ curves: *or visua!comparison o* c!assi"cation mode!s

    riginated *rom signa! detection theor%

    hos the trade3oV beteen the true

    positive rate and the *a!se positive rate

    The area under the AC curve is a

    measure o* the accurac% o* the mode!

    Aan the test tup!es in decreasing order:

    the one that is most !ie!% to be!ong to

    the positive c!ass appears at the top o*the !ist

    The c!oser to the diagona! !ine )i4e4, the

    c!oser the area is to 045+, the !ess

    accurate is the mode!

    ?ertica! a=isrepresents the truepositive rate

    Lori(onta! a=is rep4

    the *a!se positiverate The p!ot a!so shos

    a diagona! !ine . mode! ith per*ect

    accurac% i!! have

    C"apter $. Classifcation andPrediction

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    122/124

    November 16, 2015Data Mining: Concepts and

    Techniques 122

    Prediction

    hat is c!assi"cation# hat

    is prediction#

    $ssues regarding

    c!assi"cation and prediction

    C!assi"cation b% decision

    tree induction

    &a%esian c!assi"cation

    Au!e3based c!assi"cation

    C!assi"cation b% bac

    propagation

    upport ?ector Machines

    )?M+

    .ssociative c!assi"cation

    'a(% !earners )or !earning *rom

    %our neighbors+ ther c!assi"cation methods

    -rediction

    .ccurac% and error measures

    /nsemb!e methods

    Mode! se!ection

    ummar%

    Su++ar (%)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    123/124

    November 16, 2015Data Mining: Concepts and

    Techniques 12

    C!assi"cation andpredictionare to *orms o* data ana!%sis that

    can be used to e=tract mode!sdescribing important data c!assesor to predict *uture data trends4

    /Vective and sca!ab!e methods have been deve!oped *or decision

    trees induction, Naive &a%esian c!assi"cation, &a%esian be!ie*

    netor, ru!e3based c!assi"er, &acpropagation, upport ?ector

    Machine )?M+, associative c!assi"cation, nearest neighbor

    c!assi"ers,and case3based reasoning, and other c!assi"cation

    methods such as genetic a!gorithms, rough set and *u((% set

    approaches4

    'inear, non!inear, and genera!i(ed !inear mode!s o* regressioncanbe used *or prediction4 Man% non!inear prob!ems can be converted

    to !inear prob!ems b% per*orming trans*ormations on the predictor

    variab!es4 Aegression treesand mode! treesare a!so used *or

    prediction4

    Su++ar (%%)

  • 7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed

    124/124

    trati"ed 3*o!d cross3va!idationis a recommended method *or

    accurac% estimation4 &agging and boostingcan be used to

    increase overa!! accurac% b% !earning and combining a series o*

    individua! mode!s4

    igni"cance testsand AC curvesare use*u! *or mode! se!ection

    There have been numerous comparisons o* the diVerent

    c!assi"cation and prediction methods, and the matter remains a

    research topic

    No sing!e method has been *ound to be superior over a!! others *or

    a!! data sets $ssues such as accurac%, training time, robustness, interpretabi!it%,