Practical OOP using Java Training @ Basis Faqueer Tanvir Ahmed, 08 Jan 2012.
Data Mining Classification and Prediction by Dr. Tanvir Ahmed
Transcript of Data Mining Classification and Prediction by Dr. Tanvir Ahmed
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
1/124
November 16, 2015Data Mining: Concepts and
Techniques 1
Classifcation and
Prediction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
2/124
November 16, 2015Data Mining: Concepts and
Techniques 2
Classifcation and Prediction
hat is c!assi"cation# hatis prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
'a(% !earners )or !earning*rom %our neighbors+
ther c!assi"cation
methods
-rediction
.ccurac% and error
measures
/nsemb!e methods
Mode! se!ection
ummar%
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
3/124
November 16, 2015Data Mining: Concepts and
Techniques
C!assi"cation predicts categorica! c!ass !abe!s )discrete or nomina!+ c!assi"es data )constructs a mode!+ based on the
training set and the va!ues )c!ass !abe!s+ in ac!assi*%ing attribute and uses it in c!assi*%ing nedata
-rediction mode!s continuous3va!ued *unctions, i4e4, predicts
unnon or missing va!ues
T%pica! app!ications Credit approva! Target mareting Medica! diagnosis raud detection
Classifcation vs. Prediction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
4/124
November 16, 2015Data Mining: Concepts and
Techniques 7
ClassifcationA Two-StepProcess
Mode! construction: describing a set o* predetermined c!asses /ach tup!e8samp!e is assumed to be!ong to a prede"ned
c!ass, as determined b% the c!ass !abe! attribute The set o* tup!es used *or mode! construction is training
set
The mode! is represented as c!assi"cation ru!es, decisiontrees, or mathematica! *ormu!ae Mode! usage: *or c!assi*%ing *uture or unnon ob9ects
/stimate accurac%o* the mode! The non !abe! o* test samp!e is compared ith the
c!assi"ed resu!t *rom the mode! .ccurac% rate is the percentage o* test set samp!es
that are correct!% c!assi"ed b% the mode! Test set is independent o* training set, otherise over3
"tting i!! occur $* the accurac% is acceptab!e, use the mode! to c!assi*%
datatup!es hose c!ass !abe!s are not non
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
5/124
November 16, 2015Data Mining: Concepts and
Techniques 5
rocess : o eConstruction
Training
Data
NAME RANK YEARS TENURED
Mike Assistant Prof 3 no
Mary Assistant Prof 7 yesBill Professor 2 yes
Jim Associate Prof 7 yes
Dave Assistant Prof 6 no
Anne Associate Prof 3 no
Classification
Algorithms
IF rank = professor
OR years > 6
T!" ten#re$ = yes
Classifier
%&o$el'
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
6/124
November 16, 2015Data Mining: Concepts and
Techniques 6
Process (2): sin! t"e Model in
Prediction
Classifier
Testing
Data
NAME RANK YEARS TENURED
Tom Assistant Prof 2 no
Merlisa Associate Prof 7 no
George Professor 5 yes
Josep Assistant Prof 7 yes
(nseen Data
%)eff* +rofessor* ,'
Ten#re$-
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
7/124November 16, 2015Data Mining: Concepts and
Techniques
Supervised vs. nsupervised#earnin!
upervised !earning )c!assi"cation+
upervision: The training data )observations,
measurements, etc4+ are accompanied b%
!abe!s indicating the c!ass o* the observations Ne data is c!assi"ed based on the training set
;nsupervised !earning)c!ustering+
The c!ass !abe!s o* training data is unnon
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
8/124November 16, 2015Data Mining: Concepts and
Techniques >
C"apter $. Classifcation andPrediction
hat is c!assi"cation# hatis prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
upport ?ector Machines
)?M+
'a(% !earners )or !earning*rom %our neighbors+
ther c!assi"cation
methods
-rediction
.ccurac% and error
measures
/nsemb!e methods
Mode! se!ection
ummar%
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
9/124November 16, 2015Data Mining: Concepts and
Techniques @
%ssues: &ata Preparation
Data c!eaning
-reprocess data in order to reduce noise and
hand!e missing va!ues
Ae!evance ana!%sis )*eature se!ection+ Aemove the irre!evant or redundant attributes
Data trans*ormation
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
10/124November 16, 2015Data Mining: Concepts and
Techniques 10
%ssues: 'valuatin! Classifcation
Met"ods
.ccurac% c!assi"er accurac%: predicting c!ass !abe! predictor accurac%: guessing va!ue o* predicted
attributes peed
time to construct the mode! )training time+ time to use the mode! )c!assi"cation8prediction time+
Aobustness: hand!ing noise and missing va!ues ca!abi!it%: eBcienc% in dis3resident databases $nterpretabi!it%
understanding and insight provided b% the mode! ther measures, e4g4, goodness o* ru!es, such as
decision tree si(e or compactness o* c!assi"cation ru!es
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
11/124November 16, 2015Data Mining: Concepts and
Techniques 11
C"apter $. Classifcation andPrediction
hat is c!assi"cation# hatis prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+
ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
& i i T % d i T i i
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
12/124November 16, 2015Data Mining: Concepts and
Techniques 12
&ecision Tree %nduction: Trainin!&ataset
age income st!"ent cre"it#rating $!ys#comp!ter
%&3' ig no fair no
%&3' ig no e(cellent no
3)*+' ig no fair yes
,+' me"i!m no fair yes
,+' lo- yes fair yes
,+' lo- yes e(cellent no
3)*+' lo- yes e(cellent yes
%&3' me"i!m no fair no
%&3' lo- yes fair yes
,+' me"i!m yes fair yes
%&3' me"i!m yes e(cellent yes
3)*+' me"i!m no e(cellent yes
3)*+' ig yes fair yes
,+' me"i!m no e(cellent no
This*o!!os
ane=amp!eo*uin!ans$D)-!a%ing
Tennis+
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
13/124November 16, 2015Data Mining: Concepts and
Techniques 1
utput: A &ecision Tree or*buys_computer
age-
o.ercast
st#$ent- cre$it rating-
40
no yes yes
yes
31..40
no
faire/cellentyesno
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
14/124November 16, 2015Data Mining: Concepts and
Techniques 17
%nduction
&asic a!gorithm )a greed% a!gorithm+ Tree is constructed in a top3don recursive divide3and3
conquer manner .t start, a!! the training e=amp!es are at the root .ttributes are categorica! )i* continuous3va!ued, the% are
discreti(ed in advance+ /=amp!es are partitioned recursive!% based on se!ected
attributes Test attributes are se!ected on the basis o* a heuristic or
statistica! measure )e4g4, in*ormation gain+
Conditions *or stopping partitioning .!! samp!es *or a given node be!ong to the same c!ass There are no remaining attributes *or *urther partitioning E
ma9orit% votingis emp!o%ed *or c!assi*%ing the !ea*
There are no samp!es !e*t
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
15/124
," &ecision tree is popular
Does not require domain no!edge Can hand!e mu!tidimensiona! data /as% to understand
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
16/124November 16, 2015Data Mining: Concepts and
Techniques 16
Attri/ute Selection Measure:%nor+ation 0ain (%&C3.4)
e!ect the attribute ith the highest in*ormationgain4 'east impurit%4
'etpibe the probabi!it% that an arbitrar% tup!e in D
be!ongs to c!ass Ci, estimated b% FCi, DF8FDF
/=pected in*ormation)entrop%+ needed to c!assi*%a tup!e in D:
$n*ormationneeded )a*ter using . to sp!it D into vpartitions+ to c!assi*% D:
$n*ormation gainedb% branching on attribute .
'%log'% 01
i
m
i
i ppDInfo =
=
'%22
22'%
1
j
v
j
j
A DID
DDInfo =
=
(D)InfoInfo(D)Gain(A) A=
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
17/124November 16, 2015
Data Mining: Concepts andTechniques 1
Attri/ute Selection: %nor+ation 0ain
C!ass -: bu%sGcomputer HI%esJ C!ass N: bu%sGcomputer H
InoJ
means Iage KH0J has
5 out o* 17 samp!es, ith 2
%eses and nos4 Lence
imi!ar!%,
63,45'0*%1,
7
'5*,%1,
,'*0%1,
7'%
=+
+=
I
IIDInfoage
5,845'9%
17145'%
50345'%
===
ratingcreditGain
studentGain
incomeGain
0,645'%'%'% == DInfoDInfoageGain age
'*0%1,
7I
3,545'1,
7%log
1,
7'
1,
3%log
1,
3'7*3%'% 00 === IDInfo
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
18/124November 16, 2015
Data Mining: Concepts andTechniques 1>
Co+putin! %nor+ation-0ain orContinuous-5alue Attri/utes
'et attribute . be a continuous3va!ued attribute Must determine the best split point*or .
ort the va!ue . in increasing order
T%pica!!%, the midpoint beteen each pair o* ad9acent
va!ues is considered as a possib!e split point )aiai1+82 is the midpoint beteen the va!ues o* a iand ai1
The point ith the minimum expected information
requirement*or . is se!ected as the sp!it3point *or . p!it:
D1 is the set o* tup!es in D satis*%ing . sp!it3point,
and D2 is the set o* tup!es in D satis*%ing . O sp!it3
point
0 i 6 ti Att i/ t S l ti
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
19/124November 16, 2015
Data Mining: Concepts andTechniques 1@
0ain 6atio or Attri/ute Selection(C3.4)
$n*ormation gain measure is biased toardsattributes ith a !arge number o* va!ues
C745 )a successor o* $D+ uses gain ratio to overcome
the prob!em )norma!i(ation to in*ormation gain+
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
20/124November 16, 2015
Data Mining: Concepts andTechniques 20
0ini inde7 (CA6T8 %9M%ntelli!entMiner)
$* a data set D contains e=amp!es *rom nc!asses, gini inde=,gini)D+ is de"ned as
herepjis the re!ative *requenc% o* c!assjin D
$* a data set D is sp!it on . into to subsets D1and D2, the giniinde= gini)D+ is de"ned as
Aeduction in $mpurit%:
The attribute provides the sma!!est ginisplit)D+ )or the !argest
reduction in impurit%+ is chosen to sp!it the node )need to
enumerate all the possible splitting points for each attribute+
=
=n
j
p jDgini
1
01'%
'%22
22'%
22
22'% 0
01
1Dgini
D
DDgini
D
DDginiA +=
'%'%'% DginiDginiAginiA
=
0ini inde7 (CA6T %9M
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
21/124November 16, 2015
Data Mining: Concepts andTechniques 21
0ini inde7 (CA6T8 %9M%ntelli!entMiner)
/=4 D has @ tup!es in bu%sGcomputer H I%esJ and 5 in InoJ
uppose the attribute income partitions D into 10 in D1: P!o,
mediumQ and 7 in D2
but giniPmedium,highQis 040 and thus the best since it is the !oest
.!! attributes are assumed continuous3va!ued
Ma% need other too!s, e4g4, c!ustering, to get the possib!e sp!it
va!ues
Can be modi"ed *or categorica! attributes
,73451,
7
1,
31'%
00
=
=Dgini
'%1,
,
'%1,
15
'% 11:*; DGiniDGiniDgini mediumlowincome
+
=
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
22/124November 16, 2015
Data Mining: Concepts andTechniques 22
Co+parin! Attri/ute SelectionMeasures
The three measures, in genera!, return good resu!tsbut
$n*ormation gain:
biased toards mu!tiva!ued attributes
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
23/124
November 16, 2015Data Mining: Concepts and
Techniques 2
t"er Attri/ute SelectionMeasures
CL.$D: a popu!ar decision tree a!gorithm, measure based on S2
test *orindependence
C3/-: per*orms better than in*o4 gain and gini inde= in certain cases
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
24/124
November 16, 2015Data Mining: Concepts and
Techniques 27
verfttin! and Tree Prunin!
ver"tting: .n induced tree ma% over"t the training data Too man% branches, some ma% reect anoma!ies due to noise or
out!iers
-oor accurac% *or unseen samp!es
To approaches to avoid over"tting -repruning: La!t tree construction ear!%Udo not sp!it a node i* this
ou!d resu!t in the goodness measure *a!!ing be!o a thresho!d
DiBcu!t to choose an appropriate thresho!d
-ostpruning: Aemove branches *rom a I*u!!% gronJ treeUget a
sequence o* progressive!% pruned trees
;se a set o* data diVerent *rom the training data to decide
hich is the Ibest pruned treeJ
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
25/124
November 16, 2015Data Mining: Concepts and
Techniques 25
'n"ance+ents to 9asic &ecision Tree%nduction
.!!o *or continuous3va!ued attributes D%namica!!% de"ne ne discrete3va!ued attributes
that partition the continuous attribute va!ue into adiscrete set o* interva!s
Land!e missing attribute va!ues .ssign the most common va!ue o* the attribute
.ssign probabi!it% to each o* the possib!e va!ues
.ttribute construction Create ne attributes based on e=isting ones that
are sparse!% represented
This reduces *ragmentation, repetition, and
rep!ication
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
26/124
November 16, 2015Data Mining: Concepts and
Techniques 26
Classifcation in #ar!e &ata/ases
C!assi"cationUa c!assica! prob!em e=tensive!% studiedb% statisticians and machine !earning researchers
ca!abi!it%: C!assi*%ing data sets ith mi!!ions o*
e=amp!es and hundreds o* attributes ith reasonab!e
speed h% decision tree induction in data mining#
re!ative!% *aster !earning speed )than otherc!assi"cation methods+
convertib!e to simp!e and eas% to understandc!assi"cation ru!es can use ' queries *or accessing databases comparab!e c!assi"cation accurac% ith other
methods
Scala/le &ecision Tree %nduction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
27/124
November 16, 2015Data Mining: Concepts and
Techniques 2
Scala/le &ecision Tree %nductionMet"ods
'$)/D&T@6 U Mehta et a!4+ &ui!ds an inde= *or each attribute and on!% c!ass !ist
and the current attribute !ist reside in memor% -A$NT)?'D&@6 U W4 ha*er et a!4+
Constructs an attribute !ist data structure -;&'$C)?'D&@> U Aastogi X him+
$ntegrates tree sp!itting and tree pruning: stopgroing the tree ear!ier
Aainorest )?'D&@> U
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
28/124
November 16, 2015Data Mining: Concepts and
Techniques 2>
ca a t ra+ewor or6ainorest
.eparates te scala$ility aspects from te criteria tat"etermine te /!ality of te tree
B!il"s an A01list: AVC (Attribute, Value, Class_label)
AVC-set of an attri$!teX4 Pro5ection of training "ataset onto te attri$!teXan"
class la$el -ere co!nts of in"ivi"!al class la$el are
aggregate" AVC-grou of a no"e n4
.et of A01sets of all pre"ictor attri$!tes at te no"e n
6ainorest: Trainin! Set and %ts A5C
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
29/124
November 16, 2015Data Mining: Concepts and
Techniques 2@
6ainorest: Trainin! Set and %ts A5CSets
st!"ent B!y#1omp!ter
yes no
yes 6 )
no 3 +
Age B!y#1omp!ter
yes no
%&3' 3 2
3)+' + '
,+' 3 2
1re"itrating
B!y#1omp!ter
yes no
fair 6 2
e(cellent 3 3
.?C3set on incom.?C3set onAge
.?C3set on Student
Training /=amp!esincome B!y#1omp!ter
yes no
ig 2 2
me"i!m + 2
lo- 3 )
.?C3set oncredit_rating
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
30/124
November 16, 2015Data Mining: Concepts and
Techniques 0
&ata Cu/e-9ased &ecision-Tree%nduction
$ntegration o* genera!i(ation ith decision3tree
induction )Yamber et a!4@+
C!assi"cation at primitive concept !eve!s
/4g4, precise temperature, humidit%, out!oo, etc4 'o3!eve! concepts, scattered c!asses, bush%
c!assi"cation3trees
emantic interpretation prob!ems Cube3based mu!ti3!eve! c!assi"cation
Ae!evance ana!%sis at mu!ti3!eve!s
$n*ormation3gain ana!%sis ith dimension !eve!
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
31/124
November 16, 2015Data Mining: Concepts and
Techniques 1
9AT (9ootstrapped pti+isticAl!orit"+ or Tree Construction)
7se a statistical tecni/!e calle" bootstrappingto create
several smaller samples s!$sets48 eac fits in memory
9ac s!$set is !se" to create a tree8 res!lting in several
trees Tese trees are e(amine" an" !se" to constr!ct a ne-
tree T
:t t!rns o!t tatTis very close to te tree tat -o!l"$e generate" !sing te -ole "ata set togeter
A"v; re/!ires only t-o scans of DB8 an incremental alg
P t ti Cl if ti 6 lt
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
32/124
November 16, 2015Data Mining: Concepts and
Techniques 2
Presentation o Classifcation 6esults
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
33/124
November 16, 2015Data Mining: Concepts and
Techniques
5isuali
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
34/124
November 16, 2015Data Mining: Concepts and
Techniques 7
%nteractive 5isual Minin!/ Perception-
9ased Classifcation (P9C)
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
35/124
November 16, 2015Data Mining: Concepts and
Techniques 5
C"apter $. Classifcation andPrediction
hat is c!assi"cation# hatis prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+
ther c!assi"cation methods
-rediction
.ccurac% and error measures /nsemb!e methods
Mode! se!ection
ummar%
9 i Cl if ti
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
36/124
November 16, 2015Data Mining: Concepts and
Techniques 6
9aesian Classifcation:,"
. statistica! c!assi"er: per*ormsprobabilisticprediction, i.e.,predicts c!ass membershipprobabi!ities
oundation: &ased on &a%es Theorem4 -er*ormance: . simp!e &a%esian c!assi"er, nae
!a"esian classi#er, has comparab!e per*ormance ithdecision tree and se!ected neura! netor c!assi"ers
$ncrementa!: /ach training e=amp!e can incrementa!!%increase8decrease the probabi!it% that a h%pothesis iscorrect U prior no!edge can be combined ith
observed data tandard: /ven hen &a%esian methods are
computationa!!% intractab!e, the% can provide astandard o* optima! decision maing against hichother methods can be measured
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
37/124
November 16, 2015Data Mining: Concepts and
Techniques
9aesian T"eore+: 9asics
'et >be a data samp!e )IeidenceJ+: c!ass !abe! isunnon
'et L be a h"pothesisthat Z be!ongs to c!ass C
C!assi"cation is to determine -)LF>+, the probabi!it% that
the h%pothesis ho!ds given the observed data samp!e > -)L+ )prior probabilit"+, the initia! probabi!it%
/4g4,>i!! bu% computer, regard!ess o* age, income, [
-)>+: probabi!it% that samp!e data is observed
-)>FL+ )posteriori probabilit"+, the probabi!it% o* observingthe samp!e >, given that the h%pothesis ho!ds
/4g4,i!! bu% computer, the prob4 that Z is
14470, medium income
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
38/124
November 16, 2015Data Mining: Concepts and
Techniques >
9aesian T"eore+
, posteriori probabilit" of ah"pothesis L, -)LF>+, *o!!os the &a%es theorem
$n*orma!!%, this can be ritten as
posteriori H !ie!ihood = prior8evidence
-redicts >be!ongs to C2iV the probabi!it% -)CiF>+ is
the highest among a!! the -)CFZ+ *or a!! the $c!asses
-ractica! diBcu!t%: require initia! no!edge o* man%
probabi!ities, signi"cant computationa! cost
'%'%'2%'2%
X
XX
PHPHPHP =
owar s a ve aes an
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
39/124
November 16, 2015Data Mining: Concepts and
Techniques @
owar s a ve aes anClassifer
'et D be a training set o* tup!es and their associatedc!ass !abe!s, and each tup!e is represented b% an n3Dattribute vector >H )=1, =2, [, =n+
uppose there are mc!asses C1, C2, [, Cm4
C!assi"cation is to derive the ma=imum posteriori,i4e4, the ma=ima! -)CiF>+ This can be derived *rom &a%es theorem
ince -)Z+ is constant *or a!! c!asses, on!%
needs to be ma=imi(ed
'%
'%'2%'2%
X
XX
Pi
CPi
CP
iCP =
'%'2%'2%i
CPi
CPi
CP XX =
er va on o a ve aes
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
40/124
November 16, 2015Data Mining: Concepts and
Techniques 70
er va on o a ve aesClassifer
. simp!i"ed assumption: attributes are conditiona!!%independent )i4e4, no dependence re!ation beteenattributes+:
This great!% reduces the computation cost: n!%
counts the c!ass distribution $* .is categorica!, -)=FCi+ is the R o* tup!es in Ci
having va!ue =*or .divided b% FCi, DF )R o* tup!es o*Ciin D+
$* .is continous3va!ued, -)=FCi+ is usua!!% computedbased on
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
41/124
November 16, 2015Data Mining: Concepts and
Techniques 71
?a ve 9aes an ass er: Tra n n!&ataset
C!ass:
C1:bu%sGcomputer H
^%es
C2:bu%sGcomputer H ^no
Data samp!e
Z H )age KH0,
$ncome H medium,
tudent H %esCreditGrating H air+
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
42/124
November 16, 2015Data Mining: Concepts and
Techniques 72
'7a+ple
-)Ci+: -)bu%sGcomputer H I%esJ+ H @817 H 0467 -)bu%sGcomputer H InoJ+ H 5817H 045
Compute -)ZFCi+ *or each c!ass -)age H IKH0J F bu%sGcomputer H I%esJ+ H 28@ H 04222 -)age H IKH 0J F bu%sGcomputer H InoJ+ H 85 H 046 -)income H ImediumJ F bu%sGcomputer H I%esJ+ H 78@ H 04777
-)income H ImediumJ F bu%sGcomputer H InoJ+ H 285 H 047 -)student H I%esJ F bu%sGcomputer H I%es+ H 68@ H 0466 -)student H I%esJ F bu%sGcomputer H InoJ+ H 185 H 042 -)creditGrating H I*airJ F bu%sGcomputer H I%esJ+ H 68@ H 0466 -)creditGrating H I*airJ F bu%sGcomputer H InoJ+ H 285 H 047
> (a!e B = 8 inco+e +ediu+8 student es8 creditratin! air)
P(>DCi) :-)ZFbu%sGcomputer H I%esJ+ H 04222 = 04777 = 0466 = 0466 H 04077 -)ZFbu%sGcomputer H InoJ+ H 046 = 047 = 042 = 047 H 0401@P(>DCi)EP(Ci) : -)ZFbu%sGcomputer H I%esJ+ _ -)bu%sGcomputer H I%esJ+ H 0402>
-)ZFbu%sGcomputer H InoJ+ _ -)bu%sGcomputer H InoJ+ H 0400
T"ereore8 > /elon!s to class (*/usco+puter esF)
vo n! e - ro a
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
43/124
November 16, 2015Data Mining: Concepts and
Techniques 7
vo n! e ro a Pro/le+
Na`ve &a%esian prediction requires each conditiona! prob4 benon3(ero4 therise, the predicted prob4 i!! be (ero
/=4 uppose a dataset ith 1000 tup!es, incomeH!o )0+,incomeH medium )@@0+, and income H high )10+,
;se 'ap!acian correction )or 'ap!acian estimator+ .dding 1 to each case
-rob)income H !o+ H 18100-rob)income H medium+ H @@18100
-rob)income H high+ H 118100 The IcorrectedJ prob4 estimates are c!ose to their
IuncorrectedJ counterparts
=
=n
kCixkPCiP
1
'2%'2%
a ve aes an ass er:
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
44/124
November 16, 2015Data Mining: Concepts and
Techniques 77
a ve aes an ass er:Co++ents
.dvantages /as% to imp!ement
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
45/124
November 16, 2015Data Mining: Concepts and
Techniques 75
9aesian 9elie ?etwor;s
&a%esian be!ie* netor a!!os a subseto* the
variab!es conditiona!!% independent
. graphica! mode! o* causa! re!ationships
Aepresents dependenc% among the variab!es
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
46/124
November 16, 2015Data Mining: Concepts and
Techniques 76
'7a+ple
FamilyHistory
LungCancer
PositiveXRay
Smoer
!m"#ysema
$ys"nea
LC
%LC
&FH' S) &FH' %S) &%FH' S) &%FH' %S)
0.*
0.+
0.,
0.,
0.-
0.3
0.1
0.
aesian 9elie ?etwor;s
The conditional pro/a/ilitta/le)CPT+ *or variab!e'ungCancer:
=
=n
i
!Parents ixiPxxP n1
''%2%'*444*% 1
C-T shos the conditiona! probabi!it%*or each possib!e combination o* itsparents
Derivation o* the probabi!it% o* aparticu!ar combination o* va!ueso* >, *rom C-T:
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
47/124
November 16, 2015Data Mining: Concepts and
Techniques 7
Trainin! 9aesian ?etwor;s
evera! scenarios:
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
48/124
November 16, 2015Data Mining: Concepts and
Techniques 7>
C"apter $. Classifcation andPrediction
hat is c!assi"cation# hatis prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+
ther c!assi"cation methods
-rediction
.ccurac% and error measures /nsemb!e methods
Mode! se!ection
ummar%
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
49/124
November 16, 2015Data Mining: Concepts and
Techniques 7@
sin! %-TI'? 6ules or Classifcation
Aepresent the no!edge in the *orm o* $3TL/Nru!es
A: $ ageH %outh .ND studentH %es TL/N bu"s_computerH %es
Au!e antecedent8precondition vs4 ru!e consequent
.ssessment o* a ru!e: coerageand accurac"
ncovers H R o* tup!es covered b% A
ncorrect H R o* tup!es correct!% c!assi"ed b% Acoverage)A+ H ncovers 8FDF 8_ D: training data set _8
accurac%)A+ H ncorrect 8 ncovers $* more than one ru!e is triggered, need conJict resolution
i(e ordering: assign the highest priorit% to the triggering ru!es that
has the ItoughestJ requirement )i4e4, ith the most attribute test+ C!ass3based ordering: decreasing order o*prealence or
misclassi#cation cost per class
Au!e3based ordering )decision list+: ru!es are organi(ed into one !ong
priorit% !ist, according to some measure o* ru!e qua!it% or b% e=perts
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
50/124
November 16, 2015Data Mining: Concepts and
Techniques 50
age-
st#$ent- cre$it rating-
40
no yes yes
yes
31..40
no
faire/cellentyesno
/=amp!e: Au!e e=traction *rom our bu"s_computerdecision3tree
$ ageH %oung .ND studentH no TL/N bu"s_computerH no
$ ageH %oung .ND studentH"es TL/N bu"s_computerH"es
$ ageH mid3age TL/N bu"s_computerH"es
$ ageH o!d .ND credit_ratingH excellent TL/N bu"s_computer H"es
$ ageH %oung .ND credit_ratingH fair TL/N bu"s_computerH no
6ule '7traction ro+ a &ecision Tree
Au!es are easier to understand than !argetrees
ne ru!e is created *or each path *rom the
root to a !ea*
/ach attribute3va!ue pair a!ong a path*orms a con9unction: the !ea* ho!ds the
c!ass prediction
Au!es are mutua!!% e=c!usive and
e=haustive
6u e '7tract on ro+ t e Tra n n!
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
51/124
November 16, 2015Data Mining: Concepts and
Techniques 51
u e ac o o e a !&ata
equentia! covering a!gorithm: /=tracts ru!es direct!% *rom trainingdata
T%pica! sequentia! covering a!gorithms: $', ., CN2, A$--/A
Au!es are !earned sequentiall", each *or a given c!ass Ci i!! cover
man% tup!es o* Ci but none )or *e+ o* the tup!es o* other c!asses
teps:
Au!es are !earned one at a time
/ach time a ru!e is !earned, the tup!es covered b% the ru!es are
removed
The process repeats on the remaining tup!es un!ess terminationcondition, e4g4, hen no more training e=amp!es or hen the
qua!it% o* a ru!e returned is be!o a user3speci"ed thresho!d
Comp4 4 decision3tree induction: !earning a set o* ru!es
simultaneousl"
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
52/124
November 16, 2015Data Mining: Concepts and
Techniques 52
Iow to #earn-ne-6ule
tar ith the most genera! ru!e possib!e: condition H empt%
.dding ne attributes b% adopting a greed% depth3"rst strateg%
-ics the one that most improves the ru!e qua!it%
Au!e3ua!it% measures: consider both coverage and accurac%
oi!3gain )in $' X A$--/A+: assesses in*oGgain b% e=tending
condition
$t *avors ru!es that have high accurac% and cover man% positive tup!es
Au!e pruning based on an independent set o* test tup!es
-os8neg are R o* positive8negative tup!es covered b% A4
$* )*+_&runeis higher *or the pruned version o* A, prune A
'log
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
53/124
November 16, 2015Data Mining: Concepts and
Techniques 5
C"apter $. Classifcation andPrediction
hat is c!assi"cation# hatis prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+
ther c!assi"cation methods
-rediction
.ccurac% and error measures /nsemb!e methods
Mode! se!ection
ummar%
Classifcation: A Mat"e+atical
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
54/124
November 16, 2015Data Mining: Concepts and
Techniques 57
C!assi"cation: predicts categorica! c!ass !abe!s
/4g4, -ersona! homepage c!assi"cation
=iH )=1, =2, =, [+, %iH 1 or E1 =1: R o* a ord IhomepageJ
=2: R o* a ord Ie!comeJ
Mathematica!!% = Z H n, % H P1, E1Q e ant a *unction *: Z
Classifcation: A Mat"e+aticalMappin!
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
55/124
November 16, 2015Data Mining: Concepts and
Techniques 55
#inear Classifcation
&inar% C!assi"cationprob!em The data above the red
!ine be!ongs to c!ass ^=
The data be!o red !inebe!ongs to c!ass ^o /=amp!es: ?M,
-erceptron,
-robabi!istic C!assi"ers
=
==
=
==
=
=
=
= ooooo
o
o
o
o o
o
o
o
i i i i Cl if
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
56/124
November 16, 2015Data Mining: Concepts and
Techniques 56
&iscri+inative Classifers
.dvantages prediction accurac% is genera!!% high
.s compared to &a%esian methods E in genera!
robust, ors hen training e=amp!es contain errors
*ast eva!uation o* the !earned target *unction &a%esian netors are norma!!% s!o
Criticism !ong training time
diBcu!t to understand the !earned *unction )eights+ &a%esian netors can be used easi!% *or pattern discover%
not eas% to incorporate domain no!edge /as% in the *orm o* priors on the data or distributions
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
57/124
November 16, 2015Data Mining: Concepts and
Techniques 5
Perceptron K ,innow
?ector: =,
ca!ar: =, %,
$nput: P)=1, %1+, [Q
utput: c!assi"cation *unction*)=+
*)=i+ O 0 *or %iH 1
*)=i+ K 0 *or %iH 31
*)=+ HO = b H 0
or 1=12=2b H 0
=1
=2
-erceptron: update additive!%
inno: update mu!tip!icative!%
ass ca on
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
58/124
November 16, 2015Data Mining: Concepts and
Techniques 5>
9ac;propa!ation
&acpropagation: . neural networ; !earning a!gorithm tarted b% ps%cho!ogists and neurobio!ogists to deve!op
and test computationa! ana!ogues o* neurons
. neura! netor: . set o* connected input8output units
here each connection has a wei!"tassociated ith it During the !earning phase, the networ; learns /
adLustin! t"e wei!"tsso as to be ab!e to predict the
correct c!ass !abe! o* the input tup!es
.!so re*erred to as connectionist learnin!due to the
connections beteen units
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
59/124
November 16, 2015Data Mining: Concepts and
Techniques 5@
?eural ?etwor; as a Classifer
eaness 'ong training time Aequire a number o* parameters t%pica!!% best determined
empirica!!%, e4g4, the netor topo!og% or structure4 -oor interpretabi!it%: DiBcu!t to interpret the s%mbo!ic
meaning behind the !earned eights and o* hidden units inthe netor
trength Ligh to!erance to nois% data .bi!it% to c!assi*% untrained patterns
e!!3suited *or continuous3va!ued inputs and outputs uccess*u! on a ide arra% o* rea!3or!d data .!gorithms are inherent!% para!!e! Techniques have recent!% been deve!oped *or the e=traction o*
ru!es *rom trained neura! netors
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
60/124
November 16, 2015Data Mining: Concepts and
Techniques 60
A ?euron ( a perceptron)
The n3dimensiona! input vector 7is mapped into variab!e % b%means o* the sca!ar product and a non!inear *unction mapping
k
f
/eig#te
sum
n"ut
vector 7
out"ut y
2ctivation
unction
/eig#t
vector w
w&
w'
wn
x&
x'
xn
'sign%y
!/ampleFor
n
5i
kiixw +=
=
A Multi-#aer eed-orward ?eural
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
61/124
November 16, 2015Data Mining: Concepts and
Techniques 61
?etwor;
utput laer
%nput laer
Iidden laer
utput vector
%nput vector: X
wij
+=i
jiijj #wI
jIj e#
+= 1
1
''%1% jjjjj #(##)rr =
jkk
kjjj w)rr##)rr = '1%
ijijij #)rrlww '%+=
jjj )rrl'%+=
Iow A Multi-#aer ?eural ?etwor;
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
62/124
November 16, 2015Data Mining: Concepts and
Techniques 62
,or;s
The inputsto the netor correspond to the attributes measured *or
each training tup!e
$nputs are *ed simu!taneous!% into the units maing up the input
laer
The% are then eighted and *ed simu!taneous!% to a "idden laer
The number o* hidden !a%ers is arbitrar%, a!though usua!!% on!% one The eighted outputs o* the !ast hidden !a%er are input to units
maing up the output laer, hich emits the netorfs prediction
The netor is eed-orwardin that none o* the eights c%c!es bac
to an input unit or to an output unit o* a previous !a%er
rom a statistica! point o* vie, netors per*orm nonlinear
re!ression:
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
63/124
November 16, 2015Data Mining: Concepts and
Techniques 6
&efnin! a ?etwor; Topolo!
irst decide the networ; topolo!: R o* units in theinput la"er, R o* hidden la"ers)i* O 1+, R o* units in
each hidden la"er, and R o* units in the output la"er
Norma!i(ing the input va!ues *or each attribute
measured in the training tup!es to 040U140 ne inputunit per domain va!ue, each initia!i(ed to 0
utput, i* *or c!assi"cation and more than to
c!asses, one output unit per c!ass is used
nce a netor has been trained and its accurac% isunaccepta/le, repeat the training process ith a
di-erent net(or$ topolog"or a di-erent set of initial
(eights
9 ; i
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
64/124
November 16, 2015Data Mining: Concepts and
Techniques 67
9ac;propa!ation
$terative!% process a set o* training tup!es X compare the
netorfs prediction ith the actua! non target va!ue
or each training tup!e, the eights are modi"ed to +ini+i
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
65/124
November 16, 2015Data Mining: Concepts and
Techniques 65
p p !%nterpreta/ilit
/Bcienc% o* bacpropagation: /ach epoch )one interation
through the training set+ taes )FDF _ (+, ith FDF tup!es and
(eights, but R o* epochs can be e=ponentia! to n, the
number o* inputs, in the orst case
Au!e e=traction *rom netors: netor pruning
imp!i*% the netor structure b% removing eighted !insthat have the !east eVect on the trained netor
Then per*orm !in, unit, or activation va!ue c!ustering
The set o* input and activation va!ues are studied to derive
ru!es describing the re!ationship beteen the input andhidden unit !a%ers
ensitivit% ana!%sis: assess the impact that a given input
variab!e has on a netor output4 The no!edge gained
*rom this ana!%sis can be represented in ru!es
C"apter $. Classifcation and
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
66/124
November 16, 2015Data Mining: Concepts and
Techniques 66
C"apter $. Classifcation andPrediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+
ther c!assi"cation methods
-rediction
.ccurac% and error measures /nsemb!e methods
Mode! se!ection
ummar%
S5M Support 5ector Mac"ines
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
67/124
November 16, 2015Data Mining: Concepts and
Techniques 6
S5MSupport 5ector Mac"ines
. ne c!assi"cation method *or both !inear and
non!inear data
$t uses a non!inear mapping to trans*orm the origina!
training data into a higher dimension
ith the ne dimension, it searches *or the !inear
optima! separating h%perp!ane )i4e4, Idecision
boundar%J+
ith an appropriate non!inear mapping to a suBcient!%
high dimension, data *rom to c!asses can a!a%s be
separated b% a h%perp!ane
?M "nds this h%perp!ane using support vectors
)Iessentia!J training tup!es+ and margins )de"ned b%
the support vectors+
S5M Ii t d A li ti
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
68/124
November 16, 2015Data Mining: Concepts and
Techniques 6>
S5MIistor and Applications
?apni and co!!eagues )1@@2+Ugroundor *rom
?apni X Chervonenis statistica! !earning theor% in
1@60s
eatures: training can be s!o but accurac% is high
oing to their abi!it% to mode! comp!e= non!ineardecision boundaries )margin ma=imi(ation+
;sed both *or c!assi"cation and prediction
.pp!ications: handritten digit recognition, ob9ect recognition,
speaer identi"cation, benchmaring time3series
prediction tests
S5M 0 l P"il "
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
69/124
November 16, 2015Data Mining: Concepts and
Techniques 6@
S5M0eneral P"ilosop"
upport ?ectors
ma!! Margin 'arge Margin
ar! ns an uppor5 t
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
70/124
November 16, 2015Data Mining: Concepts and
Techniques 0
! pp5ectors
en a a s near S /l
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
71/124
November 16, 2015Data Mining: Concepts and
Techniques 1
Separa/le
m
'et data D be )>1, %1+, [, )>FDF, %FDF+, here >iis the set o* training
tup!es associated ith the c!ass !abe!s % i
There are in"nite !ines )h%perp!anes+ separating the to c!asses bute ant to "nd the best one )the one that minimi(es c!assi"cationerror on unseen data+
?M searches *or the h%perp!ane ith the !argest margin, i4e4,
+a7i+u+ +ar!inal "perplane)MML+
S5M #i l S /l
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
72/124
November 16, 2015Data Mining: Concepts and
Techniques 2
S5M#inearl Separa/le
. separating h%perp!ane can be ritten as
, > b H 0
here ,HP1, 2, [, nQ is a eight vector and b a sca!ar
)bias+
or 23D it can be ritten as
0 1=1 2=2H 0
The h%perp!ane de"ning the sides o* the margin:
L1: 0 1=1 2=2j 1 *or %i H 1, and
L2
: 0
1
=1
2
=2
E 1 *or %i
H E1
.n% training tup!es that *a!! on h%perp!anes L1or L2)i4e4, the
sides de"ning the margin+ are support vectors
This becomes a constrained (conve7) uadratic
opti+i
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
73/124
November 16, 2015Data Mining: Concepts and
Techniques
&ata
The comp!e=it% o* trained c!assi"er is characteri(ed b% the R
o* support vectors rather than the dimensiona!it% o* the data
The support vectors are the essentia! or critica! training
e=amp!es Uthe% !ie c!osest to the decision boundar% )MML+
$* a!! other training e=amp!es are removed and the training isrepeated, the same separating h%perp!ane ou!d be *ound
The number o* support vectors *ound can be used to compute
an )upper+ bound on the e=pected error rate o* the ?M
c!assi"er, hich is independent o* the data dimensiona!it%
Thus, an ?M ith a sma!! number o* support vectors can
have good genera!i(ation, even hen the dimensiona!it% o*
the data is high
S5M #inearl %nsepara/le
A *
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
74/124
November 16, 2015Data Mining: Concepts and
Techniques 7
S5M#inearl %nsepara/le
Trans*orm the origina! input data into a higherdimensiona! space
earch *or a !inear separating h%perp!ane in the
ne space
A '
S5M Oernel unctions
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
75/124
November 16, 2015Data Mining: Concepts and
Techniques 5
S5MOernel unctions
$nstead o* computing the dot product on the trans*ormed
data tup!es, it is mathematica!!% equiva!ent to insteadapp!%ing a erne! *unction Y)>i, >L+ to the origina! data, i4e4,
Y)>i, >L+ H k)>i+ k)>L+
T%pica! Yerne! unctions
?M can a!so be used *or c!assi*%ing mu!tip!e )O 2+ c!asses
and *or regression ana!%sis )ith additiona! user parameters+
Scalin! S5M / Iierarc"ical Micro-Clusterin!
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
76/124
November 16, 2015Data Mining: Concepts and
Techniques 6
Clusterin!
?M is not sca!ab!e to the number o* data ob9ects in terms o*
training time and memor% usage
IC!assi*%ing 'arge Datasets ;sing ?Ms ith Lierarchica!
C!usters -rob!emJ b% Lan9o u, Wiong ang, Wiaei Lan, YDD0
C&3?M )C!ustering3&ased ?M+
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
77/124
November 16, 2015Data Mining: Concepts and
Techniques
C9-S5M: Clusterin!-9ased S5M
Training data sets ma% not even "t in memor%
Aead the data set once )minimi(ing dis access+
Construct a statistica! summar% o* the data )i4e4,
hierarchica! c!usters+ given a !imited amount o* memor%
The statistica! summar% ma=imi(es the bene"t o* !earning?M
The summar% p!a%s a ro!e in inde=ing ?Ms
/ssence o* Micro3c!ustering )Lierarchica! inde=ing structure+
;se micro3c!uster hierarchica! inde=ing structure
provide "ner samp!es c!oser to the boundar% and
coarser samp!es *arther *rom the boundar%
e!ective de3c!ustering to ensure high accurac%
C Tree: Iierarc"ical Micro cluster
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
78/124
November 16, 2015Data Mining: Concepts and
Techniques >
C-Tree: Iierarc"ical Micro-cluster
C9 S5M Al!orit"+: utline
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
79/124
November 16, 2015Data Mining: Concepts and
Techniques @
C9-S5M Al!orit"+: utline
Construct to C3trees *rom positive and negativedata sets independent!% Need one scan o* the data set
Train an ?M *rom the centroids o* the root entries
De3c!uster the entries near the boundar% into thene=t !eve! The chi!dren entries de3c!ustered *rom the
parent entries are accumu!ated into the training
set ith the non3dec!ustered parent entries Train an ?M again *rom the centroids o* the
entries in the training set Aepeat unti! nothing is accumu!ated
Selective &eclusterin!
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
80/124
November 16, 2015Data Mining: Concepts and
Techniques >0
Selective &eclusterin!
C tree is a suitab!e base structure *or se!ective dec!ustering
De3c!uster on!% the c!uster /isuch that
DiE AiK Ds, here Diis the distance *rom the boundar% to
the center point o* /iand Aiis the radius o* /i Dec!uster on!% the c!uster hose subc!usters have
possibi!ities to be the support c!uster o* the boundar% Iupport c!usterJ: The c!uster hose centroid is a
support vector
'7peri+ent on Snt"etic &ataset
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
81/124
November 16, 2015 Data Mining: Concepts andTechniques >1
'7peri+ent on Snt"etic &ataset
'7peri+ent on a #ar!e &ata Set
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
82/124
November 16, 2015 Data Mining: Concepts andTechniques >2
'7peri+ent on a #ar!e &ata Set
S5M vs ?eural ?etwor;
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
83/124
November 16, 2015 Data Mining: Concepts andTechniques >
S5M vs. ?eural ?etwor;
?M Ae!ative!% ne concept
Deterministic a!gorithm
Nice
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
84/124
November 16, 2015 Data Mining: Concepts andTechniques >7
S5M 6elated #in;s
?M ebsite http:884erne!3machines4org8
Aepresentative imp!ementations
'$&?M: an eBcient imp!ementation o* ?M, mu!ti3c!ass
c!assi"cations, nu3?M, one3c!ass ?M, inc!uding a!so various
inter*aces ith 9ava, p%thon, etc4
?M3!ight: simp!er but per*ormance is not better than '$&?M,
support on!% binar% c!assi"cation and on!% C !anguage
?M3torch: another recent imp!ementation a!so ritten in C4
#iterature
http://www.kernel-machines.org/http://www.kernel-machines.org/ -
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
85/124
November 16, 2015 Data Mining: Concepts andTechniques >5
#iterature
Itatistica! 'earning Theor%J b% ?apni: e=treme!% hard to
understand, containing man% errors too4
C4 W4 C4 &urges4
. Tutoria! on upport ?ector Machines *or -attern Aecognition 4
no(ledge Discoer" and Data ining, 2)2+, 1@@>4
&etter than the ?apnis boo, but sti!! ritten too hard *or
introduction, and the e=amp!es are so not3intuitive
The boo I.n $ntroduction to upport ?ector MachinesJ b% N4
Cristianini and W4 hae3Ta%!or
.!so ritten hard *or introduction, but the e=p!anation about
the mercers theorem is better than above !iteratures
The neura! netor boo b% La%ins
Contains one nice chapter o* ?M introduction
C"apter $. Classifcation and
http://www.kernel-machines.org/papers/Burges98.ps.gzhttp://www.kernel-machines.org/papers/Burges98.ps.gz -
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
86/124
November 16, 2015 Data Mining: Concepts andTechniques >6
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+
ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
Associative Classifcation
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
87/124
November 16, 2015 Data Mining: Concepts andTechniques >
Associative Classifcation
.ssociative c!assi"cation
.ssociation ru!es are generated and ana!%(ed *or use in c!assi"cation
earch *or strong associations beteen *requent patterns
)con9unctions o* attribute3va!ue pairs+ and c!ass !abe!s
C!assi"cation: &ased on eva!uating a set o* ru!es in the *orm o*
-1 p2[ p!I.c!assH CJ )con*, sup+
h% eVective#
$t e=p!ores high!% con"dent associations among mu!tip!e attributes
and ma% overcome some constraints introduced b% decision3tree
induction, hich considers on!% one attribute at a time $n man% studies, associative c!assi"cation has been *ound to be more
accurate than some traditiona! c!assi"cation methods, such as C745
Tp ca Assoc at ve ass cat onMet"ods
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
88/124
November 16, 2015 Data Mining: Concepts andTechniques >>
Met"ods
C&. )C!assi"cation &% .ssociation: 'iu, Lsu X Ma, YDD@>+
Mine association possib!e ru!es in the *orm o*
Cond3set )a set o* attribute3va!ue pairs+ c!ass !abe!
&ui!d c!assi"er: rgani(e ru!es according to decreasing
precedence based on con"dence and then support
CM.A )C!assi"cation based on Mu!tip!e .ssociation Au!es: 'i, Lan, -ei, $CDM01+ C!assi"cation: tatistica! ana!%sis on mu!tip!e ru!es
C-.A )C!assi"cation based on -redictive .ssociation Au!es: in X Lan, DM0+
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
89/124
November 16, 2015 Data Mining: Concepts andTechniques >@
A Closer #oo; at CMA6
CM.A )C!assi"cation based on Mu!tip!e .ssociation Au!es: 'i, Lan, -ei, $CDM01+ /Bcienc%: ;ses an enhanced -3tree that maintains the distribution
o* c!ass !abe!s among tup!es satis*%ing each *requent itemset Au!e pruning henever a ru!e is inserted into the tree
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
90/124
November 16, 2015 Data Mining: Concepts andTechniques @0
S%0M&=4)
C"apter $. Classifcation and
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
91/124
November 16, 2015 Data Mining: Concepts andTechniques @1
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+
ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
#a
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
92/124
November 16, 2015 Data Mining: Concepts andTechniques @2
#a
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
93/124
November 16, 2015 Data Mining: Concepts andTechniques @
Met"ods
$nstance3based !earning: tore training e=amp!es and de!a% theprocessing )I!a(% eva!uationJ+ unti! a neinstance must be c!assi"ed
T%pica! approaches $3nearest neighbor approach
$nstances represented as points in a/uc!idean space4
'oca!!% eighted regression
Constructs !oca! appro=imation Case3based reasoning
;ses s%mbo!ic representations andno!edge3based in*erence
T"e k-?earest ?ei!"/or
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
94/124
November 16, 2015 Data Mining: Concepts andTechniques @7
Al!orit"+
.!! instances correspond to points in the n3Dspace The nearest neighbor are de"ned in terms o*
/uc!idean distance, dist)>1, >2+ Target *unction cou!d be discrete3 or rea!3
va!ued or discrete3va!ued, $3NN returns the most
common va!ue among the $training e=amp!esnearest toxq
?onoroi diagram: the decision sur*ace inducedb% 13NN *or a t%pica! set o* training e=amp!es
4
9
9 xq
9 9
9
9
.
.. .
&iscussion on t"e k-??Al it"
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
95/124
November 16, 2015 Data Mining: Concepts andTechniques @5
Al!orit"+
3NN *or rea!3va!ued prediction *or a given unnontup!e
Aeturns the mean va!ues o* the$nearest neighbors
Distance3eighted nearest neighbor a!gorithm
eight the contribution o* each o* the neighborsaccording to their distance to the quer%xq
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
96/124
November 16, 2015 Data Mining: Concepts andTechniques @6
Case 9ased 6easonin! (C96)
C&A: ;ses a database o* prob!em so!utions to so!ve ne prob!ems
tore s%mbo!ic description )tup!es or cases+Unot points in a /uc!ideanspace
.pp!ications: Customer3service )product3re!ated diagnosis+, !ega!
ru!ing
Methodo!og%
$nstances represented b% rich s%mbo!ic descriptions )e4g4, *unction
graphs+
earch *or simi!ar cases, mu!tip!e retrieved cases ma% be combined
Tight coup!ing beteen case retrieva!, no!edge3based reasoning,
and prob!em so!ving
Cha!!enges
ind a good simi!arit% metric
$nde=ing based on s%ntactic simi!arit% measure, and hen *ai!ure,
bactracing, and adapting to additiona! cases
C"apter $. Classifcation andP di ti
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
97/124
November 16, 2015 Data Mining: Concepts andTechniques @
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+ ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
0enetic Al!orit"+s (0A)
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
98/124
November 16, 2015 Data Mining: Concepts andTechniques @>
0e e c !o s (0 )
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
99/124
November 16, 2015 Data Mining: Concepts andTechniques @@
! pp
Aough sets are used to appro7i+atel or *rou!"lF
defne euivalent classes
. rough set *or a given c!ass C is appro=imated b% to sets:
a !oer appro=imation)certain to be in C+ and an upper
appro=imation)cannot be described as not be!onging to C+
inding the minima! subsets )reducts+ o* attributes *or
*eature reduction is N-3hard but a discerni/ilit +atri7
)hich stores the diVerences beteen attribute va!ues *or
each pair o* data tup!es+ is used to reduce the computation
intensit%
u
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
100/124
November 16, 2015 Data Mining: Concepts andTechniques 100
Approac"es
u((% !ogic uses truth va!ues beteen 040 and 140 torepresent the degree o* membership )such as using*u((% membership graph+
.ttribute va!ues are converted to *u((% va!ues e4g4, income is mapped into the discrete categories
P!o, medium, highQ ith *u((% va!ues ca!cu!ated or a given ne samp!e, more than one *u((% va!ue
ma% app!% /ach app!icab!e ru!e contributes a vote *or
membership in the categories T%pica!!%, the truth va!ues *or each predicted categor%
are summed, and these sums are combined
C"apter $. Classifcation andP di ti
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
101/124
November 16, 2015 Data Mining: Concepts andTechniques 101
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+ ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
,"at %s Prediction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
102/124
November 16, 2015 Data Mining: Concepts andTechniques 102
)Numerica!+ prediction is simi!ar to c!assi"cation construct a mode! use mode! to predict continuous or ordered va!ue *or a given
input -rediction is diVerent *rom c!assi"cation
C!assi"cation re*ers to predict categorica! c!ass !abe! -rediction mode!s continuous3va!ued *unctions Ma9or method *or prediction: regression
mode! the re!ationship beteen one or more independentorpredictorvariab!es and a dependentor responsevariab!e
Aegression ana!%sis 'inear and mu!tip!e regression Non3!inear regression ther regression methods: genera!i(ed !inear mode!, -oisson
regression, !og3!inear mode!s, regression trees
#inear 6e!ression
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
103/124
November 16, 2015 Data Mining: Concepts andTechniques 10
ea e! ess o
'inear regression: invo!ves a response variab!e % and a sing!e
predictor variab!e =
% H 0 1=
here 0)%3intercept+ and 1)s!ope+ are regression coeBcients
Method o* !east squares: estimates the best3"tting straight !ine
Mu!tip!e !inear regression: invo!ves more than one predictor variab!e
Training data is o* the *orm )>1, %1+, )>2, %2+,[, )>D&D, %FDF+
/=4 or 23D data, e ma% have: % H 0 1=1 2=2 o!vab!e b% e=tension o* !east square method or using ., 3
-!us
Man% non!inear *unctions can be trans*ormed into the above
=
=
=
22
1
0
22
1
'%
''%%
1 D
i
i
D
i
ii
xx
,,xx
w xw,w 15 =
?onlinear 6e!ression
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
104/124
November 16, 2015 Data Mining: Concepts andTechniques 107
ome non!inear mode!s can be mode!ed b% a
po!%nomia! *unction . po!%nomia! regression mode! can be trans*ormed into
!inear regression mode!4 or e=amp!e,
% H 0 1= 2=2 =
convertib!e to !inear ith ne variab!es: =2 H =2, =H =
% H 0 1= 2=2 = ther *unctions, such as poer *unction, can a!so be
trans*ormed to !inear mode! ome mode!s are intractab!e non!inear )e4g4, sum o*
e=ponentia! terms+ possib!e to obtain !east square estimates through
e=tensive ca!cu!ation on more comp!e= *ormu!ae
!
t"er 6e!ression-9ased Models
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
105/124
November 16, 2015 Data Mining: Concepts andTechniques 105
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
106/124
November 16, 2015 Data Mining: Concepts andTechniques 106
Trees Aegression tree: proposed in C.AT s%stem )&reiman et a!4 1@>7+
C.AT: C!assi"cation .nd Aegression Trees
/ach !ea* stores a continuous3alued prediction
$t is the aerage alue of the predicted attribute*or the
training tup!es that reach the !ea*
Mode! tree: proposed b% uin!an )1@@2+
/ach !ea* ho!ds a regression mode!Ua mu!tivariate !inear
equation *or the predicted attribute
. more genera! case than regression tree Aegression and mode! trees tend to be more accurate than
!inear regression hen the data are not represented e!! b% a
simp!e !inear mode!
Predictive Modelin! in Multidi+ensional&ata/ases
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
107/124
November 16, 2015Data Mining: Concepts and
Techniques 10
-redictive mode!ing: -redict data va!ues or construct
genera!i(ed !inear mode!s based on the database data ne can on!% predict va!ue ranges or categor% distributions Method out!ine:
Minima! genera!i(ation
.ttribute re!evance ana!%sis
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
108/124
November 16, 2015Data Mining: Concepts and
Techniques 10>
Prediction: Cate!orical &ata
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
109/124
November 16, 2015Data Mining: Concepts and
Techniques 10@
C"apter $. Classifcation andPrediction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
110/124
November 16, 2015Data Mining: Concepts and
Techniques 110
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+ ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
Classifer AccuracMeasures
C1 C2
C1 True positive a!senegative
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
111/124
November 16, 2015Data Mining: Concepts and
Techniques 111
.ccurac% o* a c!assi"er M, acc)M+: percentage o* test set tup!es that arecorrect!% c!assi"ed b% the mode! M
/rror rate )misc!assi"cation rate+ o* M H 1 E acc)M+ > 000 >642
tota! 66 267 1000
0
@5452
g
C2 a!sepositive
True negative
Predictor 'rror Measures
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
112/124
November 16, 2015Data Mining: Concepts and
Techniques 112
Measure predictor accurac%: measure ho *ar oV the predicted
va!ue is *rom the actua! non va!ue #oss unction: measures the error bet4 %iand the predicted va!ue
%i
.bso!ute error: F %iE %iF
quared error: )%iE %i+2
Test error )genera!i(ation error+: the average !oss over the test set
Mean abso!ute error: Mean squared error:
Ae!ative abso!ute error: Ae!ative squared error:
The mean squared3error e=aggerates the presence o* out!iers
-opu!ar!% use )square+ root mean3square error, simi!ar!%, root
re!ative squared error
d
,,d
i
ii=
1
2
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
113/124
November 16, 2015Data Mining: Concepts and
Techniques 11
Classifer or Predictor (%) Lo!dout method
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
114/124
November 16, 2015Data Mining: Concepts and
Techniques 117
Classifer or Predictor (%%)
&ootstrap
ors e!! ith sma!! data sets
amp!es the given training tup!es uni*orm!% (ith replacement
i4e4, each time a tup!e is se!ected, it is equa!!% !ie!% to be
se!ected again and re3added to the training set
evera! boostrap methods, and a common one is .$2 /oostrap uppose e are given a data set o* d tup!es4 The data set is samp!ed
d times, ith rep!acement, resu!ting in a training set o* d samp!es4
The data tup!es that did not mae it into the training set end up
*orming the test set4 .bout 642 o* the origina! data i!! end up in
the bootstrap, and the remaining 64> i!! *orm the test set )since )1E 18d+d e31H 046>+
Aepeat the samp!ing procedue times, overa!! accurac% o* the
mode!:''%6845'%6045%'% 9
1
9 settraini
k
i
settesti -acc-acc-acc +==
C"apter $. Classifcation andPrediction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
115/124
November 16, 2015Data Mining: Concepts and
Techniques 115
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+ ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
'nse+ e Met o s: %ncreas n! t eAccurac
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
116/124
November 16, 2015Data Mining: Concepts and
Techniques 116
/nsemb!e methods ;se a combination o* mode!s to increase accurac% Combine a series o* !earned mode!s, M1, M2, [, M,
ith the aim o* creating an improved mode! M_ -opu!ar ensemb!e methods
&agging: averaging the prediction over a co!!ection o*
c!assi"ers &oosting: eighted vote ith a co!!ection o*
c!assi"ers /nsemb!e: combining a set o* heterogeneous
c!assi"ers
a!! n!: oos rapA!!re!ation
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
117/124
November 16, 2015Data Mining: Concepts and
Techniques 11
.na!og%: Diagnosis based on mu!tip!e doctors ma9orit% vote
Training
/ach c!assi"er Mireturns its c!ass prediction The bagged c!assi"er M_ counts the votes and assigns the c!ass
ith the most votes to > -rediction: can be app!ied to the prediction o* continuous va!ues b%
taing the average va!ue o* each prediction *or a given test tup!e
.ccurac% *ten signi"cant better than a sing!e c!assi"er derived *rom D or noise data: not considerab!% orse, more robust -roved improved accurac% in prediction
9oostin!
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
118/124
November 16, 2015Data Mining: Concepts and
Techniques 11>
.na!og%: Consu!t severa! doctors, based on a combination o* eighted
diagnosesUeight assigned based on the previous diagnosis accurac% Lo boosting ors#
eights are assigned to each training tup!e
. series o* c!assi"ers is iterative!% !earned
.*ter a c!assi"er Miis !earned, the eights are updated to a!!o the
subsequent c!assi"er, Mi1, to pa% more attention to the training
tup!es that ere misc!assi"ed b% Mi The "na! M_ combines the votes o* each individua! c!assi"er, here
the eight o* each c!assi"erfs vote is a *unction o* its accurac%
The boosting a!gorithm can be e=tended *or the prediction o*
continuous va!ues
Comparing ith bagging: boosting tends to achieve greater accurac%,
but it a!so riss over"tting the mode! to misc!assi"ed data
a oos reun an c ap re81QQR)
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
119/124
November 16, 2015Data Mining: Concepts and
Techniques 11@
1, %1+, [, )>d, %d+ $nitia!!%, a!! the eights o* tup!es are set the same )18d+ L4
C!assi"er Mierror rate is the sum o* the eights o* themisc!assi"ed tup!es:
The eight o* c!assi"er Mis vote is '%
'%1log
i
i
-error
-error
=d
j
ji errw-error '%'% X
C"apter $. Classifcation andPrediction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
120/124
November 16, 2015Data Mining: Concepts and
Techniques 120
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+ ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
Model Selection: 6CCurves
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
121/124
November 16, 2015Data Mining: Concepts and
Techniques 121
AC )Aeceiver perating
Characteristics+ curves: *or visua!comparison o* c!assi"cation mode!s
riginated *rom signa! detection theor%
hos the trade3oV beteen the true
positive rate and the *a!se positive rate
The area under the AC curve is a
measure o* the accurac% o* the mode!
Aan the test tup!es in decreasing order:
the one that is most !ie!% to be!ong to
the positive c!ass appears at the top o*the !ist
The c!oser to the diagona! !ine )i4e4, the
c!oser the area is to 045+, the !ess
accurate is the mode!
?ertica! a=isrepresents the truepositive rate
Lori(onta! a=is rep4
the *a!se positiverate The p!ot a!so shos
a diagona! !ine . mode! ith per*ect
accurac% i!! have
C"apter $. Classifcation andPrediction
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
122/124
November 16, 2015Data Mining: Concepts and
Techniques 122
Prediction
hat is c!assi"cation# hat
is prediction#
$ssues regarding
c!assi"cation and prediction
C!assi"cation b% decision
tree induction
&a%esian c!assi"cation
Au!e3based c!assi"cation
C!assi"cation b% bac
propagation
upport ?ector Machines
)?M+
.ssociative c!assi"cation
'a(% !earners )or !earning *rom
%our neighbors+ ther c!assi"cation methods
-rediction
.ccurac% and error measures
/nsemb!e methods
Mode! se!ection
ummar%
Su++ar (%)
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
123/124
November 16, 2015Data Mining: Concepts and
Techniques 12
C!assi"cation andpredictionare to *orms o* data ana!%sis that
can be used to e=tract mode!sdescribing important data c!assesor to predict *uture data trends4
/Vective and sca!ab!e methods have been deve!oped *or decision
trees induction, Naive &a%esian c!assi"cation, &a%esian be!ie*
netor, ru!e3based c!assi"er, &acpropagation, upport ?ector
Machine )?M+, associative c!assi"cation, nearest neighbor
c!assi"ers,and case3based reasoning, and other c!assi"cation
methods such as genetic a!gorithms, rough set and *u((% set
approaches4
'inear, non!inear, and genera!i(ed !inear mode!s o* regressioncanbe used *or prediction4 Man% non!inear prob!ems can be converted
to !inear prob!ems b% per*orming trans*ormations on the predictor
variab!es4 Aegression treesand mode! treesare a!so used *or
prediction4
Su++ar (%%)
-
7/24/2019 Data Mining Classification and Prediction by Dr. Tanvir Ahmed
124/124
trati"ed 3*o!d cross3va!idationis a recommended method *or
accurac% estimation4 &agging and boostingcan be used to
increase overa!! accurac% b% !earning and combining a series o*
individua! mode!s4
igni"cance testsand AC curvesare use*u! *or mode! se!ection
There have been numerous comparisons o* the diVerent
c!assi"cation and prediction methods, and the matter remains a
research topic
No sing!e method has been *ound to be superior over a!! others *or
a!! data sets $ssues such as accurac%, training time, robustness, interpretabi!it%,