Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16:...

UVACS6316/4501–Fall2016

MachineLearning

Lecture16:DecisionTree/RandomForest/Ensemble

Dr.YanjunQi

UniversityofVirginia

DepartmentofComputerScience

11/9/16

Dr.YanjunQi/UVACS6316/f16

Wherearewe?èFivemajorsecLonsofthiscourse

q Regression(supervised)q ClassificaLon(supervised)q Unsupervisedmodelsq Learningtheoryq Graphicalmodels

11/9/16 2

11/9/16 3

hSp://scikit-learn.org/stable/tutorial/machine_learning_map/Dr.YanjunQi/UVACS6316/f16

ChoosingtherightesLmator

11/9/16 4

Scikit-learn:Regression

LinearmodelfiSedbyminimizingaregularizedempiricallosswithSGD

Scikit-learn:ClassificaLon

11/9/16 5

Linearclassifiers(SVM,logisLcregression…)withSGDtraining.

approximatetheexplicitfeaturemappingsthatcorrespondtocertainkernelsTocombinethe

predicLonsofseveralbaseesLmatorsbuiltwithagivenlearningalgorithminordertoimprovegeneralizability/robustnessoverasingleesLmator.(1)averaging/bagging(2)boosLng

11/9/16 6

BasicPCA

Bayes-NetHMM

Kmeans+GMM

nextaeerclassificaLon?Dr.YanjunQi/UVACS6316/f16

Ø DecisionTree(DT):Ø TreerepresentaLon

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreaboutensemble11/9/16 7

A study comparing Classifiers

11/9/16 8

Proceedingsofthe23rdInternaLonalConferenceonMachineLearning(ICML`06).

A study comparing Classifiers è 11binaryclassificaLonproblems/8metrics

11/9/16 9

Top8Models

Wherearewe?èThreemajorsecLonsforclassificaLon

•  We can divide the large variety of classification approaches into roughly three major types

1. Discriminative - directly estimate a decision rule/boundary - e.g., logistic regression, support vector machine, decisionTree 2. Generative: - build a generative statistical model - e.g., naïve bayes classifier, Bayesian networks 3. Instance based classifiers - Use observation directly (no models) - e.g. K nearest neighbors

11/9/16 10

ADatasetforclassificaLon

•  Data/points/instances/examples/samples/records:[rows]•  Features/a0ributes/dimensions/independentvariables/covariates/predictors/regressors:[columns,exceptthelast]•  Target/outcome/response/label/dependentvariable:specialcolumntobepredicted[lastcolumn]

11/9/16 11

Output as Discrete Class Label

C1, C2, …, CL

CDr.YanjunQi/UVACS6316/f16

Example

•  Example: Play Tennis

11/9/16 12

Anatomyofadecisiontree

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Outlook

HumidityWindy

Eachnodeisatestononefeature/aJribute

PossibleaSributevaluesofthenode

Leavesarethedecisions

11/9/16 13

Anatomyofadecisiontree

overcast

high normal falsetrue

sunny rain

No NoYes Yes

Outlook

HumidityWindy

EachnodeisatestononeaJribute

PossibleaSributevaluesofthenode

Leavesarethedecisions

Samplesize

Yourdatagetssmaller

11/9/16 14

ApplyModeltoTestData:To‘playtennis’ornot.

overcast

high normal false true

sunny rain

No No Yes Yes

Outlook

Humidity Windy

Anewtestexample:(Outlook==rain)and(Windy==false)Passitonthetree->Decisionisyes.

11/9/16 15

ApplyModeltoTestData:To‘playtennis’ornot.

overcast

high normal false true

sunny rain

No No Yes Yes

Outlook

Humidity Windy

(Outlook==overcast)->yes(Outlook==rain)and(Windy==false)->yes(Outlook==sunny)and(Humidity=normal)->yes

11/9/16 16

Threecasesof“YES”

Decisiontrees•  DecisiontreesrepresentadisjuncLonofconjuncLonsofconstraintsontheaSributevaluesofinstances.

•  (Outlook ==overcast) •  OR •  ((Outlook==rain) and (Windy==false)) •  OR •  ((Outlook==sunny) and (Humidity=normal)) •  => yes play tennis

11/9/16 17

RepresentaLon

C B 0 1 1 0

false true false

Y=((AandB)or((notA)andC))

11/9/16 18

Sameconcept/differentrepresentaLon

C B 0 1 1 0

false true false

Y=((AandB)or((notA)andC))

true 0 C 1

B A 0 1 0 1

false true false A 0 1

true false 11/9/16 19

WhichaJributetoselectforspli\ng?

16+16-

Thisisbadspliqng…

thedistribuLonofeachclass(notaSribute)

11/9/16 20

HowdowechoosewhichaJributetosplit?

WhichaSributeshouldbeusedfirsttotest?IntuiLvely,youwouldprefertheonethatseparatesthetrainingexamplesasmuchaspossible.

11/9/16 21

InformaLongainisonecriteriatodecideonwhichaSributeforspliqng

•  Imagine:–  1.Someoneisabouttotellyouyourownname–  2.Youareabouttoobservetheoutcomeofadiceroll–  2.Youareabouttoobservetheoutcomeofacoinflip–  3.Youareabouttoobservetheoutcomeofabiasedcoinflip

•  EachsituaLonhaveadifferentamountofuncertaintyastowhatoutcomeyouwillobserve.

11/9/16 23

InformaLon•  InformaLon:èReducLoninuncertainty(amountofsurpriseintheoutcome)

2 21( ) log log ( )( )

I E p xp x

= = −

Ø  Observingtheoutcomeofacoinflipishead

Ø  Observetheoutcomeofadiceis6

2log 1/ 2 1I = − =

2log 1/ 6 2.58I = − =

Iftheprobabilityofthiseventhappeningissmallandithappens,theinformaLonislarge.

11/9/16 24

Entropy•  Theexpectedamountofinforma9onwhenobservingthe

outputofarandomvariableX

2( ) ( ( )) ( ) ( ) ( ) log ( )i i i ii i

H X E I X p x I x p x p x= = = −∑ ∑

IftheXcanhave8outcomesandallareequallylikely

2( ) 1/8log 1/8 3i

H X == − =∑

11/9/16 25

Entropy•  Iftherearekpossible

outcomes

•  Equalityholdswhenalloutcomesareequallylikely

•  ThemoretheprobabilitydistribuLonthatdeviatesfromuniformity,thelowertheentropy

2( ) logH X k≤

e.g.forarandombinaryvariable11/9/16 26

EntropyLowerèbeJerpurity

•  Entropymeasuresthepurity

ThedistribuLonislessuniformEntropyislowerThenodeispurer

11/9/16 27

Informa_ongain

•  IG(X,Y)=H(Y)-H(Y|X)ReducLoninuncertaintyofYbyknowingafeaturevariableX

InformaLongain:=(informaLonbeforesplit)–(informaLonaeersplit)=entropy(parent)–[averageentropy(children)]

Fixed thelower,thebeSer(childrennodesarepurer)

– ForIG,thehigher,thebeSer=

11/9/16 28

Condi_onalentropy

H (Y ) = − p(yi )log2 p(yi )i∑

H (Y | X ) = p(x j )

j∑ H (Y | X = x j )

= − p(x j )

j∑ p( yi | x j ) log2 p( yi | x j )

11/9/16 29

H (Y | X = x j ) = − p( yi | x j ) log2 p( yi | x j )

ExampleX1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

ASributes Labels

IG(X1,Y) = H(Y) – H(Y|X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6)

= 0.39

InformaLongain(X1,Y)=1-0.39=0.61

WhichonedowechooseX1orX2?

11/9/16 30

ExampleX1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

ASributes Labels

IG(X1,Y) = H(Y) – H(Y|X1) H(Y) = - (5/10) log(5/10) -5/10log(5/10) = 1 H(Y|X1) = P(X1=T)H(Y|X1=T) + P(X1=F) H(Y|X1=F) = 4/10 (1log 1 + 0 log 0) +6/10 (5/6log 5/6 +1/6log1/6)

= 0.39

InformaLongain(X1,Y)=1-0.39=0.61

WhichonedowechooseX1orX2?

11/9/16 31

Whichonedowechoose?

X1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

Information gain (X1,Y)= 0.61 Information gain (X2,Y)= 0.12

PickX1Pick the variable which provides the most information gain about Y

èThenrecursivelychoosenextXionbranches

X1 X2 Y Count T T + 2 T F + 2 F T - 5 F F + 1

Onebranch

Theotherbranch

11/9/16 32

11/9/16

DecisionTrees•  Caveats:Thenumberofpossiblevaluesinfluencesthe

informaLongain.•  Themorepossiblevalues,thehigherthegain(themorelikelyitisto

formsmall,butpureparLLons)

•  OtherPurity(diversity)measures–  InformaLonGain–  Gini(populaLondiversity)

•  wherepmkisproporLonofclasskatnodem

–  Chi-squareTest

11/9/16 34

Overfi\ng

•  YoucanperfectlyfitDTtoanytrainingdata

•  InstabilityofTrees○  Highvariance(smallchangesintrainingsetwill

resultinchangesoftreemodel)○  HierarchicalstructureèErrorintopsplit

propagatesdown

•  Twoapproaches:–  1.Stopgrowingthetreewhenfurtherspliqngthedatadoesnot

yieldanimprovement–  2.Growafulltree,thenprunethetree,byeliminaLngnodes.

11/9/16 35

FromESLbookCh9:ClassificaLonandRegressionTrees(CART)●  Par__onfeature

spaceintosetofrectangles

●  Fitsimplemodelin

eachparLLon

Summary:Decisiontrees

•  Non-linearclassifier•  Easytouse•  Easytointerpret•  SuscepLbletooverfiqngbutcanbeavoided.

11/9/16 37

Decision Tree / Random Forest

Greedy to find partitions

Split Purity measure / e.g. IG / cross-entropy / Gini /

Tree Model (s), i.e. space partition

Representation

Score Function

Search/Optimization

Models, Parameters

11/9/16 38

Classification

Partition feature space into set of rectangles, local smoothness

Bagging

•  Baggingorbootstrapaggrega9on•  atechniqueforreducingthevarianceofanesLmatedpredicLonfuncLon.

•  Forinstance,forclassificaLon,acommi0eeoftrees•  Eachtreecastsavoteforthepredictedclass.

11/9/16 40

BootstrapThebasicidea:randomlydrawdatasetswithreplacement(i.e.allowsduplicates)fromthetrainingdata,eachsamplethesamesizeastheoriginaltrainingset

11/9/16 41

WithvsWithoutReplacement

•  Bootstrapwithreplacementcankeepthesamplingsizethesameastheoriginalsizeforeveryrepeatedsampling.Thesampleddatagroupsareindependentoneachother.

•  Bootstrapwithoutreplacementcannotkeepthesamplingsizethesameastheoriginalsizeforeveryrepeatedsampling.Thesampleddatagroupsaredependentoneachother.

11/9/16 42

BaggingNexamples

Createbootstrapsamplesfromthetrainingdata

....…

Mfeatures

11/9/16 43

BaggingofDTClassifiersNexamples

....…

Takethemajorityvote

Mfeatures

i.e.Refitthemodeltoeachbootstrapdataset,andthenexaminethebehaviorovertheBreplicaLons.

11/9/16 44

BaggingforClassifica_onwith0,1Loss

•  ClassificaLonwith0,1loss–  BaggingagoodclassifiercanmakeitbeJer.

–  Baggingabadclassifiercanmakeitworse.

–  Canunderstandthebaggingeffectintermsofaconsensusofindependentweakleanersandwisdomofcrowds

11/9/16 45

Peculiari_es

•  ModelInstabilityisgoodwhenbagging

–  Themorevariable(unstable)thebasicmodelis,themoreimprovementcanpotenLallybeobtained

–  Low-Variabilitymethods(e.g.LDA)improvelessthanHigh-Variabilitymethods(e.g.decisiontrees)

•  LoadofRedundancy–  Mostpredictorsdoroughly“thesamething”

11/9/16 46

Bagging:ansimulatedexample •  N=30trainingsamples,•  twoclassesandp=5features,•  EachfeatureN(0,1)distribuLonandpairwisecorrelaLon.95•  ResponseYgeneratedaccordingto:•  Testsamplesizeof2000•  FitclassificaLontreestotrainingsetandbootstrapsamples•  B=200

ESLbook/Example8.7.1

NoLcethebootstraptreesaredifferentthantheoriginaltree

FivefeatureshighlycorrelatedwitheachotherèNocleardifferencewithpickingupwhichfeaturetosplitè Small

changesinthetrainingsetwillresultindifferenttree

è ButthesetreesareactuallyquitesimilarforclassificaLon

•  Consensus:Majorityvote•  Probability:AveragedistribuLonatterminalnodes

è ForB>30,moretreesdonotimprovethebaggingresults

è Sincethetrees

correlatehighlytoeachotherandgivesimilarclassificaLons

Bagging

•  Slightlyincreasesmodelspace– Cannothelpwheregreaterenlargementofspaceisneeded

•  Baggedtreesarecorrelated– UserandomforesttoreducecorrelaLonbetweentrees

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:specialensembleofDTØ Moreaboutensemble11/9/16 51

Randomforestclassifier

•  Randomforestclassifier,– anextensiontobagging– whichusesde-correlatedtrees.

11/9/16 52

RandomForestClassifierNexamples

Createbootstrapsamplesfromthetrainingdata

....…

Mfeatures

11/9/16 53

....…

Mfeatures

Ateachnodewhenchoosingthesplitfeaturechooseonlyamongm<Mfeatures

11/9/16 54

RandomForestClassifierCreatedecisiontree

fromeachbootstrapsample

Nexamples

....…

Mfeatures

11/9/16 55

....…

Takehemajorityvote

Mfeatures

11/9/16 56

Random Forests

1. ForeachofourBbootstrapsamplesa.  Formatreeinthefollowingmanner

i.  Givenpdimensions,pickmofthemii.  Splitonlyaccordingtothesemdimensions

1.  (wewillNOTconsidertheotherp-mdimensions)

iii.  Repeattheabovestepsi&iiforeachsplit1.  Note:wepickadifferentsetofmdimensionsforeachsplit

onasingletree

11/9/16 58

599InESLbook

Random Forests •  Randomforestcanbeviewedasarefinementofbaggingwitha

tweakofdecorrela_ngthetrees:

o  Ateachtreesplit,arandomsubsetofmfeaturesoutofallpfeaturesisdrawntobeconsideredforspliqng

•  SomeguidelinesprovidedbyBreiman,butbecarefultochoosembasedonspecificproblem:

o  m=pamountstobaggingo  m=p/3orlog2(p)forregressiono  m=sqrt(p)forclassificaLon

Why correlated trees are not ideal ?

•  RandomForeststrytoreducecorrelaLonbetweenthetrees.

• Why?

•  Assumingeachtreehasvarianceσ2

•  IftreesareindependentlyidenLcallydistributed,thenaveragevarianceisσ2/B

•  Assumingeachtreehasvarianceσ2

•  IfsimplyidenLcallydistributed,thenaveragevarianceis

ρreferstopairwisecorrelaLon,aposiLvevalue

•  AsB→∞,secondterm→0•  Thus,thepairwisecorrelaLonalwaysaffectsthevariance

•  Howtodeal?o  Ifwereducem(thenumberofdimensionswe

actuallyconsider),o  thenwereducethepairwisetreecorrelaLon

o  Thus,variancewillbereduced.

Ø BriefinformaLontheoryØ LearningdecisiontreesØ BaggingØ Randomforests:EnsembleofDTØ Moreensemble

11/9/16 64

e.g. Ensembles in practice

Oct2006-2009

EachraLng/sample:+<user,movie,dateofgrade,grade>Trainingset(100,480,507raLngs)Qualifyingset(2,817,131raLngs)èwinner

Ensemble in practice Team“Bellkor'sPragmaLcChaos”defeatedtheteam“ensemble”bysubmiqngjust20minutesearlier!è1milliondollar!

TheensembleteamèblendersofmulLpledifferentmethods

References

q Prof.Tan,Steinbach,Kumar’s“IntroducLontoDataMining”slide

q HasLe,Trevor,etal.Theelementsofsta9s9callearning.Vol.2.No.1.NewYork:Springer,2009.

q Dr.OznurTastan’sslidesaboutRFandDT

11/9/16 67

Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16:...

Documents

Transcript of Dr. Yanjun Qi / UVA CS 6316 / f16 UVA CS 6316/4501 – Fall ... · Machine Learning Lecture 16:...

Skylark: Easy Cloud Computing - Xen · Skylark: Easy Cloud Computing Yanjun Wu ... • Skylark users can start any application subscribed ... spice4xen

1$ Dr.$Yanjun$Qi$/$UVA$CS$6316$/$f15$ Welcome · Today$$!$Course$LogisAcs$$!$My$background$!$Basics$and$rough$content$plan$$!$ApplicaTon$and$History$$ …

UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 3 ... · – Fall 2016 Machine Learning Lecture 3: Linear Regression Dr. Yanjun Qi University of Virginia Department of Computer

Yanjun&Qi&/&UVA&CS&450150156501507& UVACS$4501$+$001 ... · 10/22/14 3 Mul8nomial$ NaïveBayesas Stochasc$ Language$ Models$ 0.2 the 0.01 boy 0.0001 said 0.0001 likes 0.0001 black

UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19 ... · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 19: Unsupervised Clustering (I) Dr. Yanjun Qi University

UVACS$4501$+$001$/$6501$–007 $ Yanjun&Qi&/&UVA&CS ... · 10/20/14 2 Where&are&we&?&!&& Three&major&sec,ons&for&classiﬁcaon • We can divide the large variety of classification

Case No COMP/M.6316 - AURUBIS/ LUVATA ROLLED PRODUCTS · 8/8/2011 · number COMP/M.6316 (hereinafter refer to as "Commission RFI case COMP/M.6316"). 6 Turnover calculated in accordance

6316 vonnie ct_riverside

UVA CS 4501: Machine Learning Lecture 17: Generative Bayes ... · Machine Learning Lecture 17: Generative Bayes Classifiers Dr. Yanjun Qi University of Virginia Department of Computer

UVA CS 6316: Machine Learning Lecture 10: Bias-Variance Tradeoff€¦ · Adapt From Carols’ probtutorial . 10/3/19 Dr. Yanjun Qi / UVA CS 11 Review: Mean and Variance of RV •Variance:

Whirlpool Awe 6316 w

Yang Gallery Singapore & Beijing - Lv Yanjun Solo Exhibition Catalogue

UVA CS 4501: Machine Learning Lecture 12: Probability Review · Machine Learning Lecture 12: Probability Review Dr. Yanjun Qi University of Virginia Department of Computer Science

UVA CS 6316: Machine Learning Lecture 1: Introduction€¦ · DEEP LEARNING / FEATURE LEARNING : [ COMPLEXITY OF X ] 9/2/19 Yanjun Qi / UVA CS 33 Feature Engineering üMost critical

UVA CS 4501: Machine Learning Lecture 3: Linear Regression ... · Machine Learning Lecture 3: Linear Regression Basics Dr. Yanjun Qi University of Virginia ... q Learning theory q

Yanjun&Qi&/&UVA&CS&4501D01D6501D07& … · 9/3/14 2 e.g. SUPERVISED LEARNING • Find function to map input space X to output space Y • So that the difference between y and f(x)

UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 21: EM … · 2020-01-12 · UVA CS 6316/4501 – Fall 2016 Machine Learning Lecture 21: EM (Extra) Dr. Yanjun Qi University

UVACS$4501$+$001$/$6501$–007 $ Yanjun&Qi&/&UVA&CS ... · 10/9/14 7 CoinFlipsofTwoPersons • Your&friend&and&you&both&flip&coins& – Head&with&probability&0.5& – Youflip50mes;yourfriendflip100mes

Unit8 morning Changxing Central Primary School Yin Yanjun.

Protein Complex Identification by Supervised Graph …qyj/SuperComplex/sharefiles/complex-Talk...Protein Complex Identification by Supervised Graph Clustering Yanjun Qi 1, Fernanda