Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf ·...
Transcript of Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf ·...
![Page 1: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/1.jpg)
Ensemble learning Lecture 12
David Sontag New York University
Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola
![Page 2: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/2.jpg)
Ensemble methods Machine learning competition with a $1 million prize
![Page 3: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/3.jpg)
3
Bias/Variance Tradeoff
Hastie, Tibshirani, Friedman “Elements of Statistical Learning” 2001
![Page 4: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/4.jpg)
4
Reduce Variance Without Increasing Bias
• Averaging reduces variance:
Average models to reduce model variance
One problem: only one training set
where do multiple models come from?
(when predictions are independent)
![Page 5: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/5.jpg)
5
Bagging: Bootstrap AggregaGon
• Leo Breiman (1994) • Take repeated bootstrap samples from training set D. • Bootstrap sampling: Given set D containing N training examples, create D’ by drawing N examples at random with replacement from D.
• Bagging: – Create k bootstrap samples D1 … Dk. – Train disGnct classifier on each Di. – Classify new instance by majority vote / average.
![Page 6: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/6.jpg)
6
Bagging
• Best case:
In practice: models are correlated, so reduction is smaller than 1/N
variance of models trained on fewer training cases usually somewhat larger
![Page 7: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/7.jpg)
7
![Page 8: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/8.jpg)
8
decision tree learning algorithm; very similar to ID3
![Page 9: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/9.jpg)
shades of blue/red indicate strength of vote for particular classification
![Page 10: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/10.jpg)
10
Reduce Bias2 and Decrease Variance?
• Bagging reduces variance by averaging • Bagging has liZle effect on bias • Can we average and reduce bias? • Yes:
• BoosGng
![Page 11: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/11.jpg)
Theory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of BoostingTheory and Applications of Boosting
Rob Schapire
![Page 12: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/12.jpg)
Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”Example: “How May I Help You?”[Gorin et al.]
• goal: automatically categorize type of call requested by phonecustomer (Collect, CallingCard, PersonToPerson, etc.)
• yes I’d like to place a collect call long distance
please (Collect)• operator I need to make a call but I need to bill
it to my office (ThirdNumber)• yes I’d like to place a call on my master card
please (CallingCard)• I just called a number in sioux city and I musta
rang the wrong number because I got the wrong
party and I would like to have that taken off of
my bill (BillingCredit)
• observation:• easy to find “rules of thumb” that are “often” correct
• e.g.: “IF ‘card’ occurs in utteranceTHEN predict ‘CallingCard’ ”
• hard to find single highly accurate prediction rule
![Page 13: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/13.jpg)
The Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting ApproachThe Boosting Approach
• devise computer program for deriving rough rules of thumb
• apply procedure to subset of examples
• obtain rule of thumb
• apply to 2nd subset of examples
• obtain 2nd rule of thumb
• repeat T times
![Page 14: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/14.jpg)
Key DetailsKey DetailsKey DetailsKey DetailsKey Details
• how to choose examples on each round?• concentrate on “hardest” examples(those most often misclassified by previous rules ofthumb)
• how to combine rules of thumb into single prediction rule?• take (weighted) majority vote of rules of thumb
![Page 15: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/15.jpg)
BoostingBoostingBoostingBoostingBoosting
• boosting = general method of converting rough rules ofthumb into highly accurate prediction rule
• technically:• assume given “weak” learning algorithm that canconsistently find classifiers (“rules of thumb”) at leastslightly better than random, say, accuracy ! 55%(in two-class setting) [ “weak learning assumption” ]
• given su!cient data, a boosting algorithm can provablyconstruct single classifier with very high accuracy, say,99%
![Page 16: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/16.jpg)
Preamble: Early HistoryPreamble: Early HistoryPreamble: Early HistoryPreamble: Early HistoryPreamble: Early History
![Page 17: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/17.jpg)
Strong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak LearnabilityStrong and Weak Learnability
• boosting’s roots are in “PAC” learning model [Valiant ’84]
• get random examples from unknown, arbitrary distribution
• strong PAC learning algorithm:
• for any distributionwith high probabilitygiven polynomially many examples (and polynomial time)can find classifier with arbitrarily small generalizationerror
• weak PAC learning algorithm
• same, but generalization error only needs to be slightlybetter than random guessing (12 " !)
• [Kearns & Valiant ’88]:• does weak learnability imply strong learnability?
![Page 18: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/18.jpg)
If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...If Boosting Possible, Then...
• can use (fairly) wild guesses to produce highly accuratepredictions
• if can learn “part way” then can learn “all the way”
• should be able to improve any learning algorithm
• for any learning problem:• either can always learn with nearly perfect accuracy• or there exist cases where cannot learn even slightlybetter than random guessing
![Page 19: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/19.jpg)
First Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting AlgorithmsFirst Boosting Algorithms
• [Schapire ’89]:• first provable boosting algorithm
• [Freund ’90]:• “optimal” algorithm that “boosts by majority”
• [Drucker, Schapire & Simard ’92]:• first experiments using boosting• limited by practical drawbacks
• [Freund & Schapire ’95]:• introduced “AdaBoost” algorithm• strong practical advantages over previous boostingalgorithms
![Page 20: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/20.jpg)
Application: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting FacesApplication: Detecting Faces[Viola & Jones]
• problem: find faces in photograph or movie
• weak classifiers: detect light/dark rectangles in image
• many clever tricks to make extremely fast and accurate
![Page 21: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/21.jpg)
Basic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core Theory
• introduction to AdaBoost
• analysis of training error
• analysis of test errorand the margins theory
• experiments and applications
![Page 22: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/22.jpg)
Basic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core TheoryBasic Algorithm and Core Theory
• introduction to AdaBoost
• analysis of training error
• analysis of test errorand the margins theory
• experiments and applications
![Page 23: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/23.jpg)
A Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of BoostingA Formal Description of Boosting
• given training set (x1, y1), . . . , (xm, ym)
• yi # {"1,+1} correct label of instance xi # X
• for t = 1, . . . ,T :• construct distribution Dt on {1, . . . ,m}
• find weak classifier (“rule of thumb”)
ht : X $ {"1,+1}
with error "t on Dt :
"t = Pri!Dt [ht(xi ) %= yi ]
• output final/combined classifier Hfinal
![Page 24: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/24.jpg)
AdaBoostAdaBoostAdaBoostAdaBoostAdaBoost[with Freund]
• constructing Dt :
• D1(i) = 1/m• given Dt and ht :
Dt+1(i) =Dt(i)
Zt&
!
e"!t if yi = ht(xi )e!t if yi %= ht(xi )
=Dt(i)
Ztexp("#t yi ht(xi ))
where Zt = normalization factor
#t =1
2ln
"1" "t"t
#
> 0
• final classifier:
• Hfinal(x) = sign
$
%
t
#tht(x)
&
![Page 25: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/25.jpg)
Toy ExampleToy ExampleToy ExampleToy ExampleToy Example
D1
weak classifiers = vertical or horizontal half-planes
![Page 26: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/26.jpg)
Round 1Round 1Round 1Round 1Round 1
h1
α
ε11
=0.30=0.42
2D
![Page 27: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/27.jpg)
Round 2Round 2Round 2Round 2Round 2
α
ε22
=0.21=0.65
h2 3D
![Page 28: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/28.jpg)
Round 3Round 3Round 3Round 3Round 3
h3
α
ε33=0.92=0.14
![Page 29: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/29.jpg)
Final ClassifierFinal ClassifierFinal ClassifierFinal ClassifierFinal Classifier
Hfinal
+ 0.92+ 0.650.42sign=
=
![Page 30: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/30.jpg)
Voted combination of classifiers• The general problem here is to try to combine many simple
“weak” classifiers into a single “strong” classifier
• We consider voted combinations of simple binary ±1component classifiers
hm(x) = α1 h(x; θ1) + . . . + αm h(x; θm)
where the (non-negative) votes αi can be used to emphasizecomponent classifiers that are more reliable than others
Tommi Jaakkola, MIT CSAIL 3
![Page 31: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/31.jpg)
Components: decision stumps• Consider the following simple family of component classifiers
generating ±1 labels:
h(x; θ) = sign( w1 xk − w0 )
where θ = {k, w1, w0}. These are called decision stumps.
• Each decision stump pays attention to only a singlecomponent of the input vector
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
Tommi Jaakkola, MIT CSAIL 4
![Page 32: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/32.jpg)
Voted combination cont’d• We need to define a loss function for the combination so
we can determine which new component h(x; θ) to add andhow many votes it should receive
hm(x) = α1h(x; θ1) + . . . + αmh(x; θm)
• While there are many options for the loss function we considerhere only a simple exponential loss
exp{−y hm(x) }
Tommi Jaakkola, MIT CSAIL 5
![Page 33: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/33.jpg)
Modularity, errors, and loss• Consider adding the mth component:
n�
i=1
exp{−yi[hm−1(xi) + αmh(xi; θm)] }
=n�
i=1
exp{−yihm−1(xi)− yiαmh(xi; θm) }
Tommi Jaakkola, MIT CSAIL 6
![Page 34: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/34.jpg)
Modularity, errors, and loss• Consider adding the mth component:
n�
i=1
exp{−yi[hm−1(xi) + αmh(xi; θm)] }
=n�
i=1
exp{−yihm−1(xi)− yiαmh(xi; θm) }
=n�
i=1
exp{−yihm−1(xi)}� �� �fixed at stage m
exp{−yiαmh(xi; θm) }
Tommi Jaakkola, MIT CSAIL 7
![Page 35: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/35.jpg)
Modularity, errors, and loss• Consider adding the mth component:
n�
i=1
exp{−yi[hm−1(xi) + αmh(xi; θm)] }
=n�
i=1
exp{−yihm−1(xi)− yiαmh(xi; θm) }
=n�
i=1
exp{−yihm−1(xi)}� �� �fixed at stage m
exp{−yiαmh(xi; θm) }
=n�
i=1
W (m−1)i exp{−yiαmh(xi; θm) }
So at the mth iteration the new component (and the votes)should optimize a weighted loss (weighted towards mistakes).
Tommi Jaakkola, MIT CSAIL 8
![Page 36: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/36.jpg)
Empirical exponential loss cont’d• To increase modularity we’d like to further decouple the
optimization of h(x; θm) from the associated votes αm
• To this end we select h(x; θm) that optimizes the rate atwhich the loss would decrease as a function of αm
∂
∂αm��αm=0
n�
i=1
W (m−1)i exp{−yiαmh(xi; θm) } =
�n�
i=1
W (m−1)i exp{−yiαmh(xi; θm) } ·
�− yih(xi; θm)
��
αm=0
=
�n�
i=1
W (m−1)i
�− yih(xi; θm)
��
Tommi Jaakkola, MIT CSAIL 11
![Page 37: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/37.jpg)
Empirical exponential loss cont’d• We find h(x; θ̂m) that minimizes
−n�
i=1
W (m−1)i yih(xi; θm)
We can also normalize the weights:
−n�
i=1
W (m−1)i�n
j=1 W (m−1)j
yih(xi; θm)
= −n�
i=1
W̃ (m−1)i yih(xi; θm)
so that�n
i=1 W̃ (m−1)i = 1.
Tommi Jaakkola, MIT CSAIL 13
![Page 38: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/38.jpg)
Selecting a new component: summary• We find h(x; θ̂m) that minimizes
−n�
i=1
W̃ (m−1)i yih(xi; θm)
where�n
i=1 W̃ (m−1)i = 1.
• αm is subsequently chosen to minimize
n�
i=1
W̃ (m−1)i exp{−yiαmh(xi; θ̂m) }
Tommi Jaakkola, MIT CSAIL 14
![Page 39: Ensemblelearning Lecture12 - MITpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture12.pdf · Slides adapted from Luke Zettlemoyer, Vibhav Gogate, Rob Schapire, and Tommi Jaakkola](https://reader034.fdocuments.net/reader034/viewer/2022042805/5f5e54492157de7fa043fbb4/html5/thumbnails/39.jpg)
The AdaBoost algorithm0) Set W̃ (0)
i = 1/n for i = 1, . . . , n
1) At the mth iteration we find (any) classifier h(x; θ̂m) forwhich the weighted classification error �m
�m = 0.5− 12
�n�
i=1
W̃ (m−1)i yih(xi; θ̂m)
�
is better than chance.
2) The new component is assigned votes based on its error:
α̂m = 0.5 log( (1− �m)/�m )
3) The weights are updated according to (Zm is chosen so thatthe new weights W̃ (m)
i sum to one):
W̃ (m)i =
1Zm
· W̃ (m−1)i · exp{−yiα̂mh(xi; θ̂m) }
Tommi Jaakkola, MIT CSAIL 18