Weighted Low-Rank Approximation Nathan Srebro and Tommi Jaakkola ICML 2003
Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola,...
Transcript of Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola,...
![Page 1: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/1.jpg)
Machine learning: lecture 10
Tommi S. Jaakkola
MIT AI Lab
![Page 2: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/2.jpg)
Topics
• Combination of classifiers: boosting
– modularity, reweighting
– AdaBoost, examples, generalization
• Complexity and model selection
– shattering, Vapnik-Chervonenkis dimension
Tommi Jaakkola, MIT AI Lab 2
![Page 3: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/3.jpg)
Combination of classifiers
• We wish to generate a set of simple “weak” classification
methods and combine them into a single “strong” method
• The simple classifiers in our case
are decision stumps:
h(x; θ) = sign( w1 xk − w0 )
where θ = {k, w1, w0}.
Each decision stump pays attention
to only a single component of the
input vector−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
Tommi Jaakkola, MIT AI Lab 3
![Page 4: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/4.jpg)
Combination of classifiers con’d
• We’d like to combine the simple classifiers additively so that
the final classifier is the sign of
hm(x) = α1 h(x; θ1) + . . . + αm h(x; θm)
where the “votes” α emphasize component classifiers that
make more reliable predictions than others
• Important issues:
– what is the criterion that we are optimizing? (measure of
loss)
– we would like to estimate each new component classifier
in the same manner (modularity)
Tommi Jaakkola, MIT AI Lab 4
![Page 5: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/5.jpg)
Combination of classifiers con’d
• One possible measure of empirical loss isn∑
i=1
exp{−yihm(xi) }
=n∑
i=1
exp{−yihm−1(xi)− yiαmh(xi; θm) }
=n∑
i=1
exp{−yihm−1(xi)}︸ ︷︷ ︸fixed at stage m
exp{−yiαmh(xi; θm) }
=n∑
i=1
W(m−1)i exp{−yiαmh(xi; θm) }
The combined classifier based on m − 1 iterations defines a
weighted loss criterion for the next simple classifier to add
Tommi Jaakkola, MIT AI Lab 5
![Page 6: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/6.jpg)
Combination cont’d
• We can simplify a bit the estimation criterion for the new
component classifiers
When αm ≈ 0 (low confidence votes)
exp{−yiαmh(xi; θm) } ≈ 1− yiαmh(xi; θm)
and our empirical loss criterion reduces to
≈n∑
i=1
W(m−1)i (1− yiαmh(xi; θm)) =
=n∑
i=1
W(m−1)i − αm
(n∑
i=1
W(m−1)i yih(xi; θm)
)We could choose each new component classifier to optimize
a weighted agreement
Tommi Jaakkola, MIT AI Lab 6
![Page 7: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/7.jpg)
Possible algorithm
• At stage m we find θm that maximize (or at least give a
sufficiently high) weighted agreement
n∑i=1
W(m−1)i yih(xi; θm)
where the weights W(m−1)i summarize the effect from the
previously combined m− 1 classifiers.
• We find the “votes” αm associated with the new classifier
by minimizing the weighted loss
n∑i=1
W(m−1)i exp{−yiαmh(xi; θm) }
Tommi Jaakkola, MIT AI Lab 7
![Page 8: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/8.jpg)
Boosting
• We have basically derived a Boosting algorithm that
sequentially adds new component classifiers by reweighting
training examples
– each component classifier is presented with a slightly
different problem
• AdaBoost preliminaries:
– we work with normalized weights Wi on the training
examples, initially uniform (Wi = 1/n)
Tommi Jaakkola, MIT AI Lab 8
![Page 9: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/9.jpg)
The AdaBoost algorithm
1: At the kth iteration we find (any) classifier h(x; θk) for which
the weighted classification error εk
εk = 0.5− 12
(n∑
i=1
W(k−1)i yih(xi; θk)
)is better than chance.
2: Determine how many “votes” to assign to the new
component classifier: αk = 0.5 log( (1 − εk)/εk )(decorrelation)
3: Update the weights on the training examples:
W(k)i = W
(k−1)i · exp{−yiαkh(xi; θk) }
and renormalize the new weights to one.
Tommi Jaakkola, MIT AI Lab 9
![Page 10: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/10.jpg)
The AdaBoost algorithm cont’d
• The final classifier after m boosting iterations is given by the
sign of
h(x) =α1h(x; θ1) + . . . + αmh(x; θm)
α1 + . . . + αm
(the votes here are normalized for convenience)
Tommi Jaakkola, MIT AI Lab 10
![Page 11: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/11.jpg)
Boosting: example
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
Tommi Jaakkola, MIT AI Lab 11
![Page 12: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/12.jpg)
Boosting: example cont’d
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
Tommi Jaakkola, MIT AI Lab 12
![Page 13: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/13.jpg)
Boosting: example cont’d
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
−0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4−1
−0.5
0
0.5
1
1.5
Tommi Jaakkola, MIT AI Lab 13
![Page 14: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/14.jpg)
Boosting performance
• Training/test errors for the combined classifier
h(x) =α1h(x; θ1) + . . . + αmh(x; θm)
α1 + . . . + αm
0 5 10 15 200
0.02
0.04
0.06
0.08
0.1
What about the error rate of the component classifiers
(decision stumps)?
• Even after the training error of the combined classifier goes to
zero, boosting iterations can still improve the generalization
error!
Tommi Jaakkola, MIT AI Lab 14
![Page 15: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/15.jpg)
Boosting and margin
• Successive boosting iterations improve the majority vote or
margin for the training examples
margin for example i = yi
[α1h(x; θ1) + . . . + αmh(x; θm)
α1 + . . . + αm
]The margin lies in [−1, 1] and is negative for all misclassified
examples.
Tommi Jaakkola, MIT AI Lab 15
![Page 16: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/16.jpg)
Topics
• Complexity and model selection
– shattering, Vapnik-Chervonenkis dimension
– structural risk minimization (next lecture)
Tommi Jaakkola, MIT AI Lab 16
![Page 17: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/17.jpg)
Measures of complexity
• “Complexity” is a measure of a set of classifiers, not any
specific (fixed) classifier
• Many possible measures
– degrees of freedom
– description length
– Vapnik-Chervonenkis dimension
etc.
• There are many reasons for introducing a measure of
complexity
– generalization error guarantees
– selection among competing families of classifiers
Tommi Jaakkola, MIT AI Lab 17
![Page 18: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/18.jpg)
VC-dimension: preliminaries
• A set of classifiers F:For example, this could be the set of all possible linear
separators, where h ∈ F means that
h(x) = sign(w0 + wTx
)for some values of the parameters w, w0.
Tommi Jaakkola, MIT AI Lab 18
![Page 19: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/19.jpg)
VC-dimension: preliminaries
• Complexity: how many different ways can we label n
training points {x1, . . . ,xn} with classifiers h ∈ F?
In other words, how many distinct binary vectors
[h(x1) h(x2) . . . h(xn)]
do we get by trying each h ∈ F in turn?
[ -1 1 . . . 1 ] h1
[ 1 -1 . . . 1 ] h2
. . .
Tommi Jaakkola, MIT AI Lab 19
![Page 20: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/20.jpg)
VC-dimension: shattering
• A set of classifiers F shatters n points {x1, . . . ,xn} if
[h(x1) h(x2) . . . h(xn)], h ∈ F
generates all 2n distinct labelings.
• Example: linear decision boundaries shatter (any) 3 points
in 2D
x x
x
x x
x
x x
x
x x
x
x x
x
x x
x
x x
x
x x
x
+-
+-
+-+-
+ -
+- + - +-
but not any 4 points...
Tommi Jaakkola, MIT AI Lab 20
![Page 21: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/21.jpg)
VC-dimension: shattering cont’d
• We cannot shatter 4 points in 2D with linear separators
For example, the following labeling
x
x
x
x
+ -
+-
cannot be produced with any linear separator
• More generally: the set of all d-dimensional linear separators
can shatter exactly d + 1 points
Tommi Jaakkola, MIT AI Lab 21
![Page 22: Tommi S. Jaakkola MIT AI Labdspace.mit.edu/.../0/lecture10.pdf · -0.5 0 0.5 1 1.5 Tommi Jaakkola, MIT AI Lab 3 ... • Important issues: – what is the criterion that we are optimizing?](https://reader033.fdocuments.net/reader033/viewer/2022060523/6052d4ec986c8c2e920746a1/html5/thumbnails/22.jpg)
VC-dimension
• The VC-dimension dV C of a set of classifiers F is the largest
number of points that F can shatter
• This is a combinatorial concept and doesn’t depend on what
type of classifier we use, only how “flexible” the set of
classifiers is
Example: Let F be a set of classifiers defined in terms of
linear combinations of m fixed basis functions
h(x) = sign (w0 + w1φ1(x) + . . . + wmφm(x) )
dV C is at most m + 1 regardless of the form of the fixed
basis functions.
Tommi Jaakkola, MIT AI Lab 22