Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf ·...
Transcript of Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf ·...
![Page 1: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/1.jpg)
Approximation Theory
Ben RechtCenter for the Mathematics of Information
Caltech
April 7, 2008
![Page 2: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/2.jpg)
References
• The majority of this material is adapted from F. Girosi’s9.520 lecture from 2003.– Available on OCW– Very readable with an extensive bibliography
• Random Features– Ali Rahimi and Benjamin Recht. “Random Features for
Large-Scale Kernel Machines.” NIPS 2007– Ali Rahimi and Benjamin Recht. “On the power of
randomized shallow belief networks.” In preparation, 2008.
![Page 3: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/3.jpg)
Outline
• Decomposition of the generalization error• Approximation and rates of convergence• “Dimension Independent” convergence rates• Maurey-Barron-Jones approximations• Random Features
![Page 4: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/4.jpg)
Notation
how well we can do
how well we can do in H
how well we can do in Hwith our L observations
![Page 5: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/5.jpg)
Generalization Error Estimation Error Approximation Error
For least squares cost
Estimation Error Approximation Error
Independent of target space(statistics)
Independent of examples(analysis)
Judiciously select H to balance the tradeoff
![Page 6: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/6.jpg)
• Nested hypothesis spaces
• Error
For most families of hypothesis spaces we encounter
• How fast does this error go to zero? We are interested in bounds of the form
![Page 7: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/7.jpg)
Example Hypothesis Spaces
• Polynomials on [0,1]. Hn is the set of all polynomials with degree at most n
We can approximate any smooth function with a polynomial (Taylor series).
• Sines and cosines on [-π,π].
We can approximate any square integrable function with a Fourier series.
![Page 8: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/8.jpg)
Calculating approximation rates
• Functions in this class can be represented by
• Parseval:
![Page 9: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/9.jpg)
Target Space
• Sobolev space of smooth functions
• Using parseval:
![Page 10: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/10.jpg)
Hypothesis Space
• Hn is the set of trig functions of degree n
• If f is of the form
Best approximation in L2 norm by Hn is given by
![Page 11: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/11.jpg)
Approximation Rate
• Note that Hn has n parameters. How fast does go to zero?
• More smoothness, faster convergence
• What happens in higher dimension?
![Page 12: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/12.jpg)
• Functions can be written
• Target space
• Again by Parseval
![Page 13: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/13.jpg)
• Hypothesis Space. Ht
• Number of parameters in Ht is n = td. Best approximation to f is given by
![Page 14: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/14.jpg)
• How fast does go to zero? We do the calculation for d=2:
• Now the approximation scales as (as a function of n):
![Page 15: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/15.jpg)
Curse of dimensionalityBlessing of smoothness
Curse of dimensionality
• Provides an estimate for the number of parameters
• Is this upper bound very loose?
![Page 16: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/16.jpg)
Hard Limits
• Tommi Poggio: just remember Nyquist….Sample rate = 2 x max freq
Num samples = 2 x T x max freq
![Page 17: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/17.jpg)
In dimension d: Num samples = (2 x T x max freq)d
T
T
![Page 18: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/18.jpg)
N-widths
• Let X be a normed space of functions. Let A be a subset of X. We want to approximate A with a linear combination of a finite set of “basis functions” X.
• Kolmogorov N-widths let us quantify how well we could do over all choices of finite sets of basis functions.
The n-width of A in X
![Page 19: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/19.jpg)
Multivariate Example
• Theorem (from Pinkus 1980):
This rate is achieved by splines
s times differentiablesth derivative in L2
![Page 20: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/20.jpg)
“Dimension Free” convergence
• Consider networks of the form
• “Shallow” networks with parametric basis functions
• Characterize when we can get good approximations
![Page 21: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/21.jpg)
Maurey-Barron-Jones Lemma
• Theorem: If f is in the convex hull of a set G in a Hilbert Space with ||g||2≤b for all g ∈ G, then for every n≥1 and every c’>b2-||f||2
2, there is an fn in the convex hull of n points in G such that
• Also known as Maurey’s “empirical method”• Many uses in computing covering numbers (see, e.g.,
generalization bounds, random matrices, compressive sampling, etc.)
![Page 22: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/22.jpg)
Maurey-type Approximation Schemes
• Jones (1992)
• Barron (1993)
• Girosi & Anzellotti (1995)
• Using nearly identical analysis, all of these schemes achieve
Define
![Page 23: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/23.jpg)
Hidden Smoothness
• Barron hides the smoothness via the functional
• Girosi and Anzellotti show that this means
• Note: functions get smoother as d increases
for some
![Page 24: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/24.jpg)
Algorithmic difficulty
• Training these networks is hard
• But for fixed θk, the following is almost always trivial:
• How to avoid optimizing the θk?
![Page 25: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/25.jpg)
Random Features
• What happens if we pick θk at random and then optimize the weights?
• It turns out, with some a priori information about the frequency content of f, we can do just as well as the classical approximation results of Maurey and co.
![Page 26: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/26.jpg)
• Fix parameterized basis functions
• Fix a probability distribution
• Our target space will be:
• With the convention that
![Page 27: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/27.jpg)
Random Features: Example
• Fourier basis functions:
• Gaussian parameters
• If , then means
that the frequency distribution of f has subgaussian tails.
![Page 28: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/28.jpg)
• Theorem: Let f be in Fp
with
Let ω1,…, ωn be sampled iid from p. Then
with probability at least 1 - δ.
![Page 29: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/29.jpg)
Generalization Error
Estimation Error Approximation Error
• It’s a finite sized basis set!• Choosing gives overall convergence of
![Page 30: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/30.jpg)
Kernels
• Note that under the mapping
we have
• Ridge regression with random features approximates Tikhonov regularized least-squares on an RKHS
![Page 31: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/31.jpg)
Random Features for Classification
![Page 32: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/32.jpg)
Gaussian RKHS vs Random Features
• Random Features are good: when L is sufficiently large and the function is sufficiently smooth
• TR on RKHS is good: when L is small or the function is not so smooth
![Page 33: Semidefinite Relaxations for Combinatorial Optimization9.520/spring08/Classes/recht_040708.pdf · 2008-08-08 · 9.520 lecture from 2003. –A vailable on OCW – Very readable with](https://reader034.fdocuments.net/reader034/viewer/2022050109/5f471725cb3afd7d7e2eb584/html5/thumbnails/33.jpg)
% Approximates Gaussian Process regression % with Gaussian kernel of variance gamma% lambda: regularization parameter % dataset: X is dxN, y is 1xN% test: xtest is dx1% D: dimensionality of random feature
% trainingw = randn(D, size(X,1));b = 2*pi*rand(D,1);Z = cos(sqrt(gamma)*w*X + repmat(b,1,size(X,2)));
% Equivalent to% alpha = (lambda*eye(size(X,2)+Z*Z')\(Z*y);
alpha = symmlq(@(v)(lambda*v(:) + *(Z'*v(:))),…Z*y(:),1e-6,2000);
% testingztest = alpha(:)’*cos( sqrt(gamma)*w*xtest(:) + …
+ repmat(b,1,size(X,2)) );