Faster least squares approximation
description
Transcript of Faster least squares approximation
![Page 1: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/1.jpg)
Faster least squares approximation
Tamas SarlosTamas Sarlos
Yahoo! Research
IPAM 23 Oct 2007
http://www.ilab.sztaki.hu/~stamas
Joint work with P. Drineas, M. Mahoney, and S. Muthukrishnan.
![Page 2: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/2.jpg)
Outline
Randomized methods for
Least squares approximation Lp regression Regularized regression Low rank matrix approximation
![Page 3: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/3.jpg)
Models, curve fitting, and data analysis
In MANY applications (in statistical data analysis and scientific computation), one has n observations (values of a dependent variable y measured at values of an independent variable t):
Model y(t) by a linear combination of d basis functions:
A is an n x d “design matrix” with elements:
In matrix-vector notation:
![Page 4: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/4.jpg)
Least Squares (LS) Approximation
We are interested in over-constrained L2 regression problems, n >> d.
Typically, there is no x such that Ax = b.
Want to find the “best” x such that Ax ≈ b.
Ubiquitous in applications & central to theory:
Statistical interpretation: best linear unbiased estimator.
Geometric interpretation: orthogonally project b onto span(A).
![Page 5: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/5.jpg)
Many applications of this!
• Astronomy: Predicting the orbit of the asteroid Ceres (in 1801!).
Gauss (1809) -- see also Legendre (1805) and Adrain (1808).
First application of “least squares optimization” and runs in O(nd2) time!
• Bioinformatics: Dimension reduction for classification of gene expression microarray data.
• Medicine: Inverse treatment planning and fast intensity-modulated radiation therapy.
• Engineering: Finite elements methods for solving Poisson, etc. equation.
• Control theory: Optimal design and control theory problems.
• Economics: Restricted maximum-likelihood estimation in econometrics.
• Image Analysis: Array signal and image processing.
• Computer Science: Computer vision, document and information retrieval.
• Internet Analysis: Filtering and de-noising of noisy internet data.
• Data analysis: Fit parameters of a biological, chemical, economic, social, internet, etc. model to experimental data.
![Page 6: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/6.jpg)
Exact solution to LS Approximation
Cholesky Decomposition: If A is full rank and well-conditioned,
decompose ATA = RTR, where R is upper triangular, and
solve the normal equations: RTRx=ATb.
QR Decomposition: Slower but numerically stable, esp. if A is rank-deficient.
Write A=QR, and solve Rx = QTb.
Singular Value Decomposition:Most expensive, but best if A is very ill-conditioned.
Write A=UVT, in which case: xOPT = A+b = V-1kUTb.
Complexity is O(nd2) for all of these, but constant factors differ.
Projection of b on the subspace
spanned by the columns of A
Pseudoinverse of A
![Page 7: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/7.jpg)
Reduction of LS Approximation to fast matrix multiplication
Theorem: For any n-by-d matrix A, and d dimensional vector
b, where n>d, we can compute x*=A+b in O(n/dM(d)) time. Here M(d) denotes the time needed to multiply two d-by-d matrices.
• Current best M(d)=d, =2.376… Coppersmith–Winograd (1990)
Proof: Ibarra et al. (1982) generalize the LUP decomposition for rectangular matrixes
Direct proof:
• Recall the normal equations ATAx = ATb
• Group the rows of A into blocks of d and compute ATA as the sum of n/d square products, each d-by-d
• Solve the normal equations
• LS in O(nd1.376) time, but impractical. We assume O(nd2).
![Page 8: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/8.jpg)
Original (expensive) sampling algorithm
Algorithm
1. Fix a set of probabilities pi, i=1,…,n.
2. Pick the i-th row of A and the i-th element of b with probability
min {1, rpi},
and rescale by (1/min{1,rpi})1/2.
1. Solve the induced problem.
Drineas, Mahoney, and Muthukrishnan (SODA, 2006)
• These probabilities pi’s are statistical leverage scores!
• “Expensive” to compute, O(nd2) time, these pi’s!
Theorem:
Let:
If the pi satisfy:
for some (0,1], then w.p. ≥
1-,
![Page 9: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/9.jpg)
Fast LS via uniform sampling
Algorithm
1. Pre-process A and b with a “randomized Hadamard transform”.
2. Uniformly sample
r=O(d log(n) log(d log(n))/)
constraints.
3. Solve the induced problem:
Drineas, Mahoney, Muthukrishnan, and Sarlos (2007)
• Non-uniform sampling will work, but it is “slow.”
• Uniform sampling is “fast,” but will it work?
![Page 10: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/10.jpg)
A structural lemma
Approximate by solving
where is any matrix.
Let be the matrix of left singular vectors of A.
assume that -fraction of mass of b lies in span(A).
Lemma: Assume that:
Then, we get relative-error approximation:
![Page 11: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/11.jpg)
Randomized Hadamard preprocessing
Fact 1: Multiplication by HnDn doesn’t change the solution:
Fact 2: Multiplication by HnDn is fast - only O(n log(r)) time, where r is the number of elements of the output vector we need to “touch”.
(since Hn and Dn are orthogonal matrices).
Fact 3: Multiplication by HnDn approximately uniformizes all leverage scores:
Let Hn be an n-by-n deterministic Hadamard matrix, and Let Dn be an n-by-n random diagonal matrix with +1/-1 chosen u.a.r. on the diagonal.
Facts implicit or explicit in: Ailon & Chazelle (2006), or Ailon and Liberty (2008).
![Page 12: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/12.jpg)
A matrix perturbation theorem
![Page 13: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/13.jpg)
Establishing structural conditions
Lemma:|1-i2(SHUA)| ≤ 1/10, with constant probability,
i.e,. singular values of SHUA are close to 1.
Proof: Apply spectral norm matrix perturbation bound.
Lemma:||(SHUA)TSHb|| ≤ OPT, with constant probability,
i.e., SHUA is nearly orthogonal to b.
Proof: Apply Forbenius norm matrix perturbation bound.
Bottom line: • Fast algorithm approximately uniformizes the leverage of each data point.• Uniform sampling is optimal up to O(1/log(n)) factor, so over-sample a bit.• Overall running time O(ndlog(…)+d3log(…)/)
![Page 14: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/14.jpg)
Fast LS via sparse projection
Algorithm
1. Pre-process A and b with a randomized Hadamard transform.
2. Multiply preprocessed input by sparse random k x n matrix T, where
and where k=O(d/) and q=O(d log2(n)/n+d2log(n)/n) .
1. Solve the induced problem:
Drineas, Mahoney, Muthukrishnan, and Sarlos (2007) - sparse projection matrix from Matousek’s version of Ailon-Chazelle 2006
• Dense projections will work, but it is “slow.”
• Sparse projection is “fast,” but will it work?
-> YES! Sparsity parameter q of T related to non-uniformity of leverage scores!
![Page 15: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/15.jpg)
Lp regression problems
We are interested in over-constrained Lp regression problems, n >> d.
Lp regression problems are convex programs (or better!).
There exist poly-time algorithms.
We want to solve them faster!
Dasgupta, Drineas, Harb, Kumar, and Mahoney (2008)
![Page 16: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/16.jpg)
Lp norms and their unit balls
Recall the Lp norm for :
Some inequality relationships include:
![Page 17: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/17.jpg)
Solution to Lp regression
For p=2, Least-squares approximation (minimize ||Ax-b||2):
For p=∞, Chebyshev or mini-max approximation (minimize ||Ax-b||∞):
For p=1, Sum of absolute residuals approximation (minimize ||Ax-b||1):
Lp regression can be cast as a convex program for all .
![Page 18: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/18.jpg)
What made the L2 result work?
The L2 sampling algorithm worked because:
• For p=2, an orthogonal basis (from SVD, QR, etc.) is a “good” or “well-conditioned” basis.
(This came for free, since orthogonal bases are the obvious choice.)
• Sampling w.r.t. the “good” basis allowed us to perform “subspace-preserving sampling.”
(This allowed us to preserve the rank of the matrix and distances in its span.)
Can we generalize these two ideas to p2?
![Page 19: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/19.jpg)
Sufficient condition
Easy
Approximate minx||b-Ax||p by solving minx||S(b-Ax)||p giving x’
Assume that the matrix S is an isometry over the span of A and b: for any vector y in span(Ab) we have (1-)||y||p ||Sy||p (1+ ) ||y|| p
Then, we get relative-error approximation:
||b-Ax||p (1+ ) minx|| S(b-Ax)||p
(This requires O(d/2) for L2 regression.)
How to construct an Lp isometry?
![Page 20: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/20.jpg)
p-well-conditioned basis (definition)
Let A be an n x m matrix of rank d<<n, let p [1,∞), and q its dual.
Definition: An n x d matrix U is an (,,p)-well-conditioned basis for span(A) if:
(1) |||U|||p ≤ , (where |||U|||p = (ij|Uij|p)1/p )
(2) for all z in Rd, ||z||q ≤ ||Uz||p.
U is a p-well-conditioned basis if ,=dO(1), independent of m,n.
![Page 21: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/21.jpg)
p-well-conditioned basis (existence)
Let A be an n x m matrix of rank d<<n, let p [1,∞), and q its dual.
Theorem: There exists an (,,p)-well-conditioned basis U for span(A) s.t.:
if p < 2, then = d1/p+1/2 and = 1,if p = 2, then = d1/2 and = 1,if p > 2, then = d1/p+1/2 and = d1/q-1/2.
U can be computed in O(nmd+nd5log n) time (or just O(nmd) if p = 2).
![Page 22: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/22.jpg)
p-well-conditioned basis (construction)
Algorithm:
• Let A=QR be any QR decomposition of A.
(Stop if p=2.)
• Define the norm on Rd by ||z||Q,p ||Qz||p.
• Let C be the unit ball of the norm ||•||Q,p.
• Let the d x d matrix F define the Lowner-John ellipsoid of C.
• Decompose F=GTG,
where G is full rank and upper triangular.
• Return U = QG-1
as the p-well-conditioned basis.
![Page 23: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/23.jpg)
Subspace-preserving sampling
Let A be an n x m matrix of rank d<<n, let p [1,∞).
Let U be an (,,p)-well-conditioned basis for span(A),
Theorem: Randomly sample rows of A according to the probability distribution:
where:
Then, with probability 1- , the following holds for all x in Rm:
![Page 24: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/24.jpg)
Regularized regression• If you don’t trust your data (entirely)
• Minimize ||b-Ax||p+ ||x||r, where r arbitrary
• Lasso is especially popular: minx||b-Ax||2+ ||x||1
• Sampled Lp regression works for the regularized version too
• We want more: imagine
![Page 25: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/25.jpg)
Ridge regression• Tikhonov regularization, back to L2
• Minimize ||b-Ax||22+ 2||x||2
2
• Equivalent to augmenting n-by-d A with d and b with 0
• SVD of AUDiag(sqrt(i))VT
• Let D = i sqrt(i) << d
• For (1+)relative error:• precondition with randomized FFT• sample r=O(D log(n) log(D log(n))/) rows uniformly.
![Page 26: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/26.jpg)
SVD of a matrix
U (V): orthogonal matrix containing the left (right) singular vectors of A.
: diagonal matrix containing the singular values of A, ordered non-increasingly.
rank of A, the number of non-zero singular values.
Exact computation of the SVD takes O(min{mn2 , m2n}) time.
Any m x n matrix A can be decomposed as:
![Page 27: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/27.jpg)
SVD and low-rank approximations
Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A.
k: diagonal matrix containing the top k singular values of A.
• Ak is the “best” matrix among all rank-k matrices wrt. to the spectral and Frobenius norms
• This optimality property of very useful in, e.g., Principal Component Analysis (PCA), LSI, etc.
Truncate the SVD by keeping k ≤ terms:
![Page 28: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/28.jpg)
Approximate SVD• Starting with the seminal work of Frieze et al (1998) and Papadimitriou et al (1998) till 2005 several results of the form
||A-Ak’||F ≤ ||A-Ak||F + t||A||F
• ||A||F might be a significantly larger than ||A-Ak||F
• Four independent results [H-P, DV, MRT, S 06] on
||A-Ak’||F ≤ (1+)||A-Ak||F
• This year Shyanalkumar and Deshpande …
![Page 29: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/29.jpg)
Relative error SVD via random projections
Theorem Let S be n by r=O(k/) random matrix with1 entries. Project A to the subspace spanned by the rows of SA and then compute Ak’ as the rank-k SVD of the projection. Then ||A-Ak’||F ≤ (1+)||A-Ak||F with prob. at least 1/3
• By keeping an orthonormal basis of SA total time is O(Mr+(n+m)r^2) with M non-zeroes • We need to make only 2 passes over the data
Can use fast random projections and avoid the SA projection, spectral norm (talks by R and T)
![Page 30: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/30.jpg)
SVD proof sketch Can show:
||A-Ak’||F2 ≤ ||A-Ak||F
2 + ||Ak-Ak(SAk)+(SA)||F
2
For all j columns of A consider the regression A(j)≈Akx
Exact: Ak(j), error is ||A(j)-Ak
(j)||2
Approximate: Ak-Ak(SAk)+SA(j)
![Page 31: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/31.jpg)
Conclusion
Fast methods based on randomized preprocessing and sampling or sparse projections for
Least squares approximation Regularized regression Low rank matrix approximation
![Page 32: Faster least squares approximation](https://reader036.fdocuments.net/reader036/viewer/2022062301/56813a08550346895da1d5fd/html5/thumbnails/32.jpg)
Thank you!