Making Large-Scale Nyström Approximation Possible

Making Large-Scale Nystrom Approximation Possible

Mu Li [email protected]

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, China

James T. Kwok [email protected]

Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong

Bao-Liang Lu [email protected]

Department of Computer Science and Engineering, Shanghai Jiao Tong University, Shanghai 200240, ChinaMOE-MS Key Lab. for Intel. Comp. and Intel. Sys., Shanghai Jiao Tong University, Shanghai 200240, China

Abstract

The Nystrom method is an efficient techniquefor the eigenvalue decomposition of large ker-nel matrices. However, in order to ensure anaccurate approximation, a sufficiently largenumber of columns have to be sampled. Onvery large data sets, the SVD step on the re-sultant data submatrix will soon dominatethe computations and become prohibitive.In this paper, we propose an accurate andscalable Nystrom scheme that first samplesa large column subset from the input ma-trix, but then only performs an approximateSVD on the inner submatrix by using the re-cent randomized low-rank matrix approxima-tion algorithms. Theoretical analysis showsthat the proposed algorithm is as accurate asthe standard Nystrom method that directlyperforms a large SVD on the inner subma-trix. On the other hand, its time complexityis only as low as performing a small SVD.Experiments are performed on a number oflarge-scale data sets for low-rank approxima-tion and spectral embedding. In particular,spectral embedding of a MNIST data set with3.3 million examples takes less than an houron a standard PC with 4G memory.

1. Introduction

Eigenvalue decomposition is of central importance inscience and engineering, and has numerous applica-

Appearing in Proceedings of the 27 th International Confer-ence on Machine Learning, Haifa, Israel, 2010. Copyright2010 by the author(s)/owner(s).

tions in diverse areas such as physics, statistics, signalprocessing, machine learning and data mining. In ma-chine learning for example, eigenvalue decompositionis used in kernel principal component analysis and ker-nel Fisher discriminant analysis for the extraction ofnonlinear structures and decision boundaries from thekernel matrix. The eigenvectors of the kernel or affin-ity matrix are also used in many spectral clustering(von Luxburg, 2007) and manifold learning algorithms(Belkin & Niyogi, 2002; Tenenbaum et al., 2000) forthe discovery of the intrinsic clustering structure orlow-dimensional manifolds.

However, standard algorithms for computing theeigenvalue decomposition of a dense n×n matrix takeO(n3) time, which can be prohibitive for large datasets. Alternatively, when only a few leading (or trail-ing) eigenvalues/eigenvectors are needed, one may per-form a partial singular value decomposition (SVD) us-ing the Arnoldi method (Lehoucq et al., 1998). How-ever, empirically, the time reduction is significant onlywhen the matrix is sparse or very few eigenvectors areextracted (Williams & Seeger, 2001).

A more general approach to alleviate this problem isby using low-rank matrix approximations, of which theNystrom method (Drineas & Mahoney, 2005; Fowlkeset al., 2004; Williams & Seeger, 2001) is the most pop-ular. It selects a subset of m n columns fromthe kernel matrix, and then uses the correlations be-tween the sampled columns and the remaining columnsto form a low-rank approximation of the full matrix.Computationally, it only has to decompose the muchsmaller m × m matrix (denoted W ). Obviously, themore columns are sampled, the more accurate is theresultant approximation. However, there is a trade-off between accuracy and efficiency. In particular, onvery large data sets, even decomposing the small W


matrix can be expensive. For example, when the dataset has several millions examples, sampling only 1%of the columns will lead to a W that is larger than10, 000× 10, 000.

To avoid this explosion of m, Kumar et al. (2010) re-cently proposed the use of an ensemble of ne Nystromapproximators. Each approximator, or expert , per-forms a standard Nystrom approximation with a man-ageable column subset. Since the sampling of columnsis stochastic, a number of such experts can be run andthe resultant approximations are then linearly com-bined together. Empirically, the resultant approxima-tion is more accurate than that of a single expert asin standard Nystrom. Moreover, its computationalcost is (roughly) only ne times the cost of standardNystrom. However, as will be shown in Section 3, it isessentially using a block diagonal matrix to approxi-mate the inverse of a very large W . Since the inverse ofa block diagonal matrix is another block diagonal ma-trix, this approximation can be poor unless W is closeto block diagonal. However, this is highly unlikely intypical applications of the Nystrom method.

Recently, a new class of randomized algorithms areproposed for constructing approximate, low-rank ma-trix decompositions (Halko et al., 2009). It also ex-tends the Monte Carlo algorithms in (Drineas et al.,2006) on which the analysis of the Nystrom methodin (Drineas & Mahoney, 2005) is based. Unlike thestandard Nystrom which simply samples a columnsubset for approximation, it first constructs a low-dimensional subspace that captures the action of theinput matrix. Then, a standard factorization is per-formed on the matrix which is restricted to that sub-space. Though being a randomized algorithm, it isshown that this can yield an accurate approximationwith very high probability. On the other hand, the al-gorithm needs to have at least one pass over the wholeinput matrix. This is thus more expensive than theNystrom method (and its ensemble variant) which onlyaccesses a column subset. On very large data sets, thisperformance difference can be significant.

In this paper, we combine the merits of the standardNystrom method and the randomized SVD algorithm.The standard Nystrom is highly efficient but requiresa large enough number of columns to be sampled,while the randomized SVD algorithm is highly accu-rate but less efficient. Motivated by the observationthat the ensemble Nystrom algorithm is essentially us-ing a block diagonal matrix approximation for W+, wewill adopt a large column subset and then speed up theinner SVD step by randomized SVD. Both theoreticalanalysis and experimental results confirm that the er-

ror in the randomized SVD step is more than compen-sated for by the ability to use a large column subset,leading to an efficient and accurate eigenvalue decom-position even for very large input matrices. Moreover,unlike the ensemble Nystrom method which resortsto a learner and needs to attend to the consequentmodel selection issues, the proposed method is veryeasy to implement and can be used to obtain approx-imate eigenvectors.

The rest of this paper is organized as follows. Section 2gives a short introduction on the standard/ensembleNystrom method and the randomized SVD algorithm.Section 3 then describes the proposed algorithm. Ex-perimental results are presented in Section 4, and thelast section gives some concluding remarks.

Notations The transpose of vector/matrix is de-noted by the superscript T . Moreover, Tr(A) denotesthe trace of matrix A = [Aij ], A+ is its pseudo-inverse, ran(A) is the range of A, ‖A‖2 = max

√λ :

λ is eigenvalue of ATA is its spectral norm, ‖A‖F =√Tr(ATA) is its Frobenius norm, and σi(A) denotes

the ith largest singular value of A.

2. Related Works

2.1. Nystrom Method

The Nystrom method approximates a symmetric pos-itive semidefinite (psd) matrix G ∈ Rn×n by a sampleC of m n columns from G. Typically, this sub-set of columns are randomly selected by uniform sam-pling without replacement (Williams & Seeger, 2001;Kumar et al., 2009). Recently, more sophisticatednon-uniform sampling schemes have also been pursued(Drineas & Mahoney, 2005; Zhang et al., 2008).

After selecting C, the rows and columns of G can berearranged such that C and G are written as:

C =[WS

]and G =

[W ST

S B

], (1)

where W ∈ Rm×m, S ∈ R(n−m)×m and B ∈R(n−m)×(n−m). Assume that the SVD of W isUΛUT , where U is an orthonormal matrix and Λ =diag(σ1, . . . , σm) is the diagonal matrix containing thesingular values of W in non-increasing order. Fork ≤ m, the rank-k Nystrom approximation is

Gk = CW+k C

T , (2)

where W+k =

∑ki=1 σ

−1i U (i)U (i)T , and U (i) is the ith

column of U . The time complexity is O(nmk + m3).Since m n, this is much lower than the O(n3) com-plexity required by a direct SVD on G.


2.2. Ensemble Nystrom Algorithm

Since the Nystrom method relies on random samplingof columns, it is stochastic in nature. The ensem-ble Nystrom method (Kumar et al., 2010) employsan ensemble of ne ≥ 1 Nystrom approximators forimproved performance. It first samples mne columnsfrom G, which can be written as C = [C1, . . . , Cne ] ∈Rn×mne with each Ci ∈ Rn×m. The standard Nystrommethod is then performed on Ci, obtaining a rank-kapproximation Gi,k (i = 1, . . . , ne). Finally, these areweighted to form the ensemble approximation

Gens =ne∑i=1

µiGi,k, (3)

where µi’s are the mixture weights. A number ofchoices have been used in setting these weights, includ-ing uniform weights, exponential weights and by ridgeregression. Empirically, the best method is ridge re-gression. This, however, needs to sample an additionals columns from G as the training set, and another s′

columns as the hold-out set for model selection. Thetotal time complexity is O(nenmk+nem3+Cµ), whereCµ is the cost of computing the mixture weights.

Another disadvantage of the ensemble Nystrommethod is that, unlike the standard Nystrom method,approximate eigenvectors of G cannot be easily ob-tained. As can be seen from (3), the eigenvectors ofeach of the Gi,k’s in (3) are in general different andso cannot be easily combined together. Hence, the en-semble Nystrom method cannot be used with spectralclustering and manifold learning algorithms.

2.3. Randomized Low-Rank Approximation

Recently, a class of simple but highly efficient random-ized algorithms are proposed for constructing approxi-mate, low-rank matrix decompositions (Halko et al.,2009). In general, they can be used on complex-valued rectangular matrices. In the following, we focuson obtaining a rank-k SVD from a symmetric matrixW ∈ Rm×m (Algorithm 1).

In general, there are two computational stages in thisclass of algorithms. In the first stage (steps 1 to 3),an orthonormal matrix Q ∈ Rm×(k+p) is constructedwhich serves as an approximate, low-dimensional basisfor the range of W (i.e., W ' QQTW ). Here, p is anover-sampling parameter (typically set to 5 or 10) suchthat the rank of Q is slightly larger than the desiredrank (k), and q is the number of steps of a power it-eration (typically set to 1 or 2) which is used to speedup the decay of the singular values of W . In the sec-ond stage (steps 4 to 6), the input matrix matrix is

Algorithm 1 Randomized SVD (Halko et al., 2009).Input: m×m symmetric matrix W , scalars k, p, q.Output: U , Λ.1: Ω ← a m × (k + p) standard Gaussian random

matrix.2: Z ←WΩ, Y ←W q−1Z.3: Find an orthonormal matrix Q (e.g., by QR de-

composition) such that Y = QQTY .4: Solve B(QTΩ) = QTZ.5: Perform SVD on B to obtain V ΛV T = B.6: U ← QV .

restricted to the above subspace and a standard SVDis then computed on the reduced matrix

B = QTWQ (4)

to obtain B = V ΛV T . Finally, the SVD of W can beapproximated as W ' UΛUT , where U = QV .

Computationally, it takes O(m2k) time1 to computeZ and Y , O(mk) time for the QR decomposition,O(mk2) time to obtainB, andO(k3) time for the SVD.Hence, the total time complexity is O(m2k+k3), whichis quadratic in m. Moreover, it needs to have at leastone pass over the whole input matrix.

3. Algorithm

3.1. Combining Nystrom and RandomizedSVD

Obviously, the more columns are sampled, the moreaccurate is the Nystrom approximation. Hence, theensemble Nystrom method samples mne columns in-stead of m columns. In the following, we abuse no-tations and denote the corresponding W matrix byW(nem) ∈ Rmne×mne . However, there is a trade-off between accuracy and efficiency. If the standardNystrom method were used, this would have takenO(n3

em3) time for the SVD of W(nem). The ensemble

Nystrom method alleviates this problem by replacingthis expensive SVD by ne SVDs on ne smaller m×mmatrices. Our key observation is that, by using (2),the ensemble Nystrom approximation in (3) can berewritten as

Gens = C diag(µ1W+1,k, . . . , µne

W+ne,k

)CT , (5)

where Wi,k ∈ Rm×m is the W matrix in (1) corre-sponding to Gi,k, and diag(µ1W

+1,k, . . . , µne

W+ne,k

) is

1Here, we compute Y by multiplying W to a sequenceof m× (k + p) matrices, as WZ, W (WZ), . . . , W (W q−2Z).


the block diagonal matrix

µ1W+1,k

. . .

µneW+ne,k

. In

other words, the ensemble Nystrom algorithm can beequivalently viewed as approximating W+

(nem) by theblock diagonal diag(µ1W

+1,k, . . . , µneW

+ne,k

). Despitethe resultant computational simplicity, the inverse ofa block diagonal matrix is another block diagonal ma-trix. Hence, no matter how sophisticated the mixtureweights µi’s are estimated, this block diagonal approx-imation is rarely valid unless W(nem) is block diagonal.This, however, is highly unlikely in typical applicationsof the Nystrom method.

Since the ensemble Nystrom method attains betterperformance by sampling more columns, our methodwill also sample more columns, or, equivalently, usea m larger than is typically used in the standardNystrom method. However, instead of using a blockdiagonal matrix approximation for solving the subse-quent large-SVD problem, we will use a more accurateprocedure. In particular, we will adopt the randomizedlow-rank matrix approximation technique introducedin Section 2.3.

Algorithm 2 The proposed algorithm.Input: Psd matrix G ∈ Rn×n, number of columns m,

rank k, over-sampling parameter p, power param-eter q.

Output: G, an approximation of G.1: C ← m columns of G sampled uniformly at ran-

dom without replacement.2: W ← m×m matrix defined in (1).3: [U ,Λ]← randsvd(W,k, p, q) using Algorithm 1.4: U ← CUΛ+.5: G←

(√mn U) (

nmΛ) (√

mn U

T).

The proposed algorithm is shown in Algorithm 2. Es-sentially, it combines the high efficiency of the Nystrommethod, which however requires a large enough col-umn subset for accurate approximation, with the abil-ity of the randomized algorithm to produce a veryaccurate SVD but still relatively efficient approxima-tion. Note from step 5 that G = CUΛ+UTCT . Inturn, from Algorithm 1 and (4), UΛUT = QBQT =Q(QTWQ)QT . Hence, instead of relying on the blockdiagonal matrix approximation in (5), G is now moreaccurately approximated as

G = CQ(QTWQ)+QTC. (6)

Besides, instead of using the randomized SVD al-gorithm for the inner SVD, one might want to ap-

Table 1. Time complexities for the various methods to ob-tain a rank-k Nystrom approximation of an n× n matrix.Here, m is the number of columns sampled.

method time complexityNystrom O(nmk + m3)

ensemble Nystrom O(nmk + nek3 + Cµ)

randomized SVD O(n2k + k3)proposed method O(nmk + k3)

ply other approximations, such as using the stan-dard Nystrom method again. However, the Nystrommethod is not good at approximating the trailingeigenvalues, which are important in computing the in-verse of W . Preliminary experiments show that inorder for the resultant approximation on G to be ac-curate, the inner Nystrom needs to sample close tom columns, which, however, will lead to little speedupover a standard SVD. Moreover, recall from Section 2.1that there are different column sampling strategies.Here, we will focus on uniform sampling without re-placement (Kumar et al., 2009). Extension to othersampling schemes will be studied in the future.

The time complexity required is O(nmk+k3). A sum-mary of the time complexities2 of the various methodsis shown in Table 1. Recall that typically n m ≥ k.As can be seen, all the methods except randomizedSVD scale linearly with n. Moreover, the proposedmethod has a comparable complexity as the ensembleNystrom method as both only scale cubically with k,but not with m.

3.2. Error Analysis

Let the column sampling matrix be S ∈ 0, 1n×k,where Sij = 1 if the ith column of G is chosen in thej random trial, and Sij = 0 otherwise. Then, C = GSand W = STGS. Moreover, since G is psd, we canwrite it as

G = XTX, (7)

for some X ∈ Rd×n. In the proof, we will also needthe column-sampled and rescaled version of X:

H = κXS, (8)

where κ =√n/m is the scaling factor. Then,

C = κ−1XTH, W = κ−2HTH. (9)

The error analysis will depend on a number of resultsin (Halko et al., 2009; Kumar et al., 2009; Stewart,

2In order to be consistent with the other methods, thetotal number of columns sampled in the ensemble Nystrommethod is now m (not nem in Section 2.2).


1990). For the readers’ convenience, these are listed inthe appendix.

3.2.1. Spectral Norm

For the matrix W in step 3, we will first computeits (expected) approximation error E ‖W −QQTW‖2.Since the input matrix G is psd, W is also psd. Thefollowing proposition can then be obtained by using amore general result in (Halko et al., 2009).

Proposition 1. Given a psd matrix W , the Q ob-tained in Algorithm 1 satisfies

E ‖W −QQTW‖2 ≤ ζ1/qσk+1(W ), (10)

where ζ = 1 +√

kp−1 + e

√k+pp

√m− k.

The main theorem is stated below.

Theorem 1. For the G obtained in Algorithm 2,

E ‖G−G‖2 ≤ ζ1/q‖G−Gk‖2+(1+ζ1/q)n√mG∗ii, (11)

where Gk is the best rank-k approximation of G, G∗ii =

maxiGii, and ζ = 1 +√

kp−1 + e

√k+pp

√m− k.

As in (Halko et al., 2009), the power iteration drivesζ1/q towards 1 exponentially fast as q increases, and sothe error in (11) decreases with the number of sampledcolumns m. In particular, if we replace ζ1/q by 1,then (11) becomes ‖G − Gk‖2 + 2n√

mG∗ii, which is the

same3 as that for the standard Nystrom method usingm columns. In other words, Algorithm 2 is as accurateas performing a large SVD in standard Nystrom.

3.2.2. Frobenius Norm

A similar bound can be obtained for the approximationerror in terms of the Frobenius norm. Since there is noanalogous theory for power iteration w.r.t. the Frobe-nius norm (cf. remark 10.1 of (Halko et al., 2009)),the analysis here is restricted to q = 1 and thus the re-sultant bound is quite loose. However, as will be seenin Section 4, empirically the approximation with justq = 2 is already very good.

Theorem 2. For the G obtained in Algorithm 2,

E ‖G− G‖F

≤ 2(k + p)√p− 1

‖G−Gk‖F +

(1 +

4(k + p)√m(p− 1)

)nG∗ii.

Table 2. Data sets used.

data #samples dimlow-rank Satimage 4,435 36approx RCV1 23,149 47,236

MNIST 60,000 784Covtype 581,012 54

embedding MNIST-8M 3,276,294 784

4. Experiments

In this section, we study the efficiency of the proposedmethod in solving large dense eigen-systems. Experi-ments are performed on low-rank approximation (Sec-tion 4.1) and spectral embedding (Section 4.2). Allthe implementations are in Matlab. Experiments arerun on a PC with 2.4GHz Core2 Duo CPU and 4Gmemory.

4.1. Low-rank Approximation

We use a number of data sets from the LIBSVMarchive4 (Table 2). The linear kernel is used on theRCV1 text data set, while the Gaussian kernel is usedon all the others. The following methods are comparedin the experiments:

1. Standard Nystrom method (denoted nys);

2. Ensemble Nystrom method (denoted ens): Asin (Kumar et al., 2010), an additional s = 20columns are used for training the mixture weightsby ridge regression, and another s′ = 20 columnsare used for choosing the regularization parame-ters. For the covtype data set, s and s′ are reducedto 2 so as to speed up computation. Moreover, weset ne = m/k.

3. The proposed method (denoted our): We fix theover-sampling p to 5, and the power parameter qto 2.

4. Randomized SVD (denoted r-svd): Similar to theproposed method, we also use p = 5 and q = 2.

The first three methods are based on the Nystrommethod, and m columns are uniformly sampled with-out replacement. Due to randomness in the samplingprocess, we perform 10 repetitions and report the aver-aged result. On the other hand, the randomized SVDalgorithm does not perform sampling and the whole

3This bound can be obtained in (Kumar et al., 2010)by combining their (6) and (10).

4http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/

http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/


0.6 1.2 1.8 2.4 30

1

2

3

4

5

6

7

8x 10

−4

x 103

rela

tive

err

or

m

our

nys

ens

r−svd

(a) satimage.

0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 5.4 60.25

0.3

0.35

0.4

0.45

x 103

rela

tive e

rror

m

our

nys

ens

r−svd

(b) rcv1.

0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 5.42

2.5

3

3.5

4

4.5

5

5.5

6x 10

−3

x 103

rela

tive e

rror

m

our

nys

ens

r−svd

(c) mnist.

0.6 1.2 1.8 2.4 3 3.6 4.2 4.8 5.4 60

1

2

3

4

5

6x 10

−3

x 103

rela

tive

err

or

m

our

nys

ens

(d) covtype.

0.6 1.2 1.8 2.4 310

0

101

102

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

r−svd

(e) satimage.

0.6 1.2 1.8 2.4 3 3.6 4.8 610

0

101

102

103

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

r−svd

(f) rcv1.

0.6 1.2 1.8 2.4 3 4.2 5.4

101

102

103

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

r−svd

(g) mnist.

0.6 1.2 1.8 2.4 3 3.6 4.8 6

102

103

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

(h) covtype.

Figure 1. Performance of the various methods. Top: Low-rank approximation error; Bottom: CPU time. (The randomizedSVD algorithm cannot be run on the covtype data set because it is too large).

input matrix is always used. Besides, the best rank-k approximation could have been obtained by a directSVD on the whole input matrix. However, this is com-putationally expensive even on medium-sized data setsand so is not compared here.

4.1.1. Different Numbers of Columns

In the first experiment, we fix k = 600 and grad-ually increase the number of sampled columns (m).Figure 1 shows the the relative approximation er-ror5 ‖G − G‖F /‖G‖F and the CPU time. As canbe seen, the randomized SVD algorithm is often themost accurate, albeit also the most expensive. Stan-dard Nystrom can be as accurate as randomized SVDwhen m is large enough. However, since Nystromtakes O(m3) time for the SVD step, it also quickly be-comes computationally infeasible. As for the ensem-ble Nystrom method, it degenerates to the standardNystrom when ne = 1 (the left endpoint of the curve).Its approximation error decreases when the ensemblehas more experts6, which is consistent with the resultsin (Kumar et al., 2010). However, as discussed in Sec-tion 3.1, the ensemble Nystrom method approximates

5Results are only reported for the Frobenius norm be-cause the approximation error w.r.t. the spectral normis computationally difficult to compute, especially on largedata sets. Nevertheless, this is still a good indication of theapproximation performance as the spectral norm is upper-bounded by the Frobenius norm (Lutkepohl, 1996).

6Recall that we use ne = m/k, and so the number ofexperts increases with m.

the large SVD problem with a crude block diagonalmatrix approximation. Hence, its accuracy is muchinferior to that of the standard Nystrom (which per-forms the large SVD directly). On the other hand, theproposed method is almost as accurate as standardNystrom, while its CPU time is comparable or evensmaller than that of the ensemble Nystrom method.

The accuracy of the ensemble Nystrom method can beimproved, at the expense of more computations. Re-call that in our setup, all Nystrom-based methods haveaccess to m columns. For the ensemble Nystrom, thesecolumns are divided by the m/k experts, each receiv-ing a size-k subset. In general, let r be the number ofcolumns used by each expert. Obviously, the larger ther, the better the ensemble approximation. Indeed, inthe extreme case where r = m, the ensemble Nystrommethod degenerates to the standard Nystrom. Hence,accuracy of the ensemble Nystrom method can be im-proved by using fewer experts, with each expert using alarger column subset. However, the time for perform-ing m

r size-r SVD’s is O(mr r3) = O(mr2). Figure 2

shows the resultant tradeoff between approximationerror and CPU time on the satimage data set. As canbe seen, in order for the ensemble Nystrom method tohave comparable speed with the proposed algorithm,this justifies our choice of r = k.

4.1.2. Different Ranks

In the second experiment, we study the approximationperformance when the rank k varies. Because of the


0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

2

4

6

8

10

12

14

16x 10

−3

x 103

rela

tive

err

or

m

our

nys

ens

r−svd

(a) k = 200.

0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

2

4

6

8

10

12

14

16x 10

−3

x 103

rela

tive

err

or

m

our

nys

ens

r−svd

(b) k = 400.

0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

2

4

6

8

10

12

14

16x 10

−3

x 103

rela

tive

err

or

m

our

nys

ens

r−svd

(c) k = 600.

0.4 0.8 1.2 1.6 2 2.4 2.8 3.2 3.6 4

2

4

6

8

10

12

14

16x 10

−3

x 103

rela

tive

err

or

m

our

nys

ens

r−svd

(d) k = 800.

0.4 0.8 1.2 1.6 2 2.4 3.2 4

100

101

102

103

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

r−svd

(e) k = 200.

0.4 0.8 1.2 1.6 2 2.4 3.2 4

100

101

102

103

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

r−svd

(f) k = 400.

0.4 0.8 1.2 1.6 2 2.4 3.2 4

100

101

102

103

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

r−svd

(g) k = 600.

0.4 0.8 1.2 1.6 2 2.4 3.2 4

100

101

102

103

x 103

CP

U tim

e (

sec.)

m

our

nys

ens

r−svd

(h) k = 800.

Figure 3. Performance at different k’s on the MNIST data set. Top: Low-rank approximation error; Bottom: CPU time.

0.6 1.2 1.8 2.4 30

1

2

3

4

5

6

7

8x 10

−4

x 103

rela

tive e

rror

m

our

nys

ens,600

ens,900

ens,1200

(a) Approximation error.

0.6 1.2 1.8 2.4 310

0

101

102

x 103

CP

U tim

e (

sec.)

m

our

nys

ens,600

ens,900

ens,1200

(b) CPU time.

Figure 2. Low-rank approximation performance for the en-semble Nystrom method, with varying number of columnsused by each expert.

lack of space, results are only reported on the MNISTdata set. As can be seen from Figure 3, when k in-creases, the approximation error decreases while theCPU time increases across all methods. Hence, there isa tradeoff between accuracy and efficiency. Neverthe-less, the relative performance comparison among thevarious methods is still the same as in Section 4.1.1.

4.1.3. Input Matrices of Different Sizes

In this experiment, we examine how the performancescales with the size of the input matrix. The variousmethods are run on subsets of the covtype data set.Here, k is fixed to 600 and m to 0.03n. Results areshown in Figure 4. Note that the slopes of the curvesin Figure 4(b) determines their scalings with n. As canbe seen, the standard Nystrom method scales cubicallywith n, while all the other methods scale quadratically(because m also scales linearly with n here). Moreover,similar to the results in the previous sections, the en-

semble Nystrom and the proposed method are mostscalable, while the proposed method is as accurate asthe standard Nystrom that performs a large SVD.

0 1 2 3 4 5 6

x 105

0

1

2

3

4

5

6

7x 10

−3

data size

rela

tive e

rror

our

nys

ens

r−svd

(a) Approximation error.

104

105

106

100

101

102

103

104

data size

CP

U tim

e (

sec.)

our

nys

ens

r−svd

(b) CPU time.

Figure 4. Low-rank approximation performance at differ-ent sizes of the input matrix on the covtype data set.

4.2. Spectral Embedding

In this section, we perform spectral embedding usingthe Laplacian eigenmap (Belkin & Niyogi, 2002). TheGaussian kernel is used to construct the affinity ma-trix. For easy visualization, the data are projectedonto the two singular vectors of the normalized Lapla-cian with the second and third smallest singular values.

Experiments are performed on the MNIST-8M dataset7, which contains 8.1M samples constructed by elas-tic deformation of the original MNIST training set. Toavoid clutter of the embedding results, we only use dig-its 0, 1, 2 and 9, which result in a data set with about

7http://leon.bottou.org/papers/loosli-canu-bottou-2006

http://leon.bottou.org/papers/loosli-canu-bottou-2006

http://leon.bottou.org/papers/loosli-canu-bottou-2006


3.3M samples. Because of this sheer size, neither stan-dard SVD nor Nystrom can be run on the whole set.Moreover, neither can the ensemble Nystrom methodbe used as it cannot produce approximate eigenvec-tors (Section 2.2). Hence, the full set can only be runwith the proposed method, with m = 4000, k = 400and the Gaussian kernel. For comparison, we also runstandard SVD on a random subset of 8,000 samples.

Results are shown in Figure 5. As can be seen, the twoembedding results are very similar. Besides, for theproposed method, this embedding of 3.3M samples isobtained within an hour on our PC.

(a) Proposed method. (b) SVD on a data subset.

Figure 5. Embedding results for the digits 0,1,2,9 in theMNIST-8M data set.

5. Conclusion

In this paper, we proposed an accurate and scalableNystrom approximation scheme for very large datasets. It first samples a large column subset from theinput matrix, and then performs an approximate SVDon the inner submatrix by using the recent randomizedlow-rank matrix approximation algorithms. Both the-ory and experiments demonstrate that the proposedalgorithm is as accurate as the standard Nystrommethod that directly performs a large SVD on the in-ner submatrix. On the other hand, its time complex-ity is only as low as the ensemble Nystrom method. Inparticular, spectral embedding of a MNIST data setwith 3.3 million examples takes less than an hour ona standard PC with 4G memory.

Acknowledgments

This research was supported in part by the ResearchGrants Council of the Hong Kong Special Administra-tive Region (Grant 614508), the National Natural Sci-ence Foundation of China (Grant No. 60773090 andGrant No. 90820018), the National Basic ResearchProgram of China (Grant No. 2009CB320901), andthe National High-Tech Research Program of China(Grant No. 2008AA02Z315).

References

Belkin, M. and Niyogi, P. Laplacian eigenmaps andspectral techniques for embedding and clustering.In NIPS 14, 2002.

Drineas, P. and Mahoney, M.W. On the Nystrommethod for approximating a Gram matrix for im-proved kernel-based learning. Journal of MachineLearning Research, 6:2175, 2005.

Drineas, P., Kannan, R., and Mahoney, M.W. FastMonte Carlo algorithms for matrices II: Computinga low-rank approximation to a matrix. SIAM Jour-nal on Computing, 36(1):158–183, 2006.

Fowlkes, C., Belongie, S., Chung, F., and Malik, J.Spectral grouping using the Nystrom method. IEEETransactions on Pattern Analysis and Machine In-telligence, 26(2):214–225, February 2004.

Halko, N., Martinsson, P.-G., and Tropp, J.A. Find-ing structure with randomness: Stochastic algo-rithms for constructing approximate matrix decom-positions. Technical report, 2009.

Kumar, S., Mohri, M., and Talwalkar, A. Samplingtechniques for the Nystrom method. In AISTATS,2009.

Kumar, S., Mohri, M., and Talwalkar, A. EnsembleNystrom method. In NIPS 22, 2010.

Lehoucq, R.B., Sorensen, D.C., and Yang, C.ARPACK Users’ Guide: Solution of Large-Scale Eigenvalue Problems with Implicitly RestartedArnoldi Methods. SIAM, 1998.

Lutkepohl, H. Handbook of Matrices. John Wiley andSons, 1996.

Stewart, G.W. Matrix perturbation theory. SIAMReview, 1990.

Tenenbaum, J.B., de Silva, V., and Langford, J.C.A global geometric framework for nonlinear dimen-sionality reduction. Science, 290:2319–2323, 2000.

von Luxburg, U. A tutorial on spectral clustering.Statistics and Computing, 17(4):395–416, December2007.

Williams, C.K.I. and Seeger, M. Using the Nystrommethod to speed up kernel machines. In NIPS 13,2001.

Zhang, K., Tsang, I.W., and Kwok, J.T. ImprovedNystrom low rank approximation and error analysis.In ICML, 2008.


A. Existing Results

The following proposition is used in the proof of Propo-sition 8.6 in (Halko et al., 2009).

Proposition 2. Suppose that R is an orthogonal pro-jector, D is a nonnegative diagonal matrix, and integerq ≥ 1. Then, ‖RDR‖q2 ≤ ‖RDqR‖2.

Theorem 3. [Theorem 10.6, (Halko et al., 2009)]Suppose that A ∈ Rm×n with singular values σ1 ≥σ2 ≥ . . .. Choose a target rank k and an oversamplingparameter p ≥ 2, where k + p ≤ minm,n. Draw ann×(k+p) standard Gaussian matrix Ω, and constructthe sample matrix Y = AΩ. Then,

E ‖(I − PY )A‖2

≤

(1 +

√k

p− 1

)σk+1 +

e√k + p

p

∑j>k

σ2j

1/2

,

and

E ‖(I − PY )A‖F ≤(

1 +k

p− 1

)1/2(∑i>k

σ2i

)1/2

.

Proposition 3. [Corollary 2, (Kumar et al., 2009)]Suppose A ∈ Rm×n. Choose a set S of size m uni-formly at random without replacement from 1, . . . , n,and let C equal the columns of A corresponding to in-dices in S scaled by

√n/m. Then,

E ‖AAT − CCT ‖F ≤n√m

(maxi‖Ai‖

)2

,

where Ai is the ith column of A.

Proposition 4. (Stewart, 1990) Given matrices A ∈Rn×n and E ∈ Rn×n,

maxi|σi(A+ E)− σi(A)| ≤ ‖E‖2,∑

i

(σi(A+ E)− σi(A)

)2

≤ ‖E‖2F .

B. Preliminaries

In this section, we introduce some useful properties ofthe spectral and Frobenius norms that will be heavilyused in the analysis.

Lemma 1. (Lutkepohl, 1996)

1. For any square A, ‖AAT ||2 = ‖ATA||2 = ‖A‖22.

2. For any orthogonal U ∈ Rm×m, orthogonal V ∈Rn×n, and matrix A ∈ Rm×n, ‖UAV ‖2 = ‖A‖2.

3. For any A, ‖A‖2 = ‖AT ‖2 and ‖A‖2 ≤ ‖A‖F .

4. For any A and B, ‖AB‖F ≤ ‖A‖F ‖B‖2.

Definition A matrix P is an orthogonal projector ifP = PT = P 2.

Given a matrix A,

PA = A(ATA)+AT = UAUTA , (12)

where UA is an orthonormal basis of ran(A), is anorthogonal projector. For orthogonal projector P ,I −P is also an orthogonal projector. Moreover, since‖P‖22 = ‖PTP‖2 = ‖P 2‖2 = ‖P‖2, so ‖P‖2 = 0 or 1.

Lemma 2. For A ∈ Rm×n and B ∈ Rn×m, ‖AB‖2 =‖BA‖2.

Proof. Assume, without loss of generality, that n ≥ m.Recall that if λ1, . . . , λm are the eigenvalues of AB,then λ1, . . . , λm, 0, . . . , 0 are the eigenvalues of BA(Lutkepohl, 1996). Hence,

‖AB‖2 = max√λ : λ is eigenvalue of BTATAB

= max√λ : λ is eigenvalue of ABBTAT

= ‖BA‖2.

Lemma 3. For any A ∈ Rn×n, and any v ∈ Rn with‖v‖ = 1, then v′Av ≤ ‖A‖2.

Proof. ‖A‖2 = ‖A 12A

12 ‖2 = ‖A 1

2 ‖22 =max‖v‖=1 ‖A

12 v‖2 = max‖v‖=1 v

′Av.

Lemma 4. For any positive semidefinite (psd) A ∈Rn×n, ‖A‖F ≤ Tr(A).

Proof. Since A is psd, σi(A) ≥ 0. Thus,

‖A‖2F = Tr(ATA) =n∑i=1

σi(ATA) =n∑i=1

σ2i (A)

≤

(n∑i=1

σi(A)

)2

= (Tr(A))2.

Lemma 5. For matrix A and orthogonal projector P ,‖AP‖F ≤ ‖A‖F .

Proof. Since ‖P‖2 ≤ 1, then, on using property 4 ofLemma 1, ‖AP‖F ≤ ‖A‖F ‖P‖2 ≤ ‖A‖F .


C. Proofs of the Results inSection 3.2.1

Lemma 6. Let P be an orthogonal projector, and S bea psd matrix. For integer q ≥ 1, ‖PS‖2 ≤ ‖PSq‖1/q2 .

Proof. Let the SVD of S be S = UΣUT . On usingproperties 1 and 2 of Lemma 1, we have

‖PS‖2q2 = ‖PSSP‖q2 = ‖(UTPU)Σ2(UTPU)‖q2.(13)

Note that UTPU = (UTPU)T = (UTPU)(UTPU).Hence, UTPU is also an orthogonal projector. UsingProposition 2 and that Σ2 is diagonal, (13) becomes

‖PS‖2q2 ≤ ‖(UTPU)Σ2q(UTPU)‖2 = ‖PS2qP‖2= ‖PSq‖22.

Proof of Proposition 1:

Proof. Recall that Q is an orthonormal basis of Y ,thus, PY = QQT . First, consider q = 1. Using The-orem 3 and that

∑mi=k+1 σ

2i (W ) ≤ (m − k)σ2

k+1(W ),we obtain (10). Now, for integer q > 1,

E ‖(I − PY )W‖2 ≤(

E ‖(I − PY )W‖q2)1/q

,

on using the Holder’s inequality. Note that I − PYis also an orthogonal projector. Hence, on usingLemma 6, we have

E ‖(I − PY )W‖2 ≤(

E ‖(I − PY )B‖2)1/q

,

where B = W q. Using (10) with q = 1 (which has justbeen proved), we obtain

E ‖(I − PY )W‖2 ≤(ζσk+1(B)

)1/q = ζ1/qσk+1(W ).

Lemma 7. For G in (7) and H in (8), E ‖XXT −HHT ‖2 ≤ n√

mG∗ii, where G∗ii = maxiGii.

Proof. Using property 3 of Lemma 1, ‖XXT −HHT ‖2 ≤ ‖XXT −HHT ‖F . Since G = XTX, Gii =‖X(i)‖2, where X(i) is the ith column of X. Thus,(maxi ‖X(i)‖

)2= maxi ‖X(i)‖2 = maxiGii = G∗ii.

Result follows on applying Proposition 3.

Lemma 8. Let U be an orthonormal basis of the rangeof matrix R ∈ Rn×k. Then for any X ∈ Rn×n,

‖XXT −XUUTXT ‖2 ≤ ‖XXT −RRT ‖2.

Proof. Let PR = UUT . On using property 1 ofLemma 1,

‖XTX −XTUUTX‖2 = ‖X − PRX‖22 (14)= max‖v‖=1

‖vT (X − PRX)‖2.

Decompose v as v = αyT +βzT , where y ∈ ran(R), z ∈ran(R)⊥, ‖y‖ = ‖z‖ = 1 and α2 + β2 = 1. Obviously,yTPR = yT , and zTPR = 0. Thus,

‖X − PRX‖2≤ max

y∈ran(R),‖y‖=1‖yT (X − PRX)‖

+ maxz∈ran(R)⊥,‖z‖=1

‖zT (X − PRX)‖

≤ maxz∈ran(R)⊥,‖z‖=1

‖zTX‖. (15)

Since z ∈ ran(R)⊥,

‖zTX‖2 = zTXXT z

= zTRRT z + zT (XXT −RRT )z= zT (XXT −RRT )z≤ ‖XXT −RRT ‖2,

on using Lemma 3. Result follows on combining thiswith (14) and (15).

Proof of Theorem 1:

Proof. Let R = HQ and UR an orthonormal basis ofran(R). From (6) and (12),

G = CQ(QTWQ)+QTCT

= XTHQ(QTHTHQ)+QTHTX

= XTPHQX = XTURUTRX.

Using Lemmas 2 and 8, we have

‖G− G‖2 = ‖XTX −XTURUTRX‖2

= ‖X − URUTRX‖22 = ‖X −XURUTR‖22= ‖XXT −XURUTRXT ‖2≤ ‖XXT −RRT ‖2≤ ‖XXT −HHT ‖2 + ‖HHT −RRT ‖2.

Again on using Lemma 2 and (9),

‖HHT −RRT ‖2 = ‖H(I −QQT )HT ‖2= ‖(I −QQT )HTH‖2= κ2‖W −QQTW‖2.


Then by Proposition 1,

E ‖HHT −RRT ‖2= κ2 E ‖W −QQTW‖2 ≤ κ2ζ1/qσk+1(W )= ζ1/qσk+1(HHT )≤ ζ1/qσk+1(XXT ) + ζ1/q‖XXT −HHT ‖2,

where the last step is due to Proposition 4. Moreover,note from Proposition 1 that H is assumed to be fixedand the expectation above is only taken over the ran-dom variable Q (i.e., randomness due to the Gaussianrandom matrix). Putting all these together, and tak-ing expectation over both Q and H (i.e., also includingthe randomness in selecting the columns), we have

E ‖G− G‖2≤ E ‖XXT −HHT ‖2 + E

(E|H ‖HHT −RRT ‖2

)≤ ζ1/qσk+1(XXT ) + (1 + ζ1/q) E ‖XXT −HHT ‖2≤ ζ1/q‖G−Gk‖2 + (1 + ζ1/q)

n√mG∗ii.

The last step uses the fact that ‖G − Gk‖2 =σk+1(G) = σk+1(XXT ) and Lemma 7.

D. Error Analysis for the FrobeniusNorm

In this section, we obtain a similar bound for the ap-proximation error in terms of the Frobenius norm.Since there is no analogous theory for power itera-tion w.r.t. the Frobenius norm (remark 10.1 of (Halkoet al., 2009)), the analysis is restricted to q = 1 and theresultant bound is quite loose. Nevertheless, as will beseen in the experiments, the approximation error (withq > 1) is very small.

As in Section 3.2.1, we first consider the error in ap-proximating W from Algorithm 1.

Corollary 1. For the W and Q obtained in Algo-rithm 1,

E ‖W−QQTW‖F ≤(

1 +k

p− 1

)1/2(∑i>k

σ2i (W )

)1/2

.

Proof. This is a direct application of Theorem 3.

Lemma 9. Given matrices A ∈ Rn×t and B ∈ Rn×s,with n ≥ maxs, t. Then, for any k ≤ mins, t,

k∑i=1

(σ2i (A)− σ2

i (B))≤√k‖AAT −BBT ‖F .

Proof. Using the Cauchy-Schwarz inequality,

k∑i=1

(σ2i (A)− σ2

i (B))

≤√k

[k∑i=1

(σ2i (A)− σ2

i (B))2]1/2

=√k

[k∑i=1

(σi(AAT )− σi(BBT )

)2]1/2

≤√k

[n∑i=1

(σi(AAT )− σi(BBT )

)2]1/2

≤√k‖AAT −BBT ‖F ,

on using Proposition 4.

Lemma 10. For matrices A ∈ Rn×k and B ∈ Rn×n,with n ≥ k. Let U be an orthonormal basis of ran(A).Then,

k∑i=1

σ2i (A)− ‖UTB‖2F ≤

√k‖AAT −BBT ‖F .

Proof. First, note that∑ki=1 σ

2i (A) =∑k

i=1 σi(AAT ) = Tr(UTAATU). Let U (i) be

the ith column of U . Then

k∑i=1

σ2i (A)− ‖UTB‖2F

= Tr(UTAATU)− Tr(UTBBTU)

=k∑i=1

U (i)T (AAT −BBT )U (i)

≤√k

[k∑i=1

(U (i)T (AAT −BBT )U (i)

)2]1/2

≤√k

[k∑i=1

σ2i (AAT −BBT )

]1/2

≤√k‖AAT −BBT ‖F ,

where the first inequality follows from the Cauchy-Schwarz inequality.

Lemma 11. Given A ∈ Rn×n, B ∈ Rn×n, and k ≤ n,

∣∣∣∣∣∣√∑i>k

σ2i (A)−

√∑i>k

σ2i (B)

∣∣∣∣∣∣ ≤ ‖A−B‖F . (16)


Proof. Using the triangle inequality,∣∣∣∣∣∣√∑i>k

σ2i (A)−

√∑i>k

σ2i (B)

∣∣∣∣∣∣ ≤√∑i>k

(σi(A)− σi(B))2

≤

√√√√ n∑i=1

(σi(A)− σi(B))2.

On using Proposition 4, we obtain (16).

Proof of Theorem 2:

Proof. Let R = HQ and UR be an orthonormal basisof ran(R). Then similar to Theorem 1,

‖G− G‖F = ‖XTX −XTURUTRX‖F .

Since I − URUTR is an orthogonal projector, it is

psd. Thus, for any vector u, uTXT (I − URUTR )Xu =(Xu)T (I − URU

TR )(Xu) ≥ 0, and so XTX −

XTURUTRX is also psd. Using Lemma 4,

‖G− G‖F ≤ Tr(XTX −XTURUTRX)

= ‖X‖2F − ‖UTRX‖2F . (17)

Using Lemmas 9 and 10,

‖X‖2F − ‖UTRX‖2F

= ‖X‖2F −k+p∑i=1

σ2i (X) +

k+p∑i=1

σ2i (X)−

k+p∑i=1

σ2i (H)

+k+p∑i=1

σ2i (H)−

k+p∑i=1

σ2i (R) +

k+p∑i=1

σ2i (R)− ‖UTRX‖2F

≤∑i>k+p

σ2i (X) +

√k + p‖XXT −HHT ‖F

+√k + p‖HHT −RRT ‖F

+√k + p‖XXT −RRT ‖F

≤∑i>k+p

σ2i (X) + 2

√k + p‖XXT −HHT ‖F

+2√k + p‖HHT −RRT ‖F . (18)

Note that∑i>k+p σ

2i (X) can be bounded as

∑i>k+p

σ2i (X) ≤

n∑i=1

σ2i (X) = Tr(XTX) ≤ nG∗ii. (19)

Moreover, let P = I − QQT , which is an orthogonal

projector. Then

‖HHT −RRT ‖2F = ‖HHT −HQQTHT ‖2F= Tr

(HPHTHPHT

)= Tr

((HP )(PHTHPPHT )

)= Tr(PHTHPPHTHP )= ‖PHTHP‖F≤ ‖PHTH‖F= ‖HTH −QQTHTH‖2F= κ4‖W −QQTW‖2F ,

where the inequality is due to Lemma 5. Using Corol-

lary 1, and let ζF =(

1 + kp−1

)1/2

, then

E ‖HHT −RRT ‖F= κ2 E ‖W −QQTW‖F

≤ κ2ζF

[∑i>k

σ2i (W )

]1/2

= ζF

[∑i>k

σ2i (HHT )

]1/2

≤ ζF

[∑i>k

σ2i (XXT )

]1/2

+ ζF ‖XXT −HHT ‖F

= ζF ‖G−Gk‖F + ζF ‖XXT −HHT ‖F , (20)

on using Lemma 11. Combining (17) with (18), (19),and (20), we have

E ‖G− G‖F ≤nG∗ii + 2√k + pE ‖XXT −HHT ‖F

+2√k + pE

(E|H ‖HHT −RRT ‖

)≤nG∗ii + 2ζF

√k + p‖G−Gk‖F

+2(1 + ζF )√k + pE ‖XXT −HHT ‖F .

Finally, using Proposition 7 and property 3 ofLemma 1, we obtain

E ‖G− G‖F ≤ nG∗ii + 2ζF√k + p‖G−Gk‖F

+2(1 + ζF )√k + p

n√mG∗ii

≤ 2(k + p)√p− 1

‖G−Gk‖F

+

(1 +

4(k + p)√m(p− 1)

)nG∗ii.

Making Large-Scale Nyström Approximation Possible

Documents

Transcript of Making Large-Scale Nyström Approximation Possible