Lanczos Vectors versus Singular Vectors

12
1 Lanczos Vectors versus Singular Vectors for Effective Dimension Reduction Jie Chen and Yousef Saad Abstract— This paper takes an in-depth look at a technique for computing filtered matrix-vector (mat-vec) products which are required in many data analysis applications. In these applications the data matrix is multiplied by a vector and we wish to perform this product accurately in the space spanned by a few of the major singular vectors of the matrix. We examine the use of the Lanczos algorithm for this purpose. The goal of the method is identical with that of the truncated singular value decomposition (SVD), namely to preserve the quality of the resulting mat-vec product in the major singular directions of the matrix. The Lanczos-based approach achieves this goal by using a small number of Lanczos vectors, but it does not explicitly compute singular values/vectors of the matrix. The main advantage of the Lanczos-based technique is its low cost when compared with that of the truncated SVD. This advantage comes without sacrificing accuracy. The effectiveness of this approach is demonstrated on a few sample applications requiring dimension reduction, including information retrieval and face recognition. The proposed tech- nique can be applied as a replacement to the truncated SVD technique whenever the problem can be formulated as a filtered mat-vec multiplication. Index Terms— Dimension reduction, SVD, Lanczos algorithm, information retrieval, latent semantic indexing, face recognition, PCA, eigenfaces. I. I NTRODUCTION I N many data analysis problems, such as information retrieval and face recognition, it is desired to compute a matrix-vector (mat-vec) product between the data matrix A and a certain “query” vector b. However, in order to remove noise and/or extract latent structures of the data, the optimal rank-k approximation of A, denoted A k , is commonly used in place of A itself. Specifically, if we let A have the SVD A = U ΣV T , and define A k = U k Σ k V T k , where U k (resp. V k ) consists of the first k columns of U (resp. V ) and Σ k is the k-th principal submatrix of Σ, then the matrix A k is the best rank-k approximation of A in the 2-norm or Frobenius norm sense [1], [2]. Then the product A k b (1) is a filtered version of the mat-vec product Ab. The use of this filtered mat-vec product is illustrated in the following two sample applications. This work was supported by NSF grants DMS 0510131 and DMS 0528492 and by the Minnesota Supercomputing Institute. The authors are with the Department of Computer Science and Engineer- ing, University of Minnesota at Twin Cities, MN 55455. Email: {jchen, saad}@cs.umn.edu a) Latent Semantic Indexing (LSI): LSI [3], [4] is an effec- tive information retrieval technique which computes the relevance scores for all the documents in a collection in response to a user query. In LSI, a collection of documents is represented as a term- document matrix X =[x ij ] where each column vector represents a document and x ij is the weight of term i in document j . A query is represented as a pseudo-document in a similar form—a column vector q. By performing dimension reduction, each document x j (j -th column of X) becomes P T k x j and the query becomes P T k q in the reduced-rank representation, where P k is the matrix whose column vectors are the k major left singular vectors of X. The relevance score between the j -th document and the query is then computed as the cosine of the angle between P T k x j and P T k q: cos(P T k x j ,P T k q)= (P T k x j ) T (P T k q) P T k x j P T k q = (X k e j ) T q X k e j P T k q , where X k is the best rank-k approximation of X, and e j is the j -th column of the identity matrix. Hence, relevance scores for all the documents in the collection are equivalently represented by the column vector X T k q (2) modified by scaling each entry with the norm of the corresponding row of X T k . Thus, LSI reduces to computing the filtered mat-vec product X T k q. b) Eigenfaces: The eigenfaces methodology [5] applies principal component analysis to face recognition. Similar to LSI, face images in the training set are represented as column vectors to form a matrix F . A difference with LSI is that the columns of F are centered by concurrently subtracting their mean. An image p in the test set is also represented as a column vector. Let Q k be the matrix constituting the k major left singular vectors of F . The similarity between the j -th training image and the test image is then computed as the cosine of the angle between Q T k f j and Q T k p, where f j is the j -th column of F . Similar to the above derivation in LSI, the similarity scores for all the training images are represented as the column vector F T k p (3) modified by scaling each entry with the norm of the corresponding row of F T k , where F k is the best rank-k approximation of F . The main computation in the eigenfaces technique is the filtered mat- vec product F T k p. A notorious difficulty in computing A k b is that it requires the truncated SVD, which is computationally expensive for large A. (See Table I for a summary of the costs.) In addition, frequent changes in the data set require an update of the SVD, and this is not an easy task. Much research has been devoted to the problem of updating the (truncated) SVD [6]–[9]. A drawback of these approaches is that the resulting (truncated) SVD loses accuracy after frequent updates.

description

Comparativa entre vectores

Transcript of Lanczos Vectors versus Singular Vectors

  • 1Lanczos Vectors versus Singular Vectors forEffective Dimension Reduction

    Jie Chen and Yousef Saad

    Abstract This paper takes an in-depth look at a technique forcomputing filtered matrix-vector (mat-vec) products which arerequired in many data analysis applications. In these applicationsthe data matrix is multiplied by a vector and we wish to performthis product accurately in the space spanned by a few of themajor singular vectors of the matrix. We examine the use of theLanczos algorithm for this purpose. The goal of the method isidentical with that of the truncated singular value decomposition(SVD), namely to preserve the quality of the resulting mat-vecproduct in the major singular directions of the matrix. TheLanczos-based approach achieves this goal by using a smallnumber of Lanczos vectors, but it does not explicitly computesingular values/vectors of the matrix. The main advantage of theLanczos-based technique is its low cost when compared with thatof the truncated SVD. This advantage comes without sacrificingaccuracy. The effectiveness of this approach is demonstrated on afew sample applications requiring dimension reduction, includinginformation retrieval and face recognition. The proposed tech-nique can be applied as a replacement to the truncated SVDtechnique whenever the problem can be formulated as a filteredmat-vec multiplication.

    Index Terms Dimension reduction, SVD, Lanczos algorithm,information retrieval, latent semantic indexing, face recognition,PCA, eigenfaces.

    I. INTRODUCTION

    IN many data analysis problems, such as information retrievaland face recognition, it is desired to compute a matrix-vector(mat-vec) product between the data matrix A and a certainquery vector b. However, in order to remove noise and/or extractlatent structures of the data, the optimal rank-k approximationof A, denoted Ak, is commonly used in place of A itself.Specifically, if we let A have the SVD

    A = UV T ,

    and defineAk = UkkV

    Tk ,

    where Uk (resp. Vk) consists of the first k columns of U (resp. V )and k is the k-th principal submatrix of , then the matrix Akis the best rank-k approximation of A in the 2-norm or Frobeniusnorm sense [1], [2]. Then the product

    Akb (1)

    is a filtered version of the mat-vec product Ab. The use of thisfiltered mat-vec product is illustrated in the following two sampleapplications.

    This work was supported by NSF grants DMS 0510131 and DMS 0528492and by the Minnesota Supercomputing Institute.

    The authors are with the Department of Computer Science and Engineer-ing, University of Minnesota at Twin Cities, MN 55455. Email: {jchen,saad}@cs.umn.edu

    a) Latent Semantic Indexing (LSI): LSI [3], [4] is an effec-tive information retrieval technique which computes the relevancescores for all the documents in a collection in response to a userquery. In LSI, a collection of documents is represented as a term-document matrix X = [xij ] where each column vector representsa document and xij is the weight of term i in document j. A queryis represented as a pseudo-document in a similar forma columnvector q. By performing dimension reduction, each document xj(j-th column of X) becomes PTk xj and the query becomes P

    Tk q

    in the reduced-rank representation, where Pk is the matrix whosecolumn vectors are the k major left singular vectors of X. Therelevance score between the j-th document and the query is thencomputed as the cosine of the angle between PTk xj and P

    Tk q:

    cos(PTk xj , PTk q) =

    (PTk xj)T (PTk q)PTk xjPTk q = (Xkej)

    T qXkejPTk q ,where Xk is the best rank-k approximation of X, and ej is thej-th column of the identity matrix. Hence, relevance scores forall the documents in the collection are equivalently representedby the column vector

    XTk q (2)

    modified by scaling each entry with the norm of the correspondingrow of XTk . Thus, LSI reduces to computing the filtered mat-vecproduct XTk q.

    b) Eigenfaces: The eigenfaces methodology [5] appliesprincipal component analysis to face recognition. Similar to LSI,face images in the training set are represented as column vectorsto form a matrix F . A difference with LSI is that the columns ofF are centered by concurrently subtracting their mean. An imagep in the test set is also represented as a column vector. Let Qkbe the matrix constituting the k major left singular vectors of F .The similarity between the j-th training image and the test imageis then computed as the cosine of the angle between QTk fj andQTk p, where fj is the j-th column of F . Similar to the abovederivation in LSI, the similarity scores for all the training imagesare represented as the column vector

    FTk p (3)

    modified by scaling each entry with the norm of the correspondingrow of FTk , where Fk is the best rank-k approximation of F . Themain computation in the eigenfaces technique is the filtered mat-vec product FTk p.

    A notorious difficulty in computing Akb is that it requires thetruncated SVD, which is computationally expensive for large A.(See Table I for a summary of the costs.) In addition, frequentchanges in the data set require an update of the SVD, and this isnot an easy task. Much research has been devoted to the problemof updating the (truncated) SVD [6][9]. A drawback of theseapproaches is that the resulting (truncated) SVD loses accuracyafter frequent updates.

  • 2To bypass the truncated SVD computation, Kokiopoulou andSaad [10] introduced polynomial filtering techniques, and Erhel etal. [11] and Saad [12] proposed algorithms for building goodpolynomials to use in such techniques. These methods all ef-ficiently compute a sequence of vectors i(A)b where i is apolynomial of degree at most i such that i(A)b is progressivelya better approximation to Akb as i increases.

    In this paper, we investigate the Lanczos algorithm as a meansto compute a sequence of vectors {si} (or {ti}) which pro-gressively approximate Ab. This technique does not use explicitpolynomials, but it shares the advantages of polynomial filteringtechniques: easy-to-update recurrence formulas and low costrelative to the direct SVD. Both approximating sequences ({si}and {ti}) rapidly converge to Ab in the major singular directionsof A, achieving the same goal as that of the truncated SVD whichpreserves the major singular components. Hence when i = k, thevector si or ti is a good alternative to the filtered mat-vec productAkb.

    More appealingly, the proposed technique can be split intoa preprocessing phase and a query response phase, where aquery vector b is involved only in the second phase. In contrast,polynomial filtering techniques cannot be naturally separatedin two phases. Hence, the proposed technique will certainlyoutperform polynomial filtering techniques in common practicalsituations where many queries are to be processed.

    Note that the Lanczos algorithm is a commonly used methodfor large symmetric eigenvalue problems and sparse SVD prob-lems. As it turns out, this approach puts too much emphasison individual singular vectors and is relatively expensive (see,e.g. [13], and the discussions in Section IV-B). We exploit theLanczos algorithm to efficiently compute the filtered mat-vecproducts without explicitly computing the singular components ofthe matrix. Therefore, when computational efficiency is critical,this technique may be a favorable alternative to the best rank-kapproximation.

    A. Related work

    The proposed usage of the Lanczos algorithm belongs to abroad class of methods, the so-called Krylov subspace methods.Blom and Ruhe [14] suggested the use of the Golub-Kahanbidiagonalization techique [15], which also belongs to this class,for information retrieval. While it is well-known that the Golub-Kahan bidiagonalization and the Lanczos algorithm are closelyrelated, we highlight the differences between these two techniquesand note the contributions of this paper in the following:

    1) The framework described in this paper could be consideredan extension of that of [14]. The method in [14] is equiva-lent to a special case of one of our approximation schemes(the left projection approximation). In particular the methodin [14] used the normalized query as the initial vector.This means that the construction of the Krylov subspacein their method would normally be repeated anew for eachdifferent query, which is not economical. In contrast, ourmethod constructs the Krylov subspace only once as thepreprocessing phase.

    2) The method in [14] generated two sets of vectors that essen-tially replaced the left and right singular vectors. In contrast,we generate only one set of (Lanczos) vectors. This savingof storage can be critical for large scale problems.

    3) We strengthen the observation made in [14] that the dif-ference between Ab and the approximation vector (si)monotonically decreases, by proving rigorously that this dif-ference decays very rapidly in the major singular directionsof A. Illustrations are provided to relate the convergencebehavior of the method based on the good separation ofthe largest singular values (eigenvalues) in Section VII.

    4) Several practical issues which were not discussed in [14],are explored in detail. These include the choice of theapproximation schemes depending on the shape of thematrix, and other practical choices related to memory versuscomputational cost trade-offs.

    5) The technique described in this paper has a broad varietyof applications and is not limited to information retrieval.Indeed it can be applied whenever the problem can beformulated as a filtered mat-vec multiplication. This paperpresents two such applications.

    B. Organization of the paper

    The symmetric Lanczos algorithm is reviewed in Section II, andthe two proposed Lanczos approximation schemes are introducedin Section III. Section IV provides a detailed analysis of theconvergence of the proposed techniques and their computationalcomplexities. Section V discusses how to incorporate entry scal-ing into the approximation vectors. Applications of the proposedapproximation schemes to information retrieval and face recogni-tion are illustrated in Section VI. Finally, extensive experimentsare shown in Section VII, followed by concluding remarks inSection VIII.

    II. THE SYMMETRIC LANCZOS ALGORITHMGiven a symmetric matrix M Rnn and an initial unit vector

    q1, the Lanczos algorithm builds an orthonormal basis of theKrylov subspace

    Kk(M, q1) = range{q1,Mq1,M2q1, . . . ,Mk1q1}.The Lanczos vectors qi, i = 1, . . . , k computed by the algorithmsatisfy the 3-term recurrence

    i+1qi+1 =Mqi iqi iqi1with 1q0 0. The coefficients i and i+1 are computedso as to ensure that qi+1, qi = 0 and qi+1 = 1. In exactarithmetic, it turns out that qi+1 is orthogonal to q1, . . . , qi so thevectors qi, i = 1, . . . , k form an orthonormal basis of the Krylovsubspace Kk(M, q1). An outline of the procedure is illustrated inAlgorithm 1. The time cost of the procedure is O(kn2). If thematrix is sparse, then the cost reads O(k(nnz + n)), where nnzis the number of nonzeros of M .

    If Qk = [q1, . . . , qk] Rnk, then an important equality whichresults from the algorithm is

    QTkMQk = Tk =

    266666641 22 2 3

    . . .. . .

    . . .k1 k1 k

    k k

    37777775 . (4)

    An eigenvalue of Tk is called a Ritz value, and if y is anassociated eigenvector, Qky is the associated Ritz vector. As kincreases more and more Ritz values and vectors will convergetowards eigenvalues and vectors of M [2], [16].

  • 3Algorithm 1 The Lanczos ProcedureInput: M , q1, kOutput: q1, . . . , qk+1, is, is

    1: Set 1 0, q0 0.2: for i 1, . . . , k do3: wi Mqi iqi14: i wi, qi5: wi wi iqi6: i+1 wi27: if i+1 = 0 then8: stop9: end if

    10: qi+1 wi/i+111: end for

    A. Re-orthogonalization

    It is known that in practice the theoretical orthogonality ofthe computed Lanczos vectors qis is quickly lost. This lossof orthogonality is triggered by the convergence of one ormore eigenvectors [17]. There has been much research devotedto re-instating orthogonality. The simplest technique is to re-orthogonalize each new vector against each of the previous basisvectors. This amounts to adding the following line of pseudocodeimmediately after line 5 of Algorithm 1:

    wi wi Pi1

    j=1

    wi, qj

    qj .

    This additional full re-orthogonalization step increases thecomputational cost of the Lanczos procedure (by roughly k2n/2computations), but the advantages are that all the qis are guar-anteed to be numerically orthogonal, and any subsequent processrelying on the orthogonality is more rigorous.

    It is important to add here that there are other more cost-effective re-orthogonalization procedures known. These pro-cedures are not considered in this paper for simplicity, butthey should be considered in any realistic implementation ofthe Lanczos-based technique. The best-known practical alterna-tive to full re-orthogonalization is partial re-orthogonalizationwhich consists of taking a re-orthogonalization step onlywhen needed [18][20]. Another technique is the selective re-orthogonalization approach which exploits the fact that loss oforthogonality is seen only in the direction of the convergedeigenvectors [17]. Using one of these approaches should lead toa considerable reduction of the cost of the Lanczos procedure,especially for situations requiring a large number of steps.

    III. LANCZOS APPROXIMATION

    The Lanczos algorithm which we briefly discussed above canbe exploited in two ways to efficiently approximate the filteredmat-vec Akb without resorting to the expensive SVD.

    A. Left projection approximation

    The first approach is to apply the Lanczos procedure to thematrix M = AAT , resulting in the equality

    QTkAATQk = Tk. (5)

    We define the sequence

    si := QiQTi Ab, (6)

    where si is the orthogonal projection of Ab onto the subspacerange(Qi). By this definition, the vector si can be easily updatedfrom the previous vector si1:

    si = si1 + qiqTi Ab. (7)

    The sequence {si} is a progressive approximation to Ab asi increases. We call this the left projection approximation, asthe projection operator QiQTi operates on the left hand sideof A. (This is to be distinguished from the later definition ofthe right projection approximation.) In the course of the processto compute the sequence, the matrix M need not be explicitlyformed. The only computation involving M is the mat-vec online 3 of Algorithm 1. This computation can be performed as

    wi A(AT qi) iqi1.

    An analysis to be discussed shortly, will show that when i =k, the vector sk is a good approximation to Akb. Algorithm 2summarizes the steps to compute sk.

    Algorithm 2 Left projection approximationInput: A, b, q1, kOutput: sk

    1: Apply Algorithm 1 on M = AAT to obtain Lanczos vectorsq1, . . . , qk, with line 3 modified to wi A(AT qi)iqi1.

    2: s Ab3: s0 04: for i 1, . . . , k do5: si si1 + s, qi qi6: end for

    Note that the computation of sk in Algorithm 2 can equivalentlybe carried out as a sequence of mat-vec products, i.e., as Qk (QTk (A b)). The algorithm presents the computation of sk inthe update form as this reveals more information on the sequence{si}.

    B. Right projection approximation

    A second method consists of applying the Lanczos algorithmto the matrix M = ATA, resulting in the relation:

    QTkATAQk = Tk. (8)

    We now define the sequence

    ti := AQiQTi b, (9)

    and call this the right projection approximation to Ab. The vectorti is related to ti1 by

    ti = ti1 +AqiqTi b. (10)

    Similar to the left projection approximation, the matrix M doesnot need to be explicitly formed, and the vector tk is also a goodapproximation to Akb (shown later). Algorithm 3 summarizes thesteps to compute tk.

    Again, note that lines 26 of the algorithm are equivalent tocomputing tk directly by tk = A (Qk (QTk b)).

  • 4Algorithm 3 Right projection approximationInput: A, b, q1, kOutput: tk

    1: Apply Algorithm 1 on M = ATA to obtain Lanczos vectorsq1, . . . , qk, with line 3 modified to wi AT (Aqi) iqi1.

    2: t0 03: for i 1, . . . , k do4: ti ti1 + b, qi qi5: end for6: tk Atk

    IV. ANALYSIS

    A. Convergence

    The reason why sk and tk are good approximations to Akb isthe fact that the sequences {si} and {ti} both rapidly convergeto Ab in the major left singular directions of A. This fact impliesthat sk and tk, when projected to the subspace spanned by themajor left singular vectors of A, are very close to the projectionof Ab if k is sufficiently large. Since the goal of performing afiltered mat-vec Akb in data analysis applications is to preservethe accuracy of Ab in these directions, the vectors sk and tk aregood alternatives to Akb. The rapid convergence is analyzed inthe next.

    Let uj be the j-th left singular vector of A. The rapidconvergence of the sequence {si} can be shown by establishingthe inequality

    Ab si, uj cj AbT1ij(j), (11)

    for i j, where cj > 0 and j > 1 are constants independentof i, and Tl() is the Chebyshev polynomial of the first kind ofdegree l. In other words, the difference between Ab and si, whenprojected on the direction uj , decays at least with the same orderas T1ij(j). The detailed derivation of this inequality is shownin the Appendix. It can be similarly shown that

    Ab ti, uj j cj bT1ij(j) (12)

    for constants j > 0, cj > 0 and j > 1. Again, this inequalitymeans that ti converges to Ab in the major left singular directionsuj of A as i increases.

    The Chebyshev polynomial Tl(x) for x > 1 is defined as

    Tl(x) := cosh(l arccosh(x)) =1

    2

    el + el

    where = arccosh(x). When l is large the magnitude of el isnegligible, hence

    Tl(x) 12el =

    1

    2

    `exp(arccosh(x))

    l.

    Thus given x, the Chebyshev polynomial Tl(x) grows in a mannerthat is close to an exponential in l. The base exp(arccosh(x)) isstrictly larger than one when x > 1. In other words, the previousapproximation vectors si (or ti) converge geometrically, i.e., asl with < 1. The convergence factor can be close to one, butis away from one in situations when there is a good separationof the largest singular values of A. In such situations which areoften satisfied by practical applications, si and ti can convergevery fast to Ab in the major left singular directions of A.

    B. Computational complexities

    Compared with the truncated SVD, the two Lanczos ap-proximation schemes are far more economical both in memoryusage and in time consumption (for the preprocessing phase).This section analyzes the computational complexities of thesealgorithms and the results are summarized in Table I. We assumethat the matrix A is of size m n with nnz non-zero elements.The situation when A is non-sparse is not typical, thus weomit the analysis for this case here.1 Section IV-D has furtherdiscussions on the trade-offs between space and time in practicalimplementations.

    In practice, both Algorithms 2 and 3 are split into two phases:preprocessing and query response. The preprocessing phase con-sists of the computation of the Lanczos vectors, which is exactlyline 1 of both algorithms. The rest of the algorithms computessk (or tk) by taking in a vector b. Multiple query vectors b canbe input and different query responses sk (or tk) can be easilyproduced provided that the Lanczos vectors have been computedand saved. Hence the rest of the algorithms, excluding line 1,constitutes the query response phase.

    Since Algorithms 2 and 3 are very similar, it suffices toshow the analysis of Algorithm 2. In the preprocessing phase,this algorithm computes two sparse mat-vec products (once formultiplying qi by AT , and once for multiplying this result byA) and performs a few length-m vector operations during eachiteration. Hence with k iterations, the time cost is O(k(nnz+m)).The space cost consists of the storage of the matrix A and all theLanczos vectors, thus it is O(nnz + km).

    The bulk of the preprocessing phase for computing Akb liesin the computation of the truncated SVD of A. A typical wayof computing this decomposition is to go through the Lanczosprocess as in (5) with k > k iterations, and then compute theeigen-elements of Tk . Usually the number of Lanczos steps k

    is far greater than k in order to guarantee the convergence of thefirst k Ritz values and vectors. Hence the costs for the truncatedSVD, both in time and space, involve a value of k that is muchlarger than k. Furthermore, the costs have an additional factor forperforming frequent eigen-decompositions and convergence tests.

    For Algorithm 2, the query response phase involves one sparsemat-vec (Ab) and k vector updates. Hence the total storage costis O(nnz + km), and to be more exact, the time cost is nnz +2km. On the other hand, the filtered mat-vec product Akb canbe computed in three ways: (a) Uk(UTk (Ab)); (b) A(Vk(V

    Tk b));

    or (c) Uk(k(V Tk b)). Approach (a) requires O(nnz + km) spaceand nnz + 2km time, approach (b) requires O(nnz + kn) spaceand nnz + 2kn time, while approach (c) requires O(k(m + n))space and k(m + n) time. Depending on the relative size of kand nnz/n (or nnz/m), the best way can be chosen in actualimplementations.

    To conclude, the two Lanczos approximation schemes consumesignificantly fewer computing resources than the truncated SVDin the preprocessing phase, and they can be as economical inthe query response phase. As a remark, due to the split of thealgorithms into these two phases, the storage of the k Lanczosvectors {qi} (or {qi}) is an issue to consider. These vectorshave to be stored permanently after preprocessing, since they areneeded to produce the query responses. As will be discussed later

    1Roughly speaking, if nnz is replaced by m n in the big-O notationswe will obtain the complexities for the non-sparse case. This is true for thealgorithms in this paper.

  • 5TABLE ICOMPARISONS OF COMPUTATIONAL COSTS.

    Left approx. Right approx. Trunc. SVDa

    preprocessing space O(nnz + km) O(nnz + kn) O(nnz + km+ Ts) or O(nnz + kn+ Ts)preprocessing time O(k(nnz +m)) O(k(nnz + n)) O(k(nnz +m) + Tt) or O(k(nnz + n) + Tt)query space O(nnz + km) O(nnz + kn) O(nnz + km) or O(nnz + kn) or O(k(m+ n))query time nnz + 2km nnz + 2kn nnz + 2km or nnz + 2kn or k(m+ n)a Ts (resp. Tt) is the space (resp. time) cost for eigen decompositions of tridiagonal matrices and convergence

    tests. These two costs are both complicated and cannot be neatly expressed using factors such as nnz, k,m and n. Also, note that k > k and empirically k is a few times larger than k.

    in Section IV-D, computational time cost can be traded for thestorage of these k vectors.

    C. Which approximation to use?

    As the above analysis indicates, the choice of which approxi-mation scheme to use largely depends on the relative size of mand n, i.e., on the shape of the matrix A. It is clear that sk is abetter choice when m < n, while tk is more appropriate whenm n. The approximation qualities for both schemes are veryclose; see the experimental results in Section VII-A.

    D. Other implementations

    Trade-offs exist between computational efficiency and storagerequirement which will lead to different implementations ofAlgorithms 2 and 3. These choices depend on the practicalsituation at hand.

    a) Avoiding the storage of the Lanczos vectors: When mor n is large, the storage requirement of the k Lanczos vectorscan not be overlooked. This storage can be avoided by simplyregenerating the Lanczos vectors on the fly as efficiently aspossible when they are needed. A common practice is that,after preprocessing, only the initial vector q1 and the tridiagonalmatrix Tk are stored. At the query response phase, the qis areregenerated by applying the Lanczos procedure again, withoutperforming lines 4 and 6 (of Algorithm 1). This way of proceedingrequires additional sparse mat-vecs and saxpies but not innerproducts. The storage is significantly reduced from k vectors toonly 3 vectors (q1, and the main- and off-diagonal of Tk). Notethat since re-orthogonalization is usually performed the storageof the qis is needed at least during the preprocessing phase asit is not practical to re-recompute the qis for the purpose of re-orthogonalization.

    b) Avoiding the storage of the matrix: The matrix A occursthroughout the algorithms. The presence of A in the queryresponse phase may be undesirable in some situations for oneor more of the following reasons: (a) A is not sparse, so storing afull matrix is very costly; (b) mat-vec operations with a full matrixare expensive; (c) one mat-vec operation may not be as fast asa few vector operations (even in the sparse mode). It is possibleto avoid storing A after the preprocessing phase. Specifically, inthe left projection approximation, the update formula (7) can bewritten in the following form:

    si = si1 +Db, AT qi

    Eqi. (13)

    Instead of storing A, we can store the k intermediate vectors{AT qi}, which are byproducts of the preprocessing phase. This

    will result in a query response without any mat-vec opera-tions. Similarly, in the right projection approximation, updateformula (10) can be rewritten as:

    ti = ti1 + b, qi (Aqi). (14)The vectors {Aqi}, byproducts of the preprocessing phase, arestored in place of A.

    In summary, depending on the relative significance of spaceand time in practical situations, the two approximation schemescan be flexibly implemented in three different ways. From lowstorage requirement to high storage requirement (or from hightolerance in time to low tolerance in time), they are: (a) Storingonly the initial vector q1 (or q1) and regenerating the Lanczosvectors in the query response phase; (b) the standard Algorithm 2(or Algorithm 3); (c) storing two sets of vectors {qi} and {AT qi}(or {qi} and {Aqi}) and discarding the matrix A.2 In all situationsthese alternatives are far more efficient than those based on thetruncated SVD.

    V. ENTRY SCALING

    As indicated in the discussions of the two sample applicationsin Section I, it is usually necessary to scale the entries of thefiltered mat-vec product. This amounts to dividing each entry ofthe vector Akb by the norm of the corresponding row of Ak. Inparallel, in the proposed Lanczos approximations, we may needto divide each entry of the approximation vector sk = QkQTkAb(resp. tk = AQkQTk b) by the norm of the corresponding rowof QkQTkA (resp. AQkQ

    Tk ). This section illustrates a way to

    compute these row norms in a minimal cost. Similar to thecomputation of sk and tk, the row norms are computed in aniterative update fashion.

    For the left projection approximation, define (i)j to be the normof the j-th row of QiQTi A, i.e.,

    (i)j =

    eTj QiQTi A . (15)It is easy to see that (i)j is related to

    (i1)j by

    (i)j

    2= eTj QiQ

    Ti AA

    TQiQTi ej

    = eTj QiTiQTi ej

    = eTj (Qi1Ti1QTi1 + iqiq

    Ti

    + iqiqTi1 + iq

    Ti1qi)ej

    =(i1)j

    2+ i(qi)

    2j + 2i(qi)j(qi1)j , (16)

    2Depending on the relative size of k and nnz/n (or nnz/m), method (c)may or may not need more storage than (b). In any case, method (c) shouldrun faster than (b).

  • 6where (v)j means the j-th element of vector v. Hence the rownorms of QkQTkA,

    (k)j , j = 1, . . . ,m, can be computed using

    the update formula (16) in the preprocessing phase. Specifically,the following line of pseudocode should be added immediatelyafter line 4 of Algorithm 1:

    (i)j

    2(i1)j

    2+ i(qi)

    2j + 2i(qi)j(qi1)j .

    For the right projection approximation, define (i)j to be thenorm of the j-th row of AQiQTi , i.e.,

    (i)j =

    eTj AQiQTi . (17)It is easy to see that (i)j is related to

    (i1)j by

    (i)j

    2= eTj AQiQ

    Ti A

    T ej

    = eTj (AQi1QTi1A

    T +AqiqTi A

    T )ej

    =(i1)j

    2+ (Aqi)

    2j . (18)

    This update formula helps efficiently compute the row norms ofAQkQ

    Tk , that is,

    (k)j , j = 1, . . . ,m, in the preprocessing phase.

    The specific modification to Algorithm 1 is the insertion of thefollowing line of pseudocode immediately after line 3:

    (i)j

    2(i1)j

    2+ (Aqi)

    2j .

    The computation of row norms (k)j s (resp. (k)j s) using

    update formula (16) (resp. (18)) requires minimal resources andwill not increase the asymptotic space/time complexities of thepreprocessing phase in Algorithm 2 (resp. Algorithm 3).

    VI. APPLICATIONS

    We have briefly discussed two sample applicationsLSI andeigenfacesin Section I. For the sake of completeness, wesummarize in this section how the Lanczos approximation withentry scaling can be applied in these two applications as aneffective alternative to the traditional approach based on thetruncated SVD.

    The traditional SVD approach computes the filtered mat-vecproduct Akb and divides each entry of this vector by the norm ofthe corresponding row of Ak to obtain the ranking scores. In LSI,Ak is the transpose of the best rank-k approximation of the term-document matrix, i.e., XTk , and b is a query q. In eigenfaces, Akis the transpose of the best rank-k approximation of the trainingset matrix, i.e., FTk , and b is a test image p.

    In the proposed Lanczos approximation, we compute sk or tk inplace of Akb, depending on the relative size of the two dimensionsof the matrix A. If m < n, we compute sk as an alternative to Akbusing Algorithm 2. Meanwhile we compute the row norms (k)jfor j = 1, . . . ,m. Then we divide the j-th entry of sk by

    (k)j for

    all j to obtain the ranking scores. On the other hand, if m n,we compute tk as an alternative to Akb using Algorithm 3 andalso compute the scalars (k)j for j = 1, . . . ,m. Then we divide

    the j-th entry of tk by (k)j for all j hence the similarity scores

    are obtained.

    VII. EXPERIMENTAL RESULTS

    The goal of the extensive experiments to be presented in thissection is to show the effectiveness of the Lanczos approximationschemes in comparison to the truncated SVD-based approach.Most of the experiments were performed in Matlab 7 under aLinux workstation with two P4 3.00GHz CPUs and 4GB memory.The only exception was the experiments on the large data setTREC, where a machine with 16GB of memory was used.

    A. Approximation quality

    This section shows the convergence behavior of the sequences{si} and {ti}, and the qualities of sk and tk in approximating thefiltered mat-vec product Akb. The matrices used for experimentswere from the data sets MED, CRAN, and NPL, which will beintroduced in the next section. We respectively label the threematrices as AMED, ACRAN, and ANPL.

    Figure 1 shows the convergence behavior of {si} and {ti} inthe major left singular directions u1, u2 and u3 of the matrixAMED. The vertical axis corresponds to residuals

    Ab si, uj

    (or

    Ab ti, uj

    ). As shown by (a) and (c), if no re-

    orthogonalization was performed, the {si} and {ti} sequencesdiverged starting at around the 10-th iteration, which indicatedthat loss of orthogonality in the Lanczos procedure appearedfairly soon. Hence it is necessary that re-orthogonalization beperformed in order to yield accurate results. Plots (b) and (d)show the expected convergence pattern: the residuals rapidly felltowards the level of machine epsilon ( 1016).

    To show convergence results on more matrices, we plot inFigure 2 the residual curves for ACRAN and ANPL. For all casesre-orthogonalization was performed. Note that the shapes of thetwo matrices are different: ACRAN has more columns than rowswhile for ANPL the opposite is true. Figure 2 implies two facts:(a) The approximation sequences converged in all three directions,with u1 being the fastest direction; (b) the approximations did notdiffer too much in terms of rate of convergence or approximationquality. The first fact can be further investigated by plotting theeigenvalue distribution of the matrices M = AAT as in Figure 3.The large gap between the first two eigenvalues explains the quickconvergence in the u1 direction. The second fact is guaranteedby the similar residual bounds (c.f. inequalities (11) and (12)).It further shows that approximation quality is not a factor in thechoice of left or right projection.

    For k = 300, we plot in Figure 4 the residualsA300b s300, uj

    (or

    A300b t300, uj

    ) over the j-th left

    singular directions for j = 1, . . . , 300. All the plots exhibit a sim-ilar pattern: A300b and s300 (or t300) were almost identical whenprojected onto the first 100 singular directions. This indicates thatsk and tk can be good alternatives to Akb, since they preserve thequality of Akb in a large number of the major singular directions.

    B. LSI for information retrieval

    1) Data sets: Four data sets were used in the experiments.Statistics are shown in Table II.

    a) MEDLINE and CRANFIELD3: These are two early andwell-known benchmark data sets for information retrieval. Theirtypical statistics is that the number of distinct terms is more thanthe number of documents, i.e., for the matrix A (or equivalentlyXT ), m < n.

    3ftp://ftp.cs.cornell.edu/pub/smart/

  • 720 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (a) Left proj., w/o re-orth.

    20 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (b) Left proj., with re-orth.

    20 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (c) Right proj., w/o re-orth.

    20 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (d) Right proj., with re-orth.

    Fig. 1. Convergence of the {si} (resp. {ti}) sequence to the vector Ab in the major singular directions u1, u2 and u3. Matrix: AMED.

    20 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (a) ACRAN: left proj.

    20 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (b) ACRAN: right proj.

    20 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (c) ANPL: left proj.

    20 40 60 80 100

    1015

    1010

    105

    100

    iteration i

    resi

    dual

    u1u2u3

    (d) ANPL: right proj.

    Fig. 2. Convergence of the {si} (resp. {ti}) sequence to the vector Ab in the major singular directions u1, u2 and u3. Matrices: ACRAN, ANPL. All withre-orthogonalization.

    x0

    y

    1

    8.302

    3

    (a) AMED.

    x0

    y

    1

    12.332

    3

    (b) ACRAN.

    x0

    y

    1

    207.422

    3

    (c) ANPL.

    Fig. 3. Eigenvalue distribution of M = AAT on the real axis, where AT is the term-document matrix.

    50 100 150 200 250 3001020

    1015

    1010

    105

    100

    major left singular direction j

    resi

    dual

    (a) AMED: left proj.

    50 100 150 200 250 3001020

    1015

    1010

    105

    100

    major left singular direction j

    resi

    dual

    (b) ACRAN: left proj.

    50 100 150 200 250 3001020

    1015

    1010

    105

    100

    major left singular direction j

    resi

    dual

    (c) ANPL: right proj.

    Fig. 4. Differences between A300b and s300 (or t300) in the major left singular directions. For the first one hundred directions, the differences have decreasedto almost zero, while for other less significant directions the differences have not dropped to this level.

    b) NPL3: This data set is larger than the previous two, witha property that the number of documents is larger than the numberof distinct terms, i.e., m > n. Including the above two, these threedata sets have the term-document matrices readily available fromthe provided links, so we did not perform additional processingon the data.

    c) TREC4: This is a large data set which is popularly usedfor experiments in serious text mining applications. Similar toNPL, the term-document matrix extracted from this data set has

    4http://trec.nist.gov/data/docs eng.html

    more documents than distinct terms, i.e., m > n. Specifically, thewhole data set consists of four document collections (FinancialTimes, Federal Register, Foreign Broadcast Information Service,and Los Angeles Times) from the TREC CDs 4 & 5 (copyrighted).The queries are from the TREC-8 ad hoc task56. We usedthe software TMG [21] to construct the term-document matrix.The parsing process included stemming, deleting common wordsaccording to the stop-list provided by the software, and removing

    5Queries: http://trec.nist.gov/data/topics eng/6Relevance: http://trec.nist.gov/data/qrels eng/

  • 8words with no more than 5 occurrences or with appearancesin more than 100,000 documents. Also, 125 empty documentswere ignored. This resulted in a term-document matrix of size138232 528030. For the queries, only the title and descriptionparts were extracted to construct query vectors.

    TABLE IIINFORMATION RETRIEVAL DATA SETS.

    MED CRAN NPL TREC# terms 7,014 3,763 7,491 138,232# docs 1,033 1,398 11,429 528,030# queries 30 225 93 50ave terms/doc 52 53 20 129

    2) Implementation specs: The weighting scheme of all theterm-document matrices was term frequency-inverse documentfrequency (tf-idf). Due to the shapes of the matrices, for MED-LINE and CRANFIELD we used the left projection approxi-mation, and for NPL and TREC we used the right projectionapproximation. Full re-orthogonalization in the Lanczos processwas performed. The filtered mat-vec product Akb was computedas Uk(UTk (Ab)) or A(Vk(V

    Tk b)), depending on the relative sizes

    of m and n. We implemented the partial SVD based on theLanczos algorithm and Ritz values/vectors [22, Section 3.1]. Wedid not use the Matlab command svds for two reasons. (a) TheMatlab function svds uses a different algorithm. In order toconform to the computational complexity analysis in Section IV-B, we used the algorithm mentioned in the analysis. (b) While thealgorithm used by the Matlab function svds is able to accuratelycompute the smallest singular components, it consumes too manyresources in both storage and time. We chose to implementone that was both lightweight and accurate for largest singularcomponents, since the smallest components were not our concern.

    3) Results: Figure 5 plots the performance of the Lanczosapproximation applied to the MEDLINE, CRANFIELD and NPLdata sets, compared with that of the standard LSI (using thetruncated SVD approach). These plots show that the accuracyobtained from the Lanczos approximation is comparable to thatof LSI,7 while in preprocessing the former runs an order ofmagnitude faster than the latter. The accuracy is measured usingthe 11-point interpolated average precision, as shown in (a), (b)and (c). More information is provided by plotting the precision-recall curves. These plots are shown using a specific k for eachdata set in (d), (e) and (f). They suggest that the retrieval accuracyfor the Lanczos approximation is very close to that of LSI,sometimes even better. As shown in (g), (h) and (i), the Lanczosapproximation is far more economical than the SVD-based LSI.It can be an order of magnitude faster than the truncated SVDwhich, it should be recalled, was re-implemented for betterefficiency. The average query times for the two approaches, asshown in plots (j), (k) and (l), are (almost) identical, whichconforms to the previous theoretical analysis. Note the efficiencyof the query responsesin several or several tens of milliseconds.Plot (l) does show small discrepancies in query times betweenthe two approaches. We believe that the differences were causedby the internal Matlab memory management schemes that wereunknown to us. If in the experiments the Lanczos vectors (Qk)and the singular vectors (Vk) had been precomputed and reloaded

    7Note that by comparable, we mean that the accuracies obtained fromthe two methods do not greatly differ. It is not necessary that the differencebetween the accuracies is not statistically significant.

    rather than being computed on the fly, the query times would havebeen identical for both approaches.

    We also performed tests on the large data set TREC. Since wehad no access to a fair environment (which requires as much as16GB working memory and no interruptions from other users) fortime comparisons, we tested only the accuracy. Figure 6 showsthe precision-recall curves at different ks (400, 600 and 800)for the two approaches. As illustrated by the figure, the Lanczosapproximation yielded much higher accuracy than the truncatedSVD approach. Meanwhile, according to the rigorous analysis inSection IV-B, the former approach should consume much fewercomputing resources and be more efficient than the latter.

    C. Eigenfaces for face recognition

    1) Data sets: Three data sets were used in the experiments.Statistics are shown in Table III.

    a) ORL8: The ORL data set [23] is small but classic. Adark homogeneous background was used at the image acquisitionprocess. We did not perform any image processing task on thisdata set.

    b) PIE9: The PIE data set [24] consists of a large set ofimages taken under different poses, illumination conditions andface expressions. We downloaded the cropped version from DengCais homepage10. This version of the data set contains only fivenear frontal poses for each subject. The images had been non-uniformly scaled to a size of 32 32.

    c) ExtYaleB11: Extended from a previous database, the ex-tended Yale Face Database B (ExtYaleB) [25] contains images for38 human subjects. We used the cropped version [26] that can bedownloaded from the database homepage. We further uniformlyresized the cropped images to half of their sizes. The resizing canbe conveniently performed in Matlab by using the command

    img_small = imresize(img, .5, bicubic)

    without compromising the quality of the images.

    TABLE IIIFACE RECOGNITION DATA SETS.

    ORL PIE ExtYaleB# subjects 40 68 38# imgs 400 11,554 2,414img size 92 112 32 32 84 96

    2) Implementation specs: In the eigenfaces technique, eachimage is vectorized to a column, whose elements are pixel values.All the images have a single gray channel, whose values rangefrom 0 to 255. We divided them by 255 such that all the pixelvalues were no greater than 1. We split a data set into trainingsubset and test subset by randomly choosing a specific number ofimages for each subject as the training images. Different numberswere experimented with and this will be discussed shortly. Asin the previous information retrieval experiments, the choice ofleft or right projection depended on the shape of the training

    8http://www.cl.cam.ac.uk/research/dtg/attarchive/facedatabase.html

    9http://www.ri.cmu.edu/projects/project 418.html10http://www.cs.uiuc.edu/homes/dengcai2/Data/

    FaceData.html11http://vision.ucsd.edu/leekc/ExtYaleDatabase/

    ExtYaleB.html

  • 950 100 150 2000.5

    0.55

    0.6

    0.65

    0.7

    0.75

    k

    ave

    rage

    pre

    cisio

    n

    lanczos tsvd term matching

    (a) MED: average precision.

    50 100 150 200 250 3000.36

    0.37

    0.38

    0.39

    0.4

    0.41

    0.42

    0.43

    0.44

    0.45

    kave

    rage

    pre

    cisio

    n

    lanczos tsvd term matching

    (b) CRAN: average precision.

    450 500 550 600 650 700 750 8000.22

    0.225

    0.23

    0.235

    0.24

    0.245

    0.25

    0.255

    0.26

    0.265

    k

    ave

    rage

    pre

    cisio

    n

    lanczos tsvd term matching

    (c) NPL: average precision.

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    recall

    prec

    ision

    lanczostsvd

    (d) MED: precision-recall (k = 240).

    0 0.2 0.4 0.6 0.8 10

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    recall

    prec

    ision

    lanczostsvd

    (e) CRAN: precision-recall (k = 320).

    0 0.2 0.4 0.6 0.8 10

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4

    0.45

    0.5

    recall

    prec

    ision

    lanczostsvd

    (f) NPL: precision-recall (k = 800).

    50 100 150 2000

    2

    4

    6

    8

    10

    12

    k

    prep

    roce

    ssin

    g tim

    e

    lanczostsvd

    (g) MED: preprocessing time (seconds).

    50 100 150 200 250 3000

    2

    4

    6

    8

    10

    12

    14

    16

    18

    k

    prep

    roce

    ssin

    g tim

    e

    lanczostsvd

    (h) CRAN: preprocessing time (seconds).

    450 500 550 600 650 700 750 8000

    50

    100

    150

    200

    250

    300

    350

    400

    k

    prep

    roce

    ssin

    g tim

    e

    lanczostsvd

    (i) NPL: preprocessing time (seconds).

    50 100 150 2000.7

    0.8

    0.9

    1

    1.1

    1.2

    1.3

    1.4

    1.5

    1.6

    1.7 x 103

    k

    ave

    rage

    que

    ry ti

    me

    lanczostsvd

    (j) MED: average query time (seconds).

    50 100 150 200 250 3000.8

    1

    1.2

    1.4

    1.6

    1.8

    2

    2.2

    2.4

    2.6

    2.8 x 103

    k

    ave

    rage

    que

    ry ti

    me

    lanczostsvd

    (k) CRAN: average query time (seconds).

    450 500 550 600 650 700 750 8000.016

    0.018

    0.02

    0.022

    0.024

    0.026

    0.028

    0.03

    0.032

    0.034

    k

    ave

    rage

    que

    ry ti

    me

    lanczostsvd

    (l) NPL: average query time (seconds).

    Fig. 5. (Information retrieval) Performance tests on MEDLINE, CRANFIELD and NPL.

  • 10

    0 0.2 0.4 0.6 0.8 10

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    recall

    prec

    ision

    lanczostsvd

    (a) k = 400.

    0 0.2 0.4 0.6 0.8 10

    0.05

    0.1

    0.15

    0.2

    0.25

    recall

    prec

    ision

    lanczostsvd

    (b) k = 600.

    0 0.2 0.4 0.6 0.8 10

    0.05

    0.1

    0.15

    0.2

    0.25

    recall

    prec

    ision

    lanczostsvd

    (c) k = 800.

    Fig. 6. (Information retrieval) Performance tests on TREC: precision-recall curves at different ks.

    set matrix, and full re-orthogonalization was performed for theLanczos approximation approach. In the truncated SVD approachwe used the same SVD function as the one implemented forinformation retrieval.

    3) Results: Error rates of the two approaches are plotted inFigure 7. For each data set, we used different numbers of imagesfor training. The training sets are roughly 40%, 60% and 80%of the whole data set. As shown by the figure, for all the datasets and all the training sizes we experimented with, the Lanczosapproximation yielded accuracy that is very close to that of thetruncated SVD. These results confirm that the Lanczos approachcan be a good alternative to the traditional eigenfaces method.

    The advantage of the Lanczos approximation lies in its effi-ciency. Figure 8 plots the preprocessing times and query timesof the two approaches. As can be seen in the figure, in thepreprocessing phase the Lanczos approach is faster than thetruncated SVD approach by a factor of 3 or more. However thisgain in efficiency is not as large as that seen in the informationretrieval experiments. This is due to the fact that the matrixis not sparse in this case. The dense mat-vec multiplicationoccupies a larger fraction of the computing time. Nevertheless,the Lanczos approximation approach always outperforms thetraditional truncated SVD approach in time.

    VIII. CONCLUSIONS

    We have explored the use of the Lanczos algorithm as asystematic means of computing filtered mat-vec products. Thegoal of the method is to obtain a sequence of vectors whichconverges rapidly towards the exact product in the major leftsingular directions of the matrix. The main attraction of thealgorithm is its low cost. Extensive experiments show that theproposed method can be an order of magnitude faster than thestandard truncated SVD approach. In addition, the quality ofthe approximation in practical applications, such as informationretrieval and face recognition, is comparable to that obtained fromthe traditional approach. Thus, the proposed technique can beapplied as a replacement to the SVD technique in any applicationwhere a major computational task is to compute a filtered mat-vec, i.e., the product of a low-rank approximation of a matrix byan arbitrary vector.

    An avenue of future research following this work, is to studyhow the proposed Lanczos approximation, compared with thetruncated SVD approach, affects the final quality for a certainapplication, in a statistical manner. For information retrieval, the

    quality of a technique could be based on the precision, i.e., howclose the actual relevant documents are to the top of the computedrank list. Although we have proposed a technique to effectivelyapproximate the ranking scores, how the final precisions varyaway from those computed using the standard LSI approach isinconclusive. Similarly, for face recognition, the accuracy dependson only the top image in the rank list. It is possible to consider forwhat conditions the top scores obtained from both the Lanczostechnique and the truncated SVD technique appear on the sameimage, thus the two approaches produce exactly the same result.

    APPENDIXPROOF OF THE CONVERGENCE OF THE APPROXIMATION

    VECTORS

    Before the proof, we introduce a known result on the rateof convergence of the Lanczos algorithm. Saad [27] provided abound on the angle between the j-th eigenvector j of M andthe Krylov subspace range(Qk):(I QkQTk )jQkQTk j KjTkj(j)

    (I Q1QT1 )jQ1QT1 j , (19)where

    j = 1 + 2j j+1j+1 n , Kj =

    (1 j = 1Qj1i=1

    inij j 6= 1

    ,

    j is the j-th eigenvalue of M , and Tl(x) is the Chebyshevpolynomial of the first kind of degree l. Assuming that j hasunit norm, we simplify the inequality into the following form:(I QkQTk )j cjT1kj(j), (20)where cj is some constant independent of k. Inequality (20)indicates that the difference between any unit eigenvector j andthe subspace span(Qi) decays at least with the same order asT1ij(j).

    We now analyze the difference between Ab and si in the leftsingular directions of A. Recall that A has the SVD A = UV T ,hence M = AAT = UTUT , which means that the ujs are

  • 11

    10 20 30 40 50 60 70 800

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.44 train

    k

    err

    or

    rate

    lanczostsvd

    (a) ORL: error rate (4 train).

    10 20 30 40 50 60 70 800

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.46 train

    k

    err

    or

    rate

    lanczostsvd

    (b) ORL: error rate (6 train).

    10 20 30 40 50 60 70 800

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.48 train

    k

    err

    or

    rate

    lanczostsvd

    (c) ORL: error rate (8 train).

    20 40 60 80 100 120 140 1600

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.470 train

    k

    err

    or

    rate

    lanczostsvd

    (d) PIE: error rate (70 train).

    20 40 60 80 100 120 140 1600

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4100 train

    k

    err

    or

    rate

    lanczostsvd

    (e) PIE: error rate (100 train).

    20 40 60 80 100 120 140 1600

    0.05

    0.1

    0.15

    0.2

    0.25

    0.3

    0.35

    0.4130 train

    k

    err

    or

    rate

    lanczostsvd

    (f) PIE: error rate (130 train).

    50 100 150 2000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    130 train

    k

    err

    or

    rate

    lanczostsvd

    (g) ExtYaleB: error rate (30 train).

    50 100 150 2000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    140 train

    k

    err

    or

    rate

    lanczostsvd

    (h) ExtYaleB: error rate (40 train).

    50 100 150 2000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    150 train

    k

    err

    or

    rate

    lanczostsvd

    (i) ExtYaleB: error rate (50 train).

    Fig. 7. (Face recognition) Performance tests on ORL, PIE and ExtYaleB: error rate.

    eigenvectors of M . Then applying (20),

    Ab si, uj

    =D(I QiQTi )Ab, uj

    E=D(I QiQTi )uj , Ab

    E(I QiQTi )uj Ab

    cj AbT1ij(j).

    This gives the convergence rate of the approximation vector si tothe vector Ab along the direction uj .

    We can similarly give a bound on the difference between Ab

    and ti in the direction uj :Ab ti, uj

    =DA(I QiQTi )b, uj

    E=D(I QiQTi )b, ATuj

    E= j

    vTj (I QiQTi )b

    j

    (I QiQTi )vj b .Note that M = ATA = V TV T , hence vj is the j-th eigen-vector of M . So we can also bound the term

    (I QiQTi )vjby Chebyshev polynomial, yielding the result

    Ab ti, uj j cj bT1ij(j).

    This provides the convergence rate of the approximation vectorti to the vector Ab in the direction uj .

  • 12

    10 20 30 40 50 60 70 800

    0.5

    1

    1.5

    2

    2.5

    k

    prep

    roce

    ssin

    g tim

    e

    lanczostsvd

    (a) ORL: preprocessing time (6 train).

    20 40 60 80 100 120 140 1600

    2

    4

    6

    8

    10

    12

    14

    16

    k

    prep

    roce

    ssin

    g tim

    e

    lanczostsvd

    (b) PIE: preprocessing time (100 train).

    50 100 150 2000

    5

    10

    15

    20

    25

    30

    35

    40

    45

    50

    k

    prep

    roce

    ssin

    g tim

    e

    lanczostsvd

    (c) ExtYaleB: preprocessing time (40 train).

    10 20 30 40 50 60 70 805.44

    5.46

    5.48

    5.5

    5.52

    5.54

    5.56

    5.58

    5.6 x 103

    k

    ave

    rage

    que

    ry ti

    me

    lanczostsvd

    (d) ORL: query time (6 train).

    20 40 60 80 100 120 140 160

    0.0172

    0.0173

    0.0173

    0.0174

    0.0175

    0.0175

    0.0175

    0.0176

    0.0176

    k

    ave

    rage

    que

    ry ti

    me

    lanczostsvd

    (e) PIE: query time (100 train).

    50 100 150 2000.028

    0.0282

    0.0284

    0.0286

    0.0288

    0.029

    0.0292

    0.0294

    0.0296

    0.0298

    k

    ave

    rage

    que

    ry ti

    me

    lanczostsvd

    (f) ExtYaleB: query time (40 train).

    Fig. 8. (Face recognition) Performance tests on ORL, PIE and ExtYaleB: time (in seconds).

    REFERENCES

    [1] C. Eckart and G. Young, The approximation of one matrix by anotherof lower rank, Psychometrika, vol. 1, no. 3, pp. 211218, 1936.

    [2] G. H. Golub and C. F. Van Loan, Matrix Computations. Johns Hopkins,1996.

    [3] S. C. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, and R. A.Harshman, Indexing by latent semantic analysis, J. Am. Soc. Inf. Sci.,vol. 41, no. 6, pp. 391407, 1990.

    [4] M. W. Berry and M. Browne, Understanding Search Engines: Mathe-matical Modeling and Text Retrieval. SIAM, June 1999.

    [5] M. Turk and A. Pentland, Face recognition using eigenfaces, inProceedings of IEEE Conference on Computer Vision and PatternRecognition, 1991, pp. 586591.

    [6] D. I. Witter and M. W. Berry, Downdating the latent semantic indexingmodel for conceptual information retrieval, The Computer J., vol. 41,no. 8, pp. 589601, 1998.

    [7] H. Zha and H. D. Simon, On updating problems in latent semanticindexing, SIAM J. Sci. Comput., vol. 21, no. 2, pp. 782791, 1999.

    [8] M. Brand, Fast low-rank modifications of the thin singular valuedecomposition, Linear Algebra Appl., vol. 415, no. 1, pp. 2030, 2006.

    [9] J. E. Tougas and R. J. Spiteri, Updating the partial singular valuedecomposition in latent semantic indexing, Comput Statist. Data Anal.,vol. 52, no. 1, pp. 174183, 2007.

    [10] E. Kokiopoulou and Y. Saad, Polynomial filtering in latent semanticindexing for information retrieval, in Proceedings of the 27th annualinternational ACM SIGIR conference on Research and development ininformation retrieval, 2004, pp. 104111.

    [11] J. Erhel, F. Guyomarc, and Y. Saad, Least-squares polynomial filters forill-conditioned linear systems, University of Minnesota SupercomputingInstitute, Tech. Rep., 2001.

    [12] Y. Saad, Filtered conjugate residual-type algorithms with applications,SIAM J. Matrix Anal. Appl., vol. 28, no. 3, pp. 845870, August 2006.

    [13] M. W. Berry, Large scale sparse singular value computations, Interna-tional Journal of Supercomputer Applications, vol. 6, no. 1, pp. 1349,1992.

    [14] K. Blom and A. Ruhe, A Krylov subspace method for informationretrieval, SIAM J. Matrix Anal. Appl., vol. 26, no. 2, pp. 566582,2005.

    [15] G. Golub and W. Kahan, Calculating the singular values and pseudo-inverse of a matrix, SIAM J. Numer. Anal., vol. 2, no. 2, pp. 205224,1965.

    [16] Y. Saad, Numerical Methods for Large Eigenvalue Problems. HalsteadPress, New York, 1992.

    [17] B. N. Parlett, The symmetric eigenvalue problem. Prientice-Hall, 1998.[18] R. M. Larsen, Efficient algorithms for helioseismic inversion, Ph.D.

    dissertation, Dept. Computer Science, University of Aarhus, Denmark,October 1998.

    [19] H. D. Simon, Analysis of the symmetric Lanczos algorithm withreorthogonalization methods, Linear Algebra Appl., vol. 61, pp. 101131, 1984.

    [20] , The Lanczos algorithm with partial reorthogonalization, Math-ematics of Computation, vol. 42, no. 165, pp. 115142, 1984.

    [21] D. Zeimpekis and E. Gallopoulos, TMG: A MATLAB toolbox forgenerating term document matrices from text collections. Springer,Berlin, 2006, pp. 187210.

    [22] M. W. Berry, Large-scale sparse singular value computations, Interna-tional Journal of Supercomputer Applications, vol. 6, no. 1, pp. 1349,1992.

    [23] F. Samaria and A. Harter, Parameterisation of a stochastic model forhuman face identification, in 2nd IEEE Workshop on Applications ofComputer Vision, 1994.

    [24] T. Sim, S. Baker, and M. Bsat, The CMU pose, illumination, andexpression database, IEEE Trans. Pattern Anal. Machine Intell., vol. 25,no. 12, pp. 16151618, 2003.

    [25] A. Georghiades, P. Belhumeur, and D. Kriegman, From few to many:Illumination cone models for face recognition under variable lightingand pose, IEEE Trans. Pattern Anal. Machine Intell., vol. 23, no. 6, pp.643660, 2001.

    [26] K. Lee, J. Ho, and D. Kriegman, Acquiring linear subspaces for facerecognition under variable lighting, IEEE Trans. Pattern Anal. MachineIntell., vol. 27, no. 5, pp. 684698, 2005.

    [27] Y. Saad, On the rates of convergence of the Lanczos and the block-Lanczos methods, SIAM J. Numer. Anal., vol. 17, no. 5, pp. 687706,October 1980.