an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t...

23

Transcript of an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t...

Page 1: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

NUMERICAL METHODS FOR SIMULTANEOUSDIAGONALIZATION �ANGELIKA BUNSE-GERSTNER y , RALPH BYERS z AND VOLKER MEHRMANN xAbstract. We present a Jacobi-like algorithm for simultaneous diagonalization of commutingpairs of complex normal matrices by unitary similarity transformations. The algorithm uses a se-quence of similarity transformations by elementary complex rotations to drive the o�-diagonal entriesto zero. We show that its asymptotic convergence rate is quadratic and that it is numerically stable.It preserves the special structure of real matrices, quaternion matrices and real symmetric matrices.Key words. Simultaneous Diagonalization Jacobi Iteration, Eigenvalues, Eigenvectors, Struc-tured Eigenvalue Problem1. Introduction. Many of the algorithms outlined in [6] require the simulta-neous diagonalization of commuting pairs of normal matrices by unitary similaritytransformations. Often there are other structures in addition to normality. Examples,include commuting pairs of real symmetric matrices, pairs of Hermitian matrices, realsymmetric { real skew symmetric pairs and quaternion pairs. In this paper we pointout some of the di�culties associated with simultaneous diagonalization and proposea family of Jacobi-like algorithms for simultaneously diagonalizing commuting normalmatrices.The term \simultaneous diagonalization" is sometimes used in the literature todenote the diagonalization of a de�nite matrix pencil by congruence, e.g., [43]. Here,we use the term in the classical sense of simultaneous similarity transformations.To be viable for �nite precision computation, a simultaneous diagonalization al-gorithm must work with both A and B simultaneously. To see how algorithms thatviolate this principle can fail, consider the family of diagonalize-one-then-diagonalize-the-other (DODO) methods suggested by the classic proof that commuting pairs ofdiagonalizable matrices can be simultaneously diagonalized [30, p. 404]. If A 2 Cn�nand B 2 Cn�n form a pair of commuting normal matrices, the DODO approach usesa conventional algorithm to diagonalize A alone, then performs the same similaritytransformation to B. So, for example, if A and B are Hermitian, then one might useCH from [39], to �nd a unitary matrix U 2 Cn�n and a diagonal matrix D 2 Rn�nsuch that A = UHDU . (The superscript H denotes the Hermitian transpose.) Al-though CH does not produce the diagonal entries ofD in any particular order, it is easyto order the eigenvalues (say) in decreasing algebraic order along the diagonal of D.Then E := UBUT is block diagonal with the order of the j-th diagonal block equalto the multiplicity of the j-th distinct eigenvalue of A. In particular, if A has distincteigenvalues, then E is diagonal and the simultaneous diagonalization is complete. In� SIAM Journal of Matrix Analysis and Applications, Volume 14, Number 4, 927{949, 1993y Fakult�at f�ur Mathematik, Universit�at Bielefeld, Postfach 8640, D-4800 Bielefeld 1. Partialsupport received from the Forschungsschwerpunkt Mathematisierung Universit�at Bielefeld, researchproject: Steuerungsprobleme.z Department of Mathematics University of Kansas Lawrence, Kansas 66045. Partial supportreceived from the Forschungsschwerpunkt Mathematisierung Universit�at Bielefeld, research project:Steuerungsprobleme, Science Foundation grant CCR-8820882, KU International Travel Fund 560478,and KU GRF allocations #3758-20-0038 and #3692-20-0038.x Fakult�at f�ur Mathematik, Universit�at Bielefeld, Postfach 8640, D-4800 Bielefeld 1. Partialsupport received from the Forschungsschwerpunkt Mathematisierung Universit�at Bielefeld, researchproject: Steuerungsprobleme. 1

Page 2: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

any case, a subsequent block diagonal similarity transformation can diagonalize Ewithout disturbing D.Rounding errors destroy this elegant approach. Suppose, for example, roundingerrors perturb the commuting pair (A;B) to the nearly commuting pair~A = 2664 1� � 0 0 00 1 + � 0 00 0 0 10 0 1 0 3775and ~B = 2664 0 1 0 01 0 0 00 0 1� � 00 0 0 1 + � 3775 :where � is a small quantity that might be caused by rounding error. If � 6= 0, then ~Aand ~B do not commute. However, ifQ = p22 2664 1 1 0 0�1 1 0 00 0 1 10 0 �1 1 3775 ;then the o�-diagonal elements of QT ~AQ and QT ~BQ are bounded by �.For � 6= 0; �2, ~A has distinct eigenvalues, and its modal matrix of eigenvectorsU = 2664 1 0 0 00 1 0 00 0 p22 p220 0 �p22 p22 3775is independent of � and unique up to column scaling. Similarly, for � 6= 0; �2, B hasdistinct eigenvalues and its modal matrix of eigenvectorsV = 2664 p22 p22 0 0�p22 p22 0 00 0 1 00 0 0 1 3775is independent of � and unique up to column scaling. Unfortunately, the o�-diagonalentries of UTBU and V TAV are of the order of 1. The DODO method creates acyclic sequence with period two.The example suggests that a viable simultaneous diagonalization algorithm must\do something reasonable" when it is applied to a nearly commuting pair of matrices.Perhaps the most natural approach is to require the algorithm to choose a unitary sim-ilarity transformation U 2 Cn�n that minimizes some measure of how much UHAUand UHBU di�er from being diagonal. This is the approach used in [21]. There, thealgebraic eigenvalue problem of a single normal matrix is solved by simultaneouslydiagonalizing the Hermitian and Skew Hermitian parts. It is known to converge lo-cally quadratically under the serial pivot sequence [36]. When applied to a normal2

Page 3: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

matrix, the norm reducing method of [14] is also an algorithm that simultaneouslydiagonalizes the Hermitian and Skew-Hermitian part of a normal matrix. A variationof this method using nonunitary normreducing transformations is shown to be glob-ally quadratically convergent for complex normal matrices in [17]. The simultaneousdiagonalization algorithms presented below are adaptations and generalizations of [21]and are in uenced by [14].Interest in Jacobi algorithms declined when it was observed that classic and serialJacobi algorithms for the symmetric eigenvalue problem do more arithmetic operationsthan more modern techniques [22, 33, 47]. However, with the advent of paralleland vector computers interest in Jacobi's method has revived. Jacobi methods havehigh inherent parallelism which allows e�cient implementations on certain parallelarchitectures [4, 5, 8, 9, 13, 15, 16, 28, 27, 37]. This has been demonstrated onparallel and vector machines in [2, 3, 15]. Parallel orderings, block Jacobi methodsand other techniques create parallel versions of Jacobi's method. It is easy to see thatthese techniques also apply to the simultaneous diagonalization algorithms presentedhere. A short discussion of some special techniques for parallel computation alongwith a more extensive bibliography can be found in [22, x8.5].Another virtue of the Jacobi method is its favorable rounding error properties.Improved error bounds for perturbed scaled diagonally dominant matrices [1, 10,25] show that for this class of matrices small relative perturbations in the matrixentries cause only small relative perturbations in the eigenvalues. In some cases,Jacobi's method has better rounding error properties than the QR algorithm [11].We conjecture that this very favorable error analysis carries over to the simultaneousdiagonalization process as well.We shall use the following notation.� A superscript T denotes the transpose of a matrix.� A superscript H denotes the Hermitian or complex conjugate transpose of amatrix.� The vector 2-norm and its subordinate matrix norm, i.e., the spectral norm,is denoted by jj�jj2.� The Frobenius norm or Euclidean norm is jjM jjF =ptrace(MHM ).� The k-th column of the identity matrix is denoted by ek.� The smallest singular value of a matrix M 2 Rn�n is denoted by �min(M ).2. A Jacobi-like Algorithm. Following the approach of Jacobi [24], the si-multaneous diagonalization algorithm consists of a sequence of similarity transforma-tions by plane rotations. A plane rotation in the (i; j) plane, is a unitary matrixR = R(i; j; c; s) of the formR = R(i; j; c; s) = I + (c� 1)eieTi � �seieTj + sejeTi + (�c� 1)ejeTj(1)where c; s 2 C satisfy jcj2 + jsj2 = 1. For computational convenience, we will oftenrestrict c to be real and non-negative.A natural measure of the distance of the pair (A;B) 2 Cn�n �Cn�n from diag-onality is o�2(A;B) =Xi6=j jaijj2 +Xi6=j jbijj2:(2)If p; q; r; s 2 C and ps� rq 6= 0, then A and B are diagonal if and only if (pA+ qB)and (rA + sB) are diagonal. Hence, for each ordered quadruple (p; q; r; s) 2 C4 such3

Page 4: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

that ps � rq 6= 0, (2) gives rise to an alternate measureo�2�pqrs = o�2(pA + qB; rA+ sB):(3)Scaling A and B as suggested in [21] is the special case q = r = 0.In broad outline, the algorithm we present consists of a sequence of similaritytransformations by plane rotations each of which is chosen to minimize one of themeasures (2) or (3). This is done as follows. Let R = R(i; j; c; s) 2 Cn�n be a planerotation with c 2 R. The restriction that c be real does not change the amount bywhich o�2(A;B) can be reduced. A simple calculation shows thato�2(RARH ; RBRH) = o�2(A;B) � jaijj2 � jbijj2 � jajij2 � jbjij2(4) + jsc(�aii � �ajj) + c2�aij � s2�ajij2+ jsc(aii � ajj) � s2aij + c2ajij2+ jsc(�bii � �bjj) + c2�bij � s2�bjij2+ jsc(bii � bjj) � s2bij + c2bjij2The choice of c and s that minimizes o�2(RARH ; RBRH ) is the choice that minimizesfij(c; s) = jjMijzjj2 =def ��������������������266664 �aij �aii��ajjp2 ��ajiaji aii�ajjp2 �aij�bij �bii��bjjp2 ��bjibji bii�bjjp2 �bij 37777524 c2p2css2 35��������������������2 :(5)This is a constrained minimization problem. The constraint jcj2 + jsj2 = 1 impliesthat jjzjj2 = 1. So, for all z 2 R3, zHz = 1,jjMijzjj2 � minjjxjj2=1 jjMijxjj = �min(Mij):Parameterizing c and s as c = cos(�), s = ei� sin(�) for �; � 2 R makes theminimization of (5) into a two real variable optimization problem. The followinglemma shows that only the values � 2 [��4 ; �4 ] and � 2 [��; �] need to be searched tominimize (5).Lemma 2.1. All values of (5) occur with c = cos(�), s = ei�sin(�) for some valueof (�; �) 2 [��4 ; �4 ]� [��; �].Proof. De�ne gij(�; �) by gij(�; �) = fij(cos(�); ei� sin(�)). The functions cos2(�),cos(�) sin(�) and sin2(�) are �-periodic. Thus, (5) implies that gij(�; �) is �-periodic in�. The special structure ofMij in (5) along with the trivial observation ����e�2i�Mijz���� =jjMijzjj shows that g(�; �) = g(� + �2 ; �) = g(�� � �2 ; �). Hence, for each �xed valueof �, gij(�; �) assumes all its values on � 2 [��4 ; �4 ].In the terminology of [40], an \inner rotation" is one for which jcj � jsj. An\outer rotation" is one for which jcj < jsj. Corresponding to each inner rotation thatminimizes (5), there is an \outer" rotation, that also minimizes (5). By choosing� 2 [��4 ; �4 ], we have made the choice of using only inner rotations. Our proof ofquadratic convergence depends upon them.We have found no \simple" explicit formulae for the minimizers of (5). However,explicit formulae are known in some special cases. Goldstein and Horwitz [21] andEberlein [13] give explicit formulae for the special case A = AH and B = �BH .We give explicit formulae for other special cases in Section 6. General optimization4

Page 5: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

algorithms like those described in [19] provide e�ective ways to �nd minimizers (�; �) 2[��4 ; �4 ] � [��; �]. In our MATLAB [29] implementation we have used the exactminimizer if an explicit formula for the exact minimumwas available. Otherwise, weused a heuristic approximate minimizer described below.The work needed to �nd a minimizer of (5) does not grow with n. The workrequired to perform a similarity transformation by a rotation is proportional to n. Forlarge enough problems, the work of �nding minimizers of (5) is negligible. However,general constrained optimization algorithms are relatively complicated. A computerimplementation of Algorithm 1 (described below) would devote the majority of thecode to the optimization problem. Fortunately, it is su�cient to approximate theminimizer. Consider the following strategy:Let c1 2 R and s1 2 C be minimizers ofgij(c1; s1) = js1c1(�aii � �ajj) + c21�aij � s21�ajij2 + js1c1(aii � ajj)� s21aij + c21ajij2;(6)and let c2 2 R and s2 2 C be minimizers ofhij(c2; s2) = js2c2(�bii � �bjj) + c22�bij � s22�bjij2 + js2c2(bii � bjj) � s22bij + c22bjij2:(7)Explicit formulae for (c1; s1) and (c2; s2) appear in [13, 21]. Use as approximateminimizers the pair (c; s) = (ci; si), i = 1; 2, that yields the smaller value of (5). Thisstrategy has worked well in practice and the proof of local quadratic convergencepresented in Section 3 goes through with little modi�cation for this approximateminimizer.The following algorithm summarizes the procedure for simultaneous diagonaliza-tion of a commuting pair of normal matrices.Algorithm 1.INPUT: � > 0; A; B 2 Cn�n such that AB = BA, AAH = AHA and BBH = BHBOUTPUT: Q 2 Cn�n such that o�2(QHAQ;QHBQ) � �(jjAjjF + jjBjjF ) and QQH = I.1. Q I2. WHILE o�2(A;B) > �(jjAjjF + jjBjjF )� FOR i = 1; 2; 3; : : : ; n4. FOR j = i + 1; i + 2; i + 3; : : : ; n5. Select � 2 [��4 ; �4 ] and � 2 [��; �] such that c = cos(�) ands = ei� sin(�) minimizes (5). (Or, approximately minimizes (5)as described in (6) and (7)).6. R R(i; j; c; s)7. Q QR; A RHAR; B RHBRIf rotations are stored and applied in an e�cient manner similar to that outlinedfor real rotations in [22, x5.1] or [33, x6.4], then each sweep (Step 2) uses approxi-mately 8n3 complex ops to update A and B and approximately 2n3 complex opsto accumulate Q. A complex op is the computation e�ort required to execute theFortran statement A(I,J) = A(I,J) + S * A(K,J)(8)where A and S are of type COMPLEX.Algorithm 1 needs storage for approximately 3n2 complex numbers. This can beshaved to 2n2 complex numbers if Q is not required.As with the serial Jacobi algorithm, to promote rapid convergence in the caseof multiple eigenvalues, it is a good idea to use a similarity transformation by a5

Page 6: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

permutation matrix to put the diagonal entries of A and B in lexicographic order [20].Such an eigenvalue ordering is required by our proof of local quadratic convergencein Section 3.In our experience, for randomly chosen examples with n � 80, rarely does � =10�14 make Algorithm 1 require more than six sweeps. In Section 3 we show thatAlgorithm 1 has local quadratic convergence. There are, however, examples forwhich Algorithm 1 does not converge. For example, if A 2 Rn�n is given byakj = � cos((j + k)�=n) if j 6= k2�n2 cos(2k�=n) if j = k(9)and B 2 Rn�n is given bybkj = � sin((j + k)�=n) if j 6= k2�n2 sin(2k�=n) if j = k(10)and n > 6, then it is easily to verify through Theorem 6.1 that no rotation reduceso�2(A;B), so Algorithm 1 with exact minimization in Step 5 leaves A and B invariant.This example is essentially due to Voevodin [45].Usually in practice, rounding errors perturb (9) and (10) su�ciently to allowAlgorithm 1 (with exact minimization in Step 5) to reduce o�2(A;B) and ultimatelyto converge to a simultaneously diagonal form, albeit rather slowly. For example,an early version of our experimental implementation on a Sun 4/60 (with unit roundapproximately 10�16) applied to the n = 20 case needed 47 sweeps to reduce o�2(A;B)to approximately 10�14(jjAjj + jjBjj). Rounding errors did not perturb the n = 30enough to allow convergence.Example (9) and (10) is not held �xed by the heuristic minimization strategy, butit does sometimes require a few more sweeps than a random example. Algorithm 1with the approximate minimization heuristic in Step 5 needed from 9 to 10 sweeps toreduce o�2(A;B) to less than 10�14(jjAjj+ jjBjj) for n = 10, n = 15, n = 20, n = 25,and n = 30.Scaling one of the two commuting matrices, by (say) replacing A by A=2 assuggested in [21] is a more reliable way to \break away" from a �xed point like (9)and (10). This is equivalent to using o�2�1;0;0;2 in place of o�2. Motivated in partby the \exceptional shift" strategy often used with the QR algorithm, we changedour experimental code to use o�2�1;0;0;2 for one sweep, whenever o�2(A;B) declinesby less than the 1% across a sweep. With this modi�cation, only 6 to 7 sweeps wererequired to reduce o�2(A;B) to less than 10�14(jjAjj + jjBjj) for the n = 10, n = 15,n = 20, n = 25, and n = 30 cases of (9) and (10). (The choice of 1% is ad hoc.A more cautious approach would require a greater per sweep reduction in o�2(A;B)and would try other scaling factors, if 1=2 doesn't work.)We have not been able to show that Algorithm 1 with the above modi�cation isglobally convergent.3. Convergence Properties. Algorithm 1 shares many of the desirable prop-erties of algorithms related to the serial Jacobi algorithm for the real symmetric eigen-value problem [20, 23, 36, 38, 41, 46]. In our experience, Algorithm 1, with the abovestrategy for avoiding stagnation, converges globally and ultimately quadratically.In this section we establish the local quadratic convergence and numerical stabilityof Algorithm 1. In parenthetical remarks we give a rough sketch of how the argumentcan be modi�ed to show the local convergence of the heuristic variation (6), (7).6

Page 7: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

Unfortunately, we do not have a proof of global convergence. Global convergencefor special cases is shown in [21] and, using a modi�ed normreducing algorithm in [17].The latter however uses nonunitary transformations. In our case the use of unitarysimilarity transformations confers satisfactory numerical stability.3.1. Local Quadratic Convergence. Here we follow [20, 36, 46] to show thatultimately, Algorithm 1 converges quadratically. In particular, the resemblance to[36] is unmistakable.Let the set of eigenvalues of A be f�1; �2; �3; : : : ; �ng and let the set of eigen-values of B be f�1; �2; : : : ; �ng. Set� = 12 min� min�i 6=�j j�i � �jj; min�i 6=�j j�i � �j j� :(11)Denote the k-th rotation of Algorithm 1 by R(k) = R(i; j; c(k); s(k)), and let A(k)and B(k) be the values of A and B immediately before the k-th rotation. Followingcommon practice, we call (a(k)ij ; a(k)ji ; b(k)ij ; b(k)ji ) the \k-th pivots", because the rotationR(k)(i; j; c(k); s(k)) is chosen to minimize ja(k+1)ij j2 + ja(k+1)ji j2 + jb(k+1)ij j2 + jb(k+1)ji j2.De�ne �k by �k =qo�2(A(k); B(k)):Partition A(k) and B(k) as A(k) = D(k)A + E(k)Aand B(k) = D(k)B + E(k)Bwhere D(k)A ; D(k)B 2 Cn�n are diagonal and E(k)A ; E(k)B 2 Cn�n have zero diagonal.Note that �k =po�2(A(k); B(k)) = ������[E(k)A ; E(k)B ]������F .Suppose Algorithm 1 has converged to the point that�k =qo�2(A(k); B(k)) < 12�:(12)Using a permutation similarity, we may order the eigenvalues �i and �i so thatja(k)ii � �ij � ������E(k)A ������F � �k < 12�(13)and jb(k)ii � �ij � ������E(k)B ������F � �k < 12�:(14)Because Algorithm 1 uses inner rotations and o�2(A(k); B(k)) is monotonically de-creasing, (13) and (14) hold throughout the remainder of Algorithm 1 [20, 36, 46].In particular, the order in which the eigenvalues will eventually appear along thediagonals of A and B is �xed.(The heuristic (6) and (7) does not necessarily make o�2(A(k); B(k)) decreasemonotonically. However, it can be shown that o�2(A(k); B(k)) does not increase by7

Page 8: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

more that O(�2). Thus, under a slightly stronger hypothesis, (13) and (14) continueto hold.)Each diagonal entry a(k)ii is said to be \a�liated" with �i and each diagonalentry b(k)ii is \a�liated" with �i. As Algorithm 1 drives o�2(A(k); B(k)) to zero,limk!1 a(k)ii = �i and limk!1 b(k)ii = �i. The a�liation of a diagonal entry does notchange during subsequent steps of the algorithm [20, 36, 46].Inequality (13) implies that if �i 6= �j, thenja(k)ii � a(k)jj j > �:(15)Inequality (14) implies that if �i 6= �j , thenjb(k)ii � b(k)jj j > �:(16)An eigenvalue pair (�; �) has multiplicity m, if there are m distinct integers ij ,j = 1; 2; 3; : : : ; m for which (�; �) = (�ij ; �ij ). Using permutation similarity, wemay order the diagonal entries so that a�liates of an m-fold multiple eigenvalue pairappear in m adjacent diagonal entries.We will show the following theorem.Theorem 3.1. Suppose that the h-th rotation starts a sweep of Algorithm 1, i.e.,a(h)12 , a(h)21 , b(h)12 , b(h)21 are the h-th pivots. If (12) holds and a�liates of m-fold multipleeigenvalue pairs appear in adjacent diagonal entries, then at the end of the sweep�h+n(n�1)=2 � 2n(9n� 13)�2h� :We will need several lemmas to prove Theorem 3.1. The proof has two parts. The�rst part establishes that,qja(k+1)ij j2 + ja(k+1)ji j2 + jb(k+1)ij j2 + jb(k+1)ji j2 � O(�2k):(17)In the second part, we will show that subsequent rotations \preserve" (17) for subse-quent values of k.Lemmas similar to the following are well known tools in the study of quadraticconvergence of cyclic Jacobi algorithms [36, 41, 48].Lemma 3.2. If (12) holds, (�i; �i) = (�j; �j) and i 6= j, thenqja(k+1)ij j2 + ja(k+1)ji j2 + jb(k+1)ij j2 + jb(k+1)ji j2 � p2�2k2� :Proof. Regardless of the particular choice of R(k)(i; j; c(k); s(k)), the partitioninglemma [36, x2] implies that qja(k+1)ij j2 + ja(k+1)ji j2 � �2k2�(18)and qjb(k+1)ij j2 + jb(k+1)ji j2 � �2k2� :(19) 8

Page 9: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

The next lemma covers the case (�i; �i) 6= (�j ; �j).Lemma 3.3. Suppose that a(k)ij , a(k)ji , b(k)ij , b(k)ji , are the k-th pivots. If (12) holdsand (�i; �i) 6= (�j ; �j), thenqja(k+1)ij j2 + ja(k+1ji )j2 + jb(k+1)ij j2 + jb(k+1)ji j2 � 5�2k� :Proof. Without loss of generality, assume thatj�i � �jj � j�i � �j j:(20)In particular, (20) implies �i 6= �j. In [36, x3], it is shown that there is a rotationR = R(i; j; cA; sA) such that for A = RHA(k)R,qjaijj2 + jaijj2 � �2k2� :(21)In fact, this rotation minimizes the left-hand-side of (21) (and (6)), so����EA����F � ������E(k+1)A ������F :(22)Set B = RHB(k)R. Recall that A and B commute. The (i; j)-th entry of the equationAB � BA = 0 isbij(aii � ajj)� aij(bii � bjj) + Xk 6= ik 6= j �aikbkj � bikakj� = 0:(23)Applying H�older's inequality [35] to the sum and using (21) to bound the second termon the left-hand-side of (23) givesjbij(aii � ajj)j � �2k + �2k2� jbii � bjjj:(24)Inequality (22) implies that (13) and (15) hold for A andjbii � bjjj � jb(k)ii � b(k)jj j+ 2jb(k)ij j+ 2jb(k)ji j � jb(k)ii � b(k)jj j+ 4�k � jb(k)ii � b(k)jj j+ 2�:(25)Inequalities (15) and (24) implyjbijj � �2k2� 2 + bii � bjjaii � ajj! :To bound the right-hand-side, apply (25) to get����� bii � bjjaii � ajj ����� � jb(k)ii � b(k)jj j+ 2�jaii � ajjj :9

Page 10: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

The triangle inequality applied to j(b(k)ii ��i)+(�i��j )+(�j �b(k)jj )j and j(�i��j)+(�i � aii) + (�j � ajj)j gives����� bii � bjjaii � ajj ����� � jb(k)ii � �ij+ j�i � �j j+ j�j � b(k)jj j+ 2�j�i � �jj � j�i � aiij � j�j � ajjj :Inequalities (13) and (14) imply����� bii � bjjaii � ajj ����� � j�i � �j j+ 3�j�i � �jj � � :Multiply and divide the right-hand-side by (�i��j)�1 and apply (20), (13) and (14)to get ����� bii � bjjaii � ajj ����� � �i��j�i��j + 3��i��j1� ��i��j � 1 + 3=21� 1=2 � 5:It now follows that jbijj � 7�2k2� . Similarly, it can be shown that jbjij � 7�2k2� . Therefore,qjaijj2 + jajij2 + jbijj2 + jbjij2 � p99�2k2� � 5�2k� :(26)Now, R(k) is chosen to minimize ja(k+1)ij j2+ja(k+1)ji j2+jb(k+1)ij j2+jb(k+1)ji j2, so inequality(26) gives an upper bound on the minimum value. (If R(k) is chosen using theheuristic (6), (7), then (26) is one of the two choices of o�2(A(k+1); B(k+1)) overwhich the heuristic minimizes. Hence, inequality (26) also gives an upper bound onthe minimum value for this case.)Lemmas 3.2 and 3.3 imply (17). It remains to show that subsequent rotationspreserve (17). The following lemma gives a bound on the angles of rotation that occurin Algorithm 1.Lemma 3.4. If (12) holds and (�i; �i) 6= (�j; �j), thenjs(k)j � p2�1 + 5�k� � �k� � 9�k� :(27)Proof. For ease of notation, set (c; s) = (c(k); s(k)). Without loss of generality, wemay assume that �i 6= �j. Lemma 3.3 implies thatja(k+1)ij j = jsc(a(k)ii � a(k)jj ) � s2a(k)ji + c2a(k)ij j � 5�2k� :The triangle inequality givesjsc(a(k)ii � a(k)jj )j � 5�2k� + js2a(k)ji + c2a(k)ij j(28) � 5�2k� + js2a(k)ji j+ jc2a(k)ij j� 5�2k� + �k:Algorithm 1 chooses c = cos(�) for some � 2 [��4 ; �4 ], so jcj � p22 . Thus, (15) and(28) together imply the left-hand inequality of (27). The right-hand inequality followsfrom the left-hand inequality and (12). 10

Page 11: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

Proof of Theorem 3.1:Fix a particular choice of (i; j) and consider a(k)ij during a sweep of Algorithm 1as k varies from k = h through k = h+ n(n� 1)=2.Note that �k is monotonically decreasing in k, so if (�i; �i) = (�j; �j), thenLemma 3.2 implies that, for k = h; h+ 1; h+ 2; : : : ; h+ n(n� 1)=2, ja(k)ij j � p2�2k2� �p2�2h2� . (As noted in the proof of Lemma 3.2, a slightly weaker inequality holds forthe heuristic (6), (7).)Suppose (�i; �i) 6= (�j; �j) and the k-th pivots are a(k)ij , a(k)ji , b(k)ij and b(k)ji . Fixi and j. Lemma 3.3 implies that ja(k+1)ij j � 5�2k� . A�liates of multiple eigenvaluepairs appear in adjacent diagonal entries, so subsequent pivots in row i satisfy thehypothesis of Lemma 3.4. If j < n and (for ease of notation) (c; s) = (c(k+1); s(k+1)),then ja(k+2)ij j = jca(k+1)ij + �sa(k+1)j+1;j j� ja(k+1)ij j+ j�sjja(k+1)j+1;j j� 5�2k� + 9�2k+1�� 14�2k� � 14�2h�Similarly, each subsequent pivot in row i may increase ja(k)ij j by at most 9�2h� .Suppose that the last pivot in row i is the q-th pivot. There are at most n �i � 1 � n � 2 pivots in row i subsequent to the k-th pivot. Hence, for j 6= i,ja(q)ij j � (5 + 9(n� 2)) �2h� . We may bound the q-th o�-diagonal row sum byriq =def nXj = 1j 6= i ja(q)ij j2 � (n� 1)(5 + 9(n� 2))2�4h�2 :Pivoting in rows other than row i leaves the o�-diagonal row sum invariant, so forq < k < h+ n(n� 1)=2, rik = riq � (n� 1)(9n� 13)2 �4h�2 .The same argument applied to B yields the identical bound for the o�-diagonalrow sums of B. Adding the bounds of the o�-diagonal row sums of both A and B atthe end of the sweep, we get�h+n(n�1)=2 �p2n(n� 1)(9n� 13)�2h� � 2n(9n� 13)�2h� :It is clear from the derivation that the bound of Theorem 3.1 is not tight.The theorem assumes that a�liates of multiple eigenvalue pairs appear in adjacentpositions along the diagonal. This assumption guarantees that large angles of rotationdo not cause \�ll-in" of small o�-diagonal entries. As suggested in [36], a thresholdstrategy would have much the same e�ect.3.2. Rounding Errors. Algorithm 1 uses only unitary similarity transforma-tions. A classical rounding error analysis of the construction and application of real11

Page 12: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

rotations appears in [47, p.131�]. It is easily extended to the complex case. Applyingit to Algorithm 1 results in the following theorem.Theorem 3.5. Suppose the rotations in Algorithm 1 are constructed using meth-ods similar to those described in [22, x5.1] or [33, x6.4]. If A; B 2 Cn�n are a com-muting normal pair input to Algorithm 1 and Q(k); A(k); B(k) 2 Cn�n are the com-puted versions of Q, A and B after k sweeps, then there are matrices FA; FB 2 Cn�nsuch that (A+ FA)Q(k) = Q(k)A(k);(B + FB)Q(k) = Q(k)B(k)and jjFAjjF + jjFB jjF + ������Q(k)HQ� I������F � p(n)�where � is the machine precision and p(n) is a modest polynomial that depends on thedetails of the arithmetic. Thus, the e�ects of rounding errors are equivalent to makingsmall normwise perturbations in the original data matrices A and B. Algorithm 1 isbackward numerically stable.4. Real Matrices. If Algorithm 1 is applied to a pair of real, commuting normalmatrices A and B, then it will use nontrivial complex arithmetic on what is essentiallya real problem. Thus, storage must be set aside for two complex arrays instead of tworeal arrays and complex arithmetic must be used instead of real arithmetic. Complexrounding errors may perturb the eigenvalues of A and B so that they do not appearin complex conjugate pairs. This section presents a modi�cation of Algorithm 1 thatavoids complex arithmetic.Of course, there may be no real similarity transformation that simultaneouslydiagonalizes A and B. However, the following theorem shows that there is a realsimilarity transformation that block diagonalizes A and B with 1�by�1 and 2�by�2blocks.Theorem 4.1. If A; B 2 Rn�n, ATA = AAT , BTB = BBT , and AB = BA,then there exists Q 2 Rn�n, such that QTQ = I, DA = QTAQ is block diagonal andDB = QTBQ is block diagonal with with 2�by�2 blocks and at most one, trailing1�by�1 block.Proof. If n = 1 or n = 2, then the theorem holds with Q = [1] 2 R1�1 orQ = I 2 R2�2.Assume the induction hypothesis that the theorem holds for all real, normal,k�by�k, commuting pairs A and B for k < n and 3 � n. Let A 2 Rn�n and B 2Rn�n be commuting normal matrices, and let x 2 Cn be a simultaneous eigenvectorof A and B with eigenvalues �A and �B respectively. There are two cases to consider:(i) x is linearly independent of �x and (ii) x and �x are linear dependent.If x and �x are linearly independent, then A�x = ��A�x and B�x = ��B �x. Thus, xand �x are simultaneous, linearly independent eigenvectors of A and B. The vectorsy = x+�x and z = i(x��x) span a two dimensional, real, invariant subspace of A and B.Let U 2 Rn�n be an orthogonal matrix whose �rst two columns form an orthonormalbasis of span(y; z) and set ~A = UTAU and ~B = UTBU . Thus, ~A = diag(A11; A22),~B = diag(B11; B22) where A11; B11 2 R2�2 and A22; B22 2 Rn�2�n�2. Observethat ~A and ~B are real, normal, commuting matrices and hence A22 and B22 are12

Page 13: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

commuting, normal (n�2)�by�(n�2) matrices. By the induction hypothesis, thereexists an orthogonal matrix V22 2 Rn�2�n�2 such that V T22A22V22 and V T22B22V22 areblock diagonal with 2�by�2 and at most one, trailing 1�by�1 block. If V11 is the2�by�2 identity matrix and V = diag(V11; V22) then the matrix Q = UV satis�esthe conclusion of the theorem.If x and �x are linearly dependent, then �A 2 R, �B 2 R and x is a scalar multipleof some real, simultaneous eigenvector y 2 Rn. Without loss of generality we maychoose y such that jjyjj2 = 1. Let U 2 Rn�n be an orthogonal matrix whose lastcolumn is y and set ~A = UTAU and ~B = UTBU . Observe that ~A and ~B are real,normal, commutingmatrices and that ~A = diag(A11; ann), ~B = diag(B11; bnn)) whereann = �A, bnn = �B and A11; B11 2 Rn�1�n�1. By the induction hypothesis, thereexists an orthogonal matrix V11 2 Rn�1�n�1 such that V T11A11V11 and V T11B11V11are block diagonal with 2�by�2 and at most one, trailing 1�by�1 block. If V =diag(V11; [1]) then the matrix Q = UV satis�es the conclusion of the theorem. Notethat at this point that if n is even, it is necessary to logically combine the two trailing1�by�1 blocks into a single trailing 2�by�2 block.Partion each matrix M 2 Rn�n into a k�by�k block matrix asM = 2666664 M11 M12 M13 � � � M1kM21 M22 M23 � � � M2kM31 M32 M33 � � � M3k... ... ... � � � ...Mk1 Mk2 Mk3 � � � Mkk 3777775(29)where k = �n+12 � and for i; j � �n2 � Mij 2 R2�2. If n is odd, then for j � k � 1,Mkj 2 R1�2, for i � k � 1, Mik 2 R2�1 and Mkk 2 R1�1. Theorem 4.1 statesthat there is a real orthogonal similarity transformation that simultaneously blockdiagonalizes A and B conformally with (29).The next algorithm makes extensive use of elementary orthogonal matrices Z =Z(i; j; U ) 2 Rn�n that are partitioned conformally with (29). De�ne the block rota-tion Z(i; j; U ) to have the formZ(i; j; U ) = 266666666666666666664I . . . I Zii ZijI . . . IZji Zjj I . . . I

377777777777777777775(30) :Here, � Zii ZijZji Zjj � =: U = � U11 U12U21 U22 �13

Page 14: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

is an orthogonal matrix. If j < k or if n is even, then U 2 R4�4. Otherwise,U 2 R3�3. A similar de�nition of \blockwise rotation" appears in [18].The Jacobi annihilators of [31] and the orthogonal matrices proposed in [42] areblock rotations.We will use block rotations Z(i; j; U ) where U has the formU = R(2; 4; c4; s4)R(1; 4; c3; s3)R(2; 3; c2; s2)R(1; 3; c1; s1)(31) = 2664 1 0 0 00 c4 0 �s40 0 1 00 s4 0 c4 37752664 c3 0 0 �s30 1 0 00 0 1 0s3 0 0 c3 37752664 1 0 0 00 c2 �s2 00 s2 c2 00 0 0 1 37752664 c1 0 �s1 00 1 0 0s1 0 c1 00 0 0 1 3775or U = R(2; 3; c2; s2)R(1; 3; c1; s1) = 24 1 0 00 c2 �s20 s2 c2 3524 c1 0 �s10 1 0s1 0 c1 35(32)where for l = 1; 2; 3; 4; cl; sl 2 R and c2l + s2l = 1. It is convenient to parameterizecl and sl in (31) and (32) as cl = cos(�l) and sl = sin(�l) where �l 2 R. Note thatthe Jacobi annihilators of [31] use a a di�erent order of the factors in (31) and (32).In broad outline, Algorithm 2 (presented below) is a block version of Algorithm 1using the partition (29). It measures progress with a block version of (2) de�ned byo�B(A;B) =Xi6=j jjAij jj2F + jjBij jj2Fwhere A and B are partitioned as in (29). At the (i; j)-th step of a sweep, it choosesa block rotation Z = Z(i; j; U ) to minimize o�B(ZTAZ;ZTBZ). It is easy to verifythat for i < j,o�B(ZTAZ;ZTBZ) =o�B(A;B) � jjAijjj2F � jjAjijj2F � jjBijjj2F � jjBjijj2F(33) + ����UT12AiiU11 + UT22AjiU11 + UT12AijU21 + UT22AjjU21����2F+ ����UT11AiiU12 + UT21AjiU12 + UT11AijU22 + UT21AjjU22����2F :+ ����UT12BiiU11 + UT22BjiU11 + UT12BijU21 + UT22BjjU21����2F+ ����UT11BiiU12 + UT21BjiU12 + UT11BijU22 + UT21BjjU22����2F :Minimizing (33) is equivalent to minimizing the sum of the last four terms. Generaloptimization algorithms like those described in [19] provide e�ective ways to �ndminimizers (cl; sl) = (cos(�l); sin(�l)) for the parameters in (31) or (32).For U 2 R4�4 as above de�ne ~U by~U = � 0 �II 0 � � U11 U12U21 U22 � :14

Page 15: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

Observe that Z = Z(i; j; U ) and S = Z(i; j; ~U ) are both block rotations ando�B(ZTAZ;ZTBZ) = o�B(STAS; STBS). Thus, we may choose a block rotationso that jjU11jjF � jjU12jjF . Such block rotations are sometimes called \inner" blockrotations. The use of inner block rotations promotes convergence [18]. The 3�by�3case is more complicated. However, the 3�by�3 case only arises in case n is odd. Asin [44], it can be eliminated simply adding an extra row and column of zeros to A andB. The following theorem implies that (33) is minimized by a block rotation of theform of (31) or (32). There is no need to use more general block rotations. A similartheorem about a di�erent set of block rotations was proved in [42] using di�erentmethods.Theorem 4.2.1. If Q 2 R4�4 is orthogonal, then there is a block rotation Z 2 R4�4 of the formof (31) and an orthogonal, block diagonal matrix D 2 R4�4 with 2�by�2blocks such that ZQ = D.2. If Q 2 R3�3 is orthogonal, then there is a block rotation Z 2 R3�3 of the formof (32) and an orthogonal, block diagonal matrix D 2 R3�3 with 2�by�2 and1�by�1 blocks such that ZQ = D.Proof. We will prove Statement 1. The proof of the Statement 2 is similar.Select c1 = cos(�1) and s1 = sin(�1) such thatc1 det � q21 q22q31 q32 �+ s1 det � q21 q22q11 q12 � = 0:Set Q(1) = R(1; 3; c1; s1)Q. With this choice of c1 and s1,det " q(1)21 q(1)22q(1)31 q(1)32 # = 0;so there is a choice of c2 = cos(�2) and s2 = sin(�2) such thats2q(1)21 + c2q(1)31 = s2q(1)22 + c2q(1)32 = 0:Thus, Q(2) = R(2; 3; c2; s2)Q(1) has zeros in the (3; 1) and (3; 2) entries. Select c3 =cos(�3) and s3 = sin(�3) such thatc3 det" q(2)21 q(2)22q(2)41 q(2)42 #+ s3 det" q(2)21 q(2)22q(2)11 q(2)12 # = 0:Set Q(3) = R(1; 4; c3; s3)Q(2). With this choice of c3 and s4 we getdet " q(3)21 q(3)22q(3)41 q(3)42 # = 0:Hence there is a choice of c4 = cos(�4) and s4 = sin(�4) such thats2q(1)21 + c2q(1)41 = s2q(1)22 + c2q(1)42 = 0:Thus, Q(4) = R(2; 4; c4; s4)Q(3) has zeros in the (3; 1), (3; 2), (4; 1) and (4; 2) entries.15

Page 16: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

Set Z = R(2; 4; c4; s4)R(1; 4; c3; s3)R(2; 3; c2; s2)R(1; 3; c1; s1)and set D = Q(4) = ZQ. Observe that D is an orthogonal, block upper triangularmatrix with 2�by�2 blocks. Hence, D is block diagonal with 2�by�2 blocks.The following algorithm summarizes the procedure for simultaneous diagonaliza-tion of a commuting pair of real normal matrices.Algorithm 2.INPUT: � > 0; A; B 2 Rn�n such that AB = BA, AAT = ATA and BBT = BTBOUTPUT: Q 2 Rn�n such that o�B(QTAQ;QTBQ) � �(jjAjjF + jjBjjF ) and QQT = I.1. Q I2. WHILE o�B(A;B) > �(jjAjjF + jjBjjF )� FOR i = 1; 2; 3; : : : ; �n+12 �{ FOR j = i + 1; i + 2; i + 3; : : : ; �n+12 �� Select an inner block rotation Z Z(i; j; U ) that minimizeso�B(ZTAZ;ZTBZ)� Q QZ; A ZTAZ; B ZTBZIf rotations are stored and applied in the e�cient manner described in [22, x5.1] or[33, x6.4], then each sweep (Step 2) uses approximately 8n3 real ops to update A andB and approximately 2n3 real ops to accumulate Q. A real op is the computatione�ort required to execute the Fortran statement (8) where A and S are of type REAL.The rough estimate that one complex op is equivalent to four real ops implies thatAlgorithm 2 does about half the work of Algorithm 1.Algorithm 1 also needs storage for approximately 3n2 real numbers. This can beshaved to 2n2 real numbers if Q is not required.5. Quaternions. A matrix H 2 C2n�2n has quaternion structure if it is of theform JH = �HJ(34)where J 2 R2n�2n is de�ned byJ = diag(E;E;E; � � � ; E)(35)and E = � 0 1�1 0 � :If a quaternion matrix H is partitioned into an n�by�n block matrix with 2�by�2blocks, then each block is of the formHij = � uij ��vijvij �uij �(36)where uij; vij 2 C. It is easy to show that (34) and (36) are equivalent.If P 2 R2n�2n is the permutation matrixP = [e1; e3; e5; : : : ; e2n�1; e2; e4; e6; : : : ; e2n];16

Page 17: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

then PHP T = � U ��VV �U � :(37)In other treatments, e.g., [7], quaternion matrices are de�ned to have the form of(37), but in this context, (36) makes the explication somewhat simplier. Quaternionmatrices arise naturally from quantum mechanical problems that have time reversalsymmetry [12, 26, 34], and in eigenvalue problems that have two special structures[6]. The eigenvalues and eigenvectors of quaternion matrices appear in pairs. If(�; x) is an eigenvalue-eigenvector pair of a quaternion matrix, then (��; J �x) is alsoan eigenvalue-eigenvector pair. A quaternion matrix H has a quaternion, Schur de-composition as HQ = QT where Q 2 C2n�2n is a unitary, quaternion matrix andT 2 C2n�2n is an upper triangular, quaternion matrix [7]. If H 2 C2n�2n is alsonormal, then T is a diagonal, quaternion matrix. From this it is easy to show thefollowing theorem.Theorem 5.1. If A; B 2 C2n�2n are normal, commuting, quaternion matrices,then there is a unitary, quaternion matrix Q 2 C2n�2n and quaternion, diagonalmatrices DA; DB 2 C2n�2n such that AQ = QDA and BQ = QDB .Algorithm 1 applied to a commuting pair of normal, quaternion matrices willdestroy the quaternion structure. General 2n�by�2n matrices must be stored andupdated. Rounding errors may destroy the pairing of eigenvalues and eigenvectors.Algorithm 1 takes no advantage of the pairing of eigenvalues.Theorem 5.1 suggests that if quaternion structure were preserved throughoutAlgorithm 1, then work and storage requirements would be cut in half and the specialpairing of eigenvalues and eigenvectors will be preserved despite rounding errors.The key to preserving quaternion structure is the observation that products andinverses of quaternion matrices are quaternion matrices. In particular, there is a richclass of quaternion, unitary matrices to use in a modi�ed version of Algorithm 1.A quaternion rotation is a quaternion unitary matrix W 2 C2n�2n of the formW =W1W2W3 whereW1 = R(2i; 2j; �c4; �s4)R(2i� 1; 2j � 1; c4; s4);(38) W2 = R(2j � 1; 2j; c3; s3)R(2i � 1; 2i; c2; s2)(39)and W3 = R(2i; 2j; �c1; �s1)R(2i� 1; 2j � 1; c1; s1):(40)The matrices W1, W2, W3 are quaternion, unitary matrices. (The two rotations in(39) are quaternion, but the individual rotations in (38) and (40) are not.) Hence,W is quaternion and unitary. In terms of the block 2�by�2 partioning (29), for17

Page 18: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

1 � i < j � k = n, the quaternion rotation W = W (i; j; U ) is of the formW = 266666666666666666664I . . . I Wii WijI . . . IWji Wjj I . . . I

377777777777777777775 ;(41)where � Wii WijWji Wjj � = � U11 U12U21 U22 �(42) = R(2; 4;�c4; �s4)R(1; 3; c4; s4)R(3; 4; c3; s3)R(1; 2; c2; s2)R(2; 4;�c1; �s1)R(1; 3; c1; s1):Up to column scaling, quaternion unitary matrices are products of quaternionrotations. This was proved for symplectic matrices in the permuted form (37) in [32],but the methods carry over to the quaternion case with few modi�cations.Suppose A; B 2 C2n�2n form a commuting pair of normal quaternion matrices.In broad outline, Algorithm 3 (presented below) is a block version of Algorithm 1partitioned into 2�by�2 blocks. It measures progress with a block version of (2)de�ned by o�Q(A;B) =Xi6=j jjAij jj2F + jjBij jj2F :At the (i; j)-th step of a sweep, it uses a quaternion rotation W = W (i; j; U ) tominimize o�Q(WHAW;WHBW ).Partitioning U as in (42), it is easy to verify that for i < j,o�Q(WHAW;WHBW ) =o�Q(A;B) � jjAij jj2F � jjAjijj2F � jjBij jj2F � jjBjijj2F(43) + ����UH12AiiU11 + UH22AjiU11 + UH12AijU21 + UH22AjjU21����2F+ ����UH11AiiU12 + UH21AjiU12 + UH11AijU22 + UH21AjjU22����2F+ ����UH12BiiU11 + UH22BjiU11 + UH12BijU21 + UH22BjjU21����2F+ ����UH11BiiU12 + UH21BjiU12 + UH11BijU22 + UH21BjjU22����2FMinimizing (43) is equivalent to minimizing the sum of the last four terms. Generaloptimization algorithms like those described in [19] provide e�ective ways to �ndminimizers (cl; sl) = (cos(�l); sin(�l)) for the parameters in (38), (39) and (40).18

Page 19: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

For U 2 R4�4 partitioned as in (42) de�ne ~U by~U = � 0 �II 0 � � U11 U12U21 U22 � :Observe that W = W (i; j; U ) and S = W (i; j; ~U) are both quaternion rotations ando�Q(WHAW;WHBW ) = o�Q(SHAS; SHBS). Thus, we may choose the quaternionrotation so that jjU11jjF � jjU21jjF . Such block rotations are sometimes called \inner"block rotations. The use of inner block rotations promotes convergence [18].The following modi�cation of Algorithm 1 summarizes the procedure for simulta-neous diagonalization of a commuting pair of quaternion normal matrices. Note thatthe output of Algorithm 3 includes a pair of block diagonal matrices with 2�by�2blocks. The trivial step of simultaneously diagonalizing the commuting 2�by�2blocks is omitted.Algorithm 3.INPUT: � > 0; A; B 2 C2n�2n such that AB = BA, JA = �AJ , JB = �BJ , AAH =AHA and BBH = BHBOUTPUT: Q 2 C2n�2n such that o�Q(QHAQ;QHBQ) � �(jjAjjF + jjBjjF ), JQ = �QJand QQH = I.1. Q I2. WHILE o�Q(A;B) > �(jjAjjF + jjBjjF )� FOR i = 1; 2; 3; : : : ; n2{ FOR j = i + 1; i + 2; i + 3; : : : ; n2� Select an inner quaternion rotation W W (i; j; U ) that mini-mizes o�Q(WHAW;WHBW )� Q QZ; A WHAW ; B WHBWNote that Q is a product of quaternion rotations, so Q is a quaternion unitarymatrix. The quaternion structure of A and B is preserved throughout. Equations (36)and (37) show that only 2n2 complex numbers are needed to represent a quaternionmatrix. With this economy, Algorithm3 needs approximately 6n2 storage. If rotationsare stored and applied in an e�cient manner similar to [22, x5.1] or [33, x6.4], theneach sweep (Step 2) uses approximately 24n3 complex ops to update A and B andapproximately 6n3 complex ops to accumulate Q.6. Additional Symmetry Structure. In [6], simultaneous diagonalizationproblems usually have an additional special structure in addition to normality. Attimes, A and B are Hermitian, real symmetric, real skew symmetric, or quaternion.In this section, we outline how to modify Algorithm 1 to take advantage of additionalspecial structure.6.1. Hermitian Case. If (A;B) 2 Cn�n �Cn�n is a pair of commuting Her-mitian matrices, then A is the Hermitian part and iB is the skew Hermitian part ofthe normal matrix H = A + iB. All normal matrices are of this form. Algorithm 1reduces to the Jacobi method for normal matrices for which explicit formulae forminimizers of (5) are known [21].6.2. Real Symmetric Case. The following theorem shows that if A and B arereal symmetric, then no complex arithmetic is required by Algorithm 1. Moreover,there is an explicit formula for the minimizer of (5).Theorem 6.1. Suppose A; B 2 Rn�n, A = AT , B = BT and AB = BA.If c; s 2 R, c2 + s2 = 1, and w = � cs � s22cs � is a right singular vector of Lij =19

Page 20: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

� aij aii�ajj2bij aii�bjj2 � corresponding to its smallest singular value, then (c; s) is a mini-mizer of (5). Furthermore, if Lij has distinct singular values and (~c; ~s) 2 C � Cminimizes (5), then ~w = � ~cs � ~s22~cs � is a scalar multiple of w.Proof. In this case, (5) simpli�es tofij(c; s) = p2jjLijwjj2 =def p2 ��������� aij aii�ajj2bij aii�bjj2 �� c2 � s22cs ���������2 :(44)Using the double angle formulae for s = sin(�) and c = cos(�), any two-dimensionalreal vector of Euclidean length 1 may be written in the form of w for some choice ofc; s 2 R, c2 + s2 = 1 and c � s. In particular, there is a choice of c and s such thatw is a right singular vector of Lij corresponding to the smallest singular value of Lij .Note that Lij is real, so w may be chosen to be real. Clearly, (44) is bounded belowby �min(Lij), so this choice of c and s minimizes fij(c; s). If Lij has distinct singularvalues then its singular vectors are unique up to scalar multiples.In addition to simplifying the calculation of the minimizing c and s, the theoremshows that only real rotations are needed. The real symmetric structure of A and Bis preserved throughout. Consequently, only the real part of the upper triangle of Aand B and the real part of Q need to be stored and updated. These modi�cationscut storage requirements of Algorithm 1 by a factor of three. The work requirementsare cut to 6n3 real ops per sweep.6.3. Real Symmetric | Real Skew-symmetric. If A 2 Rn�n is symmetric,B 2 Rn�n is skew-symmetric and AB = BA, then H = A + B 2 Rn�n is normal.All real, normal matrices are of this form. In this case, Algorithm 2 applies withA := A + B and B := 0. Note that B := 0 is invariant throughout Algorithm 2, sono work need to expended updating B and no storage need be allocated to store B.This cuts the cost of Algorithm 2 down to approximately 4n3 real ops per sweep toupdate A := A + B and approximately 2n3 ops per sweep to update Q. A similaralgorithm using a di�erent set of block rotations was suggested in [42]6.4. Real Skew-symmetric| Real skew-symmetric. The skew-symmetricstructure of A and B is preserved throughout Algorithm 2. It is necessary to storeand update only the upper triangular part of A and B. This cuts the work requiredby Algorithm 2 to the same level as the symmetric | skew-symmetric case.6.5. Hermitian, Quaternion. If (A;B) are Hermitian, quaternion matrices,then A is the Hermitian part and iB is the skew-Hermitian part of the normal matrixH = A + iB. However, H is not quaternion. (The matrix iB is antiquaternion[7]. See Subsection 6.6.) Nevertheless, Algorithm 3 preserves both the quaternionand Hermitian structure, only the upper triangles of A and B need be stored andupdated. Thus, both the storage requirement for A and B and the work to update Aand B are cut in half.6.6. Hermitian Quaternion | Hermitian Anti-quaternion. A matrixH 2 C2n�2n is said to be antiquaternion if and only if iH is quaternion. IfA 2 C2n�2nis Hermitian and quaternion and B 2 C2n�2n is Hermitian and antiquaternion, thenA is the Hermitian part and iB is the skew-Hermitian part of the quaternion, normalmatrix H = A + iB. Moreover, all quaternion, normal matrices are of this form.Algorithm 3 applies with A := A+ iB and B := 0. Note that B := 0 is invariant, so it20

Page 21: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

need not be stored or updated. This economy cuts the work and storage requirementsof Algorithm 3 down to approximately 18n3 complex ops per sweep. If Q is notrequired, then the work requirement drops to 12n3 complex ops per sweep.6.7. Commuting m-tuples. Algorithm 1 extends to the problem of simultane-ously diagonalizingm commuting, normal matrices. In this case the the minimizationproblem (5) uses a 2m-by-3 matrix coe�cient matrix. Algorithm 1 must apply simi-larity transformations to all m commuting matrices.7. Conclusions. We have presented a Jacobi-like algorithm for simultaneousdiagonalization of commuting pairs of complex normal matrices. The algorithm usesa sequence of similarity transformations by elementary complex rotations to drivethe o�-diagonal entries to zero. Convergence and rounding error properties are sim-ilar to the serial Jacobi algorithm [20, 38, 46]. We have shown that its asymptoticconvergence rate is quadratic, and that it is numerically stable in the sense that thecomputed eigenvalues and eigenvectors are correct for a perturbation of the data. Em-pirically, it appears to converge globally, but we have not been able to give a proof.The algorithm can easily be modi�ed to preserve and exploit the additional specialstructure of real matrices, quaternion matrices and real symmetric matrices.Acknowledgment. We would like to thank Gabriele Steinmeyer for help testingsome of the algorithms and Prof. Vjeran Hari for comments on an early draft of thispaper. We also thank an anonymous referee for bringing [45] to our attention, andProf. K. Veselic for providing us with a copy of the paper and a translation.REFERENCES[1] J. Barlow and J. Demmel, Computing accurate eigensystems of scaled diagonally dominantmatrices, Tech. Rep. 421, Courant Institute of Mathematical Sciences, Departmentof Com-puter Science, New York University, 251 Mercer Street, New York, N. Y. 10012, December1988.[2] M. Berry and A. H. Sameh, Multiprocessor Jacobi algorithms for dense symmetric eigenvalueand singular value decompositions, in Proceedings of the '86 International Conference onParallel Processing, August 1986.[3] , A parallel algorithm for the singular value and dense symmetric eigenvalue problem,tech. rep., University of Illinois at Urbana-Champaign, Center for Supercomputing Re-search and Development, Urbana-Champaign, Illinois, March 1988.[4] C. H. Bischof, The two-sided block Jacobi method on a hypercube architecture, in HypercubeMultiprocessors, M. T. Heath, ed., Philadelphia, PA, 1987, SIAM, pp. 612{618.[5] R. P. Brent and F. T. Luk, The solution of singular-value and symmetric eigenvalue problemson multiprocessor arrays, SIAM Journal of Scienti�c and Statistical Computing, 6 (1985),pp. 69{84.[6] A. Bunse-Gerstner, R. Byers, and V. Mehrmann, A chart of numerical methods for struc-tured eigenvalue problems, tech. rep., FSP Mathematisierung, Universit�at Bielefeld, Post-fach 8640, D{4800 Bielefeld 1, FRG, 1989. To appear in SIAM Journal of Matrix Analysis.[7] , A quaternion QR algorithm, Numerische Mathematik, 55 (1989), pp. 83{95.[8] K. Chen and K. Irani, A Jacobi algorithm and its implementation on parallel computers,in Proceedings of the 18-th Annual Allerton Conference on Communication Control andComputing, 1980. 21

Page 22: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

[9] R. O. Davies and J. J. Modi, A direct method for computing eigenproblem solutions on aparallel computer, Linear Algebra and Its Applications, 77 (1986), pp. 61{74.[10] J. W. Demmel and W. Kahan, Computing small singular values of bidiagonal matrices withguaranteed high relative accuracy, SIAM Journal of Scienti�c and Statistical Computing,(to appear).[11] J. W. Demmel and K. Veseli�c, Jacobi's method is more accurate than QR, Tech. Rep. 468,Courant Institute of Mathematical Sciences, Department of Computer Science, New YorkUniversity, 251 Mercer Street, New York, N. Y. 10012, October 1989.[12] J. J. Dongarra, J. R. Gabriel, D. D. K�olling, and J. H. Wilkinson, The eigenvalueproblem for Hermitian matrices with time reversal symmetry, Linear Algebra and Its Ap-plications, 60 (1984), pp. 27{42.[13] P. J. Eberlein, A Jacobi-like method for the automatic computation of eigenvalues and eigen-vectors of an arbitrary matrix, Journal of the Society for Industrial and Applied Mathe-matics, 10 (1962), pp. 74{88.[14] P. J. Eberlein, Solution to the complex eigenproblem by norm reducing Jacobi-type method,Numerische Mathematic, 14 (1970), pp. 232{245.[15] , On one{sided Jacobi methods for parallel computation, SIAM Journal on Algebraic andDiscrete Methods, 8 (1987), pp. 790{796.[16] , On using the Jacobi method on the hypercube, in Second Conference on HypercubeMultiprocessors, M. T. Heath, ed., Philadelphia, PA, 1987, SIAM, pp. 605{611.[17] A. Erdmann, �Uber Jacobi{�ahnliche Verfahren zur L�osung des Eigenwertproblems nicht-normaler komplexer Matrizen, Dissertation, Fernuniversit�at Hagen, Hagen, FRG, 1984.[18] K. V. Fernando and S. J. Hammarling, On block Kogbetliantz methods for computationof the SVD, in SVD and Signal Processing: Algorithms, Applications and Architectures,F. Deprettere, ed., North-Holland, 1988.[19] R. Fletcher, Practical Methods of Optimization, Volumes 1 and 2, John Wiley and Sons,New York, N.Y., 1981.[20] G. E. Forsythe and P. Henrici, The cyclic Jacobi method for computing the principal valuesof a complex matrix, Transactions of the AmericanMathematical Society, 94 (1960), pp. 1{23.[21] H. H. Goldstein and L. P. Horwitz, A procedure for the diagonalization of normal matrices,Journal of the Association for Computing Machinery, 6 (1959), pp. 176{195.[22] G. Golub and C. Van Loan, Matrix Computations, The Johns Hopkins Press, Baltimore,Maryland, second ed., 1989.[23] V. Hari, On sharp quadratic convergence bounds for the serial Jacobi methods, tech. rep.,University of Zagreb, Department of Mathematics, YU-41001 Zagreb, P. O. Box 187, Yu-goslavia, 1989.[24] C. G. Jacobi, �Uber ein leichtes Verfahren, die in der Theorie der S�akul�arst�orungen vork-ommendem Gleichungen numerisch aufzul�osen, Journal Reine Angewandte Mathematik,(1846), pp. 51{95.[25] W. Kahan, Accurate eigenvalues of a symmetric tridiagonal matrix, Tech. Rep. CS 41, StanfordUniversity, Computer Science Department, Stanford, California, 1966. revised 1968.[26] M. Lax, Symmetry Principles in Solid State and Molecular Physics, John Wiley and Sons,New York, N. Y., �rst ed., 1974.[27] F. T. Luk, A triangular processor array for computing singular values, Linear Algebra and ItsApplications, 77 (1986), pp. 259{273.[28] F. T. Luk and H. P. , A proof of convergence for two parallel Jacobi SVD algorithms, IEEETransactions on Computers, 38 (1989), pp. 806{811.[29] C. Moler, MATLAB user's guide, Tech. Rep. CS81-1, Computer Science, University of NewNew Mexico, Albuquerque, NM, 1980.[30] B. Noble and W. Daniel, Applied Linear Algebra, Prentice-Hall, Inc., Englewood Cli�s, N.J.,1977.[31] M. H. C. Paardekooper, An eigenvalue algorithm for skew-symmetric matrices, NumerischeMathematik, 17 (1971), pp. 189{202.[32] C. Paige and C. F. Van Loan, A Schur decomposition for Hamiltonian matrices, LinearAlgebra and Its Applications, 41 (1981), pp. 11{32.[33] B. N. Parlett, The Symmetric Eigenvalue Problem, Prentice-Hall Series in ComputationalMathematics, Prentice-Hall, Inc., Englewood Cli�s, N. J., 1980.[34] N. R�osch, Time-reversal symmetry, Kramers' degeneracy and the algebraic eigenvalue prob-lem, Chemical Physics, 80 (1983), pp. 1{5.[35] W. Rudin, Real and Complex Analysis, Higher Mathematics, McGraw-Hill Book Company,New York, N.Y., 1966. 22

Page 23: an · 2015. 7. 29. · sup erscript H denotes the Hermitian transp ose.) Al-though CH ... adv en t of parallel and v ector computers in terest in Jacobi's metho d has reviv ed. Jacobi

[36] A. Ruhe, On the quadratic convergence of the Jacobi method for normal matrices, BIT, 7(1967), pp. 305{313.[37] A. H. Sameh, On Jacobi and Jacobi{like algorithms for a parallel computer, Mathematics ofComputation, 25 (1971), pp. 579{590.[38] A. Sch�onhage, Zur Konvergenz des Jacobi-Verfahrens, Numerische Mathematik, 3 (1961),pp. 374{380.[39] B. T. Smith, J. M. Boyle, J. J. Dongarra, B. S. Garbow, Y. Ikebe, V. C. Klema,and C. B. Moler, Matrix Eigensystem Routines | EISPACK Guide, Lecture Notes inComputer Science, Springer Verlag, New York, N.Y., 1976.[40] G. W. Stewart, A Jacobi-like algorithm for computing the Schur decomposition of a non-Hermitian matrix, SIAM Journal of Scienti�cand Statistical Computing, 6 (1985), pp. 853{864.[41] H. P. M. van Kempen, On the convergence of the classical Jacobi method for real symmetricmatrices with non-distinct eigenvalues, Numerische Mathematik, 9 (1966), pp. 11{18.[42] K. Veseli�c, On a new class of elementary matrices, Numerische Mathematik, 33 (1979),pp. 173{180.[43] , An eigenreduction algorithm for de�nite matrix pairs and its applications to overdampedlinear systems, tech. rep., FernUniversit�at, Fakult�at f�urMathematik, Fernuniversit�at, Post-fach 940, D-5800 Hagen, FRG, 1988.[44] K. Veseli�c and H. J. Wenzel, A quadratically convergent Jacobi-like method for real matriceswith complex eigenvalues, Numerische Mathematik, 33 (1979), pp. 425{435.[45] V. Voevodin, An extension of the method of Jacobi (in russian), Comp. Methods and Pro-gramming VIII, Comp. Center of the Moscow University, (1967), pp. 216{228.[46] J. H. Wilkinson, Note on the quadratic convergence of the cyclic Jacobi process, NumerischeMathematik, 4 (1962), pp. 296{300.[47] , The Algebraic Eigenvalue Problem, Clarendon Press, Walton Street, Oxford, U. K.,1965.[48] , Almost diagonal matrices with multiple or close eigenvalues, Tech. Rep. CS 59, StanfordUniversity, Department of Computer Science, Stanford, California, 1967.

23