PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600...
-
Upload
hugo-arnold -
Category
Documents
-
view
216 -
download
0
Transcript of PDCS 2007 November 20, 2007 Accelerating the Complex Hessenberg QR Algorithm with the CSX600...
PDCS 2007
November 20, 2007
Accelerating the Complex Hessenberg QR Algorithm with the CSX600 Floating-Point
Coprocessor
Yusaku Yamamoto 1 Takafumi Miyata
1 Yoshimasa Nakamura
2
1 Nagoya University 2
Kyoto University
2
Introduction
• Background– Advent of many-core floating point accelerators as a
means to speed up scientific computations
• Objective of our study– Apply these accelerators to the eigenvalue problem for
nonsymmetric matrices.
– Make clear potential problems.
– Modify existing algorithms or develop new algorithms if necessary.
3
Outline of the talk
• Introduction
• Many-core floating point accelerators and its performance characteristics
• The nonsymmetric eigenvalue problem
• Proposed algorithm– Modification of the small-bulge multishift QR algorithm for
floating-point accelerators
• Performance evaluation
• Conclusion
4
Many-core Floating-point accelerators
• ClearSpeed CSX600– 1+96 processor cores– 48GFLOPS (double precision)
• Intel Larrabee (under development)– 80 processor cores– 1TFLOPS (single precision)
• GRAPE-DR (Tokyo Univ.)– 512 processor cores– 512GFLOPS (single precision)– 256GFLOPS (double precision)
• Integrates hundreds of floating-point cores• Very high GFLOPS value (peak performance)
5
Architecture of the CSX600 accelerator
• The CSX600 chip– 1 main processor– 96 floating-point processors
• 64bit• 2flops/cycle• 128Byte register file• 6KB SRAM
– Operates at 250MHz– Peak performance: 48GFLOPS
• ClearSpeed Advance board– Two CSX600 processors– 1GB DRAM– Connected with the PC via the
PCI-X bus.– Peak performance: 96GFLOPS
6
Problem with the data transfer speed
• Peak floating-point performance --- very high– 48GFLOPS / chip– 96GFLOPS / board
• Data transfer speed --- relatively low– 3.2GB/s between the chip and on-board memory– 1.066GF/s between the board and main memory
• Byte/flop– 0.066Byte/flop between the chip and on-board memory– 0.011Byte/flop between the board and main memory
CSX600 CSX600I/F
DRAM DRAM
DRAM I/F
CPU
3.2GB/s1.066GB/s
PC ClearSpeedAdvance board
PCI-X
7
Byte/flop of typical linear algebraic operations
Function Operation Amount of data transfer
Flop count
Byte/flop
Dot product := xTy 2n 2n 8
AXPY x := x + y 3n 2n 12
Matrix-vector multiplication
y := Ax n2+2n 2n2 4
Rank-1 update
A:= A + xyT 2n2+2n 2n2 8
Matrix multiplication
(MatMult)
C:= C + AB 4n2 2n3 16/n
• Operations other than matrix multiplication cannot exploit the performance
of the CSX600 due to the limitation of data transfer speed.• Matrix multiplication (MatMult) can be executed efficiently, but only if
the size is very large (n > 1500).
8
Performance of MatMult on the ClearSpeed board
0
5
10
15
20
25
30
1000 2000 3000 4000 5000 6000
M = K = 500
M = K = 1000
M = K = 1500
M = K = 2000
GF
LO
PS
GF
LO
PS
0
5
10
15
20
25
30
1000 2000 3000 4000 5000 6000
N = K = 500
N = K = 1000
N = K = 1500
N = K = 2000
M
K N
M
K
N
N
M
• The library transfers the input data from the main memory to the board, perform the computation and return the data to the main memory.
• M, N and K must be larger than 1500 to get substantial performance gain.
9
Problems to be solved
• Problems– Is it possible to reorganize the algorithm so that most of the
computations are done with matrix multiplications?– What is the overhead of using very large size matrix
multiplications?– How can we reduce the overhead?
We consider these problems for the nonsymmetric eigenvalue problem.
10
The nonsymmetric eigenvalue problem
• The problem– Eigenvalue problem
• A : dense complex nonsymmetric matrix• Compute all the eigenvalues / eigenvectors
• Applications– Magnetohydrodynamics– Structural dynamics– Quantum chemistry– Fluid dynamics
• Cf. Z. Bai and J. Demmel: A test matrix collection for non-Hermitian eigenvalue problems.
11
• The standard algorithm– Similarity transformation to the upper triangular matrix
– We focus on speeding up the QR algorithm.
QR algorithmWork: 10 n3
(empirically)
Algorithm for the nonsymmetric eigenproblem
Dense matrix
diagonal elements = eigenvalues
Hessenberg matrix Upper triangular matrix
Finite # of steps Iterative method
Householder methodWork: (10/3)n3
0 0
Target of speedup
12
• Algorithm
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR algorithm at once.
• Computational procedure for one iteration– Introduce (m / 2) bulges– Transform the matrix to Hessenberg form again by chasing (m / 2) bulges.
The small-bulge multishift QR algorithm
0
the case of m = 4 shifts
Matrices:A HessenbergQ unitaryR upper triangular
13
• Algorithm
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration– Introduce (m / 2) bulges– Transform the matrix to Hessenberg form again by chasing (m / 2) bulges.
The small-bulge multishift QR algorithm
0
the case of m = 4 shifts
Matrices:A HessenbergQ unitaryR upper triangular
14
• Algorithm
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration– Introduce (m / 2) bulges– Transform the matrix to Hessenberg form again by chasing (m / 2) bulges.
The small-bulge multishift QR algorithm
0
the case of m = 4 shifts
Matrices:A HessenbergQ unitaryR upper triangular
15
• Algorithm
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration– Introduce (m / 2) bulges– Transform the matrix to Hessenberg form again by chasing (m / 2) bulges.
The small-bulge multishift QR algorithm
0
the case of m = 4 shifts
Matrices:A HessenbergQ unitaryR upper triangular
16
• Algorithm
– shifts s1, …, sm : eigenvalues of the trailing m x m submatrix of Al
– Perform m steps of the QR method at once.
• Computational procedure for one iteration– Introduce (m / 2) bulges– Transform the matrix to Hessenberg form again by chasing (m / 2) bulges.
The small-bulge multishift QR algorithm
0
the case of m = 4 shifts
Matrices:A HessenbergQ unitaryR upper triangular
17
• Division of the updating operations– Chase the bulges by only k rows at a time.– Divide update operations into two parts:
• First, update the diagonal block sequentially.
– Accumulate the Householder transformations used in the update as a unitary matrix.
• Next, update the off-diagonal blocks.– Multiply the off-diagonal blocks by the
unitary matrix.
Use of the level-3 BLAS
0Bulge (3x3)
Diagonal update (sequential)
Off-diagonal update (MatMult)
k
Level-3BLAS
Blocking of bulge-chasing operations
18
• Random matrix (n = 6000)• Compute all the eigenvalues /
eigenvectors with the small-bulge multishift QR algorithm
• Computational environments– Xeon 3.2 GHz, Memory 8 GB– ClearSpeed advance board
• CSX600 x 2
• As the number of shifts increases … – MatMult part decrease
– other part increase (bottleneck)
Performance on the CSX600
Number of shifts
0
2000
4000
6000
8000
10000
12000
others
MatMult
Exe
cutio
n tim
e (s
ec)
Xeon Xeon + CSX600
100 120 160 200 240MatMult size600 720 960 1200 1440
Parts other thanMatMult need to be
sped up !
19
Modification of the algorithm (1)
0k
0k / q
Diagonal update (sequential)
Diagonal update (sequential)
Off-diagonal (MatMult)
Off-diagonal update (MatMult)
Reformulation as a recursive algorithm
Chasing of (m / 2) / qbulges by k / q rows
(ex. Recursion level d = 1)
20
Modification of the algorithm (2)
• Deflation– Trailing submatrix is isolated.
( )
• Eigensolution of the isolated submatrix– Apply the double shift QR algorithm.
– Size of the submatrix increases with m.
• Division of the update operations– Update the diagonal block (until convergence)
• Accumulate the Householder transformations used in the update as a unitary matrix.
– Update the off-diagonal block (only once, at last)
• Multiply the off-diagonal blocks by the unitary matrix.
0
0
sequential
MatMult
Bottleneck
Reduce theComputational
work
21
Numerical experiments
• Test problem– random matrices with elements in [0, 1]
• Reduced to Hessenberg form by Householder’s method– Compute all the eigenvalues / eigenvectors
• Computational environments– Xeon 3.2 GHz , Memory 8 GB– Fortran 77, double precision– ClearSpeed advance board
• CSX600 x 2
– Matrix multiplication• ClearSpeed’s Library (for large size MatMult)
• Intel Math Kernel Library (for small size MatMult)
22
• Comparison– Existing algorithm (small-bulge multishift QR method)
• MatMult part– Off-diagonal update
– Our algorithm (mutishift QR + recursion)• MatMult parts
– Off-diagonal update– Diagonal update– Eigensolution of isolated submatrix
– Parameter values• Number of shifts: m is chosen optimally for each case.• Row length of bulge chasing: k = (3/2)m• Level of recursion: d = 1• Number of subdivision: q = m / 40
Numerical experiments
CSX600
23
Effect of our modifications
0
1000
2000
3000
4000
5000
従来法 提案法 従来法 提案法
others
isolated eigenproblem
diagonal update
off-diagonal update
• Our algorithm is 1.4 times faster– Diagonal update: 1.5 times faster
– Eigensolution of isolated submatrix: 10 times faster
( n = 3000, m = 160, q = 4 )( n = 6000, m = 200, q = 5 )
Exe
cutio
n tim
e (s
ec)
Ours Ours
CSX600 is used in all cases
OriginalOriginal
24
0
20000
40000
60000
80000
100000
(無) (有) (無) (有)
othersisolated eigenproblem
diagonal updateoff-diagonal update
Effect of using the CSX600
n = 6000
n = 12000
• By combining the CSX600 with our algorithm,– 3.5 times speedup when n = 6000
– 3.8 times speedup when n = 12000
Exe
cutio
n tim
e (s
ec)
m = 100 q = 5
m = 200q = 5
m = 100 q = 5
m = 240q = 6
(有)
(有)
Xeon +CSX600
Xeon +CSX600
XeonXeon
25
Conclusion
• We proposed an approach to accelerate the solution of nonsymmetric eigenproblem using a floating-point accelerator.
• We used the small-bulge multishift QR algorithm, which can use matrix multiplications efficiently, as a basis.
• By reformulating part of the algorithm as a recursive algorithm, we succeeded in reducing the computational time spent by non-blocked part. This enables us to use large block size (number of shifts) with small overhead and to exploit the performance of the floating-point accelerator.
• When solving an eigenproblem of order 12,000, our algorithm is 1.4 times faster than the original small-bugle multishift QR algorithm. We obtained 3.8 times speedup with the CSX600 over the 3.2GHz Xeon.