ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...
-
Upload
arely-hiley -
Category
Documents
-
view
213 -
download
1
Transcript of ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...
ScicomP 10, Aug 9-13, 2004
Parallel Out-of-Core LU and QR Factorization
Brian GunterCenter for Space Research
The University of Texas at Austin, Austin, [email protected]
Enrique Quintana-OrtíDepto. de Ingenieria y Ciencia de Computadores
Universidad Jaume I, Castellón, [email protected]
Robert van de GeijnDepartment of Computer Sciences
The University of Texas at Austin, Austin, [email protected]
Thierry JoffrainDepartment of Computer Sciences
The University of Texas at Austin, Austin, [email protected]
ScicomP 10, Aug 9-13, 2004
Motivation
Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.
m
n
In-core
ScicomP 10, Aug 9-13, 2004
Motivation
Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.
m
n
In-core
ScicomP 10, Aug 9-13, 2004
Motivation
Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.
m
n
In-core
ScicomP 10, Aug 9-13, 2004
Motivation
m >> n
n
While this is effective for many applications, it is inherently unscalable
As m >> n, fewer columns can fit into memory
ScicomP 10, Aug 9-13, 2004
A=QR
Q = I + YTYT
Out-of-Core QR Factorization
Compact WY Representation Q is an orthogonal matrix R is upper triangular Y is an m×r collection of Householder vectors, normalized to
be unit lower triangular (trapezoidal) T is r×r upper triangular
Given the m×n matrix, A, we wish to apply the factorization
ScicomP 10, Aug 9-13, 2004
Step 1:
Begin with an unfactored matrix which resides on disk.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
ScicomP 10, Aug 9-13, 2004
Step 2:
Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
t
t
ScicomP 10, Aug 9-13, 2004
Step 3:
Read in first tiles and factor, saving T matrices and overwriting lower tile with Y
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
ScicomP 10, Aug 9-13, 2004
Step 3:
Read in first tiles and factor, saving T matrices and overwriting lower tile with Y
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
ScicomP 10, Aug 9-13, 2004
Step 3:
Read in first tiles and factor, saving T matrices and overwriting lower tile with Y
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
ScicomP 10, Aug 9-13, 2004
Step 3:
Read in first tiles and factor, saving T matrices and overwriting lower tile with Y
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
ScicomP 10, Aug 9-13, 2004
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
ScicomP 10, Aug 9-13, 2004
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
ScicomP 10, Aug 9-13, 2004
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
ScicomP 10, Aug 9-13, 2004
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
ScicomP 10, Aug 9-13, 2004
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
ScicomP 10, Aug 9-13, 2004
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
ScicomP 10, Aug 9-13, 2004
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
ScicomP 10, Aug 9-13, 2004
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti
Yi
Step 4:
Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.
ScicomP 10, Aug 9-13, 2004
Step 5:
Factor next tile in first column using QR update algorithm.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
YiTi
ScicomP 10, Aug 9-13, 2004
Step 5:
Factor next tile in first column using QR update algorithm.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 5:
Factor next tile in first column using QR update algorithm.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 5:
Factor next tile in first column using QR update algorithm.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 6:
Apply transformations to remaining tiles in row.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 6:
Apply transformations to remaining tiles in row.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 6:
Apply transformations to remaining tiles in row.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 6:
Apply transformations to remaining tiles in row.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 6:
Apply transformations to remaining tiles in row.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Ti Yi
ScicomP 10, Aug 9-13, 2004
Step 7:
Repeat Steps 5 and 6 to any remaining rows of tiles.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
YiTi
ScicomP 10, Aug 9-13, 2004
Step 7:
Repeat Steps 5 and 6 to any remaining rows of tiles.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
ScicomP 10, Aug 9-13, 2004
Step 8:
Repeat Steps 1-7 on lower quadrant.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
Yi
Ti
ScicomP 10, Aug 9-13, 2004
Step 8:
Repeat Steps 1-7 on lower quadrant.
Continue until entire matrix has been factored.
= Stored on disk = In memory
QR FactorizationOut-of-Core Implementation
ScicomP 10, Aug 9-13, 2004
PA=LU
Out-of-Core LU Factorization
P is an permutation matrix U is n×n upper triangular L is lower trapezoidal Implementation analogous to out-of-core QR factorization
Given the m×n matrix, A, we wish to apply the factorization
ScicomP 10, Aug 9-13, 2004
Step 1:
Factor first tile, saving permutation matrix.
= Stored on disk = In memory
LU FactorizationOut-of-Core Implementation
Pi
Li
Ui
ScicomP 10, Aug 9-13, 2004
Step 2:
Update remaining tiles in row using panels of L and the saved permutation matrices.
= Stored on disk = In memory
LU FactorizationOut-of-Core Implementation
Pi
Li
Ui
ScicomP 10, Aug 9-13, 2004
Step 3:
Factor next tile in first column using LU update algorithm.
= Stored on disk = In memory
LU FactorizationOut-of-Core Implementation
Pi Li
Ui
ScicomP 10, Aug 9-13, 2004
Step 4:
Update remaining tiles in row using panels of L and stored permutation matrices.
= Stored on disk = In memory
LU FactorizationOut-of-Core Implementation
Li
Ui
Pi
ScicomP 10, Aug 9-13, 2004
Development Environment
Parallel Linear Algebra Package (PLAPACK) Optimized parallel routines (FORTRAN and C interfaces) ‘View-based’ infrastructure Uses standard MPI and BLAS libraries
Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) Out-of-core extension to PLAPACK Handles the complexity of the I/O operations (i.e., hidden to user) Uses standard read/write functions for portability
ScicomP 10, Aug 9-13, 2004
Performance of Parallel OOC QR
IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops
ScicomP 10, Aug 9-13, 2004
Performance for Sequential OOC LU
ScicomP 10, Aug 9-13, 2004
Earth Science Application
Gravity Recovery And Climate Experiment (GRACE)
A collaborative effort between The University of Texas Center for Space Research (CSR) The Jet Propulsion Laboratory (JPL) GeoForschungsZentrum (GFZ) Deutschen Zentrum für Luft- und Raumfahrt (DLR) National Aeronautics and Space Administration (NASA)
ScicomP 10, Aug 9-13, 2004
Earth Science Application
Goal was to compute a rigorous 360x360 gravity model No approximation techniques Translates to roughly 100 km2 resolution Involves the least squares estimation of ~130,000 parameters
Requires the combination of hundreds of millions of observations surface gravity data (land) – ½ TB altimetry-based mean sea surface data (ocean) GRACE data (satellite)
Using new parallel OOC QR algorithm A 360x360 field was generated, complete with full covariance Largest rigorous gravity field model ever created Used a single IBM P690 node OOC QR required only 32 GB
• To do in-core would require 165 GB of memory Required ~6 days of wall clock time to compute (2326 CPU hours)
• A single processor machine with sufficient memory would require 3.2 months
ScicomP 10, Aug 9-13, 2004
Conclusion
Tile-based out-of-core algorithms provide scalability Size of the tile is based on the memory of the machine (i.e. fixed) and
is independent of the problem size
Algorithms achieve excellent performance
The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations
This helps to offset the I/O cost associated with moving the tiles to and from disk
Use of the PLAPACK & POOCLAPACK greatly simplified the implementation Reduces complexity of code Makes code portable
Has already proven valuable to Earth science applications
ScicomP 10, Aug 9-13, 2004
Conclusion
Broad spectrum of applications Large scale problems Small clusters Embedded systems Other small memory machines
Tile-based OOC approach can be extended to other dense linear algebra operations Cholesky, matrix inverse, BLAS-3, etc.
Goal is to provide a full suite of OOC utilities
ScicomP 10, Aug 9-13, 2004
For More Information
Visit the PLAPACK website: www.cs.utexas.edu/users/plapack
Visit the GRACE website: www.csr.utexas.edu/grace