ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...

46
ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin, TX [email protected] Enrique Quintana-Ortí Depto. de Ingenieria y Ciencia de Computadores Universidad Jaume I, Castellón, Spain [email protected] Robert van de Geijn Department of Computer Sciences The University of Texas at Austin, Austin, TX [email protected] Thierry Joffrain Department of Computer Sciences The University of Texas at Austin, Austin, TX [email protected]

Transcript of ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for...

Page 1: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Parallel Out-of-Core LU and QR Factorization

Brian GunterCenter for Space Research

The University of Texas at Austin, Austin, [email protected]

Enrique Quintana-OrtíDepto. de Ingenieria y Ciencia de Computadores

Universidad Jaume I, Castellón, [email protected]

Robert van de GeijnDepartment of Computer Sciences

The University of Texas at Austin, Austin, [email protected]

Thierry JoffrainDepartment of Computer Sciences

The University of Texas at Austin, Austin, [email protected]

Page 2: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Motivation

Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.

m

n

In-core

Page 3: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Motivation

Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.

m

n

In-core

Page 4: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Motivation

Traditional methods use a slab approach, where entire columns of the out-of-core matrix are brought into memory.

m

n

In-core

Page 5: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Motivation

m >> n

n

While this is effective for many applications, it is inherently unscalable

As m >> n, fewer columns can fit into memory

Page 6: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

A=QR

Q = I + YTYT

Out-of-Core QR Factorization

Compact WY Representation Q is an orthogonal matrix R is upper triangular Y is an m×r collection of Householder vectors, normalized to

be unit lower triangular (trapezoidal) T is r×r upper triangular

Given the m×n matrix, A, we wish to apply the factorization

Page 7: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 1:

Begin with an unfactored matrix which resides on disk.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Page 8: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 2:

Divide matrix into a mesh of tiles of size t, where each tile is stored as a separate file.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

t

t

Page 9: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Page 10: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Page 11: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Page 12: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 3:

Read in first tiles and factor, saving T matrices and overwriting lower tile with Y

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Page 13: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Page 14: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

Page 15: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

Page 16: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

Page 17: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

Page 18: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

Page 19: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

Page 20: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti

Yi

Step 4:

Read in remaining tiles in row and apply Q = I + YiTiYi, reading Yi in one panel at a time.

Page 21: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

YiTi

Page 22: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 23: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 24: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 5:

Factor next tile in first column using QR update algorithm.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 25: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 26: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 27: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 28: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 29: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 6:

Apply transformations to remaining tiles in row.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Ti Yi

Page 30: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 7:

Repeat Steps 5 and 6 to any remaining rows of tiles.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

YiTi

Page 31: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 7:

Repeat Steps 5 and 6 to any remaining rows of tiles.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Page 32: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 8:

Repeat Steps 1-7 on lower quadrant.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Yi

Ti

Page 33: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 8:

Repeat Steps 1-7 on lower quadrant.

Continue until entire matrix has been factored.

= Stored on disk = In memory

QR FactorizationOut-of-Core Implementation

Page 34: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

PA=LU

Out-of-Core LU Factorization

P is an permutation matrix U is n×n upper triangular L is lower trapezoidal Implementation analogous to out-of-core QR factorization

Given the m×n matrix, A, we wish to apply the factorization

Page 35: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 1:

Factor first tile, saving permutation matrix.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Pi

Li

Ui

Page 36: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 2:

Update remaining tiles in row using panels of L and the saved permutation matrices.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Pi

Li

Ui

Page 37: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 3:

Factor next tile in first column using LU update algorithm.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Pi Li

Ui

Page 38: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Step 4:

Update remaining tiles in row using panels of L and stored permutation matrices.

= Stored on disk = In memory

LU FactorizationOut-of-Core Implementation

Li

Ui

Pi

Page 39: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Development Environment

Parallel Linear Algebra Package (PLAPACK) Optimized parallel routines (FORTRAN and C interfaces) ‘View-based’ infrastructure Uses standard MPI and BLAS libraries

Parallel Out-Of-Core Parallel Linear Algebra (POOCLAPACK) Out-of-core extension to PLAPACK Handles the complexity of the I/O operations (i.e., hidden to user) Uses standard read/write functions for portability

Page 40: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Performance of Parallel OOC QR

IBM P690: 32 Gb, T.P. of 5.2 Gflops, DGEMM of 3.723 Gflops

Page 41: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Performance for Sequential OOC LU

Page 42: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Earth Science Application

Gravity Recovery And Climate Experiment (GRACE)

A collaborative effort between The University of Texas Center for Space Research (CSR) The Jet Propulsion Laboratory (JPL) GeoForschungsZentrum (GFZ) Deutschen Zentrum für Luft- und Raumfahrt (DLR) National Aeronautics and Space Administration (NASA)

Page 43: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Earth Science Application

Goal was to compute a rigorous 360x360 gravity model No approximation techniques Translates to roughly 100 km2 resolution Involves the least squares estimation of ~130,000 parameters

Requires the combination of hundreds of millions of observations surface gravity data (land) – ½ TB altimetry-based mean sea surface data (ocean) GRACE data (satellite)

Using new parallel OOC QR algorithm A 360x360 field was generated, complete with full covariance Largest rigorous gravity field model ever created Used a single IBM P690 node OOC QR required only 32 GB

• To do in-core would require 165 GB of memory Required ~6 days of wall clock time to compute (2326 CPU hours)

• A single processor machine with sufficient memory would require 3.2 months

Page 44: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Conclusion

Tile-based out-of-core algorithms provide scalability Size of the tile is based on the memory of the machine (i.e. fixed) and

is independent of the problem size

Algorithms achieve excellent performance

The large tile sizes mean the algorithm spends nearly all of its time in large, highly efficient matrix-matrix operations

This helps to offset the I/O cost associated with moving the tiles to and from disk

Use of the PLAPACK & POOCLAPACK greatly simplified the implementation Reduces complexity of code Makes code portable

Has already proven valuable to Earth science applications

Page 45: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

Conclusion

Broad spectrum of applications Large scale problems Small clusters Embedded systems Other small memory machines

Tile-based OOC approach can be extended to other dense linear algebra operations Cholesky, matrix inverse, BLAS-3, etc.

Goal is to provide a full suite of OOC utilities

Page 46: ScicomP 10, Aug 9-13, 2004 Parallel Out-of-Core LU and QR Factorization Brian Gunter Center for Space Research The University of Texas at Austin, Austin,

ScicomP 10, Aug 9-13, 2004

For More Information

Visit the PLAPACK website: www.cs.utexas.edu/users/plapack

Visit the GRACE website: www.csr.utexas.edu/grace