LU-GPU: Efficient Algorithms for Solving Dense Linear ...
Transcript of LU-GPU: Efficient Algorithms for Solving Dense Linear ...
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware
Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha
http://gamma.cs.unc.edu/LU-GPU
2The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Goals
Demonstrate advantages of mapping linear algebra routines to graphics hardware:
PerformanceGrowth rate
LAPACK compliant set of linear algebra algorithms on graphics hardware
3The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
4The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU decomposition
Sequence of row eliminations: Scale and add: A(i,j) = A(i,j) – A(i,k) A(k,j)
Input data mapping: 2 distinct memory regions
No data dependencieswithin a row elimination
PivotingPointer-swap vs. data copy
k
k
k
5The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
LU decomposition
Theoretical complexity (partial pivoting): (2/3) n3 + O(n2)
Performance <> ArchitectureOrder of operationsMemory access (latency)Memory bandwidth
6The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
7The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Commodity CPUs
LINPACK Benchmark:
Intel Pentium 4, 3.06 GHz: 2.88 GFLOPs/s
(Jack Dongarra, Oct 2005)
8The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Streaming architectures
Specialized hardwareHigh bandwidth/compute ratio
Merrimac [Erez04]Molecular modeling: 38 GFLOPs vs. 2.7 GFLOPs (P4)$1,000/node
Imagine [Ahn04]10.46 GFLOPs/s on QR-decomposition
Research
9The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
10The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU
Pentium EE 8403.2 GHz Dual Core230M Transistors90nm process206 mm2
2 x 1MB Cache25.6 GFLOPs
Price: $ 1,040
GeForce 7800 GTX430 MHz302M Transistors110 nm process326 mm2
512MB onboard memory313 GFLOPs (shader)1.3 TFLOPs (total)Price: $ 450
11The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU(Henry Moreton: NVIDIA, Aug. 2005)
PEE 840 7800GTX GPU/CPU
Graphics GFLOPs 25.6 1300 50.8Shader GFLOPs 25.6 313 12.2Die area (mm2) 206 326 1.6Die area normalized 206 218 1.1Transistors (M) 230 302 1.3Power (W) 130 65 0.5GFLOPS/mm 0.1 6.0 47.9GFLOPS/tr 0.1 4.3 38.7GFLOPS/W 0.2 20.0 101.6
12The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
CPU vs. GPU: Bandwidth
CPU
(3 GHz)
System Memory
(2+ GB)
AGP Memory
(512 MB)
PCI-E Bus
(4 GB/s)
Video Memory
(512 MB)
GPU (500 MHz)
Video Memory
(512 MB)
GPU (500 MHz)
2 x 1 MB Cache
35.2 GB/s bandwidth6.4 GB/s bandwidth
13The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Bandwidth
Large high bandwidth memory 512 MB video memory vs. 2 MB L2 cache on CPUs
High memory to compute clock ratio – reduces memory stalls
14The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graphics pipeline
vertex
pixel
texture
image
polygon
per-pixel texture, fp16 blending
programmable vertexprocessing (fp32)
programmable per-pixel math (fp32)
polygon setup,culling, rasterization
Z-buf, fp16 blending,anti-alias (MRT)
memory
15The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Stream processor (non-graphics)(David Kirk, NVIDIA, May’05)
data
setuprasterizer
data
data
data
data fetch, fp16 blending
programmable MIMDprocessing (fp32)
programmable SIMDprocessing (fp32)
listsSIMD“rasterization”
predicated write, fp16blend, multiple output
memory
16The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Potential of graphics processors
Commodity horsepowerParallel computationBandwidth
Programmable graphics pipelineStream processor
Exploit large growth rate
17The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Exploiting technology moving
faster than Moore’s law Source: Anselmo Lastra
CPU Growth Rate
GPU Growth Rate
18The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
General purpose computing on GPUs
Physical SimulationFluid Flow [Fan et al. 2004]
FEM [Rumpf and Strzodka 2001]
Cloud Dynamics [Harris et al. 2003]
Sparse Linear AlgebraOperators [Krüger and Westermann 2003]
Iterative Solvers [Bolz et al. 2003]
19The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
General purpose computing on GPUs
Matrix-Matrix MultiplicationFixed graphics pipeline, fixed-point arithmetic [Larsen and McAllister 2001]
Floating-point (SP) [Fatahalian et al. 2004]
High-level APIBrookGPU [Buck et al. 2004]
Sh [McCool et al. 2004]
20The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
21The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Motivation for LU-GPU
LU decomposition maps well:Stream program
Few data dependencies
PivotingParallel pivot search
Exploit large memory bandwidth
22The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU based algorithms
Data representation
Algorithm mapping
23The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Data representation
Texture mapping hardware: Input data mapping
24The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Data representation
Matrix elements 2D texture memory
One-to-one mapping
Texture memory = on-board memoryExploit bandwidth
Avoid CPU-GPU data transfer
25The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU based algorithms
Data representation
Algorithm mappingStream computation
Input data mapping
Fast row swaps
26The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Algorithm mapping
Texture mapping hardware: Input data mapping
27The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Stream computation
Rasterize quadrilateralsGenerates computation streamInvokes SIMD unitsRasterization simulates blocking
Rasterization pass = row elimination
Alternating memory regions
28The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Input data mapping
Texture mapping hardware: Input data mapping
29The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Input data mapping
Dedicated texture mapping hardware
Traditionally for color interpolation
Map input matrix elements to output elements
Eliminates computation of memory locations
25% performance improvement
30The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Pivoting
Main issues:
Pivot search
Row/column swapping
31The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Pivoting
Texture mapping hardware: Input data mapping
32The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Partial pivoting
Fast row swapData copy: mapped rasterization
Texture mapping hardware
High memory bandwidth
Improvement over pointer swapping
TEXTURE MAPPINGHARDWARE
Input Output
33The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Full pivoting
Fast column/row swap
Parallel pivot searchDivide and conquer approach
Partial pivoting Full pivoting
34The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
35The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Benchmarks
GPU SIMD units Core clock Memory Memory clock
6800 GT 12 350 MHz
16 425 MHz
430 MHz24
256 Mb 900 MHz
6800 Ultra 256 Mb 1100 MHz
7800 GTX 256 Mb 1200 MHz
Commodity CPU3.4 GHz Pentium IV with Hyper-Threading, 1 MB L2 cache
LAPACK sgetrf() (blocked algorithm, ATLAS library)
LAPACK sgetc2() (SSE-optimized IMKL library)
36The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: No pivoting
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
0
1
2
3
4
5
6
7
8
9
1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)
37The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Partial pivoting
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot
38The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Full Pivoting
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
0
50
100
150
200
250
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
Ultra 6800 Full Pivot
LAPACK sgetc2 (IMKL)
7800 Full Pivot
39The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Results: Number of computational units
4
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
4
8
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
4
8
12
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
4
8
1216
0
5
10
15
20
25
30
35
40
45
500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)
Tim
e (s
)
6800 Ultra (no pivoting)(Jun 2003)
(Mar 2004)
40The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
GPU-CPU data transfer overhead
0.00
2.00
4.00
6.00
8.00
10.00
12.00
500 1000 1500 2000 2500 3000 3500Matrix size N
Tim
e (s
)
GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial PivotCPU-GPU transfer
41The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Bandwidth efficiency
0
5
10
15
20
25
30
35
500 1000 1500 2000 2500 3000 3500 4000Matrix size (N)
Ban
dwid
th U
sage
(GB
/s)
6800 Ultra 6800 GT6800 Ultra Peak Bandwidth: 35.2 GB/s
6800 GT Peak Bandwidth: 28.8
42The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Faster than Moore’s law
0
2
4
6
8
10
12
500 1000 1500 2000 2500 3000 3500
Matrix size N
Tim
e (s
)
ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot (Mar 2004)
(Jun 2005)
43The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Application: Fluid Simulation
Solve parallel sub-problemsN=2048
Diagonally-dominant
No pivoting required
15% faster than ATLASon Pentium IV 3.06 GHz
44The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Limitations
Maximum matrix size: 4096x4096Block-partitioned LU decomposition
PrecisionSingle precision floating point
Not 100% IEEE floating point compliant
CPU-GPU data transfer overheadSmall matrices
45The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Graphics hardware advancements
Improved floating point bandwidth4 component vs. single component
Floating point blendingUse of non-programmable TFLOPs
46The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Outline
LU Decomposition & Related Work
The potential of GPUs
LU-GPU algorithm
Results
Conclusions & Ongoing Work
47The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Algorithm mapped to graphics pipeline
Novel mapping of row operations to rasterization
Stream computation
Blocking
48The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Optimized with GPU architecture
Input data mapping
Fast pivoting
49The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
Performance Comparable to industry-standard libraries
Relatively small development effort
GPU are useful co-processors Scientific computations
Many applications
50The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Conclusions
LU-GPU Open Source library available:
http://gamma.cs.unc.edu/LUGPULIB/
51The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Ongoing work
Sorting on GPUs
Linear algebra: GPU-LAPACK / QR decomposition
52The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sorting on GPUs
Goal: Utilize the high parallelism, and memory bandwidth on GPUs for fast sorting
[Govindaraju et al, SIGMOD05]
GPUSort: Open Source library[http://gamma.cs.unc.edu/GPUSORT/]
53The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Sorting on GPUs
6 times faster than Quicksort on 3.4 GHz Pentium IV PC!
54The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Linear algebra
LAPACK-compliant library for GPUs
QR-decomposition in development (LAPACK SGEQRF)
55The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Acknowledgements
Army Research OfficeDARPANational Science FoundationOffice of Naval ResearchRDECOMIntel CorporationNVIDIA CorporationUNC GAMMA Group
56The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL
Thank you
For questions or comments:
[email protected]@[email protected]
http://gamma.cs.unc.edu/http://gamma.cs.unc.edu/LUGPULIB/