LU-GPU: Efficient Algorithms for Solving Dense Linear ...

56
The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha http://gamma.cs.unc.edu/LU-GPU

Transcript of LU-GPU: Efficient Algorithms for Solving Dense Linear ...

Page 1: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

LU-GPU: Efficient Algorithms for Solving Dense Linear Systems on Graphics Hardware

Nico Galoppo, Naga K. Govindaraju, Michael Henson, Dinesh Manocha

http://gamma.cs.unc.edu/LU-GPU

Page 2: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

2The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Goals

Demonstrate advantages of mapping linear algebra routines to graphics hardware:

PerformanceGrowth rate

LAPACK compliant set of linear algebra algorithms on graphics hardware

Page 3: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

3The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

Page 4: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

4The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

LU decomposition

Sequence of row eliminations: Scale and add: A(i,j) = A(i,j) – A(i,k) A(k,j)

Input data mapping: 2 distinct memory regions

No data dependencieswithin a row elimination

PivotingPointer-swap vs. data copy

k

k

k

Page 5: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

5The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

LU decomposition

Theoretical complexity (partial pivoting): (2/3) n3 + O(n2)

Performance <> ArchitectureOrder of operationsMemory access (latency)Memory bandwidth

Page 6: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

6The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

Page 7: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

7The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Commodity CPUs

LINPACK Benchmark:

Intel Pentium 4, 3.06 GHz: 2.88 GFLOPs/s

(Jack Dongarra, Oct 2005)

Page 8: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

8The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Streaming architectures

Specialized hardwareHigh bandwidth/compute ratio

Merrimac [Erez04]Molecular modeling: 38 GFLOPs vs. 2.7 GFLOPs (P4)$1,000/node

Imagine [Ahn04]10.46 GFLOPs/s on QR-decomposition

Research

Page 9: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

9The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

Page 10: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

10The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CPU vs. GPU

Pentium EE 8403.2 GHz Dual Core230M Transistors90nm process206 mm2

2 x 1MB Cache25.6 GFLOPs

Price: $ 1,040

GeForce 7800 GTX430 MHz302M Transistors110 nm process326 mm2

512MB onboard memory313 GFLOPs (shader)1.3 TFLOPs (total)Price: $ 450

Page 11: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

11The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CPU vs. GPU(Henry Moreton: NVIDIA, Aug. 2005)

PEE 840 7800GTX GPU/CPU

Graphics GFLOPs 25.6 1300 50.8Shader GFLOPs 25.6 313 12.2Die area (mm2) 206 326 1.6Die area normalized 206 218 1.1Transistors (M) 230 302 1.3Power (W) 130 65 0.5GFLOPS/mm 0.1 6.0 47.9GFLOPS/tr 0.1 4.3 38.7GFLOPS/W 0.2 20.0 101.6

Page 12: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

12The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

CPU vs. GPU: Bandwidth

CPU

(3 GHz)

System Memory

(2+ GB)

AGP Memory

(512 MB)

PCI-E Bus

(4 GB/s)

Video Memory

(512 MB)

GPU (500 MHz)

Video Memory

(512 MB)

GPU (500 MHz)

2 x 1 MB Cache

35.2 GB/s bandwidth6.4 GB/s bandwidth

Page 13: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

13The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Bandwidth

Large high bandwidth memory 512 MB video memory vs. 2 MB L2 cache on CPUs

High memory to compute clock ratio – reduces memory stalls

Page 14: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

14The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Graphics pipeline

vertex

pixel

texture

image

polygon

per-pixel texture, fp16 blending

programmable vertexprocessing (fp32)

programmable per-pixel math (fp32)

polygon setup,culling, rasterization

Z-buf, fp16 blending,anti-alias (MRT)

memory

Page 15: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

15The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Stream processor (non-graphics)(David Kirk, NVIDIA, May’05)

data

setuprasterizer

data

data

data

data fetch, fp16 blending

programmable MIMDprocessing (fp32)

programmable SIMDprocessing (fp32)

listsSIMD“rasterization”

predicated write, fp16blend, multiple output

memory

Page 16: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

16The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Potential of graphics processors

Commodity horsepowerParallel computationBandwidth

Programmable graphics pipelineStream processor

Exploit large growth rate

Page 17: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

17The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Exploiting technology moving

faster than Moore’s law Source: Anselmo Lastra

CPU Growth Rate

GPU Growth Rate

Page 18: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

18The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

General purpose computing on GPUs

Physical SimulationFluid Flow [Fan et al. 2004]

FEM [Rumpf and Strzodka 2001]

Cloud Dynamics [Harris et al. 2003]

Sparse Linear AlgebraOperators [Krüger and Westermann 2003]

Iterative Solvers [Bolz et al. 2003]

Page 19: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

19The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

General purpose computing on GPUs

Matrix-Matrix MultiplicationFixed graphics pipeline, fixed-point arithmetic [Larsen and McAllister 2001]

Floating-point (SP) [Fatahalian et al. 2004]

High-level APIBrookGPU [Buck et al. 2004]

Sh [McCool et al. 2004]

Page 20: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

20The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

Page 21: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

21The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Motivation for LU-GPU

LU decomposition maps well:Stream program

Few data dependencies

PivotingParallel pivot search

Exploit large memory bandwidth

Page 22: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

22The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

GPU based algorithms

Data representation

Algorithm mapping

Page 23: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

23The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Data representation

Texture mapping hardware: Input data mapping

Page 24: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

24The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Data representation

Matrix elements 2D texture memory

One-to-one mapping

Texture memory = on-board memoryExploit bandwidth

Avoid CPU-GPU data transfer

Page 25: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

25The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

GPU based algorithms

Data representation

Algorithm mappingStream computation

Input data mapping

Fast row swaps

Page 26: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

26The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Algorithm mapping

Texture mapping hardware: Input data mapping

Page 27: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

27The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Stream computation

Rasterize quadrilateralsGenerates computation streamInvokes SIMD unitsRasterization simulates blocking

Rasterization pass = row elimination

Alternating memory regions

Page 28: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

28The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Input data mapping

Texture mapping hardware: Input data mapping

Page 29: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

29The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Input data mapping

Dedicated texture mapping hardware

Traditionally for color interpolation

Map input matrix elements to output elements

Eliminates computation of memory locations

25% performance improvement

Page 30: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

30The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Pivoting

Main issues:

Pivot search

Row/column swapping

Page 31: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

31The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Pivoting

Texture mapping hardware: Input data mapping

Page 32: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

32The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Partial pivoting

Fast row swapData copy: mapped rasterization

Texture mapping hardware

High memory bandwidth

Improvement over pointer swapping

TEXTURE MAPPINGHARDWARE

Input Output

Page 33: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

33The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Full pivoting

Fast column/row swap

Parallel pivot searchDivide and conquer approach

Partial pivoting Full pivoting

Page 34: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

34The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

Page 35: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

35The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Benchmarks

GPU SIMD units Core clock Memory Memory clock

6800 GT 12 350 MHz

16 425 MHz

430 MHz24

256 Mb 900 MHz

6800 Ultra 256 Mb 1100 MHz

7800 GTX 256 Mb 1200 MHz

Commodity CPU3.4 GHz Pentium IV with Hyper-Threading, 1 MB L2 cache

LAPACK sgetrf() (blocked algorithm, ATLAS library)

LAPACK sgetc2() (SSE-optimized IMKL library)

Page 36: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

36The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: No pivoting

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

0

1

2

3

4

5

6

7

8

9

1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)Ultra 6800 LU (no pivoting)GT 6800 LU (no pivoting)7800 LU (no pivoting)

Page 37: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

37The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: Partial pivoting

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot

Page 38: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

38The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: Full Pivoting

0

50

100

150

200

250

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

Ultra 6800 Full Pivot

LAPACK sgetc2 (IMKL)

7800 Full Pivot

0

50

100

150

200

250

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

Ultra 6800 Full Pivot

LAPACK sgetc2 (IMKL)

7800 Full Pivot

0

50

100

150

200

250

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

Ultra 6800 Full Pivot

LAPACK sgetc2 (IMKL)

7800 Full Pivot

Page 39: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

39The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Results: Number of computational units

4

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

4

8

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

4

8

12

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

4

8

1216

0

5

10

15

20

25

30

35

40

45

500 1000 1500 2000 2500 3000 3500 4000Matrix Size (N)

Tim

e (s

)

6800 Ultra (no pivoting)(Jun 2003)

(Mar 2004)

Page 40: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

40The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

GPU-CPU data transfer overhead

0.00

2.00

4.00

6.00

8.00

10.00

12.00

500 1000 1500 2000 2500 3000 3500Matrix size N

Tim

e (s

)

GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial PivotCPU-GPU transfer

Page 41: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

41The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Bandwidth efficiency

0

5

10

15

20

25

30

35

500 1000 1500 2000 2500 3000 3500 4000Matrix size (N)

Ban

dwid

th U

sage

(GB

/s)

6800 Ultra 6800 GT6800 Ultra Peak Bandwidth: 35.2 GB/s

6800 GT Peak Bandwidth: 28.8

Page 42: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

42The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Faster than Moore’s law

0

2

4

6

8

10

12

500 1000 1500 2000 2500 3000 3500

Matrix size N

Tim

e (s

)

ATLAS GETRF (Partial Pivot)GT 6800 Partial PivotUltra 6800 Partial Pivot7800 Partial Pivot (Mar 2004)

(Jun 2005)

Page 43: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

43The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Application: Fluid Simulation

Solve parallel sub-problemsN=2048

Diagonally-dominant

No pivoting required

15% faster than ATLASon Pentium IV 3.06 GHz

Page 44: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

44The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Limitations

Maximum matrix size: 4096x4096Block-partitioned LU decomposition

PrecisionSingle precision floating point

Not 100% IEEE floating point compliant

CPU-GPU data transfer overheadSmall matrices

Page 45: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

45The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Graphics hardware advancements

Improved floating point bandwidth4 component vs. single component

Floating point blendingUse of non-programmable TFLOPs

Page 46: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

46The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Outline

LU Decomposition & Related Work

The potential of GPUs

LU-GPU algorithm

Results

Conclusions & Ongoing Work

Page 47: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

47The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

Algorithm mapped to graphics pipeline

Novel mapping of row operations to rasterization

Stream computation

Blocking

Page 48: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

48The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

Optimized with GPU architecture

Input data mapping

Fast pivoting

Page 49: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

49The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

Performance Comparable to industry-standard libraries

Relatively small development effort

GPU are useful co-processors Scientific computations

Many applications

Page 50: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

50The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Conclusions

LU-GPU Open Source library available:

http://gamma.cs.unc.edu/LUGPULIB/

Page 51: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

51The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Ongoing work

Sorting on GPUs

Linear algebra: GPU-LAPACK / QR decomposition

Page 52: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

52The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Sorting on GPUs

Goal: Utilize the high parallelism, and memory bandwidth on GPUs for fast sorting

[Govindaraju et al, SIGMOD05]

GPUSort: Open Source library[http://gamma.cs.unc.edu/GPUSORT/]

Page 53: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

53The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Sorting on GPUs

6 times faster than Quicksort on 3.4 GHz Pentium IV PC!

Page 54: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

54The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Linear algebra

LAPACK-compliant library for GPUs

QR-decomposition in development (LAPACK SGEQRF)

Page 55: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

55The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Acknowledgements

Army Research OfficeDARPANational Science FoundationOffice of Naval ResearchRDECOMIntel CorporationNVIDIA CorporationUNC GAMMA Group

Page 56: LU-GPU: Efficient Algorithms for Solving Dense Linear ...

56The UNIVERSITY of NORTH CAROLINA at CHAPEL HILL

Thank you

For questions or comments:

[email protected]@[email protected]

http://gamma.cs.unc.edu/http://gamma.cs.unc.edu/LUGPULIB/