Lithe: Enabling Efficient Composition of Parallel Libraries

42
1 BERKELEY PAR LAB Lithe: Enabling Efficient Composition of Parallel Libraries Heidi Pan, Benjamin Hindman, Krste Asanović HotPar Berkeley, CA March 31, 2009 [email protected] {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology UC Berkeley

description

Lithe: Enabling Efficient Composition of Parallel Libraries. Heidi Pan , Benjamin Hindman, Krste Asanovi ć. [email protected]  {benh, krste}@eecs.berkeley.edu Massachusetts Institute of Technology  UC Berkeley. HotPar  Berkeley, CA  March 31, 2009. Functionality :. or. or. or. - PowerPoint PPT Presentation

Transcript of Lithe: Enabling Efficient Composition of Parallel Libraries

Page 1: Lithe: Enabling Efficient Composition of Parallel Libraries

1

BERKELEY PAR LAB

Lithe: Enabling Efficient Composition of Parallel Libraries

Heidi Pan, Benjamin Hindman, Krste Asanović

HotPar Berkeley, CA March 31, 2009

[email protected] {benh, krste}@eecs.berkeley.edu

Massachusetts Institute of Technology UC Berkeley

Page 2: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

2

How to Build Parallel Apps?

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7Hardware

OS

App

Resource Management:

Need both programmer productivity and performance!

Functionality: or or or

Page 3: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

3

Composability is Key to Productivity

Functional Composability

sort

App 1 App 2

code reuse

sort

same library implementation, different appsmodularity

App

same app, different library implementations

bubblesort

quicksort

Page 4: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

4

Composability is Key to Productivity

Performance Composability

fast

fast

fast+

faster

fast

fast(er)+

Page 5: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

5

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution:

Harts Lithe

Evaluation

Page 6: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

6

Motivational Example

Sparse QR Factorization(Tim Davis, Univ of Florida)

OS

MKL

OpenMP

System Stack

Hardware

TBB

SPQRFrontal MatrixFactorization

ColumnElimination

Tree

Software Architecture

Page 7: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

7

Out-of-the-Box PerformanceT

ime

(sec

)

Performance of SPQR on 16-core Machine

Out-of-the-Box

Input Matrix

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci0

0.5

1

1.5

2

2.5

3

3.5

landmark

sequential

Page 8: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

8

Out-of-the-Box Libraries Oversubscribe the Resources

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

virtualized kernel threads

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

A[0]A[1]A[2]

A[10]A[11]A[12] Prefetch

+Y

+Z

+X(unit stride)NY

NZ

NX CY

CZ

CX

TYTX

Cache Blocking

Page 9: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

9

MKL Quick Fix

Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm

If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.

Page 10: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

10

Sequential MKL in SPQR

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

Page 11: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

11

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Sequential MKL PerformanceT

ime

(sec

)

Performance of SPQR on 16-core Machine

Out-of-the-Box

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Sequential MKL

Input Matrix

Page 12: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

12

SPQR Wants to Use Parallel MKL

No task-level parallelism!

Want to exploit matrix-level parallelism.

Page 13: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

13

Share Resources Cooperatively

OS

TBB OpenMP

Hardware

Tim Davis manually tunes libraries to effectively partition the resources.

Core

0Core

1

TBB_NUM_THREADS = 2

Core

2Core

3

OMP_NUM_THREADS = 2

Page 14: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

14

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Manually Tuned PerformanceT

ime

(sec

)

Performance of SPQR on 16-core Machine

Out-of-the-Box Sequential MKL

0

5

10

15

20

25

deltaX0

0.5

1

1.5

2

2.5

3

3.5

landmark60

65

70

75

80

85

ESOC0

200

400

600

800

1000

1200

Rucci

Manually Tuned

Input Matrix

Page 15: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

15

Manual Tuning Cannot Share Resources Effectively

Give resources to OpenMP

Give resources to TBB

Page 16: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

16

Manual Tuning Destroys Functional Composability

Tim Davis

LAPACKAx=bMKL

OpenMP

OMP_NUM_THREADS = 4

Page 17: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

17

Manual Tuning Destroys Performance Composability

SPQR

MKLv1

MKLv2

MKLv3

App

SPQR

0 01 2 3

Page 18: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

18

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution:

Harts: better resource abstraction Lithe: framework for sharing resources

Evaluation

Page 19: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

19

Virtualized Threads are Bad

OS

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

App 1 (TBB) App 1 (OpenMP) App 2

Different codes compete unproductively for resources.

Page 20: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

20

Space-Time Partitions aren’t Enough

OS

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

Space-time partitions isolate diff apps.

MKL

OpenMPTBB

SPQR App 2

Partition 1 Partition 2

What to do within an app?

Page 21: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

21

Harts: Hardware Thread Contexts

Represent real hw resources. Requested, not created. OS doesn’t manage harts for app.

OS

Core 0 Core 1 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7

MKL

OpenMPTBB

SPQR

Harts

Page 22: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

22

Sharing Harts

OS

TBB OpenMP

Hardware

time

Hart 0 Hart 1 Hart 2 Hart 3

Partition

Page 23: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

23

Cooperative Hierarchical Schedulers

OMP

TBB

Cilk

Ct

application call graph

Cilk

library (scheduler) hierarchy

TBB Ct

OpenMP

Modular: Each piece of the app scheduled independently.

Hierarchical: Caller gives resources to callee to execute on its behalf.

Cooperative: Callee gives resources back to caller when done.

Page 24: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

24

A Day in the Life of a Hart

Cilk

TBB Ct

OpenMP

TBB Sched: next?

time

TBB Tasks

executeTBB task

TBB Sched: next?

execute TBB task

TBB Sched: next?nothing left to do, give hart back to parent

Cilk Sched: next?don‘t start new task, finish existing one first

Ct Sched: next?

Page 25: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

25

Child Scheduler

Parent Scheduler

Standard Lithe ABI

CilkLithe Scheduler

interface for sharing harts

TBBLithe Scheduler

Caller

Callee

returncall

returncall

interface for exchanging values

Analogous to function call ABI for enabling interoperable codes.

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

unregisterenter yield request register

Mechanism for sharing harts, not policy.

Page 26: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

26

OS

TBB OpenMP

Hardware

Lithe Runtime

harts

current scheduler

OS

TBBLithe OpenMPLithe

Hardware

Lithe

TBBLithe

scheduler hierarchy

enter yield request register unregister

OpenMPLithe

enter yield request register unregister

yield

current scheduler

Page 27: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

27

}

:

:

register(OpenMPLithe);

Register / Unregister

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

matmult(){

time

Register dynamically adds the new scheduler to the hierarchy.

unregisterenter yield request registerregister unregister

unregister(OpenMPLithe);

Page 28: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

28

}

:

register(OpenMPLithe);

Request

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

matmult(){

time

unregisterenter yield request registerrequest

unregister(OpenMPLithe);

Request asks for more harts from the parent scheduler.

request(n);

Page 29: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

29

:

:

Enter / Yield

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

time

unregisterenter yield request register

enter yield();

yield enter(OpenMPLithe);

Enter/Yield transfers additional harts between the parent and child.

Page 30: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

30

SPQR with Lithe

time

reg

enterenter

enter

yieldyield

MKL

OpenMPLithe

TBBLithe

SPQR

unreg

yield

matmult

req

Page 31: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

31

SPQR with Lithe

time

MKL

OpenMPLithe

TBBLithe

SPQR

unreg unreg unreg unreg

reg reg reg reg matmult matmult matmult matmult

req req req req

Page 32: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

32

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution:

Harts Lithe

Evaluation

Page 33: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

33

Implementation

Harts

TBBLithe OpenMPLithe

Lithe

Harts: simulated using pinned Pthreads on x86-Linux

Lithe: user-level library (register, unregister, request, enter, yield, ...)

TBBLithe

OpenMPLithe (GCC4.4)

~600 lines of C & assembly

~2000 lines of C, C++, assembly

~1500 / ~8000 relevant lines added/removed/modified

~1000 / ~6000 relevant lines added/removed/modified

Page 34: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

34

No Lithe Overhead w/o Composing

TBBLithe Performance (µbench included with release)

OpenMPLithe Performance (NAS parallel benchmarks)

tree sum preorder fibonacci

TBBLithe 54.80ms 228.20ms 8.42ms

TBB 54.80ms 242.51ms 8.72ms

conjugate gradient (cg) LU solver (lu) multigrid (mg)

OpenMPLithe 57.06s 122.15s 9.23s

OpenMP 57.00s 123.68s 9.54s

All results on Linux 2.6.18, 8-core Intel Clovertown.

Page 35: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

35

Performance Characteristics of SPQR (Input = ESOC)

1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

Page 36: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

36

Performance Characteristics of SPQR (Input = ESOC)

1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

SequentialTBB=1, OMP=1

172.1 sec

Page 37: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

37

Performance Characteristics of SPQR (Input = ESOC)

1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

Out-of-the-BoxTBB=8, OMP=8

111.8 sec

Page 38: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

38

Performance Characteristics of SPQR (Input = ESOC)

1 2 3 4 5 6 7 88

64

2

50

100

150

200

250

300

350

300-350

250-300

200-250

150-200

100-150

50-100

NUM_OMP_THREADS NU

M_T

BB

_TH

RE

AD

S

Tim

e (s

ec)

Out-of-the-Box

Manually Tuned70.8 sec

Page 39: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

39

Performance of SPQR with LitheT

ime

(sec

)

Out-of-the-Box Lithe

Input Matrix

0

20

40

60

80

100

120

ESOC

Manually Tuned

0

5

10

15

20

25

30

deltaX0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

landmark0

100

200

300

400

500

600

Rucci

Page 40: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

40

Future Work

SPQR

TBBLitheenter yield req reg unreg

OpenMPLitheenter yield req reg unreg

CtLitheenter yield req reg unreg

CilkLitheenter yield req reg unreg

Page 41: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

41

Conclusion

Composability essential for parallel programming to become widely adopted.

Lithe project contributions Harts: better resource model for parallel programming Lithe: enables parallel codes to interoperate by

standardizing the sharing of harts

MKL

OpenMPTBB

SPQR

resource management

functionality

0 1 2 3

Parallel libraries need to share resources cooperatively.

Page 42: Lithe: Enabling Efficient Composition of Parallel Libraries

BERKELEY PAR LAB

42

Acknowledgements

We would like to thank George Necula and the rest of BerkeleyPar Lab for their feedback on this work.

Research supported by Microsoft (Award #024263 ) and Intel(Award #024894) funding and by matching funding by U.C.Discovery (Award #DIG07-10227). This work has also beenin part supported by a National Science Foundation GraduateResearch Fellowship. Any opinions, findings, conclusions, orrecommendations expressed in this publication are those of theauthors and do not necessarily reflect the views of the NationalScience Foundation. The authors also acknowledge the support of the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program.