B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense ...

68
BERKELEY PAR LAB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense April 9, 2010 [email protected] Massachusetts Institute of Technology PhD Thesis Committee: Saman Amarasinghe Arvind Krste Asanović

Transcript of B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense ...

Page 1: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

BERKELEY PAR LAB

Lithe Composing Parallel Software Efficiently

Heidi Pan

PhD Thesis Defense April 9, 2010

[email protected]

Massachusetts Institute of Technology

PhD Thesis Committee:

Saman Amarasinghe Arvind Krste Asanović

Page 2: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Composition of Libraries

AI

Audio

Graphics

Physics

game() { forall frames: AI.compute();

}

Audio.play(); Graphics.render();

Physics.calc (); : }

{

Efficiency: Libraries implemented differently to suit their own needs.

Performance: Leverage optimized library performance.

Productivity: Don’t want to implement & understand everything.

2

Page 3: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Talk Roadmap

Problem: Efficient parallel composition is hard! Solution Implementation Evaluation Synchronization Future Work

3

Page 4: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Real-World Parallel Composition Example

Sparse QR Factorization(Tim Davis, Univ of Florida)

OS

MKL

OpenMP

System Stack

Hardware

TBB

SPQRFrontal MatrixFactorization

ColumnElimination

Tree

Software Architecture

4

Page 5: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Out-of-the-Box Performance

Performance of SPQR on 16-core Machine

Input Matrix

sequential

Tim

e (s

ec)

Out-of-the-Box

5

Page 6: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Libraries want to Manage Parallelism

Core

0Core

1Core

2Core

3

TBB Library Runtime

spawn tbb::task();::

SPQR Application

6

Page 7: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Multiple Libraries Oversubscribe the Resources

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

virtualized OS threads

tbb::task() { matmult(); :

matmult() { #pragma omp parallel :

matmult { #pragma omp parallel :

7

Page 8: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

MKL Quick Fix

Using Intel MKL with Threaded Applicationshttp://www.intel.com/support/performancetools/libraries/mkl/sb/CS-017177.htm

If more than one thread calls Intel MKL and thefunction being called is threaded, it is importantthat threading in Intel MKL be turned off. Set OMP_NUM_THREADS=1 in the environment.

8

Page 9: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Sequential MKL in SPQR

OS

TBB OpenMP

Hardware

Core

0Core

1Core

2Core

3

9

Page 10: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Sequential MKL Performance

Performance of SPQR on 16-core Machine

Input Matrix

Sequential MKLOut-of-the-Box

Tim

e (s

ec)

10

Page 11: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

SPQR Wants to Use Parallel MKL

No task-level parallelism! Want to exploit matrix-level parallelism.

11

Page 12: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Share Resources Cooperatively

OS

TBB OpenMP

Hardware

Tim Davis manually tunes libraries to effectively partition the resources.

Core

0Core

1

TBB_NUM_THREADS = 2

Core

2Core

3

OMP_NUM_THREADS = 2

12

Page 13: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Manually Tuned Performance

Performance of SPQR on 16-core Machine

Input Matrix

Sequential MKL Manually Tuned

Tim

e (s

ec)

Out-of-the-Box

13

Page 14: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Manual Tuning Destroys Black Box Abstractions

Tim Davis

LAPACKAx=bMKL

OpenMP

OMP_NUM_THREADS = 4

14

Page 15: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Manual Tuning Destroys Code Reuse and Modular Updates

SPQR

MKLv1

MKLv2

MKLv3

App

SPQR

0 01 2 3

15

Page 16: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe

Primitives Resource Sharing Model Standard Interface Runtime

Implementation Evaluation Synchronization Future Work

16

Page 17: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Hart Primitive

Library A Library B Library C

Application

Core 0 Core 1 Core 2 Core 3

Hardware

OS Threads

Create as many threads as wanted. Allocated a finite amount of harts.

Libraries implicitly share cores. Libraries explicitly share cores.

Threads = Resource + Programming Abstraction Harts = Resource Abstraction

Library A Library B Library C

Application

Core 0 Core 1 Core 2 Core 3

Hardware

Harts = Hardware Thread Contexts

17

Page 18: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Using Harts

TBB Library Runtime

SPQR Application

time

Hart

Schedule

Execute Task

TBBSched

Q

Schedule

Execute Task

Execute Task

spawn tbb::task();::

Schedule

Execute Task

TBB Scheduler

18

Page 19: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Sharing Harts Cooperatively

OS

TBB OpenMP

Hardware

time

transfer

transfertransfer

transfer

transfer

transfertransfertransfer

19

Page 20: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

task() { matmult() { : } :}

Sharing Harts Hierarchically

Transfer of control coupled with transfer of resources.

TBB Runtime Scheduler

OpenMP Runtime Scheduler

tbb::task() { matmult() { #pragma omp parallel : } :}

ApplicationCall GraphHierarchy

task

matmult

Parent (Caller)

Child (Callee)

Call

TBB Scheduler

OpenMP Scheduler

Return

tbb::

#pragma omp parallel

TBB Scheduler

OpenMP Scheduler

20

Page 21: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Cooperative Hierarchical Resource Sharing

Hierarchical Scheduling Cooperative Scheduling

Lithe

Parent

Child

Tasks(Threads)

UnstructuredTransfer of Control

Parent

Child

Resources(Harts)

StructuredTransfer of

Control

Lottery Scheduling (Waldspurger 94)

CPU Inheritance (Ford 96)

HLS (Regehr 01)

Converse (Kale 96)

:

GHC (Li 07)

Manticore (Fluet 08)

:

(Wand 80)Continuation-BasedMultiprocessing

21

Page 22: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Parent Scheduler

Child Scheduler

Standard Scheduler Callback Interface

TBBLithe Schedulertask() { matmult() { : } :}

OpenMPLithe Schedulerunregisterenter yield request register

matmult

tbb::

#pragma OMP parallel

cilk

CilkLithe Schedulerenter yield request register unregister

task

22

Page 23: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

OS

TBB OpenMP

Hardware

Lithe Runtime

TBBLithe OpenMPLithe

OS

TBBLithe OpenMPLithe

Hardware

Lithe Runtime

harts

current scheduler

TBBLithe

scheduler hierarchy

enter yield request register unregister

OpenMPLithe

enter yield request register unregister

current scheduler

yield

23

Page 24: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

}

:

:

register(OpenMPLithe);

Register / Unregister

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

matmult(){

time

Register dynamically adds the new scheduler to the hierarchy.

unregisterenter yield request registerregister unregister

unregister(OpenMPLithe);

24

Page 25: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

register(OpenMPLithe);

Request

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

matmult(){

time

unregisterenter yield request registerrequest

Request asks for more harts from the parent scheduler.

request(n);

25

Page 26: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

:

:

Enter / Yield

TBBLithe Scheduler

OpenMPLithe Scheduler

unregisterenter yield request register

time

unregisterenter yield request register

enter yield();

yield enter(OpenMPLithe);

Enter/Yield transfers additional harts between the parent and child.26

Page 27: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

SPQR with Lithe

time

reg

enterenter

enter

yieldyield

MKL

OpenMPLithe

TBBLithe

SPQR

unreg

yield

matmult

req

27

Page 28: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

SPQR with Lithe

time

MKL

OpenMPLithe

TBBLithe

SPQR

unreg unreg unreg unreg

reg reg reg reg matmult matmult matmult matmult

req req req req

28

Page 29: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Lithe Enables Separation of Concerns

OS

Hardware

Lithe Runtime

OpenMP

MKLTBB

resource management

functionalityand resource management

High level app developer doesn’t know about Lithe. Just link with Lithe-compliant libraries.

TBBLitheOpenMPLithe

SPQRsame

interfaces

29

Page 30: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Foundation for Composing Software

Sequential Parallel

Model:

Interoperability:

Transitioning:

yield yieldyield

goto

goto

call

callreturn

return

call return

Function

enter yield

Scheduler

reg unreg req

Caller

Callee

return

Parent

Child

yield

30

Page 31: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation

Lithe Runtime Porting Intel Threading Building Blocks (TBB) Porting GNU OpenMP (libgomp)

Evaluation Synchronization Future Work

31

Page 32: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Lithe Runtime Implementation

Core 0 Core 1 Core 2 Core 3

Hardware

Harts = Pinned Pthreads

Lithe Runtime~2000 lines of C, C++, assembly

32

Page 33: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Libraries Use Harts Instead of Threads

OS

Hardware

Lithe Runtime

request

OS

Hardware

pthread_createenter

OpenMP OpenMPLithe

33

Page 34: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Porting TBB to Lithe

pthreads

work-stealing task queues

harts

work-stealing task queues

lazily createdas harts enter

Total Relevant Added Removed Modified

8,000 1,500 180 5 70Lines of Code:

34

Page 35: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Porting OpenMP to Lithe

Total Relevant Added Removed Modified

6,000 1,000 220 35 150Lines of Code:

worker threads = pthreads worker threads

time-multiplexed by OS

Core 0 Core 1 Core 2 Core 3

harts

Core 0 Core 1 Core 2 Core 3

run to completion by OpenMPLithe

35

Page 36: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation

Ported Libraries Baseline Performance Sparse QR Factorization Real-Time Audio Processing

Synchronization Future Work

36

Page 37: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Experimental Setup

16-Core AMD Barcelona: 4 x Quad-Core Opterons

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Linux 2.6.26 (64-bit, Default CFS Scheduler)

37

Page 38: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

No Lithe Overhead w/o Composing

TBB Performance

µbench included with release

Tim

e (s

ec)

OpenMP Performance

NAS Parallel Benchmarks

Tim

e (s

ec)

38

Page 39: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Sparse QR Factorization (SPQR)

OS

MKL

OpenMP

System Stack

Hardware

TBB

SPQRFrontal MatrixFactorization

ColumnElimination

Tree

Software Architecture

39

Page 40: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Performance of SPQR with LitheT

ime

(sec

)

Out-of-the-Box

Input Matrix

Manually Tuned Lithe

40

Page 41: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Lithe Enables Flexible Sharing of Resources

Give resources to OpenMP

Give resources to TBB

Manual tuning is stuck with 1 TBB/OMP config throughout run. 41

Page 42: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Detailed Performance Metrics

Out-of-the-Box LitheManually Tuned

Co

nte

xt S

wit

ches

L2

Cac

he

Mis

ses

42

Page 43: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Real-Time Audio Processing

FFT Filter

Plugin DAG

# Channels

43

Page 44: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Experimental Setup

8-Core Intel Nehalem: 2 x Quad-Core x 2-way Multithreading

Core 1

L3256 KB

L3 8MB

Core 2

Core 3

Core 4

L2256KB

L2256 KB

L2256 KB

Linux 2.6.31 (64-bit, Default CFS Scheduler)

Core 1

L3256 KB

L3 8MB

Core 2

Core 3

Core 4

L2256KB

L2256 KB

L2256 KB

44

Page 45: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Baseline FFT Filter (FFTW)

# Threads

No

rmal

ized

T

ime

long latency . . .

FFT Size = 131072

No

rmal

ized

T

ime

# Threads

FFT Size = 32768

Original Lithe

45

Page 46: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Audio Processing Performance

Original Lithe

(2-Way Parallel FFT per Channel)

Oversubscribed

46

Page 47: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation Synchronization Future Work

47

Page 48: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Synchronization Overview

time

SchedTask

QueueHart

Execute Task

Schedule

Execute Task

Execute Task

task() {

barrier_wait();waiting to synchronizewith other tasks!

48

Page 49: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Interaction btwn Sync & Scheduling

OS Scheduler

App

Barrier Lib

barrier_wait() {

App

TBBLithe OpenMPLithe

Barrier Lib

cwait csignal

pthread_cond_wait();

cwait block

barrier_wait () {

block();if (/* not ready */) {

block unblock

if (/* not ready */) {

block unblock

thread hart

49

Page 50: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Execute Task

Barrier Synchronization with Lithe

Hart Hart

Execute Task

Schedule

barrier_waitbarrier_wait

Execute Taskbarrier_wait

Schedule

barrier_waitExecute Task

Barrier Library

blockblock

blockunblockSchedule

TimeTBBLithe

block unblock

barrier_wait

block unblock

50

Page 51: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

barrier_wait()

barrier_wait()

barrier_wait()

Spin-Wait Synchronization

barrier_wait() {

Core 0 Core 1

barrier_wait();while (!ready)

OS Threads

while (!ready)

Spin-waiting tries to avoid scheduling overhead.

Spin-waiting with OS threads performs badly when resources oversubscribed.

Spin-waiting with harts may deadlock the app.51

Page 52: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Experimental Setup

16-Core AMD Barcelona: 4 x Quad-Core Opterons

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Core 1

L2512 KB

L3 2MB

Core 2

Core 3

Core 4

L2512 KB

L2512 KB

L2512 KB

Linux 2.6.26 (64-bit, Default CFS Scheduler)

52

Page 53: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Barrier Microbenchmark Evaluation

1000x

# Barriers in Parallel

#Tasks = # Cores

53

Page 54: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Barrier Microbenchmark Evaluation

# Barriers in Parallel 54

Page 55: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Barrier Microbenchmark Evaluation

# Barriers in Parallel 55

Page 56: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Talk Roadmap

Problem: Efficient parallel composability is hard! Solution: Lithe Implementation Evaluation Synchronization Future Work

56

Page 57: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Building Custom Schedulers

TBBLithe

enter yield req reg unreg

OpenMPLithe

enter yield req reg unreg

FFTWLithe

enter yield req reg unreg

load balancingportable domain-specific

CustomSortLithe

enter yield req reg unreg

time

call

quick sort

enter

enter enter

merge sort

insertion sort

57

Page 58: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Composing Auxiliary Parallel Codes

OS

Hardware

Lithe Runtime

SPQR

MKL

OpenMPLithe

TBBLithe

58

MKLTBBLithe

OpenMPLithe

GarbageCollectorLithe

PinLithe

Page 59: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

OS

Hardware

SPQR2

Lithe Runtime

MKLTBBLithe

OpenMPLithe

OS Support for Lithe

Lithe Runtime

SPQR1

MKLTBBLithe

OpenMPLithe

User-LevelHart Impl

Lithe

SPQR1

MKL

TBBLithe

OMPLithe

O S

Hardware

Lithe

SPQR2

MKL

TBBLithe

OMPLithe

59

OS-LevelHart Impl

Page 60: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Conclusion

Composability essential for parallel programming to become widely adopted.

Main thesis contributions: Harts: better resource model for parallel programming Lithe: framework for using and sharing harts

primitives; resource sharing model; standard interface; runtime

MKL

OpenMPTBB

SPQR

resource management

functionality

0 1 2 3

Parallel libraries need to share resources cooperatively.

60

Page 61: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

A big thanks to: Benjamin Hindman (Lithe, TBB Port, OpenMP Port)

Rimas Avizienis (Audio Processing App Port)

Tim Davis (SPQR), Arch Robison (TBB), Greg Henry (MKL)

Research supported by Microsoft (Award #024263), Intel (Award #024894), matching funding by U.C. Discovery (Award #DIG07-10227), and the Gigascale Systems Research Focus Center, one of five research centers funded under the Focus Center Research Program, a Semiconductor Research Corporation program. Additional support comes from Par Lab affiliates National Instruments, NEC, Nokia, NVIDIA, Samsung, and Sun Microsystems.

Code release at http://parlab.eecs.berkeley.edu/lithe

Lithe Composing Parallel Software Efficiently

61

Page 62: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Backup Slides

62

Page 63: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

I/O

SchedulerLithe

enter yield req reg unreg block unblock

I/O Library

OS

asynchronous I/O interface

readreturns

immediately

block

ready(polled/signalled)

unblock

63

Page 64: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Flickr-Like App Server

(Lithe)

Tradeoff between throughput saturation point and latency.

OpenMPLithe

Graphics

MagickLibprocess

App Server

64

Page 65: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Code-Specific Scheduling for Correctness

Easier to reason about. Easier to verify.

Introduce Complexity as Needed

T0 T1 T2 T3

Start with general,add syncs to restrict.

(nondeterministic)

App-specificordering.

(deterministic)

GenericScheduler

App

OS

App-SpScheduler

App

OS

65

Page 66: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Code-Specific Scheduling for Performance

CholeskyMatrixFactorization

Kurzak et al, Scheduling Linear Algebra Operations on Multicore Processors, LAPACK Working Note 213, Feb 2009

critical path

66

Page 67: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

Select Runtime Functions + Callbacks

Lithe Runtime Functions Scheduler Callbacks

lithe_sched_register(callbacks) register

lithe_sched_unregister() unregister

lithe_sched_request(nharts) request

lithe_sched_enter(child) enter

lithe_sched_yield() yield

lithe_ctx_block(ctx) block

lithe_ctx_unblock(ctx) unblock

67

Page 68: B ERKELEY P AR L AB Lithe Composing Parallel Software Efficiently Heidi Pan PhD Thesis Defense  April 9, 2010 xoxo@mit.edu Massachusetts Institute of.

SPMD Scheduler Pseudocode Example

1 void spmd spawn(int N, void (*func)(void*), void *arg) {2 SpmdSched *sched = new SpmdSched(N, func, arg);3 lithe_sched_register(sched);4 lithe_sched_request(N-1);5 sched->compute();6 lithe_sched_unregister();7 delete sched;8 }910 void SpmdSched::compute() {11 while (/* unstarted tasks */)12 func(arg);13 }1415 void SpmdSched::enter() {16 if (/* unblocked paused contexts */)17 lithe_ctx_resume(/* next unblocked context */);18 else if (/* requests from children */)19 lithe_sched_enter(/* next child scheduler */);20 else if (/* unstarted tasks */)21 ctx = new SpmdCtx();22 lithe_ctx_run(ctx, start);23 else lithe_sched_yield();24 }25

26 void SpmdSched::start() {27 compute();28 lithe_ctx_pause(cleanup);29 }3031 void SpmdSched::cleanup(ctx) {32 delete ctx;33 if (/* not all completed */)34 enter();35 else36 lithe_sched_yield();37 }

68