Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in...

39
1 Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs Michael Gordon, William Thies, and Saman Amarasinghe Massachusetts Institute of Technology ASPLOS October 2006 San Jose, CA http://cag.csail.mit.edu/streamit

Transcript of Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in...

Page 1: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

1

Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in

Stream Programs

Michael Gordon, William Thies, andSaman Amarasinghe

Massachusetts Institute of Technology

ASPLOS October 2006San Jose, CA

http://cag.csail.mit.edu/streamit

Page 2: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

2

Multicores Are Here!

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005 20??

# ofcores

1

2

4

8

16

32

64

128256

512

Athlon

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480 Opteron 4PXeon MP

AmbricAM2045

Page 3: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

3

Multicores Are Here!

For uniprocessors,C was:•Portable•High Performance•Composable•Malleable•Maintainable

Uniprocessors:C is the commonmachine language

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

Page 4: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

4

Multicores Are Here!

What is the commonmachine languagefor multicores?

1985 199019801970 1975 1995 2000

4004

8008

80868080 286 386 486 Pentium P2 P3P4Itanium

Itanium 2

2005

Raw

Power4 Opteron

Power6

Niagara

YonahPExtreme

Tanglewood

Cell

IntelTflops

Xbox360

CaviumOcteon

RazaXLR

PA-8800

CiscoCSR-1

PicochipPC102

Broadcom 1480

20??

# ofcores

1

2

4

8

16

32

64

128256

512

Opteron 4PXeon MP

Athlon

AmbricAM2045

Page 5: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

5

Common Machine Languages

Single memory imageSingle flow of controlCommon PropertiesUniprocessors:

ISA

Functional Units

Register FileDifferences:

Register AllocationInstruction Selection

Instruction Scheduling

Multiple local memoriesMultiple flows of controlCommon PropertiesMulticores:

Communication Model

Synchronization Model

Number and capabilities of coresDifferences:

von-Neumann languages represent the common properties and abstract away the differences

Need common machine language(s) for multicores

Page 6: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

6

Streaming as a Common Machine Language

• Regular and repeating computation

• Independent filters with explicit communication– Segregated address spaces and

multiple program counters

• Natural expression of Parallelism:– Producer / Consumer dependencies– Enables powerful, whole-program

transformations Adder

Speaker

AtoD

FMDemod

LPF1

Scatter

Gather

LPF2 LPF3

HPF1 HPF2 HPF3

Page 7: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

7

Types of Parallelism

Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Peel iterations of filter, place within

scatter/gather pair (fission)– parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Task

Page 8: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

8

Types of Parallelism

Task Parallelism– Parallelism explicit in algorithm– Between filters without

producer/consumer relationship

Data Parallelism– Between iterations of a stateless filter – Place within scatter/gather pair (fission)– Can’t parallelize filters with state

Pipeline Parallelism– Between producers and consumers– Stateful filters can be parallelized

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Data Parallel

Page 9: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

9

Types of Parallelism

Traditionally:

Task Parallelism– Thread (fork/join) parallelism

Data Parallelism– Data parallel loop (forall)

Pipeline Parallelism– Usually exploited in hardware

Scatter

Gather

Scatter

Gather

Task

Pip

elin

e

Data

Page 10: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

10

Problem Statement

Given: – Stream graph with compute and communication

estimate for each filter– Computation and communication resources of

the target machine

Find:– Schedule of execution for the filters that best

utilizes the available parallelism to fit the machine resources

Page 11: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

11

Our 3-Phase Solution

1. Coarsen: Fuse stateless sections of the graph2. Data Parallelize: parallelize stateless filters3. Software Pipeline: parallelize stateful filters

Compile to a 16 core architecture– 11.2x mean throughput speedup over single core

Coarsen Granularity

Data Parallelize

Software Pipeline

Page 12: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

12

Outline

• StreamIt language overview• Mapping to multicores

– Baseline techniques– Our 3-phase solution

Page 13: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

13

• Applications– DES and Serpent [PLDI 05]– MPEG-2 [IPDPS 06]– SAR, DSP benchmarks, JPEG, …

• Programmability– StreamIt Language (CC 02) – Teleport Messaging (PPOPP 05)– Programming Environment in Eclipse (P-PHEC 05)

• Domain Specific Optimizations– Linear Analysis and Optimization (PLDI 03)– Optimizations for bit streaming (PLDI 05)– Linear State Space Analysis (CASES 05)

• Architecture Specific Optimizations– Compiling for Communication-Exposed

Architectures (ASPLOS 02)– Phased Scheduling (LCTES 03)– Cache Aware Optimization (LCTES 05)– Load-Balanced Rendering

(Graphics Hardware 05)

StreamIt Program

Front-end

Stream-AwareOptimizations

Uniprocessorbackend

Clusterbackend

Rawbackend

IBM X10backend

C/C++ C per tile +msg code

StreamingX10 runtime

Annotated Java

MPI-likeC/C++

Simulator(Java Library)

The StreamIt Project

Page 14: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

14

Model of Computation

• Synchronous Dataflow [Lee ‘92]– Graph of autonomous filters– Communicate via FIFO channels

• Static I/O rates– Compiler decides on an order

of execution (schedule)– Static estimation of

computation

A/D

Duplicate

LED

Detect

Band Pass

LED

Detect

LED

Detect

LED

Detect

Page 15: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

15

Example StreamIt Filter0 1 2 3 4 5 6 7 8 9 10 11 input

output

FIR0 1

float→float filter FIR (int N, float[N] weights) {

work push 1 pop 1 peek N {float result = 0;

for (int i = 0; i < N; i++) {result += weights[i] ∗ peek(i);

}pop();push(result);

}}

Stateless

Page 16: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

16

Example StreamIt Filter

float→float filter FIR (int N, ) {

work push 1 pop 1 peek N {float result = 0;

for (int i = 0; i < N; i++) {result += weights[i] ∗ peek(i);

}pop();push(result);

}}

0 1 2 3 4 5 6 7 8 9 10 11 input

output

FIR0 1

weights = adaptChannel(weights);

float[N] weights;

(int N) {

Stateful

Page 17: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

17

parallel computation

StreamIt Language Overview• StreamIt is a novel

language for streaming– Exposes parallelism and

communication– Architecture independent– Modular and composable

– Simple structures composed to creates complex graphs

– Malleable– Change program behavior

with small modifications

may be any StreamIt language construct

joinersplitter

pipeline

feedback loop

joiner splitter

splitjoin

filter

Page 18: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

18

Outline

• StreamIt language overview• Mapping to multicores

– Baseline techniques– Our 3-phase solution

Page 19: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

19

Baseline 1: Task Parallelism

Adder

Splitter

Joiner

Compress

BandPass

Expand

Process

BandStop

Compress

BandPass

Expand

Process

BandStop

• Inherent task parallelism between two processing pipelines

• Task Parallel Model:– Only parallelize explicit task

parallelism – Fork/join parallelism

• Execute this on a 2 core machine ~2x speedup over single core

• What about 4, 16, 1024, … cores?

Page 20: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

20

0

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

Bitonic

Sort

Chann

elVoc

oder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpe

nt

TDEMPEG2D

ecod

er

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

Evaluation: Task ParallelismRaw Microprocessor

16 inorder, single-issue cores with D$ and I$16 memory banks, each bank with DMA

Cycle accurate simulator

Parallelism: Not matched to target!Synchronization: Not matched to target!

Page 21: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

21

Baseline 2: Fine-Grained Data Parallelism

Adder

Splitter

Joiner

• Each of the filters in the example are stateless

• Fine-grained Data Parallel Model:– Fiss each stateless filter N

ways (N is number of cores)– Remove scatter/gather if

possible

• We can introduce data parallelism– Example: 4 cores

• Each fission group occupies entire machineBandStopBandStopBandStopAdder

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

ExpandExpandExpand

ProcessProcessProcess

Joiner

BandPassBandPassBandPass

CompressCompressCompress

BandStopBandStopBandStop

Expand

BandStop

Splitter

Joiner

Splitter

Process

BandPass

Compress

Splitter

Joiner

Splitter

Joiner

Splitter

Joiner

Page 22: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

22

Evaluation:Fine-Grained Data Parallelism

0

1

2

34

5

6

7

8

910

11

12

13

14

1516

17

18

19

Bitonic

SortCha

nnelVoc

oder

DCT

DES

FFT

Filterban

k

FMRadio

Serpen

t

TDEMPEG2Deco

der

Vocod

er

Radar

Geometr

ic Mea

nTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

TaskFine-Grained Data

Good Parallelism!Too Much Synchronization!

Page 23: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

23

Outline

• StreamIt language overview• Mapping to multicores

– Baseline techniques– Our 3-phase solution

Page 24: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

24

Phase 1: Coarsen the Stream GraphSplitter

Joiner

Expand

BandStop

Process

BandPass

Compress

Expand

BandStop

Process

BandPass

Compress

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstreamPeek Peek

PeekPeek

Adder

Page 25: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

25

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Adder

• Before data-parallelism is exploited

• Fuse stateless pipelines as much as possible without introducing state– Don’t fuse stateless with

stateful– Don’t fuse a peeking filter with

anything upstream

• Benefits:– Reduces global communication

and synchronization– Exposes inter-node

optimization opportunities

Phase 1: Coarsen the Stream Graph

Page 26: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

26

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Fiss 4 ways, to occupy entire chip

Data Parallelize for 4 cores

Page 27: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

27

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandStop BandStop

Splitter

Joiner

Task parallelism!Each fused filter does equal workFiss each filter 2 times to occupy entire chip

Data Parallelize for 4 cores

Page 28: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

28

BandStop BandStop

Phase 2: Data Parallelize

AdderAdderAdder

Splitter

Joiner

Adder

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

BandPassCompressProcessExpand

Splitter

Joiner

BandPassCompressProcessExpand

Splitter

Joiner

BandStop

Splitter

Joiner

BandStop

Splitter

Joiner

Task parallelism, each filter does equal workFiss each filter 2 times to occupy entire chip

• Task-conscious data parallelization– Preserve task parallelism

• Benefits:– Reduces global communication

and synchronization

Data Parallelize for 4 cores

Page 29: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

29

Evaluation: Coarse-Grained Data Parallelism

012

34567

89

1011

1213141516

171819

Bitonic

SortCha

nnelVocoder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

Mean

Thro

ughp

ut N

orm

aliz

ed to

Sin

gle

Cor

e St

ream

It

TaskFine-Grained DataCoarse-Grained Task + Data

Good Parallelism!Low Synchronization!

Page 30: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

30

Simplified Vocoder

RectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

Joiner

PolarRect

66

20

2

1

1

1

2

1

1

1

20 Data Parallel

Data Parallel

Target a 4 core machine

Data Parallel, but too little work!

Page 31: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

31

Data Parallelize

RectPolarRectPolarRectPolar

Splitter

Joiner

AdaptDFT AdaptDFT

Splitter

Splitter

Amplify

Diff

UnWrap

Accum

Amplify

Diff

Unwrap

Accum

Joiner

RectPolar

Splitter

Joiner

RectPolarRectPolarRectPolarPolarRect

Splitter

Joiner

Joiner

66

20

2

1

1

1

2

1

1

1

20

5

5

Target a 4 core machine

Page 32: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

32

Data + Task Parallel Execution

Time

Cores

21

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

Page 33: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

33

We Can Do Better!

Time

Cores

Target 4 core machine

Splitter

Joiner

Splitter

Splitter

Joiner

Splitter

Joiner

RectPolarSplitter

Joiner

Joiner

66

2

1

1

1

2

1

1

1

5

5

16

Page 34: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

34

Phase 3: Coarse-Grained Software Pipelining

RectPolar

RectPolar

RectPolar

RectPolar

Prologue

New Steady

State

• New steady-state is free of dependencies

• Schedule new steady-state using a greedy partitioning

Page 35: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

35

Greedy Partitioning

Target 4 core machine

Time 16

CoresTo Schedule:

Page 36: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

36

0123456789

10111213141516171819

Bitonic

SortCha

nnelVocoder

DCT

DES

FFT

Filterba

nk

FMRadio

Serpent

TDEMPEG2Dec

oder

Vocoder

Radar

Geometric

MeanTh

roug

hput

Nor

mal

ized

to S

ingl

e C

ore

Stre

amIt

Task Fine-Grained DataCoarse-Grained Task + Data Coarse-Grained Task + Data + Software Pipeline

Evaluation: Coarse-Grained Task + Data + Software Pipelining

Best Parallelism!Lowest Synchronization!

Page 37: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

37

Generalizing to Other Multicores

• Architectural requirements:– Compiler controlled local memories with DMA– Efficient implementation of scatter/gather

• To port to other architectures, consider:– Local memory capacities – Communication to computation tradeoff

• Did not use processor-to-processor communication on Raw

Page 38: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

38

Related Work

• Streaming languages:– Brook [Buck et al. ’04]– StreamC/KernelC [Kapasi ’03, Das et al. ’06]– Cg [Mark et al. ‘03]– SPUR [Zhang et al. ‘05]

• Streaming for Multicores:– Brook [Liao et al., ’06]

• Ptolemy [Lee ’95]• Explicit parallelism:

– OpenMP, MPI, & HPF

Page 39: Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in …groups.csail.mit.edu/commit/papers/06/gordon-asplos06... · 2009-01-29 · Single flow of control Common Properties

39

Conclusions

• Good speedups across varied benchmark suite• Algorithms should be applicable across multicores

Low

Good

Coarse-Grained Task + Data

High

Good

Fine-Grained Data

LowestNot matched Synchronization

Best Not matchedParallelism

Coarse-Grained Task + Data +

Software PipelineTask

• Streaming model naturally exposes task, data, and pipeline parallelism

• This parallelism must be exploited at the correct granularity and combined correctly