Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2)...

24
1 Center for Domain-Specific Computing Supported by NSF “Expedition in Computing” Program www.cdsc.ucla.edu. CDSC-GR: a CnC-inspired Graph Representation CnC 2013 -- September 24, 2013 Alina Sbirlea 1 , Zoran Budimlic 1 , Jason Cong 2 , Zhuo Li 2 , Louis-Noel Pouchet 2 , Vivek Sarkar 1 and Mo Xu 2 1 Rice University 2 University of California Los Angeles

Transcript of Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2)...

Page 1: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

1

Center for Domain-Specific Computing Supported by NSF “Expedition in Computing” Program

www.cdsc.ucla.edu.

CDSC-GR: a CnC-inspired Graph Representation

CnC 2013 -- September 24, 2013

Alina Sbirlea1, Zoran Budimlic1, Jason Cong2, Zhuo Li2, Louis-Noel Pouchet2, Vivek Sarkar1 and Mo Xu2

1 Rice University 2 University of California Los Angeles

Page 2: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

2

CDSC-GR: Motivations for a CnC-inspired Graph Representation

• Provide a separate, well defined input language to the programmer

Easy to read/write for domain experts, “DSL” for dataflow

• Be an intermediate representation for parallel programs

Translation to/from CDSC-GR, analysis & optimizations, mapping

• Perform graph-level optimizations & checking

Race detection, static & static+dynamic scheduling, data locality analysis, …

• CnC-HC is not enough, extensions are desired such as:

Support for non-graph data, step-to-step synchronization

Support for regions (sets of integer points) for collections

• CDSC-GR can be mapped on heterogeneous hardware

Distributed-CnC, CDSC-GR-to-HC, FPGA synthesis

Page 3: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

3

Motivation Through Example: Heart Wall Tracking

• Heart Wall Tracking:

medical imaging application from the Rodinia Benchmark Suite

detects and analyzes the movement of the heart walls

Page 4: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

4

Motivation Through Example: Heart Wall Tracking

• Heart Wall Tracking – one step further:

create a finer grained version, by splitting each step into 10 additional steps

data remains structured in the original C code => use “fake” items to achieve

step-to-step synchronization

Page 5: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

5

Motivation Through Example: Heart Wall Tracking

• Heart Wall Tracking – one step further:

create a finer grained version, by splitting each step into 10 additional steps

data remains structured in the original C code => use “fake” items to achieve

step-to-step synchronization

Page 6: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

6

Motivation Through Example: Heart Wall Tracking

• Heart Wall Tracking – Fine grained:

Accesses C code, oblivious of data present there

Uses items for step-to-step synchronization in CnC

17 lines of code in CDSC-GR vs 38 lines of code in the CnC graph file

• CDSC-GR can operate on both graph data and non-graph data

CnC dynamic single assignment not needed for non-graph data

• CDSC-GR can be mapped to heterogeneous targets (i.e., FPGA) with

synchronized access to non-graph data

FPGA mapper prototype requires explicit (static) regions

Regions simplify communication generation and management of collections

Non-graph data support requires explicit step-to-step dependence support

Page 7: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

7

Standard CnC CDSC-GR

Controller – controllee

(step1)

(step2)

Producer - consumer

(step1) (step2) [item]

(step1) -> [item] -> (step2);

Controller – controllee

(step1)

(step2)

<t2>

Producer - consumer

(step1) (step2) [item]

(step1) -> [item] -> (step2); (step1)->(step3);

(step3)

Some differences between CnC and CDSC-GR

(step1) -> <t2> ; <t2> :: (step2); (step1):: (step2);

Page 8: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

8

CDSC-GR Key Features and Their Purpose [1/3]

Property Intel CnC CDSC-GR Role

1) Item, Step, Control Collections

CnC has all three CDSC-GR only has item and step collections (no loss of generality in omitting control collections)

Modeling

2) Item put-get semantics

Dynamic single assignment

Dynamic single assignment for graph data (arbitrary access to non-graph data)

Modeling

3) Step prescription Achieved via put on tag collection, and prescription of step

Achieved directly by one step prescribing (spawning) another step. Steps still have control tags as in CnC.

Modeling

4) Item allocation Dynamic Dynamic or pre-allocated Modeling

5) Nondeterminism Design in progress Support for putIfAbsent() Modeling

6) User-defined reductions

Designed but not included in release

In plan for CDSC-GR Modeling

Page 9: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

9

CDSC-GR Key Features and Their Purpose [2/3]

Property Intel CnC CDSC-GR Role

7) Inter-step synchronization

Inter-step synchronization must be indirect via items

Direct inter-step synchronization permitted (used for coordinating accesses to non-graph data)

Modeling, Performance

8) Put/get granularity

Single items Single items + regions Modeling, Performance

9) Step parameters No parameters other than control tag

Parent step can pass parameters to child step separate from items

Performance

10) Tuning language Designed but not included in release

In plan for CDSC-GR Modeling,

Performance

11) Get counts and folding functions

Designed but not included in release

In plan for CDSC-GR Performance

Page 10: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

10

CDSC-GR Key Features and Their Purpose [3/3]

Property Intel CnC CDSC-GR Role

12) Dependent gets Permitted Prohibited --- tag/key for a get operation should only depend on the step's control tag, and not on the value of another get operation in the same step.

Analysis and Generation

13) Tag functions Optional Required for all input items.

Can be optionally provided for output items

Modeling, Analysis and Generation

14) Static analysis and transformation of graph programs

Hard to do with API calls In plan for CDSC-GR Analysis and

Generation

15) Standardized textual representation of graphs

Present in earlier versions; current version uses API calls to specify graph

Foundation for CDSC-GR Modeling,

Analysis and

Generation

Page 11: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

11

Translating Legacy Codes to CDSC-GR

Main motivation 1: reuse existing software/compilers

• CDSC-GR-to-HC translation enables HC optimizations

OpenCL code generation

• Distributed-CnC, CnC on OCR

A step toward distributed memory code generation for CDSC-GR

• Static/dynamic analysis at the graph level

Static scheduling / partitioning, static+dynamic scheduling, placement

I/O complexity analysis

Main motivation 2: ease the user job when translating legacy programs

• Full automation out of reach for arbitrary codes, feedback needed from user

Page 12: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

12

In-Depth Look: Regions

• In CnC-HC: tags are integer tuples, with simple bounds

[ A: i, j ], [B: i, j+1] -> (mystep: i, j);

env -> (mystepColl: { 1 .. 42 }, { 3 .. 51 });

• In practice, program use more complex “shapes” for tags

for (i = 0; i < N; ++i)

for (j = i + 1; j < N; ++j)

A[i][j] = A[j][i];

[ A: j,i ] -> (mystep: i, j) -> [A: i,j ]; env -> <mystepColl: { 0 .. N-1 }, { ?? .. ?? }>;

• In practice, data regions need not be a single entry in the item

collection

These can be solved with program modifications (loop merging, careful management of

item collections)

Or with language extension!

Page 13: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

13

Syntax Proposal for Regions

• Proposal by example:

for (i = 0; i < N; ++i)

for (j = i + 1; j < N; ++j)

A[i][j] = A[j][i];

[ A: j,i ] -> (mystep: i,j) -> [A: i,j]; def region1 : i, j := { 0 <= i < N, i + 1 <= j < N }; env -> <mystepColl: region1>;

• General template for a basic region:

def <region_name> : <list of names for each dim> := { inequalities }

• General template for a union of regions:

def <reg_name>[param list] : <list of names for each dim> := {

inequalities1 }[, { inequalities2}]

Page 14: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

14

Inequalities to Describe a Region

Inequalities are used to describe a convex set of points

Ex: { i <= 42, i >= 12 + j }

Several key questions:

• Must the set really be convex?

No, but easier mapping if it is

• Must the set be exact?

No, but over-approximation must not affect correctness

• Must the set be computable at compile-time?

No, but better static/dynamic optimizations could be triggered if it is; some mappings

(FPGA) require it

Page 15: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

15

Regions for Graph Data

• Syntax allows for parameterized regions (maps) by arguments:

def region2(p,q) : i,j := { i = p, q - 1 <= j <= q + 1 };

• Use inside the graph:

[A:region2(i,j) ] -> (step1: i,j) /// eq. to [A:i,j-1],[A:i,j],[A:i.j+1] -> (step1:i,j)

def region1 : i, j := { 0 <= i < N, i + 1 <= j < N };

env -> <mystepColl: region1>;

Key questions have the same answer:

• Must the map be affine? Invertible? Likely no, but easier mapping if it is

• Must the map be exact? Likely no, but over-approximation must not affect

correctness

• Must the map be computable at compile-time? Likely no, but better static/dynamic

optimizations could be triggered if it is; some mappings (FPGA) require it

Page 16: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

16

A Step Further: Functions

• Not all programs have affine iteration domains!

• Union of regions can be too tedious for non-convex sets

• We do not want to support arbitrary expressions in CDSC-GR

Grammar for arbitrary expressions using C math functions needed

• Our proposal: allow for definition of functions to map CDSC-GR

symbols with C expression

for (int i = min(P, Q); i < sqrt(P*Q); ++i)

S(i, j)

def mymin(x,y) := “x < y ? x : y”;

def mysqrt(x) := “sqrt(x)”;

def region1 : i, j := { i >= mymin(P, Q); i < mysqrt(P*Q)};

env -> <S : region1>;

Page 17: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

17

More on Functions

• Functions associate an identifier, and possibly some arguments, to a

syntax in the target language to evaluate the expression

def <func_name>[arguments] := “<code in host language>”;

• Functions are assumed to return the same value when called with the

same arguments, and have no side effect

• Functions are implicitly typed: they return an integer value, and take

integer as argument

• Functions can lead to non-convex regions

• Functions can inhibit static analysis at the graph level

• Functions make translation from C for loops to CDSC-GR easier!

Page 18: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

18

Easing Static Analysis: Properties on Functions

• Functions are black boxes, but some information may be known

Example: range for the input values, range for the output values

• Ditto for “parameters” (unknown constants)

• We introduce properties on functions, by associating them to a region

def myfunc(i,j) := “i*i + j*j + 1”;

def regprop1 : x := { x >= 1 };

prop myfunc : regprop1

def region1 : i, j := { 0 <= i <= 42, 0 <= j <= myfunc(i,i+1)};

Here we now know region1 is never empty

• Note: a one-dimensional region is associated to a function, as a

function returns an integer value

Page 19: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

19

Easing Static Analysis: Properties on Parameters

• The same mechanism can be used to bound program parameters, and

map them to symbol in the host language

def myParam1 := “P”;

def regprop1 : x := { 1 <= x <= 10000 };

prop myParam1 : regprop1

def region1 : i := { 0 <= i <= myParam1};

Here we now know region1 is never empty, and will never exceed 10000 points

Page 20: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

20

The Next Step: A Compiler Framework for CDSC-GR

Our objective: provide translators to/from CDSC-GR

CDSC-GR to HC prototype already available

Current work: translating a subset of C to CDSC-GR. Key challenges

include:

• Analysis of C program to detect parallelism / step partitioning

Conservative analysis, feedback from the user expected

• Analysis of C program to capture dataflow

Conservative analysis, feedback expected

• Pre-transformations to ease translation to CDSC-GR

Page 21: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

21

The Role of Compiler Analysis & Transformations

• Key challenge 1: what to put inside step(s)?

Current approach: try to put non-affine code segment inside steps

Not all outer loop(s) have an easy-to-analyze dependence pattern -> need transformation

for better efficiency?

Tiling, when applicable, allows for dynamic management of step granularity

• Key challenge 2: extract enough parallelism at the CDSC-GR level

Conservative analysis may be correct, but can provide a list of may dependences to user

for feedback (interactive process)

Currently plan to use basic dependence distance vector computation

• Key challenge 3: compute data flow and item collection initialization

Translate program to DSA, graph vs. non-graph data selection

Page 22: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

22

Where to Put the Parallelism?

• Parallelism must be exploited at multiple levels for good performance

SIMD-like parallelism for CPU and GPU

Sync-free multi-core parallelism (OpenMP)

Pipeline parallelism / sync-free parallelism for FPGAs

• Key issue: how much parallelism at the CDSC-GR level?

• Approach 1: fine-grain

•Small step granularity at the CDSC-GR level

•Group step instances into “macro-steps” and scan this set in the step code regions are

useful here: they define a multi-dimensional space

•Issue: atomicity of a “macro step” (analogous to tiling requirements)

• Approach 2: medium-grain

•Ensure enough workload inside a step (i.e., a step is “a tile”)

•Issue: less flexibility for the runtime, and CDSC-GR static optimizations

Page 23: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

23

Sum-up: Motivation for CDSC-GR

• Provide a separate, well defined input language to the programmer

Easy to read/write for domain experts, “DSL” for dataflow

• Be an intermediate representation for parallel programs

Translation to/from CDSC-GR, analysis & optimizations, mapping

• Perform graph-level optimizations & checking

Race detection, static & static+dynamic scheduling, data locality analysis, …

• CnC-HC is not enough, extensions are desired such as:

Support for non-graph data, step-to-step synchronization

Support for regions (sets of integer points) for collections

• CDSC-GR can be mapped on heterogeneous hardware

Distributed-CnC, CDSC-GR-to-HC, FPGA synthesis

Page 24: Center for Domain-Specific Computing - CSweb.cs.ucla.edu/~pouchet/cnc2013/talks/cnc13-CDSC-GR.pdf2) Item put-get semantics Dynamic single assignment Dynamic single assignment for graph

24

Open Topics

• Dynamic single assignment data vs non-DSA data

Data-race detection on non-DSA data is needed to ensure correctness

Determinism --- a step waits for some vs. all of its inputs (some inputs may not be

defined as part of item collections)

DSA with data folding annotations --- memory space reuse

• Other discussion topics

Co-optimization of step code and program graph e.g., tiled steps

Memory management --- folding, get-counts, both?

• Questions?