Dataflow programming for heterogeneous computing systems · Dataflow programming for heterogeneous...

29
Dataflow programming for heterogeneous computing systems Jeronimo Castrillon Cfaed Chair for Compiler Construction TU Dresden [email protected] Tutorial: Algorithmic specification, tools and algorithms for programming heterogeneous platforms. PACT Conference. San Francisco, October 19 th 2015

Transcript of Dataflow programming for heterogeneous computing systems · Dataflow programming for heterogeneous...

Dataflow programming for heterogeneous computing systems

Jeronimo CastrillonCfaed Chair for Compiler ConstructionTU [email protected]

Tutorial: Algorithmic specification, tools and algorithms for programming heterogeneous platforms. PACT Conference. San Francisco, October 19th 2015

Outline

q Heterogeneous systems

q Dataflow models

q Analysis and synthesis

q Summary

2 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Outline

q Heterogeneous systems

q Dataflow models

q Analysis and synthesis

q Summary

3 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Heterogeneous computing systems

q Today’s heterogeneityq Desktop/HPC: GPGPU, GP + ACCq Embedded: RISCs, DSPs, ASIPs, …

è Resources: different performance/energy characteristics

q Tomorrow’s heterogeneity: Emerging technologiesq Heterogeneity beyond performance

and energy§ Reliability/error tolerance§ Computation model

4 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

5

Heterogeneity: Cfaed Vision

q German Excellence Cluster: Goal – “to explore new technologies for electronic information processing which overcome the limits of CMOS technology”

Materials &Functions

Devices &Circuits

Information Processing

CMOS

(industry  focus)

A    Silicon  Nanowires

B    Carbon

C    Organic

D    BiomolecularAssembly

E    Chemical

F Orchestration

G Resilience

HHAEC:

Highly Adaptive

ComputingEnergy-­Efficient

IBiologicalSystems

Material research for post-CMOS technologies

Systems research to handle heterogeneity

Biology to inspire solutions

© J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Heterogeneity: Example SiNW

q SiNW: Silicon Nanowiresq Multi-gate devices with less performance penaltyq Reconfigurable P/N functionality

q Possibilitiesq New micro architectures

q New pipeline structuresq New field programmable devices

6 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

[Trommer15]

Heterogeneity: Example chemical processing

q Lab-on-Chips: for sensing, analysis and testq Also for computing?

q Different kinds of transistorsq Oscillators and other components

7 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Fundamentally different way of computation

[Voigt14]

Heterogeneity: Example DNA origami

q DNA origami: Self-assembled 2D/3D structures made of DNA strands

q Use structures to build advance electronic devicesq Example: Plasmonic waveguides

8 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Courtesy: Thorsten Lars-Schmidthttps://cfaed.tu-dresden.de/schmidt-home

9

Consequence of heterogeneity

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Already difficult to program them today, what about tomorrow?Need for models and abstraction

© J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Outline

q Heterogeneous systems

q Dataflow models

q Analysis and synthesis

q Summary

10 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Models: Introduction

q Von Neumann model makes things complicatedq Sharing stateq Data races

q Task graphs: A simple parallel programming modelq Intel TBB, .NET Task parallel library

(TPL), OpenMP Tasks, …q Runtime and data management, e.g.,

StarPU [Augonnet10]11 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Central Processing Unit (CPU)

Memory & I/O

Control unit Processing unitInst. reg

Prog. counterReg. file

ALU

Directed acyclic task graphs

q Very well studied, see for example [Kwok99]q Difficult problem for heterogeneous systems

12 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

RISC DSP

SharedMemory

Interconnect

Time

Processor

13

24

Time

Processor

14

32

?4

2

1

3

Dataflow models

q Also a graph representation: Nodes & edges are called actors & channelsq Implicit repetition, common in streaming, signal processing applicationsq Communication: only through channelsq Multiple flavors of models: Rules that determine when an actor firesq A graph models multiple possible executions:

13 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

4

2

1

3Time

Processor

1

432

111

1

22

2

3

3

3

cf. LabVIEW models

Dataflow models (2)

q Also a graph representation: Nodes & edges are called actors & channelsq Implicit repetition, common in streaming, signal processing applicationsq Communication: only through channelsq Multiple flavors of models: Rules that determine when an actor firesq A graph models multiple possible executions:

14 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Time

Processor

4

2

1

3 1

32

1122

3 3…

4 4

What now?

Dataflow models (3)

q Synchronous Dataflow (SDF): every actor has a fixed behavior

q Cyclo-Static Dataflow (CSDF): every actor has a set of fixed behaviors

15 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

a3 always writes 1 token to e4

a1: writes 1 token, then 0, then 0 to e2

a1 a2 a33 1 2 3

6 2

12

e1 e3

e4

e2

a1 a2 a31{0,2}

1

{1,0,0}

e1

e3

e21

{0,1,0}

Dynamic models and Kahn Process Networks

q Dynamic dataflow: set of firing rules per actor

q Kahn process networks (KPN): nodes are now called processes

16 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

a1 a2 a3i1

i2

p1 p2 p3e1 e3

e4

e2

p1: writes any amount of tokens to e2 at any time

Outline

q Heterogeneous systems

q Dataflow models

q Analysis and synthesis

q Summary

17 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

18

Analysis and synthesis

© J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

DMAs, sema-

phoresPMU

Peripherals

Communicationsupport

HW queues

Network Processor

Packet DMA

MEMsubsystem

NoC

A15L1

A15L1

L2A15L1

A15L1

VLIW DSP

L1,L2

Application

Architecture model

Non-functional specification

Analysis

Synthesis

Code generation

Property models (timing, energy, error, …)

PNargs_ifft_r.ID = 6U;PNargs_ifft_r.PNchannel_freq_coef = filtered_coef_rightPNargs_ifft_r.PNnum_freq_coef = 0U;PNargs_ifft_r.PNchannel_time_coef = sink_rightPNargs_ifft_r.channel = 1;sink_left = IPCllmrf_open(3, 1, 1);sink_right = IPCllmrf_open(7, 1, 1);PNargs_sink.ID = 7U;PNargs_sink.PNchannel_in_left = sink_leftPNargs_sink.PNnum_in_left = 0U;PNargs_sink.PNchannel_in_right = sink_rightPNargs_sink.PNnum_in_right = 0U;taskParams.arg0 = (xdc_UArg)&PNargs_srctaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtr&taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_fft_ltaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_ifft_rtaskParams.priority = 1;

ti_sysbios_knl_Task_create((ti_sysbios_knl_Task_FuncPtrfft_Templ, &taskParams, &eb);

glob_proc_cnt++;hasProcess = 1;taskParams.arg0 = (xdc_UArg)&PNargs_sinktaskParams.priority = 1;

cf. Silexica’s tool flows

Example for SDFs

q Compute topology matrix, and solve system of equations

q Solution: repetition vector serve to unroll the graph [1 3 2]

q Perform mapping and scheduling on the resulting directed acyclic graph (DAG)

19 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Example for KPNs: Static & dynamic analysis

q Need to understand process interactions

20 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

f1(...);write(&c2);

read(&c1);f3(...);write(&c3);f4(...);write(&c3);

read(&c2);f5(...);write(&c4);

read(&c0);

f7(...);read(&c4);

read(&c3);f6(...);read(&c3);

Unrolled CFG for

t

R2R3R4

tf1R1R2R3R4

f1 f1f2

f6 f7 f7 f7 f7

f6 f7 f7 f7 f7f5

f4f3f1R1 f1 f1 f2

f4f3

Unrolled CFG for

...

...

...

...

t

R2R3R4 f6 f7 f7 f7 f7

f5f4f3

f1R1 f1 f1 f2

...

t1

t2t3t2 t1

f5 f5

f2(...);write(&c1);

f5 f5

f5 f5 f5

f1(...);write(&c2);

read(&c1);f3(...);write(&c3);f4(...);write(&c3);

read(&c2);f5(...);write(&c4);

read(&c0);

f7(...);read(&c4);

read(&c3);f6(...);read(&c3);

Unrolled CFG for

t

R2R3R4

tf1R1R2R3R4

f1 f1f2

f6 f7 f7 f7 f7

f6 f7 f7 f7 f7f5

f4f3f1R1 f1 f1 f2

f4f3

Unrolled CFG for

...

...

...

...

t

R2R3R4 f6 f7 f7 f7 f7

f5f4f3

f1R1 f1 f1 f2

...

t1

t2t3t2 t1

f5 f5

f2(...);write(&c1);

f5 f5

f5 f5 f5

[Castrillon14]

KPN & DDF: Tracing

q Dynamic analysis based on execution traces [Castrillon10/13, Brunet15, Singh15]

21 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

DAG representation for further analysis and synthesis

Models from functional specification

q Inspect functional specification of actors/processes (cf. 2nd talk)q Instrumentation, emulation, cost models, …q For timing, energy, errors, …

q Example

22 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Profiling: Execution counts, branch stats

[SAMOS14]

Models based on algorithmic descriptions

q When functional specification is not meant for synthesisq Common/required in heterogeneous systems (special components)

q Need to match algorithms to hardware components [Castrillon10b, Castrillon11]

23 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Plain code

Plain code

Platform model + characterization of special components

... ... ...Algorithm library

N: Algorithmic actorsF: Existing implementation in target platform

Multiple-applications

q Use traces and mappings to reason about platform sharing

24 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Multiple-applications (2)

q Quickly discard “bad” multi-application configurations by observing the platform utilization profiles

25 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Outline

q Heterogeneous systems

q Dataflow models

q Analysis and synthesis

q Summary

26 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Summary

q Need programming models (and HW/SW stacks) to handle heterogeneityq Even more dramatic in the post-CMOS era

q Dataflow modelsq Natural way to describe some applicationsq Amenable to analysis and synthesis for parallel execution

q Discuss different kinds of models and required analysisq Need models of hardware for synthesis

27 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

Acknowledgements

q German Cluster of Excellence: Center for Advancing Electronics Dresden (www.cfaed.tu-dresden.de)

q Collaborative research center (CRC): Highly Adaptive Energy-Efficient Computing (HAEC)

q Silexica Software Solutions GmbH

28 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015

CRC  912:  Highly  Adaptive  Energy-­‐Efficient   Computing

References

[Trommer15] Trommer, et al. "Functionality-Enhanced Logic Gate Design Enabled by Symmetrical Reconfigurable Silicon Nanowire Transistors”, IEEE Trans on in Nanotechnology, pp.689-698, July 2015

[Voigt14] Andreas Voigt, et al. “Towards Computation with Microchemomechanical Systems”, In International Journal of Foundations of Computer Science, World Scientific, volume 25, 2014

[Augonnet10] C. Augonnet, et al. “StarPU: a unified platform for task scheduling on heterogeneous multicore architectures“, Concurrency and Computation: Practice and Experience 23.2 (2011), pp. 187–198.

[Kwok99] Y.-K. Kwok, et al, “Static Scheduling Algorithms for Allocating Directed Task Graphs to Multiprocessors”, ACM Comput. Surv., ACM Press, 1999, 31, 406 - 471

[Castrillon14] J. Castrillon, et al, “Programming Heterogeneous MPSoCs: Tool Flows to Close the Software Productivity Gap”, Springer, 2014

[Castrillon10] J. Castrillon, et al, “Trace-based KPN composability analysis for mapping simultaneous applications to MPSoC platforms”, DATE’10, pp. 753-758, 2010

[Castrillon13] J. Castrillon, et al, “MAPS: Mapping concurrent dataflow applications to heterogeneous MPSoCs,” IEEE Trans on Industrial Informatics, vol. 9, no. 1, pp. 527–545, 2013

[Brunet15] Brunet, Simone C. “Analysis and optimization of dynamic dataflow programs”. Diss. ÉCOLE POLYTECHNIQUE FÉDÉRALE DE LAUSANNE, 2015

[Singh15] Singh, Amit, et al. "Resource and Throughput Aware Execution Trace Analysis for Efficient Run-time Mapping on MPSoCs.”, TCAD 2015

[SAMOS14] J.F. Eusse, et al, "Pre-architectural performance estimation for ASIP design based on abstract processor models," Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV), 2014 pp.133-140, 2014

[Castrillon10b] J. Castrillon, et al, “Component-based waveform development: The nucleus tool flow for efficient and portable SDR,” Wireless Innovation Conference and Product Exposition (SDR), 2010

[Castrillon11] J. Castrillon, et al, “Component-based waveform development: The nucleus tool flow for efficient and portable software defined radio”, Analog Integrated Circuits and Signal Processing, vol. 69, no. 2–3, pp. 173–190, 2011

29 © J. Castrillon. Dataflow programming. PACT Tutorial. Oct 2015