DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro...

DATAFLOW FOR HIGH PERFORMANCE COMPUTINGLeandro MarzuloUniversidade do Estado do Rio de Janeiro

[email protected]

No more free lunch…• Can’t buy a new processor and expect to improve

performance automatically.• Parallel programming is a must!

• Average programmers don’t know how to do it• Parallel implementation may not scale• Synchronization

• Heterogeneous Systems• So many devices – CPU, GPU, Xeon Phi, FPGA …• So many libraries/languages – CUDA, OpenCL, TBB, OpenMP,

MPI, Pthreads, VHDL…

• TOO MUCH TO LEARN!

Sweet times ahead..• Time to think out of the box

• To experiment with different stuff

• To revisit old concepts

• To rethink the way we teach programming

• To connect to different fields and research groups

The industry is investing!!!

Why Dataflow?

Just because it feels natural!

Dataflow x Von NeumannCharacteristic Dataflow Von Neumann

Register File ✖ ✔

Program Counter ✖ ✔

Control Flow Steer (one per operand) Branches and Jumps

Parallelism Natural(Parallelism Explosion)

- Pipeline- Branch Prediction

- Tomasulo- ROB

…

Language requirements Functional(no side effects) * Nonrestrictive

Compilation difficultiesControl Flow

(specially loops and functions)

Several architectural specific optimizations

* Wavescalar and its wave-ordering annotation scheme

Dataflow Revives!• TERAFLUX (Unisi, BSC, Microsoft, HP, …)

• Language• Compiler• Simulator (no actual HW yet)

• OmpSS (BSC)• Heterogeneous

• TBB Flowgraph (Intel)• Create and connect nodes• Associate them to Lambda Functions• Inject starter operands

Maxeler• Static Dataflow – DAGs (mostly)• FPGA based – DFE (DataFlow Engine)• Michael Flynn – MPP / SBAC-PAD 2014 Keynote• More performance requires more effort (Flynn’s words)• Compiler – Dataflow Graph in FPGA• Galava DFE – Academic version (USD 4999)

• 500 multipliers• 12 GB RAM• PCI-E

Maxeler - Products

CPUs plus DFEsIntel Xeon CPU cores and

up to 6 DFEs with 288GB of RAM

DFEs shared over Infiniband

Up to 8 DFEs with 384GB of RAM and dynamic

allocation of DFEs to CPU servers

Low latency connectivityIntel Xeon CPUs and 1-2

DFEs with up to six 10Gbit Ethernet connections

MaxWorkstationDesktop development system

MaxCloudOn-demand scalable accelerated compute resource, hosted in London

Maxeler - RTM• 3U System

• 1U traditional CPU node• 2 x MPC-X 2000 (16 DFEs)• Less than 2.5KW power usage

• Performance = 80 x 16 core Intel nodes!• 27x space reduction• 15x power consumption reduction• 5x improvement on total cost of ownership

• There are other similar examples

TALM• Talm is an Architecture and Language for Multithreading

• Hybrid Dataflow/Von Neumann (coarse-grained)

• Trebuchet Virtual Machine

• THLL (Annotations – C)

• Couillard Compiler

Treb

uch

et

TALM

.c

C Source

.df.c

Annotated Source

.lib.c

Super-instructions Source

.fl

Dataflow ASM Code

.so

Super-instruction Library

Blocks Deffinition(THLL)

Couillard

Super-Instruction Code Extraction

Dataflow Compilation

Ass

embl

er

Placement FileCreation

Dataflow BinaryCode Generation

Library Compilation(gcc)

Network

Inst 3Inst 50Inst 52

PE 1

Inst 19Inst 39Inst 43

PE N

.

.

.

Loader.flb

Dataflow Binary

.pla

Placement File

TALM – NW Code

TALM – Results - Blackscholes

TALM – Results - NW

TALM Extra Features• Static Scheduler – Can use profiler information• Selective Workstealing – Custom heuristic• Memory Speculation

• Transactional Memories• Distributed Control – Commit Graph• Avoid manual synchronization (dummy edges)• No Compiler Support yet

• Error Detection and Recovery• Redundant execution• Distributed Control – in the graph

• Can have super-instructions in CUDA• Compiler support needed (data movements)

Sucuri• A minimalistic Dataflow Programing Library for Python

• Transparent Execution on Clusters• Mpi_enable = TRUE• Need to obey DF principles – All data treated as operands• Python serializes objects – easy implementation

• Main Classes• Scheduler – Pool of tasks• Graph – Container• Nodes – Related to functions

Sucuri - Architecture

Sucuri - Pipeline

Create a Graph

Create a Scheduler

Create Nodes

Connect Nodes

Start Scheduler

Add nodes to Graph

Sucuri – Results - LCS

Ongoing Work• TALM

• Compiler Improvements• Cluster Version• Placement Improvements

• Sucuri• Node Galery• Graph Templates• Better scheduler

• Both• Full GPU Support• FPGA Support• Multiple implementations for the same task!• Applications and users!

ImageFilterNode

Fork/Join Graph

WavefrontGraph

Our Dataflow Research Group• Leandro Marzulo (UERJ)• Tiago Alves • Felipe França (UFRJ)• Sandip Kundu (UMASS)• Vítor Santos Costa (UPorto)• Master Students (6 ongoing, 1 finished):

• Brunno Goldstein – UFRJ• Leandro Santiago – UFRJ• Marcos Paulo Rocha – UFRJ• Leandro Rouberte – UFRJ• Alexandre Machado – UERJ• Julio Ho - UERJ• Alexandre Sardinha – Finished his Master – Petrobras

• Undergrad students (UERJ)• 6 finished – 3 are Master students now• 11 ongoing

Questions?

TALM – Results - RT

Sucuri – Hierarchical reduction

Sucuri - Wavefront

DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro...

Documents

Transcript of DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro...