DATAFLOW FOR HIGH PERFORMANCE COMPUTINGLeandro MarzuloUniversidade do Estado do Rio de Janeiro
No more free lunch…• Can’t buy a new processor and expect to improve
performance automatically.• Parallel programming is a must!
• Average programmers don’t know how to do it• Parallel implementation may not scale• Synchronization
• Heterogeneous Systems• So many devices – CPU, GPU, Xeon Phi, FPGA …• So many libraries/languages – CUDA, OpenCL, TBB, OpenMP,
MPI, Pthreads, VHDL…
• TOO MUCH TO LEARN!
Sweet times ahead..• Time to think out of the box
• To experiment with different stuff
• To revisit old concepts
• To rethink the way we teach programming
• To connect to different fields and research groups
The industry is investing!!!
Why Dataflow?
Just because it feels natural!
Dataflow x Von NeumannCharacteristic Dataflow Von Neumann
Register File ✖ ✔
Program Counter ✖ ✔
Control Flow Steer (one per operand) Branches and Jumps
Parallelism Natural(Parallelism Explosion)
- Pipeline- Branch Prediction
- Tomasulo- ROB
…
Language requirements Functional(no side effects) * Nonrestrictive
Compilation difficultiesControl Flow
(specially loops and functions)
Several architectural specific optimizations
* Wavescalar and its wave-ordering annotation scheme
Dataflow Revives!• TERAFLUX (Unisi, BSC, Microsoft, HP, …)
• Language• Compiler• Simulator (no actual HW yet)
• OmpSS (BSC)• Heterogeneous
• TBB Flowgraph (Intel)• Create and connect nodes• Associate them to Lambda Functions• Inject starter operands
Maxeler• Static Dataflow – DAGs (mostly)• FPGA based – DFE (DataFlow Engine)• Michael Flynn – MPP / SBAC-PAD 2014 Keynote• More performance requires more effort (Flynn’s words)• Compiler – Dataflow Graph in FPGA• Galava DFE – Academic version (USD 4999)
• 500 multipliers• 12 GB RAM• PCI-E
Maxeler - Products
CPUs plus DFEsIntel Xeon CPU cores and
up to 6 DFEs with 288GB of RAM
DFEs shared over Infiniband
Up to 8 DFEs with 384GB of RAM and dynamic
allocation of DFEs to CPU servers
Low latency connectivityIntel Xeon CPUs and 1-2
DFEs with up to six 10Gbit Ethernet connections
MaxWorkstationDesktop development system
MaxCloudOn-demand scalable accelerated compute resource, hosted in London
Maxeler - RTM• 3U System
• 1U traditional CPU node• 2 x MPC-X 2000 (16 DFEs)• Less than 2.5KW power usage
• Performance = 80 x 16 core Intel nodes!• 27x space reduction• 15x power consumption reduction• 5x improvement on total cost of ownership
• There are other similar examples
TALM• Talm is an Architecture and Language for Multithreading
• Hybrid Dataflow/Von Neumann (coarse-grained)
• Trebuchet Virtual Machine
• THLL (Annotations – C)
• Couillard Compiler
Treb
uch
et
TALM
.c
C Source
.df.c
Annotated Source
.lib.c
Super-instructions Source
.fl
Dataflow ASM Code
.so
Super-instruction Library
Blocks Deffinition(THLL)
Couillard
Super-Instruction Code Extraction
Dataflow Compilation
Ass
embl
er
Placement FileCreation
Dataflow BinaryCode Generation
Library Compilation(gcc)
Network
Inst 3Inst 50Inst 52
PE 1
Inst 19Inst 39Inst 43
PE N
.
.
.
Loader.flb
Dataflow Binary
.pla
Placement File
TALM – NW Code
TALM – Results - Blackscholes
TALM – Results - NW
TALM Extra Features• Static Scheduler – Can use profiler information• Selective Workstealing – Custom heuristic• Memory Speculation
• Transactional Memories• Distributed Control – Commit Graph• Avoid manual synchronization (dummy edges)• No Compiler Support yet
• Error Detection and Recovery• Redundant execution• Distributed Control – in the graph
• Can have super-instructions in CUDA• Compiler support needed (data movements)
Sucuri• A minimalistic Dataflow Programing Library for Python
• Transparent Execution on Clusters• Mpi_enable = TRUE• Need to obey DF principles – All data treated as operands• Python serializes objects – easy implementation
• Main Classes• Scheduler – Pool of tasks• Graph – Container• Nodes – Related to functions
Sucuri - Architecture
Sucuri - Pipeline
Create a Graph
Create a Scheduler
Create Nodes
Connect Nodes
Start Scheduler
Add nodes to Graph
Sucuri – Results - LCS
Ongoing Work• TALM
• Compiler Improvements• Cluster Version• Placement Improvements
• Sucuri• Node Galery• Graph Templates• Better scheduler
• Both• Full GPU Support• FPGA Support• Multiple implementations for the same task!• Applications and users!
ImageFilterNode
Fork/Join Graph
WavefrontGraph
Our Dataflow Research Group• Leandro Marzulo (UERJ)• Tiago Alves • Felipe França (UFRJ)• Sandip Kundu (UMASS)• Vítor Santos Costa (UPorto)• Master Students (6 ongoing, 1 finished):
• Brunno Goldstein – UFRJ• Leandro Santiago – UFRJ• Marcos Paulo Rocha – UFRJ• Leandro Rouberte – UFRJ• Alexandre Machado – UERJ• Julio Ho - UERJ• Alexandre Sardinha – Finished his Master – Petrobras
• Undergrad students (UERJ)• 6 finished – 3 are Master students now• 11 ongoing
Questions?
TALM – Results - RT
Sucuri – Hierarchical reduction
Sucuri - Wavefront
Top Related