DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro...
-
Upload
jody-franklin -
Category
Documents
-
view
219 -
download
0
Transcript of DATAFLOW FOR HIGH PERFORMANCE COMPUTING Leandro Marzulo Universidade do Estado do Rio de Janeiro...
DATAFLOW FOR HIGH PERFORMANCE COMPUTINGLeandro MarzuloUniversidade do Estado do Rio de Janeiro
No more free lunch…• Can’t buy a new processor and expect to improve
performance automatically.• Parallel programming is a must!
• Average programmers don’t know how to do it• Parallel implementation may not scale• Synchronization
• Heterogeneous Systems• So many devices – CPU, GPU, Xeon Phi, FPGA …• So many libraries/languages – CUDA, OpenCL, TBB, OpenMP,
MPI, Pthreads, VHDL…
• TOO MUCH TO LEARN!
Sweet times ahead..• Time to think out of the box
• To experiment with different stuff
• To revisit old concepts
• To rethink the way we teach programming
• To connect to different fields and research groups
The industry is investing!!!
Why Dataflow?
Just because it feels natural!
Dataflow x Von NeumannCharacteristic Dataflow Von Neumann
Register File ✖ ✔
Program Counter ✖ ✔
Control Flow Steer (one per operand) Branches and Jumps
Parallelism Natural(Parallelism Explosion)
- Pipeline- Branch Prediction
- Tomasulo- ROB
…
Language requirements Functional(no side effects) * Nonrestrictive
Compilation difficultiesControl Flow
(specially loops and functions)
Several architectural specific optimizations
* Wavescalar and its wave-ordering annotation scheme
Dataflow Revives!• TERAFLUX (Unisi, BSC, Microsoft, HP, …)
• Language• Compiler• Simulator (no actual HW yet)
• OmpSS (BSC)• Heterogeneous
• TBB Flowgraph (Intel)• Create and connect nodes• Associate them to Lambda Functions• Inject starter operands
Maxeler• Static Dataflow – DAGs (mostly)• FPGA based – DFE (DataFlow Engine)• Michael Flynn – MPP / SBAC-PAD 2014 Keynote• More performance requires more effort (Flynn’s words)• Compiler – Dataflow Graph in FPGA• Galava DFE – Academic version (USD 4999)
• 500 multipliers• 12 GB RAM• PCI-E
Maxeler - Products
CPUs plus DFEsIntel Xeon CPU cores and
up to 6 DFEs with 288GB of RAM
DFEs shared over Infiniband
Up to 8 DFEs with 384GB of RAM and dynamic
allocation of DFEs to CPU servers
Low latency connectivityIntel Xeon CPUs and 1-2
DFEs with up to six 10Gbit Ethernet connections
MaxWorkstationDesktop development system
MaxCloudOn-demand scalable accelerated compute resource, hosted in London
Maxeler - RTM• 3U System
• 1U traditional CPU node• 2 x MPC-X 2000 (16 DFEs)• Less than 2.5KW power usage
• Performance = 80 x 16 core Intel nodes!• 27x space reduction• 15x power consumption reduction• 5x improvement on total cost of ownership
• There are other similar examples
TALM• Talm is an Architecture and Language for Multithreading
• Hybrid Dataflow/Von Neumann (coarse-grained)
• Trebuchet Virtual Machine
• THLL (Annotations – C)
• Couillard Compiler
Treb
uch
et
TALM
.c
C Source
.df.c
Annotated Source
.lib.c
Super-instructions Source
.fl
Dataflow ASM Code
.so
Super-instruction Library
Blocks Deffinition(THLL)
Couillard
Super-Instruction Code Extraction
Dataflow Compilation
Ass
embl
er
Placement FileCreation
Dataflow BinaryCode Generation
Library Compilation(gcc)
Network
Inst 3Inst 50Inst 52
PE 1
Inst 19Inst 39Inst 43
PE N
.
.
.
Loader.flb
Dataflow Binary
.pla
Placement File
TALM – NW Code
TALM – Results - Blackscholes
TALM – Results - NW
TALM Extra Features• Static Scheduler – Can use profiler information• Selective Workstealing – Custom heuristic• Memory Speculation
• Transactional Memories• Distributed Control – Commit Graph• Avoid manual synchronization (dummy edges)• No Compiler Support yet
• Error Detection and Recovery• Redundant execution• Distributed Control – in the graph
• Can have super-instructions in CUDA• Compiler support needed (data movements)
Sucuri• A minimalistic Dataflow Programing Library for Python
• Transparent Execution on Clusters• Mpi_enable = TRUE• Need to obey DF principles – All data treated as operands• Python serializes objects – easy implementation
• Main Classes• Scheduler – Pool of tasks• Graph – Container• Nodes – Related to functions
Sucuri - Architecture
Sucuri - Pipeline
Create a Graph
Create a Scheduler
Create Nodes
Connect Nodes
Start Scheduler
Add nodes to Graph
Sucuri – Results - LCS
Ongoing Work• TALM
• Compiler Improvements• Cluster Version• Placement Improvements
• Sucuri• Node Galery• Graph Templates• Better scheduler
• Both• Full GPU Support• FPGA Support• Multiple implementations for the same task!• Applications and users!
ImageFilterNode
Fork/Join Graph
WavefrontGraph
Our Dataflow Research Group• Leandro Marzulo (UERJ)• Tiago Alves • Felipe França (UFRJ)• Sandip Kundu (UMASS)• Vítor Santos Costa (UPorto)• Master Students (6 ongoing, 1 finished):
• Brunno Goldstein – UFRJ• Leandro Santiago – UFRJ• Marcos Paulo Rocha – UFRJ• Leandro Rouberte – UFRJ• Alexandre Machado – UERJ• Julio Ho - UERJ• Alexandre Sardinha – Finished his Master – Petrobras
• Undergrad students (UERJ)• 6 finished – 3 are Master students now• 11 ongoing
Questions?
TALM – Results - RT
Sucuri – Hierarchical reduction
Sucuri - Wavefront