A Profiler for a Multi-Core Multi-FPGA System
description
Transcript of A Profiler for a Multi-Core Multi-FPGA System
A Profiler for a Multi-Core Multi-FPGA System
by
Daniel Nunes
Supervisor:
Professor Paul Chow
September 30th, 2008
University of Toronto
Electrical and Computer Engineering Department
Overview
Background Profiling Model The Profiler Case Studies Conclusions Future Work
How Do We Program This System? Lets look at what
traditional clusters use and try to port it to these type of machines
User
FPGA
User
FPGA
User
FPGA
User
FPGA
Ctrl
FPGA
Traditional Clusters
MPI is a de facto standard for parallel HPC
MPI can also be used to program a cluster of FPGAs
The TMD
Heterogeneous multi-core multi-FPGA system developed at UofT
Uses message passing (TMD-MPI)
TMD-MPI
Subset of the MPI standard Allows an independence between the
application and the hardware TMD-MPI functionality is also
implemented in hardware (TMD-MPE)
TMD-MPI – Rendezvous Protocol
This implementation uses the Rendezvous protocol, a synchronous communication mode
Req. to Send
Acknowledge
Data
The TMD Implementation on BEE2 Boards
PPC
MB
PPC MB
MBPPC
PPC
PPCMB
NoC
NoC
NoC
NoC
NoC
User FPGA
User FPGA
User FPGA
User FPGA
Ctrl FPGA
How Do We Profile This System? Lets look at how it is done
in traditional clusters and try to adapt it to hardware
MPICH - MPE
Collects information from MPI calls and defined user states through embedded calls
Includes a tool to view all log files (Jumpshot)
Goals Of This Work
Implement a hardware profiler capable of extracting the same data as the MPE
Make it less intrusive
Make it compatible with the API used by MPE
Make it compatible with Jumpshot
Tracers
PPCProcessor’s Computation
Tracer
Receive
Tracer
Send
Tracer
TMD
MPE
Receive
Tracer
Send
Tracer
TMD
MPE
Engine’s Computation
Tracer
The Profiler interacts with the computation elements through tracers that register important events
TMD-MPE requires two tracers due to its parallel nature
PPCProcessor’s Computation
Tracer
Tracers - Hardware Engine Computation
MUX
R0
Tracer for Hardware Engine
Cycle Counter
32 32 32
Tracers - TMD-MPE
R0 R1 R2 R3
R4
MPE Data Reg
MUX
MUX
MUX
Tracer for TMD-MPE
Cycle Counter
TMD
MPE
32
32 32 32
32
32
Tracers – Processors Computation
Register Bank
(9 x 32 bits)
MUX
Register Bank
(5 x 32 bits)
Stack
Stack
MPI Calls States User Define States
Tracer for PowerPC/MicroBlaze
Cycle Counter
PPC
3232 32 32
Profiler’s Network
Tracer
Tracer
Tracer
.
.
.
Gather Collector DDR
User FPGA Control FPGA
Synchronization
Synchronization within the same board Release reset of the cycle counters
simultaneously Synchronization between boards
Periodically exchange of messages between the root board and all other boards
Visualize with
Jumpshot
Profiler’s FlowCollect Data
Dump to Host
Convert
To CLOG2
Convert
To SLOG2
After Execution
Back
End
Front
End
Case Studies
Barrier Sequential vs Binary Tree
TMD-MPE - Unexpected Message Queue Unexpected Message Queue addressable by
rank The Heat Equation
Blocking Calls vs Non-Blocking Calls LINPACK Benchmark
16 Node System Calculating a LU Decomposition of a Matrix
Barrier
Synchronization call – No node will advance until all nodes have reached the barrier
0
1 2
3 4 5 6
7
0
1 2 3 4 5 6 7
Barrier Implemented Sequentially
Send Receive
Barrier Implemented as a Binary Tree
Send Receive
TMD-MPE – Unexpected Messages Queue
All request to send that arrive to a node before it issues a MPI_RECV are kept in this queue.
TMD-MPE – Unexpected Messages Queue
Send Receive Queue Search and Reorganization
TMD-MPE – Unexpected Messages Queue
Send Receive Queue Search and Reorganization
TMD-MPE – Unexpected Messages Queue
Send Receive
The Heat Equation Application
Partial differential equation that describes the temperature change over time
41,1,,1,1
,
jijijijiji
uuuuv
2,, )( jiji vu
The Heat Equation Application
The Heat Equation Application
Send Receive Computation
The Heat Equation Application
Send Receive Computation
The LINPACK Benchmark
Solves a system of linear equations
LU factorization with partial pivoting
LUPA
The LINPACK Benchmark
assigned to Rank 0
assigned to Rank 1
assigned to Rank 2
0 1 n-3 n-2 n-12 3 4 5
The LINPACK Benchmark
Send Receive Computation
The LINPACK Benchmark
Send Receive Computation
Profiler’s Overhead
Block LUTs Flip-Flops BRAMsCollector 3856 (5%) 1279 (1%) 0 (0%)
Gather 187 (0%) 53 (0%) 0 (0%)
Engine Computation Tracer
396 (0%) 701 (1%) 0 (0%)
TMD-MPE Tracer 526 (0%) 1000 (1%) 0 (0%)
Processors Computation Tracer
without MPE1196 (1%) 1521 (2%) 0 (0%)
Processors Computation Tracer
with MPE
855 (1%) 1200 (1%) 0 (0%)
Conclusions
All major features of the MPE were implemented
The profiler was successfully used to study the behavior of the applications
Less intrusive More events available to profile Can profile network components Compatible with existing profiling software
environments
Future Work
Reduce the footprint of the profiler’s hardware blocks.
Profile the Microblaze and PowerPC in a non-intrusive way.
Allow real-time profiling
Thank You(Questions?)
Off-Chip Communications Node
The TMD (2)
Off-Chip Communications Node
FSL
PPC
TMD-MPE
TMD-MPE
InterChip
FSL XAUI
Computation Node
Computation Node
Network InterfaceHardware Engine
Network
On-chip
Profiler (2)
TMD-MPE
Tracer RX Tracer TX Tracer Comp
To Gather
From Cycle Counter
From Cycle Counter
From Cycle Counter
PPC
PLB
TMD-MPE
Tracer RX Tracer TX
DCR2FSL
Bridge
Tracer Comp
To Gather
DC
R
From Cycle Counter
GPIO
Processor Profiler Architecture
Engine Profiler Architecture
Profiler (1)
XAUI
PPC
μB
Collector
IC IC
PPC
μB
Gather
ICIC
DDR
Control FPGA
User FPGA 1User FPGA 4
Board 0
Board N
Switch
Gather
Cycle Counter
Cycle Counter
Network
On-chip
Network
On-chip
Profiler (2)
TMD-MPE
Tracer RX Tracer TX Tracer Comp
To Gather
From Cycle Counter
From Cycle Counter
From Cycle Counter
PPC
PLB
TMD-MPE
Tracer RX Tracer TX
DCR2FSL
Bridge
Tracer Comp
To Gather
DC
R
From Cycle Counter
GPIO
Processor Profiler Architecture
Engine Profiler Architecture
Hardware Profiling Benefits
Less intrusive More events available to profile Can profile network components Compatible with existing profiling
software environments
MPE PROTOCOL
Message Size (NDW )Opcode Src/Dest Rank3 1 3 0 2 2
1C t r l b it 2 9 2 1 0
Tag0
Data-word (0)0
Data-word (1)0
Data-word (NDW -1)0