A Profiler for a Multi-Core Multi-FPGA System

A Profiler for a Multi-Core Multi-FPGA System

by

Daniel Nunes

Supervisor:

Professor Paul Chow

September 30th, 2008

University of Toronto

Electrical and Computer Engineering Department

Overview

Background Profiling Model The Profiler Case Studies Conclusions Future Work

How Do We Program This System? Lets look at what

traditional clusters use and try to port it to these type of machines

User

FPGA

User

FPGA

User

FPGA

User

FPGA

Ctrl

FPGA

Traditional Clusters

MPI is a de facto standard for parallel HPC

MPI can also be used to program a cluster of FPGAs

The TMD

Heterogeneous multi-core multi-FPGA system developed at UofT

Uses message passing (TMD-MPI)

TMD-MPI

Subset of the MPI standard Allows an independence between the

application and the hardware TMD-MPI functionality is also

implemented in hardware (TMD-MPE)

TMD-MPI – Rendezvous Protocol

This implementation uses the Rendezvous protocol, a synchronous communication mode

Req. to Send

Acknowledge

Data

The TMD Implementation on BEE2 Boards

PPC

MB

PPC MB

MBPPC

PPC

PPCMB

NoC

NoC

NoC

NoC

NoC

User FPGA

User FPGA

User FPGA

User FPGA

Ctrl FPGA

How Do We Profile This System? Lets look at how it is done

in traditional clusters and try to adapt it to hardware

MPICH - MPE

Collects information from MPI calls and defined user states through embedded calls

Includes a tool to view all log files (Jumpshot)

Goals Of This Work

Implement a hardware profiler capable of extracting the same data as the MPE

Make it less intrusive

Make it compatible with the API used by MPE

Make it compatible with Jumpshot

Tracers

PPCProcessor’s Computation

Tracer

Receive

Tracer

Send

Tracer

TMD

MPE

Receive

Tracer

Send

Tracer

TMD

MPE

Engine’s Computation

Tracer

The Profiler interacts with the computation elements through tracers that register important events

TMD-MPE requires two tracers due to its parallel nature

PPCProcessor’s Computation

Tracer

Tracers - Hardware Engine Computation

MUX

R0

Tracer for Hardware Engine

Cycle Counter

32 32 32

Tracers - TMD-MPE

R0 R1 R2 R3

R4

MPE Data Reg

MUX

MUX

MUX

Tracer for TMD-MPE

Cycle Counter

TMD

MPE

32

32 32 32

32

32

Tracers – Processors Computation

Register Bank

(9 x 32 bits)

MUX

Register Bank

(5 x 32 bits)

Stack

Stack

MPI Calls States User Define States

Tracer for PowerPC/MicroBlaze

Cycle Counter

PPC

3232 32 32

Profiler’s Network

Tracer

Tracer

Tracer

.

.

.

Gather Collector DDR

User FPGA Control FPGA

Synchronization

Synchronization within the same board Release reset of the cycle counters

simultaneously Synchronization between boards

Periodically exchange of messages between the root board and all other boards

Visualize with

Jumpshot

Profiler’s FlowCollect Data

Dump to Host

Convert

To CLOG2

Convert

To SLOG2

After Execution

Back

End

Front

End

Case Studies

Barrier Sequential vs Binary Tree

TMD-MPE - Unexpected Message Queue Unexpected Message Queue addressable by

rank The Heat Equation

Blocking Calls vs Non-Blocking Calls LINPACK Benchmark

16 Node System Calculating a LU Decomposition of a Matrix

Barrier

Synchronization call – No node will advance until all nodes have reached the barrier

0

1 2

3 4 5 6

7

0

1 2 3 4 5 6 7

Barrier Implemented Sequentially

Send Receive

Barrier Implemented as a Binary Tree

Send Receive

TMD-MPE – Unexpected Messages Queue

All request to send that arrive to a node before it issues a MPI_RECV are kept in this queue.


Send Receive Queue Search and Reorganization


Send Receive

The Heat Equation Application

Partial differential equation that describes the temperature change over time

41,1,,1,1

,

jijijijiji

uuuuv

2,, )( jiji vu


Send Receive Computation

The LINPACK Benchmark

Solves a system of linear equations

LU factorization with partial pivoting

LUPA


assigned to Rank 0

assigned to Rank 1

assigned to Rank 2

0 1 n-3 n-2 n-12 3 4 5


Send Receive Computation

Profiler’s Overhead

Block LUTs Flip-Flops BRAMsCollector 3856 (5%) 1279 (1%) 0 (0%)

Gather 187 (0%) 53 (0%) 0 (0%)

Engine Computation Tracer

396 (0%) 701 (1%) 0 (0%)

TMD-MPE Tracer 526 (0%) 1000 (1%) 0 (0%)

Processors Computation Tracer

without MPE1196 (1%) 1521 (2%) 0 (0%)

Processors Computation Tracer

with MPE

855 (1%) 1200 (1%) 0 (0%)

Conclusions

All major features of the MPE were implemented

The profiler was successfully used to study the behavior of the applications

Less intrusive More events available to profile Can profile network components Compatible with existing profiling software

environments

Future Work

Reduce the footprint of the profiler’s hardware blocks.

Profile the Microblaze and PowerPC in a non-intrusive way.

Allow real-time profiling

Thank You(Questions?)

Off-Chip Communications Node

The TMD (2)

Off-Chip Communications Node

FSL

PPC

TMD-MPE

TMD-MPE

InterChip

FSL XAUI

Computation Node

Computation Node

Network InterfaceHardware Engine

Network

On-chip

Profiler (2)

TMD-MPE

Tracer RX Tracer TX Tracer Comp

To Gather

From Cycle Counter

From Cycle Counter

From Cycle Counter

PPC

PLB

TMD-MPE

Tracer RX Tracer TX

DCR2FSL

Bridge

Tracer Comp

To Gather

DC

R

From Cycle Counter

GPIO

Processor Profiler Architecture

Engine Profiler Architecture

Profiler (1)

XAUI

PPC

μB

Collector

IC IC

PPC

μB

Gather

ICIC

DDR

Control FPGA

User FPGA 1User FPGA 4

Board 0

Board N

Switch

Gather

Cycle Counter

Cycle Counter

Network

On-chip

Network

On-chip

Profiler (2)

TMD-MPE

Tracer RX Tracer TX Tracer Comp

To Gather

From Cycle Counter

From Cycle Counter

From Cycle Counter

PPC

PLB

TMD-MPE

Tracer RX Tracer TX

DCR2FSL

Bridge

Tracer Comp

To Gather

DC

R

From Cycle Counter

GPIO

Processor Profiler Architecture

Engine Profiler Architecture

Hardware Profiling Benefits

Less intrusive More events available to profile Can profile network components Compatible with existing profiling

software environments

MPE PROTOCOL

Message Size (NDW )Opcode Src/Dest Rank3 1 3 0 2 2

1C t r l b it 2 9 2 1 0

Tag0

Data-word (0)0

Data-word (1)0

Data-word (NDW -1)0

A Profiler for a Multi-Core Multi-FPGA System

Documents

Transcript of A Profiler for a Multi-Core Multi-FPGA System