12/4/031 Research Capabilities of the HCS Lab at Univ. of Florida Reconfigurable Computing,...

12/4/03 1

Research Capabilities of theHCS Lab at Univ. of FloridaReconfigurable Computing, High-Performance

Interconnects, and Simulation Activities

Honeywell BriefingDecember 4, 2003

12/4/03 2

Outline Lab Overview Reconfigurable Computing (RC)

Motivation CARMA Framework

High-Performance Interconnects Overview Experimentation Simulation

Simulation Overview Fast and Accurate Simulation Environment (FASE) Library for Integrated Optical Networking (LION) RC Modeling and Simulation

Summary Potential Collaboration

12/4/03 3

Primary Research Areas high-performance computer networks high-performance computer networks high-performance computer architectures high-performance computer architectures reconfigurable and fault-tolerant computingreconfigurable and fault-tolerant computing parallel and distributed computingparallel and distributed computing

Research Methods modeling and simulationmodeling and simulation experimental testbed researchexperimental testbed research software design & developmentsoftware design & development hardware design & developmenthardware design & development

modeling and

simulation

experimental

testbed

architectures networks

syst

em

softw

are

algo

rithm

s

Research Areas and Methods

12/4/03 4

CARRIER: A High-Performance Computing System forCluster/Grid Architecture, Network, and System Research

Cluster-based laboratory gridCluster-based laboratory grid• PCs organized in 11 clustersPCs organized in 11 clusters• 480 Pentium-compatible CPUs480 Pentium-compatible CPUs• 308 nodes (Opteron, Xeon, PIII, etc.)308 nodes (Opteron, Xeon, PIII, etc.)• 102 GB main memory102 GB main memory• 5.2 TB storage5.2 TB storage• PCI-X and PCI64/66PCI-X and PCI64/66• Five reconfig. computing nodesFive reconfig. computing nodes

High-speed networking testbedsHigh-speed networking testbeds• 5.3 Gb/s Scalable Coherent Interface5.3 Gb/s Scalable Coherent Interface• 10 Gb/s InfiniBand (4X)10 Gb/s InfiniBand (4X)• 1.28 Gb/s Myrinet, 3.2 Gb/s QsNet1.28 Gb/s Myrinet, 3.2 Gb/s QsNet• 1.25 Gb/s Cluster LAN (cLAN)1.25 Gb/s Cluster LAN (cLAN)• 1.0 Gb/s Gigabit Ethernet1.0 Gb/s Gigabit Ethernet

Key Facilities – CARRIER

12/4/03 5

Laboratory grid spanning two sitesLaboratory grid spanning two sites• 9 PC clusters in Gainesville, 2 in Tallahassee9 PC clusters in Gainesville, 2 in Tallahassee

• 1000BASE-SX GigE backbone at each lab site1000BASE-SX GigE backbone at each lab site• Abilene/POS between sites (OC-12 @ UF, OC-3 @ FAMU-FSU)Abilene/POS between sites (OC-12 @ UF, OC-3 @ FAMU-FSU)

Facilities – Distributed Infrastructure

Coming soon:• Florida Lambda Rail (FLR)• Part of National Lambda Rail (NLR)• WDM with 10 Gb/s per wavelength

Coming soon:• Florida Lambda Rail (FLR)• Part of National Lambda Rail (NLR)• WDM with 10 Gb/s per wavelength

12/4/03 6

AlphaServer cluster• 16 processors (all 64-bit Alphas)16 processors (all 64-bit Alphas)• 10 GB main memory10 GB main memory• PCI-X support (i.e. 6PCI-X support (i.e. 6PCI)PCI)• Featured platform is 4p Marvel ES80Featured platform is 4p Marvel ES80

Sun workstation cluster• 76 processors (mostly UltraSPARCs)76 processors (mostly UltraSPARCs)• 9.5 GB main memory9.5 GB main memory• PCI64/66 support (i.e. 4PCI64/66 support (i.e. 4PCI)PCI)

Networking testbeds• 1.6 Gb/s Scalable Coherent Interface1.6 Gb/s Scalable Coherent Interface• 1.28 Gb/s Myrinet1.28 Gb/s Myrinet• 1.0 Gb/s Scalable Coherent Interface 1.0 Gb/s Scalable Coherent Interface • 1.0 Gb/s Fibre Channel1.0 Gb/s Fibre Channel• 155 Mb/s ATM155 Mb/s ATM

Key Facilities – Workstation Cluster

12/4/03 7

Reconfigurable Computing

Overview of Research Activities

12/4/03 8

Current RC Motivation

Key missing pieces in RC clusters for HPC Dynamic RC fabric discovery and management Coherent multitasking, multi-user environment Robust job scheduling and management Fault tolerance and scalability Performance monitoring down into the RC fabric Automated application mapping into management tool

HCS Lab has proposed concepts and framework to unify existing technologies as well as fill in missing pieces Cluster-based Approach to Reconfigurable Management

Architecture (CARMA)

12/4/03 9

CARMA Framework Overview CARMA seeks to integrate:

Graphical user interface COTS application mapper

Candidates include Handel-C, Viva, StreamsC, CHAMPION, CoreFire, etc.

Graph-based job description Condensed Graphs, DAGMan, etc.

Robust management tool Distributed, scalable job scheduling Checkpointing, rollback and recovery

for both host and RC units Multilevel monitoring service (GEMS)

Clusters, networks, hosts, RC fabric Tradeoff issues down to RC level

Middleware API (adapt USC-ISI API) Multiple types of RC boards Multiple high-performance networks

SCI, Myrinet, GigE, InfiniBand, etc.

Applications

RC ClusterManagement

DataNetwork

Algorithm Mapping

PerformanceMonitoring

MiddlewareAPI

UserInterface

COTSProcessor

RC FabricAPI

RC Fabric

RC Node

To OtherNodes

ControlNetwork

12/4/03 10

Applications

Test applications developed Block ciphers

DES, Blowfish Sonar Beamforming Hyperspectral Imaging (c/o LANL)

Future development Stream ciphers

RSA RC/HPC benchmarks (c/o Honeywell TC and UCSD) Cryptanalysis benchmarks (c/o DoD)

12/4/03 11

Application Mapper

Evaluating three application mappers on the basis of: Ease of use, performance, hardware device independence,

programming model, parallelization support, resource targeting,network support, stand-alone mapping

Celoxica: SDK (Handel-C) Provides access to in-house boards

ADM-XRC (x1) and Tarari (x4) Star-Bridge Systems: Viva

Provides best option for hardware independence Annapolis Micro Systems: CoreFire

Provides access to the AFRL-IFTC 48-node cluster Xilinx’s ISE compulsory, evaluating Jbits for partial RTR

ADM-XRC

Tarari

12/4/03 12

Job Scheduler (JS) Prototyping effort underway (forecasting)

Completed first version of JS (coded Q3 of 2003, now under test) Single node Task-based execution using Dynamic Acyclic Graphs (DAGs) Separate processes and message queues for fault-tolerance

Second version of JS (Q2 of ‘04 completion) Multi-node Distributed job migration Checkpoint and rollback Links to Configuration Manager and GEMS

External extensions to traditional tools (interoperability) Expand upon GWU/GMU work (Dr. El-Ghazawi’s group) Other COTS job schedulers under consideration Striving for “plug and play” approach to JS within CARMA

c/o GWU/GMU

12/4/03 13

Configuration Manager

CM (Configuration Manager) Application interface to RC Board Handles configuration caching and defragmentation Passes configuration information to other CMs via TCP or SCI

CMUI (CM User Interface) Allows user input to configure CM Used for testing purposes

Communication Module (Com) Used to transfer configuration files between remote CMs

c/o UW

c/o UW

FPGA Defragmentation

Configuration TransposeCMUI

CMCom

Com

Execution Manager Remote Node

RC Fabric & Processor

CM

Stub

Networks

Boards

Network Node Reg.

Config File Reg.

NetworkCompilerMessage Queue

12/4/03 14

Management SchemesJobs submitted

“centrally”APP

APP MAP

GJS

GRMAN

Network

LRMON

Local SysLRMON

Local Sys…

Tasks,States

Results,Statistics

Global view ofthe system at all times

Global view ofthe system at all times GRMAN

LAPP

LAPP MAP

LJS

LRMAN

Network

LRMON

Local Sys…

Tasks,Configurations

Requests,Statistics

LAPP

LAPP MAP

LJS

LRMAN

LRMON

Local Sys

Jobs submittedlocally

Requests,Statistics

Global view ofthe system at all times GRMAN

LAPP

LAPP MAP

LJS

LRMAN

Network

LRMON

Local Sys …

Configuration Pointers

Requests,Statistics

LAPP

LAPP MAP

LJS

LRMAN

LRMON

Local Sys


Requests,Statistics

Configurations

LAPP

LAPP MAP

LJS

LRMAN

Network

LRMON

Local Sys …Configurations

Requests

LAPP

LAPP MAP

LJS

LRMAN

LRMON

Local Sys


Requests

Master-Worker (MW) Client-Server (CS)

Client-Broker (CB) Peer-to-Peer (PP)

Server houses configurations

Server brokers configurations

MW has been built, CS is underway,CB and PP by Q2 of 2004

Multilevel approach anticipated for large number of nodes having different schemes at different levels

12/4/03 15

Monitoring Service Options Custom agent per Functional Unit (FU)

Provides customized information per FU Heavily burdens user (unless automated) Requires additional FPGA area

Centralized agent per FPGA Possibly reduces area overhead Reduces data storage and communication Limits scalability

Information storage and response Store on chip or on board Periodically send or respond when queried

Key parameters to monitor - further study Custom parameters per algorithm Requires all-encompassing interface Automation needed for usability Monitoring is overhead, so use sparingly!

Applications

RC ClusterManagement

DataNetwork

Algorithm Mapping

PerformanceMonitoring

MiddlewareAPI

UserInterface

COTSProcessor

RC FabricAPI

RC Fabric

RC Node

To OtherNodes

ControlNetwork

* GEMS is the gossip-enabled monitoring service developed by the HCS Lab for robust, scalable, multilevel monitoring of resource health and performance; for more info. see http://www.hcs.ufl.edu/prj/ftgroup/teamHome.php.

12/4/03 16

Monitoring Service Parameters Processor CPU utilization Memory utilization

Network Bandwidth Latency Congestion, loss rate

RC fabric performance Area utilization On-chip memory utilization On-chip processor utilization On-board memory utilization On-board bus throughput

Functional unit performance Pipeline utilization Active pipeline stage Most recent output value User-defined status

RC fabric health Configuration scrubbing Device input current Temperature Fan speed

Cross-device monitoring Automation in design is key GEMS* to be extended

* GEMS is the gossip-enabled monitoring service developed by the HCS Lab for robust, scalable, multilevel monitoring of resource health and performance; for more info. see http://www.hcs.ufl.edu/prj/ftgroup/teamHome.php.

12/4/03 17

Intra-FPGA• On-chip processor optionOn-chip processor option• Multitasking of resourcesMultitasking of resources• Configuration caching and reuseConfiguration caching and reuse• Resource monitoringResource monitoring

Inter-FPGA• Arbitration and controlArbitration and control• On-chip processor controlling other On-chip processor controlling other

FPGA’s functional unitsFPGA’s functional units

Node• System main CPU in or out of the loopSystem main CPU in or out of the loop• Process to functional unit interactionProcess to functional unit interaction• Memory hierarchyMemory hierarchy• 1-5 boards per node1-5 boards per node• 1-3 FPGAs per board1-3 FPGAs per board• FPGA as NIC acceleratorFPGA as NIC accelerator NIC

RC board(s)

PCI Bridge

Memory Hierarchy

Main CPU(s)

Node Architecture Tradeoffs

12/4/03 18

Future RC Work Continue development and evaluation of CARMA

Expanded features Support for multiple boards, nodes, networks, etc. Functionality and performance evaluation and optimization

Collect and expand RC/HPC performance benchmarks Honeywell Technology Center UCSD MIT

Investigate/develop RC system programming model Continue RC cluster simulation

Extend previous analytic modeling work (arch. and software) Forecast architecture limitations Forecast software/management limitations Develop interfaces for network attached RC devices Determine key design tradeoffs

Consider RC security challenges (e.g. FPGA viruses)

12/4/03 19

High-Performance Interconnects


12/4/03 20

Interconnect Overview

Numerous high-performance network testbeds SCI, InfiniBand, Myrinet, QsNet and Gigabit Ethernet Mid- and low-level performance analysis Cluster computing production environment Analytical modeling to determine scalability

Numerous simulation testbeds developed TCP, Ethernet, InfiniBand, SCI, RapidIO and Optical System-level performance issues Simulation of cluster and grid computing environments Network enhancement prototyping

12/4/03 21

Network Performance Tests Detailed understanding of high-performance cluster interconnects

Identifies suitable networks for UPC over clusters Aids in smooth integration of interconnects with upper-layer UPC components Enables optimization of network communication; unicast and collective

Various levels of network performance analysis Low-level tests

InfiniBand based on Virtual Interface Provider Library (VIPL) SCI based on Dolphin SISCI and SCALI SCI Myrinet based on Myricom GM QsNet based on Quadrics Elan Communication Library Host architecture issues (e.g. CPU, I/O, etc.)

Mid-level tests Sockets

Dolphin SCI Sockets on SCI BSD Sockets on Gigabit and 10Gigabit Ethernet GM Sockets on Myrinet SOVIA on Infinband

MPI InfiniBand and Myrinet based on MPI/PRO SCI based on ScaMPI and SCI-MPICH

Network Layer

SCI Myrinet InfiniBand QsNetPCI

Express1/10 Gigabit

Ethernet

IntermediateLayers

Network Layer

SCI Myrinet InfiniBand QsNetPCI

Express1/10 Gigabit

Ethernet

IntermediateLayers

12/4/03 22

Gigabit Ethernet Performance Tests

Low-cost and widely deployed Standardized TCP communication

protocol High-overhead, low-

performance One-way latency test

TCP Netpipe

MPI over TCP MPICH and MPI/Pro

GASNet over MPI/TCP MPICH

~60 sec one-way latency for TCP communication

Commercial MPI implementation (MPI/Pro) performs better for large messages

~100 sec GASNet overhead for small messages Each send/receive includes a

RPC For large messages GASNET

overhead is negligible

Promising light-weight communication protocol MVIA (National Energy Research Center)

Virtual Interface Architecture for Linux Mesh topology for Gigabit Ethernet cluster

(Jefferson National Lab) Direct or indirect, scalable system architecture High performance/cost ratio Suitable platform for large-scale UPC

applications

0

200

400

600

800

1000

1200

0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K

Message Size (Bytes)

Late

ncy

(use

c)

TCP MPIPro/TCP MPICH/TCP GASNet/MPICH/TCP

GASNet overhead includes RPC associated with AM

12/4/03 23

On-going SAN Performance Tests Quadrics

Features Fat-tree topology Work offloading from the host CPU

for communication events Low latency and high bandwidth Performance increase with NICs

Potential Existing GASNet implementation Existing MPI 1.2 implementation

Initial Results Bandwidth Results From Elan to Main

Memory ~350 MB/sec Latency of 2.4s

Future plans Low-level network performance tests

InfiniBand Features

Low latency and high bandwidth Based on VIPL Industry standard Reliability

Potential Existing GASNet conduit

Future Plans Low-level network performance

tests Performance evaluation of

GASNet on InfinibBand UPC benchmarking on

InfiniBand-based clusters

Quadrics Latency MPI Ping Test

0

200

400

600

800

1000

Bytes

uS

ec

Quadrics Bandwidth MPI Ping Test

050

100150200250300350

Bytes

MB

/sec

12/4/03 24

InfiniBand Architecture Layers

Host Channel Adapter (HCA) Resides in host processor node and connects host to IBA fabric Functions as interface to consumer processes (OS’s message and data service) Interface between HCA and consumer processes is a set of semantics, called IBA verbs, that describes functions necessary for configuring, managing, and operating an HCA Verbs are a description of API parameters. Examples: OpenHCA, ModifyQP, ResizeCQ, etc.

Target Channel Adapter (TCA) Resides in an I/O node and connects it to the IBA fabric Functions as the interface to an I/O device Includes an I/O controller that is specific to its particular I/O device protocol, such as SCSI, FC, Ethernet TCAs are tailor-made HCAs for specific I/O media (without the Verbs layer)

Introduction to InfiniBand High-speed networking technology for interconnecting processor and I/O nodes in a system-area network The architecture is independent of host operating system and processor platform

12/4/03 25

InfiniBand ModelModeling & Experiments Modeling HCA, switches/routers, links, traffic

sources and queue analyzers Functional testing of IBA components Test and analyze performance of various

queueing algorithms Validation testing using 10Gb/s InfiniBand

experimental testbed using Mellanox chipset

12/4/03 26

TCP Model

MLD TCP Model

Increases accuracy of network models (e.g. implement over IP/Ethernet network)

Enables comparison of TCP with other network models (e.g. Myrinet, InfiniBand)

Provides structure to analyze efficiency of performance enhancement strategies over TCP networks

Can be reconfigured and extended to simulate new TCP algorithms and implementations

High-level model allows time-efficient simulations while maintaining necessary detail

Generic TCP building blocks provided by MLD

TCP over Ethernet and MPI Interface have been implemented

Configuration of TCP library parameters and creation of additional TCP components to increase accuracy and breadth of library ongoing

New FASE models can immediately incorporate and take advantage of TCP

Experimentation and examination of results to come

12/4/03 27

SCI Model 1D, 2D, 3D topologies supported

Each link controller is assigned a unique ID number

Data structure used to assign SCI packet a link controller ID for routing purposes

Speed enhancements Removed idle packet (used for flow

control) Removed unnecessary components

Next phase Collective communication support Comparison tests with FASE SCI model

and real SCI hardware

16-node 2D SCI torus

2 link controllers with SCI/PCI interface

Link controllers

Blink/PCI/Host Interface

Host

12/4/03 28

RapidIO Model Initially developed for simulation of

interconnects for processing-in-memory (PIM) architectures Can be easily integrated with models of

different types of embedded systems Support for RapidIO Message Passing

Logical Layer, Common Transport Layer, and Parallel Physical Layer

Undergoing simulation speed enhancements More efficient arbitration for resources

Next phase Addition of RapidIO I/O Logical Layer

(shared memory) Addition of RapidIO Serial Physical Layer (if

desired) Validation with real RapidIO hardware

12/4/03 29

RapidIO Model

Star-switched PIM Topology Array-switched PIM Topology

DIMM Chip

DIMM Chip

12/4/03 30

Optical-Layer Component ModelsOptical Modules

Device Functional Variations

Optical Fiber

Optical Filter Band-pass, Band-stop

Power Amplifier

Transmitter Fixed, Tunable, Discrete Tunable

Receiver Fixed, Tunable, Discrete Tunable

WDM DEMUX 12, 14, 18

WDM MUX 21, 41, 81

OADM 1, 2, 4 channel

Connector

Splitter 12, 14, 18

Coupler 21, 24, 28, 41, 44, 81, 88

Optical Switch 12, 14, 21, 22, 24, 42, 44

Optical Interface Module 80% pass / 20% drop

TDM Components 4-channel MUX/DEMUX

System Effects

Cost

Electrical Power Use

Latency / Throughput

Power Threshold

Power Budget

BW limits

Higher-order physical effects (future)

Networking Layers

IEEE 802.3 switch (ARNC664?)

IP with QoS

ARP cache

Multicast / Virtual Links

TCP

UDP

Numerous LRMs

Application Layer (DVI, etc.)Optical components easily replicated with differentparameter settings to produce commercial product models (ex. Genoa GT111 Amplifier)

12/4/03 31

Simulation


12/4/03 32

Simulation Overview

Cluster and Grid Modeling Application profiling, tracing, and emulation Network modeling MPI implementation Speed vs. fidelity tradeoff

Optical Network Modeling Optical network prototyping Design trade study

RC Modeling RC-enhanced NIC Hardware monitoring trade study

12/4/03 33

Trace Generator(MPE)

Script Generator

Heterogeneouscluster

Single computerrunning

MLDesigner Simulation Results

Current Simulation Environment

Parallelprogram

1. Input parallel program into Trace Generator

2. Logfile from Trace Generator is converted to readable text and fed to Script Generator

3. Script files are read in by MLDesigner and each end-node performs its specified tasks

4. When all script files have been completed, simulation is complete and statistics are reported

12/4/03 34

FASE Overview FASE = Fast and Accurate Simulation Environment for Clusters and Grids Trace Tools – evaluated TAU, MPE, Paradyn, SvPablo, and VampirTrace

for parallel applications Chose MPE for communication events since it provides sufficient detail and is free

Script Generator Sifts through logfile to categorize program execution into communication and computation Implementation

Records MPI commands and computation blocks in logfile Static = Database of scripts can be made instead of traces Trace tool(s)

Performance statistics generator Generates statistics that can characterize behavior of a device while running a particular

program Example statistics

Cache misses, CPI, percentages of instruction types executed, instruction count, disk I/O Models

Processors – single and multiprocessors, processors in memory, reconfigurable Networks – Ethernet, Myrinet, SCI, InfiniBand, Rapid I/O, HyperTransport, SONET and

other optical protocols, TCP/IP MPI Interface

Currently support most basic MPI functions = MPI_Send, MPI_Recv Future functions = MPI_Barrier, MPI_Bcast, MPI_AlltoAll, MPI_Reduce Created for each network model in library

Implementation - Speed vs. Fidelity Tradeoff High-level modeling of end nodes and packet-level modeling of network components Rationale: The network is often a bottleneck of a system, so use high-fidelity network

models where practical to account for characteristics such as congestion and flow control End-nodes

Simulate computation = a “pause” in a model by a variable amount of time dictated by trace tool or calculated by statistics gathered by performance statistics generator

Begin and end network communications Lower fidelity

Script Reader/Processor MPI/UPC

Interface

ComputationalUnit

End-node

Script File

Network

User-suppliedParameters

PerformanceStatisticsGenerator

Main Components in FASE

FASE component interactions

FASE interfaces

MPI Interface

Device Interface

MPI Interface

End Node

End Node

UPC Interface

UPC Interface

12/4/03 35

Profiling/Tracing Background Understanding execution and analyzing performance of a parallel program is

difficult Profiling and tracing are two useful methods of gathering data from execution Behavior of a parallel program can be seen by making events observable and

then recording the events Events can be function calls (e.g. MPI routines), computation, or user-defined

Visualization tools can later be used to see logged events in a graphical manner

Profiling Records statistics of program

execution Uses events to keep track of

performance metrics Tracing

Records detailed log of time events Includes time stamp and attributes of

events Log formats: ALOG, BLOG, CLOG,

SLOG-1, SLOG-2, SDDF, many others

Visualization of a matrix multiply using Jumpshot-4

12/4/03 36

Trace Tools and Script Generator

Evaluated Trace Tools Tuning and Analysis Utility (University of Oregon, LANL, and

Research Centre Jülich, ZAM, Germany), SvPablo (University of Illinois), and Paradyn (University of Wisconsin Madison)

Average execution time and count of procedure calls and loops Multi-Processing Environment (distributed with MPICH)

Traces all MPI communications VampirTrace (Pallas)

Commercial tool that traces MPI and procedure calls MPE

Selected initially because of direct MPI support CLOG output, but CLOG format is space inefficient and not well

documented SLOG-2 is well documented and scalable Conversion process: CLOG, SLOG-2 binary, SLOG-2 ASCII

Script Generator Sifts through logfile to categorize program execution into

communication and computation Input: ASCII version of logfile Output: script files that are used in simulation environment Different degrees of filtering to balance between simulation

fidelity vs. speed Fine-grained = higher fidelity, slow Coarse-grained = lower fidelity, fast

Implementation Records MPI commands and computation blocks in logfile Static = Database of scripts can be made instead of traces

SLOG-2 logfileASCII version

ScriptGenerator

MPI_Send(0,2,4)MPI_Send(0,1,4)MPI_Recv(1,0,8)MPI_Recv(2,0,8)MPI_Send(0,2,4)MPI_Send(0,1,4)MPI_Recv(1,0,8)MPI_Recv(2,0,8)

MPI_Recv(1,0,512)MPI_Recv(2,0,512)

MPI_Recv(0,1,4)MPI_Send(1,0,8)MPI_Recv(0,1,4)MPI_Send(1,0,8)

MPI_Send(1,0,512)

MPI_Recv(0,2,4)MPI_Send(2,3,4)MPI_Recv(3,2,8)MPI_Send(2,0,8)MPI_Recv(0,2,4)MPI_Send(2,3,4)MPI_Recv(3,2,8)MPI_Send(2,0,8)

MPI_Recv(3,2,512)MPI_Send(2,0,512)

MPI_Recv(2,3,4)MPI_Send(3,2,8)MPI_Recv(2,3,4)MPI_Send(3,2,8)

MPI_Send(3,2,512)

PID 0 PID 1 PID 2 PID 3

MPE

CLOG logfile

SLOG-2 logfileBinary version

slog2sdk

12/4/03 37

MPI Collective Communication FASE currently supports only

unicast communication (MPI_Send, MPI_Recv)

Must support collective communications (MPI_Barrier, MPI_Alltoall, MPI_Reduce, MPI_Bcast, etc) in order to simulate the execution of complex applications

Algorithms shown in figures are specific to FASE since the MPI standard does not layout exact algorithms

Current implementation assumes that the entire MPI_Comm_World is used as the group in the collective function calls

These algorithms will be leveraged for UPC collective communications as well

0

21

3 4 5

0

21

3 4 5

The transactions can occurin any order, however, the

parent nodes must receive aBarrier packet from both

child nodes before sending aBarrier packet to its parent.

After the root nodereceives Barrier packetsfrom both its children, itsends Barrier packets

back to its children whichthen relay these packets

to their children.

MPI_Barrier

Step 1 Step 2

0

21

0

21

0

21Step 1 Step 2 Step 3

MPI_AlltoallSending node passes

MPI_Alltoall packet to parentnode first and then to children

nodes. Parent node then sendsthis packet to its parent and its

other child. Children nodes sendonly to their children.

0

21

3 4 5Step 1

MPI_ReduceDestination

Node Parents receive MPI_Reduce packets from both children andperform operation specified in function call before sending new

MPI_Reduce packet to parent node. If node 0 is not thedestination, it collects a single MPI_Reduce packet from a child

and sends its new packet to the other child node. If a nodereceives a packet from its parent and it is not the destination, thenit waits for the retrieval of a child node’s MPI_Reduce packet andthen sends its own MPI_Reduce packet to the other child node.

0

21

3 4 5Step 1

MPI_BcastSourceNode

Source sends MPI_Bcast packet first to parent node and then tochildren nodes. Nodes that receive an MPI_Bcast packet from a childnode must send to their parent first and then to the child that did send

the initial packet. Nodes receiving an MPI_Bcast packet from theirparent node simply sends the packet to both children nodes.

Transaction Legend1st transaction2nd transaction3rd transaction

12/4/03 38

Switched Aircraft LAN

TT

File Serverssrv1 srv2

DisplaysOptical

Pixel Bus

LRM

LRM

LRM

dp1 dp2 dp3

Navigation

LRM

LRM

LRM

LRM

nav1 nav2 nav3 nav4

Mission Processing

LRM

LRM

LRM

LRM

mp1 mp2 mp3 mp4

Sensors

LRM

LRM

sen1 sen2 sen_ir

sen_eo

Communications

LRM

LRM

com1 com2 com_r1 to com_r4

Platform Monitoring &Protection

LRM

LRM

LRM

LRM

pm1 pm2 pm3 pm4

pm_gw1

Key tradeoffs QoS thresholds and algorithms Latency / Bandwidth Power analysis Cost analysis (baseline cost)

Key features Baseline system Application and traffic study Virtual Links with QoS TCP / IP / Ethernet systems Fiber and copper links

12/4/03 39

Switchless Optical Pixel Bus

Key tradeoffs Latency / Bandwidth Optical power budget Electrical power analysis Cost analysis (baseline cost)

Key features 4 channels @ 2.5Gbps (10Gbps aggregate) Fixed optical components (cheaper)

TDM TDM SystemSystem

12/4/03 40

Switchless Optical Pixel Bus

Key features 4 channels @ 10Gbps (40Gbps aggregate) 4 independent wavelengths (better security) More optical components (increased cost)

TDM system provides cost-effective solution if bandwidth limitation is sufficientWDM system provides better bandwidth and security at increased cost in both $ and power

Key tradeoffs Latency / Bandwidth Optical power budget Electrical power analysis Cost analysis (baseline cost)

WDM WDM SystemSystem

12/4/03 41

Switchless Optical Aircraft LAN

Key tradeoffs WDM / TDM Compare to baseline architecture Latency / Bandwidth Power analysis (mostly passive) Cost analysis

TT

File Serverssrv1 srv2

DisplaysOptical

Pixel Bus

LRM

LRM

LRM

dp1 dp2 dp3 Navigation

LRM

LRM

LRM

LRM

nav1 nav2 nav3 nav4

Mission Processing

LRM

LRM

LRM

LRM

mp1 mp2 mp3 mp4

Sensors

LRM

LRM

sen1 sen2 sen_ir

sen_eo

Communications

LRM

LRM

com1 com2 com_r1 to com_r4 Platform Monitoring &Protection

LRM

LRM

LRM

LRM

pm1 pm2 pm3 pm4

pm_gw1

G1

G2

G3FE

FE FE FE

FE TT

TT

Key features Candidate system Unified bus Switchless Tunable wavelengths Optical switching Increased reliability Supports bandwidth growth

Gateway LegendG# = Gigabit EthernetFE = Fast EthernetTT = Time Triggered

12/4/03 42

RC Simulation

RC-enhanced NIC Based on the Intel IXP1200 Explored dynamic RC trades Gigabit Ethernet packet processing Theatre Missile Defense System with Cooperative

Engagement Capability as case study RC Hardware Monitoring Trade Study

Explored monitoring design space Examined performance impacts Trade study included processor, bus and FPGA

12/4/03 43

Basic and applied research in advanced computer architectures, networks, services, and systems for high-performance and reconfigurable computing and communications.

Focus on “high performance” in terms of execution time, throughput, latency, quality of service, dependability, etc.

Simulative and experimental research to achieve both distinct and interdependent goals.

For simulation, leverage expertise with high-fidelity, CAD-based architecture and network performance modeling and analysis.

For experimentation, facilities include cluster-based lab grid with 480 CPUs, multi-Gigabit networking testbeds, reconfigurable and embedded computing testbeds, etc.

Summary of HCS Lab Capabilities

12/4/03 44

Potential Collaboration New Millennium ST-8 System Model (Jeremy Ramos)

System model development Fault insertion / recovery assessment System behavior analysis

Integrated Payload Middleware (Jeff Wolfe) Middleware alternatives trade study Validate integrated software designs Focus on fault detection and management, IPC, hardware

abstraction, OS abstraction for embedded systems

HRSC Development Trade Studies (Chris Butera) Architecture, interconnect trades for flight design Independent test and evaluation

Suggestions c/o Jeremy Ramos

12/4/03 45

Potential Collaboration DARPA RWI (Michael Elias)

Network protocol development Architecture trades

Payload Prototype Environment Emulation / simulation of components and systems Quick-turn-around application studies for proposal efforts Propose enhancements and perform trade studies Forecast future system needs and explore impacts

Other joint proposals? Star Bridge Systems, Inc. and other vendor collaboration? Other possibilities?

12/4/031 Research Capabilities of the HCS Lab at Univ. of Florida Reconfigurable Computing,...

Documents

Transcript of 12/4/031 Research Capabilities of the HCS Lab at Univ. of Florida Reconfigurable Computing,...