12/4/031 Research Capabilities of the HCS Lab at Univ. of Florida Reconfigurable Computing,...
-
Upload
walter-daniels -
Category
Documents
-
view
216 -
download
0
Transcript of 12/4/031 Research Capabilities of the HCS Lab at Univ. of Florida Reconfigurable Computing,...
12/4/03 1
Research Capabilities of theHCS Lab at Univ. of FloridaReconfigurable Computing, High-Performance
Interconnects, and Simulation Activities
Honeywell BriefingDecember 4, 2003
12/4/03 2
Outline Lab Overview Reconfigurable Computing (RC)
Motivation CARMA Framework
High-Performance Interconnects Overview Experimentation Simulation
Simulation Overview Fast and Accurate Simulation Environment (FASE) Library for Integrated Optical Networking (LION) RC Modeling and Simulation
Summary Potential Collaboration
12/4/03 3
Primary Research Areas high-performance computer networks high-performance computer networks high-performance computer architectures high-performance computer architectures reconfigurable and fault-tolerant computingreconfigurable and fault-tolerant computing parallel and distributed computingparallel and distributed computing
Research Methods modeling and simulationmodeling and simulation experimental testbed researchexperimental testbed research software design & developmentsoftware design & development hardware design & developmenthardware design & development
modeling and
simulation
experimental
testbed
architectures networks
syst
em
softw
are
algo
rithm
s
Research Areas and Methods
12/4/03 4
CARRIER: A High-Performance Computing System forCluster/Grid Architecture, Network, and System Research
Cluster-based laboratory gridCluster-based laboratory grid• PCs organized in 11 clustersPCs organized in 11 clusters• 480 Pentium-compatible CPUs480 Pentium-compatible CPUs• 308 nodes (Opteron, Xeon, PIII, etc.)308 nodes (Opteron, Xeon, PIII, etc.)• 102 GB main memory102 GB main memory• 5.2 TB storage5.2 TB storage• PCI-X and PCI64/66PCI-X and PCI64/66• Five reconfig. computing nodesFive reconfig. computing nodes
High-speed networking testbedsHigh-speed networking testbeds• 5.3 Gb/s Scalable Coherent Interface5.3 Gb/s Scalable Coherent Interface• 10 Gb/s InfiniBand (4X)10 Gb/s InfiniBand (4X)• 1.28 Gb/s Myrinet, 3.2 Gb/s QsNet1.28 Gb/s Myrinet, 3.2 Gb/s QsNet• 1.25 Gb/s Cluster LAN (cLAN)1.25 Gb/s Cluster LAN (cLAN)• 1.0 Gb/s Gigabit Ethernet1.0 Gb/s Gigabit Ethernet
Key Facilities – CARRIER
12/4/03 5
Laboratory grid spanning two sitesLaboratory grid spanning two sites• 9 PC clusters in Gainesville, 2 in Tallahassee9 PC clusters in Gainesville, 2 in Tallahassee
• 1000BASE-SX GigE backbone at each lab site1000BASE-SX GigE backbone at each lab site• Abilene/POS between sites (OC-12 @ UF, OC-3 @ FAMU-FSU)Abilene/POS between sites (OC-12 @ UF, OC-3 @ FAMU-FSU)
Facilities – Distributed Infrastructure
Coming soon:• Florida Lambda Rail (FLR)• Part of National Lambda Rail (NLR)• WDM with 10 Gb/s per wavelength
Coming soon:• Florida Lambda Rail (FLR)• Part of National Lambda Rail (NLR)• WDM with 10 Gb/s per wavelength
12/4/03 6
AlphaServer cluster• 16 processors (all 64-bit Alphas)16 processors (all 64-bit Alphas)• 10 GB main memory10 GB main memory• PCI-X support (i.e. 6PCI-X support (i.e. 6PCI)PCI)• Featured platform is 4p Marvel ES80Featured platform is 4p Marvel ES80
Sun workstation cluster• 76 processors (mostly UltraSPARCs)76 processors (mostly UltraSPARCs)• 9.5 GB main memory9.5 GB main memory• PCI64/66 support (i.e. 4PCI64/66 support (i.e. 4PCI)PCI)
Networking testbeds• 1.6 Gb/s Scalable Coherent Interface1.6 Gb/s Scalable Coherent Interface• 1.28 Gb/s Myrinet1.28 Gb/s Myrinet• 1.0 Gb/s Scalable Coherent Interface 1.0 Gb/s Scalable Coherent Interface • 1.0 Gb/s Fibre Channel1.0 Gb/s Fibre Channel• 155 Mb/s ATM155 Mb/s ATM
Key Facilities – Workstation Cluster
12/4/03 7
Reconfigurable Computing
Overview of Research Activities
12/4/03 8
Current RC Motivation
Key missing pieces in RC clusters for HPC Dynamic RC fabric discovery and management Coherent multitasking, multi-user environment Robust job scheduling and management Fault tolerance and scalability Performance monitoring down into the RC fabric Automated application mapping into management tool
HCS Lab has proposed concepts and framework to unify existing technologies as well as fill in missing pieces Cluster-based Approach to Reconfigurable Management
Architecture (CARMA)
12/4/03 9
CARMA Framework Overview CARMA seeks to integrate:
Graphical user interface COTS application mapper
Candidates include Handel-C, Viva, StreamsC, CHAMPION, CoreFire, etc.
Graph-based job description Condensed Graphs, DAGMan, etc.
Robust management tool Distributed, scalable job scheduling Checkpointing, rollback and recovery
for both host and RC units Multilevel monitoring service (GEMS)
Clusters, networks, hosts, RC fabric Tradeoff issues down to RC level
Middleware API (adapt USC-ISI API) Multiple types of RC boards Multiple high-performance networks
SCI, Myrinet, GigE, InfiniBand, etc.
Applications
RC ClusterManagement
DataNetwork
Algorithm Mapping
PerformanceMonitoring
MiddlewareAPI
UserInterface
COTSProcessor
RC FabricAPI
RC Fabric
RC Node
To OtherNodes
ControlNetwork
12/4/03 10
Applications
Test applications developed Block ciphers
DES, Blowfish Sonar Beamforming Hyperspectral Imaging (c/o LANL)
Future development Stream ciphers
RSA RC/HPC benchmarks (c/o Honeywell TC and UCSD) Cryptanalysis benchmarks (c/o DoD)
12/4/03 11
Application Mapper
Evaluating three application mappers on the basis of: Ease of use, performance, hardware device independence,
programming model, parallelization support, resource targeting,network support, stand-alone mapping
Celoxica: SDK (Handel-C) Provides access to in-house boards
ADM-XRC (x1) and Tarari (x4) Star-Bridge Systems: Viva
Provides best option for hardware independence Annapolis Micro Systems: CoreFire
Provides access to the AFRL-IFTC 48-node cluster Xilinx’s ISE compulsory, evaluating Jbits for partial RTR
ADM-XRC
Tarari
12/4/03 12
Job Scheduler (JS) Prototyping effort underway (forecasting)
Completed first version of JS (coded Q3 of 2003, now under test) Single node Task-based execution using Dynamic Acyclic Graphs (DAGs) Separate processes and message queues for fault-tolerance
Second version of JS (Q2 of ‘04 completion) Multi-node Distributed job migration Checkpoint and rollback Links to Configuration Manager and GEMS
External extensions to traditional tools (interoperability) Expand upon GWU/GMU work (Dr. El-Ghazawi’s group) Other COTS job schedulers under consideration Striving for “plug and play” approach to JS within CARMA
c/o GWU/GMU
12/4/03 13
Configuration Manager
CM (Configuration Manager) Application interface to RC Board Handles configuration caching and defragmentation Passes configuration information to other CMs via TCP or SCI
CMUI (CM User Interface) Allows user input to configure CM Used for testing purposes
Communication Module (Com) Used to transfer configuration files between remote CMs
c/o UW
c/o UW
FPGA Defragmentation
Configuration TransposeCMUI
CMCom
Com
Execution Manager Remote Node
RC Fabric & Processor
CM
Stub
Networks
Boards
Network Node Reg.
Config File Reg.
NetworkCompilerMessage Queue
12/4/03 14
Management SchemesJobs submitted
“centrally”APP
APP MAP
GJS
GRMAN
Network
LRMON
Local SysLRMON
Local Sys…
Tasks,States
Results,Statistics
Global view ofthe system at all times
Global view ofthe system at all times GRMAN
LAPP
LAPP MAP
LJS
LRMAN
Network
LRMON
Local Sys…
Tasks,Configurations
Requests,Statistics
LAPP
LAPP MAP
LJS
LRMAN
LRMON
Local Sys
Jobs submittedlocally
Requests,Statistics
Global view ofthe system at all times GRMAN
LAPP
LAPP MAP
LJS
LRMAN
Network
LRMON
Local Sys …
Configuration Pointers
Requests,Statistics
LAPP
LAPP MAP
LJS
LRMAN
LRMON
Local Sys
Jobs submittedlocally
Requests,Statistics
Configurations
LAPP
LAPP MAP
LJS
LRMAN
Network
LRMON
Local Sys …Configurations
Requests
LAPP
LAPP MAP
LJS
LRMAN
LRMON
Local Sys
Jobs submittedlocally
Requests
Master-Worker (MW) Client-Server (CS)
Client-Broker (CB) Peer-to-Peer (PP)
Server houses configurations
Server brokers configurations
MW has been built, CS is underway,CB and PP by Q2 of 2004
Multilevel approach anticipated for large number of nodes having different schemes at different levels
12/4/03 15
Monitoring Service Options Custom agent per Functional Unit (FU)
Provides customized information per FU Heavily burdens user (unless automated) Requires additional FPGA area
Centralized agent per FPGA Possibly reduces area overhead Reduces data storage and communication Limits scalability
Information storage and response Store on chip or on board Periodically send or respond when queried
Key parameters to monitor - further study Custom parameters per algorithm Requires all-encompassing interface Automation needed for usability Monitoring is overhead, so use sparingly!
Applications
RC ClusterManagement
DataNetwork
Algorithm Mapping
PerformanceMonitoring
MiddlewareAPI
UserInterface
COTSProcessor
RC FabricAPI
RC Fabric
RC Node
To OtherNodes
ControlNetwork
* GEMS is the gossip-enabled monitoring service developed by the HCS Lab for robust, scalable, multilevel monitoring of resource health and performance; for more info. see http://www.hcs.ufl.edu/prj/ftgroup/teamHome.php.
12/4/03 16
Monitoring Service Parameters Processor CPU utilization Memory utilization
Network Bandwidth Latency Congestion, loss rate
RC fabric performance Area utilization On-chip memory utilization On-chip processor utilization On-board memory utilization On-board bus throughput
Functional unit performance Pipeline utilization Active pipeline stage Most recent output value User-defined status
RC fabric health Configuration scrubbing Device input current Temperature Fan speed
Cross-device monitoring Automation in design is key GEMS* to be extended
* GEMS is the gossip-enabled monitoring service developed by the HCS Lab for robust, scalable, multilevel monitoring of resource health and performance; for more info. see http://www.hcs.ufl.edu/prj/ftgroup/teamHome.php.
12/4/03 17
Intra-FPGA• On-chip processor optionOn-chip processor option• Multitasking of resourcesMultitasking of resources• Configuration caching and reuseConfiguration caching and reuse• Resource monitoringResource monitoring
Inter-FPGA• Arbitration and controlArbitration and control• On-chip processor controlling other On-chip processor controlling other
FPGA’s functional unitsFPGA’s functional units
Node• System main CPU in or out of the loopSystem main CPU in or out of the loop• Process to functional unit interactionProcess to functional unit interaction• Memory hierarchyMemory hierarchy• 1-5 boards per node1-5 boards per node• 1-3 FPGAs per board1-3 FPGAs per board• FPGA as NIC acceleratorFPGA as NIC accelerator NIC
RC board(s)
PCI Bridge
Memory Hierarchy
Main CPU(s)
Node Architecture Tradeoffs
12/4/03 18
Future RC Work Continue development and evaluation of CARMA
Expanded features Support for multiple boards, nodes, networks, etc. Functionality and performance evaluation and optimization
Collect and expand RC/HPC performance benchmarks Honeywell Technology Center UCSD MIT
Investigate/develop RC system programming model Continue RC cluster simulation
Extend previous analytic modeling work (arch. and software) Forecast architecture limitations Forecast software/management limitations Develop interfaces for network attached RC devices Determine key design tradeoffs
Consider RC security challenges (e.g. FPGA viruses)
12/4/03 19
High-Performance Interconnects
Overview of Research Activities
12/4/03 20
Interconnect Overview
Numerous high-performance network testbeds SCI, InfiniBand, Myrinet, QsNet and Gigabit Ethernet Mid- and low-level performance analysis Cluster computing production environment Analytical modeling to determine scalability
Numerous simulation testbeds developed TCP, Ethernet, InfiniBand, SCI, RapidIO and Optical System-level performance issues Simulation of cluster and grid computing environments Network enhancement prototyping
12/4/03 21
Network Performance Tests Detailed understanding of high-performance cluster interconnects
Identifies suitable networks for UPC over clusters Aids in smooth integration of interconnects with upper-layer UPC components Enables optimization of network communication; unicast and collective
Various levels of network performance analysis Low-level tests
InfiniBand based on Virtual Interface Provider Library (VIPL) SCI based on Dolphin SISCI and SCALI SCI Myrinet based on Myricom GM QsNet based on Quadrics Elan Communication Library Host architecture issues (e.g. CPU, I/O, etc.)
Mid-level tests Sockets
Dolphin SCI Sockets on SCI BSD Sockets on Gigabit and 10Gigabit Ethernet GM Sockets on Myrinet SOVIA on Infinband
MPI InfiniBand and Myrinet based on MPI/PRO SCI based on ScaMPI and SCI-MPICH
Network Layer
SCI Myrinet InfiniBand QsNetPCI
Express1/10 Gigabit
Ethernet
IntermediateLayers
Network Layer
SCI Myrinet InfiniBand QsNetPCI
Express1/10 Gigabit
Ethernet
IntermediateLayers
12/4/03 22
Gigabit Ethernet Performance Tests
Low-cost and widely deployed Standardized TCP communication
protocol High-overhead, low-
performance One-way latency test
TCP Netpipe
MPI over TCP MPICH and MPI/Pro
GASNet over MPI/TCP MPICH
~60 sec one-way latency for TCP communication
Commercial MPI implementation (MPI/Pro) performs better for large messages
~100 sec GASNet overhead for small messages Each send/receive includes a
RPC For large messages GASNET
overhead is negligible
Promising light-weight communication protocol MVIA (National Energy Research Center)
Virtual Interface Architecture for Linux Mesh topology for Gigabit Ethernet cluster
(Jefferson National Lab) Direct or indirect, scalable system architecture High performance/cost ratio Suitable platform for large-scale UPC
applications
0
200
400
600
800
1000
1200
0 4 8 16 32 64 128 256 512 1K 2K 4K 8K 16K 32K
Message Size (Bytes)
Late
ncy
(use
c)
TCP MPIPro/TCP MPICH/TCP GASNet/MPICH/TCP
GASNet overhead includes RPC associated with AM
12/4/03 23
On-going SAN Performance Tests Quadrics
Features Fat-tree topology Work offloading from the host CPU
for communication events Low latency and high bandwidth Performance increase with NICs
Potential Existing GASNet implementation Existing MPI 1.2 implementation
Initial Results Bandwidth Results From Elan to Main
Memory ~350 MB/sec Latency of 2.4s
Future plans Low-level network performance tests
InfiniBand Features
Low latency and high bandwidth Based on VIPL Industry standard Reliability
Potential Existing GASNet conduit
Future Plans Low-level network performance
tests Performance evaluation of
GASNet on InfinibBand UPC benchmarking on
InfiniBand-based clusters
Quadrics Latency MPI Ping Test
0
200
400
600
800
1000
Bytes
uS
ec
Quadrics Bandwidth MPI Ping Test
050
100150200250300350
Bytes
MB
/sec
12/4/03 24
InfiniBand Architecture Layers
Host Channel Adapter (HCA) Resides in host processor node and connects host to IBA fabric Functions as interface to consumer processes (OS’s message and data service) Interface between HCA and consumer processes is a set of semantics, called IBA verbs, that describes functions necessary for configuring, managing, and operating an HCA Verbs are a description of API parameters. Examples: OpenHCA, ModifyQP, ResizeCQ, etc.
Target Channel Adapter (TCA) Resides in an I/O node and connects it to the IBA fabric Functions as the interface to an I/O device Includes an I/O controller that is specific to its particular I/O device protocol, such as SCSI, FC, Ethernet TCAs are tailor-made HCAs for specific I/O media (without the Verbs layer)
Introduction to InfiniBand High-speed networking technology for interconnecting processor and I/O nodes in a system-area network The architecture is independent of host operating system and processor platform
12/4/03 25
InfiniBand ModelModeling & Experiments Modeling HCA, switches/routers, links, traffic
sources and queue analyzers Functional testing of IBA components Test and analyze performance of various
queueing algorithms Validation testing using 10Gb/s InfiniBand
experimental testbed using Mellanox chipset
12/4/03 26
TCP Model
MLD TCP Model
Increases accuracy of network models (e.g. implement over IP/Ethernet network)
Enables comparison of TCP with other network models (e.g. Myrinet, InfiniBand)
Provides structure to analyze efficiency of performance enhancement strategies over TCP networks
Can be reconfigured and extended to simulate new TCP algorithms and implementations
High-level model allows time-efficient simulations while maintaining necessary detail
Generic TCP building blocks provided by MLD
TCP over Ethernet and MPI Interface have been implemented
Configuration of TCP library parameters and creation of additional TCP components to increase accuracy and breadth of library ongoing
New FASE models can immediately incorporate and take advantage of TCP
Experimentation and examination of results to come
12/4/03 27
SCI Model 1D, 2D, 3D topologies supported
Each link controller is assigned a unique ID number
Data structure used to assign SCI packet a link controller ID for routing purposes
Speed enhancements Removed idle packet (used for flow
control) Removed unnecessary components
Next phase Collective communication support Comparison tests with FASE SCI model
and real SCI hardware
16-node 2D SCI torus
2 link controllers with SCI/PCI interface
Link controllers
Blink/PCI/Host Interface
Host
12/4/03 28
RapidIO Model Initially developed for simulation of
interconnects for processing-in-memory (PIM) architectures Can be easily integrated with models of
different types of embedded systems Support for RapidIO Message Passing
Logical Layer, Common Transport Layer, and Parallel Physical Layer
Undergoing simulation speed enhancements More efficient arbitration for resources
Next phase Addition of RapidIO I/O Logical Layer
(shared memory) Addition of RapidIO Serial Physical Layer (if
desired) Validation with real RapidIO hardware
12/4/03 29
RapidIO Model
Star-switched PIM Topology Array-switched PIM Topology
DIMM Chip
DIMM Chip
12/4/03 30
Optical-Layer Component ModelsOptical Modules
Device Functional Variations
Optical Fiber
Optical Filter Band-pass, Band-stop
Power Amplifier
Transmitter Fixed, Tunable, Discrete Tunable
Receiver Fixed, Tunable, Discrete Tunable
WDM DEMUX 12, 14, 18
WDM MUX 21, 41, 81
OADM 1, 2, 4 channel
Connector
Splitter 12, 14, 18
Coupler 21, 24, 28, 41, 44, 81, 88
Optical Switch 12, 14, 21, 22, 24, 42, 44
Optical Interface Module 80% pass / 20% drop
TDM Components 4-channel MUX/DEMUX
System Effects
Cost
Electrical Power Use
Latency / Throughput
Power Threshold
Power Budget
BW limits
Higher-order physical effects (future)
Networking Layers
IEEE 802.3 switch (ARNC664?)
IP with QoS
ARP cache
Multicast / Virtual Links
TCP
UDP
Numerous LRMs
Application Layer (DVI, etc.)Optical components easily replicated with differentparameter settings to produce commercial product models (ex. Genoa GT111 Amplifier)
12/4/03 31
Simulation
Overview of Research Activities
12/4/03 32
Simulation Overview
Cluster and Grid Modeling Application profiling, tracing, and emulation Network modeling MPI implementation Speed vs. fidelity tradeoff
Optical Network Modeling Optical network prototyping Design trade study
RC Modeling RC-enhanced NIC Hardware monitoring trade study
12/4/03 33
Trace Generator(MPE)
Script Generator
Heterogeneouscluster
Single computerrunning
MLDesigner Simulation Results
Current Simulation Environment
Parallelprogram
1. Input parallel program into Trace Generator
2. Logfile from Trace Generator is converted to readable text and fed to Script Generator
3. Script files are read in by MLDesigner and each end-node performs its specified tasks
4. When all script files have been completed, simulation is complete and statistics are reported
12/4/03 34
FASE Overview FASE = Fast and Accurate Simulation Environment for Clusters and Grids Trace Tools – evaluated TAU, MPE, Paradyn, SvPablo, and VampirTrace
for parallel applications Chose MPE for communication events since it provides sufficient detail and is free
Script Generator Sifts through logfile to categorize program execution into communication and computation Implementation
Records MPI commands and computation blocks in logfile Static = Database of scripts can be made instead of traces Trace tool(s)
Performance statistics generator Generates statistics that can characterize behavior of a device while running a particular
program Example statistics
Cache misses, CPI, percentages of instruction types executed, instruction count, disk I/O Models
Processors – single and multiprocessors, processors in memory, reconfigurable Networks – Ethernet, Myrinet, SCI, InfiniBand, Rapid I/O, HyperTransport, SONET and
other optical protocols, TCP/IP MPI Interface
Currently support most basic MPI functions = MPI_Send, MPI_Recv Future functions = MPI_Barrier, MPI_Bcast, MPI_AlltoAll, MPI_Reduce Created for each network model in library
Implementation - Speed vs. Fidelity Tradeoff High-level modeling of end nodes and packet-level modeling of network components Rationale: The network is often a bottleneck of a system, so use high-fidelity network
models where practical to account for characteristics such as congestion and flow control End-nodes
Simulate computation = a “pause” in a model by a variable amount of time dictated by trace tool or calculated by statistics gathered by performance statistics generator
Begin and end network communications Lower fidelity
Script Reader/Processor MPI/UPC
Interface
ComputationalUnit
End-node
Script File
Network
User-suppliedParameters
PerformanceStatisticsGenerator
Main Components in FASE
FASE component interactions
FASE interfaces
MPI Interface
Device Interface
MPI Interface
End Node
End Node
UPC Interface
UPC Interface
12/4/03 35
Profiling/Tracing Background Understanding execution and analyzing performance of a parallel program is
difficult Profiling and tracing are two useful methods of gathering data from execution Behavior of a parallel program can be seen by making events observable and
then recording the events Events can be function calls (e.g. MPI routines), computation, or user-defined
Visualization tools can later be used to see logged events in a graphical manner
Profiling Records statistics of program
execution Uses events to keep track of
performance metrics Tracing
Records detailed log of time events Includes time stamp and attributes of
events Log formats: ALOG, BLOG, CLOG,
SLOG-1, SLOG-2, SDDF, many others
Visualization of a matrix multiply using Jumpshot-4
12/4/03 36
Trace Tools and Script Generator
Evaluated Trace Tools Tuning and Analysis Utility (University of Oregon, LANL, and
Research Centre Jülich, ZAM, Germany), SvPablo (University of Illinois), and Paradyn (University of Wisconsin Madison)
Average execution time and count of procedure calls and loops Multi-Processing Environment (distributed with MPICH)
Traces all MPI communications VampirTrace (Pallas)
Commercial tool that traces MPI and procedure calls MPE
Selected initially because of direct MPI support CLOG output, but CLOG format is space inefficient and not well
documented SLOG-2 is well documented and scalable Conversion process: CLOG, SLOG-2 binary, SLOG-2 ASCII
Script Generator Sifts through logfile to categorize program execution into
communication and computation Input: ASCII version of logfile Output: script files that are used in simulation environment Different degrees of filtering to balance between simulation
fidelity vs. speed Fine-grained = higher fidelity, slow Coarse-grained = lower fidelity, fast
Implementation Records MPI commands and computation blocks in logfile Static = Database of scripts can be made instead of traces
SLOG-2 logfileASCII version
ScriptGenerator
MPI_Send(0,2,4)MPI_Send(0,1,4)MPI_Recv(1,0,8)MPI_Recv(2,0,8)MPI_Send(0,2,4)MPI_Send(0,1,4)MPI_Recv(1,0,8)MPI_Recv(2,0,8)
MPI_Recv(1,0,512)MPI_Recv(2,0,512)
MPI_Recv(0,1,4)MPI_Send(1,0,8)MPI_Recv(0,1,4)MPI_Send(1,0,8)
MPI_Send(1,0,512)
MPI_Recv(0,2,4)MPI_Send(2,3,4)MPI_Recv(3,2,8)MPI_Send(2,0,8)MPI_Recv(0,2,4)MPI_Send(2,3,4)MPI_Recv(3,2,8)MPI_Send(2,0,8)
MPI_Recv(3,2,512)MPI_Send(2,0,512)
MPI_Recv(2,3,4)MPI_Send(3,2,8)MPI_Recv(2,3,4)MPI_Send(3,2,8)
MPI_Send(3,2,512)
PID 0 PID 1 PID 2 PID 3
MPE
CLOG logfile
SLOG-2 logfileBinary version
slog2sdk
12/4/03 37
MPI Collective Communication FASE currently supports only
unicast communication (MPI_Send, MPI_Recv)
Must support collective communications (MPI_Barrier, MPI_Alltoall, MPI_Reduce, MPI_Bcast, etc) in order to simulate the execution of complex applications
Algorithms shown in figures are specific to FASE since the MPI standard does not layout exact algorithms
Current implementation assumes that the entire MPI_Comm_World is used as the group in the collective function calls
These algorithms will be leveraged for UPC collective communications as well
0
21
3 4 5
0
21
3 4 5
The transactions can occurin any order, however, the
parent nodes must receive aBarrier packet from both
child nodes before sending aBarrier packet to its parent.
After the root nodereceives Barrier packetsfrom both its children, itsends Barrier packets
back to its children whichthen relay these packets
to their children.
MPI_Barrier
Step 1 Step 2
0
21
0
21
0
21Step 1 Step 2 Step 3
MPI_AlltoallSending node passes
MPI_Alltoall packet to parentnode first and then to children
nodes. Parent node then sendsthis packet to its parent and its
other child. Children nodes sendonly to their children.
0
21
3 4 5Step 1
MPI_ReduceDestination
Node Parents receive MPI_Reduce packets from both children andperform operation specified in function call before sending new
MPI_Reduce packet to parent node. If node 0 is not thedestination, it collects a single MPI_Reduce packet from a child
and sends its new packet to the other child node. If a nodereceives a packet from its parent and it is not the destination, thenit waits for the retrieval of a child node’s MPI_Reduce packet andthen sends its own MPI_Reduce packet to the other child node.
0
21
3 4 5Step 1
MPI_BcastSourceNode
Source sends MPI_Bcast packet first to parent node and then tochildren nodes. Nodes that receive an MPI_Bcast packet from a childnode must send to their parent first and then to the child that did send
the initial packet. Nodes receiving an MPI_Bcast packet from theirparent node simply sends the packet to both children nodes.
Transaction Legend1st transaction2nd transaction3rd transaction
12/4/03 38
Switched Aircraft LAN
TT
File Serverssrv1 srv2
DisplaysOptical
Pixel Bus
LRM
LRM
LRM
dp1 dp2 dp3
Navigation
LRM
LRM
LRM
LRM
nav1 nav2 nav3 nav4
Mission Processing
LRM
LRM
LRM
LRM
mp1 mp2 mp3 mp4
Sensors
LRM
LRM
sen1 sen2 sen_ir
sen_eo
Communications
LRM
LRM
com1 com2 com_r1 to com_r4
Platform Monitoring &Protection
LRM
LRM
LRM
LRM
pm1 pm2 pm3 pm4
pm_gw1
Key tradeoffs QoS thresholds and algorithms Latency / Bandwidth Power analysis Cost analysis (baseline cost)
Key features Baseline system Application and traffic study Virtual Links with QoS TCP / IP / Ethernet systems Fiber and copper links
12/4/03 39
Switchless Optical Pixel Bus
Key tradeoffs Latency / Bandwidth Optical power budget Electrical power analysis Cost analysis (baseline cost)
Key features 4 channels @ 2.5Gbps (10Gbps aggregate) Fixed optical components (cheaper)
TDM TDM SystemSystem
12/4/03 40
Switchless Optical Pixel Bus
Key features 4 channels @ 10Gbps (40Gbps aggregate) 4 independent wavelengths (better security) More optical components (increased cost)
TDM system provides cost-effective solution if bandwidth limitation is sufficientWDM system provides better bandwidth and security at increased cost in both $ and power
Key tradeoffs Latency / Bandwidth Optical power budget Electrical power analysis Cost analysis (baseline cost)
WDM WDM SystemSystem
12/4/03 41
Switchless Optical Aircraft LAN
Key tradeoffs WDM / TDM Compare to baseline architecture Latency / Bandwidth Power analysis (mostly passive) Cost analysis
TT
File Serverssrv1 srv2
DisplaysOptical
Pixel Bus
LRM
LRM
LRM
dp1 dp2 dp3 Navigation
LRM
LRM
LRM
LRM
nav1 nav2 nav3 nav4
Mission Processing
LRM
LRM
LRM
LRM
mp1 mp2 mp3 mp4
Sensors
LRM
LRM
sen1 sen2 sen_ir
sen_eo
Communications
LRM
LRM
com1 com2 com_r1 to com_r4 Platform Monitoring &Protection
LRM
LRM
LRM
LRM
pm1 pm2 pm3 pm4
pm_gw1
G1
G2
G3FE
FE FE FE
FE TT
TT
Key features Candidate system Unified bus Switchless Tunable wavelengths Optical switching Increased reliability Supports bandwidth growth
Gateway LegendG# = Gigabit EthernetFE = Fast EthernetTT = Time Triggered
12/4/03 42
RC Simulation
RC-enhanced NIC Based on the Intel IXP1200 Explored dynamic RC trades Gigabit Ethernet packet processing Theatre Missile Defense System with Cooperative
Engagement Capability as case study RC Hardware Monitoring Trade Study
Explored monitoring design space Examined performance impacts Trade study included processor, bus and FPGA
12/4/03 43
Basic and applied research in advanced computer architectures, networks, services, and systems for high-performance and reconfigurable computing and communications.
Focus on “high performance” in terms of execution time, throughput, latency, quality of service, dependability, etc.
Simulative and experimental research to achieve both distinct and interdependent goals.
For simulation, leverage expertise with high-fidelity, CAD-based architecture and network performance modeling and analysis.
For experimentation, facilities include cluster-based lab grid with 480 CPUs, multi-Gigabit networking testbeds, reconfigurable and embedded computing testbeds, etc.
Summary of HCS Lab Capabilities
12/4/03 44
Potential Collaboration New Millennium ST-8 System Model (Jeremy Ramos)
System model development Fault insertion / recovery assessment System behavior analysis
Integrated Payload Middleware (Jeff Wolfe) Middleware alternatives trade study Validate integrated software designs Focus on fault detection and management, IPC, hardware
abstraction, OS abstraction for embedded systems
HRSC Development Trade Studies (Chris Butera) Architecture, interconnect trades for flight design Independent test and evaluation
Suggestions c/o Jeremy Ramos
12/4/03 45
Potential Collaboration DARPA RWI (Michael Elias)
Network protocol development Architecture trades
Payload Prototype Environment Emulation / simulation of components and systems Quick-turn-around application studies for proposal efforts Propose enhancements and perform trade studies Forecast future system needs and explore impacts
Other joint proposals? Star Bridge Systems, Inc. and other vendor collaboration? Other possibilities?