Communication Requirements and Interconnect Optimization for High-End Scientific Applications

Communication Requirements andInterconnect Optimization for High-End

Scientific ApplicationsShoaib Kamil, Leonid Oliker, Ali Pinar, Member, IEEE Computer Society, and

John Shalf, Member, IEEE Computer Society

Abstract—The path toward realizing next-generation petascale and exascale computing is increasingly dependent on buildingsupercomputers with unprecedented numbers of processors. To prevent the interconnect from dominating the overall cost of theseultrascale systems, there is a critical need for scalable interconnects that capture the communication requirements of ultrascaleapplications. It is, therefore, essential to understand high-end application communication characteristics across a broad spectrum ofcomputational methods, and utilize that insight to tailor interconnect designs to the specific requirements of the underlying codes. Thiswork makes several unique contributions toward attaining that goal. First, we conduct one of the broadest studies to date of high-endapplication communication requirements, whose computational methods include: finite difference, lattice Boltzmann, particle-in-cell,sparse linear algebra, particle-mesh ewald, and FFT-based solvers. Using derived communication characteristics, we next present thefit-tree approach for designing network infrastructure that is tailored to application requirements. The fit-tree minimizes the componentcount of an interconnect without impacting application performance compared to a fully connected network. Finally, we propose amethodology for reconfigurable networks to implement fit-tree solutions. Our Hybrid Flexibly Assignable Switch Topology (HFAST)infrastructure, uses both passive (circuit) and active (packet) commodity switch components to dynamically reconfigure interconnectsto suit the topological requirements of scientific applications. Overall, our exploration points to several promising directions forpractically addressing the interconnect requirements of future ultrascale systems.

Index Terms—Interconnections, topology, performance analysis.

Ç

1 INTRODUCTION

AS scientific computing matures, the demands forcomputational resources are growing at a rapid rate. It

is estimated that by the end of this decade, numerous grand-challenge applications will have computational requirementsthat are at least two orders of magnitude larger than currentlevels [1], [2], [23]. However, as the pace of processor clockrate improvements continues to slow [3], the path towardrealizing ultrascale computing is increasingly dependent onscaling up the number of processors to unprecedented levels.To prevent the interconnect architecture from dominating theoverall cost of such systems, there is a critical need toeffectively build and utilize network topology solutions withcosts that scale linearly with system size.

High-performance computing (HPC) systems imple-menting fully connected networks (FCNs) such as fat-treesand crossbars have proven popular due to their excellentbisection bandwidth and ease of application mapping for

arbitrary communication topologies. However, as super-computing systems move toward tens or even hundreds ofthousands of processors, FCNs quickly become infeasiblyexpensive. These trends have renewed interest in networkswith a lower topological degree, such as mesh and torusinterconnects (like those used in the IBM BlueGene andCray XT series), whose costs rise linearly with system scale.Indeed, the number of systems using lower degreeinterconnects such as the BG/L and Cray Torus intercon-nects has increased from six systems in the November 2004list to 28 systems in the more recent Top500 list of June 2007[24]. However, it is unclear what portion of scientificcomputations have communication patterns that can beefficiently embedded onto these types of networks.

Our work proposes a science-driven approach to inter-

connect design. We believe the quality of an interconnect

should be measured by how well it captures the commu-

nication requirements of target applications, as opposed to

theoretical metrics such as diameter and bisection band-

width, since such metrics depend only on the interconnect

topology, ignoring the communication topologies of target

applications. For this proposed approach, it is essential to

understand scientific application communication require-

ments across a broad spectrum of computational methods.

Once this information is derived, we can then investigate

how to tailor interconnect designs for the specific commu-

nication requirements of the underlying applications in

terms of cost and performance effectiveness, while explor-

ing how new technologies can be adopted for break-

throughs in interconnect solutions.

188 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 21, NO. 2, FEBRUARY 2010

. S. Kamil is with the CRD/NERSC, Lawrence Berkeley NationalLaboratory, 1 Cyclotron Road, MS 50A1148, Berkeley, CA 94720, andthe Computer Science Department, University of California, Berkeley, CA94720. E-mail: [email protected].

. L. Oliker and J. Shalf are with the CRD/NERSC, Lawrence BerkeleyNational Laboratory, 1 Cyclotron Road, MS 50A1148, Berkeley, CA94720. E-mail: {loliker, jshalf}@lbl.gov.

. A. Pinar is with Sandia National Laboratories, 7011 East Avenue MS9159, Livermore, CA 94550. E-mail: [email protected].

Manuscript received 1 Dec. 2007; revised 26 Aug. 2008; accepted 12 Feb.2009; published online 27 March 2009.Recommended for acceptance by F. Petrini.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TPDS-2007-12-0450.Digital Object Identifier no. 10.1109/TPDS.2009.61.

1045-9219/10/$26.00 � 2010 IEEE Published by the IEEE Computer Society

This work demonstrates our overall approach to inter-connect design and presents several unique contributions.In Section 2, we conduct one of the broadest studies to dateof high-end application communication requirements,whose computational methods include: finite difference,lattice Boltzmann, particle-in-cell (PIC), sparse linearalgebra, particle-mesh ewald, and FFT-based solvers. Toefficiently collect this data, we use the Integrated Perfor-mance Monitoring (IPM) profiling layer to gather detailedmessaging statistics with minimal impact to code perfor-mance. The data collected in this phase set the stage forSection 3, which explores interconnect topologies thatefficiently support the underlying applications’ commu-nication characteristics. To achieve this goal, we present anovel fit-tree approach for optimizing interconnect wiringtopologies that exactly match application communicationrequirements using only a fraction of the resources requiredby conventional fat-tree or Clos interconnects.

Our studies further show that the varying communicationrequirements of different applications enforce a conservativeapproach that overprovisions resources to avoid congestionacross all possible application classes. This limitation can beovercome within a dynamically reconfigurable interconnectinfrastructure, which leads us to Section 4, where we proposea methodology for implementing reconfigurable networksusing the Hybrid Flexibly Assignable Switch Topology(HFAST) infrastructure. HFAST allows the implementationof interconnect topologies that are specifically tailored toapplication requirements, via the proposed fit-tree approachor other mapping strategies. The HFAST approach uses bothpassive (layer-1/circuit switch) and active (layer-2/packetswitch) commodity components to deliver all of the flex-ibility and fault tolerance of a fat-tree interconnect, whilepreserving the nearly linear cost scaling associated withtraditional low-degree interconnect networks.

Moreover, we believe that the HFAST approach can matchthe performance of approaches that rely on more complexadaptive routing (AR) strategies using a far simpler and morecost-effective implementation. With adaptive routing, deci-sions on message direction cannot be made instantaneouslyupon arrival of the first bit of the packet. Rather, the packetheader must be buffered long enough to examine enoughheader bits to determine the packet’s destination and toconsult the appropriate routing lookup tables—incurringadditional latency and requiring additional buffer space.Circuit switches, on the other hand, do not require buffering,allowing instant data traversal without additional buffers orlatency delays [21]. This, in turn, reduces complexity, area,and power requirements, thus, motivating the use of bypasspaths that skip over intermediate packet switches forpersistent application communication topologies wheredynamic routing decisions prove unnecessary [4].

Overall results lead to a promising approach for practicallyaddressing the interconnect requirements of future ultrascalesystems. Although our three research thrusts—HPC commu-nication characterization, fit-tree design, and HFAST net-works—work closely in concert, each of these componentscould also be considered as independent contributions, whichadvance the state-of-the-art in their respective areas.

2 SCIENTIFIC APPLICATION COMMUNICATION

REQUIREMENTS

In order to quantify HPC interconnect requirements andstudy efficient implementation approaches, we must firstdevelop an understanding of the communication character-istics for realistic scientific applications. Several studieshave observed that many applications display communica-tion topology requirements that are far less than the totalconnectivity provided by fully connected networks. Forinstance, the application study by Vetter and Mueller [25],[26] indicates that the applications that scale most efficientlyto large numbers of processors tend to depend on point-to-point communication patterns where each processor’saverage topological degree of communication (TDC) is threeto seven distinct destinations, or neighbors. This providesstrong evidence that many application communicationtopologies exercise a small fraction of the resourcesprovided by fully connected networks.

In this section, we expand on previous studies byexploring detailed communication profiles across a broadset of representative parallel algorithms. We use the IPMprofiling layer to quantify the type and frequency ofapplication-issued MPI calls, as well as identify the buffersizes utilized for both point-to-point and collective com-munications. Finally, we study the communication topologyof each application, determining the average and maximumTDC for bandwidth-limited messaging.

2.1 IPM: Low-Overhead MPI Profiling

To profile the communication characteristics of the scientificapplications in our study, we employ the IPM [14] tool—anapplication profiling layer that allows us to noninvasivelygather the communication characteristics of these codes asthey run in a production environment. IPM brings togethermultiple sources of performance metrics into a single profilethat characterizes the overall performance and resourceusage of the application. It maintains low overhead byusing a unique hashing approach that allows a fixedmemory footprint and minimal CPU usage. IPM is opensource, relies on portable software technologies, and isscalable to thousands of tasks.

IPM collects a wide variety of communication informationusing a very low-overhead hashing technique, which allowsus to noninvasively instrument full-scale application codeswithout dramatically affecting their performance. In thiswork, we principally utilize information that encodes thenumber and timing of each MPI call. We gather communica-tion information on each task about each MPI call with aunique set of arguments. Arguments to MPI calls containmessage buffer size, as well as source and destinationinformation. In some cases, we also track information fromthe MPI_Status structure. For instance, in the case ofMPI_Send, IPM keeps track of each unique buffer size anddestination, the number of such calls, as well as the total,minimum and maximum runtimes to complete the call. IPMalso allows code regions to be defined, enabling us to separateapplication initialization from steady-state computation andcommunication patterns, as we are interested, primarily, inthe communication topology for the application in itspostinitialization steady state. Experiments were run on a

KAMIL ET AL.: COMMUNICATION REQUIREMENTS AND INTERCONNECT OPTIMIZATION FOR HIGH-END SCIENTIFIC APPLICATIONS 189

variety of Department of Energy supercomputing systems;

the data we collected depend on the concurrency, application

code, and input—no machine-dependent characteristics are

collected or analyzed in this study.

2.2 Message Size Thresholding

Before we explore the TDC for our application suite, we

must first quantify the thresholding size for messages that

are bandwidth limited. Otherwise, we may mistakenly

presume that a given application has a high TDC even if

trivially small (latency-bound) messages are sent to major-

ity of its neighbors.To derive an appropriate threshold, we examine the

product of the message bandwidth and the delay (latency)for a given point-to-point connection. The bandwidth-delayproduct describes precisely how many bytes must be “in-flight” to fully utilize available link bandwidth. This canalso be thought of as the minimum size required for anonpipelined message to fully utilize available link band-width. Vendors commonly refer to an N1=2 metric, whichdescribes the message size below which you will get only1=2 of the peak link performance; the N1=2 metric is typicallyhalf the bandwidth-delay product. In this paper, we focuson the minimum message size that can theoreticallysaturate the link, i.e., those messages that are larger thanthe bandwidth-delay product.

Table 1 shows the bandwidth-delay products for a

number of leading-edge interconnect implementations,

where the best performance hovers close to 2 KB. We,

therefore, choose 2 KB as our target bandwidth-limiting

messaging threshold. This reflects the state-of-the-art in

current switch technology and an aggressive goal for future

leading-edge switch technologies. We assume that below

this threshold, the latency-bound messages would not

benefit from a dedicated point-to-point circuit. Such

messages are only affected by topology when it comes to

the number of links traversed, and cannot be sped up by

increasing available bandwidth.

2.3 Evaluated Scientific Applications

We now highlight the salient features of the eight applica-tions studied in this work. The high-level overview of thecodes and methods is presented in Table 2. Each of theseapplications is actively run at multiple supercomputingcenters, consuming a sizable amount of computationalresources. Descriptions of the algorithms and scientificimpacts of these codes have been extensively detailedelsewhere [5], [6], [7], [13], [17], [18], [20], [22]; here, wepresent a brief overview of each application.

BBeam3D [22] models the collision process of twocounterrotating charged particle beams moving at close tothe speed of light. The application is a 3D particle-in-cellcomputation that contains multiple models (weak-strong,strong-strong) and multiple collision geometries (head-on,long range, crossing angle), with collisions calculated self-consistently by solving the Vlasov-Poisson equation usingHockney’s FFT method. Thus, the code exhibits communica-tion characteristics that reflect the combined requirements ofthe PIC method and the 3D FFT for the Poisson solver.

Cactus [6] is an astrophysics computational toolkitdesigned to solve the challenging coupled nonlinearhyperbolic and elliptic equations that arise from Einstein’sTheory of General Relativity. Consisting of thousands ofterms when fully expanded, these partial differentialequations (PDEs) are solved using finite differences on ablock domain-decomposed regular grid distributed over theprocessors. The Cactus communication characteristics re-flect the requirements of a broad variety of PDE solvers onnonadaptive block-structured grids.

The Gyrokinetic Toroidal Code (GTC) is a 3D PICapplication developed to study turbulent transport inmagnetic confinement fusion [18]. GTC solves the nonlineargyrophase-averaged Vlasov-Poisson equations [16] in ageometry characteristic of toroidal fusion devices. By usingthe PIC method, the nonlinear PDE describing particlemotion becomes a simple set of ordinary differentialequations (ODEs) that can be easily solved in the Lagrangian


TABLE 1Bandwidth-Delay Products for Several High-Performance Interconnect Technologies

�This is the effective peak unidirectional bandwidth delivered per CPU (not per link).

TABLE 2Overview of Scientific Applications Evaluated

coordinates. Unlike BB3D, GTC’s Poisson solver is localizedto individual processors; so, the communication require-ments only reflect the needs of the PIC core.

LBCFD [20] utilizes an explicit Lattice-Boltzmann meth-od to simulate fluid flows and to model fluid dynamics. Thebasic idea is to develop a simplified kinetic model thatincorporates the essential physics, and reproduces correctmacroscopic averaged properties. LBCFD models 3Dsimulations under periodic boundary conditions, with thespatial grid and phase space velocity lattice overlaying eachother, distributed with a 3D domain decomposition.

Based on the MADspec cosmology code that calculatesthe maximum likelihood angular power spectrum of thecosmic microwave background (CMB), MADbench [5] is asimplified benchmark that inherits the characteristics of theapplication without requiring massive input data files.MADbench tests the overall performance of the subsystemsof real massively parallel architectures by retaining thecommunication and computational complexity of MADspecand integrating a data set generator that ensures realisticinput data. Much of the computational load of thisapplication is due to its use of dense linear algebra, whichis reflective of the requirements of a broader array of denselinear algebra codes in the scientific workload.

PARAllel Total Energy Code (PARATEC [7]) performs abinitio quantum-mechanical total energy calculations usingpseudopotentials and a plane-wave-basis set. In solving theKohn-Sham equations using a plane-wave basis, part of thecalculation is carried out in real space and the remainder inFourier space using specialized parallel 3D FFTs to transformthe wave functions. The communication involved in theseFFTs is the most demanding portion of PARATEC’s commu-nication characteristics. A workload analysis at the NationalEnergy Research Scientific Computing Center (NERSC) [27]has shown that Density Functional Theory (DFT) codes,which include PARATEC, QBox, and VASP, account for morethan 3/4 of the materials science workload.

Particle-Mesh Ewald Molecular Dynamics (PMEMD[13]) is an application that performs molecular dynamicssimulations and minimizations. The force evaluation isperformed in an efficiently parallel manner using state-of-the-art numerical and communication methodologies.PMEMD uses a highly asynchronous approach to commu-nication for the purposes of achieving a high degree ofparallelism. PMEMD represents the requirements of abroader variety of molecular dynamics codes employed inchemistry and bioinformatics applications.

SuperLU [17] is a general purpose library for the directsolution of large, sparse, nonsymmetric systems of linearequations on high-performance machines. The libraryroutines perform an LU decomposition with partial pivot-ing as well as a triangular system solve through forwardand back substitution. This application relies on sparselinear algebra of various kinds for its main computationalkernels, ranging from a simple vector scale to a largetriangular solve. Sparse methods are becoming increasinglycommon in the scientific workload because they apply workonly to nonzero entries of the matrix in order to improvetime-to-solution for large-scale problems.

Together, this collection of numerical methods spans thecharacteristics of a great many more applications, especiallywith respect to communication patterns. For example, thecore algorithm of the PARATEC code studied here, has thecommunication characteristics of many other importantplane-wave DFT calculations. Likewise, a large number offinite difference and particle-mesh codes exhibit similarcommunication patterns to Cactus and PMEMD. Note thatcertain quantities relevant to the present study, such ascommunication degree, are largely dictated by the scientificproblem solved and algorithmic methodology. For instance,in the case of Cactus where finite differencing is performedusing a regular grid, the number of neighbors is determinedby the dimensionality of the problem and the stencil size.Profiling a greater number of applications would of courseimprove the coverage of this study; however, the eightapplications detailed here broadly represent a wide range ofscientific disciplines and modern parallel algorithms underrealistic computational demands.

2.4 Communication Characteristics

We now explore the communication characteristics of ourstudied applications, by quantifying the MPI call countdistributions, collective and point-to-point buffer sizes, andtopological connectivity.

2.4.1 Call Counts

The breakdown of MPI communication call types is shownin Table 3, for each of our studied applications. Here, weonly consider calls dealing with communication andsynchronization, and do not analyze other types of MPIfunctions which do not initiate or complete message traffic.Notice that, overall, these applications utilize only a smallsubset of the entire MPI library. Most codes use a smallvariety of MPI calls, and utilize mostly point-to-pointcommunication functions (over 90 percent of all MPI calls),except GTC, which relies heavily on MPI_Gather. Observealso that nonblocking communication is the predominantpoint-to-point communication model for these codes.

2.4.2 Buffer Sizes for Collectives

Fig. 1 presents a cumulative histogram of buffer sizes forcollective communication (that is, communication thatinvolves all of the processors), across all eight applications.Observe that relatively small buffer sizes are predominantlyused; in fact, about 90 percent of the collective messages are2 KB or less (shown as the bandwidth-delay product by thevertical line), while almost half of all collective calls usebuffers less than 100 bytes. These results are consistent withprevious studies [25], [26] and validate IBM’s architecturaldecision to dedicate a separate lower bandwidth networkon their BlueGene machines for collective operations. Forthis broad class of applications, collective messages aremostly constrained by the latency of the interconnect,regardless of the topological interconnectivity.

2.4.3 Point-to-Point Buffer Sizes

A cumulative histogram of buffer sizes for point-to-pointcommunication is shown in Fig. 3 for each of theapplications; once again, the 2 KB bandwidth-delay productis shown by the vertical lines. Here, we see a wide range of


communication characteristics across the applications.Cactus, LBCFD, and BBeam3D use a relatively small numberof distinct buffer sizes, but each of these buffers is relativelylarge. GTC employs some small communication buffers, butover 80 percent of the messaging occurs with 1 MB or largerdata transfers. In addition, it can be seen that SuperLU,PMEMD, MADbench, and PARATEC use many differentbuffer sizes, ranging from a few bytes to over a megabyte insome cases. Overall, Fig. 3 demonstrates that unlikecollectives (Fig. 1), point-to-point messaging in theseapplications uses a wide range of buffers, as well as largemessage sizes. In fact, for all but two of the codes, buffersizes larger than the 2 KB bandwidth-delay product accountfor >75 percent of the overall point-to-point message sizes.

2.4.4 Topological Connectivity

We now explore the topological connectivity for eachapplication by representing the volume and pattern ofmessage exchanges between all tasks. By recording statisticson these message exchanges, we form an undirected graphthat describes the topological connectivity required by eachapplication. Note that this graph is undirected as we assumethat most modern switch links are bi-directional; as a result,the topologies shown are always symmetric about thediagonal. From this topology graph, we then calculate thequantities that describe communication patterns at a coarse

level. Such reduced metrics are important in allowing us tomake direct comparisons between applications. In particu-

lar, Fig. 2 examines the maximum and average TDC(connectivity) of each code, a key metric for evaluating the

potential of lower degree and nontraditional interconnects.We show the max and average connectivity using athresholding heuristic based on the bandwidth-delay

product (see Section 2.2) that disregards smaller latency-bound messages. In many cases, this thresholding lowersthe average and maximum TDC substantially. An analysis

of these results in the context of topological network designsis presented in Section 2.5.

Fig. 4a shows the topological connectivity of BBeam3D for

P ¼ 256 as well as the effect of eliminating smaller (latency-bound) messages on the number of partners. Observe thehigh TDC for this charge density calculation due to its

reliance on data transposes during the 3D FFTs. For this code,the maximum and average TDC is 66 neighbors; both of theseare insensitive to thresholding lower than 64 KB. BBeam3D,

thus, represents an application class which exhibits a TDCsmaller than the full connectivity of a fat-tree, with littlesensitivity to bandwidth-limited message thresholding.

In Fig. 4b, we see that the ghost-zone exchanges of

Cactus result in communications with “neighboring”


Fig. 1. Buffer sizes distribution for collective communication for all

codes. The vertical line demarcates the 2 KB bandwidth-delay product.

Fig. 2. Average and maximum communicating partners for the studied

applications at P ¼ 256, thresholded by the 2 KB bandwidth-delay

product. Communications smaller than the threshold are not considered

in calculating the communicating partners.

TABLE 3Breakdown of MPI Communication Calls, Percentage of Point-to-Point (PTP) Messaging, Maximum and Average TDC Thresholded

by 2 KB, and FCN Utilization (Thresholded by 2 KB) for Evaluated Application on 256 Processors

nodes, represented by diagonal bands. In fact, each nodecommunicates with at most six neighbors due to the regularcomputational structure of this 3D stencil code. On average,the TDC is 5, because some nodes are on the boundary and,therefore, have fewer communication partners. The max-imum TDC is independent of run size (as can be seen by thesimilarity of the P ¼ 64 and P ¼ 256 lines) and isinsensitive to thresholding, which suggests that no patternof latency-bound messages can be excluded. Note, how-ever, that the low TDC indicates limited utilization of anFCN architecture.

As shown in Fig. 4c, we see that GTC exhibits a regularcommunication structure typical of a particle-in-cell calcu-lation that uses a one-dimensional domain decomposition.Each processor exchanges data with its two neighbors asparticles cross the left and right boundaries. Additionally,there is a particle decomposition within each toroidalpartition, resulting in an average TDC of 4 with a maximumof 17 for the P ¼ 256 test case. This maximum TDC isfurther reduced to 10 when using our 2 KB bandwidth-delay product message size threshold. These small TDCrequirements clearly indicate that most links on an FCN arenot being utilized by the GTC simulation.

The connectivity of LBCFD is shown in Fig. 4d.Structurally, we see that the communication, like Cactus,occurs in several diagonal bands. Note that althoughLBCFD streams the data in 27 directions (due to the 3D

decomposition), the code is optimized to reduce the number

of communicating neighbors to 6, as seen in Fig. 4d. This

degree of connectivity is insensitive to the concurrency

level. The maximum TDC is insensitive to thresholding,

showing that the communications of this application use

larger message sizes.MADbench’s communication topology characteristics

are shown in Fig. 4e. Each processor communicates with

38 neighbors on average, dropping to 36 if we eliminate

messages smaller than 2 KB. The communication is

relatively regular due to the underlying dense linear

algebra calculation, with an average and maximum TDC

that are almost identical. MADbench is another example of

a code whose overall TDC is greater than the connectivity of

a mesh/torus interconnect, but still significantly less than

the number of links provided by a fat-tree.Fig. 4f shows the complex structure of communication of

the PMEMD particle-mesh ewald calculation. Here, the

maximum and average TDC is equal to P and the degree of

connectivity is a function of concurrency. For the spatial

decomposition used in this algorithm, the communication

intensity between two tasks drops as their spatial regions

become more distant. The rate of this drop off depends

strongly on the molecule(s) in the simulation. Observe that

for P ¼ 256, thresholding at 2 KB reduces the average

connectivity to 55, even though the maximum TDC remains


Fig. 3. Buffer sizes distribution for point-to-point communication. The vertical lines demarcate the 2 KB bandwidth-delay product.

at 256. This application class exhibits a large disparity

between the maximum and average TDC.Fig. 4g shows the communication requirements of

PARATEC. This communication-intensive code relies on

global data transposes during its 3D FFT calculations,

resulting in large, global message traffic [7]. Here, the

maximum and average TDC is equal to P , and the

connectivity is insensitive to thresholding. Thus, PARATEC

represents the class of codes that make use of the bisection

bandwidth that an FCN configuration provides.Finally, Fig. 4h shows the connectivity and TDC for

SuperLU. The complex communication structure of this


Fig. 4. Topological connectivity of each of the studied applications, showing volume of communication at P ¼ 256.

computation results in many point-to-point message trans-missions: in fact, without thresholding the connectivity isequal to P . However, by removing the latency-boundmessages by thresholding at 2 KB, the average and maximumTDC is reduced to 30 for the 256 processor test case. Also, notethat the connectivity of SuperLU is a function of concurrency,scaling proportionally to

ffiffiffiffiPp

(see [17]).In the following section, we analyze the measured

topological connectivities in the context of interconnectrequirements.

2.5 Communication Connectivity Analysis

Based on the topological connectivities of our applications,we now categorize our codes as follows: Applications withcommunication patterns such that the maximum TDC isless than the connectivity of the interconnection network(case i) can be perfectly embedded into the network, albeit atthe cost of having some connections be wasted/idle. If theTDC is equal to that of the underlying interconnect and thecommunication is isomorphic to the network architecture,then the communication can also be embedded (case ii).However, if the TDC is equal and the communication isnonisomporphic to the interconnect (case iii) or if the TDC ishigher than the underlying network (case iv), there is noembedding without sharing some links for messaging,which can lead to message contention.

2.6 Point-to-Point Traffic

We now discuss each of the applications and consider theclass of network best suited for its communication require-ments. First, we examine the four codes exhibiting the mostregularity in their communication exchanges: Cactus, GTC,LBCFD, and MADbench. Cactus displays a bounded TDCindependent of run size, with a communication topologythat isomorphically maps to a regular mesh; thus, a fixed 3Dmesh/torus would be sufficient to accommodate thesetypes of stencil codes, although an adaptive network (seeSection 4) would also fulfill Cactus’s requirements (i.e.,consistent with case i). LBCFD and MADbench also displaya low degree of connectivity; however, while their commu-nication pattern is isotropic, their respective structures arenot isomorphic to a regular mesh, thereby corresponding tocase iii classification. Although GTC’s primary communica-tion pattern is isomorphic to a regular mesh, it has amaximum TDC that is higher than the average due toimportant connections that are not isomorphic to a mesh(case iv). Thus, a fixed mesh/torus topology would be notwell suited for this class of computation.

BBeam3d, SuperLU, and PMEMD all exhibit anisotropiccommunication patterns with a TDC that scales with thenumber of processors. Additionally, PMEMD has widelydiffering maximum and average TDC. However, withthresholding, the proportion of processors that have mes-sages that would benefit from the dedicated links is large butstays bounded to far less than the number of processorsinvolved in the calculation (consistent with case iii). Thus, aregular mesh or torus would be inappropriate for this classof computation, while an FCN remains underutilized.

Finally, PARATEC represents the communications re-quirements for a large class of important chemistry andfluids problems where part of the problem is solved in

Fourier space. It requires large global communicationsinvolving large messages that fully utilize the FCN and are,therefore, consistent with case iv. PARATEC’s large globalcommunications are a result of the 3D FFTs used in thecalculation, which require two stages of global 3D trans-poses. The first transpose is nonlocal and involves commu-nications of messages of similar sizes between all theprocessors, resulting in the uniform background of 32 KBmessages. In the second transpose, processors only com-municate with neighboring processors, resulting in addi-tional message traffic along the diagonal of the graph.PARATEC’s large global communication requirements canonly be effectively provisioned with an FCN network.

In summary, only one of the eight codes studied (Cactus)offered a communication pattern that maps isomorphicallyto a 3D mesh network topology (case i). This indicates thatmesh/torus interconnects may be insufficient for a diversescientific workloads. Additionally, only PARATEC fullyutilizes the FCN at large scales (case iv); thereby, under-cutting the motivation for using FCNs across a broad rangeof computational domains. The underutilization of FCN forour codes can be clearly seen in the last row of Table 3.Thus, for a wide range of applications (cases ii and iii), webelieve there is space to explore alternative interconnectarchitectures that contain fewer switch ports than a fat-treebut greater connectivity than mesh/tori networks; suchinterconnects are explored further in Section 3.

3 FIT-TREE INTERCONNECT DESIGNS

Our analysis in the previous section showed that thecommunication patterns of most applications have irregularpatterns, evincing the limitations of 3D mesh interconnects.At the same time, most communication patterns are sparse,revealing that the large bandwidth of a FCN is notnecessary and have good locality, showing that anintelligent task-to-processor assignment can significantlydecrease the load on the network. In this section, wedemonstrate how statistics about communication patternsof target applications can be adopted to build interconnectsthat are more effective and cost efficient. Specifically, westart with a fat-tree topology, and then develop the conceptof a fit-tree, which allows comparable performance on targetapplications at a fraction of the interconnect resources of afat-tree. While we examine our science-driven approach inthe context of fat-trees in this paper, the same analysis maybe applied to other popular topologies; our choice of fat-trees is motivated by their high popularity, as evidenced bytheir strong presence in Top500 list.

We start with a review of the fat-tree topology and itsresource requirements in Section 3.1, and then examine howwell this topology corresponds to the application commu-nication requirements in Section 3.2. Establishing theunderutilization of fat-tree network resources, motivatesour novel fit-tree methodology described in Section 3.3.

3.1 Fat-Tree Resource Requirements

Conceptually, a fat-tree is a k-ary tree with processors on thebottom-most level, where the thicknesses (capacities) of theedges increase at higher levels of the tree. Here, k is definedby the k� k switch block size used to implement the network.That is, 2� 2 switches will yield a binary tree, 4� 4 switches


yield a 4-ary tree, etc. In a conventional fat-tree, the totalbandwidth is constant for each level of the tree; thus, thethickness of an edge at level iþ 1 is k times the thickness atlevel i. Messages can travel up the tree and back down totraverse from a processor to any other processor withoutbeing constrained by bandwidth limitations; this structurecan be thought of as a “folded” Benes network [9].

We now quantify the relation between the number oflevels, number of processors, and the number of switch boxes.A fat-tree with L levels built with k� k switches can have upto 2kL processors, since the number of nodes is multiplied bykat each level from the root down to the tree’s bottom.Conversely, the depth of a fat-tree for P processors built withk� k switches is logk P � logk 2. The corrective term of 2 is dueto the root level of the fat-tree, where all switch ports areavailable for the lower level, unlike intermediate levels,where half of the ports are used for connections with thehigher level. Since the total bandwidth at each level isconstant, so is the number of switch ports per level. As aresult, the bottom level of the fat-tree, which connects theprocessors to the network, requires d Pk e switches; thus, a fat-tree with L levels built with k� k switches requires LP

k ¼2LkL�1 switches. Conversely, building a fat-tree for Pprocessors requires ðlogk P � logk 2Þd Pk e of k� k switches.

Constructing fat-trees (illustrated in Fig. 7) where thenetwork bandwidth is preserved at all levels is extremelychallenging for thousands of processors, and simply infea-sible for the next generation of ultrascale computing systemswith tens or hundreds of thousands of processors. Besides theconstruction complexity, the performance of a fat-tree net-work degrades while the cost inflates sharply with increasingprocessor count. From the performance perspective, as thedepth of the tree increases with larger concurrencies, thenumber of hops per message increases, corresponding tolarger message latencies. While latency due to the intercon-nection network may not be significant for small to mediumnumber of processors, it can dominate the message transmis-sion cost at very high concurrencies. Additionally, the cost of afat-tree grows superlinearly with larger parallel systems,since fat-tree construction depends on the number of switch-ing blocks as well as the number of cables employed. These

factors eliminate fat-tree topologies as a practical interconnec-tion paradigm for next-generation supercomputers.

3.2 Fat-Tree Utilization

In this section, we analyze what fraction of the available fat-

tree bandwidth is utilized by our studied applications. In

previous work [15], we employed two methods to assign

tasks to processors: one that assigns processors based on the

natural ordering of the tasks, and a second method that

aims to minimize the average number of hops for each

message using a heuristic based on graph partitioning. For

the analysis here, we assign tasks to processors using the

heuristic methodology.To create an instance of communication, we use the

application communication patterns presented in Section 2.

For a given instance, a processor sends a message to one of its

communicating partners chosen at random. As an approx-

imation of the communication overhead, we create 10P

instances of communication for each application, and route

the messages on the interconnect, recording how many

messages reach each level of the fat-tree. Using this estimation

strategy, we simulate the behavior of each application to

determine the communication load on the network.Fig. 5a displays the results for bandwidth utilization of a

fat-tree built with 2� 2 switches. In this figure, the horizontalaxis corresponds to the fit-tree level starting with the leafnodes (i.e., the processors). The vertical axis corresponds tobandwidth utilization, which we compute by counting thenumber of messages that reach a given level, and comparingthis number with the level’s total available bandwidth (P for afat-tree). The results show that the bandwidth utilizationdrops sharply as the tree level increases. For GTC, thisnumber drops to 0 at level seven, indicating that the highestsix levels of the fat-tree are not used at all. A similar trend isseen in all examined applications. Even for PARATEC, whichuses all-to-all-communication in its FFT, bandwidth utiliza-tion goes down to 74 percent at the top level even though P isonly 256 processors. These results clearly show that fat-treebandwidth is underutilized for most applications, especiallyfor those that can scale up to thousands of processors. In the


Fig. 5. (a) Underutilization of fat-tree bandwidth for the examined application suite. Level 1 refers to the bottom of the tree closest to the processors.

The vertical axis represents percentage of the bandwidth utilized at each level. (b) The potential savings in the number of required ports (and thus

cost) for an ideal fit-tree compared with the fat-tree approach.

next section, we will use this observation to propose analternative interconnection topology.

3.3 Fit-Tree Approach

The motivation for the fit-tree topology comes from ourobservation that the available bandwidth of a fat-tree is notutilized at all levels—especially the higher ones—of a fat-treenetwork for many scientific computing applications. Ex-ploiting this observation, we propose the fit-tree topology,which is an improvement on fat-trees, providing betterscalability in terms of both performance and cost.

Consider an intermediate node in a fat-tree that is theroot of a subtree of P 0 processors. In a conventional fat-tree,this node corresponds to a P 0 � P 0 switch, whose ports areassigned so that P 0 of them are connected to the lower leveland P 0 are connected to the higher level. This provides P 0

different communication channels for P 0 processors. Sincesome of this bandwidth is redundant (Section 3.1), eliminat-ing a portion of the connections within the higher level willnot degrade performance. Note that although this networkdesign optimization decreases cabling requirements, it doesnot improve switch costs or overall performance.

In the proposed fit-tree design, the number of ports usedfor connections to the higher levels of the tree is less thanthe number of ports used for connections to the lowerlevels. This approach leverages otherwise unutilized switchports to increase the number of connected nodes at lower

tree levels, allowing an increase in the number of processorsrooted at a node (at the same level). Thus, our fit-tree designhas a c : 1 ratio between the number of ports that go downand up (respectively) for each intermediate level, where wecall c > 1 the fitness ratio. Conversely, a conventional fat-treehas a 1 : 1 (c ¼ 1) ratio between bandwidth down and up ateach intermediate level.

The fit-tree methodology enables building larger systemsfor a fixed number of levels in the interconnect tree. A directcomparison with fat-trees can be made in two ways. If weassume that total bandwidth is preserved at each level of thetree, a fat-tree is built using k children per node. However, afit-tree’s node has ck children, where c is the fitness ratio. Thistranslates to an exponential advantage in the number ofprocessors the interconnect can support, as a fit-tree of Llevels built with k� k switches and a c : 1 ratio will contain2ðckÞL processors as opposed to 2kL for a fat-tree. Conversely,the depth of a fit-tree for a fixed number of processors P builtwith k� k switches and a c : 1 ratio is logck P � logck 2. Thisreduced number of levels for a fixed number of processorstranslates to a reduction in switch count. Overall, this resultsin a substantial reduction in the port counts and consequentwiring complexity required to implement a fit-tree that offersthe same performance as a fat-tree. Fig. 6a shows theadvantage of the fit-tree approach for potential systemconcurrency given a fixed number of levels. Alternatively,we can consider fixing the number of tree levels while


Fig. 6. Comparison of fat-tree and fit-tree scalabilities in terms of (a) potential system concurrency for a fixed number of tree levels and (b) the

required number of switches per processor.

Fig. 7. A four-level fat-tree built from 2� 2 switches. The fit-tree approach “trims” links at the upper levels if the extra bandwidth is unneeded and

packs the resulting necessary links into as few switch blocks as possible.

decreasing the total fit-tree bandwidth at the higher levels.For a fat-tree, the total bandwidth provisioned is computed as

XL�1

i¼1

P

kci�1¼P�

1cL� 1�

k�

1c � 1

� <Pc

kðc� 1Þ :

In a fit-tree, however, the bandwidth can be reduced by c ateach level of the tree. It is worth noting that the totalbandwidth, and thus the number of switches, scales linearlywith P , which provides perfect scalability for fit-trees. Forexample, given c ¼ 2, the total number of required switcheswill be no more than two times the number of switches atthe first level. Fig. 6b highlights the fit-tree advantage, byshowing a comparison of the required number of switchcomponents required for fat-trees and fit-trees usingvarying fitness ratios.

In practice, it is possible to build hybrid topologies,where each node has more than k children, thus, allowingthe bandwidth to reduce gradually. This allows fit-treedesigners to trade off between cost savings and perfor-mance. In Section 4, we propose a hybrid optical/electricalinterconnect solution that would allow fit-tree designs thatcould be dynamically reconfigured to the requirements ofthe underlying application. We now examine the potentialadvantage of a fit-tree architecture for our evaluated set ofscientific codes.

3.4 Fit-Tree Evaluation

In the previous section, we showed how fit-trees cansignificantly improve the cost and scalability of fat-trees,while preserving performance. The critical question is,therefore, determining the appropriate fitness ratio for agiven computation. In this section, we investigate the fitnessratio requirements of our studied applications. For theseexperiments, we use the same experimental setup as inSection 3.2, and compute the fitness ratio of level iþ 1 as theratio between the bandwidth utilization at level iþ 1 and i,respectively. To reduce the effect of outliers, we consider 4to be the highest allowable fitness ratio.

Table 4a presents the fitness ratios of each examinedapplication. For clarity of presentation, we only show theminimum, average, maximum, and median across all fit-tree levels. Results show that (as expected) fitness ratios arehigher for applications with sparse communication: GTC,

BB3D, LBCFD, MADbench, Cactus, and SuperLU. Note thatthese applications are known to exhibit better scalabilitycompared to communication-intensive computations suchas PARATEC and PMEMD. However, it is remarkable thateven PARATEC, which require global 3D FFTs, has a fitnessratio of 1.18 at its top level.

Table 4b presents fitness ratios for each level of the fit-tree across all studied applications. Again for clarity ofpresentation, we only display the minimum, average,maximum, and median values. Results show that, whilethe fitness ratios are low at the lowest levels, they increasewith increasing fit-tree levels. This is expected, as thenumber of nodes rooted at a node is doubled at each level ofthe fat-tree, creating room for locality where the percentageof local communication increases.

Based on Table 4, it is difficult to decide on a single“ideal” fitness ratio, but the data show strong quantitativesupport for the fit-tree concept. After all, even the minimumfitness ratio at level six is 1.15. It is worth repeating that ourmain motivation is interconnect designs for next-generationpetascale and exascale systems, which are expected to havehundreds of thousands of processors. Therefore, even afitness ratio of 1.15 will translate to enormous savings incosts and improvements in performance as displayed inFig. 6. The potential savings in switch ports versus a fat-treefor our examined applications are shown in Fig. 5b. Evenfor the moderate concurrency levels explored here, thehybrid fit-tree approach can reduce the port count require-ments by up to 44 percent (on average).

Given the potential gains of the fit-tree methodology, wenow propose a hardware solution for dynamically con-structing fit-trees appropriate for each application, thus,building the best performing interconnect at the lowest costin terms of switch ports.

4 RECONFIGURABLE INTERCONNECT

ARCHITECTURE

In the previous section, we demonstrated the tremendoussavings that can be achieved by designing interconnects bytaking application requirements into account. In particular,we showed that fit-trees can be as effective as fat-trees, butwith linearly scaling costs. However, the caveat here is thatthe communication requirements of application can vary


TABLE 4Fitness Ratios for (a) Each Applications across All Levels and (b) Each Level across All Applications

significantly. Thus, guaranteeing high performance for allcode classes requires designing each part of the interconnectfor the maximum demand among all target applications—resulting in overprovisioning and degraded efficiency. Theremedy is in reconfigurability, which would allow us tobuild an interconnect dynamically for each application toachieve maximum performance with minimum resources.

In this section, we present a methodology for dynami-cally creating interconnects. This approach would allow usto build fit-trees with variable fitness ratios as well asarbitrary network configurations. Our proposed technology,called HFAST, uses passive/circuit switches to dynamicallyprovision active/packet switch blocks—allowing the custo-mization of interconnect resources for application-specificrequirements. Before describing the details of our infra-structure, we first motivate the discussion by looking atrecent trends in the high-speed wide area networkingcommunity, which has developed cost-effective solutions tosimilar challenges.

4.1 Circuit Switch Technology

Packet switches, such as Ethernet, Infiniband, and Myrinet,are the most commonly used interconnect technology forlarge-scale parallel computing platforms. A packet switchmust read the header of each incoming packet in order todetermine on which port to send the outgoing message. Asbit rates increase, it becomes increasingly difficult andexpensive to make switching decisions at line rate. Mostmodern switches depend on ASICs or some other form ofsemicustom logic to keep up with cutting-edge data rates.Fiber optic links have become increasingly popular forcluster interconnects because they can achieve higher datarates and lower bit-error rates over long cables than ispossible using low-voltage differential signaling overcopper wire. However, optical links require a transceiverthat converts from the optical signal to electrical so thesilicon circuits can perform their switching decisions. TheOptical Electrical Optical (OEO) conversions further add tothe cost and power consumption of switches. Fully opticalswitches that do not require an OEO conversion caneliminate the costly transceivers, but per-port costs willlikely be higher than an OEO switch due to the need to useexotic optical materials in the implementation [19].

Circuit switches, in contrast, create hard circuits betweenend points in response to an external control plane—just likean old telephone system operator’s patch panel, obviatingthe need to make switching decisions at line speed. As such,they have considerably lower complexity and consequentlylower cost per port. For optical interconnects, microelec-tromechanical mirror (MEMS) based optical circuit switchesoffer considerable power and cost savings as they do notrequire expensive (and power-hungry) optical/electricaltransceivers required by the active packet switches. Also,because nonregenerative circuit switches create hard circuitsinstead of dynamically routed virtual circuits, they con-tribute almost no latency to the switching path aside frompropagation delay. MEMS-based optical switches, such asthose produced by Lucent, Calient, and Glimmerglass, arecommon in the telecommunications industry and the pricesare dropping rapidly as the market for the technology growslarger and more competitive.

4.2 Related Work

Circuit switches have long been recognized as a cost-effective alternative to packet switches, but it has provendifficult to exploit the technology for use in clusterinterconnects because the switches do not understandmessage or packet boundaries. It takes on the order ofmilliseconds to reconfigure an optical path through theswitch, and one must be certain that no message traffic ispropagating through the light path when the reconfigurationoccurs. In comparison, a packet-switched network cantrivially multiplex and demultiplex messages destined formultiple hosts without requiring any configuration changes.

The most straightforward approach is to completelyeliminate the packet switch and rely entirely on a circuitswitch. A number of projects, including the OptIPuter [10]transcontinental optically interconnected cluster, use thisapproach for at least one of their switch planes. TheOptIPuter nodes use Glimmerglass MEMS-based opticalcircuit switches to interconnect components of the localcluster, as well as to form transcontinental light paths whichconnect the University of Illinois half of the cluster to theUC San Diego half. One problem that arises with thisapproach is multiplexing messages that arrive simulta-neously from different sources. Given that the circuit switchdoes not respect packet boundaries and that switchreconfiguration latencies are on the order of milliseconds,either the message traffic must be carefully coordinatedwith the switch state or multiple communication cards mustbe employed per node so that the node’s backplaneeffectively becomes the message multiplexor; the OptIPutercluster uses a combination of these two techniques. Thesingle-adapter approach leads to impractical message-coordination requirements in order avoid switch reconfi-guration latency penalties, whereas the multiadapterapproach suffers from increased component costs due tothe increased number of network adapters per host and thelarger number of ports required in the circuit switch.

One proposed solution, the Interconnection CachedNetwork (ICN) [12], recognizes the essential role thatpacket switches play in multiplexing messages from multi-ple sources at line rate. The ICN consists of processingelements that are organized into blocks of size k which areinterconnected with small crossbars capable of switchingindividual messages at line rate (much like a packet switch).These k-blocks are then organized into a larger system via ak�Nblocks ported circuit switch. The ICN can embedcommunication graphs that have a consistently boundedTDC less than k. The jobs must be scheduled in such a waythat the bounded contraction of the communication topol-ogy (that is, the topological degree of every subset ofvertices) is less than k. This is an NP-complete problem forgeneral graphs when k > 2, although such contractions canbe found algorithmically for regular topologies like meshes,hypercubes, and trees. If the communication topology hasnodes with degree greater than k, some of the messages willneed to take more than one path over the circuit switch and,therefore, share a path with other message traffic. Conse-quently, the bandwidth along that path is reduced if morethan one message must contend for the same link on thenetwork. Job placement also plays a role in finding anoptimal graph embedding. Runtime reconfiguration of the


communication topology on an ICN may require taskmigration in order to maintain an optimal embedding forthe communication graph. The HFAST approach detailed inthis work has no such restriction to regular topologies andneeds no task migration.

AR offers an alternative approach to reducing linkcontention in low-degree interconnects. However, theadditional logic required for AR greatly increases hardwarecomplexity to achieve the same goal as the HFASTapproach. HFAST reduces interconnect link contention byreconfiguring the wiring using simpler circuit switches,whereas adaptive routing makes contention-avoiding suchdecisions on a packet-by-packet basis. In our study, we haveused a broad array of HPC applications to demonstrate thatrouting decisions made on a longer timescale, which isamenable to the circuit switch reconfiguration times, offeran efficient approach to reducing hot spots in a lower degreeinterconnect. Overall, HFAST offers lower design complex-ity and hence a more cost-effective approach to achievingthe same capabilities for hot-spot avoidance as AR.

Finally, there are a number of hybrid approaches that usecombination packet/circuit switch blocks. Here, eachswitching unit consists of a low bandwidth dynamicallyrouted network that is used to carry smaller messages andcoordinate the switch states for a high-bandwidth circuitswitched network that follows the same physical path.Some examples include Gemini [8] and Sun MicrosystemsClint [11]. Each of these uses the low-bandwidth packet-switched network to set up a path for large-payload bulktraffic through the circuit switch hierarchy. While the circuitswitch path is unaware of packet boundaries, the lowerspeed packet network is fast enough to mediate potentialconflicts along the circuit path. This overcomes theproblems with coordinating message traffic for switchreconfiguration exhibited by the purely circuit-switchedapproach. While promising, this architecture suffers fromthe need to use custom-designed switch components for avery special-purpose use. In the short term, such aspecialized switch architecture will have difficulty reachinga production volume that can amortize the initial develop-ment and manufacturing costs. Our target is to make use ofreadily available commodity components in the design ofour interconnect in order to keep costs under control.

4.3 Hybrid Flexibly Assignable Switch Topology

We propose HFAST as a solution for overcoming theobstacles we outlined in the previous subsection, by using(Layer-1) passive/circuit switches to dynamically provision(Layer-2) active/packet switch blocks at runtime. Thisarrangement leverages the less-expensive circuit switchesto connect processing elements together into optimalcommunication topologies using far fewer packet switchesthan would be required for an equivalent fat-tree networkcomposed of packet switches. For instance, packet switchblocks can be arranged in a single-level hierarchy whenprovisioned by the circuit switches to implement a simplertopology like a 3D torus, whereas a fat-tree implementationwould require traversal of many layers of packet switchesfor larger systems—contributing latency at each layer of theswitching hierarchy. Therefore, this hybrid interconnectionfabric can reduce fabric latency by reducing the number of

packet switch blocks that must be traversed by a worse-casemessage route.

Using less-expensive circuit switches, one can emulatemany different interconnect topologies that would other-wise require fat-tree networks. The topology can beincrementally adjusted to match the communication topol-ogy requirements of a code at runtime. Initially, the circuitswitches can be used to provision densely packed 3D meshcommunication topologies for processes. However, as dataabout messaging patterns are accumulated, the topologycan be adjusted at discrete synchronization points to bettermatch the measured communication requirements and,thereby, dynamically optimizes code performance. MPItopology directives can be used to speed the runtimetopology optimization process. There are also considerableresearch opportunities available for studying compile-timeinstrumentation of codes to infer communication topologyrequirements at compile time. In particular, languages likeUPC offer a high-level approach for exposing communica-tion requirements at compile time. Similarly, the compilercan automatically insert the necessary synchronizationpoints that allow the circuit switches time to reconfiguresince the Layer-1 switches do not, otherwise, respect packetboundaries for in-flight messages.

HFAST differs from the bounded-degree ICN approachin that the fully connected passive circuit switch is placedbetween the nodes and the active (packet) switches. Thissupports a more flexible formation of communicationtopologies without any job placement requirements. Codesthat exhibit nonuniform degree of communication (e.g., justone or few process(es) must communicate with a largenumber of neighbors) can be supported by assigningadditional packet switching resources to the processes withgreater communication demands. Unlike the ICN andOptIPuter, HFAST is able to treat the packet switches as aflexibly assignable pool of resources. In a sense, ourapproach is precisely the inverse of the ICN—the proces-sors are connected to the packet switch via the circuitswitch, whereas the ICN uses processors that are connectedto the circuit switch via an intervening packet switch.

Fig. 8 shows the general HFAST interconnection betweenthe nodes, circuit switch and active switch blocks. Thediagram in Fig. 8b shows an example with six nodes andactive switch blocks of size 4. In this example, node 1 cancommunicate with node 2 by sending a message through


Fig. 8. (a) General layout of HFAST and (b) example configuration for six

nodes and active switch blocks of size 4.

the circuit switch (red) in switch block 1 (SB1), and backagain through the circuit switch (green) to node 2. Thisshows that the minimum message overhead will requirecrossing the circuit switch two times. If the TDC of node 1 isgreater than the available degree of the active SB, multipleSBs can be connected together (via a myriad of interconnec-tion options). For the example in Fig. 8, if node 1 was tocommunicate with node 6, the message would first arrive atSB1 (red), then be transferred to SB2 (blue), and finally sentto node 6 (orange)—thus, requiring three traversals of thecircuit switch crossbar and two active SB hops.

The HFAST approach holds a clear advantage tostatically built interconnects, since additional packet switchresources can dynamically be assigned to the subset ofnodes with higher communication requirements. HFASTallows the effective utilization of interconnect resources forthe specific requirements of the underlying scientificapplications. This methodology can, therefore, satisfy thetopological connectivity of applications categorized in casesi-iii (defined in Section 2.5). Additionally, HFAST could beused to dynamically create fit-trees with static or variablefitness rations. Furthermore, because the circuit switcheshave allocated a network that matches the application, thenetwork can avoid elaborate dynamic routing approachesthat result in greater router complexity and slower routingspeed. This approach avoids job fragmentation, since“migration” is essentially a circuit switch configurationthat can be performed at a barrier in milliseconds. Finally,the HFAST strategy could even iteratively reconfigure theinterconnect between communication phases of a dynami-cally adapting application [15]. Future work will continue toexplore the potential of the HFAST in the context ofdemanding scientific applications.

5 SUMMARY AND CONCLUSIONS

There is a crisis looming in parallel computing driven byrapidly increasing concurrency and the nonlinear scaling ofswitch costs. It is, therefore, imperative to investigateinterconnect alternatives to ensure that future HPC systemscan cost-effectively support the communication require-ments of ultrascale applications across a broad range ofscientific disciplines. Before such an analysis can be under-taken, one must first understand the communicationrequirements of large-scale HPC applications, which arethe ultimate drivers for future interconnect technologies.

To this end, we first presented one of the broadeststudies to date of high-end communication requirements,across a broad spectrum of important scientific disciplines,whose computational methods include: finite difference,lattice Boltzmann, particle-in-cell, sparse linear algebra,particle-mesh ewald, and FFT-based solvers. Analysis ofthese data shows that most applications do not utilize thefull connectivity of traditional fully connected networkimplementations. Based on these observations, we pro-posed a novel network called the fit-tree architecture. Ourwork reveals that fit-trees can significantly improve the costand scalability of fat-trees, while preserving performancethrough reduced component count and lower wiringcomplexity. Finally, we propose the HFAST infrastructure,which combines passive and active switch technology to

create dynamically reconfigurable network topologies, andcould be used to create custom-tailored fit-tree configura-tions for specific application requirements. This approachmeets the performance benefits of adaptive routing ap-proaches without the design complexity. Overall resultslead to a promising approach for ultrascale systeminterconnect design and analysis.

In future work, we plan to pursue two major thrusts.First, we will expand the scope of both the applicationsprofiled and the data collected through the IPM profiling.The low overhead of IPM profiling opens up the possibilityof the characterization of large and diverse applicationworkloads. We will also pursue more detailed performancedata collection, including the analysis of full chronologicalcommunication traces. Studying the time dependence ofcommunication topologies could expose opportunities toreconfigure an HFAST interconnect within a dynamicallyevolving computation. Our studies will also have applica-tion to interconnect topologies and circuit provisioning foremerging chip multiprocessors (CMPs) that contain hun-dreds or thousands of cores per socket. Finally, the secondthrust will continue our exploration of fit-tree solutions inthe context of ultrascale scientific computations. Thisportion of the investigation will require comparisons withalternative approaches such as high-radix routers, as well asexamining the physical aspects of constructing reconfigur-able fit-tree interconnects including issues of packaging andcable layout.

REFERENCES

[1] Proc. US Dept. of Energy Office of Science Workshop Science Case forLarge-Scale Simulation, D. Keyes, ed., June 2003.

[2] Proc. Workshop Roadmap for the Revitalization of High-End Comput-ing, D.A. Reed, ed., June 2003.

[3] K. Asanovic et al., “The Landscape of Parallel ComputingResearch: A View from Berkeley,” Technical Report UCB/EECS-2006-183, EECS Dept., Univ. of California, Berkeley, Dec. 2006.

[4] J. Balfour and W.J. Dally, “Design Tradeoffs for Tiled CMP On-Chip Networks,” Proc. 20th Ann. Int’l Conf. Supercomputing (ICS’06), pp. 187-198, May 2006.

[5] J. Borrill, J. Carter, L. Oliker, D. Skinner, and R. Biswas,“Integrated Performance Monitoring of a Cosmology Applicationon Leading HEC Platforms,” Proc. Int’l Conf. Parallel Processing(ICPP), 2005.

[6] Cactus Homepage, http://www.cactuscode.org, 2004.[7] A. Canning, L.W. Wang, A. Williamson, A. Zunger, “Parallel

Empirical Pseudopotential Electronic Structure Calculations forMillion Atom Systems,” J. Computational Physics, vol. 160, pp. 29-41, 2000.

[8] R.D. Chamberlain, C.S. Baw, and M.A. Franklin, “Gemini: AnOptical Interconnection Network for Parallel Processing,” IEEETrans. Parallel and Distributed Systems, vol. 13, no. 10, pp. 1038-1055, Oct. 2002.

[9] W.J. Dally and B. Towles, Principles and Practices of InterconnectionNetworks. Morgan Kaufmann, 2004.

[10] T. DeFanti, M. Brown, J. Leigh, O. Yu, E. He, J. Mambretti, D.Lillethun, and J. Weinberger, “Optical Switching Middlewarefor the Optiputer,” IEICE Trans. Fundamentals/Comm./Electronics/Information and System, Feb. 2003.

[11] H. Eberle and N. Gura, “Separated High-Bandwidth and Low-Latency Communication in the Cluster Interconnect Clint,” Proc.IEEE Conf. Supercomputing, 2002.

[12] V. Gupta and E. Schenfeld, “Performance Analysis of a Synchro-nous, Circuit-Switched Interconnection Cached Network,” Proc.Eighth Int’l Conf. Supercomputing (ICS ’94), pp. 246-255, 1994.

[13] PMEMD Homepage, http://amber.scripps.edu/pmemd-get.html,2007.

[14] IPM Homepage, http://www.nersc.gov/projects/ipm, 2005.


[15] S. Kamil, A. Pinar, D. Gunter, M. Lijewski, L. Oliker, and J. Shalf,“Reconfigurable Hybrid Interconnection for Static and DynamicScientific Applications,” Proc. ACM Int’l Conf. Computing Frontiers,2007.

[16] W.W. Lee, “Gyrokinetic Particle Simulation Model,” J. Computa-tional Physics, vol. 72, pp. 243-269, 1987.

[17] X.S. Li and J.W. Demmel, “Superlu-Dist: A Scalable Distributed-Memory Sparse Direct Solver for Unsymmetric Linear Systems,”ACM Trans. Math. Software, vol. 29, no. 2, pp. 110-140, June 2003.

[18] Z. Lin, S. Ethier, T.S. Hahm, and W.M. Tang, “Size Scaling ofTurbulent Transport in Magnetically Confined Plasmas,” PhysicalRev. Letters, vol. 88, 2002.

[19] R. Luijten, C. Minkenberg, R. Hemenway, M. Sauer, and R.Grzybowski, “Viable Opto-Electronic HPC Interconnect Fabrics,”Proc. High Performance Computing, Networking, and Storage Conf.(SC), 2005.

[20] L. Oliker et al., “Scientific Application Performance on CandidatePetascale Platforms,” Proc. IEEE Int’l Parallel and DistributedProcessing Symp. (IPDPS), Mar. 2007.

[21] L.-S. Peh and W.J. Dally, “A Delay Model for Router Micro-architectures,” IEEE Micro, vol. 21, no. 1, pp. 26-34, Jan./Feb. 2001.

[22] J. Qiang, M. Furman, and R. Ryne, “A Parallel Particle-in-CellModel for Beam-Beam Interactions in High Energy Ring CollideRS,” J. Computational Physics, vol. 198, pp. 278-294, 2004.

[23] H.D. Simon, W.T. Kramer, W. Saphir, J. Shalf, D.H. Bailey, L.Oliker, M. Banda, C.W. McCurdy, J. Hules, A. Canning, M. Day, P.Colella, D. Serafini, M.F. Wehner, and P. Nugent, “Science-DrivenSystem Architecture: A New Process for Leadership ClassComputing,” J. Earth Simulator, vol. 2, pp. 1-9, Jan. 2005.

[24] Top 500 Supercomputer Sites, http://www.top500.org, 2005.[25] J.S. Vetter and F. Mueller, “Communication Characteristics of

Large-Scale Scientific Applications for Contemporary ClusterArchitectures,” Proc. 16th Int’l Parallel and Distributed ProcessingSymp. (IPDPS), 2002.

[26] J.S. Vetter and A. Yoo, “An Empirical Performance Evaluation ofScalable Scientific Applications,” Proc. IEEE Conf. Supercomputing,2002.

[27] L.W. Wang, “A Survey of Codes and Algorithms Used in NERSCMaterials Science Allocations,” LBNL Technical Report LBNL-16691, 2006.

Shoaib Kamil is a PhD candidate in computerscience at the University of California, Berkeley,and works with the Future Technologies Groupat Lawrence Berkeley National Laboratory,advised by Dr. Katherine Yelick. His researchinterests focus on auto-tuning of structured gridapplications for multicore systems, interconnectresearch and optimization, as well as power-efficient computing research in both softwareand hardware.

Leonid Oliker received the bachelor degrees incomputer engineering and finance from theUniversity of Pennsylvania, and performed bothhis doctoral and postdoctoral work at NASAAmes Research Center. He is a computerscientist in the Future Technologies Group atLawrence Berkeley National Laboratory. Lennyhas coauthored more than 60 technical articles,four of which received best paper awards. Hisresearch interests include HPC evaluation,

multicore auto-tuning, and power-efficient computing.

Ali Pinar received the BS and MS degrees in1994 and 1996, respectively, in computer en-gineering from Bilkent University, Turkey, and thePhD degree in 2001 in computer science from theUniversity of Illinois at Urbana-Champaign, withthe option of computational science and engi-neering. He is a member of the Information andDecision Sciences Department at Sandia Na-tional Laboratories. Before joining Sandia, he waswith the High-Performance Computing Research

Department at Lawrence Berkeley National Laboratory. His research is oncombinatorial problems arising in algorithms and applications of scientificcomputing, with emphasis on parallel computing, sparse matrix computa-tions, computational problems in electric power systems, graph datamining, and interconnection network design. He is a member of SIAM andits activity groups in supercomputing, computational science andengineering, and optimization; the IEEE Computer Society; and theACM. He is also elected to serve as the secretary of SIAM activity group insupercomputing for the 2008-2009 term.

John Shalf’s background is in electrical engi-neering: he spent time in graduate school atVirginia Tech working on a C-compiler for theSPLASH-2 FPGA-based computing system, andat Spatial Positioning Systems, Inc. (now ArcSe-cond) he worked on embedded computer sys-tems. He first got started in HPC at the NationalCenter for Supercomputing Applications (NCSA)in 1994, where he worked on a number of HPCapplication codes. While working for the General

Relativity Group at the Albert Einstein Institute in Potsdam, Germany, hehelped develop the first implementation of the Cactus ComputationalToolkit, which is used for numerical solutions to Einstein’s equations forGeneral Relativity and which enables modeling of black holes, neutronstars, and boson stars. He won the R&D 100 award in 2001 for the RAGErobot and is currently the Group leader for the Advanced TechnologiesGroup at the National Energy Research Supercomputing Center(NERSC). He is a member of the IEEE Computer Society.

. For more information on this or any other computing topic,please visit our Digital Library at www.computer.org/publications/dlib.


Communication Requirements and Interconnect Optimization for High-End Scientific Applications

Documents

Transcript of Communication Requirements and Interconnect Optimization for High-End Scientific Applications