Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L)...

13
Topology Mapping for Blue Gene/L Supercomputer Hao Yu I-Hsin Chung Jose Moreira IBM Thomas J. Watson Research Center Yorktown Heights, NY 10598-0218 {yuh,ihchung,jmoreira}@us.ibm.com Abstract Mapping virtual processes onto physical processos is one of the most important issues in parallel computing. The prob- lem of mapping of processes/tasks onto processors is equiva- lent to the graph embedding problem which has been studied extensively. Although many techniques have been proposed for embeddings of two-dimensional grids, hypercubes, etc., there are few efforts on embeddings of three-dimensional grids and tori. Motivated for better support of task mapping for Blue Gene/L supercomputer, in this paper, we present embedding and integration techniques for the embeddings of three-dimensional grids and tori. The topology mapping library that based on such techniques generates high-quality embeddings of two/three-dimensional grids/tori. In addition, the library is used in BG/L MPI library for scalable support of MPI topology functions. With extensive empirical stud- ies on large scale systems against popular benchmarks and real applications, we demonstrate that the library can sig- nificantly improve the communication performance and the scalability of applications. 1 Introduction Mapping tasks of a parallel application onto physical proces- sors of a parallel system is one of the very essential issues in parallel computing. It is critical for todays supercomputing system to deliver sustainable and scalable performance. The problem of mapping application’s task topology onto underneath hardware’s physical topology can be formalized as a graph embedding problem [3], which has been studied extensively in the past. For the sake of intuition, in this writing, we used terms, embedding and mapping, with the same meaning. In general, an embedding of a guest graph G =( V G , E G ) into a host graph H =( V H , E H ) is a one-to-one Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. SC2006 November 2006, Tampa, Florida, USA 0-7695-2700-0/06 $20.00 c 2006 IEEE mapping φ from V G to V H . The quality of the embedding are usually measured by two cost functions (parameters): dila- tion and expansion. The dilation of an edge (u, v) E G is the length of a shortest path in H that connects φ (u) and φ (v). The dilation of an embedding is the maximum dilation over all edges in E G . The expansion of an embedding is simply | V H |/| V G |. Intuitively, the dilation of an embedding mea- sures the worst case stretching of edges and the expansion measures the relative size of the guest graph. Because the graph embedding problems are NP-hard prob- lems, researchers have focused on developing heuristics. In the past two decades, a large number of graph embedding techniques has been developed to solve problems raised from VLSI circuit layout and process mapping in parallel comput- ing. Specifically, for optimizing the VLSI design of highly eccentric circuits, techniques that embed rectangular two- dimensional grids into square two-dimensional grids [3; 17; 10] were proposed and significant results were obtained. In parallel computing domain, techniques were developed to map processes onto various high-performance intercon- nects such as hierarchical network, crossbar-based network, hypercube network, switch-based interconnects, and multi- dimensional grids [24; 14; 8; 18; 16; 15; 12; 20]. In this pa- per, we concentrate on the problem of embedding for three- dimensional grids and tori. Our research is motivated for providing better support for task mapping on Blue Gene/L (BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys- tem developed by IBM in partnership with Lawrence Liver- more National Laboratory (LLNL). BG/L uses system-on-a- chip integration [4] and a highly scalable architecture [2] to assemble machines with up-to 65,536 dual-processor com- pute nodes. The primary communication network of BG/L is a three-dimensional torus. To provide system level topol- ogy mapping support for BG/L, we need effective and scal- able embedding techniques that map application’s virtual topologies onto three-dimensional tori or two-dimensional sub-topologies. Although there are large amount of work on embeddings for various topologies, we found relatively few techniques for efficient/scalable embeddings among three-dimensional grids/tori. In particular, techniques based on graph- partitioning and/or searching would find fairly good embed- dings for different topologies, they are hard to be parallelized

Transcript of Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L)...

Page 1: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

Topology Mapping for Blue Gene/L Supercomputer

Hao Yu I-Hsin Chung Jose Moreira

IBM Thomas J. Watson Research CenterYorktown Heights, NY 10598-0218{yuh,ihchung,jmoreira}@us.ibm.com

Abstract

Mapping virtual processes onto physical processos is one ofthe most important issues in parallel computing. The prob-lem of mapping of processes/tasks onto processors is equiva-lent to the graph embedding problem which has been studiedextensively. Although many techniques have been proposedfor embeddings of two-dimensional grids, hypercubes, etc.,there are few efforts on embeddings of three-dimensionalgrids and tori. Motivated for better support of task mappingfor Blue Gene/L supercomputer, in this paper, we presentembedding and integration techniques for the embeddingsof three-dimensional grids and tori. The topology mappinglibrary that based on such techniques generates high-qualityembeddings of two/three-dimensional grids/tori. In addition,the library is used in BG/L MPI library for scalable supportof MPI topology functions. With extensive empirical stud-ies on large scale systems against popular benchmarks andreal applications, we demonstrate that the library can sig-nificantly improve the communication performance and thescalability of applications.

1 Introduction

Mapping tasks of a parallel application onto physical proces-sors of a parallel system is one of the very essential issues inparallel computing. It is critical for todays supercomputingsystem to deliver sustainable and scalable performance.

The problem of mapping application’s task topology ontounderneath hardware’s physical topology can be formalizedas a graph embedding problem [3], which has been studiedextensively in the past. For the sake of intuition, in thiswriting, we used terms, embedding and mapping, with thesame meaning. In general, an embedding of a guest graphG= (VG,EG) into a host graphH = (VH ,EH) is a one-to-one

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.

SC2006 November 2006, Tampa, Florida, USA0-7695-2700-0/06 $20.00c©2006 IEEE

mappingφ fromVG to VH . The quality of the embedding areusually measured by two cost functions (parameters):dila-tion andexpansion. The dilation of an edge(u,v)∈EG is thelength of a shortest path inH that connectsφ(u) andφ(v).The dilation of an embedding is the maximum dilation overall edges inEG. The expansion of an embedding is simply|VH |/|VG|. Intuitively, the dilation of an embedding mea-sures the worst case stretching of edges and the expansionmeasures the relative size of the guest graph.

Because the graph embedding problems are NP-hard prob-lems, researchers have focused on developing heuristics. Inthe past two decades, a large number of graph embeddingtechniques has been developed to solve problems raised fromVLSI circuit layout and process mapping in parallel comput-ing. Specifically, for optimizing the VLSI design of highlyeccentric circuits, techniques that embed rectangular two-dimensional grids into square two-dimensional grids [3; 17;10] were proposed and significant results were obtained.In parallel computing domain, techniques were developedto map processes onto various high-performance intercon-nects such as hierarchical network, crossbar-based network,hypercube network, switch-based interconnects, and multi-dimensional grids [24; 14; 8; 18; 16; 15; 12; 20]. In this pa-per, we concentrate on the problem of embedding for three-dimensional grids and tori. Our research is motivated forproviding better support for task mapping on Blue Gene/L(BG/L) supercomputer.

The BG/L supercomputer is a new massively parallel sys-tem developed by IBM in partnership with Lawrence Liver-more National Laboratory (LLNL). BG/L uses system-on-a-chip integration [4] and a highly scalable architecture [2]toassemble machines with up-to 65,536 dual-processor com-pute nodes. The primary communication network of BG/Lis a three-dimensional torus. To provide system level topol-ogy mapping support for BG/L, we need effective and scal-able embedding techniques that map application’s virtualtopologies onto three-dimensional tori or two-dimensionalsub-topologies.

Although there are large amount of work on embeddingsfor various topologies, we found relatively few techniquesfor efficient/scalable embeddings among three-dimensionalgrids/tori. In particular, techniques based on graph-partitioning and/or searching would find fairly good embed-dings for different topologies, they are hard to be parallelized

Page 2: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

and scalable [12; 20; 6]. Specifically, except the workson hypercube embeddings [8], most techniques proposed fortask mapping for parallel systems have neglected the pro-gression in the area of embeddings of two-dimensional grids.Overall, existing techniques are either not suitable for BG/Lor only covering very limited cases.

In this paper, we describe existing and newly developed gridembedding techniques and integration techniques to gener-ate efficient mapping of parallel processes/tasks onto up-tothree-dimensional physical topologies, which indirectlyop-timizes the nearest-neighbor communications of parallel pro-grams. The embedding techniques we used and exploredtake constant time in parallel and therefore scalble. Our in-tegration techniques cover rather general cases (e.g. not lim-ited to grids/tori with dimension sizes as powers of 2) withsmall dilation costs. The topology mapping library based onthese techniques has been integrated into BG/L MPI libraryto support MPI Cartesian Topology [13]. Finally, we presentcomprehensive experiments against popular parallel bench-marks and real applications. With helps from MPI tracingtools, our empirical results demonstrate significant perfor-mance improvements for point-to-point communication af-ter using the process-to-processor mapping generated by ourlibrary.

This paper makes the following contributions:

- We present the design and integration of a comprehensivetopology mapping library for scalable three-dimensionalgrids/tori embeddings. Besides intensive exploration oflatest developments in the area of grid-embedding, wedescribe extensions for embeddings of three-dimensionalgrids/tori.

- Our topology mapping library provides efficient supportfor the MPI virtual topology functions. The computationof MPI virtual topology on each processor takes constanttime and thereafter the mapping process is scalable.

- Quantitatively, we demonstrate that our topology map-ping library is effective on improving the communicationperformance of parallel applications running on a largenumber of processors (i.e. experiments ran on up-to 4096processors).

The rest of the paper is organized as followings: Sec. 2gives brief explanation of embedding techniques for up-totwo-dimensional grids Based on the existing techniques, wepresent a set of embedding operations and correspondingpredicates for their selection. Sec. 3 presents embeddingoperations for 3D grids/tori. We further present the pro-cedure for integrating various embedding techniques into ageneral and powerful library. Sec. 4 describes some issues onthe support of topology mapping for BG/L systems. Sec. 5presents an extensive empirical study to demonstrate the ef-fectiveness of our topology mapping library. Finally, Sec.6discusses related work and Sec. 7 concludes the paper.

2 Basic Grid Embeddings

In this section, we describe techniques of embedding 1D or2D guest Cartesian topologies into 2D host Cartesian topolo-gies (grids or tori). We describe the operations and predicateswe defined for their integration. In the next section of thispaper, we will describe how we utilize these operations forembeddings of 3D Cartesian topologies.

2.1 1D Embeddings of Rings

For the case of embedding a ring into an 1D mesh (aka line),[22] described a method that embed the first half of the nodesin the guest ring into the host line in the same numbering di-rection, with each edge dilated by 2. Then it maps the 2ndhalf of the nodes in the guest ring to the host line in the re-verse order of the numbering direction of the line, still withedges dilated to 2. In this paper, we refer to this methodas ring-wrapping. Figure 1(a) shows an example for em-bedding a ring with size 7 to a line with size 8 using ring-wrapping.

When the given host graph is a ring with slightly greater sizethan the guest ring, a method referred to asring-scatteringin our writing can be used ([15]). It simply stretches someedges of the guest ring and map the nodes onto the host ringso the ring-links of the host graph can be utilized. Figure 1(b)shows an example for embedding a ring with size 5 to a ringwith size 8 using ring-scattering. However, when the size ofthe host ring is more than twice as great as the guest ring, thedilation of ring-scattering will exceed 2. In this case, ring-wrapping would give better embedding.

In our later development for 3D embeddings, we will oftenswitching between these two basic ring embedding meth-ods. In addition, because ring-scattering introduces expan-sion that is greater than 1 and the expansion of ring-wrappingis always 1, we found it is convenient to use ring-wrappingfor the most times.

2.2 Embeddings of Lines and Rings intoHigher-Dimensional Topologies

The methods to embed lines (aka pipes in [22]) and rings into2D or 3D topologies are fairly intuitive. They are similarto performing a naive space filling, i.e. to fold (aka wrap)the line in a 2D or 3D space without stretching any of itsedges. Figure 1 (c) and (d) show 2 examples for folding aline onto 2D and 3D grids. Because the dilation factors ofthe embeddings are 1, methods of embedding lines into 2Dgrids and tori are essentially the same.

While the embedding of a line to 2D or 3D grids can alwaysachieve dilation factor 1, embedding of a ring to 2D or 3Dgrids may not. Specifically, when the size of the guest ring

Page 3: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

1 2 34560

(a) Ring-wrapping

0 1 2 3 4

(b) Ring-scattering

14:(1,3)

0:(0,0)

(c) Embed a line to a 2D grid

14:(2,1,1)

0:(0,0,0)

(d) Embed a line to a 3D grid

X

Y

Z

order: (X,Y,Z)

0:(0,0)

14:(1,3)

(e) Embed a ring to a 2D grid 1,6

1,7

2,6

2,7

3,6

3,7

2,8

3,51,9

1,83,4

2,5

(f) Fold a 12x3 grid to a 6x6 2D(g) 90 degree turn

Figure 1: Existing Methods for Grid Embedding

is an odd number, there will be one and only one guest edgebeing dilated to 2 hops [22]. Figure 1 (e) shows an examplefor the case.

We find that the basic technique of folding a ring is useful be-cause for the cases of embedding 2D rectangular grids withvery small aspect ratios (lengtho f theshorterside

lengtho f thelongerside) and 3D gridswith one dimension much greater than the others as an 1Dline or ring We will discuss more on the matter in the nextsection. In addition, when embedding a ring into 2D/3D tori,the torus edges of the host grid are not needed.

2.3 Fold Embeddings among 2D Topolo-gies

Figure 2: Embed 3×36t grid to 12×12 grid

A well known method of embedding 2D grids with small as-pect ratio to grids with large aspect ratio isfolding, whichis first introduced in [3]. Figure 1(f) shows an example ofembedding a 12×3 rectangular grid into a 6×6 square grid.Under the folding process, a row of the guest grid (e.g. row

2) goes along row 2 until it reaches(2,5). Then row 2 fol-lows adiagonal–vertical–diagonalfolding maneuver. Notethe diagonal edges in the figure do not exist in the host gridand need to be replaced with a pair of edges in X and Y di-rection. As a result, the maneuver keeps the dilation costunder 2.

The key of the method is the embedding scheme for per-forming the180 degree turn(dark nodes in the picture) withdilation cost under 2. Recently, a method for performing a90degree turnwith dilation cost under 2 is introduced by [11](depicted in Figure 1(g)). Integrating the 180 degree turn andthe 90 degree turn, a 2D torus with small aspect ratio can befolded onto a 2D grid with large aspect ratio in the similarfashion as the ring-folding introduced above Figure 2 givesan intuitive example of such integration.

Need to note here that, to facilitate a massively parallel im-plementation of the embeddings, we have inverted the pro-jection functions of the 90 degree turn and the 180 degreeturn. Due to space limitation, we are not including theproject functions in this writing.

2.4 Matrix-Based Embedding

Matrix-based embeddings [21] are widely used to embedamong 2D grids with relative close aspect rations. The ex-ample given in Figure 3 shows an embedding and the cor-responding embedding matrix of mapping a 2x32 grid intoa 5x13 grid. In the example, the dilation of the embedding(3) is very close to the average dilation across all the edgesof the guest grid ( 2.5), which implies that most of the guestedges are stretched. On the other hand, the average dilationcosts of folding-based embeddings reach 2 only in the areaswhere theturn maneuvers are applied. The rest areas of theembeddings have average dilation cost as 1. For this rea-

Page 4: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

son, our embedding selection process presented in the latersections sets higher priority to folding-based methods thanmatrix-based methods. That is, we would try folding-basedmethods first when certain criteria is satisfied. Then whenthe folding-based method is failed to obtain an embedding,we will fall back to matrix-based method.

3 2 3 2 2 3 2 3 2 2 3 2 22 3 2 3 2 2 3 2 3 2 2 3 2

Figure 3: Embed 2×31t grid to 5×13 grid

Nevertheless, matrix-based method is very general and canefficiently embed a 2D guest grid (with dimension asA×B)to its ideal sub-grid of the host grid. Here the ideal grid isdefined asX×Y′, for a given host grid with dimensionX×Y,whereY′ = ⌈A×B/X⌉. Therefore, for the embedding among2D grids, we use it as a default method. In addition, when theguest and host grids are both tori, with the relative orientationof the guest graph is maintained, matrix-based embeddingcan utilize the torus links of the host grid.

The core of the matrix-based embedding is the generationof the matrix, many of which are in sequential because ofthe dependences of contiguous cells in the matrix. Althoughthe methods described in [21] enable parallel generation ofthe matrix, the generated matrix is for computing the coordi-nates of host grid from the coordinates of a guest grid. Whatwe need is the inverse function, i.e. each physical processorcomputes its logical coordinates. Following a similar proce-dure described in [21], we have derived the inverse functionsto enable scalable computation of the embedding in the MPIvirtual topology library.

2.5 Summary on Embeddings into 2DTopologies

We have derived a number of embedding operators to embed1D or 2D Cartesian topologies into 2D grids or tori. The op-erations we defined and later used in 3D embeddings, alongbrief descriptions are summarized in Table 1. Among the op-erations, while many of them are directly adopted from exist-ing techniques, others are extended from the existing ones ordeveloped from the existing ones to support scalable parallelembeddings.

First, in the context of this paper, we use following termi-nologies:

- G represents the guest grid/torus.- H represents the host grid/torus.- |G| and|H| represent the sizes of G and H.

- A, B, and C are the sorted dimensions of the guestgrid/torus, which satisfyA≤ B≤C.

- A′, B′, or C′ represent one dimension of the guestgrid/torus. When used together, they represent distinctdimensions of the guest grid/torus.

- X, Y, and Z are the sorted dimensions of the hostgrid/torus, which satisfyX ≤Y ≤ Z.

- X′, Y′, or Z′ represent one dimension of the hostgrid/torus. When used together, they represent distinctdimensions of the host grid/torus.

- AB, A′B′, XY, or X′Y′ are occasionally used to repre-sent a 2D grid/torus or a 2D sub-topology of a higher-dimensional grid/torus, where the meanings associated to′ in A′, B′, X′, andY′ are consistent with above definitions.

- A×B, A′ ×B′, X ×Y, or X′ ×Y′ represent the sizes ofcorresponding grids/tori or sub-topologies.

Given that each of the above embedding methods for 2Dgrids works well for certain cases, an important step to in-tegrate them into an effective library is to define the condi-tions or procedure for their application. Figure 4 specifiesthe procedure we used. We specify 6 steps, executing in se-quential order, for embedding to 2D grids/tori. Due to spacerestriction, sub-steps such as matrix-based embeddings intotori are omitted. Note that because the procedure is executedin sequential order, the conditions of later steps imply thatthe previous steps are not performed or do not return witha valid embedding. While most conditions or predicates arestraight-forward, some deserves explanation.

In step 3, because RFold2D needs at least 4 turns to have thestart and end points of the B dimension, the larger one of theguest grid, to meet, conditionA < B/4 is necessary to em-bed at least the four 90-degree-turns. On the other hand, forthe case of MFold2D, since at least one 180-degree-turn isneeded, conditionA < B/2 is specified. Step 4 and 5 essen-tially applies Compress() operation to cover general casesof2D grid embedding.

3 Embeddings of 3D CartesianTopologies

In this section, we concentrate on embedding guest grids/toriwith 1, 2, or 3 dimensions into 3D grids or tori. In addition tothe basic strategy, we will also present how we guard the useof the basic strategy to increase the chance of a successfulmapping.

The most simple embedding is called, in this paper,as Sit3D(A,B,X,Y), which simply applies Sit1D(A,X),Sit1D(B,Y), and Sit1D(C,Z) to embed the 3 dimensionsof the guest grid/torus into the 3 dimensions of the hostgrid/torus independently.

Page 5: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

Table 1:Methods/Operators for Embedding into 2D Grid/ToriOperator Description Reference

Guest topology is 1DSit1D(A,X) map nodes of a line with size A to the first A nodes of X naiveWrap(A) embed a ring onto a line with A nodes [22]Scat(A,X) scatter/stretch a ring onto a ring with X nodes [15]RFold1D(A,X) fold a ring onto a 2D grid/tori, starting in parallel to Y [22]LFold1D(B,X) fold a line onto a 2D grid/tori, starting in parallel to Y [22]

Guest topology is 2DSit2D(A,X) Sit1D(A,X) and Sit1D(B,Y) naiveCompress(B,Y) matrix embeddingA×B onto an ideal gridX′×Y with B along Y (when B< Y) [21; 22]Stretch(B,Y) matrix embeddingA×B onto an ideal gridX′×Y with B along Y (when B> Y) extended from [21; 22]MFold2D(B,Y) fold rectangular meshB×A onto meshX×Y with B along Y [3]RFold2D(B,Y) fold rectangular toriB×A onto meshX×Y with B along Y extended from [22; 3; 11]

Step Conditions Embedding Method

1 IF( A≤ B≤ X ≤Y ) Sit2D (B,X)2 IF( A≤ X < B≤Y ) Sit2D (A,X)3 IF( (A≤ X ≤Y ≤ B)∧ (A≤ X/2∨A≤Y/2)∧Y ≤ B/2 )

IF( B has tori-link∧A < B/4 ) try RFold2D (B,*)IF( B has NO tori-link∧A < B/2 ) try MFold2D (B,*)

4 IF( A≤ X ≤Y ≤ B )IF( B/Y ≤ X/A ) Compress (B,Y)IF( B/Y > X/A ) Stretch (A,X)

5 IF( X < A≤ B < Y ) Compress (A,X) or Stretch (B,Y)

Figure 4:The procedure for selecting 2D embedding methods

In the following sub-sections, we describe more complicatedembedding methods we developed. Later in this section,we summarize the described techniques and specify condi-tions/predicates for their applications.

3.1 3D Embeddings for Pipe and Ring

The embeddings described in this sub-section is extendedfrom embeddings of 1D rings into 3D topologies. Specif-ically, for a given 2D or 3D guest grid/torus in the shapeas a rectangle with low aspect ratio or a thin pipe, we treatthem as 1D lines/rings and fold/wrap them in the 3D hostgrid/torus. The projection function is extended from that forembeddings of 1D lines/rings into 3D grids. Figure 5 (a-c) shows some representative candidate guest grids that areconsidered for such embeddings.

Because the method embeds guest grids/tori whose shapesare close to lines or rings, the technique is very effective.Figure 5(d) shows an example of embedding a24x3x2 gridinto a6x6x4 host grid. In the example, three 2D-fold opera-tions are performed. The process is similar to fold or wrap aring with size 8 into a2x2x2 grid. The worst stretched edgesare from the maneuvers as 90-degree-turn and 180-degree-turn. For each such maneuvers, it introduces dilation costin a 2D sub-grid, which has dilation factor as 2. The foldingmaneuvers applied in different 2D sub-grids do not re-embedeach other’s dilated edges. Therefore, the dilation cost ofthewhole embedding is 2.

In term of detailed embedding process, similar to folding an1D line/ring into a 3D graph, various turning points are de-termined first. Then turning methods are assigned to eachturning points. For instance, for the case of embedding a3D torus (in the shape of a thin 3D circle) into a 3D cube,four 90-degree-turn maneuvers maybe needed to turn the endphase of the 3D ring to be close to the starting phase of thering. In addition, we found that the dilation costs and the av-erage dilation costs are the same for embeddings of 3D ringsinto 3D grids/tori. However, utilizing the torus links duringembedding will help the parallel tasks to use the torus linksin a better organized manner and thereafter may introduceslightly better performance.

3.2 Paper Folding

An intuitive way to map a two-dimensional grid into a three-dimensional grid is to follow the process as folding a paperinto multiple layers. The idea can been used for embeddingsof 2D grids and 3D grids with a shape as a thin panel intonear-cube 3D host grids/tori. Figure 6 (a) and (b) give anexample for embedding a 10x9 2D grid into a 5x3x6 3D gridwith the dilation factor as 5.

This naive 3D paper folding method can be improved by uti-lizingt the 2D grid folding techniques described in previoussection. We call this method asPaperFoldFigure 6 (c) il-lustrate the folding scheme and shows that the dilation factoris dropped to 2. To explore the folding idea further, whenthe guest grid is a 2D/3D torus, the torus edges can be main-

Page 6: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

(a) 2D candidate for 3D ring folding (b) 3D pipe

(c) 3D ring

(d) 3D ring folding

Figure 5: 3D Ring Folding

����

����

����

����

��YZ

X

����

����

����

����

����

����

(a) Fold the Y dimension of a 2D grid onto Zdimension

X

ZY

����������������

��������

��������

��������

��������

�������

���������

������

��������

���

���

����������������

��������

��������

��������

��������

������

������

(b) Naive 3D to 3D folding6,9

X

Y

Z

��

����

��

�� �� �� ����

�� �� ����5,3

6,35,4

5,9

(c) Dilation 2 3D to 3D folding

Figure 6:PaperFold

tained by applying ring folding in a 2D plane defined by thedimensions excluding the dimension whose shape and sizeare not changed (X dimension in Figure 6 (a) ).

The 3D fold embedding of a 2D guest grid or a 3D guest gridwith a panel shape can always be decomposed to two steps.To embed a 3D guest grid/toriA×B×C with A < B < C,into a 3D guest gridX×Y×Z with X < Y < Z, the generalcondition for paper folding isA′ >= X′∧B′ >Y′∧C′ < Z′/2.The procedure is to first fold dimensionB′ ontoC′ and thenfold A′ ontoC′ as showing in Figure 6 (c).

One observation of the problems of applying folding to 3Dembeddings is that, a folding introduces dilation cost 2 andthe dilation cost is accumulative. For most 3D grid embed-ding problems (embedding a thin 3D grid into a near-square3D grid), we can simply apply folding twice and the mappingdilation cost is guarded by 4.

Often, applications require 2D square grids, whose sizes of

the 2 dimensions are not divisible by any of the sizes of the3 dimensions of the host grid. For these cases, if we simplyperform PaperFold, the expansion factor of the embedding isfairly large. For instance, in our study of running NAS BTon a partition with 512 nodes organized into8x8x8 torus,the guest topology is a 2D square grid in22x22, because484 is the maximal square number that is smaller than 512.To embed such 2D grids into a 3D grid, we first find a inter-mediate 2D gridI . The requirements of theI are: (a) the 2Dguest grid can be embedded intoI with minimal expansion,and (b) the size of one of the dimensions ofI is a multiplyof any of the dimensions of the 3D host grid. Because thedilation cost of embedding a 2D grid into a 3D grid usingPaperFold is 2, the final dilation of above 2-step procedure istwo times the dilation of the embedding from the guest gridto the intermediate grid, which is usually guarded.

3.3 General 3D to 3D Embedding

Given the relative non-trivial search space for a reasonableapplication of many basic embedding techniques for 3D em-bedding, we developed an approach to deal with the generalcases. Our approach is composed of 2 contiguous 2D to 2Dembeddings:

1. Map a 2D sub-topology of the guest grid to its ideal 2Dgrid (a 2D sub-grid of a 2D sub-topology of the hostgrid). Assume embeddingA×B to X ×Y′, where Y’<= Y.

2. MapY′×C ontoY×Z

In our embedding process for 3D grids, we always try tofind embeddings with 3D folding and PaperFold. first, whenan embedding can not be found applying these two folding,we apply the general approach. Therefore, the grids end-ing up using the general technique are likely to have similarshapes and aspect ratios as the host grids. In addition, bothsteps likely end up using matrix-based embedding (the Com-press() operation described in the previous section) and weexpect the compression ratio for both Compress() steps beingabout or smaller than three. Note the two steps may dilate thesame guest edges and the upper-bound of the final dilation of

Page 7: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

Table 2:Methods/Operators for Embedding into 3D Grid/ToriOperator Description Reference

Guest topology is 3DSit3D(A,B,X,Y) Sit1D(A,X), Sit1D(B,Y), and Sit1D(C,Z) naiveMFold3D(C,Z) treat a thin 3D structure ship as a 1D line and fold in 3D 3D extension of [22]

also used for embedding a line or a grid with small aspect rationRFold3D(C,Z) treat a thin circle 3D structure as a 1D ring and fold in 3D 3D extension of [22; 11]

also used for embedding a ring or a tori with small aspect rationPaperFold(A,X) treat a thin 3D structure as a 2D paper and fold it into 3D combination of [5; 3; 21]

The process is the same when the guest grid is 2DGeneral3D embedding common cases that can not apply above methods developed

Step Conditions Embedding Method

1 IF( A′ ≤ X′ ∧B′ ≤Y′ ∧C′ ≤ Z′ ) Sit3D2 IF( A′ ≡ X′ ∧B′ ≡Y′ ) reduced to embedding to 1D grid3 IF( A′ ≡ X′ ) reduced to embedding to 2D grid4 IF( A′ ≥ X′ ∧B′ ≥Y′ ∧C′ ≤ Z′/2∧ (A′/2 > C′ ∨B′/2 > C′) ) try PaperFold (C’,Z’)5 IF( A′×B′ ≤ X′×Y′/4∨A′ ≤ X′/2∧B′ ≤Y′ ∧C ≥ Z ) try Fold3D (C’,Z’)6 IF( A′×B′ ≤ X′×Y′ ) General3D (A’,B’, X’,Y’)

Figure 7:The procedure for selecting 3D embedding methods

the two steps is the product of the dilation factors of the twosteps.

3.4 Summary of Embeddings to 3DGrid/Tori

As a summary of our effort on embeddings into 3D grids/tori,we list the operations and the corresponding techniques inTable 2.

Similar to Sec. 2.5, Figure 7 specifies our procedure forselecting and applying different embedding operations forembedding 3D or lower-dimensional grids/tori into 3D hostgrids/tori. The steps are executed in sequential order, andtherefore the conditions of later steps imply that the previoussteps are not performed.

In Figure 7, step 2 and 3 simply deal with special casesthat can be simplified to the embedding among 1D and 2Dgrids/torus. The condition for PaperFold (step 4) is ratherlose, which makes sure eitherA′ or B′ can be folded intoC′.It requires permutation of bothA, B, C andX, Y, Z to findouf if the condition is ever met. In step 5, RFold3D willbe applied when the longer dimension of the guest grid hastorus-link.

4 Implementation Issues

Currently, besides supporting the linearization-based processnumbering/mapping, BG/L MPI allows a user to provide amapping file to explicitly specify a list of torus coordinatesfor all MPI tasks [5]. This simple approach allow users tocontrol the task placement of an application during its launch

time. While having users to specify dictate mappings simpli-fies the BG/L control system and the MPI implementation,it is not portable and addes an additional task for user, i.e.the generation of different mapping files for different BG/Lpartitions. In the end, an application-specific program formapping generation is needed to run on different BG/L par-titions.

We have implemented two interfaces for our topology map-ping library for BG/L. First, we support standalone interfaceto allow user to generate BG/L MPI mapping file for a pairof given guest and host grids/tori. Secondly, we have in-tegrated most of the functionality into BG/L MPI library tosupport MPI Cartesian Topology functions. In the rest of thissection, we briefly discuss implementation issues on the sup-port of MPI Cartesian topology functions and the support ofBG/L virtual node operation mode.

MPI standard has defined topology functions which providesa way to specify task layout at run-time. It essentially pro-vides a portable way for adapting MPI applications to thecommunication architecture of the target hardware. MPIspecifies two types of virtual topologies:graph topologydescribing an graph with irregular connectivity,Cartesiantopologydescribing a multi-dimensional grid. Because thecomputation network of BG/L is 3D torus, we have concen-trated on the support for Cartesian topology.

MPI virtual Cartesian topology is created byMPI Cart create(), whose inputs describes a preferredCartesian topology. Additional functions are defined for aprocess to query a communicator for information related tothe communicator’s Cartesian topology, e.g., ranks of anyneighbor in the virtual grid, dimensionality of the virtualgrid, etc. With this set of functions, an MPI applicationcan map its tasks dynamically and transparently. And it isthe MPI system’s responsibility to realize efficient mapping

Page 8: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

from the application’s requested grid onto underneathphysical interconnect.

We have plugged our topology mapping library into thesystem-dependent layer of the BG/L MPI implementa-tion (essentially an optimized port of the MPICH2 [19]).Specifically, when the application calls MPICart create() orMPI Cart map(), mapping functions in our library are in-voked first. If the library can not compute a valid map-ping/embedding, the default implementation (linearizationbased) is called.

BG/L supports two operation modes: co-processor mode andvirtual node mode that uses both processors on a computenode [5]. In coprocessor mode, a single process uses bothprocessors. In virtual node mode, two processes, each usinghalf of the memory of the node, run on one compute node,with each process bound to one processor. Virtual node dou-bles the number of tasks in BG/L message layer; it also in-troduces a refinement in the addressing of tasks. Instead ofbeing addressed with a triplet (x,y,z) denoting the 3D phys-ical torus coordinates, tasks are addressed with quadruplets(x,y,z,t) wheret is the processor ID (0 or 1) in a computenode. The additional dimension consisting of the two CPUsof one compute node the T dimension. In our implementa-tion, to avoid to compilicate the problem, we did not treatthe issue as a problem of embedding into 4D tori because theT dimension, comparing to other 3 regular torus dimension,is too small. Instead, we always disregard the T dimensionand solve the embeddings of lower-dimensional topologies.Then the T dimension is put into the inner-most (fastest-changing) dimension of the embedded virtual topologies.

5 Empirical Evaluation

In this section, we present the evaluation of our topologymapping library against a collection of widely used bench-mark programs and real applications on an large scale BlueGene/L system. We show that, for a large number of real-istic cases, the mappings generated by our library not onlyhave much lower dilation factors, but also achieves clost-to-constant hop counts of messages exchanged among pro-cesses. In the end, the communication costs are significantlyreduced and and the scalabilities are largely improved.

5.1 Experiment Setup

We use the MPI tracing/profiling component of IBM HighPerformance Computing Toolkit [1] in our study. We usethe following two performance metrics that are dynamicallymeasured by the tracing library:

- Communication Timeis the total time a processor spendsin MPI communication routines.

- Average Hopsis based on manhattan distance of two pro-cessors. The manhattan distance of a pair of processorsp, q, with their physical coordinates as(xp,yp,zp) and(xq,yq,zq), is defined asHops(p,q) = |xp − xq|+ |yp −yq|+ |zp − zq|. We define the average-hops for all mes-sages sent from a given processor as:

averagehops=∑i

Hopsi ×Bytesi

∑i

Bytesi

whereHopsi is the manhattan distance of the sending andreceiving processors of theith messages, andBytesi is themessage size.

The logical concept behind the performance metricaverage-hopsis to measure, for any given MPI message, the numberof hops each byte has to travel. While the metric reflects thehops of actually exchanged messages of pairs of MPI com-munication partners, it is different from the dilation factorsof an embedding for it reflects the real communication sce-nario.

For both performance metrics, we record the average val-ues and the maximum values. The maximum values repre-sent the “worst case”: the maximal communication time isthe communicaiton time of the processor with the longestcommunication time; the maximal average-hops is that ofthe MPI message with the largest average-hops.

We used benchmark programs from NAS Parallel Bench-mark Suite (NPB2.4) and two scientific applications. NASNPB benchmark suite has been widely used for studies ofparallel performance. Their descriptions can be found in [9].In this paper, we include detailed results for 4 NAS NPBprograms (i.e. BT, SP, LU, and CG). Among the rest of theNAS NPB programs, FT and EP do not use point-to-pointcommunication primitives, IS does one time point-to-pointcommunication, where each node sends one integer value toits right-hand-side neighbor. The virtual topology of MG isa 3D near-cube, which can exactly map to the topology ofBG/L partitions with a permutation of the three dimensions([5] describes how user can specify such mapping on BG/L). For all the results obtained on NAS NPB benchmarks, weused the class D problem sizes, the largest of NPB 2.4.

The applications are SOR and SWEEP3D [23] from ASCIbenchmark programs. SOR is a program for solving thePoisson equation using an iterative red-black SOR method.This code uses a two dimensional process mesh, where com-munication is mainly boundary exchange on a static gridwith east-west and north-south neighbors. This results ina simple repetitive communication pattern, typical of grid-point codes from a number of fields. SWEEP3D [23] is asimplified benchmark program that solves a neutron trans-port problem using a pipelined wave-front method and atwo-dimensional process mesh. Input parameters determine

Page 9: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

problem sizes and blocking factors, allowing for a widerange of message sizes and parallel efficiencies.

Table 3:Topology Scenario of Test ProgramsEmbedding Topologies Dilation / Avg. Dilation

Guest Topo. Host Topo. Default OptimizedMapping Mapping

NAS BT, SP : 2D square mesh256: 16x16 256: 8x8x4 3 / 1.633 2 / 1.038484: 22x22 512: 8x8x8 6 / 3.108 4 / 1.6201024: 32x32 1024: 8x16x8 5 / 2.661 2 / 1.0512025: 45x45 2048: 8x16x16 10 / 5.049 4 / 1.5874096: 64x64 4096: 8x16x32 9 / 4.802 2 / 1.025

NAS LU, CG; SOR : 2D near-square mesh256: 16x16 256: 8x8x4 3 / 1.633 2 / 1.038512: 32x16 512: 8x8x8 3 / 1.656 2 / 1.0551024: 32x32 1024: 8x16x8 5 / 2.661 2 / 1.0512048: 64x32 2048: 8x16x16 5 / 2.680 2 / 1.0264096: 64x64 4096: 8x16x32 9 / 4.802 2 / 1.025

SWEEP3D : 2D near-square mesh256: 16x16 256: 8x8x4 3 / 1.633 2 / 1.038512: 16x32 512: 8x8x8 5 / 2.754 2 / 1.0551024: 32x32 1024: 8x16x8 5 / 2.661 2 / 1.0512048: 32x64 2048: 8x16x16 9 / 4.768 2 / 1.0264096: 64x64 4096: 8x16x32 9 / 4.802 2 / 1.025

Table 3 lists the programs’ brief information related to topol-ogy mapping, the specific mapping scenario we used in ourstudy, and the computed dilation factors using BG/L defaultmapping and optimized mapping generated by our library.In the columns of guest/host topologies, we listed the num-ber of total processes and corresponding Cartesian topology.The computed dilation factors show that the edge dilationassociated to our optimized mapping are very small (withmany equal to 2), and much smaller than the dilation factorsassociated to using BG/L’s default mapping.

In the following sub-section, we investigate the performanceimpacts of optimized mappings with low dilation factors onrealistic programs. For the experiments presented in this sec-tion, we compiled the programs using IBM XL compiler onBG/L. For each of the scenario we have studied, we generatea BG/L mapping file using our mapping library. Then weran the specific program and the input case on a BG/L parti-tion matches the host topology described in the scenario andcollect the performance metrics.

Note that the evaluation of the support for MPI topologyfunctions is not presented in this writing. This is primarilybecause we did not find application or benchmark programthat is written with MPI topology routines.

5.2 Results of NAS benchmarks

Results of the 4 NAS programs are given in Figure 8. Theleft graph for each program show the measurements ofAverageHopsand the right graph show the measurementsof communication time. The bars are the averages acrossall messages and lines are the “worst cases”. The horizontal

axis represents the number of processors which correspondsto the tested mapping scenario given in Table 3.

The results of BT, SP, and LU are all consistent in the sensethat the dilation is significantly reduced (many cases guardedby 2 hops) by using the mapping produced by our mappinglibrary. As a result, BT and SP benefited from the high-quality mapping and shows significant reduction of theircommunication costs. This is primarily because the two pro-grams have nearest-neighbor communication pattern. Par-ticularly, for the cases of using BG/L partitions with 512and 2048 compute nodes, because the programs requires thenumber of processes as square numbers, the mapping prob-lems are to map22x22 and45x45 onto the correspondingpartitions. The default, linearization-based mapping hasnospecial handling for such cases. One of the worst dilatededges of mapping22x22 onto 512 nodes using default map-ping is the edge between guest nodes(1,20) and(2,20),where the nodes are mapped to physical nodes(2,5,0) and(0,0,1) and the edge is dilated by 6. In our optimized map-ping, the perfect square meshes,22x22 and45x45, are com-pressed onto32x16 and64x32, then they are folded into 3Dgrids (Section 3.2) and dilation 4 is realized.

On the other hand, although the average hops measurementsof LU are very good when using mapping files we generated,there is no or little improvements on the communication cost.This is because LU involves communications between pro-cesses with variant distances.

The results of CG is rather controversal. Specifically, al-though the estimated dilation of optimized mapping is only2, the average communication hops is much higher. This isbecause for CG, each process P exchanges message with allprocesses of the same row that are exactly 2k−1 hops awayfrom P. Here the hops are in the term of the guest 2D grid.For example, assuming each row has 32 nodes, the node 2of the row would communicate with nodes 3, 4, 6, 10, 18 ofthe same row. When mapping this dimension to the host gridusing the default, linearization-based mapping with radixasa power-of-two number, the physical distances of far-apartcommunication partners may be small. On the other hand,the optimized mapping, trying to optimize the connectivityof the nearest neighbors, the physical distances of the long-distance communication partners may not be optimized. Forinstance, Figure 9 shows how a row is mapped onto a 2Dspace with the size of the inner dimension as 8, using thedefault mapping and the optimized mapping. As highlightedin Figure 9(a), with the default mapping, node 2 node 2 endsup to be 1 hop away from node 10. In Figure 9(b), with theoptimized mapping, the physical distance of node 2 and node10 is 4.

Page 10: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

(a) BT (b) SP

(c) LU (d) CG

Figure 8:NAS NPB 2.4 results (with class D inputs, co-processor mode)

10

2 3 4 6

18

(a) Default mapping

10

2 3 4 6

18

(b) Optimized mapping

Figure 9: CG communication partners

5.3 Application Results

Figure 10 gives the detailed performance results for SORand SWEEP3D. The results of SOR confirms that for appli-cations with nearest-neighbor communication pattern, opti-mized process mapping improve the communication perfor-mance significantly. Specifically, SOR results show that notonly theAverageHopsbut also the communication time im-prove significantly with our mapping library. This is becausethe communication pattern of SOR is “ping-pong” messagesamong nearest neighbors, which is the representative com-munication pattern that can benifit from high-quality topol-ogy mapping. Note the measured average hops for the casesusing our mapping library stays flat and the benefit of usingit becomes more significant when the number of processorsincreases. For 4096 processors, the average communicationtime are improved by 42%. For the worst cases (i.e., theprocessor spends most time in communication), the commu-nication costs are improved by 25%.

Similar to the results of NAS-LU, results of SWEEP3D inFigure 10(b) show that although theAverageHopsare im-proved significantly for all the cases, the communication cost

is not affected. This is because for the SWEEP3D applica-tion, the communication pattern is composed of pipelinedwavefronts. Although its communications are all amongnearest neighbors, the pipeline hides the relative high la-tencies between certain neighboring wavefronts containingextended mesh edges. Nevertheless, because the communi-cations of SWEEP3D are among neighbors, the optimizedmapping does not introduce performance degradation as theNAS-CG case.

6 Related Work

The problem of mapping parallel programs onto parallel sys-tems has been studied extensively since the beginning of par-allel computing. The problem is essentially equivalent to thegraph embedding problem. Nevertheless, for different appli-cations, the problem would have different constraints. Forinstance, there are many-to-one and one-to-one mappings.Similarly, some methods concentrates develop techniquesfor mapping data structures onto processors, and others mapsparallel processes onto processors. In this paper, we concen-trated on exploring one-to-one mappings from parallel pro-cesses processors.

In term of solution approaches, a large number of methodsexploring graph-partitioning and searching-based optimiza-tion have been developed (with few examples as [12; 20;24]). In this approach [6] described a simulated anneal-ing based method to explore applications’ communicationpattern and in turn discover the most beneficial mapping ofapplication’s tasks onto BG/L. The off-line approach intro-

Page 11: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

(a) SOR (a) SWEEP3D

Figure 10:Application results (co-processor mode)

duced in the paper is effective for performance tuning andknowledge discovery of complicated applications. Our workis orthogonal and complementary to their approach in thesense that when the logical topology of the communicationsof an application is well defined, our topology mapping li-brary should be used to port the application to various BG/Lpartitions easily. When the communication pattern of an ap-plication is irregular or dynamic (like that of NAS-CG), theirtool should be used to uncover a proper mapping to BG/Ltopologies.

Another approach is to embed guest graphs into host graphsvia projection functions, These methods usually develop so-lutions for special cases with low complexity. As discussedin Section 2, most of the effective results on grid embeddingsare obtained for embedding into two-dimensional grids. Inthis section we discuss related efforts on embeddings forthree-dimensional grids/tori, embeddings for tori, supportfor MPI topology functions, and topology mapping relatedwork for BG/L.

Gray Codesis an ordering of 2n binary numbers such thatonly bit changes from one entry to the next. Applying graycode numbering for grid embedding, for the case of 8 nodes,the nodes can be numbered as 000, 001, 011, 010, 110, 111,101, 100. This sequence is a gray code. With its simplicity,gray code has been successfully applied for embedding intohypercube topologies and researcher has explored its appli-cation for embedding among k-array n-cube topologies [7;16]. When embedding a one-dimensional ring onto mul-tiple dimensional grids, gray-code-based embedding is ex-actly the same as thering foldingmethod mentioned in Sec-tion 2. Nevertheless, it is difficult to extend the approacheffectively for embedding guest grids with two or higher di-mensions. When embedding among 2D and/or 3D grids, theassociated dilation factors are usually as the sizes of the di-mensions.

Tao’s mapping [16] is composed of ring-wrapping, simplereduction, and general reduction techniques for embeddingamong high-dimensional meshes and tori. It is based ona ring-wrapping method that similar to gray-code methodsto embed a one-dimensional pipe/ring onto two-dimensionalgrids. Based on re-factoring the sizes of the dimensions of

the guest grid and host grid, they introduced a simple reduc-tion method and a general reduction method to embed amongmulti-dimensional grids. Their factorization based tech-niques post fairly strong constraints on the relative shapesof the guest grids and host grids and additional techniquesare needed to complement the solution scope. In addition,their most general technique (general reduction) can intro-duce large dilation. For the case of mapping to 3D grid/tori,in the worst case, its dilation factor is the same as the size ofthe smallest dimension.

PARIXmapping [22] is a comprehensive topology mappinglibrary developed in the similar approach as ours, i.e. ex-ploring effective techniques for 2D grid embedding. In thissense, our work is similar to theirs. Nevertheless, their inte-gration for embedding of 3D grid/tori is not as sufficient. Ifdirectly applying their steps of embedding 3D grids into 2Dgrids, the embedding of 3D grid into 3D grid would involve3 steps. It first unfolds the 3D guest grid to 2D grid along asingle dimension, Then, it does a 2D embedding of the 2Dintermediate grid to another 2D intermediate grid that can befold into the 3D host grid by pipe/ring foldings. Finally, itfolds the 2nd intermediate grid to 3D. The drawback of theprocedure is that the first step (unfolding from 3D grid to 2Dgrid) introduces large dilation which equals to the size of thesmallest dimension.

Kim and Hur [15] proposed an approach for many-to-oneembedding of multiple dimension torus onto a host toruswith the same number of dimensions. Their approach isbased on ring stretching/scattering, which we also use forour torusSit operations in this paper. Because they concen-trated on many-to-one embedding, their approach does notwork well for general one-to-one grid/torus embeddings.

In term of supporting MPI topology functions, [24] pro-posed a graph-partitioning based technique for embeddinginto a hierarchical communication architecture (i.e. NECSX-series). [18] described techniques to for embedding intoswitch-based networks. These techniques are designed forspecific systems, which have different networks from BG/L.

Topology mapping on BG/L has been studied in [7; 6]. [7]studied performance impacts of process mapping on BG/L

Page 12: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

has been studied on rather small scales (using 128 BG/Lcompute nodes). The study shows that gray code and aone-step paper folding are fairly effective for mapping two-dimensional guest topologies onto BG/L partitions with sizesupto 128 compute nodes (i.e. 8x8x2 torus). Based on moresophisticated grid embedding techniques, our approach cov-ers many hard cases. In addition, our integration of existingand novel techniques provides a rather complete solution tosystematically map applications onto BG/L.

7 Conclusions

To run scalable applications on today’s scalable parallel sys-tems with minimal communication overhead, effective andefficient support on mapping parallel tasks onto physical pro-cessors is required. This paper describes the design and in-tegration of a comprehensive topology mapping library formapping MPI processes (parallel tasks) onto physical pro-cessors with three-dimensional grid/torus topology. On de-veloping our topology mapping library that presented in thispaper, we have not only engaged an extensive study of theexisting practical techniques on grid/graph embeddings, butalso explored design space and integration techniques onembeddings of three-dimensional grids/tori. By providingscalable support of MPI virtual topology interface, portableMPI applications can benefit from our comprehensive li-brary. The results of our empirical study that using popu-lar benchmarks and real applications, with topologies scaleup to 4096 nodes, further shows the impact of our topologymapping techniques and library on improving communica-tion performance of parallel applications.

For future work, we would like to find or co-develop appli-cations that use MPI virtual topology functions for a realisticevaluation of our support for MPI topology interface. In ad-dition, we would like to look into embeddings of topologiesother than grids/tori into BG/L topologies.

Acknowledgement

We would like to acknowledge and thank George Almasi,Jose G. Castanos, and Manish Gupta from IBM T. J. Wat-son Research Center for their supports, Brian Smith, CharlesArcher, and Joseph Ratterman from IBM System Group fordiscussion and their effort on integrating our library intoBG/L MPI library, and William Gropp for valuable discus-sion on the support of MPI topology functions.

References

[1] IBM Advanced Computing Technology Center MPItracer/profiler. URL: http:// www.research.ibm.com/ actc/projects/ mpitracer.shtml.

[2] A DIGA , N. R., ET AL . 2002. An overview of the Blue-Gene/L supercomputer. InSC2002 – High PerformanceNetworking and Computing.

[3] A LELIUNAS , R., AND ROSENBERG, A. L. 1982. Onembedding rectangular grids in square grids.IEEE Trans-actions on Computers 31, 9 (September), 907–913.

[4] A LMASI , G., ET AL . 2001. Cellular supercomputingwith system-on-a-chip. InIEEE International Solid-stateCircuits Conference ISSCC.

[5] A LMASI , G., ARCHER, C., CASTANOS, J. G., ERWAY,C. C., HEIDELBERGER, P., MARTORELL, X., MOR-EIRA, J. E., PINNOW, K., RATTERMAN , J., SMEDS, N.,STEINMACHER-BUROW, B., GROPP, W., AND TOONEN,B. 2004. Implementing MPI on the BlueGene/L super-computer. InProc. of Euro-Par Conference.

[6] BHANOT, G., GARA , A., HEIDELBERGER, P., LAW-LESS, E., SEXTON, J. C.,AND WALKUP, R. 2005. Op-timizing task layout on the Blue Gene/L supercomputer.IBM Journal of Research and Development 49, 2 (March),489–500.

[7] BRIAN E. SMITH , B. B. 2005. Performance effectsof node mappings on the IBM Blue Gene/L machine. InEuro-Par.

[8] CHAN , M. J. 1996. Dilation-5 embedding of 3-dimensional grids into hypercubes.Journal of Paralleland Distributed Computing 33, 1 (February).

[9] DER WIJINGAART, R. F. V. 2002. NAS Parallel bench-marks version 2.4. Tech. Rep. NAS-02-007, NASA AmesResearch Center, Oct.

[10] ELLIS , J. A. 1991. Embedding rectangular grids intosquare grids. IEEE Transactions on Computers 40, 1(Jan.), 46–52.

[11] ELLIS , J. A. 1996. Embedding grids into grids: Tech-niques for large compression ratios.Networks 27, 1–17.

[12] ERCAL , F., RAMANUJAM , J., AND SADAYAPPAN , P.1990. Task allocation onto a hypercube by recursive min-cut bipartitioning.J. Parallel Distrib. Comput. 10, 1, 35–44.

[13] FORUM, M. P. I., 1997. MPI: A message-passing in-terface standard. URL: http:// www.mpi-forum.org/ docs/mpi-11-html/ mpi-report.html, August.

[14] HATAZAKI , T. 1998. Rank reordering strategy for MPItopology creation functions. InProceedings of the 5th Eu-

Page 13: Topology Mapping for Blue Gene/L Supercomputersc06.supercomputing.org/schedule/pdf/pap273.pdf(BG/L) supercomputer. The BG/L supercomputer is a new massively parallel sys-tem developed

roPVM/MPI conference, Springer-Verlag, Lecture Notesin Computer Science.

[15] K IM , S.-Y., AND HUR, J. 1999. An approach fortorus embedding. InProceedings of the 1999 Interna-tional Workshop on Parallel Processing, 301–306.

[16] MA , E., AND TAO, L. 1993. Embeddings amongmeshes and tori.Journal of Parallel and Distributed Com-puting 18, 44–55.

[17] MELHEM, R. G., AND HWANG, G.-Y. 1990. Em-bedding rectangular grids into square grids with dilationtwo. IEEE Transactions on Computers 39, 12 (Decem-ber), 1446–1455.

[18] MOH, S., YU, C., HAN , D., YOUN, H. Y., AND LEE,B. 2001. Mapping strategies for switch-based cluster sys-tems of irregular topology. In8th IEEE International Con-ference on Parallel and Distributed Systems.

[19] The MPICH and MPICH2 homepage.http://www-unix.mcs.anl.gov/mpi/mpich.

[20] OU, C.-W., RANKA , S., AND FOX, G. 1996. Fastand parallel mapping algorithms for irregular problems.J.Supercomput. 10, 2, 119–140.

[21] ROTTGER, M., AND SCHROEDER, U. 1998. Efficientembeddings of grids into grids. InThe 24th Internationalworkshop on Graph-Theoretic Concepts in Computer Sci-ence, 257–271.

[22] ROTTGER, M., SCHROEDER, U., AND SIMON , J.1993. Virtual topology library for parix. Tech. Rep. TR-005-93, Paderborn Center for Parallel Computing, Univer-sity of Paderborn, Germany, November.

[23] The asci sweep3d benchmark code. URL: http://www.llnl.gov/ ascibenchmarks/ scsi/ limited/ sweep3d/asci sweep3d.html.

[24] TRAFF, J. L. 2002. Implementing the MPI processtopology mechanism. InSupercomputing, 1–14.