Implementation and performance evaluation of a parallel ocean model

Ž .Parallel Computing 24 1998 181–203

Practical aspects

Implementation and performance evaluation of aparallel ocean model

Manu Konchady a,1, Arun Sood b,), Paul S. Schopf c,2

a Mitre Corporation, McLean, VA 22102, USAb Department of Computer Science, George Mason UniÕersity, Fairfax, VA 22030, USAc Dept. of Computational Science, George Mason UniÕersity, Fairfax, VA 22030, USA

Received 18 June 1996; revised 21 May 1997; accepted 13 July 1997

Abstract

Ocean models have been used to study global phenomena such as the El Nino SouthernOscillation. The effects of El Nino are felt worldwide and are of importance to all segments ofsociety. High resolution ocean models are being developed to run more sophisticated simulations.The demand for computing power is expected to increase drastically with the use of highresolution models. Running extended simulations with high resolution models on current vectorsupercomputers will require computing power which is not available or prohibitively expensive.We have developed a parallel implementation suitable for running high resolution simulations. Westudied performance issues such as the communication overhead and cache hit rate. Ourexperiments with grid partitioning and inter-processor communication techniques enabled us toobtain an efficient parallel implementation. Three grid partitioning techniques are evaluated. Twoschemes to gather and scatter global grid data among processors are compared. A cache model toanalyze performance is described. q 1998 Published by Elsevier Science B.V. All rights reserved.

Keywords: Ocean model; High resolution simulation; Communication overhead; Cache misses; Grid partition-ing

1. Introduction

The use of high resolution models for studying ocean circulation was proposed byw xSemnter and Chervin in 1988 17 . The prohibitive cost of running high resolution ocean

) Corresponding author. E-mail: [email protected] E-mail: [email protected] E-mail: [email protected]

0167-8191r98r$19.00 q 1998 Published by Elsevier Science B.V. All rights reserved.Ž .PII S0167-8191 97 00092-6

( )M. Konchady et al.rParallel Computing 24 1998 181–203182

models on vector supercomputers has limited the use of such models. To lower the costof simulation, several parallel ocean models have been successfully implemented on

w xmulti-processor distributed memory machines 2,11,13 . The use of parallel distributedmemory computers for climate modelling has gained popularity because of the develop-ment of high performance processors and increased inter-processor communicationbandwidth. In this research, we have used a Cray T3D as our experimental platform, andhave focussed on developing techniques that will increase performance. In particular, wehave explored grid partitioning approaches, techniques to collect and distribute data andlimitations of the cache on a Cray T3D processor. We show that the proper choices leadto enhanced performance. Regular 3D grids have been traditionally used in oceanmodels, and thus the implementation of an ocean model on a parallel distributedmemory machine requires the partitioning of a grid with a periodic exchange of shareddata. We have analyzed the performance of three grid partitioning techniques for regulargrids.

We describe a parallel implementation and evaluate the performance of Poseidonw x15 , a coupled ocean model on the Cray T3D. Our implementation is similar to prior

w ximplementations of parallel ocean models 2,11 with respect to grid partitioning. Thedistinguishing feature of our model is the coupling interface. We have developed amechanism for coupling the ocean model with land, ice, lake and atmospheric models.The ocean model we used does not include the treatment of ice and was coupled to asimplified atmospheric model. A coupled system consisting of multiple models includ-ing ocean, ice, atmospheric and land models can be used to address some of the complexclimate simulation problems. The exchange of forcing data between models is asignificant overhead since data must be distributed and collected for all processors. Inour implementation, we have studied two techniques to distribute and collect dataefficiently. The communication overhead problem also occurs during IrO for restart andhistory files. We have not used any parallel IrO facilities which can reduce some of thecommunication overhead. Our results show that global data can be distributed andcollected with less overhead using a recursive scheme. A new algorithm to find anoptimized processor grid for a given topography using an irregular blocking techniquehas been developed. This algorithm is used to obtain a high resolution processor grid fora fixed number of physical processors. Our irregular blocking scheme is similar to the

w xone developed for the MICOM ocean model 2 . We have evaluated performance usingoptimized processor grids for several configurations.

One of the limitations of running on the Cray T3D is the size of the cache on a singleŽ .processor 8K . While the 64 Mbyte per processor memory limit is sufficient to run most

applications, the small cache can degrade performance. For a simple application, it ispossible to rearrange data to obtain a high cache hit rate. It is more complex for a large

w xapplication to determine the optimum data layout. Cache performance tools 4 can beused to determine bottlenecks. We have developed a technique to evaluate cacheperformance for multiple processor configurations and cache sizes without the use of alarge multi-processor simulator. It can be extended to handle different cache models. Wehave used the technique to study the performance of a primary direct mapped cache.

( )M. Konchady et al.rParallel Computing 24 1998 181–203 183

Table 1Parallel ocean models

Model Platform Grid partitioning Coupled

w xSWEM 1 SGI Onyx regular 1D now xMICOM 2 Cray T3D, SGI Onyx irregular 2D no

w xPOP 19 CM-5 and others irregular 2D yesw xPOCM 18 Cray YMP irregular 2D no

1.1. Parallel ocean models

Several organizations have developed parallel versions of ocean models. Table 1shows a list of some parallel ocean models. The platforms, grid partitioning techniquesand coupling mode is described for each model in Table 1. All the models in Table 1obtain parallelism by running the same code on multiple processors for separate regionsof a grid. The partitioning techniques differ among models and we have evaluated three

w xtechniques for the Poseidon ocean model 15 . Coupled parallel ocean models have theadded complexity of exchanging forcing data with other climate models. We havedescribed a scheme which can be used to reduce the communication overhead ofexchanging forcing data in Section 3.

The Poseidon ocean model was developed to study seasonal to decadal variability ofthe global ocean. Coupled with a land–atmosphere model, the ocean model has beenused to study the El Nino phenomenon. Poseidon is a three dimensional general ocean

Ž . w xcirculation model Fig. 1 extended from a two layer upper ocean model 14 with thew xfollowing key features: generalized isopycnic vertical coordinates 2 , reduced gravity

w x w xapproximation 21 , a staggered grid 11 and the provision for a turbulent surface mixedw xlayer. The Poseidon Ocean Model User’s Guide Version 4 15 includes a detail

description of the model.

1.2. Coupled model

The ocean model includes provisions for coupling with other climate models. Currentmodel runs can be coupled with a simplified or a complete atmospheric general

Fig. 1. Ocean model layers.


Ž .circulation model AGCM . The coupling interface between the ocean and atmosphericmodels can be extended to other models such as ice, lake, or land models. Obviously, anocean model run coupled with a complete AGCM takes much longer than a run with asimplified AGCM. The decision to run with a simplified or complete AGCM is madebased on the experiment. In our experiments to evaluate performance, we have used theocean model coupled with a simplified AGCM.

The provision for this type of coupling is important to test theories regarding El Ninowhich is based on a complex interaction between the ocean and atmosphere. Thecoupling process consists of the exchange of forcing data between the two models. Theatmospheric model provides wind stresses and other forcing data and the ocean modelprovides the sea surface temperature. The coupling interface between any two models isdefined by the forcing data exchanged.

Since each model requires forcing data to begin a simulation, there are two ways toinitiate a coupled model run. The first technique is to run the two models alternately ona single machine. The forcing data for one of the models is read from a restart file and itgenerates forcing data for the other model. The second technique is to run both models

Ž .in parallel Fig. 2 . The forcing data for both models is read from restart files and bothmodels run concurrently on separate machines or a single machine. A physical couplingexists between the ocean and atmospheric models, i.e. the forcing data generated by theocean model affects the results of the atmospheric model and vice versa. A distributed

w xcoupled model has been implemented using a Cray T3D, YMP and C90 9 . The oceanmodel ran on the T3D while the atmospheric model ran on the C90 and a driver programran on the YMP. The driver program handled communication between the models,synchronization and performed interpolation of forcing data. One of the limitations ofsuch a system is the bandwidth between machines. For a coupled model, high volumeintermittent exchange of data is required. A low bandwidth connection will lowerperformance, since both models are idle till forcing data has been received. Running on

Fig. 2. Coupled ocean–atmosphere model.


a distributed system introduces a load balancing problem. The atmospheric and oceanmodels do not run for the same amount of time between coupling intervals. The modelcompleting a simulation sooner must wait till the other model is ready for coupling. Theuse of other climate models such as an ice or land model together with the ocean andatmospheric models increases the complexity of running the coupled models on adistributed system. This is due to the differences between the times to run each model ona particular machine. Models taking less time for computation will be idle until forcingdata is received from a more computation intensive model. Parallelizing and distributingmultiple coupled climate models for a complete global climate simulation is a complexand challenging task. We have successfully executed simulations with the atmosphericand ocean models on a network of heterogenous machines. The results from thedistributed simulation were similar to the results from a simulation on a single machine.We did not study issues related to running distributed computations such as loadbalancing and a rigorous verification of results.

1.3. Serial implementation

The fundamental purpose of Poseidon is to advance the state of the ocean. It can becalled repeatedly and executes functions based on alarms. A global timekeeper maintainssimulation time and alarms. Alarms are set for individual functions to run at specifictime intervals during the simulation.

The overall program flow is shown in Fig. 3. The initial ocean state is read from a

Fig. 3. Ocean model algorithm.


restart file. The clock and alarms are set for the current run. External forcing data isexchanged with the atmospheric model. The coupling interface is built into the maindriver routine. Coupling can be performed with ice, land, lake and atmospheric models.The ocean state is advanced by calling the hydrodynamics and other routines. The

w xvertical diffusion, mixed layer and filtering 16 routines are executed less frequentlythan the hydrodynamics. A new ocean state is determined at the end of each hydrody-namics computation. Each of the hydrodynamic routines contribute to a partial deriva-tive which is then used by the update routine to compute a new state. The ocean statevariables are layer thickness, temperature, salinity, zonal and meridional advection.

1Filtering, vertical diffusion and the mixed layer computation consume less than of32the total computation time. The hydrodynamics computations use about of the total3

computation time. A large number of grid points implies loops with a higher iterationcount and a proportionally longer computation time. An explicit finite difference schemeis used by the update routine to compute a new state.

w xA Shapiro filtering scheme 16 has been used to maintain numerical stability near thepoles. Since the Arctic ocean was not included in the simulation, a stripe of land pointsis assumed near the North pole. Stripes of land points are used near the South pole. Inour parallel implementation we have not used a compressed grid due to the problems ofhandling inter-processor communication.

1.4. Parallel implementation

About 15,000 lines of Fortran code have been ported to the Cray T3D from a Crayvector supercomputer. Our implementation includes the capability to use three commu-

w xnication protocols: PVM, MPI and shmem 6,8,12 . We have successfully run the modelusing each of the three protocols on the Cray T3D.

Ž .The task of parallelizing the code involved three tasks: 1 developing code forŽ .inter-processor and global communication, 2 modifying the number of iterations

Ž .executed for a do loop based on the number of processors used and 3 optimizing codeto run efficiently with a small cache. We accomplished all three tasks while stillmaintaining the capability to run the sequential and distributed versions from a singlesource. We created a preprocessor file with a number of options for the number of

Žprocessors, the type of grid partitioning, the communication protocol PVM, MPI, or.shmem , shadow border width and the tracing of communication calls. The C preproces-

sor was used to generate code from the source based on the options defined in thepreprocessor file.

The first task of developing code for inter-processor communication and globalcommunication involved implementing a grid partitioning technique and efficientlyperforming global computations. We analyze three grid partitioning techniques, inSection 2. On the Cray T3D, the number of processors allocated for a job is always a

w xpower of two 3 . We have coded global computations assuming a complete binary tree.When running on a machine such as the Intel Paragon, additional code must be includedto handle processor configurations which are not powers of 2. The second task is atedious job of changing the iteration counts of every do loop. The number of iterations is


based on the options in the preprocessor file. The start and end iterations are computeddepending on the number of processors, shadow width and grid partitioning technique.The third task involves analysis of performance for a direct mapped cache. We havediscussed cache performance in Section 4. A number of the performance recommenda-tions for a T3D processor may not be appropriate for a vector processor. Therefore,optimizing code to run on multiple machines involves iterative changes to verify thatperformance on any one of the machines has not degraded significantly.

The performance of the parallel algorithm depends on the communication overhead.Minimizing communication overhead while performing the required computation is a

Ž .challenging task. A single program multiple data SPMD coding model is chosen. Thechoice is made for easier maintenance and simplicity. The SPMD model is easier todevelop from uni-processor code. The implementation uses a single program on eachprocessor working on a section of the 3D grid with message passing to share commongrid data. Verification of results from the sequential and parallel algorithms must takeinto account the differences between processors and computations such as global sums.Since filtering is used in both algorithms, values will not exceed bounds. We haveverified the results using the sequential and parallel algorithms to validate the parallelimplementation. Successful extended runs were executed to generate a higher confidencein the use of a parallel implementation.

2. Grid partitioning techniques

In this section, we have briefly reviewed three grid partitioning techniques for theocean model. Each grid partitioning technique has pros and cons in terms of the volumeof data communicated and number of neighbor processors. The performance for eachtechnique is assessed assuming computation for a grid point requires data from fourŽ .north, south, east, and west neighbors. The striping and blocking grid partitioning

w xtechniques have been compared in earlier research 5,10 .

2.1. Striping

Striping can be performed along any one of the three axes. A stripe is assigned to asingle processor and consists of a sub-grid. The size of the sub-grid is determined by the

Ž .number of stripes Fig. 4 . The advantages of striping are simpler coding and the numberof neighbor processors is always 2. The number of bytes exchanged is equal to the size

Ž .of the dimension of the grid a shadow row . The process of dividing the grid is simplesince the entire grid can be viewed as a one dimensional array which is divided intosections. During an exchange, the shadow row is not copied into a temporary messagebuffer. The shadow data can be exchanged by simply providing an address and length. Aone dimensional distribution of the grid makes the task of gatherrscatter of global grid

Ž .data simple see Section 3 . For striping, the maximum number of processors that can beused is limited to the size of the y dimension of the grid.


Fig. 4. Striping and blocking.

2.2. Regular blocking

Regular blocking is a two dimensional partitioning technique which reduces theŽ .volume of inter-processor data communication compared to striping Figs. 4 and 5 . The

grid is divided across the latitude and longitude dimensions into blocks of equal areasŽ .or grid points . Each processor must exchange data with 4 processors. The implementa-tion is more complex compared to striping. The grid cannot be viewed as a 1-dimen-sional array. Additional code must be written to handle the ‘gather and scatter’ ofshadow column data which is non-contiguous. This additional code causes a degradationin performance compared to striping for large grids on small processor configurations.The number of processors that can be used in regular blocking exceeds that of striping.

2.3. Irregular blocking

For a global ocean model grid, approximately 35% of the grid points are land pointsŽ .and the remainder are water points Fig. 6 . No useful computation is performed on land

points. A lower execution time can be obtained through the use of an irregular blockingw xtechnique 2 . We will show how an optimal processor grid can be obtained for a given

data grid and processor configuration. For a topography with over 50% land points, thebenefits of irregular blocking are significant. Using the algorithm below for a given


Fig. 5. Regular blocking for oceans.

processor configuration and topography, we can calculate an optimized processor grid.The optimized processor grid would have the maximum number of processors that canbe used while still including all water points. The best solution results in a smallsub-grid for a processor. This results in fewer iterations per do loop and a correspondinglower execution time.

With a lower communication overhead and computation demand, irregular blockingwill give better performance results than regular blocking. The algorithm can be used forany given topography. It should be noted that by altering topography such as addingextra columns or rows of land along the edges of the grid, optimal solutions can beobtained which are not possible for the original topography. An altered topography canhave more factors for processor grid configurations. For example in Fig. 7, we have useda 12=13 processor grid for a 288=260 data grid. The dimensions of each sub-grid are24=20 or 480 grid points per processor. For regular blocking, a processor grid of8=16 is used for a 288=256 data grid. The dimensions of each sub-grid are 36=16or 576 grid points per processor. The size of the sub-grid for irregular blocking issmaller by 96 points despite the use of a larger data grid.

Algorithm for optimized processor grid– Read topography mask– Set procnum to maximum # of processors– Do while procnum<# of real processors


Fig. 6. Irregular blocking for oceans.

– Find all factors for procnum and data grid, i.e. computenumbers x and y such that x*y = procnum and the dimensionsof the data grid are divisible by x and y

– Do for each factor– Compute number of land procs– If (# of land procs+# of real procs=procnum)– then save factors and exit from do loops– Endif– Enddo– Decrement procnum– Enddo

The purpose of the above algorithm is to obtain two numbers x and y such that xPyis equal to the number of physical processors and all water points are allocated to atleast one processor. The algorithm begins by reading a binary topography mask.Procnum is assigned a large number based on the number of physical processors. Forexample, if 64 physical processors are used, then procnum can be set to 2000. Thenumber 2000 is calculated based on estimated maximum values for x and y. In eachiteration of the loop, factors x and y are computed for a procnum value. For each set offactors, the allocation of water points to at least one processor is verified. This isaccomplished by comparing the sum of the number of processors with only land points


Fig. 7. Optimal processor grid for 128 processors.

and the number of physical processors with procnum. If procnum is less than the sum,optimized factors have been obtained. Otherwise, procnum is decremented and itera-tions continue till either optimized factors are obtained or procnum is equal to thenumber of physical processors.

The differences between the two blocking schemes are in inter-processor communica-tion and collectionrdistribution of global data. Shadow data is not exchanged with aneighbour land processor, since it does not exist. Processor 0 does not transmit orreceive global data from a land processor. Table 2 below shows the improvement ofoptimal solutions for larger processor grids for a 288=260 data grid.

With larger processor configurations, the number of land processors increases and ahigh resolution processor grid can be used. The % increase represents the percentage ofprocessors in excess of the real number of processors. For the 512 processor configura-tion, we use a grid 36=20 or 720 processors. The number of processors in excess of512 processors is 208. The % increase will reach a stable value with more processorsand should be approximately the same as the percentage of land for the data grid. The

Table 2Optimized processor grids for ocean grid

No. of processors Processor grid dimensions No. of land processors % increase in processors

64 8=9 8 12.5128 12=13 33 21.9256 32=10 67 25.0512 36=20 220 40.6

1024 72=20 466 40.6


irregular decomposition technique yields significant benefits when high resolutionprocessor and data grids are used.

2.4. Execution timing analysis

Ž .We have analyzed the time for execution excluding IrO of the model. The time forIrO is approximately 30% of the total time to complete a one month simulation. IrOtime has not been modelled here and the focus is on execution time alone. We ranexperiments with the ocean model coupled to a simplified AGCM. The processing timewithout overlapping computation and communication time, t for a timestep isp

t s t q tp comm comp

where t and t are the communication and computation times per timestep,comm comp

respectively. A perfectly balanced load is assumed for calculating t . This is possiblep

when an equal number of grid points can be assigned to every processor. Due to smalldifferences in the computation or communication times for processors, the processingtime t will not be identical for every processor. The computation time is an approxi-p

mate linear function of the number of grid points, n . This was observed by measuringg

the computation time for several grid sizes on a single processor of the Cray T3D. Thecomputation time by component is described in Section 1.3. n is based on the numberg

of processors used and the dimensions of the data grid. For a balanced load, n must beg

the same in all processors. This is true when the same computations are performed forevery grid point and the values at grid points do not affect the type of computationperformed. The communication time, t is based on inter-processor bandwidth andcomm

latency time.w xThe latency time for the Cray T3D is approximately 2.5 ms 12 . The time to

exchange data is a function of the bandwidth, latency and the number of bytesw xtransmitted and received. Data can be exchanged using PVM 8 send and receiÕe calls.

On the Cray T3D, shared memory calls provide a higher bandwidth than PVM send andreceiÕe calls. We have used shared memory calls in the model and have included code

Ž . Ž . w xfor communication using PVM version 3.3.7 or MPI version 1.0 6 . In ourexperiments, we found the Cray shmem protocol to be the fastest followed by MPI andPVM. Table 3 lists the execution times for a 10 day run using 32 and 64 processors for a288=256=8 grid.

2.4.1. Striping Õs. regular blockingThe three components of communication time are latency time, transmission time and

gatherrscatter time. The latency time cannot be changed by a program, however, the

Table 3Ž .Execution time s using PVM, MPI, and shmem

Number of processors Communication protocol

Cray shmem MPI PVM

32 235 280 47764 121 161 552


transmission time can be reduced by sending data packets which are near the peak of thebandwidth curve. For example, in the Cray T3D, the latency time is in ms, while thetransmission time is in tens of ms for packets of 40 Kbytes or less and in hundreds of

w xms for packets up to 100 Kbytes. We note that in 12 , Numrich has used a similarmodel to assess computation cost.

In the striping grid partitioning technique, data can be sent by providing an addressand the number of bytes to be sent since shadow data are stored in contiguous locations.In the regular blocking grid partitioning technique, pre-processing must be performed

Ž . w xprior to sending data and after receiving data the gatherrscatter operation 7 . Thisadditional overhead for exchanging data must be added to the latency time andtransmission time. The communication time, t to exchange shadow data for stripingcomm

and regular blocking techniques can be defined as

t s t q t q tcomm lat tran gs

where t , t and t are the latency time, the transmission time and the gatherrscatterlat tran gs

time, respectively. In striping,

t s2 t , t s2 t , t s0lat l tran x gs

t is the latency time to transfer data from the application to a system message bufferl

and t is the time to transmit a row of x bytes. In regular blocking,x

t s4 t , t s2 t q t , t /0Ž .lat l tran x p y q gs

t and t are the times to transmit xp and yq bytes, respectively. If p and q are thex p y q

dimensions of the processor configuration in the x and y axes, then for a grid withdimensions x=y, xpsxrp and yqsyrq. t , t , and t vary depending on thegs x p y q

processor configuration and grid dimensions. In general, striping is better than regularblocking, when the following is true,

2 t -2 t q t q t q tŽ .x l x p y q gs

2.4.2. Regular blocking Õs. irregular blockingIrregular blocking is similar to regular blocking and includes preprocessing to

Ž .determine land and water processors Figs. 5 and 6 . Processor 0 reads the topographymask and builds a table consisting of real processor numbers, virtual processor numbersand a flag indicating a water or land processor.

This table is broadcast to all processors. A virtual processor number is based on thelocation of a processor on the processor grid. A real processor number is the T3D PEnumber associated with the virtual processor number. Shadow data is exchanged withfour neighbor processors to the north, south, east and west. An exchange is notperformed when the neighbor processor is a land processor. The benefits of irregularblocking are smaller sub-grid sizes for a fixed number of processors. With a smallersub-grid, t and t will both be smaller giving better performance. The experi-comp comm

mental results below demonstrate the improvement in performance that can be expectedwith irregular blocking for the global topography.


3. Distribution and collection of data

For coupled model runs, there is a frequent exchange of forcing data between theocean and the atmospheric models. In a sequential implementation, all forcing data isavailable on a single processor. In a parallel implementation, forcing data must becollected and distributed at each coupling interval. An all-to-one and one-to-all approachis used during coupling. This approach is used since our interpolation algorithm requiresboth complete ocean and atmospheric grids. A parallel interpolation scheme willeliminate the need to distribute and collect forcing data at a single node. The distributionand collection of data also occurs during IrO. When a run is initiated, data from arestart file is distributed to all processors. At the end of a run, data is collected for arestart file. During the run, data is collected for history files. For small processorconfigurations, the overhead is not excessive. With a large processor configurationŽ .)64 processors , the overhead for collecting and distributing data increases rapidly. Inour 256 processor implementation, approximately 10% of the execution time was spentin distributing and collecting data. We have used two approaches for data distributionand collection, a hub scheme and a recursive scheme.

These two schemes have been used on other platforms such as the Intel iPSCr860w xand Touchstone Delta 20 . In the hub scheme, a single processor collects and distributes

global data. This scheme is sequential since only one sub-grid can be sent or receivedfrom other processors. An alternative scheme is the recursive scheme. The idea is torecursively halve the size of the sub-grid and transmit data in parallel when global data

Ž . Ž .is distributed Fig. 8 . A global grid can be built in log n steps where n is the numberŽ .of processors instead of ny1 steps for the hub scheme. The recursive scheme can be

Fig. 8. Gatherrscatter of global data for 1D decomposition.


Fig. 9. Gatherrscatter of global data for 2D decomposition.

Ž .used for 1D and 2D grid partitionings Fig. 9 . A grid is recursively divided along the xand y dimensions for a 2D grid partitioning.

4. A cache performance model

A parallel implementation of the ocean model on the Cray T3D with 256 processorsachieved a throughput equivalent to a gigaflop or about 3.8 Mflops per processor beforeoptimization. The peak performance of the alpha processor of the Cray T3D is 150Mflops. The reason for the lower megaflop rate of the ocean model was assumed to bethe small cache size. In order to determine the effect of cache size on performance, acache model was built to calculate the hit percentage for a subroutine.

w xThe primary cache consists of 256 lines 12 . Each line is allocated a 32 byte segmentor 4 words of the cache. The cache is direct mapped. The cache address of any elementis the address modulo 1024. An element is assumed to be a 8 byte field. The goal was tomaximize the hit percentage rate. The example below shows the cache hit percentage forarray elements. For every cache miss, a cache line is loaded with 4 words from mainmemory. This operation can take 21 or 30 clock periods depending on whether there is apage hit or miss in main memory.

Example:

real a(37), b(37), c(37)integer i, j


do j=1, 5do i=1, 37c(i)=a(i)*b(i)enddoenddo

Cache table

Name Hits Missesa 148 37b 148 37

In the first iteration of the outer loop in the example above, all references to elementsof a and b result in cache misses. For all other iterations of the outer loop, thereferences to elements of a and b are in cache. Four write buffers, each one cache linelong are used for memory writes.

The cache hit rate cannot be estimated for a complex program by observation alone.w xSeveral cache performance tools have been developed 4 . Some of the tools have a

graphical user interface to examine cache performance and profile the cache hitpercentage by variable. The performance tool we developed was used to simulate cacheperformance for multiple processor configurations and various cache sizes. Instead ofusing a multiprocessor simulator, we replaced the subroutine to be profiled with

Fig. 10. Execution time for 744 timesteps.


generated code. A function call statement was made to a cache simulator with an addressand a string as arguments. The cache simulator used the arguments to update a cachetable. The generated code was included with the rest of the model code intact. A singletimestep of the simulation was executed to generate the cache table.

Input Subroutine™Code Generator™Output Profile CodeOutput Profile Code™Cache Table

Using the above technique, a single subroutine of a complex application can bestudied in isolation without profiling the entire application. A multiprocessor simulatoris not required since generated code is used with the application to build the cache table.The cache table is an array of 1024 unsigned long integers. Each element of the cachetable contains a memory address. The cache table is maintained by updating table entriesas needed based on memory references generated. A cache line is the unit of replace-ment for the table. Replacements are made starting from a cache line boundary.

After running a Cray performance tool to analyze the performance of the oceanmodel, one of the subroutines, update, was found to be the most time consumingsubroutine of the ocean model. It consumed approximately 15% of the total executiontime excluding IrO. All the remaining subroutines consumed less time. The cacheperformance tool was used to analyze the update subroutine of the model alone. Thefunction of the update subroutine is to calculate a new state of the ocean after the

Fig. 11. Interprocessor communication time for 744 timesteps.


computation of partial derivatives from hydrodynamic calculations. It consists of asingle outer loop over each layer of the ocean. The grid point values for layer thickness,horizontal velocities and tracers are revised. Two inner loop divisions are performed.Since update uses the most CPU time in our analysis, we decided to focus on thissubroutine. Similar approaches can be used to analyze other program components.

5. Experimental results

Our results from experiments with grid partitioning techniques, distribution andcollection of data and a cache performance model are described in this section. In allexperiments, we have used the ocean model coupled to a simplified AGCM. The first setof results are timings for various combinations of processor configurations and gridpartitioning techniques. We have plotted results for three parameters execution timeexcluding IrO, communication time and speed up. The second set of results comparethe hub and recursive schemes to distribute and collect data for 1D and 2D partitioning.The final results are from the cache performance model for multiple cache sizes andprocessor configurations.

Fig. 12. Speed up for 744 timesteps using a 288=256=8 grid.


5.1. Grid partitioning

Ž .The results for grid partitioning are based on a 1 month simulation 744 timesteps .The number of processors used ranges from 64 to 512. A 288=256=8 grid was usedfor all experiments. No results are plotted for striping with 512 processors since themaximum number of processors that can be used with striping for the 288=256 grid is256.

Irregular blocking gives the best results and striping gives the worst results of theŽ .three grid partitioning techniques Figs. 10–12 . The reason for the poor performance of

striping is the high communication overhead for inter-processor communication. Forexample, striping uses 121 seconds and regular blocking uses 62 seconds for inter-processor communication when a 64 processor configuration is used. Clearly, thevolume of communication affects performance as demonstrated by the difference intiming results. The benefits of communicating with two neighbours is not sufficient toovercome the lower communication overhead associated with regular blocking. This isespecially true for large grids. The superior performance of irregular blocking can beexplained by the smaller communication overhead and sub-grid size. With higherprocessor configurations, the differences between sub-grid sizes for regular blocking andirregular blocking decreases and therefore, the benefits of irregular blocking arecorrespondingly lower. The highest execution time difference between the two tech-niques is for the 64 processor configuration. A larger problem size will give betterresults for irregular blocking.

Fig. 13. Execution time for 744 timesteps using 256 processors.


Fig. 14. Recursive versus hub scheme for gatherrscatter in 1D.

Ž .In our next experiment Fig. 13 , we evaluated processor grid dimensions for aparticular processor configuration. A processor grid dimension must be selected for afixed number of processors. We ran seven tests with a 256 processor configuration. Theprocessor grid dimensions ranged from 128=2 to 2=128. The 128=2 and 8=32configurations took 185 and 113 seconds, respectively, for a 1 month simulation. Thereis a difference of 72 seconds between the two configurations. The poor performance ofthe 128=2 configuration can be explained by the long columns for sub-grids. The timeto gatherrscatter data from long columns during inter-processor communication isresponsible for the high execution time.

5.2. RecursiÕe Õs. hub

In Figs. 14 and 15, the results from the recursive and hub schemes are compared for a288=256=8 grid. The recursive scheme scales well and gives better performance thanthe hub scheme at higher processor configurations. The disadvantage of the recursivescheme is the added communication of data through multiple processors.

If a grid of dimension x=y is to be distributed over a processor grid p=q, then thetime to distribute the grid using the hub scheme is

t q t xyrpq q t pqŽ .Ž .l tran g


Fig. 15. Recursive versus hub scheme for gatherrscatter in 2D.

where t is the transmission time for xyrpq bytes and t is the time to build a blocktran g

of size xyrpq. The time to distribute the grid using the recursive scheme isdim

it xyr2 q t q tŽ .Ž .Ý tran l g iis1

where dim is the dimension of the processor grid and t is the time to build a block ofg iŽ i.size xyr2 bytes. The results for both grids are better when the recursive scheme is

used. The rate of increase of communication time is higher with the hub scheme than therecursive scheme for regular blocking.

5.3. Cache performance model

In this experiment, the cache size was varied from 8K to 1 Meg and the number ofprocessors was varied from 64 to 256. In Fig. 16, the hit rate versus cache size is plotted

Ž .for three processor configurations for a subroutine update of the model. The cache sizewas varied by changing the size of the cache table in the code generator. The highestcache hit percentage of 61.6% was obtained for the 256 processor configuration with a 1Mbyte cache. The lowest cache hit percentage of 8.22% was obtained for the 128processor configuration with an 8 Kbyte cache. From the plot in Fig. 16, we canconclude that for a Cray T3D processor, the cache hit percentage for the updatesubroutine is a function of grid size, processor configuration and cache size.


Fig. 16. Cache hit rate percentage for a 288=256=3 grid.

Using the information provided by the cache performance model, the data layout wasaltered to group memory references and shorter loops were used. With these changes,the performance for the update routine improved to about 8.2 Mflops per processorfrom the earlier performance of 5.4 Mflops per processor for a 64 processor run. Somedependencies were eliminated and an inner loop division operation was pre-computedfor an iteration of the loop. These changes were made for the parallel version of the codealone. By using pre-processor directives, a single source was maintained with differentsections of code for the parallel and vector versions. With these modifications, theexecution time for the update subroutine fell from 15.7 seconds to 10.5 seconds, i.e. animprovement of approximately 33%. There is an obvious benefit to having a highercache hit rate. However, a higher cache hit rate does not always result in lowerexecution time. There are other factors mentioned earlier such as dependencies and loopoptimization which also affect performance.

6. Conclusions

The performance of a parallel coupled ocean model on the Cray T3D is based on thegrid partitioning technique, the global data collectionrdistribution algorithm and theoptimization of memory access. The irregular blocking technique is the most effectivegrid partitioning technique for the ocean grid with lower execution and communication


times than regular blocking and striping. An efficient algorithm to distribute and collectthe global data lowers the communication overhead. Optimizing memory accesses byrestructuring code to obtain a higher cache hit rate improves computation performance.Future developments include a parallel ocean and atmosphere model with parallelcommunication between the models.

References

w x1 J.L. Bickham, Parallel ocean modelling using Glenda, Master’s thesis, University of Southern Missis-sippi, Hattiesburg, MS, 1995.

w x2 R. Bleck et al., A comparison of data-parallel and message passing versions of the Miami IsopycnicŽ . Ž .Coordinate Ocean Model MICOM , Parallel Comput. 21 1995 1695–1720.

w x3 Cray Research, Cray MPP Fortran Reference Manual, Mendota Heights, MN, 1993.w x4 J. Dongarra, O. Brewer, J.A. Kohl, S. Fineberg, A tool to aid in the design, implementation and

Ž .understanding of matrix algorithms for parallel processors, J. Parallel Distrib. Comput. 9 1990 185–202.w x5 G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, D. Walker, Solving Problems on Concurrent

Processors, vol. 1, Prentice-Hall, Englewood Cliffs, NJ, 1988.w x6 W. Gropp, E. Lusk, A. Skjellum, Using MPI, Portable Parallel Programming with the Message Passing

Interface, MIT Press, Cambridge, MA, 1994.w x7 J.L. Gustafson, G.R. Montry, R.E. Benner, Development of parallel methods for a 1024 processor

Ž .hypercube, SIAM, J. Sci. Stat. Comput. 9 1988 609–638.w x8 D. Morton, K. Wang, O.D. Ogbe, Lessons learned in porting FortranrPVM code to the Cray T3D, IEEE

Ž .Parallel Distrib. Technol. 1995 4–11.w x9 M. Konchady, Implementation of a parallel coupled climate model, Proceedings of the Second Interna-

tional Conference on High Performance Computing, New Delhi, India, 1995.w x10 M. Konchady, A. Sood, Analysis of grid partitioning techniques for an ocean model, Proceedings of the

Seventh High Performance Computing Symposium, Montreal, Canada, 1995, pp. 407–419.w x11 C.R. Mechoso et al., Parallelization and distribution of a coupled atmosphere–ocean general climate

Ž .model, Mon. Weal Rev. 121 1991 2062–2076.w x12 R. Numrich, P.L. Springer, J.C. Peterson, Measurement of the communication rates on the Cray T3D

interprocessor network, Proceedings of the High Performance Computing Networks Symposium, Munich,Germany, 1994.

w x13 L. DeRose, K. Gallivan, E. Gallopoulos, A. Navara, Parallel ocean circulation modelling on Cedar, in:Proceedings of the 5th SIAM Conference on Parallel Processing for Scientific Computing, 1992, pp.401–405.

w x Ž .14 P. Schopf, M. Cane, On equatorial dynamics, mixed layer physics and SST, J. Phys. Oceanogr. 1983917–935.

w x15 P. Schopf, Poseidon Ocean Model User’s Guide, Version 4, Goddard Space Flight Center, Greenbelt,MD, 1994.

w x Ž .16 R. Shapiro, Smoothing, filtering and boundary effects, Rev. Geophys. Space Phys. 8 1970 359–387.w x17 J.A. Semnter, R.M. Chervin, A simulation of the global ocean circulation with resolved eddies, J.

Ž . Ž .Geophys. Res. 93 c12 1988 15502–15522.w x18 J.A. Semnter, R.M. Chervin, Ocean general circulation from a global eddy resolving model, J. Geophys.

Ž .Res. 97 1992 5493–5550.w x19 R.D. Smith, J.K. Dukowicz, New numerical methods for ocean modelling on parallel computers, Los

Alamos Science, Los Alamos, NM, 1993.w x20 S. Takkella, S. Seidel, Broadcast and complete exchange algorithms for mesh topologies, Technical

Report CS-TR93-04, Michigan Technical University, Houghton, MI, 1996.w x21 K.E. Trenberth, Climate System Modelling, Cambridge University Press, Cambridge, UK, 1992.

Implementation and performance evaluation of a parallel ocean model

Documents

Transcript of Implementation and performance evaluation of a parallel ocean model