Algorithmic Optimizations of a Conjugate Gradient Solver ... · Algorithmic Optimizations of a...

Algorithmic Optimizations of a Conjugate Gradient

Solver on Shared Memory Architectures

Henrik Lof and Jarmo Rantakokko

October 28, 2004

Abstract

OpenMP is an architecture-independent language for programmingin the shared memory model. OpenMP is designed to be simple andpowerful in terms of programming abstractions. Unfortunately, thearchitecture-independent abstractions sometimes come with the priceof low parallel performance. This is especially true for applicationswith unstructured data access pattern running on distributed sharedmemory systems (DSM). Here proper data distribution and algorithmicoptimizations play a vital role for performance. In this article we haveinvestigated ways of improving the performance of an industrial classconjugate gradient (CG) solver, implemented in OpenMP running ontwo types of shared memory systems.

We have evaluated bandwidth minimization, graph partitioningand reformulations of the original algorithm reducing global barriers.By a detailed analysis of barrier time and memory system performancewe found that bandwidth minimization is the most important opti-mization reducing both L2 misses and remote memory accesses. On anuniform memory system we get perfect scaling. On a NUMA systemthe performance is significantly improved with the algorithmic opti-mizations leaving the system dependent global reduction operations asa bottleneck.

1 Introduction

Many parallel applications exhibit unstructured and/or highly irregular pat-terns of communication. Implementing such applications often require sub-stantial software engineering efforts regardless of the parallel programmingmodel used. These efforts can however be lowered using a shared memorymodel, like OpenMP, since in this model the orchestration of the commu-nication is handled in an implicit manner hidden from the programmer.Unfortunately, this very attractive feature of OpenMP often comes withthe price of lower parallel performance compared to other, more explicitprogramming models such as message passing. OpenMP is designed to bea simple and architecture independent tool, i.e., OpenMP uses a uniform

1

shared memory model. However, most large-scale shared memory systemsare of cc-NUMA type. The OpenMP model still works well for applica-tions with structured and local data dependencies [14] but for applicationswith unstructured data dependencies special care have to be taken to thealgorithm.

An example of an important scientific application exhibiting an unstruc-tured pattern of communication is an iterative solver for large sparse sys-tems of equations. Many algorithmic optimizations have been proposed toincrease the performance of iterative solvers. Most of them have been eval-uated in a message passing programming model. In this study, we wantto evaluate these algorithmic optimizations on a conjugate gradient (CG)solver implemented in OpenMP. We want to investigate whether these opti-mizations can be used to increase the performance on both uniform (SMP)and non-uniform (cc-NUMA) shared memory architectures. Can we get highperformance using algorithmic optimizations and still keep the software en-gineering efforts low using the shared memory model?

We use a real industrial data set from a finite element discretization ofthe Maxwell equations in three dimensions on a fighter jet geometry. Thecode was run on a Sun Fire 15000 (SF15K) cc-NUMA server [7] and ona Sun Enterprise 10000 (E10K) SMP, a uniform memory access system.The algorithmic optimizations we have studied are, reformulation of thealgorithm for reducing global barriers [8], bandwidth minimization of theiteration matrix [12] and graph partitioning using the MeTiS tool [15]. Wehave also used a page migration feature of Solaris 9 [21] to improve the datadistribution of the application.

The effects of the different optimizations were quantified. The paral-lel overhead is dominated by the poor locality properties this applicationexhibits. Hence, bandwidth minimization showed to give the highest per-formance improvements for both uniform and non-uniform architectures.Load balance showed to be of less importance and a simple linear parti-tioner in combination with bandwidth minimization gave the best results.We saw no need for more advanced partitioning like graph partitioning. Re-ducing global barriers had no significant effect on these small size systems(28 threads) but gave some improvements in combination with bandwidthminimization (when the load imbalance was minimized).

2 Parallelizing the CG-algorithm using OpenMP

When parallelizing iterative solvers like the CG-algorithm [1] there are threebasic types of operations to consider:

Vector operations (Lines (4),(5) and (8) of Algorithm 1) These opera-tions are trivially parallelized and they require very large vectors to

2

Algorithm 1 Method of Conjugate GradientsGiven an initial guess x0, compute r0 = b − Ax0

and set p0 = r0.For k = 0, 1, . . .(1) Compute and Store Apk

(2) Compute < pk, Apk >(3)

αk =< rk, rk >

< pk, Apk >

(4) xk+1 = xk + αkpk

(5) Compute rk+1 = rk − αkApk

(6) Compute < rk+1, rk+1 >(7)

βk =< rk+1, rk+1 >

< rk, rk >

(8) Compute pk+1 = rk+1 + βkpk

amortize the parallel overhead. In OpenMP this operation translatesinto a simple parallelized loop.

Inner products (Lines (2) and (6) of Algorithm 1) These operations mapwell onto OpenMP reductions and they are inherently serial in thesense that they introduce global barriers.

Sparse matrix-vector product (SpMxV) (Line (1) of Algorithm 1) Thisoperation stands for the significant part of the execution time. Astraight forward parallelization consists of assigning rows or columnsof the matrix to different threads using a loop directive on the outer-most loop.

The main sources of parallel overhead in any shared memory implementa-tion are: global barriers, true/false sharing effects and load balance. On aNUMA system additional overheads come from non-optimal data placementwhich will trigger remote memory accesses. In the case of the CG algorithmand OpenMP, there are no major false sharing effects1 as most stores areprimarily stride-1 accesses which map very well to the static partitions gen-erated by OpenMP loop directives.

When it comes to global barriers, we can see from Algorithm 4 (AppendixA), that we have several flow dependencies that must be protected by globalbarriers. In OpenMP we have implicit barriers at the end of the worksharedirectives. In some cases these can be removed using the NOWAIT clause.

1There might still be false-sharing effects in the OpenMP runtime system. As anexample the final stage of a reduction, which must be performed serially, should usepadded arrays to minimize false sharing.

3

Also, in OpenMP an inner product requires two barriers, one to clear thereduction variable and one at the end of the reduction. In total, we foundthat a correct OpenMP implementation of the basic CG-algorithm needs 7barriers per iteration, see Algorithm 4.

Finally, load balancing is good for the vector operations and the reduc-tions if they are statically partitioned. The load balance in the SpMxV ishowever dependent on the structure of the sparse matrix, and the partition-ing of the data, ie how the OpenMP threads are scheduled in the outer-loopof the SpMxV, see Algorithm 2. Even though we assign an equal amountof rows to each thread, load balance may still be poor since the number ofnon-zero elements per row varies.

3 Algorithmic optimizations

We can improve the performance of a standard implementation using somewell known optimizations. To address the parallel overheads we have studiedbandwidth minimization (true sharing), reformulation of algorithm (globalbarriers), and graph partitioning (load balance).

3.1 Cache-aware optimizations to the SpMxV

The SpMxV have several performance problems on modern microprocessors.First of all, this operation is memory-bound, since each element in the resultvector v require more memory operations than floating-point operations, seeAlgorithm 2. The inner loop can be very short and varies in size between therows. Also, the spatial locality in the right-hand side (RHS) vector x is verybad due to the indirect addressing through the column index vector col.The spatial locality for the row and column vectors are however good sincethey exhibit perfect stride-1 access patterns. The distributions of elementsof each row or column of the matrix determines the amount of reuse presentin the cache blocks of the RHS x. In the case of FEM, this pattern isdetermined by the type and numbering of elements and also on the geometryof the problem. Furthermore, the number of non-zero elements in each row orcolumn is bound by how many elements that geometrically couple in physicalspace. For 3-dimensional problems, a common number of non-zeros per rowor column is about 20. This small number of elements and the irregularityin loop length makes it hard for an optimizing compiler to use standardoptimizations such as blocking, unrolling and tiling on the inner-loop.

Using the fact that any sparse matrix can be interpreted as an adjacencymatrix of a corresponding graph, we can apply graph theoretical methodsto improve the locality of the SpMxV. Remember that the spatial locality ofthe RHS accesses in the SpMxV will probably increase if the matrix containlarge dense blocks since we can reuse more elements of the cached block.

4

Algorithm 2 OpenMP implementation in C of a sparse matrix vector prod-uct for the compressed sparse row (CSR) format.void spmxv(const double *val, const int *col,

const int *row, const double *x,

double *v, int nrows)

{

register int i,j;

register double d0;

#pragma omp for private(i,j,d0) nowait

for( i = 0; i < nrows; i++ )

{

d0 = 0.0;

for( j = row[i]; j < row[i+1]; j++ )

d0 += val[j] * x[ col[j] ];

v[i] = d0;

}

}

Unfortunately, many applications does not exhibit a large dense blocks, butthe matrix can be re-ordered to maximize the number of such blocks.

One way of finding dense blocks is to minimize the bandwidth of thematrix using graph theoretical methods. This has been successfully em-ployed in several previous studies for uniform memory systems. In a DSMor NUMA setting, the prize of a cache miss can be much higher due to thepossibility of the miss to be served by remote memory. Hence, we wouldexpect bandwidth minimization to be efficient on non-uniform architecturesas well. In this study we use an optimization of the standard method Re-verse Cuthill-McKee (RCM) [9] developed by Gibbs, Pool and Stockmeyer(GPS). The GPS method give equal quality minimizations in less time asshown in [12].

3.2 Reducing the number of global barriers

The original algorithm requires 7 barriers in total to be correct. Two barrierscan immediately be removed by making private copies of the global variablesα and β. Moreover, we can remove two barriers by interleaving the zeroingof one reduction variable with the reduction of the other variable and viceversa, exploiting the reduction barriers. This gives an optimized algorithmwith only three barriers per iteration, see Algorithm 5 (Appendix A). Inthis algorithm we assume that the runtime system will produce identicalpartitions for all STATIC schedules.

Furthermore, we can use the s-step formulation of the CG-algorithmto remove one more barrier. Originally developed in a message passingenvironment, this optimization tries to reduce the number of inner productsby computing several CG iterations in parallel. Using the identity

< Apk, pk >=< Ark, rk > −βk−1/αk−1 < rk, rk >

5

Algorithm 3 S-step formulation of CGGiven an initial guess x0, compute r0 = b − Ax0

and set p0 = r0.Compute Ar0, set b−1 = 0 and compute

α0 =< r0, r0 >

< r0, Ar0 >

For k = 1, 2, . . .(1) pk = rk + βk−1pk−1

(2) Apk = Ark + βk−1Apk−1

(3) xk+1 = xk + αkpk

(4) rk+1 = rk − αkApk

(5) Compute and Store Ark+1

(6) Compute

< Ark+1, rk+1 > and < rk+1, rk+1 >

(7)

βk =< rk+1, rk+1 >

< rk, rk >

(8)

αk+1 =< rk+1, rk+1 >

< Ark+1, rk+1 > −(βk/αk) < rk, rk >

Chronopoulos et al. formed the s-step (s = 1) version of the CG algorithm,see Algorithm 3. Here, an extra vector operation is introduced to remove oneinner product, hence, removing also one barrier and giving a total of twobarriers per iteration (see Algorithm 6 Appendix A). However, the s-stepformulation contain more arithmetical work than the standard formulation.

3.3 Improving load-balance

Graph partitioning is a standard way of minimizing interprocessor commu-nication and producing load balanced computations for mesh based applica-tions2, see [10]. They are typically used in implementations on systems witha distributed memory where the computation needs to be explicitly parti-tioned in some way. In a shared memory setting we can permute the matrixusing a partition vector from a graph partitioner to create load balance par-titions. An outer loop is added to the SpMxV indexed by the OpenMPthreadID which fetches the partition boundaries each thread will use. Agraph partitioner can potentially also effect the locality of the implementa-tion, something we want to study further here. As the partitioner minimizes

2To only achieve load balance one can use a sequence partitioning method for the rows.

6

the edge-cut, which is an approximation of communication volume, the par-titions could also decrease the number of remote accesses although somecommunications will be local. We used the k-way multilevel recursive bi-section routine METIS PartGraphKway of the MeTiS graph partitioner usingHeavy-Edge matching and Random matching with maximized connectivitythroughout this study.

3.4 Related Work

The CG algorithm has been extensively used and studied in the numeri-cal community [13, 1]. Several attempts have been made to increase theperformance of the SpMxV [20, 5]. Apart from reordering, which we alsouse, other optimizations are: register blocking to lower the amount of loadinstructions [20, 24, 23], address precomputation [23] and cache blocking[24].

Most of the work, when it comes to optimizing the parallel performanceof the algorithm has been conducted in the distributed memory setting.Early work in the shared memory setting include Brehm et al [3], wherethey studied the performance of the CG algorithm on earlier type of sharedmemory machines from vendors like Alliant, Sequent and Encore. In theDSM setting, Oliker et al. [19] comes closest to our work. Here, the authorsshowed that for a similar implementation of the CG algorithm, running onan SGI Origin 2000 [16] and using SGI’s native shared memory pragmas,the performance was very much increased when proper data distribution andbandwidth minimization was used. In fact, the performance of the sharedmemory implementation was as good as a pure message passing implementa-tion. In this article we use data from a real industrial application and we alsoperform a more detailed analysis of the different overheads associated withthe parallelization and the computer system. Our goal is to understand howto produce the most efficient implementation of a particular solver ratherthan comparing different programming models for a given problem whichwas the case in Oliker et. al.

In the OpenMP/DSM setting there has been a lot of activity on decidinghow to achieve data distribution on non-uniform architectures [18, 2, 4, 14].These articles all discuss the CG benchmark of the NAS NPB3.0 OpenMPsuite. Unfortunately, the data in this benchmark is not very realistic as it israndomly generated and hence differs from real world data which often hassome kind of structure.

4 Methodology

In this section we describe additional details about the input data and theactual source code. We also explain the experimental setups and our exper-imental methodology to quantify the effects of the different optimizations.

7

4.1 Execution environment

The results where measured on a dedicated 32 CPU domain of a SF15K [7]and on a 32 processor SUN E10K server [6]. The SF15K is a cc-NUMAsystem consisting of 4 CPU nodes (cpu boards) connected together using acrossbar switch. On the system used, each UltraSPARC-IIICu [22] is clockedat 900MHz and had a 64Kb L1 data cache, a 8Mb L2 cache and a singleCPU is capable of addressing up to 16Gb of memory at 2.4 Gb/s. Thedomain had a total DRAM size of 32 Gb, where each 4 CPU node had 4 Gbof local memory on the board. The SUN E10K server is a uniform memoryaccess machine of SMP type consisting of 32 UltraSPARC-II CPUs clockedat 400MHz. The UltraSPARC-II has a 8Mb L2 cache and a 16Kb L1 cache.The total DRAM size is 32Gb. All source codes where compiled using theSun ONE Studio 8 compiler.

4.2 The FEM solver

The studied solver solves the Maxwell equations in electro-magnetics arounda fighter jet configuration [11] in 3D using the finite element method (FEM).The system of equations has 1794058 unknowns, the non-zero density is only0.0009% (see Figure 1) and the solver converge in 48 iterations without anypreconditioning. Figure 1(a) shows a histogram of the number of non-zerosper row. The maximum number of non-zeros for all rows was 36 and onaverage each row has about 15 non-zero elements. After initialization theresident working set of the applications was about 500Mb.

The main part of the code was implemented in Fortran90. The SpMxVhowever was implemented in C, to support more aggressive optimizationsby the compiler. We do not run the entire FEM solver framework. Insteadwe have saved the iteration matrix to file and isolated the iterative solver.Hence, the matrix is read from file by the master thread which, by the first-touch principle, allocates all memory in the node that the master threadis running in. A similar situation might occur if we had used the entireFEM framework as it is hard to do the FEM assembly process in a waythat distributes data according to the first touch principle. A more elegantway is to use what ever code you have for the assembly process and letthe iterative solver worry about data distribution. As the sparse matrixis stored in the compressed sparse row (CSR) format, we cannot use somekind of pre-iteration to do data distribution. This is because the accesspattern is determined indirectly from the CSR arrays that need to be loadedinto memory. Hence, page migration seems like a solution given that theoverhead of migration can be amortized. This is discussed further in Lofet al. [14, 17]. In the following experiments, the page migration directives(calls to madvise(3C)) was inserted at the beginning of the first iterationjust before the SpMxV when running on the SF15K machine. By using

8

0

100000

200000

300000

400000

500000

1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35

Number of non-zeros per row

Fre

qu

en

cy

(a) Histogram of the non-zero structure of the input matrix

(b) Base (c) Bandwidth minimization

Figure 1: Structure of the iteration matrix used for the four different casesof reorderings. The system has 1794058 unknowns

9

migration data is migrated, by redoing the first touch, according to theaccess pattern of the solver.

To control thread affinity, the Solaris scheduler was told to try and keepthreads in their “home” node. This is especially important when we dodata placement (migrate-on-next-touch mode). In practice, the Solaris 9scheduler assign the home node to spawned threads in a way so that thenode is not overloaded. As an example, if the node has 4 CPUs, the firstthree threads created will be assigned to this node (this will be their homenode) and the fourth thread will be assigned to another un-loaded node ifthere is one available. In our experiments we did not use any explicit CPUbindings. Hence, the scheduler can move threads away from their home nodeto create a better load balance from a system perspective.

4.3 Experimental setups

We constructed five different combinations of reorderings and algorithms

Base This is the standard OpenMP implementation using the original inputmatrix, see Figure 1(b) and Algorithm 4.

S-Step This is an implementation of the S-Step algorithm, see Section 3.2and Algorithm 6 using the original data, see Figure 1(b)

GPS Here, we use the standard CG algorithm but we have reordered theorginal matrix using the Gibbs-Pool-Stockmeyer (GPS) algorithm, seeSection 3.1. The structure of the resulting reordered matrix can befound in Figure 1(c)

OPT Here, we use the optimized CG algorithm, see Algorithm 5

MeTiS Here, we use the optimized algorithm, see Algorithm 5 and thematrix is reordered according to the partitions produced by MeTiS asdescribed in Section 3.3

GPS+OPT Here, we use the optimized CG algorithm, see Algorithm 5 onthe matrix reordered by the GPS method

GPS+MeTiS Here, the matrix reordered using the GPS method is re-ordered again according to the partitions produced by MeTiS as de-scribed in Section 3.3

GPS+S-Step In this setting we use the S-Step algorithm on the matrixreordered by the GPS algorithm

10

4.4 Evaluating the effect of the algorithmic optimizations

We have performed the experiments using hardware counters to quantify theeffects of the optimizations. To quantify the load balance we measured thetotal CPU time spent in the mt EndOfTask Barrier and mt ExplicitBarrier functions of the Sun OpenMP runtime system called from withinthe main parallel region. This corresponds to the CPU time spent in globalbarriers in the implemented algorithm. The mt Explicit Barrier func-tion is not due to any BARRIER directives. This function is called by theSun runtime system at the end of the SINGLE directive. We also measuredthe executed cycles and instructions to calculate the number of Instruc-tions Per Cycle (IPC) and we measured the impact on the memory subsys-tem by counting L2-cache references, misses and remote misses using theUltraSPARC-IIICu hardware counters [22]. This allows us to calculate L2miss ratios as well as the distribution of the L2 misses into the remote orlocal categories. We also counted the number of completed floating-pointinstructions from the CPU. Using these counters we could calculate FLOP/srates.

All detailed measurements were made using the Sun collect/analyzer pro-filing tool set using 16 threads. In the case of the SF15K we also used pagemigration (migrate-on-next-touch mode) and 64Kb pages, see [17]. For theperformance graphs, the execution times were measured using the Solarishigh resolution timer gethrtime(3C). Each binary was run ten times andwe choose the minimum to represent the execution time of the binary. Theminimum time is used to reduce the variations caused by thread schedulingand intrusion of system deamons. We measure the arithmetical load imbal-ance γ defined as the maximum number of non-zeros per partition dividedby the mean number of non-zeros per partition. As the number of operationsneeded to produce an entry in the solution is dependent on the number ofnon-zeros in the partition, the arithmetical load balance can be describedusing the total number of non-zeros per partition. True load balance is acombination of the arithmetical load balance and high throughput in thecomputing hardware. Even though the partitions have balanced number ofnon-zeros the actual time to compute is very dependent on how much theCPU stalls on cache misses. This is especially true for non-uniform archi-tectures. Also, in the case of OpenMP, non-uniform memory systems mightaffect the time spent in global barriers.

5 Experimental Results

From Figures 2 and 3, we can see the effect of the different algorithmic op-timizations. The bars are normalized to the corresponding Base case (theleft-most bar in each group) which corresponds to the original code with nooptimizations used. We can clearly see that bandwidth minimization gives

11

the highest performance improvement both on NUMA and the uniform SMPmachine. Reordering with MeTiS has some effect on the SMP machine butthis declines with increasing number of threads. Reducing the number ofbarriers has no significant effect alone but in combination with bandwidthminimization the peformance is further improved on both machines. Look-ing at Figures 4 and 5, we see that we can achieve perfect scalability on theSMP machine using the different algorithmic optimization techniques whileon the NUMA machine we only reach about 50% parallel efficency on 28threads. To understand the effects of the different optimization techniqueswe have made a detailed analysis of a 16 thread run (Figure 6, Table 1,Table 2). Figure 6 visualizes the arithmetical load balance. Here the “Ref-erence” ring corresponds to a static equal size partitioning of the non-zeros,ie perfect arithmetical load balance. This can however not be achieved sinceit does not correspond to any valid reordering. Comparing the sectors atthe innermost ring, the Base ring, at 12 and 6 o‘clock of Figure 6 we can seethat the static partitioning does not produce very balanced partitions forthis realistic data set. This is also quantified by the γ parameter. A simplelinear partitioning of rows in the bandwidth minimized matrix gives a verygood load balance, better than using graph partitioning. A primary goal ofMeTiS is to minimize the edge-cut. Experimenting with different algorithmsfor refinement and edge-matching in MeTiS did not have any significant ef-fect on runtime performance. Looking at the results from the more detailed

BASE S-Step GPS MeTiS OPTLoad balance (γ) 1.24 1.24 1.01 1.15 1.24User CPU time 336.66s 328.44s 234.57s 299.71 323.71sSystem time 0.4s 0.45s 0.5s 0.07s 0.33s

SpMxV 213.79s 211.19s 190.29s 224.74s 212.43sBarrier 85.02 79.72s 10.84s 46.66s 83.7s

Table 1: Detailed analysis using the Sun analyzer profiling tool. All resultsare for 16 threads on the E10K. On this system, the analyzer could not usehardware counters.

measurements in Table 1 and Table 2 we can confirm that the algorithm ismemory-bound with a low instruction per cycle rate (IPC). Hence, band-width minimization that reduces both the L2 misses and the remote memoryaccesses most is expected to give best performance improvements. MeTiSreduces the total edge-cut and the load imbalance but gives a high num-ber of remote memory accesses increasing the time spent in SpMxV. MeTiSdoes not consider that the processors are clustered four by four in the nodes.The different optimizations to reduce the number of barriers do not give anysignificant improvements. For this small number of threads the time spentin barriers is mostly due to load imbalance in the SpMxV operation andnot in the algorithm to synchronize the threads. Thus, reducing the loadimbalance gives the best effect also on barrier time. For larger number of

12

threads we would expect to gain more in reducing the number of barriers.On the NUMA system the time spent in barriers (20-40%) is still very highfor all optimizations. This is also causing the poor scaling. We would ex-pect that the GPS+OPT or GPS+S-step with good load balance and fewbarriers would give low barrier times but this is not the case on the NUMAsystem. This suggest that the reductions may not be optimal here. For thesame optimizations we get a barrier time below 5% of total time on the SMPsystem.

BASE S-Step GPS MeTiS OPTLoad balance (γ) 1.24 1.24 1.01 1.15 1.24User CPU time 131.29s 127.7s 74.85s 130.11s 123.01s

System time 4.77s 6.51s 6.78s 6.23s 4.3sSpMxV 59.45s 57.4s 36.23s 70.02s 55.67sBarrier 45.15s 48.29s 21.67s 33.25s 47.5s

IPC 0.63 0.57 0.65 0.47 0.55L2 misses 106 427 421 376 438 418

Remote misses 106 125 115 70 144 104MFLOPS/s 450.2 463.4 710.7 451.9 450.8

Table 2: Detailed analysis using the Sun analyzer profiling tool and hardwarecounters. All results are for 16 threads on the SF15K.

6 Conclusions

We have investigated ways of improving the performance of an industrialclass conjugate gradient (CG) solver, implemented in OpenMP running ontwo types of shared memory systems, a uniform SMP and a cc-NUMA ma-chine. We have evaluated bandwidth minimization, graph partitioning andreformulations of the original algorithm reducing global barriers. Bandwidthminimization has the greatest impact on performance, reducing both the L2misses and remote memory accesses. Graph partitioning improves the loadbalance and edge-cut of the original matrix but does not give the same loadbalance and data locality as bandwidth minimization. Reducing the numberof barriers alone has no significant effect on these small systems but givessome performance improvements in combination with the other optimiza-tions when the load imbalance is reduced. On the uniform SMP system wecan achieve perfect scaling using the different optimization techniques buton the cc-NUMA system we have less good scalability due to high barriertimes coming from the reduction operations.

Our conclusion is that it is possible to achive high performance andstill keep the software engineering efforts low by using an architecture in-dependent shared memory model like OpenMP. But, special care have tobe taken to algorithmic optimizations, data distributions and implementa-tion of global operations. The bottleneck in our application on the NUMAsystem is the global reduction operations.

13

References

[1] Richard Barrett, Michael Berry, Tony F. Chan, James Demmel,June Donato, Jack Dongarra, Victor Eijkhout, Roldan Pozo, CharlesRomine, and Henk van der Vorst. Templates for the Solution of LinearSystems: Building Blocks for Iterative Methods. SIAM, 1994.

[2] John Bircsak, Peter Craig, RaeLyn Crowell, Zarka Cvetanovic,Jonathan Harris, C. Alexander Nelson, and Carl D. Offner. Extend-ing OpenMP for NUMA machines. Scientific Programming, 8:163–181,2000.

[3] Jurgen Brehm and Harry F. Jordan. Parallelizing algorithms for mimdarchitectures with shared memory. In Proceedings of the 3rd interna-tional conference on Supercomputing, pages 244–253. ACM Press, 1989.

[4] J. Mark Bull and Chris Johnson. Data Distribution, Migrationand Replication on a cc-NUMA Architecture. In Proceedings ofthe Fourth European Workshop on OpenMP. http://www.caspur.it/ewomp2002/, 2002.

[5] D. A. Burgess and M. B. Giles. Renumbering unstructured grids toimprove the performance of codes on hierarchical memory machines.Technical report, Oxford University Computing Laboratory, NumericalAnalysis Group, May 1995.

[6] A. Charlesworth. Starfire: extending the SMP envelope. IEEE Micro,18(1):39–49, Jan/Feb 1998.

[7] Alan Charlesworth. The sun fireplane system interconnect. In Proceed-ings of the 2001 ACM/IEEE conference on Supercomputing (CDROM),pages 7–7. ACM Press, 2001.

[8] A.T. Chronopoulos and C.W. Gear. s-step iterative methods for sym-metric linear systems. Journal of Computational and Applied Mathe-matics, 25:153–168, 1989.

[9] E. Cuthill and J. McKee. Reducing the bandwidth of sparse symmetricmatrices. In Proceedings of the 1969 24th national conference, pages157–172. ACM Press, 1969.

[10] Jack Dongarra, Ian Foster, Geoffrey Fox, William Gropp, Ken Kennedy,Linda Torczon, and Andy White. Sourcebook of Parallel Computing.Morgan Kaufmann, 2003.

[11] Fredrik Edelvik. Hybrid Solvers fo the Maxwell Equations in Time-Domain. Doctoral thesis, Mathematics and Computer Science, De-partment of Information Technology, University of Uppsala, may 2002.http://urn.kb.se/resolve?urn=urn:nbn:se:uu:diva-2156.

14

[12] Norman E. Gibbs, Jr. William G. Poole, and Paul K. Stockmeyer. AnAlgorithm for Reducing the Bandwith and Profile of a Sparse Matrix.SIAM Journal on Numerical Analysis, 13(2):236–250, April 1976.

[13] Gene Golub and D.O’Leary. Some history of the conjugate gradientand Lanczos methods. SIAM Review, 31:50–102, 1989.

[14] Sverker Holmgren Henrik Lof, Markus Norden. Improving GeographicalLocality of Data for Shared Memory Implementations of PDE Solvers.In Computational Science - ICCS 2004: 4th International Conference,Krakow, Poland, June 6-9, 2004, Proceedings, Part II, volume 3037of LNCS, pages 9–16, http://www.springerlink.com/openurl.asp?genre=article&issn=0302-9743&vo%lume=3037&spage=9, 2004.

[15] George Karypsis and Vipin Kumar. A fast and highly quality multilevelscheme for partitioning irregular graphs. SIAM Journal on ScientificComputing, 20(1):359–392, 1999.

[16] James Laudon and Daniel Lenoski. The SGI Origin: a ccNUMA highlyscalable server. In Proceedings of the 24th annual international sympo-sium on Computer architecture, pages 241–251. ACM Press, 1997.

[17] Henrik Lof and Sverker Holmgren. affinity-on-next-touch: Increasingthe Performance of an Industrial PDE Solver on a cc-NUMA System.In Mats Brorsson, editor, Proceedings of the 6th European Workshopon OpenMP, pages 3–8. Royal Institute of Technology, 2004.

[18] Dimitros S. Nikolopoulos, Theodore S. Papatheodorou, Constantine D.Polychronopoulos, Jesus Labarta, and Eduard Ayguade. A transparentruntime data distribution engine for OpenMP. Scientific Programming,8:143–162, 2000.

[19] Leonid Oliker, Xiaoye Li, Parry Husbands, and Rupak Biswas. Effectsof ordering strategies and programming paradigms on sparse matrixcomputations. SIAM Review, 44(3):373–393, 2002.

[20] Ali Pinar and Michael T. Heath. Improving performance of sparsematrix-vector multiplication. In Proceedings of the 1999 ACM/IEEEconference on Supercomputing (CDROM), page 30. ACM Press, 1999.

[21] Sun Microsystems, http://www.sun.com/servers/wp/docs/mpo_v7_CUSTOMER.pdf. Solaris Memory Placement Optimization and Sun Fireservers, January 2003.

[22] Sun Microsystems, http://www.sun.com/processors/manuals. Ul-traSPARC III Cu User’s Manual, 2003.

15

[23] Sivan Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. IBM J. Res. Develop, 41(6):711–725,1997.

[24] Richard Vuduc, James W. Demmel, Katherine A. Yelick, Shoaib Kamil,Rajesh Nishtala, and Benjamin Lee. Performance optimizations andbounds for sparse matrix-vector multiply. In Proceedings of the 2002ACM/IEEE conference on Supercomputing, pages 1–35. IEEE Com-puter Society Press, 2002.

A Implementations in OpenMP

Here we present our implementations of th CG algorithm in OpenMP. Thestandard implementation, Algorithm 4 represents a direct implementationfrom the CG algorithm. In Algorithm 5 we have used features of OpenMPto reorder some calculations to save global barriers. Finally, Algorithm 6 isan implementation of the S-Step version of CG. Here only 2 global barriersare needed. To achieve this an extra vector is introduced, resulting in morearithmetical work.

16

Algorithm 4 Standard OpenMP implementation of CG!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(cg_iter_count)

cg_iter_count = 1

do

call SpMxV(A,p,temp) ! No barrier here

!$OMP SINGLE

pAp_norm = 0.0_rfp

!$OMP END SINGLE

====================================

!$OMP DO REDUCTION(+:pAp_norm)

do i = 1, matrix_size

pAp_norm = pAp_norm + p(i)*temp(i)

end do

!$OMP END DO

====================================

!$OMP SINGLE

alpha = r_old_norm/pAp_norm

!$OMP END SINGLE

====================================

!$OMP WORKSHARE

x = x + alpha * p

r_new = r_old - alpha * temp

!$OMP END WORKSHARE NOWAIT

!$OMP SINGLE

r_new_norm = 0.0_rfp

!$OMP END SINGLE

====================================

!$OMP DO REDUCTION(+:r_new_norm)


r_new_norm = r_new_norm + r_new(i)*r_new(i)

end do

!$OMP END DO

===================================

!$OMP SINGLE

beta = r_new_norm/r_old_norm

!$OMP END SINGLE

===================================

!$OMP WORKSHARE

p = r_new + beta*p

!$OMP END WORKSHARE

===================================

Check convergence

!$OMP SINGLE

call Swap_pointers(r_old, r_new)

r_old_norm = r_new_norm

!$OMP END SINGLE NOWAIT

cg_iter_count = cg_iter_count + 1

end do

!$OMP END PARALLEL

17

Algorithm 5 Optimized OpenMP implementation of CG

pAp_norm = 0.0_rfp

!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(cg_iter_count,alpha,beta)

cg_iter_count = 1

do

call SpMxV(A,p,temp) ! No barrier here

!$OMP SINGLE




!$OMP DO REDUCTION(+:pAp_norm)


pAp_norm = pAp_norm + p(i)*temp(i)

end do

!$OMP END DO

====================================

alpha = r_old_norm/pAp_norm

!$OMP WORKSHARE

x = x + alpha * p

r_new = r_old - alpha * temp


!$OMP SINGLE

pAp_norm = 0.0_rfp


!$OMP DO REDUCTION(+:r_new_norm)


r_new_norm = r_new_norm + r_new(i)*r_new(i)

end do

!$OMP END DO

=====================================


!$OMP WORKSHARE

p = r_new + beta*p

!$OMP END WORKSHARE

====================================

Check convergence

!$OMP SINGLE

call Swap_pointers(r_old, r_new)



end do

!$OMP END PARALLEL

18

Algorithm 6 OpenMP implementation of S-step CG!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(idx, cg_iter_count, alpha, beta)

cg_iter_count = 1

alpha = r_old_norm / my

beta = 0.0_rfp

do

!$OMP WORKSHARE

p = r + beta * p

q = temp + beta * q


!$OMP SINGLE



my = 0.0_rfp


!$OMP WORKSHARE

x = x + alpha * p

r = r - alpha * q

!$OMP END WORKSHARE

===================================

call SpMxV(A,r,temp) ! No barrier here

!$OMP DO REDUCTION(+:r_new_norm,my)

do idx = 1, matrix_size

r_new_norm = r_new_norm + r(idx) * r(idx)

my = my + r(idx) * temp(idx)

end do

!$OMP END DO

====================================

Check convergence


alpha = r_new_norm/(my - ((beta*r_new_norm)/alpha))


end do

!$OMP END PARALLEL

19

0

0,2

0,4

0,6

0,8

1

1,2

1,4

1,6

2 4 8 12 16 22 28

Number of Threads

Sp

ee

du

p

Base S-step GPS OPT MeTiS

(a) Speedup of basic optimizations on the E10K

00,20,40,60,811,21,41,61,8

2 4 8 12 16 22 28

Number of Threads

Sp

ee

du

p

Base GPS+S-step GPS+OPT MeTiS+GPS

(b) Speedup of combinations of optimizations on the E10K

Figure 2: Speedup of algorithmic optimizations on the E10K. The executiontimes are normalized to the Base case for each set of threads.

20

0

0,5

1

1,5

2

2,5

2 4 8 12 16 22 28

Number of Threads

Sp

ee

du

p

Base S-step GPS OPT MeTiS

(a) Speedup of basic optimizations on the SF15K

0

0,5

1

1,5

2

2,5

2 4 8 12 16 22 28

Number of Threads

Sp

ee

du

p

Base GPS+S-step GPS+OPT MeTiS+GPS

(b) Speedup of combinations of optimizations on the SF15K

Figure 3: Speedup of algorithmic optimizations on the SF15K. The execu-tion times are normalized to the Base case for each set of threads.

21

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Number of Threads

Sp

ee

du

p

Ideal Base S-step GPS OPT

(a) Speedup of basic optimizations on the E10K

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30

Number of Threads

Sp

ee

du

p

Ideal GPS+S-step GPS+OPT MeTiS MeTiS+GPS

(b) Speedup of combinations of optimizations on the E10K

Figure 4: Speedup of algorithmic optimizations on the E10K. The executiontimes are normalized to a serial code using bandwidth minimizaton.

22

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Number of Threads

Sp

ee

du

p


(a) Speedup of basic optimizations on the SF15K

0

5

10

15

20

25

30

0 5 10 15 20 25 30

Number of Threads

Sp

ee

du

p


(b) Speedup of combinations of optimizations on the SF15K

Figure 5: Speedup of algorithmic optimizations on the SF15K. The execu-tion times are normalized to a serial code using bandwidth minimizaton.

23

Base (1.24)GPS (1.01)MeTiS (1.15)Reference (1.00)

Figure 6: Visualization of the arithmetical load balance for the case of 16partitions (threads). The reference ring corresponds to perfect balance whichis impossible to generate using reorderings. The number in parenthesis inthe legend is the γ parameter defined in Section 4.4

0%

20%

40%

60%

80%

100%

Base

_E10

K

Base

_SF1

5K

S-Step

_E10

K

S-Step

_SF1

5K

GPS_E

10K

GPS_S

F15K

MeTiS_

E10K

MeTiS_

SF15

K

OPT

_E10

K

OPT

_SF1

5K

SpMxV Total Barrier Other

Figure 7: Distribution of CPU time

24

Algorithmic Optimizations of a Conjugate Gradient Solver ... · Algorithmic Optimizations of a...

Documents

Transcript of Algorithmic Optimizations of a Conjugate Gradient Solver ... · Algorithmic Optimizations of a...