Performance evaluation of a processor farm model for power system state-estimators on...

9
~ Pergamon Computing 5;vstems in Engineering, Vol. 5, No. 4-6, pp. 479 487, 1994 Copyright '.(" 1995 ElsevierScienceLtd 0956-0521(95)00003-8 Printed in Great Britain. All rights reserved 0956-0521/94 $7.00 + 0.00 PERFORMANCE EVALUATION OF A PROCESSOR FARM MODEL FOR POWER SYSTEM STATE-ESTIMATORS ON DISTRIBUTED-MEMORY COMPUTERS EFTHIMIOS TAMBOURIS,~ r PETER VAN SANTEN + and JOHN F. MARSH~ fElectrical Engineering and Electronics Department, Brunel University, Uxbridge, Middlesex UB8 3PH. U.K. :[:Control Engineering Centre, Brunel University, Uxbridge, Middlesex UB8 3PH, U.K. Abstract--In this work we present a preliminary performance and scalability analysis of a distributed state estimator for power systems when run on a massively parallel machine with a 2D mesh interconnection topology. The parallel algorithm is based on the processor farm, or master slave, programming model which results in additional modelling complexity from process-to-processor mapping, non-trivial com- munication overhead and extra computation not present in the serial case. The effect of applying sparsity programming techniques further complicates the computational procedure, and hence the performance analysis. Our objective is to investigate the feasibility of the architecture for the algorithm in hand. To achieve it, we derive an accurate performance model, mainly for the communication overhead, and examine the efficiency of the architecture. Furthermore, we use this model to derive the scaling behaviour of the system, i.e. information on how the system performs if we allow the input size to grow as the number of processors increases. The convergence between theoretical estimates and simulation results obtained is also investigated. NOMENCLATURE In this paper the following notation is used: I. INTRODUCTION P tl l nr~ n r tl h Ms sE ?~11)S1:" r ml~st~, W w, t, t h nl I3. l pl, % number of processors input size, i.e. number of nodes in the network, global case number of links in the network, global case input size for slave i, i.e. number of nodes in the area i, decomposed case maximum of nr, for all areas, decomposed case input size for master, i.e. number of boundary nodes, decomposed case data's memory requirements, global case data's memory requirements for each slave pro- cessor, decomposed case data's memory requirements for the master, decom- posed case problem size, i.e. work in the global case problem size for each slave, i.e. work in each slave in the decomposed case problem size for master, i.e. work in master in the decomposed case start-up cost for next-neighbour messages per-byte cost for next-neighbour messages message size in bytes time for one single-node broadcast communication operation time for one single-node gather communication oper- ation start-up cost for broadcast (p > 2) per-processor cost for broadcast (p > 2) start-up cost for gather (p > 2) per-processor cost for gather (p > 2) extra computation cost for gather (p > 2) The development of massively parallel MIMD archi- tectures consisting of hundreds or thousands of pro- cessors is encouraging research into their potential for a wide range of applications. Part of this research has culminated in a new performance analysis field termed scalability analysisJ 'z Scalability itself is a property which, whilst desirable, is difficult to define formally. Nevertheless, researchers agree that scal- ability is a property of a parallel algorithm along with a parallel architecture, often together referred to as a parallel system. Scalability analysis incorporates a number of paradigms and corresponding metrics for scalability, each one providing useful information for the system under investigation.~ ~ Scalability analysis paradigms are based on a parallel system's perform- ance model and among other things examine how performance changes when the number of processors increases if we allow the input size to grow. The results of all paradigms determine the overall scaling behariour of the system. In this paper we outline and use an experimental methodology to construct a performance model and derive the scaling behaviour of a Decomposed State Estimator (DSE) parallel algorithm S 7 when run on a Parsytec GCel 3/512. The DSE algorithm is based on the processor farm, or master-slave, programming model; the Parsytec GCel 3/512 is a massively parallel 479

Transcript of Performance evaluation of a processor farm model for power system state-estimators on...

~ Pergamon Computing 5;vstems in Engineering, Vol. 5, No. 4-6, pp. 479 487, 1994

Copyright '.(" 1995 Elsevier Science Ltd 0956-0521(95)00003-8 Printed in Great Britain. All rights reserved

0956-0521/94 $7.00 + 0.00

P E R F O R M A N C E E V A L U A T I O N OF A PROCESSOR F A R M M O D E L FOR POWER SYSTEM STATE-ESTIMATORS ON

D I S T R I B U T E D - M E M O R Y C O M P U T E R S

EFTHIMIOS TAMBOURIS,~ r PETER VAN SANTEN + and JOHN F. MARSH~

fElectrical Engineering and Electronics Department, Brunel University, Uxbridge, Middlesex UB8 3PH. U.K.

:[:Control Engineering Centre, Brunel University, Uxbridge, Middlesex UB8 3PH, U.K.

Abstract--In this work we present a preliminary performance and scalability analysis of a distributed state estimator for power systems when run on a massively parallel machine with a 2D mesh interconnection topology. The parallel algorithm is based on the processor farm, or master slave, programming model which results in additional modelling complexity from process-to-processor mapping, non-trivial com- munication overhead and extra computation not present in the serial case. The effect of applying sparsity programming techniques further complicates the computational procedure, and hence the performance analysis. Our objective is to investigate the feasibility of the architecture for the algorithm in hand. To achieve it, we derive an accurate performance model, mainly for the communication overhead, and examine the efficiency of the architecture. Furthermore, we use this model to derive the scaling behaviour of the system, i.e. information on how the system performs if we allow the input size to grow as the number of processors increases. The convergence between theoretical estimates and simulation results obtained is also investigated.

N O M E N C L A T U R E

In this paper the following notation is used:

I. I N T R O D U C T I O N

P tl

l nr~

n r

tl h

Ms sE ?~11)S1:" r

m l~s t ~ ,

W w,

t ,

t h

n l

I3.

l pl,

%

number of processors input size, i.e. number of nodes in the network, global case number of links in the network, global case input size for slave i, i.e. number of nodes in the area i, decomposed case maximum of nr, for all areas, decomposed case input size for master, i.e. number of boundary nodes, decomposed case data's memory requirements, global case data's memory requirements for each slave pro- cessor, decomposed case data's memory requirements for the master, decom- posed case problem size, i.e. work in the global case problem size for each slave, i.e. work in each slave in the decomposed case problem size for master, i.e. work in master in the decomposed case start-up cost for next-neighbour messages per-byte cost for next-neighbour messages message size in bytes time for one single-node broadcast communication operation time for one single-node gather communication oper- ation start-up cost for broadcast (p > 2) per-processor cost for broadcast (p > 2) start-up cost for gather (p > 2) per-processor cost for gather (p > 2) extra computation cost for gather (p > 2)

The development of massively parallel M I M D archi- tectures consisting of hundreds or thousands of pro- cessors is encouraging research into their potential for a wide range of applications. Part of this research has culminated in a new performance analysis field termed scalability analysisJ 'z Scalability itself is a property which, whilst desirable, is difficult to define formally. Nevertheless, researchers agree that scal- ability is a property of a parallel algorithm along with a parallel architecture, often together referred to as a parallel system. Scalability analysis incorporates a number of paradigms and corresponding metrics for scalability, each one providing useful information for the system under investigation.~ ~ Scalability analysis paradigms are based on a parallel system's perform- ance model and among other things examine how performance changes when the number of processors increases if we allow the input size to grow. The results of all paradigms determine the overall scaling

behariour of the system. In this paper we outline and use an experimental

methodology to construct a performance model and derive the scaling behaviour of a Decomposed State Est imator (DSE) parallel algorithm S 7 when run on a Parsytec GCel 3/512. The DSE algorithm is based on the processor farm, or master-slave, programming model; the Parsytec GCel 3/512 is a massively parallel

479

480 EFTHIMIOS TAMBOURIS et al.

machine with 512 processors and a 2D mesh intercon- nection topology.

The field of scalability analysis is rather new and only partial results on the scaling behaviour of sys- tems have been published. The parallel systems examined so far allow a straightforward problem decomposition and a natural process-to-processor mapping which results in perfect load balancing and trivial sources of overhead) '2'4 Unfortunately, the structure of the DSE algorithm is complex and hence in our case none of these are true. The programming model used cannot be naturally mapped on a 2D mesh resulting in a complex communication oper- ation. Furthermore, the parallel version of the algor- ithm contains extra computation not present in the serial version. Finally, the programming model itself implies extra overhead since keeping the master busy results in inefficient use of the machine.

Our objective is to construct a preliminary per- formance model for the system and derive its scaling behaviour. Emphasis is placed on modelling the communication overhead since this is the main cause of performance loss. The results are used to examine the feasibility of the specific target architecture for the algorithm in hand. Simulation test results are also compared with previous research in which the algor- ithm was run on a small shared memory system. In this study we use the FORTRAN 77 code developed for a shared memory architecture 7 with some modifi- cations to run under a message passing environment.

The main motivation for the current investigation comes from collaborative work arising from our participation (funded under SERC contract GR/ F4002) in the BRIM parallelizing compiler project. 8 The experience gained provides us with a solid frame- work for determining the performance and scalability estimates for a number of parallel systems. The study of distributed state-estimators for power systems was funded under SERC contract GR/G33110.

This paper is organised as follows. Section 2 con- tains a brief presentation of the large scale appli- cation. The techniques which are used and the assumptions made are outlined in Section 3. Sections 4 and 5 contain results from the asymptotic perform- ance and scalability analysis respectively. A prelimi- nary exact performance analysis is presented in Section 6. Conclusions and future work are given in Section 7. Finally, the Appendix contains a brief description of the CSE and DSE algorithms.

2. THE LARGE SCALE APPLICATION

The algorithm addresses the problem of reducing run times and improving the robustness of state-esti- mator (SE) programs. More specifically, a method has been developed based on partitioning a global network into a specific inter-connection of sub-net- works, and distributing the resulting decomposed state estimator (DSE) between two computation levels, with a single (master) processor at the upper

level and parallel (slaves) processors at the lower level. ~7 Communication is required between the mas- ter and each of the slaves but not between the slaves. The DSE is an exact decomposition of a conventional least-squares estimator (CSE) for a non-linear measurement model in the steady-state. For 'clearly decomposable' networks in which the total number of inter-area tie-lines is low, the computational task at upper level is light and the data communication between levels is of low volume. In this sense, the decomposition is asymptotically efficient. The em- ployment of sparsity programming significantly affects the performance of CSE and may be used in each of the lower level DSE problems.

In summary, the algorithm has been proven to exhibit a number of properties making it attractive as a solution in reducing time and improving the robust- ness of SE programs for power transmission system applications. The algorithm has been implemented in FORTRAN 77 and satisfactory results have been presented when a centralised architecture is used (a 4 processor VAX 6440 computer). 7 In the Appendix we outline both the CSE and the DSE algorithms em- phasizing their computation and communication structure rather than their engineering interpretation. For more information on the engineering side of the algorithms the interested reader is referred to Refs 5 and 6.

The target machine is a Parsytec GCel 3/512, a massively parallel MIMD machine with 512 INMOS T805 transputers. 9 The processors are organized in a 16 × 32 2D mesh and communication is handled through explicit message passing. GCel has its own operating system called PARIX (PARallel extensions to UNIX). PARIX allows the exchange of messages between next neighbour processors using the send and receive primitives. Furthermore, a virtual topology may be defined by the user as a set of virtual links. Processors connected via a virtual link are viewed as next neighbour by the user and, thus, routing of messages between them is transparent to the user. PARIX provides a library of virtual topologies in- cluding hypercube, ring, tree, torus, etc. However, when using a virtual topology the physical intercon- nection network does not change and, hence, virtual topologies other than the 2D mesh imply extra communication overhead. This overhead comes from the fact that the user views as next neighbour pro- cessors that may be far away in the physical topology. Nevertheless, performance losses are offset by a sig- nificant reduction in programming effort. The version of PARIX used provides a number of virtual topolo- gies but does not explicitly support processor farms and thus the implementation of the farm is the responsibility of the programmer.

When programming GCel using FORTRAN 77 the same code is loaded to all processors. In order to distinguish whether a processor is the master or a slave we need to use the processor's id. This is a unique number for each processor to distinguish it

Performance evaluation of a processor farm model 481

from other processors in the machine. To implement the processor farm we use the following structure.

IF my id = master id T H E N { code ./or the master Jbllows } E L S E { code ['or the slave follows }

3. TECHNIQUES AND ASSUMPTIONS USED

We now outline the techniques that are used for load balancing, performance estimation and scalabil- ity estimation. Issues of memory use, optimization, and the employment of vector techniques are also addressed. Furthermore, the assumptions used in the study are presented.

Load balancing is actually determined by the method employed for partitioning the global network into an inter-connection of sub-networks. Currently, the method is simple but during a future optimization phase alternative heuristics will be exploited. Recall that the problem of the partitioning is equivalent to the problem of decomposing a sparse matrix. This problem is computationally difficult and, hence, the use of heuristics may only guarantee near-optimal solutions.

Performance and scalability analysis are carried out according to a methodology we have constructed. This methodology may be also used to investigate different scalability analysis paradigms with a similar context. The four steps of the methodology are now outlined:

Step l: construct an analytical performance model, Step 2: derive the asymptotic scaling behaviour, Step 3: construct the exact performance model and Step 4: derive the exact scaling behaviour.

For the analysis we need to derive performance models for both the CSE (global case) and the DSE (decomposed case). Performance analysis faces the problem of identifying and estimating the overhead in a processor farm programming model mapped on a 2D mesh topology. Major effort is put into deriving performance models for the single-node broadcast and the single -node gather communication operations on mesh topologies. Scalability analysis includes overhead estimation as well as the impact of memory limitations and the effect of algorithm concurrency.

The database for DSE is derived from the global database, and is distributed amongst the lower level processors. The resulting sparse Jacobian sub- matrices are held in Zollenkopf form ~° in the lower level processors. The algorithm has already been implemented for VAX machines using FORTRAN 77 and, thus, we only need to develop the necessary communication structure under PARIX. To reduce the programming effort and assist the modelling procedure we chose to work with synchronous com- munication. A future optimization phase includes

using asynchronous communication and also coding again the algorithm using Occam, although recent research at the same machine showed that in some cases Occam performs worse than FORTRAN. ~ Finally, although the algorithm itself favours the employment of vector architectures, Parsytec GCel does not support vector operations.

Furthermore, a number of assumptions are intro- duced to decrease the complexity of the problem. The algorithms in the Appendix contain two variables that can be used as the system's input size, namely n and me with me > n. To keep the analysis simple we assume that me increases linearly with respect to (w.r.t.) n. Hence the system's input size is adequately represented by the single variable, n. In the decom- posed case n = n~ + Ynri where nri characterises the computation in slave i and n b characterizes the com- putation in the master and the communication load. Since only synchronous communication is used we define nr as the maximum of all nr~ and use it to characterise the computation load in each slave. For clearly decomposable networks nt,<<n r and thus in scalability analysis we concentrate into two cases: one where n/, is constant w.r.t, p and one where n~ grows linearly with the number of processors. In step 3* of the DSE algorithm in the Appendix, each processor receives n~ elements from the master. Hence the first case is equivalent to asking for a decreasing number of inter-area tie-lines per processor, as the number of processors grows, while the second case is equivalent to asking for a constant number of inter-area tie-lines per processor as the number of processors grows. In asymptotic scalability analysis the first case should be discarded since, by increasing p at some point we would have p > n/, meaning that each processor con- tains less than one boundary node. However. assum- ing n~ is infinitely large we overcome this limitation. Finally, we assume the number of links / in the network grows linearly w.r.t, the number of nodes n. In most real life networks the ratio of l to n is a constant having a value within the range 1 to 4.

In this study we call problem size W the number of basic operations required by the fastest known sequential algorithm to solve the problem on a single processor. Note that the problem size W is a function of the input size n. Furthermore, we call e~iciency E the ratio of speedup S to the number of processors p where ,~Tmedup S is the ratio of the time needed by the fastest known algorithm to solve the problem to the time needed by the parallel algorithm to solve the same problem.

4. ASYMPTOTIC PERFORMANCE ANALYSIS

At this first step of the methodology we derive the general performance models for the CSE and the DSE algorithms when run on a 2D mesh. Here, we are interested in generic results and therefore, bench- marking of the target architecture is not essential. It is suffÉcient for the performance models to contain

482 EFTHIMIOS TAMBOURIS et al.

enough coefficients to be determined later. Since the decomposed solution is derived algebraically from the global solution we examine each case explicitly.

4.1. Operat ion count

We first identify the computat ion operations in- volved in both CSE and DSE algorithms. These are matrix transposition, matrix multiplication, mat r ix- vector multiplication and solving a system of linear equations. If we assume n is the input size and there is no sparsity programming involved then the com- plexity expressions for the computat ion operations are: O(n=), O(n3), O(n 2) and O(n 3) respectively. 1 From the Appendix we may easily derive that for the CSE case the complexity of the work is W = O(n3). For the DSE case the complexity of the work Wb

done by the master is Wb = O(n~) while the complex- ity of the work Wr done by each of the slaves is W, = O(n~) .

In the case where the implementat ion is based on sparsity programming techniques the performance models for matrix computat ion operations are ex- pected to differ from those reported in the previous paragraph. More specifically, we assume that (i) the non-zero elements of sparse matrices are stored in a rectangular part of the memory and (ii) to a first approximation, an increase of the input size of the sparse matrix results in a proport ional increase of one dimension only of the stored matrix while the other remains constant at a small value. Under these assumptions we may easily show that the operations between matrices have a cost of only O(n) where n is the input size of the sparse matrix. Hence, for our system we get that the problem size is W = O(n) for

the CSE case, Wr = ®(nr) for each slave in the DSE case and W b = ®(nb) for the master in the DSE case.

4.2. Commun ica t i ons es t imate

In this study we assume the one-por t c o m m u n i -

cation model where a processor can send a message on only one of its links at a time. Similarly, it can receive a message on only one link at a time. How- ever, a processor can receive a message while sending another message at the same time on the same or a different link. The DSE algorithm contains two com- munication operations. The first step 2* in the Appendix-- i s the s ingle-node ga ther communicat ion operation in which the master collects a unique message from each slave. The second--s tep 3* in the Appendix is the s ingle-node broadcas t operation in which the master sends identical data to all other processors.

The routing algorithm we use for the broadcast operation on a 2D mesh topology suggests the master is the central processor of the grid and sends messages first at one dimension (say the y dimension) and then at the other (the x dimension). Processors having the same y dimension as the master receive the message and forward it both to the next processor at the y dimension (if any) and to the next processor(s) at the x dimension. All other processors receive the message and forward it to the next processor(s) at the x dimension (if any).

In Fig. 1 we may see how the algorithm runs for a 5 × 5 grid where the master is the processor with id 12. In this figure each message transfer step is shown by a numbered, dotted arrow from the source of the message to its destination, and the number on the

• 5

i . . . . I 20 I

" 5

• 4 3 - . - - - - ' r - -

. ' 4 2, I~ _ _3_..

. , , I

. _ 4 I , _ ~ _ ,_ _ s " r

4 -

1 8 ~ - 4 ' ' . ~

_4_ -. ' L 4 -- - - [I 11 [ 10 13

, I 5 , - _ _ 5 : " 5 ~ 5 . . . . . " , " ]

| 6r, 4:. > : .

Fig. 1. Global communication on a 5 × 5 GCel partition.

Performance evaluation of a processor farm model 483

arrow indicates the time step during which the mess- age is transferred. For example, the message reaches the processor with id 18 in 3 time steps and the total number of time steps for the broadcast operation is 6.

The routing algorithm for the gather operat ion is similar to that for the broadcast. The difference now is that the algorithm applies backwards from the edges of the grid to the centre point where the master rests. Furthermore, each processor has to concate- nate its own data to the message(s) it receives before forwarding the new message. Hence, messages grow as they get closer to the master.

In both algorithms, communicat ion is carried out concurrently for many pairs of processors. In estimat- ing performance we need to identify the maximum number of hops one message has to traverse. For both algorithms this number equals the distance between the master processor, which rests in the centre of the mesh, and the processor which rests in one of the edges of the mesh. For example in square meshes with p processors if ,ffip is even this number equals ,,ffi while if xffi is odd this number is xflP - I.

To estimate the time required for the completion of the communicat ion operations we assume that m is the message size (in bytes) each processor receives and/or sends and that it is constant for all processors. Furthermore, that next neighbour communicat ion is modelled as t, + mt~ where t~ is the message start-up cost and tb is the per-byte cost. For a square mesh with p processors where x/'P is even the communi- cation time for the broadcast operation t~, is:

t~ ,=t ,~+x/p tp~ ,+x/p t ,+mx/p th; m =nh (1)

where t,~ is a start-up operation cost and tph is a per-processor cost from the necessary computat ion at each processor. For the gather operation and the same mesh the time t,, is:

t,, = t,~.+,v,/ptp~+(px/p + 3 p + 2 ~ / P ) 8 t ~

m + x/pt, + (p~/p + 3p + 2v/p) ~ th;

m - ( 2 )

P

where t,~ is a start-up operation cost, tpg is a per-pro- cessor cost and t,.~ is the computat ion cost, since the gather algorithm implies computat ion for decompos- ing and composing messages in each processor. These expressions are easily modified to apply to square grids where ~/p is odd and to rectangular grids.

4.3. -Fhe effect o f serial code

The processor farm model from its nature implies an extra overhead arising from the slaves being idle while the master performs useful computation. The work done by the master may be viewed as the serial

part of code which is not parallelized and, hence, always executed by only one processor. In both cases the overhead depends on the work done by the master (or the serial work) W~ and the total amount of work as presented in the global case W. In particular, assuming no communicat ion overheads, the running time of the decomposed problem is W/p + ~), (p - l)/p and hence the overhead is W~ (p - 1)/p.

5. ASYMPTOTIC SCALABILITY ANALYSIS

In this second step of the methodology we investi- gate the asumptotic scaling behaviour of the system. For the system in hand we present an analysis on the memory limitations and on the fixed efficiency seal- ability. Scalability is sometimes restricted by limi- tations on the concurrency of the algorithm and hence, the algorithm's degree of concurrency is also examined.

5.1. Memory limitations

In our target architecture there is a 4 Mbytes D R A M local memory in the chip of the T8(15 trans- puter. As a first step we examine the memory required by the code produced for a shared memory architec- ture. Without any memory optimization the memory required in the global solution is (in bytes):

Mcs e = 3696n-' + 688n + 4 4 2 / + 100n/+ 10923. (3)

In Eq. (3) McsE is the memory required by the code, n is the number of nodes and l the number of links in the network. In general, as mentioned in Section 3, the number of links is linear to the number of nodes. Applying Eq. (3) for the IEEE standard 30-node 41-lines network we get that Mc.sr~= 3,499,085 bytes, and hence the network may be solved on a single GCel transputer. Using Eq. (3) we derive that a network of 32-nodes is the largest that may be solved in GCel before the machine runs out of memory.

For the decomposed case we start from the mem- ory requirements of the code produced for the shared memory machine. After optimizing memory allo- cation the memory requirements for each slave are (in bytes):

MI~sL. r ~ 6024n~ + 32n~. (4)

The memory requirements for the master are (in bytes):

MDSEh .~ 80n~ + 152n~,. (5)

Note that although the same code is loaded to all processors the memory requirements between the master and each slave differ since the code actually running in each case is different. Both Moser and M~seb should remain below approximately 3.9 Mbytes. From Eq. (5) we get that a maximum of 220 boundary nodes may be used. From Eq. (4), after substituting this value to nh we get a maximum of 26

484 EFTHIMIOS TAMBOURIS et al.

inter-area nodes for each slave. Assuming that n b oc p then their ratio determines the number of boundary nodes per processor. The most significant case is when this ratio is constant which is equivalent to asking for a linear increase in the number of bound- ary nodes as the number of processors grows. We now see the largest networks that may be solved by our system assuming ideal load balance with 26 inter-area nodes per slave. If each processor contains 5 boundary nodes then the largest network may be solved on 44 processors and has 1364 nodes. In the case where each processor contains 7 boundary nodes then the largest network may be solved on 31 pro- cessors and contains 1032 nodes. Finally, in the rather pessimistic case of 10 boundary nodes per processor the largest network may be solved on 22 processors and contains 792 nodes.

In summary, the available memory in the trans- puter chip significantly reduces the potential of using GCel since only a fraction of its available processing power may be utilized. The analysis shows that even after performing a number of optimizations no more than 50 processors may be used before the system runs out of memory. This represents only the one tenth of the number of processors available in GCel which is 512. From the point of view of the input size we expect the system to be able to solve a 1400-node network at the maximum.

5.2. Fixed-efficiency scaling

We use the fixed-efficiency scaling paradigm to estimate the system's scalability. We want to estimate the growth rate of the problem size that is necessary to keep efficiency constant if the number of pro- cessors increases (also called the isoefficiency func- tion). ~ Keeping dominant terms only, the definition of efficiency becomes:

S tcs E W E - - - ~ • (16)

P ptDsE p W + p W h + p n b ~ / p + p n g p - / p w P

In the denominator the first term is the useful compu- tation performed while all other terms represent overhead. In particular, the second term is the over- head from the processors being idle while the master is working, the third comes from the broadcast communicat ion operat ion and the fourth from the gather communicat ion operation. We may now de- rive the growth rate of the input size n w.r.t, p to keep efficiency constant in each of the three overhead cases:

W = ®(pWh); W = ~(plSnb); W = f~(plSn~). (7)

We first examine the case where sparsity program- ming techniques are used. As shown in Section 4.1 in

this case we have W = ® ( n ) , Wb=®(nb) and Wr = ®(n,) and, hence, Eq. (7) becomes:

n = ®(pnb); n = f~(plSnb); n = ~2(pISn~). (8)

We observe that when /'t b is constant w.r.t, p then from the second expression we get that W = fi(p 1.5) which determines the overall system's scalability. However when nb is linear w.r.t, p the third expression gives a much worse scalability W = f~(p3S).

In the case where no sparsity programming tech- niques are used we have W = O(rt3) , W r = I ~ ( n ~ ) and Wb= ®(n~). Now Eq. (7) becomes:

n3=®(pn3); n3=f~(plSnh); n3=f~(plSn~). (9)

We observe that when nb is constant w.r.t, p then again from the second expression we get that W = f~(pLS) which determines the overall system's scalability. However when n b is linear w.r.t, p the first expression gives a much worse scalability W = ~ ( p 4 ) .

In the analysis performed so far it is assumed that n and nb may change independently. We now examine the case where n and n b are linearly related, i.e. nh = c n where c is a constant from 0 to 1. Even if we assume no communicat ion overhead, and when sparsity pro- gramming techniques are employed, we have W = n and an overhead of (p - 1) n b. This means that the isoefficiency function is given from:

W ~ pn b ~ W ~ pcn. (10)

Equat ion (10) suggests that the system has no isoeffi- ciency function and hence it is unscalable. Note that this holds no matter how small c may be and the same result holds for dense matrices.

We now summarize the results for the most signifi- cant case which is when sparsity programming tech- niques are used and the number of boundary nodes is linear to the number of processors. In this case the analysis showed that efficiency may be kept at a fixed value as the number of processors grows if we increase the input size at a rate f~(p3.5).

5.3. The degree o f concurrency

The degree o f concurrency of a parallel algorithm is the maximum number of tasks that can be executed simultaneously at any time. 1 It is a measure of the number of operations that an algorithm can perform in parallel for a problem of size W and is independent of the parallel architecture.

Theoretically, in our parallel algorithm the number of processors that may be efficiently utilized may scale up to the number of nodes in the global solution, i.e. ®(n). Assuming sparsity programming techniques are used we have W = f~(n), the degree of concur- rency is f~(W) and at most f~(W) processors can be used efficiently. Therefore, given p processors, the problem size should be at least f~(p) to use them all. Thus, the isoefficiency function due to concurrency is

Performance evaluation of a processor farm model 485

El(p) which is the best case value. Assuming no sparsity programming techniques are used we have W = ft(n), the degree of concurrency is ~ ( W h'3) and at most f~(W ~3) processors can be used efficiently. Thus, the isoefficiency function due to concurrency is El(p~).

6. E X A C T P E R F O R M A N C E A N A L Y S I S

In this third step of the methodology we verify the communicat ion models proposed at the first step and determine the coefficients they contain. After that, we present timing results from running the CSE algor- ithm and compare them with previous results re- ported.

6.1. Communication per[ormance models

The model t~+ ruth proposed for the communi- cation of next-neighbour processors is verified by the measures giving t~= 74#sec (1 # s e c = 10 ~sec) and t~ = 0.9 #sec. The measures were obtained using the P I N G - P O N G benchmark ~z and take into account the times required from the timing statements used. Timing values obtained for the broadcast algorithm verify the model proposed in Section 4.2 giving t,t, = 260 #sec and tph = 155 #sec. In Fig. 2 we see the convergence between results derived using the model and timing values when the number of processors grows from 16 to 512 for two cases: one for small messages (4 bytes or one integer) and one for medium messages (100 bytes or 25 integers).

Timing values obtained for the gather algorithm also verify the model proposed. For our area of interest, which is small message sizes, we get the following values for the coefficients: t,;~ = 819#sec, tr. ~ = 365 #sec and t~ = 7.5 Itsec. In Fig. 3 we see the results from the model and the corresponding measures for some square meshes where the number of processors ranges from 16 to 256 and the message size is 4 bytes.

6.2. CSE simulation test results

Simulations test results of the CSE algorithm for six different networks have been obtained. The net-

Time (usec) (Thousands)

8

7

6

5

4

3

2

1

0

Broadcast communication operation

, t?

j -

es

16 96 ' 256 64 144 512

Number of processors Model (4 b)

Timing 14 b) ~ Model (100 b) ~ T i m i n g t l 0 0 bl

Fig. 2. Single-node broadcast communication operation.

Time (msec)

28 26 24

22 20

18 16

14 12 1~)

6 4

2

Gather communication operation

/

/

i 16 ~ 48

~2 64 86 i 144

96 256 Number ot processors

Least squares appr. Tmlillg ineastlrc',

Fig. 3. Single-node gather communication operation.

works used are: 8-node, 10-1ink: 12-node, 15-1ink: 16-node, 20-1ink: 20-node, 27-1ink: 24-node, 32-1ink and 30-node, 41-1ink. The 30-node is an IEEE stan- dard network and the others are all subsets of this network. The timing results are shown in Table 1 and Fig. 4 and suggest that the time required for the algorithm increases almost linearly with the input size. More specifically, the least-squares line approxi- mation to the experimental data gives a 14% error in the worse case (the 16-node network). Note that in Fig. 4 time in the y-axis is calculated as the ratio of the measured time to the number of iterations per- formed to convergence.

7. C O N C L U S I O N S A N D F [ T U R E W O R K

To estimate the performance and scaling behaviour of a parallel algorithm machine combination (sys- tem) we proposed a four-step methodology. We also showed how this methodology applies in practice even in the case of systems that are hard to model. In particular, wc performed a preliminary performance and scalability analysis for a distributed state estima- tor for power systems when run on a massively parallel machine with a 2D mesh intcrconnection topology. The communicat ion overhead was mod- elled precisely resulting in an accurate performance model. On the other hand the computat ion load was only asymptotically modelled. The scaling behaviour of the system is only partially presented here since work is still in progress.

Scalability analysis suggested the system scales well when the number of boundary nodes n~, is kept constant or increases slowly w.r.t, the number of processors p. Actually, the faster n~ increases w.r.t, p the worse the system's scalability is. In particular, if we assume that sparsity programming techniques are used and nt, is constant w.r.t, p then in order to keep efficiency at a fixed value we need to increase the input size at a rate f~(p~S). However, if n/, is linear w.r.t, p then this rate becomes f21p~5).

Theoretical analysis and test runs suggested each processor may be only assigned a small network due

486 E F T H I M I O S T A M B O U R I S e l al.

Table 1. Timing measures for CSE running on GCel

No. of nodes No. of links No. of iterations Time (#sec)

8 10 3 251,940 12 15 4 405,610 16 20 4 466,735 20 26 4 634,865 24 32 4 808,454 30 41 5 1,246,955

to memory limitations. More precisely, no more than

26 inter-area nodes may be assigned to each one of the slave processors while a total of 220 bounda ry nodes may be assigned to the master. In order to est imate the largest ne twork tha t may be solved due to memory l imitat ions we need to know how the n u m b e r of bounda ry nodes n b increases w . r . t . p . Assuming a l inear increase of n b the largest ne twork is de termined by the n u m b e r of bounda ry nodes per area. Fo r example, if each area conta ins 5 bounda ry nodes then the largest network is solved on 44 processors and conta ins 1364 nodes.

The analysis also revealed a conflict in the n u m b e r of bounda ry nodes conta ined in the network. Analy- sis of the system's memory cons t ra in ts suggested tha t a large n u m b e r of bounda ry nodes may be used. This allows us to solve very large networks even if the n u m b e r of bounda ry nodes is l inear w.r.t, the n u m b e r of processors. However, an increase in the n u m b e r of b o u n d a r y nodes increases the computa t ion load of the mas ter which tends to decrease the system's efficiency. This is expected since while the mas ter processor is working all o ther processors remain idle. The exact implicat ions of this conflict may be only de termined after the exact per formance and scalabil- ity analysis has been concluded.

Fu ture work starts with deriving the exact system's per formance model and scaling behaviour . Simu- la t ion test results f rom large networks are necessary to verify the theoret ical analysis. Opt imizat ions on the a lgor i thm will then be per formed to enhance its performance. These include developing algor i thms for au tomat ic load balancing, minimizing communi - cat ion overhead using asynchronous communica t ion ,

Time(usec) (Thousands)

240 230 220 210 200 190 180 170 160 150 140 130 120 I10 100

90 80 70 60

Running time for the Global Problem

+ /

J

L , , F J 8

l u 14 18 22

Number of nodes

Timing measures Least squares appr.

Fig. 4. Running time per iteration for the global problem.

and exploit ing absolute per formance gains using a low level p rog ramming language like Occam. Fur- thermore, the use of a theoretically superior algor- i thm for the communica t ion operat ions ~3 will be investigated to enhance the system's scalability.

Acknowledgements--We would like to acknowledge Prof. C. Halatsis for granting us access to the GCel 3/512; the researchers of Athens High Performance Computing Lab- oratory for their assistance with using GCel; and Dr. S. Zitouni who developed the code for the DSE algorithm. During part of this work Efthimios Tambouris was sup- ported by the Organization for the Employment of Labour Potential at the Hellenic Ministry of Labour.

R E F E R E N C E S

1. V. Kumar, A. Grama, A. Gupta and G. Karypis, Introduction to Parallel Computing. Design and Analysis of Algorithms, Benjamin/Cummings, USA, 1994.

2. X.-H. Sun and D. T. Rover, "Scalability of parallel algorithm machine combinations," Technical Report IS-5057, Ames Laboratories, Iowa State University, Ames, IA, 1991. Also in IEEE Transactions on Parallel and Distributed Systems, 1994.

3. J. L. Gustafson, "'The consequences of fixed time per- formance measurement," Proceedings of the 25th Inter- national Cot!]'erenee on System Sciences, Hawaii, Vol. 2, 1992, pp. 113-124.

4. P. H. Worley, "The effect of time constraints on scaled speedup," SIAM Journal on Scientific and Statistical Computing, 11(5), 1 35 (1991).

5. J. F. Marsh, "Structure of measurement Jacobian matrices for power systems," lEE Proc., Pt. C 6, 407~413 (1989).

6. J. F. Marsh and M. Azzam, "MCHSE a versatile framework for the design of 2-level power system state-estimators" lEE Proc., Pt. C 4, 291 298 (1988).

7. J. F. Marsh, S. Zitouni and M. R. Irving, "Compu- tational aspects of distributed state-estimators for power systems," 12th IFAC World Congress, Sydney, Australia, 1993.

8. C. P. Wadsworth, S. K. Robinson and D. J. Johnson, "'Mapping dataplacements in a parallelising compiler,'" Proceedings of PACTA '92 (edited by M. Valero), IOS Press, 1992.

9. F. Tiedt, "Parsytec GCel Supercomputer,'" Technical Report, July 1992.

10. J. K. Reid, Large Sparse Sets of Linear Equations, Academic Press, London, 1971.

11. P. M. A. Sloot, A. G. Hoekstra and L. O. Hertzberger, "A comparison of the Iserver-Occam, Parix, Express, and PVM programming environments on a Parsytec GCel," Lecture Notes in Computer Science, Vol. 797, 1994, pp. 253 259.

12. R. Hockney, "'Performance parameters and benchmark- ing of supercomputers," Parallel Computing 17, 1111 1130 (1991).

13. M. Barnett, D. G. Payne, R. A. van de Geijn and J. Watts, "Broadcasting on meshes with worm-hole

Performance evaluation of

rout ing," Technical Report , Compute r Science Depart- ment, University of Texas, 1993.

a processor farm model 487

[W~H,~]: ~, = [W, rH,~]~R, ~Az~; G - ZG, and ,~Z Yg, the following computa t ional procedure applies.

APPENDIX

CSE al~,,orithm

Given the measurement model z = h(x ) + u where x e 91 ~, (z, u)e~.R m'' and me > n , the WLS est imator Z for x is given by convergence of H J ( j ) R i H ( j ) O , x ( . j ) - H/ ( . / )R i Az (.j), where Ax ( j ) = x ( j + 1 ) -- x (.j), A- ( j ) _ z h l x l j } ) and .j is an iteration index.

D S E algorithm

If we assume n is the input size of CSE then in DSE it is decomposed to the inner area n,~ and to the boundary area n~,. For simplicity we assume each inner area is assigned to one processor and all inner areas have the same load n~ equal to the max imum of n,~. The algori thm implies each slave is assigned to computa t ion from an inner area while the master is assigned the computa t ion from the boundaries. Further- more, r,~ is the set of the ' boundary residuals' with r,h ~ N'""'. Once again we assume me~ is the same for all i; also that me is linear to n, thus, increases to me result in relevant increases to n. To conclude the notat ion i is an index from I to the total number of processors which is assumed to be p and j is an iteration index. Note that W,~ is not to be confused with W, used in the main part of the paper.

Given the following matrix definitions: H~, internal Jacobian matrix for sub-area i; ~ , - H,' ,R, ~ H,,: 1 4 ' , ¢ = I - H , r n ~ ' H 5 R , ': G,=[I4",,H, hl 'R , '

At lower level (in p-parallel processors) 1. initialize with x,r(O), x~(O). Read z,, Set j - 0 : 2. compute: h , (x , , ( j ) , x~( j )): kzAj ) ; H,,(.j ); H,~,(.I ); f~,, ( j ) : ~,r I(.j'l: W,r(j) and [G,(j); ~g,(.j)]. 2*. send [G,(/) : ,,z,,I j )] to upper level, i.e. n~, (n~ + 1 ) elements in p-parallel.

At upper level (in single processor) 3. compute: [G :g] - Z[G,:g,] and A.b,( J } from G(J )Axe(./) -- ,~(.j ): 3*. send Axh(.j ) in p-parallel to lower level, i.e. n~. elements.

At lower level (in p-parallel processors} 4. compute: A r , t & j ) - A z , - - H # , ( j ) A x t , ( / } and Av, , ( j l f rom ~ , , ( j ) A l "( , ~ ( i ) = H [ ( j )R, I krit,( j ): 5. set j - j + 1. Compute x, , ( j ): -'¢h(J ): 6. if convergence not reached, go to 2. else x,, ( j ) = Z,, and "% (.1 ) = "/ ~, .

End of state-estimation algorithm.

Following convergence, the further computa t ions required for distributed bad-data-analysis are: At lower level (in p-parallel processors) 7. compute: r , - W,,[z, H,~(j)z/,] and E [ r , r / ] - D ) , [ R , - H~,,6" ~H~IIW/~.