Vector and parallel computing for matrix balancing

Annals of Operations Research 22(1990)161 - 180 161

VECTOR AND PARALLEL COMPUTING FOR MATRIX BALANCING*

Stavros A. ZENIOS

Decision Sciences Department, The Wharton School, University of Pennsylvania, Philadelphia, PA 19104, USA

and

Siu-Leong IU

Electrical Engineering Department, University o/Pennsylvania, Philadelphia, PA 19104, USA

Abstract

Estimating the 6ntries of a large matrix to satisfy a set of internal consistency relations is a problem with several applications in economics, urban and regional planning, transportation, statistics and other areas. It is known as the Matrix Balancing Problem. Matrix balancing applications arising from the estimation of telecommunication or transportation traffic and from multi-regional trade flows give rise to huge optimization problems. In this report, we show that the RAS algorithm can be specialized for vector and parallel computing and used for the solution of very large problems. The algorithm is specialized for vector computations on a CRAY X-MP and is paraUelized on an AUiant FX/8. A variant of the algorithm - developed here for its potential parallelism - turns out to be more efficient than the original algorithm even when implemented serially. We use the algorithms to estimate disaggregated input/output tables and a multi-regional trade flow table of the U.S. The larger problem solved has approximately 12 000 constraints and over 370 000 nonlinear variables. This is the first of two papers that aim at the solution of very large matrix balancing problems. Zenios [20] is using the same algorithm for the same models on a massively parallel Connection Machine CM-2.

* Research partially supported by NSF grants ECS-8718971 and CCR-8811135, and AFOSR grant 89-0145. Computing resources were made available through the ACRF at Argonne National Laboratory and CRAY Research, Inc.

© J.C. Baltzer AG, Scientific Publishing Company

162 S.A. Zenios, S.-L. Iu, Parallel computing for matrix balancing

1. I n t r o d u c t i o n

The problem of adjusting the entries of a large matrix to satisfy prior consistency requirements occurs in regional planning, economics, statistics, stochastic modeling, traffic assignment and others. This problem is known as the Matrix Balancing Problem (abbreviated: MB) and has attracted the attention of researchers from several disciplines due to its rich mathematical foundation and the significance of the applications. For a general reference on algorithms and applications of MB, we cite Schneider and Zenios [18]. Textbook treatment of this problem is found in Bacharach [5] and Miller and Blair [17]. The problem can be stated formally as follows:

[MB] Given an m x n non-negative matrix A and positive vectors u and t; of dimensions m and n, respectively, determine a "nearby" non-negative matrix X (of the same dimensions) such that

n

~. xij i = 1

= U i , f o r / = 1,2 . . . . ,m,

m

Y. xij= i = l

fo r j = 1,2 . . . . ,n ,

and xij > 0 only if all > O.

At the current status of research in this problem area, it is considered feasible to balance matrices of dimension up to three hundred and a few thousand of non-zero elements. As an indication, we mention the work of Barker et al. [7] on the balancing of a sparse system of National Accounts for the United Kingdom with 262 accounts. Klincewicz [15] reports on the solution of some of the larger, randomly generated, problems in the literature. Dense matrices of dimension up to 500 x 100 were balanced in approximately 0.5 CPU hour on an IBM 3090-200. Zenios et al. [19] report on the balancing of sparse Social Accounting Matrices for development planning with 232 accounts. Some of the larger problems were solved within a few minutes of CPU time on an IBM 3081-K. In general, problems of this size are considered solvable on large mainframes with moderate expenditure of resources.

In this report, we introduce vector and parallel computations for the solution of very large MB problems. It is shown that the RAS algorithm is well suited for vector computations, and achieves significant performance on a CRAY X-MP even with a sparse implementation. Furthermore, the algorithm can be specialized for parallel computations. It achieved almost 90% efficiency on a shared memory Alliant FX/8 with eight processors for an observed speedup factor of over 7. A variant of the algorithm - developed specifically for its potential parallelism - was observed to

S.A. Zenios, S.-L. Iu, Parallel computing for matrix balancing 163

achieve superlinear speedup. As it turns out, the proposed variant of RAS could be more efficient than the original algorithm even when implemented in a sequential environment.

Test problems for our experiments are derived from real data: a multiregional input/output table of the U.S. from 1977 and Make and Use national tables for the U.S. for the same year. The largest problem solved has dimension of 6000 x 6000 and over 370000 non-zero entries. The test problems were balanced to a very high level of accuracy within a few minutes of CPU time. We point out that Harris [13] identi- fied MB problems as one area of transportation and land-use modeling where super- computers could be proven of great importance. This paper presents the first, to our knowledge, research effort in this direction. It is the first of two papers that attempt to establish the size of MB problems that are solvable with parallel computing techniques. Zenios [20] discusses the solution of the same class of problems using data level parallelism with the RAS algorithm on a Connection Machine CM-2.

The rest of the manuscript is organized as follows: Section 2 introduces the RAS algorithm and a parallel modification, and points out its similarities with algorithms used in medical imaging. Section 3 introduces some elements of vector and parallel computing in order to ease the development of subsequent sections. The specialization of RAS for vector and parallel computations is addressed in section 4. Section 5 summarizes the test problems and provides results with the computational experiments conducted both on a CRAY X-MP and an Alliant FX/8. Conclusions are summarized in section 6.

2. T h e R A S a lgo r i t hm

One of the earlier algorithms for solving MB problems is RAS that dates back to the Russian architect Sheleikhovskii in the 1930's. An early discussion on the algorithm is found in Bacharach [5], who is often credited for the naming RAS. See Schneider and Zenios [18] for a discussion of RAS-related literature. Bachem and Korte [6] suggested some numerically stable modifications to the algorithm. The algorithm is formally def'med as:

THE RAS ALGORITHM

Input: An m x n non-negative matrix A and positive vectors u and v of dimensions m and n, respectively.

Step 0: (Initialization) Set k = 0 and A ° = A.

Step 1 : (Row scaling) For i = 1, 2 . . . . , m define

164 S.A. Zenios, S.-L. lu, Parallel computing for matrix balancing

ui

;:- Z a,)' i

and update A k by

aqk ~ p~ai ~, i = 1,2, . . . . . . ,m and ] = 1,2, ,n.

Step 2" (Column scaling) For j = 1 ,2 . . . . . n define

oj

i

and def'me A T M by

ai~ +l k k = a q e j , i= 1,2 . . . . . m and / = 1 ,2 . . . . . n.

Step 3: Replace k <-- k + 1, and return to step 1.

The algorithm terminates either after a maximum number of iterations has been reached or the maximum row/column error is below some acceptable tolerance. The error at iteration k is computed by:

lu i - ~.ai~l l u j - ai¢ max J "

id ui vj

The convergence of the algorithm - under some mild conditions on the feasibility of the linear restrictions - has been established by Bregman [9]. It can be shown that the sequence { - log p/k, log a f } converges to the solution of the following entropy maximization problem:

maximize - ~ xqlog(xq/a~l ) x (t,])

subject to ~ xq = u i i

x q = v j . i


The RAS algorithm and MB have similarities to the problem of image reconstruction from projections. Such problems appear in medical imaging and non- destructive material testing. MB can be viewed as a problem of image reconstruction from two orthogonal projections. Elements of the matrix can be viewed as the inten- sifies in discretized pixels for the image reconstruction problem. Marginal values represent projections (line sums) along two orthogonal directions. In image reconstruction one attempts to reconstruct the intensities of the pixels of a discretized image, based on marginal values obtained along different angles. In MB, we construct a matrix based on views of the total input and output. The RAS algorithm is a special case of the multiplicative algebraic reconstruction algorithm (MART), which is often used for the solution of fully discretized models of image reconstruction. Therefore, any progress made with the solution of very large MB problems could have an impact on the reconstruction of highly descretized images. A broad treatment on image reconstruction is found in Herman [14], and a reference on the use ofiterative algorithms for this problem class is found in Censor [10]. Zenios and Censor [22] discuss vector and parallel computing with block versions of MART.

/ROW

I I

No. Rows + 1

VAL IPOINT

. / ' q ,

I

i , , ,

ICOL

N o . of non-zero entries

Fig. 1. Sparse data structure for the RAS algorithm.

No. Columns + 1

RAS was implemented using sparse data structure to facilitate the representation of very large problems that are typically sparse. The matrix is stored both row- wise and column-wise for the efficient computation of both row and column scaling factors and the subsequent scaling of the entries. Figure 1 illustrates the data structure


we use. Row entries are stored consecutively in array VAL. A pointer integer array 1R01¢ points to the starting location of every row in array VAL. A column pointer array 1COL points to the starting location of every column in a permutation array 1POINT. This permutation array points to the elements of every column in array VAL.

The algorithm can be partitioned for parallel computations in an obvious manner: scaling of all rows can be executed simultaneously, followed by simultaneous scaling of all columns. Synchronization is required when the algorithm proceeds from row to column scaling and vice versa. This synchronization requirement - a potential bottleneck for the parallel implementation - can be eliminated with the development of asynchronous algorithms, as discussed in Zenios [20].

2.1. A PARALLEL VARIANT OF RAS

We develop here a modification of R.AS that allows the simultaneous updating of both rows and columns. The scheme proposed here allows more fiexibflity in the parallel implementation. A hierarchical decomposition is possible whereby one cluster of processors is executing row scaling while a second cluster is executing column scaling. Multiple processors within each cluster can operate simultaneously on multiple rows and columns, respectively.

THE MODIFIED RAS ALGORITHM

Input:

Step 0:

Step 1 :

An rn x n non-negative matrix A and positive vectors u and o of dimensions m and n, respectively.

(Initialization) Set k = 0 and A ° = A. Choose wp, w a E (0, 1).

(Compute row scaling factors) For i = 1, 2 . . . . . m define

t Ui .1 wp Pik = - - k "

Step 2: (Compute column scaling factors) For ] = 1, 2 . . . . . n define

! Step 3: (Matrix updating) Define the updated matrix A k +* by

ai~+l k k k = Pi a i j a j , i = 1 , 2 , . . . ,m and / = 1,2 . . . . . n.

S.A. Zenios, S.-L. lu, Parallel computing for matrix balancing 167

Step 4: Replace k +- k + 1, and return to step 1.

The convergence of the modified algorithm proposed here follows directly from the results of Bertsekas and Tsitsiklis [8], pp. 501 -506 .*

3. E l e m e n t s o f v e c t o r a n d paral lel c o m p u t i n g

In this section, we discuss selective aspects of vector and parallel computing in order to facilitate the discussion in subsequent sections. Readers who may wish to obtain additional information on some of the aspects discussed here should consult the references in Zenios [21].

3.1. VECTORIZATION

Vectorization is defined as the restructuring of a program in order to exploit parallelism at the innermost loop level. This is made possible through vector hardware features which allow the processing of long vectors in roughly the same amount of time required for scalar operations. The primary vehicle of vectorization is the iterative procedure (i.e. inner DO-loops).

Efficient vectorization of a software system can be achieved if operations within a loop are independent in the sense of section 3.2. A typical inefficiency arises when sparsity is present and is dealt with using indirect addressing. Indirect - and nonlinear addressing - will in general introduce data dependencies and inhibit vectorization. This obstacle can be avoided by employing the following steps:

(1) Gather the non-zero components of the sparse vector according to the indirect addressing list.

(2) Operate on the gathered data in vector mode using multiple functional units.

(3) Scatter the results to their proper memory address using the indirect list.

Finally, a word on implementation. In general, vectorizatlon can be automated by means of vectorizing Compilers. Such compilers will detect loops that can be processed in vector mode and use appropriate hardware features. Unfortunately, vectorizing compilers can not eliminate dependencies for the kinds of sparse problems

*The results of [ 8 ] are more general than what is needed for the proof convergence of this algorithm. Indeed, it can be shown that an asynchronous version of the modified algorithm can be derived based on the cited results. Interested readers should refer to the development of asynchronous algorithms in Zenios [20] where, however, the discussion goes beyond what is necessary for an understanding of the work reported here.


we deal with in large-scale optimization. The execution of the program will fall short of its potential and it is up to the user to modify his or her software. This issue is exemplified in a later section with computational results.

3.2. MULTITASKING

Multitasking is defined as the structuring of a program into two or more components that can execute concurrently. The units of computation that can be scheduled for execution are called tasks. In general, there is no certainty that more than one processor will be dedicated to the tasks of a given job, that the tasks will execute in a particular order, or that a particular task will terminate first. In this respect, a multitasked program is non-deterministic with respect to time. With the aid of appropriate synchronization mechanisms, however, the program should be deterministic with respect to results. The property of a program that allows one copy of a module to be used by multiple tasks is called reentrancy. The module is typically a subroutine or a set of program statements, and its environment is re- created every time the routine is activated as a task. This means that local variables and indicators are assigned storage on a stack, independent from the storage used by other tasks. Reentrancy is a necessary but not sufficient property for the correct execution of a multitasked program. Parallel tasks will execute correctly only if they are independent of each other in two forms:

(1) Computational independence: The work to be performed by one task should be independent from the work performed by other tasks. For example, a task can not accept input data computed in parallel by a second task. Since there is no certainty about the sequence in which tasks are completed, it is con- ceivable that the data for the second task will not be available when requested. Tasks that violate this rule are data dependent. The order of execution of statements in a task should also be independent of statements in other tasks. Incorrect results may be obtained when the order of execution of a statement can not be determined before run time; for example, when conditional statements are used. It is then possible that the condition will be true or false, depending on which task executes first. Violation of this rule is called control dependence. The presence of computational dependence causes unpredictable behavior of the program. Not only are the results incorrect, but the error may seem random, depending on the sequence of execution of the tasks.

(2) Storage independence: Each parallel computational task has access to global variables. Fetching and storing of all variables in a task should not interfere with that in another task. Storage dependence is often difficult to identify and is easily overlooked. A program with this kind of dependence may execute correctly depending on the exact instant when tasks access the same storage.


Exact replication of this behavior is next to impossible. Modifying the program to provide each task with separate storage location for the original variables wilt frequently eliminate dependencies.

Multitasking is typically implemented at the subroutine level. Modules that can execute in parallel are coded as subroutines. Calls to the system library activate these subroutines as independent tasks. Although there are some differences between the implementations of multitasking by different vendors, most follow the general concepts defined here.

An alternative approach to parallel computing is microtasking. Microtasking can achieve a subset of the operations that are possible by multitasking. With multitasking, tasks are specified by multiple subroutines which can be executed in parallel using the task-start library. At the microtasking level, tasks are delimited with the aid of compiler directives. They can be sections of the code (within a single program module), the iterations of an outermost loop (that could possibly contain calls to subroutines), or even the iterations of an innermost loop.

The name microtasking is chosen because multiple processing is efficient even at the DO-loop level where the task granularity can be very small. Instead of using library routines to initiate and control tasks, the user has at his disposal a set of preprocessor directives. These directives delimit segments of code that are independent and provide appropriate synchronization points and locking mechanisms for shared variables. A microtasking preprocessor translates these directives into assembly language constructs using the multiprocessing hardware.

It should be mentioned that any segment of code suitable for microtasking can also be multitasked. The converse is not necessarily true. It is preferable, when possible, to structure a program for parallel execution with microtasking rather than multitasking due to the low overhead introduced by the former and its ease of implementation relative to the latter. Further information on the distiction between these two techniques is given in [1] or [3].

3.3. THE ALLIANT FX/8 AND THE CRAY X-MP/48.

The Alliant FX/8 is a shared memory vector multiprocessor with up to eight computational elements (CE) and peak rate of 94.4 MFLOPS. The CEs operate at 170 nsec clock cycles. An overview of the Alliant FX series can be found in [1 ].

The Aniant architecture utilizes parallel processing in two forms: vector functional units in each CE with hardware pipelining for every functional unit and parallel operations of the multiple CEs. Multiple CEs can execute concurrently segments of code like, for example, iterations of a loop or program subroutines. Communication between the processors is achieved through the shared memory. In addition, FORTRAN programs may fork processes for parallel asynchronous execution to the operating system.


Each computational element uses a five-stage pipeline and multiple functional units that can overlap to achieve high performance on scalar operations. Instruction fetch, address calculations and floating point add and multiply are performed by separate units. Any of the units may operate in parallel with the others. In addition, each CE contains hardware to support a full vector process set: 8 vector registers, each with 32 64-bit elements, 8 64-bit floating point scalar registers, 8 32-bit integer and same number of address registers. Each element can execute vector operations of length 32 64-bit words in parallel with the other elements for an effective vector length of 256 data elements. Vector operations utilize a twelve-stage vector pipeline.

Four possible modes of execution are possible on the FX system. Scalar, when all operations are performed serially by one CE. Vector, when operations are performed in groups of up to 32 elements by special vector instructions on the hardware. Scalar concurrent, when scalar operations are performed by a number of CEs concurrently. Vector concurrent, when multiple CEs operate concurrently on groups of up to 32 elements with the vector hardware.

The CRAY X-MP is a general-purpose vector multiprocessor system. Detailed description of this system can be found in Chen [12]. It comes in versions with two (X-MP/2) or four (X-MP/4) processors. Special hardware allows the efficient and coordinated application of multiple processors to a single job. Software support is available, and the user can call library routines to invoke multitasking functions; see Larson [16].

Multiprocessing with tightly coupled CPUs is one of the two levels of parallelism exhibited by the X-MP family. Individual processors have a vector architecture that facilitates the exploitation of parallelism at the innermost loop level. Every processor is designed with a number of dedicated functional units, both for floating point vector and scalar operations and integer vector and scalar operations. The functional units are supplemented by a set of eight vector registers and the same number of scalar registers. Vector registers are 64 words long. Every CPU has four parallel paths to or from memory: two for vector storage, one for vector loading, and one for independent input/output. Again, the hardware features are not very important for the applications programmer. The CRAY vectorizing compiler will generate machine code that uses the hardware efficiently. In the presence of sparsity, the compiler may fail to recognize programming constructs that could vectorize. The user should determine which programming constructs could be vectorized and issue appropriate compiler directives, or modify the organization of the software. The vectorizing then becomes more effective.

4. Vector and parallel computing with RAS

The algorithm was implemented on a CRAY X-MP/48 supercomputer for vector computations and on a shared memory Alliant FX/8 for parallel computing.


Both implementations are described here. We point out that each processor of the Alliant FX/8 has vector features, and the modifications discussed here for the CRAY were also carded over to the Alliant. The CRAY X-MP[48 supports both multitasking and microtasking. It is possible to implement the parallel algorithm on the X-MP[48, although we have not done so in the present study. The vectorized algorithm on the CRAY could solve some of the larger test problems within a few minutes of CPU time, and the additional savings from parallel computing on a 4-processor system do not justify the effort. On the Alliant FX[8, however, significant improvement in performance is possible only when using both vector and para~el computations.

4.1. VECTOR IMPLEMENTATION

Row scaling operations would vectorize automatically by the compiler. Given the starting and ending locations of a row in array VAL, the operations of RAS are executed over contiguous segments of memory. Although both the sum operation

k k Eiai~ and the scaling operation Pi aq would vectorize, some improvement in performance can be achieved by replacing these operations with computational kernels from the CRAY library. The SUM and SCALE kernels were used in place of the iterative loops for the sum and scaling operations.

Column scaling presents some difficulties, due to the indirect addressing used for the sparse representation of the matrix. Given the starting and ending locations of a column, the algorithm has to search through the array VAL - using the pointer list 1POINT - to fred the sum of non-zero values of the column. This sum is then used to compute the column scaling factor o~. A second search through VAL will scale the non-zero entries of the column. To avoid the indirect addressing, we used a GATHER/SCATTER technique as outlined in section 3.1. The non-zero elements of the column are GATHERed in a dense workspace vector. The sum and scaling operations are then performed in vector mode using identical kernels as those used for the row scaling. Finally, the results are SCATTERed from the work array back into VAL using the permutation pointers 1POINT. The workspace vector is of dimension equal to the number of non-zero entries in the most dense column of the matrix. This dimension never exceeds the number of rows in the matrix and its use does not impose a major requirement on memory utilization.

The GATHER/SCATTER procedure for the column scaling operation is illustrated in fig. 2. We point out that the CRAY X-MP has dedicated hardware for the GATHER/SCATTER operations. Although such operations are not as efficient as operations on homogeneous vectors, they are still much faster than scalar code. In addition, using the technique described here we are able to carry out the floating point operations - vector reduction and scaling - in vector mode.

172 S.A. Zenios, S.-L. Iu, Parallel computing.for matrix balancing

GATHER SCATTER

-!

\ 7 /

Scaling operations

work work array array

- - i

VAZ IPOINT 1POINT VAL

Fig. 2. The GATHER/SCATTER operations for column scaling.

4.2. PARALLEL IMPLEMENTATION OF RAS

RAS is well suited for parallel computation. Operations on multiple rows can be carried out concurrently without any need for synchronization. Each row has read and write access to different entries of the VAL array, and the scaling operations exhibit both computational and storage independence, as defined in section 3.2. The same is also true for column operations. Once all rows are scaled, the process is repeated for columns, with multiple processors operating on different columns concurrently. Synchronization is required when switching between row and column scaling (and vice versa) since a row and a column may have overlapping entries, which would result in memory write conflicts with incorrect results (i.e. storage dependence). Such a parallel mode of computing was implemented using microtasking directives (CONCUR) on the AUiant FX/8. A dynamic task scheduling mechanism is employed. Each row (column) operation is an independent task scheduled for execution on the next available processor. This scheme has the advantage of ensuring load balancing. Processors that are assigned dense rows (columns) will complete the task in more time and hence will process a smaller number of tasks than processors operating on sparse rows (columns). The disadvantage is the small granularity of the tasks assigned to each processor.


An alternative parallel scheme is to partition the matrix into blocks of rows (columns) of fixed size. The number of blocks could be equal to the number of available processors, and hence each processor will be assigned one task of large granularity. Recent developments with block-iterative algorithms for entropy mini- mization provide great flexibility in specifying the block sizes and even allow us to design blocks containing both rows and columns; see, for example, Censor and Segrnan [11]. Such techniques were shown to be very effective for parallel computations in the domain of medical image reconstruction from projections (Zenios and Censor [22]). They can be specialized for the MB problems addressed here if we consider the MB problem as a model of image reconstruction from two orthogonal projections, as discussed in section 2. Although such alternative parallel schemes are of interest both theoretically and computationaUy, they are not given any further consideration in the present study. The justification for not doing so is provided by the performance of the parallel scheme we did implement, which achieved almost 90% efficiency on the Alliant FX/8.

4.3. PARALLEL IMPLEMENTATION OF MODIFIED RAS

The modified RAS algorithm of section 2.1 provides the mechanism for a hierarchical parallel implementation. A cluster of processors is assigned to the computation of the row scaling factors and a second cluster is assigned to the computation of column scaling factors. Multiple processors within each cluster can operate inde- pendently on multiple rows and columns, respectively. Once the scaling factors are computed, all processors coordinate on updating the matrix. No synchronization is required between rwo and column scaling in this phase. The only synchronization point is between successive iterations: all the matrix entries have to be updated before the clusters can repeat the column and row scaling calculations.

5. Computational results

The ultimate objective in the design of vector and parallel computers, and the related research in algorithm development, is to solve problems whose large size would render them unsolvable on yon Neumann systems. To justify the claim that parallel computing could solve very large MB problems, we conducted computational experimentation using both the vector implementation on the CRAY X-MP and the parallel implementation on the AUiant FX/8. Real data are used throughout the experiments. As mentioned in the introduction, the size of the. test problems is one to two orders of magnitude larger than what has been solved previously. Even the examples mentioned above are considered very large by the majority of the publications in this area. The solution of problems of dimension more than 80 x 80 is rarely reported.


5.1. TEST PROBLEMS

Test problems were derived from regional input/output accounts. The first source was the National table for the U.S. for 1977 [4]. This set of data consists of the Make and Use matrices, with a classification scheme of 537 sectors. The second source was the multiregional input/output accounts of the U.S. for 1977 [2]. This set of accounts consists of input/output tables for forty-eight States, the District of Columbia, and a rest-of-the-world account, using an industrial classification scheme of approximately 120 sectors. For our test problem, we assembled a table of Make sub- matrices along the diagonal, and inter-regional trade flows off the diagonal that provide coupling between the regions. The resultant table has approximately 6000 ( = 50 x 120) accounts and over .370 000 non-zero transactions, both inter- and intra-regional.

The purpose of this exercise was not to provide a set of reconciled tables either for the multiregional or the regional test problems. We are interested primarily in demonstrating that problems of this size can be balanced efficiently before proceeding with the collection of the marginal values that will be needed in order to develop a consistent set of accounts. To create balancing problems based on the available data, we computed the row and column sums of each test problem and "then added noise to the entries of the matrix. Thus, we obtained tables whose entries did not add up to the precomputed control totals. Several alternative ways of adding noise were tried in order to create difficult problems. For example, adding noise to

Table 1

Test problem characteristics

Problem No. of No. of non-zero rows X columns entries

USE537 504 × 473 57247 MAKE537 523 × 533 9586 MRIO (approx.) 6000 × 6000 (approx.) 370000

the entries of the matrix in proportion to their current value resulted in problems that were balanced within a few iterations. The same was true when noise was added to the row and column totals, again in proportion to their current value. The most difficult problems were derived when noise, drawn from a uniform distribution, was added on all entries of the matrix. This approach was used in creating all the test problems. Noise was added incrementally to the entries until the problem would become infeasible, and the largest noise level for which RAS would converge was used to set up the problems. The characteristics of the test problems are summarized in table 1.


To provide a benchmark against which to evaluate the performance of the vector and parallel algorithms, we first run RAS on a VAX 8700 large mainframe and a Floating Point Systems (FPS) M64/60 attached array processor. The results are summarized in table 2. We observe that the dense problems (USE537) are more

Table 2

Benchmark solution times with RAS in hrs:min:sec.

Problem Error VAX 8700 FPS M64/60 Iterations CPU time Iterations CPU time

USE537 10 -6 31 0:00:10 25 0:00:1 10 -s 46 0:00:15 41 0:00:2 10- t~ 78 0:00:26 73 0:00:4

MAKE537 10 -4 443 0:00:30 713 0:00:I1 10"6 66827 1:13:06 22280 0:05:54 10 -s > 20:00:00 273701 1:12.28

MRIO

expensive per iteration, but the algorithm converges to a very accurate solution in a small number of iterations. This situation is reversed with larger and sparser problems. The number of iterations varies significantly between the VAX and FPS implementation. The difference can be explained by the numerical behavior of RAS when implemented using finite precision. Although 64-bit words are used throughout our experiments, different computers have different levels of accuracy for double precision arithmetic. This affects th6 convergence of the algorithm, especially for ill-conditioned problems like MAKE537.

We point out that the elapsed wall-clock time - the time a user has to wait for an answer - exceeded 20 hours for the longer run with MAKE537 on the VAX, and no attempt was made to solve the MRIO problem on either the VAX or the FPS. The wall-clock time for the smaller runs on the FPS is approximately equal to the CPU time. For the larger run, where the CPU time exceeds the time-slice of 5 minutes allocated to one single job, the wall-clock time is significantly higher, depending on the number of jobs competing for the system. During periods with 3 - 4 users, the wall-clock time would exceed several hours.

5.2. VECTOR COMPUTING ON A CRAY X-MP

The results on the CRAY X-MP/48 are summarized in table 3. Observe, first, an improvement by a factor of two in the performance of the vectorized over the scalar algorithm. The most striking observation from this table is the solution time

17 6 S.A. Zenios, S.-L. Iu, Parallel computing for matrix balancing

Table 3

Solution times on the CRAY X-MP/48 in hrs:min:sec.

Problem Error Iterations Scalar Compiler User vectorization vectorization

USE537 10 -6 28 0:02.8 0:00:00.3

I0 -'~ 77 - 0:00:00.9

MAKE537 i0"* 956 0:14.2 0:08.9 0:00:06.9

10 -6 44434 10:54.9 6:48.9 0:05:20.5 10 -8 306459 46:18.5 0:36:46.9

MRIO 10"* 50000 1:41:10.0

for MAKE537, the most difficult of the test problems. The solution time of over 20 hours on the VAX mainframe has been brought down to almost 0.5 CPU hour on the supercomputer. For less accurary, the solution time was reduced from over 1 hour to almost 5 minutes with the elapsed wall-clock time reduced from over 6 hours to less than 10 minutes. It is also possible to update the fully disaggregated multiregional table with over 370 000 non-zero entries, although the computing resources spent in doing so are not trivial.

5.3. PARALLEL COMPUTING ON THE ALLIANT FX/8

The results on the Alliant FX/8 with the parallel implementation of (the original) RAS are summarized in table 4. Some improvement in the performance of the algorithm is realized with the vectorization of the software. Parts of the code - the row and column sum and scaling operations - were written using FORTRAN 8X array extensions that are supported by the AUiant. The efficiency of the vectorized code is slightly lower than the improvement in performance exhibited by the vectorized code on the CRAY. Significant speedup is achieved with parallel computations using all eight processors of the system. The average speedup factor - average for the solution of MAKE537 to three different levels of accuracy, and for the solution of USE537 - is shown in fig. 3 as a function of the number of processors. The performance is indeed remarkable, approaching linear speedup. The small inefficiency (around 10%) is due to the overhead in initializing the tasks and the need to check for convergence of the algorithm, which is a sequential operation.

Using the modified RAS with weight factors wp = wa = 0.8 in the multiprocessing environment and comparing it with the sequential implementation of the unmodified RAS, we obtained the speedup curve of fig. 4. Although it appears that the speedup is superlinear, a more careful examination of the results indicated that the modified RAS is more efficient than the original RAS even in the sequential

S.A. Zenios, S.- L. Iu, Parallel computing for matrix balancing 177

Table 4

Solution time on the Alliant FX/8 in hrs:min:sec.

Problem: MAKE537 MAKE537 MAKE537 USE537

Error: 10 -4 10 -6 10 "s 10 -12

Iterations: 802 80057 342081 70

Scalar: 00:01:19.9 02:10:20.8 09:20:51.0 00:00:31.0 Vectorized: 00:00:51.3 01:23:13.4 05:55:15.9 00:00:18.5 2-CPUs: 00:00:25.7 00:41:37.6 02:59:33.9 00:00:09.5 3-CPUs: 00:00:17.5 00:28:13.0 02:00:46.9 00:00:06.4 4-CPUs: 00:00:13.5 00:21:39.1 01:31:54.9 00:00:04.9 5-CPUs: 00:00:11.0 00:17:32.9 01:14:57.7 00:00:04.1 6-CPUs: 00:00:09.5 00:14:56.1 01:03:37.7 00:00:03.5 7-CPUs: 00:00:08.3 00:13:04.0 00:55:39.6 00:00:03.1 8-CPUs: 00:00:07.5 00:11:45.0 00:49:45.9 00:00:02.7

Speedup

.

L inea r S p e e d u p

Obse rved S p e e d u p

.

0 - ~

0 2 4 6 8 No. of processors

Fig. 3. Speedup factors versus number of processors on the AUiant FX/8.

Spcedup

2

..... i • , , , D

0 2 4 6 8 No. of processors

Fig. 4. Speedup factors with the modified/~4[S on the AUiant FX/8.


Iterations modified RAS

Iterations RAS

7

6-

5

4"

3'

2

1

0 .......

0.3

• MAKE537 • STONE X30 X 30

/ /

/

/ / o ~---- - ~ . ~ ....~.....~ / ..--

/

~-------.__ • / /

01370 0[440 0r.$10 0'.S80 01.6$0 0~.720 01790 0~.860 0;.930

Fig. 5. The performance of modified RAS. ~ ' ~ W O.


case. Were we to develop the speedup curve based on a comparison of the parallel modified RAS with the sequential modified RAS, we would obtain a curve identical to that of fig. 3. Nevertheless, we present that of fig. 4 since it is the experimentation with parallel computing that revealed the potential of the modified RAS as a more efficient sequential algorithm. Figure 5 illustrates the relative performance of the modified RAS with respect to RAS with varying weights for three test problems: STONE, MAKE537, and a randomly generated 30 x 30 matrix. These results were obtained on the VAX 8700. We observe that a fair amount of fine tuning is required in order to fred the set of weights that make the modified RAS outperform RAS. While such f'me-tuning may be worthwhile for larger problems, it does not measure up to the significant savings realized with parallelism.

6. Conc lus ions

We have shown in this report that vector and parallel computing can be applied successfully towards the solution of very large matrix balancing problems. The technical observations - parallel implementations, speedup factors and efficient vectorization of RAS - are very encouraging. More important, however, is the overall achieve- ment of solving problems orders of magnitude larger than what was considered feasible a few years ago. Solutions are obtained within minutes or seconds of computing time. Such advances "could have a significant impact on modeling activities where matrix balancing appears as a core problem. Areas that stand to benefit from the work described here are economics, regional planning, transportation and traffic assignment. Some encouraging lessons are also drawn for the related problem of image reconstruction from projections, with its own wide range of applications.

A c k n o w l e d g e m e n t s

We would like to acknowledge numerous useful discussions with Professor Yair Censor on the relation between matrix balancing problems and problems in image reconstruction from projections. We would also like to acknowledge the assistance of Mr. E. Chajakis in the computational experiments with the modified algorithm.

References

[1] FX/FORTRANProgrammer's Handbook (Alliant Computer-Systems Corporation, Aeton, MA, May 1985).

[2] The Mulfire~onal Input-Output Accounts for 197Z The MRPIS Project (The Social Welfare Research Institute, Boston College, Feb. 1988).

[3] Multitasking User Guide, CRAY Computer Systems Technical Note SN-0222 (CRAY Research, Inc., March 1986).


[4] Survey of C-hrrent Business (May 1984). [5] M. Bacharach, Bl-proportional Matrices and Input-Output Change (Cambridge University

Press, U.K., 1970). [6] A. Bachem and B. Korte, On the RAS-Algorithm, Computing 23(1979)189-198. [7 ] T. Barker, F. Van der Ploeg and M. Weale, A balanced system of National accounts for the

United Kingdom, The Review of Income and Wealth 30, 4(1979)461-485. [8] D.P. Bertsekas and J.N. Tsitsiklis, Parallel and Distributed Computation: Numerical Methods

(Prentice-Hall, New Jersey, 1988). [9] L.M. Bregman, Proof of the convergence of Sheleikhovskii's method for a problem with

transportation constraints, USSR Computational Mathematics and Mathematical Physics 1, 1(1967)191-204.

[ 10] Y. Censor, Parallel application of block-iterative methods in medical imaging and radiation therapy, Math. Progr., Series B, 42, 2(1988)307-326.

[11] Y. Censor'and J. Segman, On block-iterative entropy maximization, J. Information and Optimization Sciences 8(1987)275 - 291.

[12] S.S. Chen, Large-scale and high-speed multiprocessor system for scientific applications, in: High Speed Computation, ed. J.S. Kowalik, NATO ASI Series F, 7 (Springer-Verlag, Berlin, 1984).

[13] B. Harris, Some notes on parallel computing: with special reference to transportation and land-use modeling, Environment and Planning A, 17(1985)1275 - 1278.

[14] G.T. Herman, Image Reconstruction from Projections: The Fundamentals of Computerized Tomography (Academic Press, New York, 1980).

[15] J.G. Klincewicz, Implementing an exact Newton method for separable convex transportation problems, Networks (1987), to appear.

[ 16] J.L: Larson, Multitasking on the CRAY X-MP/2 multiprocessor, Computer (July 1984). [17] R.E. Miller and P.D. Blair, Input-Output Analysis. Foundations and Extensions (Prentice-

Hall, New Jersey, 1985). [ 18] M.H. Schneider and S.A. Zenios, A comparative study of algorithms for matrix estimation,

Oper. Res. (1989), to appear. [19] S.A. Zenios, A. Drud and J.M. Mulvey, Balancing large social accounting matrices with

nonlinear network programming, Networks (1989), to appear. [20] S.A. Zenios, Matrix balancing on a massively parallel Connection Machine, ORSA Journal

on Computing, to appear.

[21 ] S.A. Zenios, Parallel numerical optimization: Current status and an annotated bibliography, ORSA Journal on Computing 1, 1(1989)20-43.

[22] S.A. Zenios and Y. Censor, Vectorization and multitasking of block iterative algorithms for image reconstruction, in: 4th Int. Symp. on Science and Engineering on CRAY Super- computers, Minnesota, MN (Oct. 1988) pp. 241-264.

Vector and parallel computing for matrix balancing

Documents

Transcript of Vector and parallel computing for matrix balancing