An implementation of a portable instrumented communication library using CS tools

Future Generation Computer Systems 9 (1993) 53-61 53 North-Holland

An implementation of a portable instrumented communication library using US tools

Andrew Grant a and Robert Dickens b a Computer Graphics Unit, Manchester Computing Centre, University of Manchester, Manchester, M13 9PL UK b Computer Services Center, University of Reading, Whiteknights, P.O. Box 220, Reading, Berkshire RG6 2AX,, UK

Abstract

Grant, A and R. Dickens, An implementation of a portable instrumented communication library using CS tools, Future Generation Computer Systems 9 (1993) 53-61.

The Portable Instrumented Communication Library (PICL), developed at Oak Ridge National Laboratory, is a high level communications library which has been implemented on a range of distributed memory parallel machines such as the Intel IPSC/2 and the NCUBE. The library is implemented on top of the native message passing libraries for those machines, hence, programs written using PICL are portable at source code level between the different machines. PICL has an associated tool, called ParaGraph, which allows tracing information generated by PICL programs to be displayed graphically. This can give some insight into program efficiency, load balancing and communications overhead.This paper describes an implementation of PICL on a Meiko Computing Surface using CS Tools. The use of its associated animated graphical display system, ParaGraph, and of some other types of profiling tools is also discussed.

Keywords. PICL; ParaGraph; communication libraries; CS tools; Meiko Computing Surface; distributed memory parallel computers; performance monitoring; portability.

1. I n t r o d u c t i o n

This paper reports on an implementation of a portable instrumented communications library built on top of CS Tools [3]. In carrying out the implementation two issues are being addressed which, on moving from sequential to parallel machines, become increasingly important.

The first of these is the issue of portability. A given program would normally have to be rewrit- ten for a different parallel machine in order to take advantage of its particular features. Even those machines which are based on a similar

Correspondence to: A. Grant, Computer Graphics Unit, Manchester Computing Centre, University of Manchester, Manchester, M13 9PL UK, email: [email protected] Tel: + 44 (0)61 275 6096, Fax: +44 (0)61 275 6040.

design, e.g. MIMD message-passing architec- tures, are usually programmed using different proprietary communication primitives.

The second issue is that of performance monitoring and optimisation. A wish to achieve high performance is usually the motivation for employing parallelism in the first place. It is therefore especially important to ensure that a parallel program's execution can be profiled adequately in order to aid the process of performance optimisation.

In this work we have used the Oak Ridge National Laboratory's message-passing library, PICL (Portable Instrumented Communication Li- brary) [1], and animated graphical display system, ParaGraph [2]. PICL is designed to be implemented on different MIMD-type parallel computers, and its routines incorporate a means of

0376-5075/93/$06.00 © 1993 - Elsevier Science Publishers B.V. All rights reserved

54 A. Grant, R. Dickens

transparently collecting tracing data and routing this back to the filing system. ParaGraph is an accompanying post-processing utility for the pur- pose of presenting this complex information in animated graphical form.

PICL and ParaGraph allow programs to be written which will run on a range of different parallel machines, and allow the execution of those programs to be conveniently visualised with the aid of animated graphics. It should thus be possible to observe the behaviour of programs, and thereby attempt to optimise the performance.

PICL has been implemented on a number of machines, including the Intel iPSC/2, the Intel iPSC/860 and the NCUBE, as part of a project to characterise the performance of various algo- rithms on a range of parallel machines [1]. It has been made freely available, together with Para- Graph, in order that other workers may benefit from its advantages and also contribute to the project by carrying out implementations on other parallel machines. Thus, PICL has now been implemented on a Meiko Computing Surface, using CS Tools. To our knowledge, no other such implementation has yet been reported.

Before describing the method of implementation, the role of graphical profiling tools as part of the parallel programming process is discussed. The structure of the PICL library, and its communication model, are then contrasted with those of CS Tools. Finally, some timings from a selec- tion of simple applications are reported, followed by a discussion of possible further developments.

2. The use of graphical tools to aid the parallel programming process

On conventional sequential computers execution profilers can reveal vital information about the complex interactions between a program and its data stored in memory. The way in which data is accessed by the program is crucial to its performance, hence, only by having a good insight into the data access patterns can the programmer successfully modify the code to achieve performance improvements.

On parallel machines, an extra dimension of complexity is added to the data access patterns when multiple processors exchange data across a

network. For programs involving complex communication between the processing nodes it is extremely difficult to gain a suitable insight into the data access patterns using the kind of profiling tools mentioned above. The amount of data typically generated would be so overwhelming in magnitude and complexity that any useful insights would most likely be obscured. The obvious way to extract the key features from the profiling information is to make use of human visual per- ception capabilities and display the information graphically.

A considerable proportion of parallel programs are developed by adapting existing sequential code. Although it is generally desirable to develop parallel programs from scratch, this is frequently impractical. It is very rare that the first attempt at the parallel program will run effi- ciently. In fact in most cases, a newly parallelised program, or indeed a new parallel program written from scratch, is likely to run at only a fraction of its peak performance. The reasons why this may be the case are numerous. For example:

- There are likely to be sequential bottlenecks in the code which could be removed or min- imised.

- The granularity, or size, of the processes may mean that the balance between communication and computation is poor, so that processors are under-utilised.

Because of these reasons the role of optimisation and tuning is much more important in the parallel programming methodology than for sequential programming, and the use of effective profiling tools is essential.

To a certain extent, a profile of a parallel program could be assembled by 'time stamping' sections of code and by using print statements to output the results. However, as mentioned above, the amount of information generated would be difficult to interpret. In addition, using print statements in programs causes message traffic and congestion amongst processors, which can alter the behaviour of the program and give mis- leading information.

It is therefore important that any profiling tools used for parallel programming have a mini- mal impact on the program being executed and allow the key features of the profile to be easily interpreted.

Implementation of a portable instrumented communication library 55

2.1 What kinds of tools are required?

There are two main areas where graphical tools can provide useful insights: firstly, for code analysis, when converting existing sequential programs and secondly, for optimising parallel programs.

As a typical example, the first of these tasks may be addressed with the Express system's vtool [5]. Here, a preprocessor inserts commands at suitable points in the program to generate tracing information as the program executes. The program is then executed and the tracing information is saved to a file. Given this information, vtool is then able to produce a graphical 'playback' of the program's execution. The tool pre- sents various views of the data and allows the user to scroll through the source code corresponding to the view at any particular time. The real insights however, come when the playback is animated. In this case, the memory access patterns are displayed in various colours depending upon how the data is accessed.

The second class of tools, those used for optimising programs, are typified by the ParaGraph system which was developed in conjunction with the PICL [2]. The remainder of this section provides a description of this tool.

ParaGraph provides an X Windows interface to a set of tools which can be used to display

various aspects of a program's behaviour. The parallel program includes embedded commands which allow execution tracing to be turned on or off as required. The tracing information is dumped into a buffer at the processing node and is sent back to the host processor either when the program terminates or when the buffer becomes full. In this way the collection of the tracing statistics does not dramatically alter the behaviour of the parallel program.

ParaGraph provides nine different displays, each of which gives a different perspective of the same underlying tracing information. In general the displays change dynamically, with execution time in the original program simulated by time steps in the display. Figure 1 shows a snapshot of some of the ParaGraph tools executing a trace file.

Animation tool The Animation tool is used to show interprocessor communication as execution proceeds. The display shows a series of nodes, the colours of which change depending upon their current state: busy, idle, sending or receiving a message. Arcs joining nodes appear when there is communication between those nodes, and dis- appear when the communication has completed. Thus the arcs are used to represent logical links between processors and not physical links.

Message-lengths The Message-lengths tool shows a two dimensional array representing pro-

Kiviat I 0 15

i4 ~

I o ~ 6 g 7

8 • • • @ bus~ send recv

Fig. 1. A display of some of the tools in ParaGraph.


cessor nodes. The elements of the array change colour when a communication takes place between two nodes, e.g. when node 2 communicates with node 3, then array element (2,3) changes colour. The colour assigned to the array element represents the length of the message passed.

Kiviat diagram The Kiviat tool is used to show processor utilisation and overall processor load-balancing. It is a dynamic tool in which each processor is represented as a segment of a circle. The current usage of each processor is indicated by the amount of the segment which is shaded, and a lighter shade is used to show the 'high water mark' usage for that processor.

Gantt chart The Gantt chart shows a histogram of processor utilisation as time progresses. When a processor is busy then the part of the chart depicting that processor is coloured green. If the processor becomes idle then the chart is coloured red.

Aggregate processor utilisation This simply gives a histogram of current processor utilisation as time progresses.

Aggregate communication This tool displays a histogram showing total communication volume as a function of time.

Message queues This display shows a histogram depicting the size of the input message queue on each processor. Dark shading indicates the current queue-size and lighter shading indicates the 'high water mark'.

3. Structure of the PICL library

PICL consists of three layers of libraries which sit above the underlying communications library of the native parallel machine, as depicted in Fig. 2.

The top layer consists of two sets of routines. The first set is used to perform certain high-level operations such as broadcasts and global sums, while the second set is used to generate tracing information.

The top layer references a set of low-level routines, constituting an intermediate layer of the library. This layer is concerned with the basic initialisation of communication and the transfer of individual messages between processors.

The intermediate layer in turn references a set of enuironment-level routines which are written in

High Level Routines, Tracing Routines

Low Level Routines

Environment Routines

Underlying Communication Library

.LInM achine dependent

I Machine Dependent

Fig. 2. Structure of the PICL libraries.

terms of the native communication library of a particular machine.

Thus, the two upper layers of the PICL library are entirely portable, and only the environment layer needs to be tailored when implementing the library on a new machine.

PICL assumes that the machine consists of a central host processor and a number of identical node processors. Separate host and node versions of each routine in each layer of the library are therefore required.

4. The communication model: PICL versus CS tools

The PICL library originated as a means of writing portable programs for the iPSC/1 and NCUBE machines [1], and as such is based on the model of communication which they share. It assumes a set of autonomous processors, each possessing a fixed amount of memory to which no other processor has access. Processors share data by passing messages to each other using blocked communications. That is, if processor i has data required by processor j, then i must explicitly send the data using the send command and j must issue a corresponding recv command. Until the message is copied from i's buffers into system buffers then processor i will be idle or blocked. Similarly, j is blocked from the time it issues the recv command until a message satisfying the request arrives and is copied into a specified user buffer.

PICL assumes the interprocessor communication is interrupt-driven. Hence, if i sends the message to j before j has issued the recv corn-


mand to receive it, then j's operating system must interrupt whatever task is currently being performed in order to intercept the message on j ' s behalf, and store it in a system buffer until j is ready to receive it.

Hence, PICL supports the asynchronous programming style, rather than the synchronous style where each sending process blocks until the receiving process issues a corresponding recv command.

A PICL application program consists of a single host process and several (possibly different) node processes; the host has access to the filing system, and loads each node onto a separate processor.

In contrast, CS Tools allows both blocked and non-blocked, synchronous and asynchronous modes of communication. In CS Tools there is no distinction between host and node as there is above, and all processes are loaded by a separate 'parfile loader' [3] or custom CS Build [3] program. Any process may access the filing system, and several processes may be allocated to a single processor.

Since in PICL only blocking receives are al- lowed, then in order to allow computation and communication to be overlapped at the receiver, messages must be buffered at the receiver implic- itly. To distinguish between messages, each one has an associated ' type' and a message's arrival in the buffer may be determined by some form of ' probe'.

The message-handler process performs the following three tasks:

- accepts messages from user processes; - fields requests issued by the user process it

accompanies; - m a i n t a i n s message-buffer and empty-buffer

linked lists (the number and size of buffers may be adjusted for efficiency).

The relationship of the message handler to the user program is shown in Fig. 3. The CSN 'names library' [3] is used to address t ransports (csn registername(), csn lookup- name ( ) , etc.). Three transports are opened on each processor (see Fig. 3) and given names according to node number and function. For example, the 'input', 'output ' and 'user' transports of the first node are named "0 i ", " 0 o " and "Ou" (those on the host are named "h i" , " h o " and "hu") .

6. Sending and receiving message,;

The message handler associated with each user process maintains two linked lists of buffers - an empty-buffer list which is always ready to receive messages and a message-buffer list which holds the messages until the user process is ready to receive them (Fig. 4).

5. Implicit buffering - the message handler

On a machine supporting CS Tools, it will be most convenient to use CSN transports [3] to pass messages. As mentioned above, the CSN does not provide implicit buffering, and it was suggested to us [4] that the best way of implementing this might be to have a separate 'message-handler' process accompanying each user process. Thus, messages intended for a particular user process are instead sent to its message handler, which is always ready to accept them (subject to the avail- ability of buffer space). A message may then be forwarded to the user process when this is ready to accept it. This then, is the basis of the implementation described here.

message handler user program

messages in out

CSN tra~sp°rt/ L

Fig. 3. Relationship of user program and message handler on each processor.


Each message is associated with an integer 'message type' which must be communicated along with the message itself. In order to transfer both of these together in a single operation, the message and message type must be adjacent in memory and so are copied to a separate buffer. This is then transferred synchronously to an empty buffer of the message handler accompanying the destination user process (this corresponds to asynchronous communication from the user's viewpoint).

The message handler inserts a pointer to this buffer at the end of its message-buffer list and removes it from the empty-buffer list.

When ready, the destination user-process requests the message from its message handler. User requests are translated into one of several types of lower-level request to be issued to the message handler, along with the message type of interest:

REQST_CONFIRM check whether a message of the specified type is present in the message- buffer list;

REQST_CONFIRM _BLOCK wait for the next message to arrive in the message buffer list if one is not already present. Check whether this is of the specified type;

REQST_SENI) send the first message of the specified type if one is present;

REQST_ SEND_IF_IST send the message of the specified type only if it is the first in the queue;

REQST_SENI)_BLOCK wait for a message of the specified type to arrive if one is not already present, then send it;

REQST_SENI)_NOCONF send a message of the specified type without confirmation (this is used when it has already been confirmed that a message is present);

REQST_FINISI-I terminate message-handler process;

After the message has been sent to the user process, the message handler adds its buffer to the end of the empty-buffer list and removes it from the message-buffer list.

Note that the above applies to both user messages (with user-specified message types) and

tracing messages etc.

7. Configuration a n d loading

Consider an application involving a master and four identical slaves. The corresponding '.par' file could be specified as follows: par

processor 0 for 4 node msg handler

../node/slave

processor 4 host msg handler

../host/master

endpar

It is significant that the processor on which the host (master) resides appears last and that the node (slave) processors are indexed from 0, since these values are imported by each node program.

The above should also be taken into account when writing a custom loader using CS Build.

From Section 4, implementing PICL using CS Tools requires that the host and nodes are treated equivalently: all processes being loaded onto pro-

- message,type

message,type

Kev

Empty-buffer list ........ II.,. Message-buffer list

Fig. 4. The empty-buffer and message-buffer l inked lists.


cessors at once by a special loader. However, each node must still be ' loaded' by the host, which need now only involve the host sending the node a 'start ' message (and also performing any other procedures associated with setting up tracing, etc.).

Note that it is not possible to run PICL applications in which processors are reloaded.

program has been run on up to 32 processors on a Meiko Computing Surface containing T800 transputers.

The application calculates the value of the constant ~" using numerical integration. The for- mula for generating ~- is as follows:

1 4

f0-- ~" = 1 + X 2 dx.

8. Some results

The system is currently being tested in order to determine both the correctness and efficiency of the implementation. So far we have only been able to write small PICL programs and so can only report preliminary results. However, a number of programs of more significant size have been obtained from Oak Ridge National Labora- tory and these have compiled and run successfully using our implementation.

8.1 The calculation o f zr

In this section, results are presented for a simple application which has been written using both PICL and CS Tools for comparison. The

The standard method of evaluating this is to divide the area represented by the integral into a number of evenly spaced rectangles as depicted in Fig. 5.

The value of the function at the midpoint of the strip is taken as the height of the rectangle, rr is then calculated as the sum of the areas of the strips under the curve. The more strips used, the greater the accuracy of the calculation.

The most straightforward way to parallelise this application is to have each processor calculating the areas of an equal number of the rectangu- lar strips. When this is done, the partial values are passed back to a master processor which sums them to obtain the value of ~-.

Because all the values are passed back to the master simultaneously, this stage is a bottleneck in the program as it is currently written. How-

4 I I I I

4/(l+x*x) 3.5

3

2.5

2

1.5

1

0.5

0 0 0.2 0.4 0.6 0.8

Fig. 5. Computing ~- by numerical integration.


ever, the same bottleneck is present in both the PICL and CS Tools version of the program and so it can still be used as a basis for comparison.

This method of calculating ~- is typical of a broad class of parallel programs. It is useful as a test program since the size of the computation can be varied by increasing or decreasing the number of strips used.

Figure 6 shows a plot of the execution times for the calculation against the number of processors when using 10 7 strips.

The plot shows that for this calculation the CS Tools and PICL versions of the program perform in an almost identical manner, with the CS Tools version being a few seconds quicker when the number of processors is increased. This is to be expected since in the PICL program each mes-

sage exchanged is passed through the message handler and hence incurs an overhead.

8.2 Cost of sending a message

The extent of this overhead can be measured by calculating the time taken to pass messages of various sizes between processors using both CS Tools and PICL. The following table shows the time taken to send a 1 kilobyte messages from one processor to another and back using CS Tools and PICL.

CS Tools PICL

1920 ms 19008 ms

E .to

¢==

0 .ca ,4a

O

............. t

! I t . . . . . . . . . . . . . . . . . . . i ; [

.i r~

0 8 f 2 f 8 2 4 3 0

N ' ~ m b e r o f P ~ ' o c e s s o ~ ' s 3 6

D [] [] PICL .'- ,.~ ,.~ C$ T o o l s

Fig. 6. Execut ion t ime for the calculat ion of ~-.


As can be seen from the table it takes approxi- mately ten times longer to send a message of this size using PICL rather than CS Tools.

An implementation of PICL built on top of a native communication library will never be as efficient as that underlying library. The extent of the overhead incurred will vary depending pri- marily on the similarity of the communication model used by PICL and the underlying library. For machines with similar models, such as the IPSC/2 , there is very little overhead incurred by using the PICL routines [1]. However, for our implementation on the Meiko Computing Surface the overhead incurred is quite significant.

We have shown that the system can be used to write portable parallel programs and generate tracing information for use with ParaGraph. However, at the moment, the usefulness of the implementation is limited by the heavy overhead incurred when passing messages between processors. This results from a basic difference in the underlying communication models used by PICL and CS Tools.

The implementation is currently being tested and it is hoped to use the feedback from this to improve performance in future versions of the system.

9. Future development

It may be possible to improve the efficiency of our implementation by modifying the code which deals with transferring messages from the message handler to the user program, to take advantage of the fact that both of these processes reside on the same processor. Thus, instead of using the CSN, it may be possible to allow both processes access to the same shared memory, employing semaphores to ensure mutual exclu- sion.

Currently, both the PICL nodes and the PICL host are placed on transputers. Further work is therefore required if the PICL host is to be situated on the Computing Surface's host processor.

Finally, a new version of the PICL library has been released, and the necessary changes will have to be carried out to our code.

10. (ionclusion

This paper has described an approach to implementing a portable instrumented communications library on top of CS Tools. Our reasons for providing this implementation are twofold: firstly, to facilitate the portability of programs between the Meiko Computing Surface and other distributed memory parallel machines and secondly, to allow the generation of tracing information for use in program profiling and optimisation.

References

[1] G.A. Geist, M.T. Heath, B.W. Peyton and P.H. Worley, A Machine-independent communication library, in: Proc. Fourth Conf. on Hypercubes, Concurrent Computers, and Applications, J. Gustafson, ed., (Golden Gate Enterprises, Los Altos, CA, 1989) 565.

[2] M.T. Heath and J.A. Etheridge, Visualising the performance of parallel programs, IEEE Software 8(5) (1991) 29-39.

[3] Meiko Ltd., Computing Surface - CS Tools for SunOS, 1990.

[4] J. Cownie, private communication from Meiko Ltd., 1991. [5] J. Flower and A. Kolawa, Parallel programming with Ex-

press in Surface Noise, J. Meiko Users Soc. (1992) 18-27.

Andrew (;rant is currently employe~ as Visualisation and Parallel System, Support Officer at Manchester Com puting Centre, University of Manch ester. His research interests includ~ volume visualisation techniques fo distributed and virtual shared mem ory parallel computers and tools fo monitoring the performance of paral lel machines. He received a Bachelo of Science degree in Mathematics am Computer Science from Oxford Poly technic in 1986, and a Master of Sci

ence degree in Systems Design from the University of Manch ester in 1987.

Robert Dickens received his B.Sc. il 1986 and Ph.D in 1991 in Chemistr at the University of Manchester, hay ing worked in the areas of phar maceutical applications and paralle implementation of Computationa Chemistry techniques. He continue( working with parallel programming al the University's Computer Graphics Unit studying tools for parallel performance modelling, and is currently employed as a programmer at the Computer Services Centre, University of Reading.

An implementation of a portable instrumented communication library using CS tools

Documents

Transcript of An implementation of a portable instrumented communication library using CS tools