Sept 2011 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 5 Parallel...

41
Sept 2011 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 5 Parallel Programming: Languages – pthreads (and MPI) John Gurd, Graham Riley Centre for Novel Computing School of Computer Science University of Manchester

Transcript of Sept 2011 1 COMP60611 Fundamentals of Parallel and Distributed Systems Lecture 5 Parallel...

Sept 2011 1

COMP60611 Fundamentals of Parallel and Distributed

Systems

Lecture 5

Parallel Programming:

Languages – pthreads (and MPI)

John Gurd, Graham Riley

Centre for Novel Computing

School of Computer Science

University of Manchester

Sept 2011 2

Overview

– Parallel Programming Fundamentals

• Different levels of programming…

• Managing parallel work units and their interaction

– The Unix Model ( http://www.unix.org )

• Processes and threads ---- memory mapping

– Overview of pthreads - single address space (and MPI - multiple address spaces)

– Summary

Sept 2011 3

Extensions to C

– We need specific programming constructs to define parallel computations. We shall use (sequential) C as a starting point.

– In this lecture, we investigate extensions to C that allow the programmer to express parallel activity in the thread-based or data-sharing style and in the message passing style.

– We approach this from two main directions:

• Extensions that allow the programmer to create and manage threads explicitly and interact via shared memory.

• Extensions that allow the programmer to manage processes explicitly and exchange data via messages.

Sept 2011 4

Different Levels ofThread-based Programming

– None of these schemes is fully implicit (i.e. automatic); unfortunately, autoparallelisation of C (or any other serial) programs is beyond the present state-of-the-art. Instead, different schemes offer increasing amounts of high-level assistance for the creation and management of parallel threads.

– POSIX Parallel Threads Library

• 'Bare-metal' approach --- the programmer is responsible for everything except the posix call implementations.

– OpenMP API (a higher level alternative for threads)

• Much functionality is provided, e.g. at the loop level --- the programmer is presented with a simpler picture, but the scope for losing performance through naivety increases.

Sept 2011 5

How to Obtain Parallel Thread-based Activity

– The general approach to developing a parallel code is the same in each scheme. The basic idea is to create a number of parallel threads and then find (relatively) independent units of work for each of them to execute.

– Units of work come in two basic types which correspond to task- and data-parallelism:

• Functionally different subprograms, each executed once;

• Single subprogram, executed multiple times – with different data

– In general, these forms of parallelism can be nested.

– Each scheme relies on run-time support routines, provided as part of the operating system. It is important to know how memory (address space) is laid out at run-time. An example is given by the UNIX system, described on the next slide – this is similar to other operating systems.

Sept 2011 6

The UNIX Model:Processes and Threads

• There are two basic units:

– A Process, the basic unit of resource.

– A thread, the basic unit of execution.

– The simplest process is one having a single thread of execution.

• This corresponds well to our programming models. Code is shared by all threads in a process. The general situation is illustrated in the following slide.

– (Note: the terminology used in other operating systems is dangerously ambiguous.)

Sept 2011 7

Memory Map for UNIX Processes and Threads

OS segments

Code segment

Thread-shared data

Task-shared data

Master stack

Thread stackThread stackThread stack

PC

PC

PCPC

Process-shared data

Thread-shared data

Sept 2011 8

POSIX Threads and an example…• An IEEE standard for UNIX (like) systems (defined for C)

– Standard ‘wrappers’ exist to support use from FORTRAN and other languages

• A set of library routines (and a run-time system) to manage the explicit creation and destruction of threads, and to manage their interaction

• Essentially, a pthread executes a user-defined function

– Scheduling of work to threads is down to the user

• Calls to pthread synchronisation routines manage the interaction between shared data

• OpenMP implementations can be built on top of pthreads

– with details hidden from user

• See: https://computing.llnl.gov/tutorials/pthreads

– Good overview; starts with a description of the relationship between processes and threads

Sept 2011 9

#include <pthread.h>#include <stdio.h>#define NUM_THREADS 5

int main (int argc, char *argv[]) { pthread_t threads[NUM_THREADS]; Int rc; long t;

for (t=0; t<NUM_THREADS; t++) { printf ("In main: creating thread %ld\n", t); rc = pthread_create (&threads[t], NULL, PrintHello, (void *)t); if (rc) { printf ("ERROR; return code from pthread_create() is %d\n", rc); exit(-1); } } pthread_exit(NULL); }

Pthreads – simple example

Sept 2011 10

void *PrintHello(void *threadid)

{

long tid;

tid = (long)threadid;

printf("Hello World! It's me, thread #%ld!\n", tid);

pthread_exit(NULL);

}

Sept 2011 11

Synchronisation mechanisms• Locks and condition variables in pthreads

– Example calls are shown on the following slides

• Semphores (provided by the OS)

– An early (Dijkstra, 1965) mechanism to control resource sharing in concurrent systems (providing mutual exclusion).

– Pthread synchronisation routines are frequently implemented by OS-supported semaphores. For example, a binary semaphore is a lock.

– More on these later…

Sept 2011 12

Pthread locks

pthread_mutex_t mutex1 = PTHREAD_MUTEX_INITIALIZER;

void *func() { pthread_mutex_lock( &mutex1 ); counter++; printf("Counter value: %d\n",counter); pthread_mutex_unlock( &mutex1 ); }

Note: using Semaphores, typical routines are:

semaphore_wait(s), equivalent to pthread_mutex_lock()

semaphore_signal(s), equivalent to pthread_mutex_unlock()

Where s is a semaphore initialised to 1 (a binary semaphore)

Sept 2011 13

Pthread condition synchronisation• Condition variables provide a “wait-notify” mechanism for

threads.

– For example, consider a producer-consumer model where one producer thread is served by one (or more) consumer threads. A consumer must check the shared state to see if there is anything for it to do. If there is nothing, the thread must wait until there is work. When the producer has work to be done, it updates the shared state and must then notify any consumers waiting.

• Condition synchronisation involves a condition lock and a condition variable.

• The combination of some shared data and some synchronisation mechanism (locks and condition variables, for example) is called a monitor.

– More in later lectures on monitors.

Sept 2011 14

We refer to a thread entering a monitor when a thread acquires the mutual exclusion lock associated with the monitor and exiting the monitor when it releases the lock. wait - causes the thread to exit the monitor (release the lock),

permitting other threads to enter the monitor.notify – causes one of any waiting threads to be allowed to enter the monitor.

Thread A Thread B

waitnotify

Monitor

data

Sept 2011 15

• A consumer thread takes the condition lock before examining the relevant shared state. – If there is work, the thread can modify the state (i.e. ‘get’ the work)

and then release the lock.

– If there is no work, the thread calls pthread_cond_wait() with the names of the condition lock and condition variable. This routine releases the condition lock and ensures that the thread waits (inside the wait routine) until it is woken up by a notification from the producer.

• When the producer thread has work to be done, it takes the condition lock in order to modify the shared state (to make the work available to consumers). – The producer then calls pthread_cond_broadcast, with the name of

the condition variable, to wake up all waiting threads and then releases the condition lock.

– If the producer knows that only a single thread is waiting it may call pthread_cond_signal which wakes only a single thread.

Sept 2011 16

• On being awakened, a consumer thread (which is still inside the pthread_cond_wait routine) will immediately try to take the condition lock so that, on return from pthread_cond_wait, it owns the lock.

• The consumer thread should then re-check the shared state to confirm that there is some work to be done (why?). – Typically, the consumer will call pthread_cond_wait from

within a ‘while’ loop which checks the shared state. Thus, if there is no work, the consumer thread will simply call pthread_cond_wait again. If there is work, the consumer will modify the shared state to ‘get’ the work and then release the condition lock.

Sept 2011 17

An example• Consider a producer thread which increments a

shared counter (up to the value of 5) before passing on the work of further incrementing the shared counter (up to a value of 10) to a consumer thread.

• In the following C/pthreads example, functionCount2 is the producer and functionCount1 is the consumer.

Sept 2011 18

Pthread condition variablesint count = 0;#define COUNT_DONE 10#define COUNT_HALT1 5pthread_mutex_t condition_mutex = PTHREAD_MUTEX_INITIALIZER;pthread_cond_t condition_cond = PTHREAD_COND_INITIALIZER;

main() { pthread_t thread1, thread2; pthread_create( &thread1, NULL, &functionCount1, NULL); pthread_create( &thread2, NULL, &functionCount2, NULL); pthread_join( thread1, NULL); pthread_join( thread2, NULL); exit(0);}

Sept 2011 19

void *functionCount2() { for(;;) { pthread_mutex_lock( &condition_mutex ); if( count < COUNT_HALT1 ) { count++; pthread_cond_signal( &condition_cond ); /* OR pthread_cond_broadcast( &condition_cond ); */ printf("Counter value functionCount2: %d\n",count); } pthread_mutex_unlock( &condition_mutex );

if (count >= COUNT_DONE) return (NULL); }}

Producer

Sept 2011 20

void *functionCount1() { for(;;) { pthread_mutex_lock( &condition_mutex ); while( count < COUNT_HALT1 ) { pthread_cond_wait( &condition_cond, &condition_mutex ); printf("Counter value functionCount1 on waking: %d\n",count); } count++; printf("Counter value functionCount1: %d\n",count); pthread_mutex_unlock( &condition_mutex );

if(count >= COUNT_DONE) return(NULL); } }

Consumer

Sept 2011 21

Behaviour• The producer continually cycles taking the lock (entering the

monitor) and increments the counter if it is less than 5.

• After each increment, the producer signals a waiting thread (the consumer) before exiting the monitor by releasing the lock.

• Either the producer or consumer will enter the monitor next. – If it is the producer, it carries on as before.

• Note that once it has done its work, it simply releases the lock.

– If it is the consumer, and it isn’t time for it to take over (i.e. the state is not 5), it simply announces the value it saw, and goes back to waiting.

• Eventually the consumer takes over incrementing the counter until it reaches the final value, at which point both it and the producer thread terminate.

• The example code is in the directory CondVar in the teaching domain directory ~griley/COMP60611/labs/extras

Sept 2011 22

Implicit Thread Control: OpenMP

– OpenMP is the de-facto standard API for writing shared memory parallel applications in C, C++, and Fortran.

– OpenMP consists of:

• Compiler directives,

• Run time routines,

• Environment variables.

– Specification maintained by the OpenMP Architecture Review Board (http://www.openmp.org).

– Version 3.0 was released May 2008.

Sept 2011 23

OpenMP Overview:How do threads interact?

– OpenMP is a multi-threading, shared address model.

• Threads communicate by sharing variables.

– Unintended sharing of data causes race conditions:

• race condition – the program’s outcome changes as the threads are scheduled differently.

– To control race conditions:

• Use synchronization to protect data conflicts.

– But synchronization can be expensive

Sept 2011 24

Advantages of OpenMP

– Good performance and scalability.

• if you do it right ....

– De-facto and mature standard.

– An OpenMP program is portable.

• Supported by a large number of compilers.

– Simplifies the writing of multi-threaded programs in Fortran, C and C++.

– Allows the program to be parallelized incrementally.

– OpenMP is well suited for multicore architectures.

Sept 2011 25

Summary of Threads approach

– Programming multiple threads with explicit accesses to shared data requires attention to detail and much low level control.

– This can be alleviated by providing the programmer with a high-level data-sharing model and leaving low-level problems to the implementation. Higher level abstractions make programming increasingly easier, but they provide more opportunity for performance to be lost, due to unforseen actions by the compiler or run-time system.

– Experience shows that it is somewhat easier to program using threads, compared to other approaches we shall study, although it is still non-trivial.

Sept 2011 26

Overview of MPI• Processes versus threads

– Shared and distributed memory systems

• Process-based Programming Fundamentals– Managing Processes

– Passing Messages

– The Message-Passing Interface (MPI)• See MPI forum on web

• Message-Passing Overheads

• Summary

Sept 2011 27

Parallel Computing with Multiple Processes

• For anyone familiar with concurrent execution of processes under a conventional uni-processor operating system, such as Unix, the notion of parallel computing with multiple (single-thread) processes is quite natural.

• Each process is essentially a stand-alone sequential program, with some form of interprocess communication mechanism provided to permit controlled interaction with other processes.

• The simplest form of interprocess communication mechanism is via input and output files. However, this does not allow very 'rich' forms of interaction.

• Hence, more complex varieties of message-passing have evolved, e.g.:

– UNIX pipes, sockets, MPI, Java RMI…

Sept 2011 28

Message-Passing

• The process-based approach to parallel programming is the main alternative to the thread-based approach. Again, we use (sequential) C as a starting point.

• We will look at an extension to C which allow the programmer to express parallel activity in the message-passing style.– Extensions that allow the programmer to send and receive

messages explicitly (to exchange program data and synchronise). We illustrate this using the Message-Passing Interface (MPI) standard library.

– MPI can also be used from FORTRAN and C++ (research versions for, for example, Java are available too).

Sept 2011 29

Why MPI?• Shared memory computers tend to be limited in size (numbers

of processors) and the cost of hardware to maintain cache coherency across an interconnect grows rapidly with system size. So the ‘biggest’ computers do not support shared memory.

• Distributed memory systems are relatively cheap to build. They are essentially multiple copies of ‘independent’ processors connected together. Interconnects for these are relatively simple and cheap (e.g. based on routers). For example:

– Networks of workstations, NoWs, using Ethernet or Myrinet

– Supercomputers with specialised router-based interconnects: HECToR, a Cray XT4 using fast SeaStar routers – more than 22,000 cores (5664 quad-core Opterons). Upgraded in 2010.

• Most of the Top100 computers in the world are DM systems and MPI is the de-facto standard for writing parallel programs for them (at least in the scientific world). See: www.top500.org.

Sept 2011 30

Managing Processes• Remember: in our UNIX view, a process is a (virtual)

address space with one or more threads (program counter plus stack). Processes are independent!

• A key requirement is to be able to create a new process and know its (unique) identity. With process identities known to one another in this way, it is feasible within any process to construct a message and direct it specifically to some other process.

• MPI has the concept of process ‘groups’ through communicators, e.g. MPI_COMM_WORLD.

• Finally, there needs to be a mechanism for causing a process to ‘die’ and allow the MPI ‘group’ to ‘die’.

Sept 2011 31

Passing Messages• The fundamental requirements for passing a

message between two processes are:– The sending process knows how to direct a message to an

appropriate receiving process.

– In MPI this is achieved explicitly by naming a process id or through the use of a communicator (naming a group of processes).

– There are several models of interacting and synchronising processes in MPI. We shall keep it simple and look only at basic sending and receiving:

• Where recvs block but sends do not (implying buffering of the data)

• MPI also supports synchronous (sends block until a receive is posted) and asynchronous sends and recvs (using poling)

Sept 2011 32

The MPI C Library• Typical MPI scientific codes use processes that are

identical, thus implementing the Single-Program-Multiple-Data (SPMD) scheme. For example:

mpiexec –n 4 a.out ! Runs a 4 process single SPMD job

• MPI also supports the Multiple-Program-Multiple-Data (MPMD) scheme, in which different code is executed in each process:

mpiexec –n 3 a.out : -n 4 b.out : -n 6 c.out ! An MPMD job

• Chapter 8 in Foster's book is a good source for additional MPI information. See also, LLNL tutorials.

Sept 2011 33

MPI Fundamentals• Processes are grouped together; they are numbered within a group

using contiguous integers, starting from 0. Messages are passed using the send (MPI_SEND) and receive (MPI_RECV) library subroutine calls (many other forms exist!)

• A message send has the general form:

MPI_Send(sbuf,icount,itype,idest,itag,icomm,ierr)– A send may block or not – depends on the MPI implementation’s use of

buffering. (MPI_Ssend is a guaranteed blocking send)– Programs should not assume buffering of sends! Can lead to deadlock (see

later example).

• A message receive has the general form:

MPI_Recv(rbuf,icount,itype,isrce,itag,icomm,istat,ierr)

• The receiving process blocks until a message of the appropriate kind becomes available. The buffer starting at rbuf has to be guaranteed to be large enough to hold icount elements. The istat parameter shows how many elements actually arrived, where from, etc.

Sept 2011 34

MPI Fundamentals• There are four other core library functions, illustrated

in the following slides (which uses C syntax)

• The following slides show the MPI basics plus the skeleton on an n-body MPI application in the SPMD-style.

Sept 2011 35

#include “mpi.h”

/* include file of compile-time constants needed for MPI library calls */

/* main program */

main (int argc, char *argv[]) {

/* call to initialise this process - called once only per process */

ierr = MPI_Init(&argc, &argv);

/* find the number of processes */

MPI_Comm_size(MPI_COMM_WORLD, &np);

/* find the id (number) of this process */

MPI_Comm_rank(MPI_COMM_WORLD, &myid);

/* print a “Hello world” message from this process */

fprintf (“I am %d of %d processes!”, myid, nprocs);

/* shut down this process - last thing a process should do

MPI_Finalize();

}

Sept 2011 36

#include “mpi.h” /* include file */

main(int argc, char *argv[]) { /* main program */

int myid, np, ierr, lnbr, rnbr;

Real x[300], buff[300], forces[300];

MPI_Status status;

ierr = MPI_Init(&argc, &argv); /* initialize */

if (ierr != MPI_SUCCESS) { /* check return code */

fprintf(stderr,”MPI initialisation error\n”);

exit(1);

}

Sept 2011 37

MPI_Comm_size(MPI_COMM_WORLD, &np); /* nprocs */

MPI_Comm_rank(MPI_COMM_WORLD, &myid); /* my process id */

lnbr = (myid+np-1)%np; /* id of left neighbour */

rnbr = (myid+1)%np; /* id of right nbr */

Initialize(x, buff, forces);

for (i=0; i<np-1; i++) { /* circulate messages */

/* Note: assumes sends do not block! What if they do?*/

MPI_Send(buff, 300, MPI_FLOAT, rnbr, 0, MPI_COMM_WORLD);

MPI_Recv(buff, 300, MPI_FLOAT, lnbr, 0, MPI_COMM_WORLD,

&status);

update_forces(x, buff, forces);

}

Print _forces(myid, forces); /* print result */

MPI_Finalize(); /* shutdown */

}

Sept 2011 38

Other MPI Facilities• The tag parameter is used to match up an input message with a

specific expected kind. If the kind of message is immaterial, MPI_ANY_TAG will match with anything.

• There are also constructs for: global 'barrier' synchronisation (MPI_Barrier); transfer of data, including one-to-many 'broadcast' (MPI_Bcast) and 'scatter' (MPI_Scatter), and many-to-one 'gather' (MPI_Gather); and 'reduction' operators (MPI_Reduce and MPI_All_reduce).

• A reduction has the general form:

MPI_Reduce(src,result,icnt,ityp,op,iroot,icomm,ierr)

where op is the operator, typ is the element type, and root is the number of the process that will receive the reduced result. All processes in the group receive the same result when MPI_All_reduce is used.

• There are many other features, but these are too numerous to be studied further here.

Sept 2011 39

MPI – pros and cons• MPI is the de-facto standard for programming large

supercomputers because the current trend is only to build distributed memory machines.

• The vast majority of current DM machines are built out of multicore processors

– Mixed mode programming with MPI ‘outside’ and pthreads (or OpenMP) ‘inside’ is possible…

• MPI forces the programmer to face up to the distributed nature of machines – is this a good thing?

• MPI solutions tend to be more scalable than ptrhread (or OpenMP) solutions

• (OpenMP is somewhat easier to use…)

Sept 2011 40

Summary of MPI• Process-based programming using a library such as

MPI for explicit passing of messages requires attention to detail and much low level description of activities.

• Ultimately, the same underlying problems of parallelism emerge, regardless of whether the shared memory (e.g. pthreads or OpenMP) or distributed memory (e.g. MPI) programming approach is used.

Sept 2011 41

Summary• Pthreads exploit parallelism by exploiting multiple

threads within a single process– A single address space model

• MPI exploits parallelism between processes and supports the explicit exchange of messages between processes– A multiple address space model

• Note that MPI and pthreads can be nested!– Multiple (p)threads can execute inside each (MPI) process

– An approach that appears to match the hierarchical architecture of modern computers (i.e. muticore processors in a distributed memory machine, e.g. HeCToR)