Lecture 29: Parallel Programming Overview

17
Lecture 29 Fall 2006 Lecture 29: Parallel Programming Overview

description

Lecture 29: Parallel Programming Overview. Parallelization in Everyday Life. Example 0 : organizations consisting of many people each person acts sequentially all people are acting in parallel Example 1: building a house (functional decomposition) - PowerPoint PPT Presentation

Transcript of Lecture 29: Parallel Programming Overview

Page 1: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Lecture 29: Parallel Programming Overview

Page 2: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Parallelization in Everyday Life

Example 0: organizations consisting of many people each person acts sequentially all people are acting in parallel

Example 1: building a house (functional decomposition) Some tasks must be performed before others: dig hole, pour

foundation, frame walls, roof, etc. Some tasks can be done in parallel: install kitchen cabinets, lay

the tile in the bathroom, etc.

Example 2: digging post holes (“data” parallel decomposition)

If it takes one person an hour to dig a post hole, how long will it take 30 men to dig a post hole?

How long would it take 30 men to dig 30 post holes?

Page 3: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Parallelization in Everyday Life

Example 3: car assembly line (pipelining)

chassis

chassis

chassis

motor

motor

motor

interior

interior

interior

body

body

body

Page 4: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Parallel Programming Paradigms --Various Methods

There are many methods of programming parallel computers. Two of the most common are message passing and data parallel.

Message Passing - the user makes calls to libraries to explicitly share information between processors.

Data Parallel - data partitioning determines parallelism Shared Memory - multiple processes sharing common memory space Remote Memory Operation - set of processes in which a process can

access the memory of another process without its participation Threads - a single process having multiple (concurrent) execution paths Combined Models - composed of two or more of the above.

Note: these models are machine/architecture independent, any of the models can be implemented on any hardware given appropriate operating system support. An effective implementation is one which closely matches its target hardware and provides the user ease in programming.

Page 5: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Parallel Programming Paradigms: Message Passing

The message passing model is defined as: set of processes using only local memory processes communicate by sending and receiving messages data transfer requires cooperative operations to be performed by

each process (a send operation must have a matching receive)

Programming with message passing is done by linking with and making calls to libraries which manage the data exchange between processors. Message passing libraries are available for most modern programming languages.

Page 6: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Parallel Programming Paradigms: Data Parallel

The data parallel model is defined as: Each process works on a different part of the same data

structure Commonly a Single Program Multiple Data (SPMD) approach Data is distributed across processors All message passing is done invisibly to the programmer Commonly built "on top of" one of the common message passing

libraries

Programming with data parallel model is accomplished by writing a program with data parallel constructs and compiling it with a data parallel compiler.

The compiler converts the program into standard code and calls to a message passing library to distribute the data to all the processes.

Page 7: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Implementation of Message Passing: MPI

Message Passing Interface often called MPI. A standard portable message-passing library definition

developed in 1993 by a group of parallel computer vendors, software writers, and application scientists.

Available to both Fortran and C programs. Available on a wide variety of parallel machines. Target platform is a distributed memory system All inter-task communication is by message passing. All parallelism is explicit: the programmer is responsible

for parallelism the program and implementing the MPI constructs.

Programming model is SPMD (Single Program Multiple Data)

Page 8: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Implementations: F90 / High Performance Fortran (HPF)

Fortran 90 (F90) - (ISO / ANSI standard extensions to Fortran 77).

High Performance Fortran (HPF) - extensions to F90 to support data parallel programming.

Compiler directives allow programmer specification of data distribution and alignment.

New compiler constructs and intrinsics allow the programmer to do computations and manipulations on data with different distributions.

Page 9: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Steps for Creating a Parallel Program

1. If you are starting with an existing serial program, debug the serial code completely

2. Identify the parts of the program that can be executed concurrently: Requires a thorough understanding of the algorithm Exploit any inherent parallelism which may exist. May require restructuring of the program and/or algorithm. May require an entirely

new algorithm.

3. Decompose the program: Functional Parallelism Data Parallelism Combination of both

4. Code development Code may be influenced/determined by machine architecture Choose a programming paradigm Determine communication Add code to accomplish task control and communications

5. Compile, Test, Debug 6. Optimization

Measure Performance Locate Problem Areas Improve them

Page 10: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Recall Amdahl’s Law

Speedup due to enhancement E is

Speedup w/ E = ---------------------- Exec time w/o E

Exec time w/ E

Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected

ExTime w/ E = ExTime w/o E ((1-F) + F/S)

Speedup w/ E = 1 / ((1-F) + F/S)

Page 11: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Examples: Amdahl’s Law

Amdahl’s Law tells us that to achieve linear speedup with 100 processors (e.g., speedup of 100), none of the original computation can be scalar!

To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less

What speedup could we achieve from 100 processors if 30% of the original program is scalar?

Speedup w/ E = 1 / ((1-F) + F/S)

= 1 / (0.7 + 0.7/100)

= 1.4 Serial program/algorithm might need to be restructuring

to allow for efficient parallelization.

Page 12: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Decomposing the Program There are three methods for decomposing a problem into smaller tasks to be

performed in parallel: Functional Decomposition, Domain Decomposition, or a combination of both

Functional Decomposition (Functional Parallelism) Decomposing the problem into different tasks which can be distributed to

multiple processors for simultaneous execution Good to use when there is not static structure or fixed determination of number

of calculations to be performed Domain Decomposition (Data Parallelism)

Partitioning the problem's data domain and distributing portions to multiple processors for simultaneous execution

Good to use for problems where: - data is static (factoring and solving large matrix or finite difference calculations) - dynamic data structure tied to single entity where entity can be subsetted (large multi-

body problems) - domain is fixed but computation within various regions of the domain is dynamic (fluid

vortices models) There are many ways to decompose data into partitions to be distributed:

- One Dimensional Data Distribution – Block Distribution – Cyclic Distribution

- Two Dimensional Data Distribution – Block Block Distribution – Block Cyclic Distribution – Cyclic Block Distribution

Page 13: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Functional Decomposing of a Program

Decomposing the problem into different tasks which can be distributed to multiple processors for simultaneous execution

Good to use when there is not static structure or fixed determination of number of calculations to be performed

Page 14: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Functional Decomposing of a Program

Sequential Program

Subtask 1

Subtask 2

Subtask 3

Subtask 5

Subtask 6

Parallel Program

Subtask 1

Subtask 2Subtask 3

Subtask 5 Subtask 6

Subtask 4

Subtask 4

*

**

Page 15: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Domain Decomposition (Data Parallelism) Partitioning the problem's data domain and distributing portions to

multiple processors for simultaneous execution There are many ways to decompose data into partitions to be

distributed:

Page 16: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Domain Decomposition (Data Parallelism) Partitioning the problem's data domain and distributing portions to

multiple processors for simultaneous execution There are many ways to decompose data into partitions to be

distributed:

For example, the matrix multiplication A x B = C requires a dot-product calculationfor each ci, j element.

ith row of A

jth

colu

mn

of B

c i, j X =

A B C

Page 17: Lecture 29:  Parallel Programming Overview

Lecture 29Fall 2006

Cannon's Matrix Multiplication

2-D Block decompose A and B and arrange submatrices as

A32

B23

A31

B12

A30

B01

A33

B30

A21

B13

A20

B02

A23

B31

A22

B20

A10

B03

A13

B32

A12

B21

A11

B10

A03

B33

A02

B22

A01

B11

A00

B00

Slave Algorithm:

Receive initial submatrices from master

Determine IDs (ranks) of mesh neighbors

Multiple local submatrices

For Sqrt(# of processors) times do Shift A submatrices West Shift B submatrices North Multiple local submatrices end for

Send resulting submatrices back to master