Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. ·...

41
Introduction to Parallel Algorithm Analysis Jeff M. Phillips October 2, 2011

Transcript of Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. ·...

Page 1: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Introduction to Parallel Algorithm Analysis

Jeff M. Phillips

October 2, 2011

Page 2: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Petri Nets

C. A. Petri [1962] introduced analysis modelfor concurrent systems.

I Flow chart

I Described data flow and dependencies.

I Very low level (we want something morehigh-level)

I Reachability EXP-SPACE-HARD,Decidable

Page 3: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Critical Regions Problem

Edsger Dijkstra [1965]

I Mutex: “Mutual exclusion” of variable

I Semaphores : Locks/Unlocks access to (multiple) data.

I Semaphore more general - keeps a count. Mutex binary.

Important, but lower level details.

Page 4: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Critical Regions Problem

Edsger Dijkstra [1965]

I Mutex: “Mutual exclusion” of variable

I Semaphores : Locks/Unlocks access to (multiple) data.

I Semaphore more general - keeps a count. Mutex binary.

Important, but lower level details.

Page 5: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Amdahl’s and Gustafson’s Laws

Amdahl’s Law : Gene Amdahl[1967]

I Small portion (fraction α)non-parallelizable

I Limits max speed-upS = 1/α.

Gustafson’s Law :Gustafson+Barsis [1988]

I Small portion (fraction α)non-parallelizable

I P processors

I Limits max speed-upS(P) = P − α(P − 1).

Tseq

Tpar

S =Tseq

Tpar

S(P ) =Tseq

Tpar(P )Tpar(P )

Page 6: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Logical Clocks

Leslie Lamport [1978]

I Posed parallel problems as finitestate machine

I Preserved (only) partial order:“happens before” mutex

Page 7: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Logical Clocks

Leslie Lamport [1978]

I Posed parallel problems as finitestate machine

I Preserved (only) partial order:“happens before” mutex

Page 8: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Logical Clocks

Leslie Lamport [1978]

I Posed parallel problems as finitestate machine

I Preserved (only) partial order:“happens before” mutex

Page 9: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Logical Clocks

Leslie Lamport [1978]

I Posed parallel problems as finitestate machine

I Preserved (only) partial order:“happens before” mutex

Page 10: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Logical Clocks

Leslie Lamport [1978]

I Posed parallel problems as finitestate machine

I Preserved (only) partial order:“happens before” mutex

Page 11: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Logical Clocks

Leslie Lamport [1978]

I Posed parallel problems as finitestate machine

I Preserved (only) partial order:“happens before” mutex

Highlights nuances and difficulties inclock synchronization.

Page 12: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

DAG Model

Directed Acyclic Graph:

I Each node represents a chunk ofcomputation that is to be done ona single processor

I Directed edges indicate that thefrom node must be completedbefore the to node

I The longest path in the DAGrepresents the total amount ofparallel time of the algorithm

I The width of the DAG indicatesthe number of processors that canbe used at once

Page 13: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

DAG Model

Directed Acyclic Graph:

I Each node represents a chunk ofcomputation that is to be done ona single processor

I Directed edges indicate that thefrom node must be completedbefore the to node

I The longest path in the DAGrepresents the total amount ofparallel time of the algorithm

I The width of the DAG indicatesthe number of processors that canbe used at once

Page 14: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

DAG Model

Directed Acyclic Graph:

I Each node represents a chunk ofcomputation that is to be done ona single processor

I Directed edges indicate that thefrom node must be completedbefore the to node

I The longest path in the DAGrepresents the total amount ofparallel time of the algorithm

I The width of the DAG indicatesthe number of processors that canbe used at once

A1 A2

A3+A4

A3 A4

A1+A2+A3+A4

A1+A2

Page 15: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

PRAM Model

RAM

CPU1 CPU2 CPUp

Steve Fortune and James Wyllie [1978].“shared memory model”

I P processors which operate on ashared data

I For each processor read, write, op(e.g. +, −, ×) constant time.

I CREW : Concurrent read,exclusive write

I CRCW : Concurrent read,concurrent write

I EREW : Exclusive read, exclusivewrite

Page 16: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

PRAM Model

RAM

CPU1 CPU2 CPUp

Steve Fortune and James Wyllie [1978].“shared memory model”

I P processors which operate on ashared data

I For each processor read, write, op(e.g. +, −, ×) constant time.

I CREW : Concurrent read,exclusive write

I CRCW : Concurrent read,concurrent write

I EREW : Exclusive read, exclusivewrite

Page 17: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

PRAM Model

RAM

CPU1 CPU2 CPUp

A1 A2

A3+A4

A3 A4

A1+A2+A3+A4

A1+A2

Steve Fortune and James Wyllie [1978].“shared memory model”

I P processors which operate on ashared data

I For each processor read, write, op(e.g. +, −, ×) constant time.

I CREW : Concurrent read,exclusive write

I CRCW : Concurrent read,concurrent write

I EREW : Exclusive read, exclusivewrite

Page 18: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Message Passing Model

Emphasizes Locality

I send(X , i) : sends X to Pi

I receive(Y , j) : receives Y from Pj

I Fixed topology, can onlysend/receive from neighbor

Common Topologies:

I Array/Ring Topology•

I Mesh Topology•

I Hypercube Topology•

Page 19: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Message Passing Model

Emphasizes Locality

I send(X , i) : sends X to Pi

I receive(Y , j) : receives Y from Pj

I Fixed topology, can onlysend/receive from neighbor

Common Topologies:

I Array/Ring Topology•

I Mesh Topology•

I Hypercube Topology•

Page 20: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Message Passing Model

Emphasizes Locality

I send(X , i) : sends X to Pi

I receive(Y , j) : receives Y from Pj

I Fixed topology, can onlysend/receive from neighbor

Common Topologies:

I Array/Ring Topology• (Ω(p) rounds)

I Mesh Topology•

I Hypercube Topology•

Page 21: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Message Passing Model

Emphasizes Locality

I send(X , i) : sends X to Pi

I receive(Y , j) : receives Y from Pj

I Fixed topology, can onlysend/receive from neighbor

Common Topologies:

I Array/Ring Topology• (Ω(p) rounds)

I Mesh Topology•

I Hypercube Topology•

Page 22: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Message Passing Model

Emphasizes Locality

I send(X , i) : sends X to Pi

I receive(Y , j) : receives Y from Pj

I Fixed topology, can onlysend/receive from neighbor

Common Topologies:

I Array/Ring Topology• (Ω(p) rounds)

I Mesh Topology• (Ω(

√p) rounds)

I Hypercube Topology•

Page 23: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Message Passing Model

Emphasizes Locality

I send(X , i) : sends X to Pi

I receive(Y , j) : receives Y from Pj

I Fixed topology, can onlysend/receive from neighbor

Common Topologies:

I Array/Ring Topology• (Ω(p) rounds)

I Mesh Topology• (Ω(

√p) rounds)

I Hypercube Topology•

0000

0010

0001

0011

1000

1010

1001

1011

0100

0110

0101

0111

1100

1110

1101

1111

Page 24: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Message Passing Model

Emphasizes Locality

I send(X , i) : sends X to Pi

I receive(Y , j) : receives Y from Pj

I Fixed topology, can onlysend/receive from neighbor

Common Topologies:

I Array/Ring Topology• (Ω(p) rounds)

I Mesh Topology• (Ω(

√p) rounds)

I Hypercube Topology• (Ω(log p) rounds)

0000

0010

0001

0011

1000

1010

1001

1011

0100

0110

0101

0111

1100

1110

1101

1111

Page 25: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Programming in MPI

Open MPI :

I (Open Source High Performance Computing).

I http://www.open-mpi.org/

When to use MPI?

I Critical to exploit locality (i.e. scientific simulations)

I Complication in only talking to neighbor

Page 26: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Programming in MPI

Open MPI :

I (Open Source High Performance Computing).

I http://www.open-mpi.org/

When to use MPI?

I Critical to exploit locality (i.e. scientific simulations)

I Complication in only talking to neighbor

Page 27: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Bulk Synchronous Parallel

Les Valiant [1989] BSPCreates “barriers” in parallel algorithm.

1. Each processor computes in data

2. Processors send/receive data

3. Barrier : All processors wait forcommunication to end globally

Allows for easy synchronization. Easierto analysis since handles many messysynchronization details if this isemulated.

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

Page 28: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Bulk Synchronous Parallel

Les Valiant [1989] BSPCreates “barriers” in parallel algorithm.

1. Each processor computes in data

2. Processors send/receive data

3. Barrier : All processors wait forcommunication to end globally

Allows for easy synchronization. Easierto analysis since handles many messysynchronization details if this isemulated.

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

Page 29: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Bulk Synchronous Parallel

Les Valiant [1989] BSPCreates “barriers” in parallel algorithm.

1. Each processor computes in data

2. Processors send/receive data

3. Barrier : All processors wait forcommunication to end globally

Allows for easy synchronization. Easierto analysis since handles many messysynchronization details if this isemulated.

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

CPU

RAM

Page 30: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Bridging Model for Multi-Core Computing

Les Valiant [2010] Mulit-BSPMany parameters:

I P : number of processors

I M : Memory/Cache Size

I B : Block Size/Cost

I L : Synchronization Costs

Argues: any portable and efficientparallel algorithm, must take intoaccount all of these parameters.

Advantages:

I Analyzes all levels of architecturetogether

I Like Cache-Oblivious, but notoblivious

At depth d uses parameters:⋃i (pi , gi , Li ,mi )

I pi : number of subcomponents(processors at leaf)

I gi : communication bandwidth(e.g. I/O cost)

I Li : synchronization cost

I mi : memory/cache sizeMatrix Multiplication, Fast Fourier Transform, Sorting

Page 31: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Bridging Model for Multi-Core Computing

Les Valiant [2010] Mulit-BSPMany parameters:

I P : number of processors

I M : Memory/Cache Size

I B : Block Size/Cost

I L : Synchronization Costs

Argues: any portable and efficientparallel algorithm, must take intoaccount all of these parameters.

Advantages:

I Analyzes all levels of architecturetogether

I Like Cache-Oblivious, but notoblivious

At depth d uses parameters:⋃i (pi , gi , Li ,mi )

I pi : number of subcomponents(processors at leaf)

I gi : communication bandwidth(e.g. I/O cost)

I Li : synchronization cost

I mi : memory/cache sizeMatrix Multiplication, Fast Fourier Transform, Sorting

Page 32: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Bridging Model for Multi-Core Computing

Les Valiant [2010] Mulit-BSPMany parameters:

I P : number of processors

I M : Memory/Cache Size

I B : Block Size/Cost

I L : Synchronization Costs

Argues: any portable and efficientparallel algorithm, must take intoaccount all of these parameters.

Advantages:

I Analyzes all levels of architecturetogether

I Like Cache-Oblivious, but notoblivious

At depth d uses parameters:⋃i (pi , gi , Li ,mi )

I pi : number of subcomponents(processors at leaf)

I gi : communication bandwidth(e.g. I/O cost)

I Li : synchronization cost

I mi : memory/cache size

Matrix Multiplication, Fast Fourier Transform, Sorting

Page 33: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Bridging Model for Multi-Core Computing

Les Valiant [2010] Mulit-BSPMany parameters:

I P : number of processors

I M : Memory/Cache Size

I B : Block Size/Cost

I L : Synchronization Costs

Argues: any portable and efficientparallel algorithm, must take intoaccount all of these parameters.

Advantages:

I Analyzes all levels of architecturetogether

I Like Cache-Oblivious, but notoblivious

At depth d uses parameters:⋃i (pi , gi , Li ,mi )

I pi : number of subcomponents(processors at leaf)

I gi : communication bandwidth(e.g. I/O cost)

I Li : synchronization cost

I mi : memory/cache sizeMatrix Multiplication, Fast Fourier Transform, Sorting

Page 34: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Two types of programmers

1. Wants to optimize the heck out of everything, tune allparameters

2. Wants to get something working, not willing to work too hard

Page 35: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Two types of programmers

1. Wants to optimize the heck out of everything, tune allparameters

2. Wants to get something working, not willing to work too hard

Page 36: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

Two types of programmers

1. Wants to optimize the heck out of everything, tune allparameters

2. Wants to get something working, not willing to work too hard

Page 37: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

MapReduce

Each Processor has full hard drive,data items < key,value >.Parallelism Procedes in Rounds:

I Map: assigns items to processorby key.

I Reduce: processes all items usingvalue. Usually combines manyitems with same key.

Repeat M+R a constant number oftimes, often only one round.

I Optional post-processing step.

CPU

RAM

CPU

RAM

CPU

RAM

MAP

REDUCE

Pro: Robust (duplication) and simple. Can harness LocalityCon: Somewhat restrictive model

Page 38: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

General Purpose GPU

Massive parallelism on your desktop.Uses Graphics Processing Unit.Designed for efficient video rasterizing.Each processor corresponds to pixel p

I depth buffer:D(p) = mini ||x − wi ||

I color buffer: C (p) =∑

i αiχi

I ...

X

wi

p

Pro: Fine grain, massive parallelism. Cheap.Con: Somewhat restrictive model. Small memory.

Page 39: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

... and Beyond

Google Sawzall?

I Compute statistics on massive distributed data.

I Separates local computation from aggregation.

Processing in memory?

I GPU has large cost in transferring data

I Can same work be done directly on memory?

Massive, Unorganized, Distributed Computing

I Bit-Torrent (distributed hash tables)

I SETI @ Home

I sensor networks

Page 40: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

... and Beyond

Google Sawzall?

I Compute statistics on massive distributed data.

I Separates local computation from aggregation.

Processing in memory?

I GPU has large cost in transferring data

I Can same work be done directly on memory?

Massive, Unorganized, Distributed Computing

I Bit-Torrent (distributed hash tables)

I SETI @ Home

I sensor networks

Page 41: Introduction to Parallel Algorithm Analysisjeffp/teaching/cs7960/parallel... · 2011. 10. 3. · Leslie Lamport [1978] I Posed parallel problems as nite state machine I Preserved

... and Beyond

Google Sawzall?

I Compute statistics on massive distributed data.

I Separates local computation from aggregation.

Processing in memory?

I GPU has large cost in transferring data

I Can same work be done directly on memory?

Massive, Unorganized, Distributed Computing

I Bit-Torrent (distributed hash tables)

I SETI @ Home

I sensor networks