© 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel...

56
97-98 Silicon Graphics Inc. All rights reserved. A Standard for A Standard for Shared Memory Shared Memory Parallel Programming Parallel Programming

description

© Silicon Graphics Inc. All rights reserved. DO Worksharing directive DO directive: –C$OMP DO [clause[[,] clause]… ] –C$OMP DO [NOWAIT] n Clauses: –PRIVATE(list) –FIRSTPRIVATE(list) –LASTPRIVATE(list) –REDUCTION({op|intrinsic}:list}) –ORDERED –SCHEDULE(TYPE[, chunk])

Transcript of © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel...

Page 1: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

A Standard for A Standard for Shared Memory Shared Memory

Parallel ProgrammingParallel Programming

Page 2: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Using OpenMP Using iterative worksharing constructs Analysis of loop level parallelism Reducing overhead in loop level model Domain decomposition

– Writing scalable programs in OpenMP Comparisons with message passing Performance and scalability considerations

Page 3: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

DO Worksharing directive DO directive:

– C$OMP DO [clause[[,] clause]… ]– C$OMP DO [NOWAIT]

Clauses:– PRIVATE(list)– FIRSTPRIVATE(list)– LASTPRIVATE(list)– REDUCTION({op|intrinsic}:list})– ORDERED– SCHEDULE(TYPE[, chunk])

Page 4: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

NOWAIT clause NOWAIT clause

C$OMP PARALLEL C$OMP DO do i=1,n a(i)= cos(a(i)) enddoC$OMP END DOC$OMP DO do i=1,n b(i)=a(i)+b(i) enddoC$OMP END DOC$OMP END PARALLEL

C$OMP PARALLEL C$OMP DO do i=1,n a(i)= cos(a(i)) enddoC$OMP END DO NOWAITC$OMP DO do i=1,n b(i)=a(i)+b(i) enddoC$OMP END DO

Implied BARRIER

No BARRIER

By default loop index is PRIVATE

Page 5: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

LASTPRIVATE clause Useful when loop index is live out

– Recall that if you use PRIVATE the loop index becomes undefined

do i=1,N-1 a(i)= b(i+1) enddo a(i) = b(0)

In Sequentialcasei=N

C$OMP PARALLEL C$OMP DO LASTPRIVATE(i) do i=1,N-1 a(i)= b(i+1) enddo a(i) = b(0)C$OMP END PARALLEL

Page 6: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Reductions Assume no reduction clause

do i=1,N X = X + a(i) enddo

Sum Reduction

C$OMP PARALLEL DO SHARED(X) do i=1,N X = X + a(i) enddoC$OMP END PARALLEL DO

C$OMP PARALLEL DO SHARED(X) do i=1,NC$OMP CRITICAL X = X + a(i)C$OMP END CRITICAL enddoC$OMP END PARALLEL DO

Wrong!

What’s wrong?

Page 7: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

REDUCTION clause Parallel reduction operators

– Most operators and intrinsics are supported– +, *, -, .AND. , .OR., MAX, MIN

Only scalar variables allowed

C$OMP PARALLEL DO REDUCTION(+:X) do i=1,N X = X + a(i) enddoC$OMP END PARALLEL DO

do i=1,N X = X + a(i) enddo

Page 8: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Ordered clause Executes in the same order as sequential code Parallelizes cases where ordering needed

do i=1,N call find(i,norm) print*, i,norm enddo

C$OMP PARALLEL DO ORDERED PRIVATE(norm) do i=1,N call find(i,norm)C$OMP ORDERED print*, i,norm C$OMP END ORDERED enddoC$OMP END PARALLEL DO

1 0.452 0.863 0.65

Page 9: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Schedule clause Controls how the iterations of the loop are assigned to

threads– static: Each thread is given a “chunk” of iterations in a round

robin order» Least overhead - determined statically

– dynamic: Each thread is given “chunk” iterations at a time; more chunks distributed as threads finish

» Good for load balancing– guided: Similar to dynamic, but chunk size is reduced

exponentially– runtime: User chooses at runtime using environment variable

» setenv OMP_SCHEDULE “dynamic, 4”

Page 10: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Performance Impact of Schedule

Static vs. Dynamic across multiple do loops– In static, iterations of the do loop

executed by the same thread in both loops

– If data is small enough, may be still in cache, good performance

Effect of chunk size– Chunk size of 1 may result in

multiple threads writing to the same cache line

– Cache thrashing, bad performance

a(1,1) a(1,2)

a(2,1) a(2,2)

a(3,1) a(3,2)

a(4,1) a(4,2)

C$OMP DO SCHEDULE(STATIC) do i=1,4

C$OMP DO SCHEDULE(STATIC) do i=1,4

Page 11: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

C$OMP PARALLEL DO do i=1,n ………… enddo alpha = xnorm/sumC$OMP PARALLEL DO do i=1,n ………… enddoC$OMP PARALLEL DO do i=1,n ………… enddo

Loop Level Paradigm Execute each loop in

parallel Easy to parallelize code Similar to automatic

parallelization– Automatic Parallelization

Option (API) may be good start

Incremental– One loop at a time– Doesn’t break code

Page 12: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

C$OMP PARALLEL DO do i=1,n ………… enddo alpha = xnorm/sumC$OMP PARALLEL DO do i=1,n ………… enddoC$OMP PARALLEL DO do i=1,n ………… enddo

Performance Fine Grain Overhead

– Start a parallel region each time

– Frequent synchronization Fraction of non parallel

work will dominate– Amdahl’s law

Limited scalability

Page 13: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Reducing Overhead Convert to coarser grain model:

– More work per parallel region– Reduce synchronization across threads

Combine multiple DO directives into single parallel region– Continue to use Work-sharing directives

» Compiler does the work of distributing iterations » Less work for user

– Doesn’t break code

Page 14: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

C$OMP PARALLEL DO do i=1,n ………… enddoC$OMP PARALLEL DO do i=1,n ………… enddo

C$OMP PARALLEL C$OMP DO do i=1,n ………… enddoC$OMP DO do i=1,n ………… enddoC$OMP END PARALLEL

Coarser Grain

Page 15: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

C$OMP PARALLEL DO do i=1,n ………… enddo call MatMul(y)

subroutine MatMul(y)C$OMP PARALLEL DO do i=1,n ………… enddo

C$OMP PARALLEL C$OMP DO do i=1,n ………… enddo call MatMul(y)C$OMP END PARALLEL subroutine MatMul(y)C$OMP DO do i=1, N

…… enddoOrphaned

Directive

Using Orphaned Directives

Page 16: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

C$OMP PARALLEL DOC$OMP& REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddo alpha = sum/scaleC$OMP PARALLEL DO do i=1,n a[i] = alpha * a[i] enddo

C$OMP PARALLEL C$OMP DO REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddoC$OMP SINGLE alpha = sum/scaleC$OMP END SINGLEC$OMP DO do i=1,n a[i] = alpha * a[i] enddoC$OMP END PARALLEL

LoadImbalance

Statements Between Loops

Page 17: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Replicatedexecution

Cannot useNOWAIT

C$OMP PARALLEL C$OMP DO REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddoC$OMP SINGLE alpha = sum/scaleC$OMP END SINGLEC$OMP DO do i=1,n a[i] = alpha * a[i] enddoC$OMP END PARALLEL

C$OMP PARALLEL PRIVATE(alpha)C$OMP DO REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddo alpha = sum/scaleC$OMP DO do i=1,n a[i] = alpha * a[i] enddoC$OMP END PARALLEL

Loops (contd.)

Page 18: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Coarse Grain Work-sharing Reduced overhead by increasing work per parallel

region But…work-sharing constructs still need to compute

loop bounds at each construct Work between loops not always parallelizable Synchronization at the end of directive not always

avoidable

Page 19: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Domain Decomposition We could have computed loop bounds once and used

that for all loops– i.e., compute loop decomposition a priori

Enter Domain Decomposition– More general approach to loop decomposition– Decompose work into the number of threads– Each threads gets to work on a sub-domain

Page 20: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Domain Decomposition (contd.)

Transfers onus of decomposition from compiler (work-sharing) to user

Results in Coarse Grain program– Typically one parallel region for whole program– Reduced overhead good scalability

Domain Decomposition results in a model of programming called SPMD

Page 21: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Program

1 Domain n threadsn sub-domains

Program

SPMD Programming SPMD: Single Program Multiple Data

Page 22: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

program workC$OMP PARALLEL DEFAULT(PRIVATE)C$OMP& SHARED(N,global) nthreads = omp_get_num_threads() iam = omp_get_thread_num() ichunk = N/nthreads istart = iam*ichunk iend = iam+1*ichunk -1 call my_sum(istart, iend, local)C$OMP ATOMIC global = global + localC$OMP END PARALLEL print*, global end

Program

Each thread works on its

piece.

Decompositiondone manually

Implementing SPMD

Page 23: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Implementing SPMD (contd.) Decomposition is done manually

– Implement to run on any number of threads» Query for number of threads» Find thread number» Each thread calculates its portion (sub-domain) of work

Program is replicated on each thread, but with different extents for the sub-domain– all sub-domain specific data are PRIVATE

Page 24: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

C$OMP ATOMIC p(i) = p(i) + plocal

usually better than

C$OMP CRITICAL p(i) = p(i) + plocalC$OMP END CRITICAL

Handling Global Variables Global variables: spans whole domain

– Field variables are usually shared Synchronization required to update shared variables

– ATOMIC– CRITICAL

Page 25: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

parameter (N=1000) real A(N,N)C$OMP THREADPRIVATE(/buf/) common/buf/lft(N),rht(N)C$OMP PARALLEL call init call scale call orderC$OMP END PARALLEL

buf is commonwithin each thread

Global private to thread Use threadprivate for all other sub-domain data that

need file scope or common blocks

Page 26: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

subroutine scale parameter (N=1000)C$OMP THREADPRIVATE(/buf/) common/buf/lft(N),rht(N) do i=1, N lft(i)= const* A(i,iam) end do return end

subroutine order parameter (N=1000)C$OMP THREADPRIVATE(/buf/) common/buf/lft(N),rht(N) do i=1, N A(i,iam) = lft(index) end do return end

Threadprivate

Page 27: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Clauses on Parallel Regions DEFAULT(SHARED | PRIVATE | NONE)

– Default is DEFAULT(SHARED)– DEFAULT(PRIVATE)makes all variables in lexical extent PRIVATE

REDUCTION similar to that on DO directive IF(expr) controls whether to execute the region in

parallel or not COPYIN(list) copies in threadprivate variables from

Master thread

Page 28: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Comparisons with Message Passing

Domain decomposition algorithm same, but implementation easier– Global data, Field variables shared - read from any thread,

write may need synchronization– No need for passing messages, no need for ghost cells or

shadow buffers Parallelize only computationally intensive parts of code,

not entire code– Pre-processing, post-processing may be left alone

Page 29: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Field Variables Message passing model: all data is local to the process

– Each process updates field variables for its subdomain– Field variable is a private, subdomain sized variable– Message passing to make the edges consistent with neighbors

Could use exactly the same scheme for shared memory model also - but not common– DEFAULT(PRIVATE) useful for this model

Typical shared memory model: field variables shared among threads– Each thread updates field variables for its own subdomain– Field variable is a shared variable with a size same as the whole

domain

Page 30: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Messages

Ghost cells

Read/Write datawith synchronization

OpenMPMessage Passing

Message passing vs. OpenMP In OpenMP, domain remains logically intact and

accessible to all threads

Page 31: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Example Typical of many engineering codes

– FEM, CFD Consider 1-D example

subdomain 1 subdomain 2subdomain 3

subdomain 4

istart iend

me

istart-1 iend+1

Page 32: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Example (contd.) Stencil computations

b(i) = c1*a(i-1) + c2 * a(i) + c3 * a(i+1)

subroutine update common /domain/ a, b common /subdomain/ istart, iendC$OMP THREADPRIVATE(/subdomain/) do i = istart, iend a(i) = a*(i) + alocal enddo do i = istart, iend b(i) = c1 * a(i-1) + c2 * a*(i) + c3* a(i+1) enddo

Page 33: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Example (contd.) What happens when i=iend

b(iend) = c1*a(iend-1)+c2*a(iend)+c3*a(iend+1)

a is shared: just read a(iend+1)– Make sure that a(iend+1)is ready

do i = istart, iend a(i) = a*(i) + alocal enddoC$OMP BARRIER do i = istart, iend b(i) = c1 * a(i-1) + c2 * a*(i) + c3* a(i+1) enddo

Not in my subdomain

Wait for a to be ready on all subdomains

Page 34: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Message Passing Equivalent subroutine update common /subdomain/ nsub do i = 1, nsub a(i) = a*(i) + alocal enddo call mpi_send(a(1), me -1) call mpi_recv(a(0), me -1) call mpi_send(a(nsub), me +1) call mpi_recv(a(nsub+1), me +1) do i = 1, nsub b(i) = c1 * a(i-1) + c2 * a*(i) + c3* a(i+1) enddo

subdomain 2

1 nsub

me0 nsub +1

Page 35: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Two Types of Scaling Scale problem size with # of processors

– Users who want to solve bigger size problems» 1 processor: solve 32 x 32 grid» 4 processors: solve 64 x 64 grid

– Subdomain size remains constant Problem size constant irrespective of # of processors

– Users who want to reduce time to solution– Subdomain size becomes smaller and smaller– NAS Parallel benchmarks– Weather codes

Page 36: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Scaling Issues Scale problem size with # of processors

– Subdomain size per processor constant– Memory characteristics independent of # of processors

Problem size constant irrespective of # of processors– Subdomain size becomes smaller and smaller– Data access and memory overhead changes

Message passing and shared memory models run into different problems with decreasing subdomain size

Page 37: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Memory Considerations Message passing implementations need to maintain ghost

cells or shadow buffers– 5 point stencil: a(i-2)+a(i-1)+a(i)+a(i+1)+a(i+2)– 2 layers of ghost cells on each side

Fixed problem size: 64 x 64 grid On 1 processor

– Storage (2+64+2) x (2+64+2)– Ghost cell overhead = 13%

On 16 processors:– Storage on each processor: (2+16+2) x (2+16+2) – Ghost cell overhead = 56%

Ghost cells may be avoided in shared memory model

i-1i-2

Page 38: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Cache Considerations Typical shared memory model: field variables shared

among threads– Each thread updates field variables for its own subdomain– Field variable is a shared variable with a size same as the

whole domain For fixed problem size scaling subdomain size

decreases– Fortran Example– P(20,20) Field variable

1 cache line128 bytes on Origin(16 double words)

P(20,20)

s1 s3

s2 s4

Page 39: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Two threads need to write the same cache line Cacheline pingpongs between the two processors

– A cacheline is the smallest unit of transport

Poor performance: must avoid at all costs

1 cache lineThread A writes

Thread B writesCache line must move from Processor A to Processor B

False sharing

Page 40: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Pay attention to granularity Diagnose by using SGI performance tools On Origin2000, use R10000 Hardware performance

counters:– SpeedShop: man speedshopsetenv _SPEEDSHOP_HWC_COUNTER_NUMBER 31(31: Store prefetch exclusive to shared block in scache)ssrun -prof_hwc a.out

Pad false shared arrays or make private SGI’s NUMA extensions may help:

– c$sgi distribute_reshape p(block, block)

Fixing False sharing

Page 41: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

NUMA Directives SGI’s extensions for addressing NUMA issues OpenMP allows performance extensions

– Performance extensions does not affect correctness of programs - harmless if ignored

SGI extensions: C$SGI prefix– HPF like data distribution– Granularity is a page: 16k on Origin

C$SGI distribute A(*,block)– Distribute colums of A across threads– If we have two threads A(102,102):– A(*,1:51) on thread 0, A(*,52:102) on thread 1

Page 42: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Distribute with Reshape When data has less than page granularity Guarantee the desired memory distribution

real A(20, 20)C$SGI distribute_reshape A(block,block)

Layout of array changed - no more equivalence to undistributed array– Compiler will transform code to

» A(20/sqrt(np), A20/sqrt(np), np)– All program units must satisfy the same distribution

Page 43: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Handling Irregular Data Adaptive Mesh Refinement (AMR) or Multigrid methods Data is not exclusively accessed by one thread

t1

t2

t1

t2

t1

t2

Page 44: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Forcing Page Placement Pages can be moved at runtime to desired memory

!$SGI page_place(<addr>, <size>, <threadnum>)

Places specified virtual address range on the thread number

SGI Extension

Page 45: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Reducing Barriers Common issue when converting from message passing

model– One process sends data, another receives data

» Usually the received data is filled into the ghost cells around the subdomain

» Field variable is “ready” for use after the receive: Ghost cells are consistent with neighboring subdomains

– Receive is an implicit synchronization point This synchronization is very often done using barrier

– But BARRIER synchronizes all threads in the region– BARRIERs have high overhead at large processor counts– By synchronizing all threads makes load imbalance worse

Reduce BARRIERs to the minimum

Page 46: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Point to point Synchronization Only need producer-consumer type synchronization:

» One thread computes result, another thread reads result - synchronize only the two

The producer thread:– Computes field variables on its own subdomain– Set a flag when the data is ready– In MPI the point at which MPI_send happens

The consumer thread:– Needs field variables in neighboring subdomain(s) to continue

simulation– Waits for the neighbor’s flag to be set before using the data– In MPI the MPI_recv event

Page 47: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Watch out for isend and irecv pairs

– Buffers can be modified after wait Conversion to shared memory model:

– Replacing recv with read of the data may not work if data was modified between wait and recv

– May need to move code around

Asynchronous messages

buf(i)= A(rborder(i)) call MPI_isend(buf, iam+1, handle) A(i) = A(i)*c call MPI_wait(handle) call MPI_recv(buf, iam -1) A(lborder(i)) = A(lborder(i)) + buf(i)

Page 48: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Producer-Consumer

Synchronization!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE field(subdomain)=field(subdomain)+update READY=.TRUE.

!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE do while(READY)c do useful work while waiting for producer enddo field(i)=field(i-1)+ field(i) + field(i+1)

Producer thread

Consumer thread

Page 49: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Memory Reordering Previous code assumes the operations are performed

in the strict order in which they appear Most modern architectures reorder memory references

– Register allocation of variables: delay update of memory location until necessary

– Compiler may reorder the memory references for better cache locality

– Hardware may cause memory transactions to appear in a different order

Necessary for obtaining high performance Not an issue for most codes: only for synchronization

through memory reference

Page 50: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Flush Directive OpenMP provides a directive for users to identify

synchronization points in the program FLUSH directive provides memory fence Makes shared memory consistent across all threads

– All updates to shared variables that happened before the fence are committed to memory

– All references to shared variables after the fence are fetched from memory

Does not apply to private variables

Page 51: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Using Flush Directive!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE field(subdomain)=field(subdomain)+update!$OMP FLUSH READY=.TRUE.!$OMP FLUSH

!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE do while(READY)c do useful work while waiting for producer!$OMP FLUSH enddo field(i)=field(i-1)+ field(i) + field(i+1)

Producer thread

Consumer thread

Page 52: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(done) done(iam) = .FALSE.!$OMP BARRIER call do_my_subdomain_work() done(iam) = .TRUE.!$OMP FLUSH(done) do while(done(neighbor)) call do_some more_useful_work_if_you_can!$OMP FLUSH(DONE) enddo

Point to Point Synchronization

Page 53: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Mixing OpenMP and Message Passing

Enables using more than one shared memory machine– OpenMP within the machine and MPI across machines

SGI supports mixing within a machine– Multiple MPI processes within a shared memory machine that

launch multiple OpenMP threads each– MPI for coarse grain and OpenMP for loop level

Interoperability restrictions– Only one thread should communicate at a time

» Put MPI call in a SINGLE or CRITICAL section

Page 54: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

MPI-OpenMP MPI across hosts, OpenMP within

– setenv OMP_NUM_THREADS for each host – may be different, use .login or equivalent to set differently for

each host MPI-OpenMP within a host

– Environment variable can be used to set the number of OpenMP threads for all MPI processes

» All MPI processes will use the same number or threads– If each MPI process needs different number of threads use omp_set_num_threads call on each MPI process

Page 55: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Profiling Hierarchical Models SpeedShop can be used

– setenv OMP_NUM_THREADS 16– mpirun -np 4 ssrun -fpcsampx a.out

Will create 4 MPI processes– SpeedShop datafile will have “f” prefix before PID

Each MPI process will create 15 slave threads– SpeedShop datafile will have “p” prefix before PID

Accumulating OpenMP thread data into the MPI process– Need to find the ancestor for each OpenMP process

» ssdump a.out.pXXXXXX|grep ancestor» prof a.out.f#### a.out.p#### a.out.p####

Page 56: © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel Programming.

© 1997-98 Silicon Graphics Inc. All rights reserved.

Summary Loop level paradigm is easiest to implement but least

scalable Work-sharing in coarse grain parallel regions provides

intermediate performance SPMD model provides good scalability

– OpenMP provides ease of implementation compared to message passing

Shared memory model forgiving of use error– Avoid common pitfalls to obtain good performance