© 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel...

© 1997-98 Silicon Graphics Inc. All rights reserved.

A Standard for A Standard for Shared Memory Shared Memory

Parallel ProgrammingParallel Programming


Using OpenMP Using iterative worksharing constructs Analysis of loop level parallelism Reducing overhead in loop level model Domain decomposition

– Writing scalable programs in OpenMP Comparisons with message passing Performance and scalability considerations


DO Worksharing directive DO directive:

– C$OMP DO [clause[[,] clause]… ]– C$OMP DO [NOWAIT]

Clauses:– PRIVATE(list)– FIRSTPRIVATE(list)– LASTPRIVATE(list)– REDUCTION({op|intrinsic}:list})– ORDERED– SCHEDULE(TYPE[, chunk])


NOWAIT clause NOWAIT clause

C$OMP PARALLEL C$OMP DO do i=1,n a(i)= cos(a(i)) enddoC$OMP END DOC$OMP DO do i=1,n b(i)=a(i)+b(i) enddoC$OMP END DOC$OMP END PARALLEL

C$OMP PARALLEL C$OMP DO do i=1,n a(i)= cos(a(i)) enddoC$OMP END DO NOWAITC$OMP DO do i=1,n b(i)=a(i)+b(i) enddoC$OMP END DO

Implied BARRIER

No BARRIER

By default loop index is PRIVATE


LASTPRIVATE clause Useful when loop index is live out

– Recall that if you use PRIVATE the loop index becomes undefined

do i=1,N-1 a(i)= b(i+1) enddo a(i) = b(0)

In Sequentialcasei=N

C$OMP PARALLEL C$OMP DO LASTPRIVATE(i) do i=1,N-1 a(i)= b(i+1) enddo a(i) = b(0)C$OMP END PARALLEL


Reductions Assume no reduction clause

do i=1,N X = X + a(i) enddo

Sum Reduction

C$OMP PARALLEL DO SHARED(X) do i=1,N X = X + a(i) enddoC$OMP END PARALLEL DO

C$OMP PARALLEL DO SHARED(X) do i=1,NC$OMP CRITICAL X = X + a(i)C$OMP END CRITICAL enddoC$OMP END PARALLEL DO

Wrong!

What’s wrong?


REDUCTION clause Parallel reduction operators

– Most operators and intrinsics are supported– +, *, -, .AND. , .OR., MAX, MIN

Only scalar variables allowed

C$OMP PARALLEL DO REDUCTION(+:X) do i=1,N X = X + a(i) enddoC$OMP END PARALLEL DO

do i=1,N X = X + a(i) enddo


Ordered clause Executes in the same order as sequential code Parallelizes cases where ordering needed

do i=1,N call find(i,norm) print*, i,norm enddo

C$OMP PARALLEL DO ORDERED PRIVATE(norm) do i=1,N call find(i,norm)C$OMP ORDERED print*, i,norm C$OMP END ORDERED enddoC$OMP END PARALLEL DO

1 0.452 0.863 0.65


Schedule clause Controls how the iterations of the loop are assigned to

threads– static: Each thread is given a “chunk” of iterations in a round

robin order» Least overhead - determined statically

– dynamic: Each thread is given “chunk” iterations at a time; more chunks distributed as threads finish

» Good for load balancing– guided: Similar to dynamic, but chunk size is reduced

exponentially– runtime: User chooses at runtime using environment variable

» setenv OMP_SCHEDULE “dynamic, 4”


Performance Impact of Schedule

Static vs. Dynamic across multiple do loops– In static, iterations of the do loop

executed by the same thread in both loops

– If data is small enough, may be still in cache, good performance

Effect of chunk size– Chunk size of 1 may result in

multiple threads writing to the same cache line

– Cache thrashing, bad performance

a(1,1) a(1,2)

a(2,1) a(2,2)

a(3,1) a(3,2)

a(4,1) a(4,2)

C$OMP DO SCHEDULE(STATIC) do i=1,4

C$OMP DO SCHEDULE(STATIC) do i=1,4


C$OMP PARALLEL DO do i=1,n ………… enddo alpha = xnorm/sumC$OMP PARALLEL DO do i=1,n ………… enddoC$OMP PARALLEL DO do i=1,n ………… enddo

Loop Level Paradigm Execute each loop in

parallel Easy to parallelize code Similar to automatic

parallelization– Automatic Parallelization

Option (API) may be good start

Incremental– One loop at a time– Doesn’t break code


C$OMP PARALLEL DO do i=1,n ………… enddo alpha = xnorm/sumC$OMP PARALLEL DO do i=1,n ………… enddoC$OMP PARALLEL DO do i=1,n ………… enddo

Performance Fine Grain Overhead

– Start a parallel region each time

– Frequent synchronization Fraction of non parallel

work will dominate– Amdahl’s law

Limited scalability


Reducing Overhead Convert to coarser grain model:

– More work per parallel region– Reduce synchronization across threads

Combine multiple DO directives into single parallel region– Continue to use Work-sharing directives

» Compiler does the work of distributing iterations » Less work for user

– Doesn’t break code


C$OMP PARALLEL DO do i=1,n ………… enddoC$OMP PARALLEL DO do i=1,n ………… enddo

C$OMP PARALLEL C$OMP DO do i=1,n ………… enddoC$OMP DO do i=1,n ………… enddoC$OMP END PARALLEL

Coarser Grain


C$OMP PARALLEL DO do i=1,n ………… enddo call MatMul(y)

subroutine MatMul(y)C$OMP PARALLEL DO do i=1,n ………… enddo

C$OMP PARALLEL C$OMP DO do i=1,n ………… enddo call MatMul(y)C$OMP END PARALLEL subroutine MatMul(y)C$OMP DO do i=1, N

…… enddoOrphaned

Directive

Using Orphaned Directives


C$OMP PARALLEL DOC$OMP& REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddo alpha = sum/scaleC$OMP PARALLEL DO do i=1,n a[i] = alpha * a[i] enddo

C$OMP PARALLEL C$OMP DO REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddoC$OMP SINGLE alpha = sum/scaleC$OMP END SINGLEC$OMP DO do i=1,n a[i] = alpha * a[i] enddoC$OMP END PARALLEL

LoadImbalance

Statements Between Loops


Replicatedexecution

Cannot useNOWAIT

C$OMP PARALLEL C$OMP DO REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddoC$OMP SINGLE alpha = sum/scaleC$OMP END SINGLEC$OMP DO do i=1,n a[i] = alpha * a[i] enddoC$OMP END PARALLEL

C$OMP PARALLEL PRIVATE(alpha)C$OMP DO REDUCTION(+: sum) do i=1,n sum = sum + a[i] enddo alpha = sum/scaleC$OMP DO do i=1,n a[i] = alpha * a[i] enddoC$OMP END PARALLEL

Loops (contd.)


Coarse Grain Work-sharing Reduced overhead by increasing work per parallel

region But…work-sharing constructs still need to compute

loop bounds at each construct Work between loops not always parallelizable Synchronization at the end of directive not always

avoidable


Domain Decomposition We could have computed loop bounds once and used

that for all loops– i.e., compute loop decomposition a priori

Enter Domain Decomposition– More general approach to loop decomposition– Decompose work into the number of threads– Each threads gets to work on a sub-domain


Domain Decomposition (contd.)

Transfers onus of decomposition from compiler (work-sharing) to user

Results in Coarse Grain program– Typically one parallel region for whole program– Reduced overhead good scalability

Domain Decomposition results in a model of programming called SPMD


Program

1 Domain n threadsn sub-domains

Program

SPMD Programming SPMD: Single Program Multiple Data


program workC$OMP PARALLEL DEFAULT(PRIVATE)C$OMP& SHARED(N,global) nthreads = omp_get_num_threads() iam = omp_get_thread_num() ichunk = N/nthreads istart = iam*ichunk iend = iam+1*ichunk -1 call my_sum(istart, iend, local)C$OMP ATOMIC global = global + localC$OMP END PARALLEL print*, global end

Program

Each thread works on its

piece.

Decompositiondone manually

Implementing SPMD


Implementing SPMD (contd.) Decomposition is done manually

– Implement to run on any number of threads» Query for number of threads» Find thread number» Each thread calculates its portion (sub-domain) of work

Program is replicated on each thread, but with different extents for the sub-domain– all sub-domain specific data are PRIVATE


C$OMP ATOMIC p(i) = p(i) + plocal

usually better than

C$OMP CRITICAL p(i) = p(i) + plocalC$OMP END CRITICAL

Handling Global Variables Global variables: spans whole domain

– Field variables are usually shared Synchronization required to update shared variables

– ATOMIC– CRITICAL


parameter (N=1000) real A(N,N)C$OMP THREADPRIVATE(/buf/) common/buf/lft(N),rht(N)C$OMP PARALLEL call init call scale call orderC$OMP END PARALLEL

buf is commonwithin each thread

Global private to thread Use threadprivate for all other sub-domain data that

need file scope or common blocks


subroutine scale parameter (N=1000)C$OMP THREADPRIVATE(/buf/) common/buf/lft(N),rht(N) do i=1, N lft(i)= const* A(i,iam) end do return end

subroutine order parameter (N=1000)C$OMP THREADPRIVATE(/buf/) common/buf/lft(N),rht(N) do i=1, N A(i,iam) = lft(index) end do return end

Threadprivate


Clauses on Parallel Regions DEFAULT(SHARED | PRIVATE | NONE)

– Default is DEFAULT(SHARED)– DEFAULT(PRIVATE)makes all variables in lexical extent PRIVATE

REDUCTION similar to that on DO directive IF(expr) controls whether to execute the region in

parallel or not COPYIN(list) copies in threadprivate variables from

Master thread


Comparisons with Message Passing

Domain decomposition algorithm same, but implementation easier– Global data, Field variables shared - read from any thread,

write may need synchronization– No need for passing messages, no need for ghost cells or

shadow buffers Parallelize only computationally intensive parts of code,

not entire code– Pre-processing, post-processing may be left alone


Field Variables Message passing model: all data is local to the process

– Each process updates field variables for its subdomain– Field variable is a private, subdomain sized variable– Message passing to make the edges consistent with neighbors

Could use exactly the same scheme for shared memory model also - but not common– DEFAULT(PRIVATE) useful for this model

Typical shared memory model: field variables shared among threads– Each thread updates field variables for its own subdomain– Field variable is a shared variable with a size same as the whole

domain


Messages

Ghost cells

Read/Write datawith synchronization

OpenMPMessage Passing

Message passing vs. OpenMP In OpenMP, domain remains logically intact and

accessible to all threads


Example Typical of many engineering codes

– FEM, CFD Consider 1-D example

subdomain 1 subdomain 2subdomain 3

subdomain 4

istart iend

me

istart-1 iend+1


Example (contd.) Stencil computations

b(i) = c1*a(i-1) + c2 * a(i) + c3 * a(i+1)

subroutine update common /domain/ a, b common /subdomain/ istart, iendC$OMP THREADPRIVATE(/subdomain/) do i = istart, iend a(i) = a*(i) + alocal enddo do i = istart, iend b(i) = c1 * a(i-1) + c2 * a*(i) + c3* a(i+1) enddo


Example (contd.) What happens when i=iend

b(iend) = c1*a(iend-1)+c2*a(iend)+c3*a(iend+1)

a is shared: just read a(iend+1)– Make sure that a(iend+1)is ready

do i = istart, iend a(i) = a*(i) + alocal enddoC$OMP BARRIER do i = istart, iend b(i) = c1 * a(i-1) + c2 * a*(i) + c3* a(i+1) enddo

Not in my subdomain

Wait for a to be ready on all subdomains


Message Passing Equivalent subroutine update common /subdomain/ nsub do i = 1, nsub a(i) = a*(i) + alocal enddo call mpi_send(a(1), me -1) call mpi_recv(a(0), me -1) call mpi_send(a(nsub), me +1) call mpi_recv(a(nsub+1), me +1) do i = 1, nsub b(i) = c1 * a(i-1) + c2 * a*(i) + c3* a(i+1) enddo

subdomain 2

1 nsub

me0 nsub +1


Two Types of Scaling Scale problem size with # of processors

– Users who want to solve bigger size problems» 1 processor: solve 32 x 32 grid» 4 processors: solve 64 x 64 grid

– Subdomain size remains constant Problem size constant irrespective of # of processors

– Users who want to reduce time to solution– Subdomain size becomes smaller and smaller– NAS Parallel benchmarks– Weather codes


Scaling Issues Scale problem size with # of processors

– Subdomain size per processor constant– Memory characteristics independent of # of processors

Problem size constant irrespective of # of processors– Subdomain size becomes smaller and smaller– Data access and memory overhead changes

Message passing and shared memory models run into different problems with decreasing subdomain size


Memory Considerations Message passing implementations need to maintain ghost

cells or shadow buffers– 5 point stencil: a(i-2)+a(i-1)+a(i)+a(i+1)+a(i+2)– 2 layers of ghost cells on each side

Fixed problem size: 64 x 64 grid On 1 processor

– Storage (2+64+2) x (2+64+2)– Ghost cell overhead = 13%

On 16 processors:– Storage on each processor: (2+16+2) x (2+16+2) – Ghost cell overhead = 56%

Ghost cells may be avoided in shared memory model

i-1i-2


Cache Considerations Typical shared memory model: field variables shared

among threads– Each thread updates field variables for its own subdomain– Field variable is a shared variable with a size same as the

whole domain For fixed problem size scaling subdomain size

decreases– Fortran Example– P(20,20) Field variable

1 cache line128 bytes on Origin(16 double words)

P(20,20)

s1 s3

s2 s4


Two threads need to write the same cache line Cacheline pingpongs between the two processors

– A cacheline is the smallest unit of transport

Poor performance: must avoid at all costs

1 cache lineThread A writes

Thread B writesCache line must move from Processor A to Processor B

False sharing


Pay attention to granularity Diagnose by using SGI performance tools On Origin2000, use R10000 Hardware performance

counters:– SpeedShop: man speedshopsetenv _SPEEDSHOP_HWC_COUNTER_NUMBER 31(31: Store prefetch exclusive to shared block in scache)ssrun -prof_hwc a.out

Pad false shared arrays or make private SGI’s NUMA extensions may help:

– c$sgi distribute_reshape p(block, block)

Fixing False sharing


NUMA Directives SGI’s extensions for addressing NUMA issues OpenMP allows performance extensions

– Performance extensions does not affect correctness of programs - harmless if ignored

SGI extensions: C$SGI prefix– HPF like data distribution– Granularity is a page: 16k on Origin

C$SGI distribute A(*,block)– Distribute colums of A across threads– If we have two threads A(102,102):– A(*,1:51) on thread 0, A(*,52:102) on thread 1


Distribute with Reshape When data has less than page granularity Guarantee the desired memory distribution

real A(20, 20)C$SGI distribute_reshape A(block,block)

Layout of array changed - no more equivalence to undistributed array– Compiler will transform code to

» A(20/sqrt(np), A20/sqrt(np), np)– All program units must satisfy the same distribution


Handling Irregular Data Adaptive Mesh Refinement (AMR) or Multigrid methods Data is not exclusively accessed by one thread

t1

t2

t1

t2

t1

t2


Forcing Page Placement Pages can be moved at runtime to desired memory

!$SGI page_place(<addr>, <size>, <threadnum>)

Places specified virtual address range on the thread number

SGI Extension


Reducing Barriers Common issue when converting from message passing

model– One process sends data, another receives data

» Usually the received data is filled into the ghost cells around the subdomain

» Field variable is “ready” for use after the receive: Ghost cells are consistent with neighboring subdomains

– Receive is an implicit synchronization point This synchronization is very often done using barrier

– But BARRIER synchronizes all threads in the region– BARRIERs have high overhead at large processor counts– By synchronizing all threads makes load imbalance worse

Reduce BARRIERs to the minimum


Point to point Synchronization Only need producer-consumer type synchronization:

» One thread computes result, another thread reads result - synchronize only the two

The producer thread:– Computes field variables on its own subdomain– Set a flag when the data is ready– In MPI the point at which MPI_send happens

The consumer thread:– Needs field variables in neighboring subdomain(s) to continue

simulation– Waits for the neighbor’s flag to be set before using the data– In MPI the MPI_recv event


Watch out for isend and irecv pairs

– Buffers can be modified after wait Conversion to shared memory model:

– Replacing recv with read of the data may not work if data was modified between wait and recv

– May need to move code around

Asynchronous messages

buf(i)= A(rborder(i)) call MPI_isend(buf, iam+1, handle) A(i) = A(i)*c call MPI_wait(handle) call MPI_recv(buf, iam -1) A(lborder(i)) = A(lborder(i)) + buf(i)


Producer-Consumer

Synchronization!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE field(subdomain)=field(subdomain)+update READY=.TRUE.

!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE do while(READY)c do useful work while waiting for producer enddo field(i)=field(i-1)+ field(i) + field(i+1)

Producer thread

Consumer thread


Memory Reordering Previous code assumes the operations are performed

in the strict order in which they appear Most modern architectures reorder memory references

– Register allocation of variables: delay update of memory location until necessary

– Compiler may reorder the memory references for better cache locality

– Hardware may cause memory transactions to appear in a different order

Necessary for obtaining high performance Not an issue for most codes: only for synchronization

through memory reference


Flush Directive OpenMP provides a directive for users to identify

synchronization points in the program FLUSH directive provides memory fence Makes shared memory consistent across all threads

– All updates to shared variables that happened before the fence are committed to memory

– All references to shared variables after the fence are fetched from memory

Does not apply to private variables


Using Flush Directive!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE field(subdomain)=field(subdomain)+update!$OMP FLUSH READY=.TRUE.!$OMP FLUSH

!$OMP SINGLE READY=.FALSE.!$OMP END SINGLE do while(READY)c do useful work while waiting for producer!$OMP FLUSH enddo field(i)=field(i-1)+ field(i) + field(i+1)

Producer thread

Consumer thread


!$OMP PARALLEL DEFAULT(PRIVATE) SHARED(done) done(iam) = .FALSE.!$OMP BARRIER call do_my_subdomain_work() done(iam) = .TRUE.!$OMP FLUSH(done) do while(done(neighbor)) call do_some more_useful_work_if_you_can!$OMP FLUSH(DONE) enddo

Point to Point Synchronization


Mixing OpenMP and Message Passing

Enables using more than one shared memory machine– OpenMP within the machine and MPI across machines

SGI supports mixing within a machine– Multiple MPI processes within a shared memory machine that

launch multiple OpenMP threads each– MPI for coarse grain and OpenMP for loop level

Interoperability restrictions– Only one thread should communicate at a time

» Put MPI call in a SINGLE or CRITICAL section


MPI-OpenMP MPI across hosts, OpenMP within

– setenv OMP_NUM_THREADS for each host – may be different, use .login or equivalent to set differently for

each host MPI-OpenMP within a host

– Environment variable can be used to set the number of OpenMP threads for all MPI processes

» All MPI processes will use the same number or threads– If each MPI process needs different number of threads use omp_set_num_threads call on each MPI process


Profiling Hierarchical Models SpeedShop can be used

– setenv OMP_NUM_THREADS 16– mpirun -np 4 ssrun -fpcsampx a.out

Will create 4 MPI processes– SpeedShop datafile will have “f” prefix before PID

Each MPI process will create 15 slave threads– SpeedShop datafile will have “p” prefix before PID

Accumulating OpenMP thread data into the MPI process– Need to find the ancestor for each OpenMP process

» ssdump a.out.pXXXXXX|grep ancestor» prof a.out.f#### a.out.p#### a.out.p####


Summary Loop level paradigm is easiest to implement but least

scalable Work-sharing in coarse grain parallel regions provides

intermediate performance SPMD model provides good scalability

– OpenMP provides ease of implementation compared to message passing

Shared memory model forgiving of use error– Avoid common pitfalls to obtain good performance

© 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel...

Documents

Transcript of © 1997-98 Silicon Graphics Inc. All rights reserved. A Standard for Shared Memory Parallel...