Parallel Processing1 High Performance Computing (CS 540) Shared Memory Programming with OpenMP and...

Parallel Processing 1

High Performance Computing(CS 540)

Shared Memory Programming with OpenMP and Pthreads*

Jeremy R. Johnson

*Some of this lecture was derived from Pthreads Programming by Nichols, Buttlar, and Farrell and POSIX Threads Programming Tutorial (computing.llnl.gov/tutorials/pthreads) by Blaise Barney


Introduction• Objective: To further study the shared memory model of parallel

programming. Introduction to the OpenMP and Pthreads for shared memory parallel programming

• Topics– Concurrent programming with UNIX Processes

– Introduction to shared memory parallel programming with Pthreads• Threads• fork/join• race conditions• Synchronization• performance issues - synchronization overhead, contention and granularity, load balance, cache

coherency and false sharing.

– Introduction parallel program design paradigms• Data parallelism (static scheduling)• Task parallelism with workers• Divide and conquer parallelism (fork/join)


Introduction

• Topics

– OpenMP vs. Pthreads• hello_pthreadsc

• hello_openmp.c

– Parallel Regions and execution model– Data parallelism with loops– Shared vs. private variables– Scheduling and chunk size– Synchronization and reduction variables– Functional parallelism with parallel sections– Case Studies

Processes

• Processes contain information about program resources and program execution state

– Process ID, process group ID, user ID, and group ID– Environment– Working directory– Program instructions– Registers– Stack– Heap– File descriptors– Signal actions– Shared libraries– Inter-process communication tools (such as message queues, pipes,

semaphores, or shared memory).


UNIX Process


Threads

• An independent stream of instructions that can be scheduled to run

– Stack pointer– Registers (program counter)– Scheduling properties (such as policy or priority)– Set of pending and blocked signals– Thread specific data

• “lightweight process”– Cost of creating and managing threads much less than processes– Threads live within a process and share process resources such as

address space

• Pthreads – standard thread API (IEEE Std 1003.1)


Threads within a UNIX Process


Shared Memory Model

• All threads have access to the same global, shared memory

• All threads within a process share the same address space

• Threads also have their own private data

• Programmers are responsible for synchronizing access (protecting) globally shared data.


Simple Example

void do_one_thing(int *);

void do_another_thing(int *);

void do_wrap_up(int, int);

int r1 = 0, r2 = 0;

extern int

main(void)

{

do_one_thing(&r1);

do_another_thing(&r2);

do_wrap_up(r1, r2);

return 0;

}



do_another_thing() i j k--------------------------------------main()

main()--------do_one_thing() --------do_another_thing()---------

r1r2

SPPCGP0GP1…

PIDUIDGID

Open FilesLocksSockets…

Stack

Text

Data

Heap

Registers

Identity

Resources

Virtual Address Space

Simple Example (Processes)

int shared_mem_id, *shared_mem_ptr;

int *r1p, *r2p;

extern int main(void)

{

pid_t child1_pid, child2_pid;

int status;

/* initialize shared memory segment */

if ((shared_mem_id = shmget(IPC_PRIVATE, 2*sizeof(int), 0660)) == -1)

perror("shmget"), exit(1);

if ((shared_mem_ptr = (int *)shmat(shared_mem_id, (void *)0, 0)) == (void *)-1

)

perror("shmat failed"), exit(1);

r1p = shared_mem_ptr;

r2p = (shared_mem_ptr + 1);

*r1p = 0;

*r2p = 0;


Simple Example (Processes)

if ((child1_pid = fork()) == 0) {

/* first child */

do_one_thing(r1p);

return 0;

} else if (child1_pid == -1) {

perror("fork"), exit(1);

}

/* parent */

if ((child2_pid = fork()) == 0) {

/* second child */

do_another_thing(r2p);

return 0;

} else if (child2_pid == -1) {

perror("fork"), exit(1);

}


/* parent */

if ((waitpid(child1_pid, &status, 0) == -1))

perror("waitpid"), exit(1);

if ((waitpid(child2_pid, &status, 0) == -1))

perror("waitpid"), exit(1);

do_wrap_up(*r1p, *r2p);

return 0;

}


do_one_thing() i j k---------------------------main()


SPPCGP0GP1…

PIDUIDGID

Open FilesLocksSockets

…

Stack

Text

Data

Heap

Registers

Identity

Resources


do_another_thing() i j k---------------------------main()


SPPCGP0GP1…

PIDUIDGID

Open FilesLocksSockets

…

Stack

Text

Data

Heap

Registers

Identity

Resources


Shared Memory

Simple Example (PThreads)

int r1 = 0, r2 = 0;

extern int

main(void)

{

pthread_t thread1, thread2;

if (pthread_create(&thread1,

NULL,

do_one_thing,

(void *) &r1) != 0)

perror("pthread_create"), exit(1);

if (pthread_create(&thread2,

NULL,

do_another_thing,

(void *) &r2) != 0)

perror("pthread_create"), exit(1);


if (pthread_join(thread1, NULL) != 0)

perror("pthread_join"),exit(1);

if (pthread_join(thread2, NULL) != 0)

perror("pthread_join"),exit(1);

do_wrap_up(r1, r2);

return 0;

}



main()--------do_one_thing() --------do_another_thing()-----------------r1r2

SPPCGP0GP1…

PIDUIDGID

Open FilesLocksSockets…

Stack

Text

Data

Heap

Registers

Identity

Resources



Stack

SPPCGP0GP1…

Registers

Thread 1

Thread 2

Concurrency and Parallelism


Time

do_one_thing()do_another_thing() do_wrap_up()

do_one_thing() do_another_thing() do_wrap_up()

do_one_thing()

do_another_thing()

do_wrap_up()

Unix Fork

• The fork() call

– Creates a child process that is identical to the parent process

– The child has its own PID

– The fork() call provides different return values to the parent [child’s PID] and the child [0]



--------fork()-----------------

PID = 7274

--------fork()-----------------

PID = 7274

--------fork()-----------------

PID = 7275

fork

Parent

Child

Thread Creation

• pthread_create creates a new thread and makes it executable

– pthread_create (thread,attr,start_routine,arg) • thread - unique identifier

• attr – attribute

• Start_routine – the routine the newly created thread will execute

• arg – a single argument passed to start_routine


Thread Creation

• Once created, threads are peers, and may create other threads


Thread Join

• "Joining" is one way to accomplish synchronization between threads.

• The pthread_join() subroutine blocks the calling thread until the specified threadid thread terminates.


Fork/Join Overhead

• Compare the overhead of procedure call, process fork/join, thread create/join

– Procedure call (no args)• 1.2 10-8 sec (.12 ns)

– Process• 0.0012 sec (1.2 ms)

– Thread• 0.000042 sec (42 s)


Race Conditions

• When two or more threads access the same resource at the same time


Tim

e

Thread 1 Thread 2 Balance

Withdraw $50 Withdraw $50Read Balance $125 Read Balance $125Set Balance $75 Set Balance $75

Bad Count

int sum= 0;

void count(int *arg)

{

int i;

for (i=0;i<*arg;i++) {

sum++;

}

}

int main(int argc, char **argv)

{

int error,i;

int numcounters = NUMCOUNTERS;

int limit = LIMIT;

pthread_t tid[NUMCOUNTERS];


pthread_setconcurrency(numcounters);

for (i=0;i<numcounters;i++)

{

error = pthread_create(&tid[i],NULL,(void *(*)(void *))count,&limit);

}

for (i=0;i<numcounters;i++)

{

error = pthread_join(tid[i],NULL);

}

printf("Counters finished with count = %d\n",sum);

printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit);

return 0;

}

Mutex

• Mutex variables are for protecting shared data when multiple writes occur.

• A mutex variable acts like a "lock" protecting access to a shared data resource. Only one thread can own (lock) a mutex at any given time


Mutex Operations

• pthread_mutex_lock (mutex) – The pthread_mutex_lock() routine is used by a thread to

acquire a lock on the specified mutex variable. If the mutex is already locked by another thread, this call will block the calling thread until the mutex is unlocked.

• Pthread_mutex_unlock (mutex) – will unlock a mutex if called by the owning

thread. Calling this routine is required after a thread has completed its use of protected data if other threads are to acquire the mutex for their work with the protected data.


Good Countint sum= 0;

pthread_mutex_t lock;


{

int i;

for (i=0;i<*arg;i++)

{

pthread_mutex_lock(&lock);

sum++;

pthread_mutex_unlock(&lock);

}

}


{

int error,i;

int numcounters = NUMCOUNTERS;

int limit = LIMIT;

pthread_t mytid, tid[MAXCOUNTERS];


pthread_setconcurrency(numcounters);

pthread_mutex_init(&lock,NULL);

for (i=1;i<=numcounters;i++)

{

error = pthread_create(&tid[i],NULL,(void *(*)(void *))count, &limit);

}

for (i=1;i<=numcounters;i++)

{


}

printf("Counters finished with count = %d\n",sum);

printf("Count should be %d X %d = %d\n",numcounters,limit,numcounters*limit);

return 0;

}

Better Count

int sum= 0;



{

int i;

int localsum = 0;

for (i=0;i<*arg;i++)

{

localsum++;

}


sum = sum + localsum;


}


Threadsafe Code

• Refers to an application's ability to execute multiple threads simultaneously without "clobbering" shared data or creating "race" conditions.


Condition Variables

• While mutexes implement synchronization by controlling thread access to data, condition variables allow threads to synchronize based upon the actual value of data.

• Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the condition is met.

• A condition variable is a way to achieve the same goal without polling

• Always used with a mutexParallel Processing 30

Using Condition variables

Thread A

• Do work up to the point where a certain condition must occur (such as "count" must reach a specified value)

• Lock associated mutex and check value of a global variable

• Call pthread_cond_wait() to perform a blocking wait for signal from Thread-B. Note that a call to pthread_cond_wait() automatically and atomically unlocks the associated mutex variable so that it can be used by Thread-B.

• When signalled, wake up. Mutex is automatically and atomically locked.

• Explicitly unlock mutex• Continue

Thread B

• Do work

• Lock associated mutex

• Change the value of the global variable that Thread-A is waiting upon.

• Check value of the global Thread-A wait variable. If it fulfills the desired condition, signal Thread-A.

• Unlock mutex.

• Continue


Condition Variable Example

void *watch_count(void *idp)

{

int i=0, save_state, save_type;

int *my_id = idp;

pthread_mutex_lock(&count_lock);

while (count < COUNT_THRES) {

pthread_cond_wait(&count_hit_threshold, &count_lock);

}

pthread_mutex_unlock(&count_lock);

return(NULL);

}


void *inc_count(void *idp)

{

int i=0, save_state, save_type;

int *my_id = idp;

for (i=0; i<TCOUNT; i++) {

pthread_mutex_lock(&count_lock);

count++;

if (count == COUNT_THRES) {

pthread_cond_signal(&count_hit_threshold);

}

pthread_mutex_unlock(&count_lock);

}

return(NULL);

}


OpenMP

• Extension to FORTRAN, C/C++– Uses directives (comments in FORTRAN, pragma in C/C++)

• ignored without compiler support• Some library support required

• Shared memory model– parallel regions– loop level parallelism– implicit thread model– communication via shared address space– private vs. shared variables (declaration)– explicit synchronization via directives (e.g. critical)– library routines for returning thread information (e.g.

omp_get_num_threads(), omp_get_thread_num() )– Environment variables used to provide system info (e.g.

OMP_NUM_THREADS)


Benefits

• Provides incremental parallelism

• Small increase in code size

• Simpler model than message passing

• Easier to use than thread library

• With hardware and compiler support smaller granularity than message passing.


Further Information

• Adopted as a standard in 1997– Initiated by SGI

• www.openmp.org• computing.llnl.gov/tutorials/openMP

• Chandra, Dagum, Kohr, Maydan, McDonald, Menon, “Parallel Programming in OpenMP”, Morgan Kaufman Publishers, 2001.

• Chapman, Jost, and Van der Pas, “Using OpenMP: Portable Shared Memory Parallel Programming,” The MIT Press, 2008.


Shared vs. Distributed Memory

Memory

P0 P1 Pn...

Interconnection Network

P0 P1 Pn

...M0 M1 Mn

Shared memory Distributed memory


Shared Memory Programming Model

• Shared memory programming does not require physically shared memory so long as there is support for logically shared memory (in either hardware or software)

• If logical shared memory then there may be different costs for accessing memory depending on the physical location.

• UMA - uniform memory access– SMP - symmetric multi-processor– typically memory connected to processors via a bus

• NUMA - non-uniform memory access– typically physically distributed memory connected via an

interconnection network


Hello_openmp.c#include <stdio.h>

#include <stdlib.h>

#include <omp.h>


{

int n;

if (argc > 1) {

n = atoi(argv[1]); omp_set_num_threads(n);

}

printf("Number of threads = %d\n",omp_get_num_threads());

#pragma omp parallel

{

int id = omp_get_thread_num();

printf("Hello World from %d\n",id);

if (id == 0)


}

exit(0);

}


Compiling & Running Hello_openmp

% gcc –fopenmp hello_openmp.c –o hello

% ./hello 4

Number of threads = 1

Hello World from 1

Hello World from 0

Hello World from 3


Hello World from 2

The order of the print statements is nondeterministic


Execution Model

Master thread

Master and slave threads

Master thread

Implicit barrier synchronization(join)

Implicit thread creation (fork)

Parallel Region


Explicit Barrier#include <stdio.h>

#include <stdlib.h>


{

int n;

if (argc > 1) {

n = atoi(argv[1]);

omp_set_num_threads(n);

}


#pragma omp parallel

{

int id = omp_get_thread_num();

printf("Hello World from %d\n",id);

#pragma omp barrier

if (id == 0) printf("Number of threads = %d\n",omp_get_num_threads());

}

exit(0);

}


Output with Barrier

%./hellob 4


Hello World from 1

Hello World from 0

Hello World from 2

Hello World from 3


The order of the “Hello World” print statements are nondeterministic; however, the Number of threads print statement always comes at the end


Hello_pthreads.c#include <stdio.h>

#include <stdlib.h>

#include <pthread.h>

#include <errno.h>

#define MAXTHREADS 32


{

int error,i,n;

void hello(int *pid);

pthread_t tid[MAXTHREADS],mytid;

int pid[MAXTHREADS];

if (argc > 1) {

n = atoi(argv[1]);

if (n > MAXTHREADS) {

printf("Too many threads\n"); exit(1);

}

pthread_setconcurrency(n);

}

printf("Number of threads = %d\n",pthread_getconcurrency());

for (i=0;i<n;i++) {

pid[i]=i;

error = pthread_create(&tid[i], NULL,(void *(*)(void *))hello, &pid[i]);

}

for (i=0;i<n;i++) {


}

exit(0);

}


Hello_pthreads.c

void hello(int *pid)

{

pthread_t tid;

tid = pthread_self();

printf("Hello World from %d (tid = %u)\n",*pid,(unsigned int) tid);

if (*pid == 0)

printf("Number of threads = %d\n",pthread_getconcurrency());

}

% gcc -pthread hello.c -o hello

% ./hello 4


Hello World from 0 (tid = 1832728912)





The order of the print statements is nondeterministic

Types of Parallelism

Data Parallelism

Threads execute same instructions

… but on different data

Functional Parallelism

Threads execute different instructions

… and can read same data but should write different

data

F1

F2

F3

F4


Parallel Loop

int a[1000], b[1000];

int main()

{

int i;

int N = 1000;

for (i=0; i<N; i++)

a[i] = i; b[i] = N-i;

for (i=0;i<N;i++) {

a[i] = a[i] + b[i];

}

int a[1000], b[1000];

int main()

{

int i;

int N = 1000;

// Serial Initialization

for (i=0; i<N; i++)

a[i] = i; b[i] = N-i;

#pragma omp for shared(a,b), private(i), schedule(static)

for (i=0;i<N;i++) {

a[i] = a[i] + b[i];

}


Scheduling of Parallel Loop

+

a

b

0 1tid

Stripmining

2 Nthreads-1


Implementation of Parallel Loop

void vadd(int *id){int i;for (i=*id;i<N;i+=numthreads) { a[i] = a[i] + b[i]; }}

for (i=0;i<numthreads;i++) { id[i] = i; error = pthread_create(&tid[i],NULL,(void *(*)(void *))vadd, &id[i]); }for (i=0;i<numthreads;i++) { error = pthread_join(tid[i],NULL); }


Scheduling Chunks of Parallel Loop

a

b

0 1tid

chunk0

chunk0

Chunk 1

2

Chunk 2

Chunk Nthreads-1


Implementation of Chunking

#pragma omp for shared(a,b), private(i), schedule(static,CHUNK)for (i=0;i<N;i++) { a[i] = a[i] + b[i];}

void vadd(int *id){int i,j;

for (i=*id*CHUNK;i<N;i+=numthreads*CHUNK) { for (j=0;j<CHUNK;j++) a[i+j] = a[i+j] + b[i+j]; }}


Race Condition

int x[10000000];int main(int argc, char **argv) {int sum=0;…….omp_set_num_threads(numcounters);

for (i=0;i<numcounters*limit;i++) x[i] = 1;

#pragma omp parallel for schedule(static) private(i) shared(sum,x)for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; if (i==0) printf("num threads = %d\n",omp_get_num_threads()); }


Critical Sections

int x[10000000];int main(int argc, char **argv) {int sum=0;…….#pragma omp parallel for schedule(static) private(i) shared(sum,x)for (i=0;i<numcounters*limit;i++) {#pragma omp critical(sum) sum = sum + x[i]; }


Reduction Variables

int x[10000000];int main(int argc, char **argv) {int sum=0;…….#pragma omp parallel for schedule(static) private(i) shared(x)

reduction(+:sum)for (i=0;i<numcounters*limit;i++) { sum = sum + x[i]; }


Reduction

X[]

+

partialsum

+

partialsum

+

partialsum

+

partialsum

+

partialsum

+

total sum


Implementing Reduction

#pragma omp parallel shared(sum,x) {int i;int localsum=0;int id;id = omp_get_thread_num();for (i=id;i<numcounters*limit;i+=numcounters) { localsum = localsum + x[i]; }#pragma omp critical(sum) sum = sum+localsum;}

Functional Parallelism Example

int main()

{

int i;

double a[N], b[N], c[N], d[N];

// Parallel Function

#pragma omp parallel shared(a,b,c,d) privite(i)

{

#pragma omp sections

{

#pragma omp section

for (i=0; i<N; i++)

c[i] = a[i] + b[i];

#pragma omp section

for (i=0; i<N; i++)

d[i] = a[i] * b[i];

}

}

Parallel Programming

• Task parallelism vs. data parallelism

• Fork/join parallelism (divide & conquer)

• Static scheduling

• Dynamic scheduling with workers


Sequential Count

int X[MAXSIZE];

int icount(int l,int u)

{

int i;

int y = 0;

for (i=l; i<=u;i++)

y = y + X[i];

return y;

}


int rcount(int l,int u)

{

int m;

int y1,y2;

if ( (u-l) == 0)

return X[l];

else

{

m = (l+u)/2;

y1 = rcount(l,m);

y2 = rcount(m+1,u);

return (y1 + y2);

}

}

Counting with a Parallel Loop

int sum= 0;

int numcounters;

int size;



void count(int *id)

{

int i,lsum;

lsum = 0;

for (i=*id;i<size;i+=numcounters)

{

lsum = lsum + X[i];

}


sum = sum + lsum;


}

Counting with Workers

void get_task(int *start, int *stop)

{

pthread_mutex_lock(&task_lock);

*start = task_index;

if (*start + task_chunk > n)

*stop = n;

else

*stop = *start + task_chunk;

task_index = *stop;

pthread_mutex_unlock(&task_lock);

}


void worker()

{

int start,stop,i;

int y = 0;

get_task(&start,&stop);

for (i=start; i<stop;i++)

y = y + X[i];

pthread_mutex_lock(&sum_lock);

sum = sum + y;

pthread_mutex_unlock(&sum_lock);

}

Parallel Divide & Conquerint pcount(int *arg)

{

int error,arg1[3],arg2[3];

int l,u,m;

int y,y1,y2;

pthread_t tid1,tid2;

l = arg[0];

u = arg[1];

if ( (u-l) <= cutoff)

y = count(l,u);

else

{

m = (l+u)/2;

arg1[0] = l;

arg1[1] = m;


error = pthread_create(&tid1,NULL,(void *(*)(void *))pcount,arg1);

/* y2 = count(m+1,u); */

arg2[0] = m+1;

arg2[1] = u;

error = pthread_create(&tid2,NULL,(void *(*)(void *))pcount,arg2);

error = pthread_join(tid1,NULL);

y1 = arg1[2];

error = pthread_join(tid2,NULL);

y2 = arg2[2];

y = y1 + y2;

}

/* thr_exit(&y); */

arg[2] = y;

}

Parallel Processing1 High Performance Computing (CS 540) Shared Memory Programming with OpenMP and...

Documents

Transcript of Parallel Processing1 High Performance Computing (CS 540) Shared Memory Programming with OpenMP and...