Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data...

23
11/2/2010 1 Saber Feki Unified Parallel C Department of Computer Science University of Houston [email protected] Rakhi Anand Saber Feki References Slides in this lecture are based upon following references: • http://upc.lbl.gov/lang-overview.shtml • http://upc.gwu.edu/downloads/Manual-1.2.pdf • http://upc.lbl.gov/docs/user/index.shtml • http://upc.gwu.edu/tutorials/UPC-SC05.pdf

Transcript of Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data...

Page 1: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

1

Saber Feki

Unified Parallel C

Department of Computer Science

University of Houston

[email protected]

Rakhi Anand

Saber Feki

References

• Slides in this lecture are based upon following

references:

• http://upc.lbl.gov/lang-overview.shtml

• http://upc.gwu.edu/downloads/Manual-1.2.pdf

• http://upc.lbl.gov/docs/user/index.shtml

• http://upc.gwu.edu/tutorials/UPC-SC05.pdf

Page 2: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

2

Introduction

• Unified Parallel C

• A Partition Global Address Space language (PGAS) model

• Similar to C language

• Common and familiar syntax

• Designed for parallel C programs

• Provide the ability to exploit data locality in different

memory architectures

PGAS Languages

• UPC is based on Partition Global Address Space language

(PGAS) model.

Global address

space

Thread0 Thread1 Threadn

Shared memory space

Private 0 Private 1 Private n

Page 3: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

3

UPC Execution model

• A number of threads work in SPMD fashion

•MYTHREAD gives thread index(0,1....n-1)

•THREADS gives number of threads

UPC Execution model

• There are two compilation mode

•Static Threads modes

•Threads are specified at compile time

•Using THREADS constant

•Dynamic Thread mode

• Threads are specified at run time

Page 4: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

4

UPC Hello world example

#include <upc.h> /* needed for UPC extensions */

#include <stdio.h>

main() {

printf("Thread %d of %d: hello UPC world\n",

MYTHREAD, THREADS);

}

Compile:

upcc -T 2 -o hello hello.upc

Run:

upcrun -n 2 ./hello

Shared and Private data

• Normal C varaibles are allocated in the private memory

space of a thread

int mine;

• Shared varaibles are allocated only once with thread 0

shared int x;

• Shared arrays are distributed across threads with one

element per each thread

shared int x[10];

Page 5: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

5

Shared and Private data

Global address

space

mine:

Thread0 Thread1 Threadn

ours:

x[0]

mine: mine:

x[1] x[n]

Shared and Private data

• Example : vector addition

shared int a[100], b[100], c[100];

int i;

for(i=0; i<100; i++){

if(MYTHREAD == i%THREADS)

a[i] = b[i] + c[i];

}

Page 6: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

6

Data distribution

Efficient data distribution

Page 7: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

7

Blocking of shared data

• Default block size of a block is 1.

• Distributes data in round robin fashion

• Shared arrays can be distributed in blocks across

threads

shared [3] int A[4][THREADS];

Assume THREADS = 4

Blocking of the shared array

A[0][0]

A[0][1]

A[0][2]

A[3][0]

A[3][1]

A[3][2]

A[0][3]

A[1][0]

A[1][1]

A[3][3]

A[1][2]

A[1][3]

A[2][0]

A[2][1]

A[2][2]

A[2][3]

Thread 1 Thread 2 Thread 3 Thread 4

Page 8: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

8

Blocking of Shared Array

• Thread affinity: ability of a thread to refer to an object

by a private pointer

• Element i of a blocked array has affinity to thread

For_all

• Provides an opportunity to distribute iterations across

the threads as you wish

upc_forall(init;test;loop;affinity)

• Affinity expression decides which iteration to execute on

which thread

• Affinity can be an integer or pointer

Page 9: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

9

For_all

• Example 1: explicit affinity

shared int a[100], b[100], c[100];

int i;

upc_forall(i=0; i<100; i++; &a[i])

a[i] = b[i] + c[i];

• Example 2: implicit affinity

shared int a[100], b[100], c[100];

int i;

upc_forall(i=0; i<100; i++; i)

a[i] = b[i] + c[i];

For_all

• Example 3: blocked affinity

shared [100/THREADS] int a[100], b[100], c[100];

int i;

upc_forall(i=0;i<100; i++;(i*THREADS)/100)

a[i] = b[i] + c[i];

Page 10: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

10

UPC Pointers

• Pointer Declarations

• int *p1; /* local item which points locally*/

• shared int *p2;/* local pointer to shared data*/

• int *shared p3;/* shared pointer to local data*/

• shared int *shared p4; /* shared pointer to shared data */

• Shared to local memory(p3) not recommended

UPC Pointers

Page 11: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

11

Pointer example

Assume THREADS = 3

shared int A[10]

shared int *dp=&A[2], *dp1;

dp1 = dp + 4;

UPC Pointers

UPC Pointers

A[0]

A[3]

A[6]

A[9]

A[1]

A[4]

A[7]

A[10]

A[2]

A[5]

A[8]

Thread 1: Thread 2: Thread 3:

dp

dp + 1 dp + 2 dp + 3

dp + 4

dp1

Page 12: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

12

Pointer example 2

Assume THREADS = 3

shared [2] int A[10]

shared [3] int *dp=&A[2], *dp1;

dp1 = dp + 4;

UPC Pointers

UPC Pointers

A[0]

A[1]

A[6]

A[7]

A[2]

A[3]

A[8]

A[9]

A[4]

A[5]

A[10]

Thread 1: Thread 2: Thread 3:

dp + 4

dp

dp + 2

dp + 1

dp1

dp + 3

Page 13: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

13

shared void *upc_global_alloc(size_tnblocks, size_t nbytes);

nblocks : number of blocks

nbytes : block size

• Non-collective operation

• Calling thread allocates memory in shared address space

Dynamic memory allocation(I)

shared void *upc_all_alloc(size_t nblocks, size_t nbytes);

nblocks : number of blocks

nbytes : block size

• Collective operation

• All threads will get the same pointer

Dynamic memory allocation(II)

Page 14: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

14

shared void *upc_alloc(size_t nbytes);

nbytes : block size

• Non-collective operation

• Calling thread allocates memory in local shared address space

Dynamic memory allocation(III)

void upc_free(shared void *ptr);

• The upc free function frees the dynamically allocated

shared memory pointed to by ptr

• Upc free is not collective

Dynamic memory de-allocation

Page 15: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

15

Consistency Models(I)

• Used for the interaction of memory access in shared memory space

• Consistency can be strict or relaxed

• Relaxed consistency

• Program executed in a local consistency model

• Compiler analyses only shared memory access in the local thread

• Shared operations can be reordered by compiler

• Default environment is set by using

<upc_relaxed.h>

Consistency Models(II)

• Strict consistency

• Program executed in a sequential consistency model

• Compiler must take into account all memory accesses in all threads

• Reordering of operations is not allowed

• Default environment is set by using

<upc_strict.h>

Page 16: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

16

Consistency Models(III)

• Default consistency model can be altered using

# pragma upc strict

# pragma upc relaxed

• Declare variables using type qualifiers

• Strict or relaxed

Consistency Models Example

# include <upc_relaxed.h>Send(val){#pragma upc strict nextwhile (flag){data1 = val1; /* statements can be reordered */ data2 = val2;}#pragma upc strict nextflag = 1;}}int recv (){#pragma upc strict nextwhile(!flag){tmp = data1+data2;#pragma upc strict nextflag = 0;}}

Page 17: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

17

Consistency Models Example

# include <upc_strict.h>Send(val){while (flag){#pragma upc relaxed next{data1 = val1; /* statements can be reordered */ Data2 = val2;}flag = 1;}}

int recv (){While(!flag){#pragma upc relaxed nexttmp = data1+data2;flag = 0;}}

Synchronization

•There is no implicit synchronization among threads

•Synchronization is provided using

•Barrier: blocks until all threads arrive

•upc_barrier()

•Split-phase barrier: non blocking barrier

•upc_notify()

•upc_wait()

Page 18: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

18

Synchronization

•UPC uses locks to achieve synchronization for multiple

writers•upc_lock_t* upc_all_alloc();

•To acquire lock•void upc_lock (upc_lock_t *l);

•To lock data•void upc_unlock (upc_lock_t *l);

•To unlock data•void upc_lock_free(upc_lock_t *ptr);

•To free lock

Synchronization Example

#include<upc_relaxed.h> upc_lock_t *l; shared int value = 0; void main(void) {

l=upc_all_lock_alloc(); upc_lock(l); value += 2; upc_unlock(l);upc_barrier; // Ensure all is done if(MYTHREAD==0)

printf(“VAL=%d\n“,value); if(MYTHREAD==0) upc_lock_free(l);

}

Page 19: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

19

Installation prerequisites

• A POSIX-like environment

• A version of UNIX

• 'Cygwin' toolkit (for Windows)

• GNU make

• Standard Unix tools: a Bourne-compatible shell, 'awk' 'env', 'tail', 'sed', 'basename', 'dirname', and 'tar'.

• C compiler

• An MPI implementation( to run UPC over MPI)

• A C++ compiler

• http://upc.lbl.gov/download/

Debuggers

• UPC programs can be debugged using TotalViewdebugger

• Requirements

• TotalView version 7.0.0 or greater

• UPC installations running on x86 architectures

• C compiler must be GNU GCC

• Should include –with-multiconf=+dbg_tv when

configuring

Page 20: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

20

Performance Analysis tool

• GASP- Global Address Space Performance tool

• Used to plug-in third party performance tools to

measure and visualize performance of UPC programs.

• Tool : Parallel Performance Wizard(PPW)

• http://ppw.hcs.ufl.edu/

• Should include –with-multiconf=+opt_inst when

configuring

Parallel I/O constraints

• All UPC I/O functions are collective

• File is automatically closed on termination of program

• If a program tries to read past the end of file, it reads

till the end of file

• Writing past the end of file increases the size of file

• The arguments to all UPC-IO functions are single valued

• UPC-IO, by default, support weak consistency and

atomicity semantics

• UPC-IO allows synchronous and asynchronous operations

Page 21: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

21

UPC file operations(I)

• upc_file_t upc_all_fopen(const char* fname);

• Open the file

• int upc_all_fclose(upc_file_t *fd);

• Closing a file

• int upc_all_fsynch(upc_file_t *fd);

• Transfers the data written with fd to storage device

• upc_off_t upc_all_fseek(upc_file_t *fd, upc_off_t offset, int origin)

• Sets the current position in the file

UPC file operations(II)

• ssize_t upc_all_fread_local(upc_file_t fd, void *buf, size_t size, size_t nmemb, upc_flag_t sync_mode);

• ssize_t upc_all_fwrite_local(upc_file_t fd, void *buf, size_t size, size_t nmemb, upc_flag_t sync_mode);

• Reading and writing to local memory

Page 22: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

22

UPC file operations(III)

• ssize_t upc_all_fread_shared(upc_file_t fd, void *buf, size_t blocksize, size_t size, size_t nmemb, upc_flag_t sync_mode);

• ssize_t upc_all_fwrite_shared(upc_file_t fd, void *buf, size_t blocksize, size_t size, size_t nmemb, upc_flag_t sync_mode);

• Reading and writing to shared memory

Data/Thread Placement Affinity

• Distributed memory systems

•Each processor have separate address space

•Shared scalers are allocated to thread zero

•Shared arrays are dynamically allocated across the

system

• Shared memory systems(depends on operating systems)

•Each thread can map its memory twice

•Once for local data

•Once for shared data

Page 23: Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data distribution. 11/2/2010 7 Blocking of shared data ... • An MPI implementation( to run

11/2/2010

23

Questions ?