Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data...
Transcript of Unified Parallel Cgabriel/courses/cosc6374_f10/ParCo_18... · 2018. 6. 18. · Efficient data...
11/2/2010
1
Saber Feki
Unified Parallel C
Department of Computer Science
University of Houston
Rakhi Anand
Saber Feki
References
• Slides in this lecture are based upon following
references:
• http://upc.lbl.gov/lang-overview.shtml
• http://upc.gwu.edu/downloads/Manual-1.2.pdf
• http://upc.lbl.gov/docs/user/index.shtml
• http://upc.gwu.edu/tutorials/UPC-SC05.pdf
11/2/2010
2
Introduction
• Unified Parallel C
• A Partition Global Address Space language (PGAS) model
• Similar to C language
• Common and familiar syntax
• Designed for parallel C programs
• Provide the ability to exploit data locality in different
memory architectures
PGAS Languages
• UPC is based on Partition Global Address Space language
(PGAS) model.
Global address
space
Thread0 Thread1 Threadn
Shared memory space
Private 0 Private 1 Private n
11/2/2010
3
UPC Execution model
• A number of threads work in SPMD fashion
•MYTHREAD gives thread index(0,1....n-1)
•THREADS gives number of threads
UPC Execution model
• There are two compilation mode
•Static Threads modes
•Threads are specified at compile time
•Using THREADS constant
•Dynamic Thread mode
• Threads are specified at run time
11/2/2010
4
UPC Hello world example
#include <upc.h> /* needed for UPC extensions */
#include <stdio.h>
main() {
printf("Thread %d of %d: hello UPC world\n",
MYTHREAD, THREADS);
}
Compile:
upcc -T 2 -o hello hello.upc
Run:
upcrun -n 2 ./hello
Shared and Private data
• Normal C varaibles are allocated in the private memory
space of a thread
int mine;
• Shared varaibles are allocated only once with thread 0
shared int x;
• Shared arrays are distributed across threads with one
element per each thread
shared int x[10];
11/2/2010
5
Shared and Private data
Global address
space
mine:
Thread0 Thread1 Threadn
ours:
x[0]
mine: mine:
x[1] x[n]
Shared and Private data
• Example : vector addition
shared int a[100], b[100], c[100];
int i;
for(i=0; i<100; i++){
if(MYTHREAD == i%THREADS)
a[i] = b[i] + c[i];
}
11/2/2010
6
Data distribution
Efficient data distribution
11/2/2010
7
Blocking of shared data
• Default block size of a block is 1.
• Distributes data in round robin fashion
• Shared arrays can be distributed in blocks across
threads
shared [3] int A[4][THREADS];
Assume THREADS = 4
Blocking of the shared array
A[0][0]
A[0][1]
A[0][2]
A[3][0]
A[3][1]
A[3][2]
A[0][3]
A[1][0]
A[1][1]
A[3][3]
A[1][2]
A[1][3]
A[2][0]
A[2][1]
A[2][2]
A[2][3]
Thread 1 Thread 2 Thread 3 Thread 4
11/2/2010
8
Blocking of Shared Array
• Thread affinity: ability of a thread to refer to an object
by a private pointer
• Element i of a blocked array has affinity to thread
For_all
• Provides an opportunity to distribute iterations across
the threads as you wish
upc_forall(init;test;loop;affinity)
• Affinity expression decides which iteration to execute on
which thread
• Affinity can be an integer or pointer
11/2/2010
9
For_all
• Example 1: explicit affinity
shared int a[100], b[100], c[100];
int i;
upc_forall(i=0; i<100; i++; &a[i])
a[i] = b[i] + c[i];
• Example 2: implicit affinity
shared int a[100], b[100], c[100];
int i;
upc_forall(i=0; i<100; i++; i)
a[i] = b[i] + c[i];
For_all
• Example 3: blocked affinity
shared [100/THREADS] int a[100], b[100], c[100];
int i;
upc_forall(i=0;i<100; i++;(i*THREADS)/100)
a[i] = b[i] + c[i];
11/2/2010
10
UPC Pointers
• Pointer Declarations
• int *p1; /* local item which points locally*/
• shared int *p2;/* local pointer to shared data*/
• int *shared p3;/* shared pointer to local data*/
• shared int *shared p4; /* shared pointer to shared data */
• Shared to local memory(p3) not recommended
UPC Pointers
11/2/2010
11
Pointer example
Assume THREADS = 3
shared int A[10]
shared int *dp=&A[2], *dp1;
dp1 = dp + 4;
UPC Pointers
UPC Pointers
A[0]
A[3]
A[6]
A[9]
A[1]
A[4]
A[7]
A[10]
A[2]
A[5]
A[8]
Thread 1: Thread 2: Thread 3:
dp
dp + 1 dp + 2 dp + 3
dp + 4
dp1
11/2/2010
12
Pointer example 2
Assume THREADS = 3
shared [2] int A[10]
shared [3] int *dp=&A[2], *dp1;
dp1 = dp + 4;
UPC Pointers
UPC Pointers
A[0]
A[1]
A[6]
A[7]
A[2]
A[3]
A[8]
A[9]
A[4]
A[5]
A[10]
Thread 1: Thread 2: Thread 3:
dp + 4
dp
dp + 2
dp + 1
dp1
dp + 3
11/2/2010
13
shared void *upc_global_alloc(size_tnblocks, size_t nbytes);
nblocks : number of blocks
nbytes : block size
• Non-collective operation
• Calling thread allocates memory in shared address space
Dynamic memory allocation(I)
shared void *upc_all_alloc(size_t nblocks, size_t nbytes);
nblocks : number of blocks
nbytes : block size
• Collective operation
• All threads will get the same pointer
Dynamic memory allocation(II)
11/2/2010
14
shared void *upc_alloc(size_t nbytes);
nbytes : block size
• Non-collective operation
• Calling thread allocates memory in local shared address space
Dynamic memory allocation(III)
void upc_free(shared void *ptr);
• The upc free function frees the dynamically allocated
shared memory pointed to by ptr
• Upc free is not collective
Dynamic memory de-allocation
11/2/2010
15
Consistency Models(I)
• Used for the interaction of memory access in shared memory space
• Consistency can be strict or relaxed
• Relaxed consistency
• Program executed in a local consistency model
• Compiler analyses only shared memory access in the local thread
• Shared operations can be reordered by compiler
• Default environment is set by using
<upc_relaxed.h>
Consistency Models(II)
• Strict consistency
• Program executed in a sequential consistency model
• Compiler must take into account all memory accesses in all threads
• Reordering of operations is not allowed
• Default environment is set by using
<upc_strict.h>
11/2/2010
16
Consistency Models(III)
• Default consistency model can be altered using
# pragma upc strict
# pragma upc relaxed
• Declare variables using type qualifiers
• Strict or relaxed
Consistency Models Example
# include <upc_relaxed.h>Send(val){#pragma upc strict nextwhile (flag){data1 = val1; /* statements can be reordered */ data2 = val2;}#pragma upc strict nextflag = 1;}}int recv (){#pragma upc strict nextwhile(!flag){tmp = data1+data2;#pragma upc strict nextflag = 0;}}
11/2/2010
17
Consistency Models Example
# include <upc_strict.h>Send(val){while (flag){#pragma upc relaxed next{data1 = val1; /* statements can be reordered */ Data2 = val2;}flag = 1;}}
int recv (){While(!flag){#pragma upc relaxed nexttmp = data1+data2;flag = 0;}}
Synchronization
•There is no implicit synchronization among threads
•Synchronization is provided using
•Barrier: blocks until all threads arrive
•upc_barrier()
•Split-phase barrier: non blocking barrier
•upc_notify()
•upc_wait()
11/2/2010
18
Synchronization
•UPC uses locks to achieve synchronization for multiple
writers•upc_lock_t* upc_all_alloc();
•To acquire lock•void upc_lock (upc_lock_t *l);
•To lock data•void upc_unlock (upc_lock_t *l);
•To unlock data•void upc_lock_free(upc_lock_t *ptr);
•To free lock
Synchronization Example
#include<upc_relaxed.h> upc_lock_t *l; shared int value = 0; void main(void) {
l=upc_all_lock_alloc(); upc_lock(l); value += 2; upc_unlock(l);upc_barrier; // Ensure all is done if(MYTHREAD==0)
printf(“VAL=%d\n“,value); if(MYTHREAD==0) upc_lock_free(l);
}
11/2/2010
19
Installation prerequisites
• A POSIX-like environment
• A version of UNIX
• 'Cygwin' toolkit (for Windows)
• GNU make
• Standard Unix tools: a Bourne-compatible shell, 'awk' 'env', 'tail', 'sed', 'basename', 'dirname', and 'tar'.
• C compiler
• An MPI implementation( to run UPC over MPI)
• A C++ compiler
• http://upc.lbl.gov/download/
Debuggers
• UPC programs can be debugged using TotalViewdebugger
• Requirements
• TotalView version 7.0.0 or greater
• UPC installations running on x86 architectures
• C compiler must be GNU GCC
• Should include –with-multiconf=+dbg_tv when
configuring
11/2/2010
20
Performance Analysis tool
• GASP- Global Address Space Performance tool
• Used to plug-in third party performance tools to
measure and visualize performance of UPC programs.
• Tool : Parallel Performance Wizard(PPW)
• http://ppw.hcs.ufl.edu/
• Should include –with-multiconf=+opt_inst when
configuring
Parallel I/O constraints
• All UPC I/O functions are collective
• File is automatically closed on termination of program
• If a program tries to read past the end of file, it reads
till the end of file
• Writing past the end of file increases the size of file
• The arguments to all UPC-IO functions are single valued
• UPC-IO, by default, support weak consistency and
atomicity semantics
• UPC-IO allows synchronous and asynchronous operations
11/2/2010
21
UPC file operations(I)
• upc_file_t upc_all_fopen(const char* fname);
• Open the file
• int upc_all_fclose(upc_file_t *fd);
• Closing a file
• int upc_all_fsynch(upc_file_t *fd);
• Transfers the data written with fd to storage device
• upc_off_t upc_all_fseek(upc_file_t *fd, upc_off_t offset, int origin)
• Sets the current position in the file
UPC file operations(II)
• ssize_t upc_all_fread_local(upc_file_t fd, void *buf, size_t size, size_t nmemb, upc_flag_t sync_mode);
• ssize_t upc_all_fwrite_local(upc_file_t fd, void *buf, size_t size, size_t nmemb, upc_flag_t sync_mode);
• Reading and writing to local memory
11/2/2010
22
UPC file operations(III)
• ssize_t upc_all_fread_shared(upc_file_t fd, void *buf, size_t blocksize, size_t size, size_t nmemb, upc_flag_t sync_mode);
• ssize_t upc_all_fwrite_shared(upc_file_t fd, void *buf, size_t blocksize, size_t size, size_t nmemb, upc_flag_t sync_mode);
• Reading and writing to shared memory
Data/Thread Placement Affinity
• Distributed memory systems
•Each processor have separate address space
•Shared scalers are allocated to thread zero
•Shared arrays are dynamically allocated across the
system
• Shared memory systems(depends on operating systems)
•Each thread can map its memory twice
•Once for local data
•Once for shared data
11/2/2010
23
Questions ?