Post on 14-Jan-2016
Eliminating affinity tests and simplifying shared accesses in
UPC
Rahul Garg*, Kit Barton*, Calin Cascaval**Gheorghe Almasi**, Jose Nelson Amaral*
*University of Alberta**IBM Research
UPC : Unified Parallel C
0 1 2 3 4 5THREADS = 6
Partitioned Global Address Space
Shared arrays
Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF
cyclic(k)
0 1 2 3 4 5 6 7 8
Vector addition example
#include <upc.h> #include <stdio.h> shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }
Outline of talk
upc_forall loops syntax and uses
Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results
upc_forall and affinity tests
upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){
//loop body } “Affinity test” expression determines which
thread executes which iteration.
Affinity test expression
Affinity test elimination : naive
shared [BF] double A[M];upc_forall(i=0;i<M;i++; &A[i]){
//loop body}
shared [BF] double A[M];for(i=0; i<M; i++){
if(upc_threadof(&A[i])==MYTHREAD){//loop body
}}
Affinity test elimination : optimized
shared [BF] double A[M];upc_forall(i=0;i<M;i++; &A[i]){
//loop body}
shared [BF] double A[M];for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){
for(j=i; j<i+BF; j++){//loop body
}}
Integer Affinity Tests
upc_forall(i=0;i<M;i++; i){//loop body
}
for(i=MYTHREAD; i<M; i+=THREADS){//loop body
}
Data distributions for shared arrays
UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data
distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic
fashion More general than UPC spec, but not as
general as ScaLAPACK or HPF
Multidimensional Blocking
shared [2][2] double A[5][5];
0 0
0 0
1 1
1 1
2
2
3 3 0 0 1
3 3 0 0 1
2 2 3 3 0
Locality analysis and privatization
Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){
upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j];
} } What code should we generate for references
A[i][j] and B[i+1][j]?
Shared access code generation
for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){
val = shared_deref(B,i+1,j);shared_assign(A,i,j,val);
}}
for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){
A[i][j] = B[i+1][j];}
}
Shared access code generation
Do we really need the function calls?
A[i][j] should only be a memory load/store??
What about B[i+1][j] on SMP? This should be just a load? On hybrids?
for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){
A[i][j] = B[i+1][j];}
}
Locality Analysis
Area belonging to thread 0
Area referenced by thread 0 for B[i+1][j]
for(i=0;i<4;i++)upc_forall(j=0;j<4;j++;&A[i][j])
A[i][j] = B[i+1][j];
Locality Analysis : Intuition
The locality can only change if index (i+1) crosses block boundaries in a direction
Block boundaries : 0, BF , 2*BF ... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to
find places where locality can change!
for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){
A[i][j] = B[i+1][j];}
}
Locality Analysis
Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF<(BF-k1%BF) and i%BF>=
(BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut'
for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){
A[i][j] = B[i+1][j];}
}
Shared access code generation
for(i=0;i<4;i++){if((i%2<1){
upc_forall(j=0;j<4;j++;&A[i][j]){val = memory_load(B,i+1,j);memory_store(A,i,j,val);
}}else{
upc_forall(j=0;j<4;j++; &A[i][j]){val = shared_deref(B,i+1,j);memory_store(A,i,j,val);
}}
}
Locality analysis : algorithm
For each shared reference in loop: Check if blocking factor matches
Check if distance vector is constant
If reference is eligible: Generate cut expressions
Put cut in a sorted “cut list”
Replicate loop body as necessary Insert memory load/store if local reference
otherwise insert RTS call
Improvements of locality analysis in isolation
1 node 2 nodes 3 nodes
0
50
100
150
200
250
300
350
400
450
500
% Improvement : 100*(base-opt)/opt
1 thread/node
2 thread/node
4 threads/node
8 threads/node
Improvements of affinity test elimination in isolation
1 node 2 nodes 3 nodes
0
50
100
150
200
250
300
Percentage improvements
1 thread/node
2 thread/node
4 threads/node
8 threads/node
Results : Vector addition
1 node 2 nodes 3 nodes
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Percentage improvements in runtime
1 thread/node
2 thread/node
4 threads/node
8 threads/node
Matrix-vector multiplication
1 node 2 nodes 3 nodes
0
500
1000
1500
2000
2500
3000
Percentage improvements in runtime
1 thread/node
2 threads/node
4 threads/node
8 threads/node
Matrix-vector scalability
1 threads/
node
2 threads/
node
4 threads/
node
8 threads/
node
0
0.5
1
1.5
2
2.5
Speedup over C
1 node
2 node
3 node
Conclusions
UPC requires extensive compiler support upc_forall is a challenging construct to compile
efficiently Shared access implementation requires compiler
support Optimizations working together produce good
results Compiler optimizations can produce >80x
speedup over unoptimized code If one optimization fails, then results can still be
bad