Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin...

Eliminating affinity tests and simplifying shared accesses in

Rahul Garg*, Kit Barton*, Calin Cascaval**Gheorghe Almasi**, Jose Nelson Amaral*

*University of Alberta**IBM Research

UPC : Unified Parallel C

0 1 2 3 4 5THREADS = 6

Partitioned Global Address Space

Shared arrays

Arrays can be shared b/w all threads Eg : shared [2] double A[9]; Assuming THREADS=3 1-d block cyclic distribution : similar to HPF

cyclic(k)

0 1 2 3 4 5 6 7 8

Vector addition example

#include <upc.h> #include <stdio.h> shared [2] double A[10]; shared [3] double B[10],C[10]; int main(){ int i; upc_forall(i=0;i<10;i++;&C[i]) C[i] = A[i] + B[i]; }

Outline of talk

upc_forall loops syntax and uses

Compiling upc_forall loops Data distributions in UPC Multiblocking distributions Privatization of access Results

upc_forall and affinity tests

upc_forall is a work distribution construct Form : shared [BF] double A[M]; upc_forall(i=0; i<N; i++; &A[i]){

//loop body } “Affinity test” expression determines which

thread executes which iteration.

Affinity test expression

Affinity test elimination : naive

shared [BF] double A[M];upc_forall(i=0;i<M;i++; &A[i]){

//loop body}

shared [BF] double A[M];for(i=0; i<M; i++){

if(upc_threadof(&A[i])==MYTHREAD){//loop body

Affinity test elimination : optimized

shared [BF] double A[M];upc_forall(i=0;i<M;i++; &A[i]){

//loop body}

shared [BF] double A[M];for(i=MYTHREAD*BF; i<M; i+=(BF*THREADS)){

for(j=i; j<i+BF; j++){//loop body

Integer Affinity Tests

upc_forall(i=0;i<M;i++; i){//loop body

for(i=MYTHREAD; i<M; i+=THREADS){//loop body

Data distributions for shared arrays

UPC official spec only supports 1d block cyclic IBM xlupc compiler supports more general data

distribution : 'multi-dimensional blocking' Eg : shared [2][3] double A[5][5]; Divide the array into multidimensional tiles Distribute the tiles among processors in cyclic

fashion More general than UPC spec, but not as

general as ScaLAPACK or HPF

Multidimensional Blocking

shared [2][2] double A[5][5];

3 3 0 0 1

2 2 3 3 0

Locality analysis and privatization

Consider : shared [2][3] A[5][6],B[5][6]; for(i=0; i<4; i++){

upc_forall(j=0; j<4; j++; &A[i][j]){ A[i][j] = B[i+1][j];

} } What code should we generate for references

A[i][j] and B[i+1][j]?

Shared access code generation

for(i=0;i<4;i++){upc_forall(j=0;j<4;j++;&A[i][j]){

val = shared_deref(B,i+1,j);shared_assign(A,i,j,val);

A[i][j] = B[i+1][j];}

Do we really need the function calls?

A[i][j] should only be a memory load/store??

What about B[i+1][j] on SMP? This should be just a load? On hybrids?

A[i][j] = B[i+1][j];}

Locality Analysis

Area belonging to thread 0

Area referenced by thread 0 for B[i+1][j]

for(i=0;i<4;i++)upc_forall(j=0;j<4;j++;&A[i][j])

A[i][j] = B[i+1][j];

Locality Analysis : Intuition

The locality can only change if index (i+1) crosses block boundaries in a direction

Block boundaries : 0, BF , 2*BF ... (i+1)%BF==0 gives block boundary So we only need to see if (i+1)%BF==0 to

find places where locality can change!

A[i][j] = B[i+1][j];}

Locality Analysis

Define offset vector : [k1 k2] k1=1, k2=0 k1 and k2 are integer constants Cross block boundary at (i+k1)%BF ==0 Cases : i%BF<(BF-k1%BF) and i%BF>=

(BF-k1%BF) i%BF<(BF-k1) : we refer it to as 'cut'

A[i][j] = B[i+1][j];}

for(i=0;i<4;i++){if((i%2<1){

upc_forall(j=0;j<4;j++;&A[i][j]){val = memory_load(B,i+1,j);memory_store(A,i,j,val);

}}else{

upc_forall(j=0;j<4;j++; &A[i][j]){val = shared_deref(B,i+1,j);memory_store(A,i,j,val);

Locality analysis : algorithm

For each shared reference in loop: Check if blocking factor matches

Check if distance vector is constant

If reference is eligible: Generate cut expressions

Put cut in a sorted “cut list”

Replicate loop body as necessary Insert memory load/store if local reference

otherwise insert RTS call

Improvements of locality analysis in isolation

1 node 2 nodes 3 nodes

% Improvement : 100*(base-opt)/opt

1 thread/node

2 thread/node

4 threads/node

8 threads/node

Improvements of affinity test elimination in isolation

Percentage improvements

1 thread/node

2 thread/node

4 threads/node

8 threads/node

Results : Vector addition

Percentage improvements in runtime

1 thread/node

2 thread/node

4 threads/node

8 threads/node

Matrix-vector multiplication

Percentage improvements in runtime

1 thread/node

2 threads/node

4 threads/node

8 threads/node

Matrix-vector scalability

1 threads/

2 threads/

4 threads/

8 threads/

Speedup over C

1 node

2 node

3 node

Conclusions

UPC requires extensive compiler support upc_forall is a challenging construct to compile

efficiently Shared access implementation requires compiler

support Optimizations working together produce good

results Compiler optimizations can produce >80x

speedup over unoptimized code If one optimization fails, then results can still be

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin...

Documents

Transcript of Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg*, Kit Barton*, Calin...

, M. Almasi Kashi1,2* M. Noormohammadi

Almasi Kitti - A Parkapcsolat Klinika

Meniul Zilei by Red Confort Hotel · 2020-01-21 · Crochete de cascaval cascaval, ou, pesmet Cascaval pane cascaval, pesmet Bulgarasi de branza branza dulce, telemea, unt, cascaval

Calin Georgescu

calin alexandru.docx

Studiul HACCP Cascaval Modificat 4

Gabriel Almasi CV 2021 - resurseumane.uvt.ro

Imbrea Erika-Raport Expertizare Cascaval

CASCAVAL PENTELEU

Almasi zenye thamani kubwa mno

Cascaval Dobrogea Delaco

vladu cascaval . bundoc

CALIN NEGRET

Cascaval Dalia

Isi dealwatch almasi levente apr21 - pw c

studiul HACCP Cascaval

Razvan calin

IĞDIR’IN ÜÇ ALMASI - ismailkaygusuz.comismailkaygusuz.com/Dosya/Tiyatro/igdirinucalmasi.pdf1 IĞDIR’IN ÜÇ ALMASI İsmail Kaygusuz KİŞİLER: SEVDA/FATE Zorla evlendirilen

cascaval afumat dalia

Proiect Cascaval Rucar --

Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin...

Transcript of Eliminating affinity tests and simplifying shared accesses in UPC Rahul Garg, Kit Barton, Calin...