Automatic task generation for DAGuE
description
Transcript of Automatic task generation for DAGuE
Automatic task generation for DAGuE
George Bosilca, Aurelien Bouteiller, Anthony Danalis, Mathieu Faverge, Thomas Herault, Jack Dongarra
http://icl.utk.edu/dague
The DAGuE system
Serial Code to Dataflow Representation
DAGuE Compiler
Example: QR Factorization
Input Format – Quark (PLASMA)for (k = 0; k < MT; k++) {
Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
• Sequential C code • Annotated through
QUARK-specific syntax• Insert_Task• INOUT, OUTPUT, INPUT• REGION_L, REGION_U, REGION_D, …• LOCALITY
• StarPU syntax in progress
DAGuE Compiler analysis steps • Record all USEs
• Record all DEFinitions• Formulate as Omega
Relations:• all true (flow) dependencies• all output dependencies
• Compute the differences• Formulate all anti-
dependencies• Finalize synchronization
edges
for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
Traditional Compiler (Control Flow)for (k = 0; k < MT; k++) { Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
for(k…
for(m…
for(n…
for(m…Co
ntro
l Flo
w Gr
aph
Data Flow imposes orderingfor (k = 0; k < MT; k++) {
Insert_Task( zgeqrt, A[k][k], INOUT, T[k][k], OUTPUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsqrt, A[k][k], INOUT | REGION_D|REGION_U, A[m][k], INOUT | LOCALITY, T[m][k], OUTPUT); } for (n = k+1; n < NT; n++) { Insert_Task( zunmqr, A[k][k], INPUT | REGION_L, T[k][k], INPUT, A[k][m], INOUT); for (m = k+1; m < MT; m++) { Insert_Task( ztsmqr, A[k][n], INOUT, A[m][n], INOUT | LOCALITY, A[m][k], INPUT, T[m][k], INPUT); } }}
Dataflow Analysis
• Example on task DGEQRT of QR
for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
MEM
k=0
Outgoing DataIncoming Data
n=k+1m=k+1
Dataflow Analysis
• Example on task DGEQRT of QR• Polyhedral Analysis through
Omega• Compute algebraic expressions
for:• Source and destination tasks• Necessary conditions for that
data flow to exist
MEM
k=0
n=k+1m=k+1
Outgoing DataIncoming Data
U
L
k=SIZE-1
for k = 0 .. N-1 A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) for n = k+1 .. N-1 A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] )
Intermediate Representation: Job Data Flow
Control flow is eliminated, therefore maximum parallelism is possible
GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k)
BODY zgeqrt( A, T )END
Intermediate Representation: Job Data Flow
Control flow is eliminated, therefore maximum parallelism is possible
GEQRT(k) /* Execution space */ k = 0..( MT < NT ) ? MT-1 : NT-1 ) /* Locality */ : A(k, k) RW A <- (k == 0) ? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1) ? A UNMQR(k, k+1 .. NT-1) [type = LOWER] -> (k < MT-1) ? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1) ? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1) ? T UNMQR(k, k+1 .. NT-1) /* Priority */ ;(NT-k)*(NT-k)*(NT-k)
BODY zgeqrt( A, T )END
Dataflow Analysis
A[m][n] :k+1<= m < N-1k+1<= n < N-10 <= k < N-1
DEF
USEA[k’][k’] :0 <= k’ <N-1
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
A[m][n] :k+1<= m < N-1k+1<= n < N-10 <= k < N-1
DEF
A[k’][k’] :0 <= k’ <N-1
USE
k’ > kCtrl Flow
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
[k,m,n] -> [k’] :k+1<= m < N-1k+1<= n < N-10 <= k < N-10 <= k’ < N-1k < k’m = k’n = k’
Flow Dependency(RAW) Relation
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
{[k,m,m] -> [m] : k+1, 0 <= m < N}
Omega Simplifiedfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
Output Dependency (WAW)
{[k,m,m] -> [m] : k+1, 0 <= m < N}
Omega Simplifiedfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
n=k+1m=k+1
{[k,k+1,k+1] -> [k+1] : 0 <= k <= N-2}
Real Edge: Flow - Outputfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
n=k+1m=k+1
{[k,k+1,k+1] -> [k+1] : 0 <= k <= N-2}
Real Edge: Flow - Output
{[k-1,k, k] -> [k] : 0 < k <= N-1}
GEQRT’s incoming edge
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
n=k+1m=k+1
{[k,k+1,k+1] -> [k+1] : 0 <= k <= N-2}
Real Edge: Flow - Output
{[k-1,k, k] -> [k] : 0 < k <= N-1}(k>0) ? TSMQR(k-1,k,k)
GEQRT’s incoming edge
Dataflow Analysis
{[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }
GEQRT - > UNMQRfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Dataflow Analysis
{[k] -> [k, n] : 0 <= k <= N-2 && k+1 <= n <=N-1 }
-> (k<N-1) ? UNMQR(k, k+1..N-1)
GEQRT - > UNMQRfor k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies• In theory, anti-deps do not matter in distributed
memory• But real machines are distributed/shared memory
hybrids
• Anti-deps must create synchronization edges• Overestimating anti-deps is safe (albeit slow)• Output deps should be treated the same … in
theory
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
n=k+1m=k+1
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
{[k, m, n] -> [k‘] : n=m=k‘}
TSMQR - > GEQRT
for k = 0 .. N-1 { A[k][k], T[k][k] < - GEQRT( A[k][k] ) for m = k+1 .. N-1 { A[k][k] | U, A[m][k], T[m][k] < - TSQRT( A[k][k] | U, A[m][k], T[m][k]) } for n = k+1 .. N-1 { A[k][n] < - UNMQR( A[k][k] | L, T[k][k], A[k][n] ) for m = k+1 .. N-1 { A[k][n], A[m][n] < - TSMQR( A[m][k], T[m][k], A[k][n], A[m][n] ) } }}
Anti-dependencies in QR?
{[k, m, n] -> [k‘] : n=m=k‘}
TSMQR - > GEQRT
n=k+1m=k+1
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
anti-dep:{[k,m,n] -> [k‘] : n=m=k‘}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
anti-dep:{[k,m,n] -> [k‘] : n=m=k‘}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
UNMQR(k,n) k = 0..(( mt < nt ) ? mt-1:nt-1) n = k+1..nt-1
GEQRT(k) k = 0..((mt<nt) ? mt-1:nt-1 )
TSQRT(k,m) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1
TSMQR(k,m,n) k = 0..((mt < nt) ? mt-1:nt-1 ) m = k+1..mt-1 n = k+1..nt-1{[k]->[k,k+1] : mt >= (k+2)}
{[k]->
[k,n
] : k
< n
< n
t &&
k <
nt-1
}
{[k,n]->
[k,k+1,n]: k < m
t-1}
{[k,m
]->[k
,m,n
] : k
<nt-1
&&
k<n
< nt
{[k,m]->[k,m+1]: m<mt-1}
{[k,m,n]->[n,m
] : n==(k+1) &&
m>n}
{[k,m,n]->[n] : k+1==n && k+1==m}
{[k,m
,n]->[k+1,n] :
m==k+1 && n>m}{[k,m
,n]->[k+1,m,n]: n>1+k &
& m
>1+k}{[k,m
,n]->[k,m+1,n]: m
<mt-1}
FinalizingAnti-deps
FinalizingAnti-deps
FinalizingAnti-deps
Transitive Closureis undecidable!
IsDescendantOf()
Current/Future Work
Can we address non-affine codes?
Example: Reduction Operation• Reduction: apply a user
defined operator on each data and store the result in a single location.(Suppose the operator is associative and commutative)
Example: Reduction Operation• Reduction: apply a user
defined operator on each data and store the result in a single location.(Suppose the operator is associative and commutative)
Issue: Non-affine loops lead to non-polyhedral array accessing
for(s = 1; s < N/2; s = 2*s) for(i = 0; i < N-s; i += 2*s) V[i] = op(V[i], V[i+s])
Example: Reduction Operation reduce(l, p)
: V(p)l = 1 .. depth+1p = 0 .. (MT / (1<<l))
RW A <- (1 == l) ? V(2*p) : A reduce( l-1, 2*p ) -> ((depth+1) == l) ? V(0) -> (0 == (p%2))? A reduce(l+1, p/2) : B reduce(l+1, p/2)READ B <- ((p*(1<<l) + (1<<(l-1))) > MT) ? V(0) <- (1 == l) ? V(2*p+1) <- (1 != l) ? A reduce( l-1, p*2+1 )BODY operator(A, B);END
0
1
2
3Current Solution: Hand-writing of the data dependency using the intermediate Data Flow representation
for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } }}
Handling Reduction
for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } }} Loop Canonicalization
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } }}
Handling Reduction
for(k=0; k<NT; k++){ for (i = 1; i < NT/2; i*=2 ) { for (j = 0; j < NT-i; j+=2*i) { Task_Rdc: A[k][j] += A[k][j+i]; } }} Loop Canonicalization
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] += A[k][j+i]; } }}
Handling Reduction
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
For an edge to exist it must be:j = j'+i' => jj' = (jj * 2**ii) / (2**ii') -
1/2OR
j = j'=> jj' = (jj * 2**ii) / (2**ii')
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
For an edge to exist it must be:j = j'+i' => jj' = (jj * 2**ii) / (2**ii') -
1/2OR
j = j'=> jj' = (jj * 2**ii) / (2**ii')Hard to solve statically
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
For an edge to exist it must be:j = j'+i' => jj' = (jj * 2**ii) / (2**ii') -
1/2OR
j = j'=> jj' = (jj * 2**ii) / (2**ii')
But a given (source) taskhas fixed {ii, jj}
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
jj' = (jj * 2**ii) / (2**ii') – 1/2
jj' = (jj * 2**ii) / (2**ii')
But a given (source) taskhas fixed {ii, jj}
constant
constant
for (k=0; k<NT; k++) { for (ii = log(1); ii < log(NT/2); ii++) { i = 2**ii for (jj = 0; jj < (NT-i)/(2*i); jj++) { j = jj*2*i; Task_Rdc: A[k][j] = A[k][j] + A[k][j+i]; } }}
Handling Reduction
jj' = C / (2**ii') – 1/2
jj' = C / (2**ii')But a given (source) task
has fixed {ii, jj}
Handling Reduction
jj' = C / (2**ii') – 1/2
jj' = C / (2**ii')
Finding a destination task means finding integers that satisfy either equation.
Run-time upper bound for cost: log(NT/2)
Handling “Arbitrary” Code?for( i=a(); i<b(); i = c(i) ){ A[d(i)] <- Task( A[e(i)] );}
Handling “Arbitrary” Code?for( i=a(); i<b(); i = c(i) ){ A[d(i)] <- Task( A[e(i)] );}
for( ii=LB; ii<UB; ii++ ){ i = f(ii); A[d(i)] <- Task( A[e(i)] );}
Loop Canonicalization
Handling “Arbitrary” Code?
d(i) = e(i’) : f(LB) < i < i’ < f(UB)
for( ii=LB; ii<UB; ii++ ){ i = f(ii); A[d(i)] <- Task( A[e(i)] );}
Conclusions & Future Work• Static compiler analysis makes coding in DAGuE easier• DAGuE Compiler handles affine codes w/ Quark notation (StarPU
wip)• Anti-dependencies are also handled (60% of the time it works all
the time)• Anti-deps finalization algorithm can be used for
isDescendantOf()
• Non-affine codes can be future input for the compiler• Regular Control Flow• Loops that can be canonicalized
• Evaluating shortcut answers quickly will be the biggest challenge
Parallel machines are hard to program and we should make them even harder - to keep the riff-raff off them. -- Gary Montry
There are 3 rules to follow when parallelizing large codes. Unfortunately, no one knows what these rules are. -- W. Somerset Maugham, Gary Montry