Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min
description
Transcript of Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min
![Page 1: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/1.jpg)
The LRPD Test: Speculative Run-Time Parallelization of Loops with Privatization
and Reduction Parallelization
Lawrence Rauchwerger and David A. Padua
PLDI 1995
Presented by Seung-Jai Min
![Page 2: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/2.jpg)
Introduction• Motivation : Current parallelizing compilers cannot handle
complex or statically insufficiently defined access patterns. ( input dependent, run-time dependent conditions, subscripted subscripts, etc…)
• LRPD Test - Speculatively executes the loop as a doall - applies a fully parallel data dependency test (x-iter.) - if the test fails, then the loop is re-executed serially
![Page 3: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/3.jpg)
Inspector-Executor Method
• Inspector/Executor
- extract and analyze the memory access pattern
- transform the loop if necessary and execute• Disadvantage
- cost and side effect : if the address computation of the array under test depends on the actual data computation.
- parallel execution of the inspector loop is not always possible
![Page 4: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/4.jpg)
speculative run-time parallelization
Static analysis
Run-time transformations
Polaris
Checkpoint
Speculative parallel execution
test restore
heuristic
fail
pass
reorder
sequential execution
Compile time
Run Time
![Page 5: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/5.jpg)
Hazards(during the speculative execution)
• Exceptions
- invalidate the parallel execution
- clear the exception flag, restore the values of any altered variables, and execute serially.
• Cross-iteration dependencies in the loop
- LRPD Test
![Page 6: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/6.jpg)
LPD Test(The Lazy Privatizing doall Test)
1. Marking Phase - For each shared array A[1:s] - read, write and not-private shadow arrays,
Ar[1:s], Aw[1:s], and Anp[1:s] (a) Uses : if this array element has not been modified,
then set corresponding elem. in Ar and Anp
(b) Defs : set corresp. elem. in Aw and clear in Ar if set.
(c) twi(A) : Count the total number of write accesses to A that are set in this iteration (i : iteration #)
![Page 7: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/7.jpg)
LPD Test(The Lazy Privatizing doall Test)
2. Analysis Phase (Performed after the speculative exec.)
(a) Compute
(i) tw(A) = (twi(A))
(ii) tm(A) = sum(Aw[1:s])
(iii) tm(A) != tw(A) : cross iteration output depend.
(b) If any(Aw[:] & Ar[:]), then ends the phase.
: def and use values stored at the same location in different iterations (flow/anti dependency)
![Page 8: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/8.jpg)
LPD Test(The Lazy Privatizing doall Test)
2. Analysis Phase (Performed after the speculative exec.)
(c) Else if tw(A) == tm(A), then the loop is doall
(without privatizing the array A)
(d) Else if any(Aw[:] & Anp[:]), then the array A is not privatizable.
(there is at least one iteration in which some element of A was used before modified)
(e) Otherwise, the loop was made into a doall by privatizing the shared array A.
![Page 9: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/9.jpg)
Dynamic dead reference elimination
• To avoid introducing false dependences, the marking of the read and private shadow arrays, Ar and Anp can be postponed until the value of the shared variable is actually used.
• Definition : A dynamic dead read reference in a loop is a read access of a shared variable that does not contribute to the computation of any other shared variable which is live at loop end.
• The “lazy” marking employed by the LPD test, i.e., the dynamic dead reference elimination tech., allows it to qualify more loops than the PD test.
![Page 10: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/10.jpg)
PD TestDo i=1, 5
z = A(K(i))
if (B1(i).eq..true.) then
A(L(i)) = z + C(i)
endif
enddo
PD test Shadow arrays tw tm
1 2 3 4
Aw
Ar 1 1 1 1
Anp 1 1 1 1
Aw(:) & Ar(:)
Aw(:) & Anp(:)
Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endifenddo
B1(1:5) = (1 0 1 0 1)
K(1:5) = (1 2 3 4 1)
L(1:5) = (2 2 4 4 2)
![Page 11: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/11.jpg)
PD TestDo i=1, 5
z = A(K(i))
if (B1(i).eq..true.) then
A(L(i)) = z + C(i)
endif
enddo
PD test Shadow arrays tw tm
1 2 3 4
Aw 0 1 0 1 3 2
Ar 1 0 1 0
Anp 1 1 1 1
Aw(:) & Ar(:) 0 0 0 0
Aw(:) & Anp(:) 0 1 0 1
Do i=1, 5 markread(K(i)) z = A(K(i)) if (B1(i).eq..true.) then markwrite(L(i)) A(L(i)) = z + C(i) endifenddo
B1(1:5) = (1 0 1 0 1)
K(1:5) = (1 2 3 4 1)
L(1:5) = (2 2 4 4 2)
![Page 12: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/12.jpg)
LPD TestDo i=1, 5
z = A(K(i))
if (B1(i).eq..true.) then
A(L(i)) = z + C(i)
endif
enddo
PD test Shadow arrays Tw tm
1 2 3 4
Aw 0 1 0 1 3 2
Ar 1 0 1 0
Anp 1 0 1 0
Aw(:) & Aw(:) 0 0 0 0
Aw(:) & Anp(:) 0 0 0 0
Do i=1, 5 z = A(K(i)) if (B1(i).eq..true.) then markread(K(i)) markwrite(L(i)) A(L(i)) = z + C(i) endifenddo
B1(1:5) = (1 0 1 0 1)
K(1:5) = (1 2 3 4 1)
L(1:5) = (2 2 4 4 2)
![Page 13: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/13.jpg)
Run-time Reduction Parallelization
• Recognition of reduction variable + Parallelizing reduction variable
• Pattern matching identification
- The DD test to qualify a statement as a reduction statement cannot be performed statically in the presence of input-dependent access patterns.
- Syntactic pattern matching cannot identify all potential reduction variables (e.g. subscripted subscripts)
![Page 14: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/14.jpg)
The LRPD Test : Extending the LPD Test for Reduction Validation
do i = 1, nS1: A(K(i)) = ………S2: ……… = A(L(i))S3: A(R(i)) = A(R(i)) + exp() enddo
doall i = 1, n markwrite(K(i)) markredux(K(i))S1: A(K(i)) = ……… markread(L(i)) markredux(L(i))S2: ……… = A(L(i)) markwrite(R(i))S3: A(R(i)) = A(R(i)) + exp() enddo
(a) Source program
(b) transformed program
markredux operation sets the shadow array element of Anx to true
Anx : To check only that the reduction variable is not accessed outside the single reduction statement.
![Page 15: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/15.jpg)
LRPD Test
• Modified Analysis Pass
- 2(d’) Else if any(Aw[:] & Anp[:] & Anx[:]), then some elements of A written in the loop is neither a reduction variable nor privatizable. Thus, the loop is not a doall and the phase ends.
- 2(e’) Otherwise, the loop was made into a doall by parallelizing reduction and privatization.
![Page 16: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/16.jpg)
Performance (1)
![Page 17: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/17.jpg)
Performance (2)
![Page 18: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/18.jpg)
Experimental Results Summary
![Page 19: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/19.jpg)
Other Run-time Parallelization Papers
• “Techniques for Speculative Run-Time Parallelization of Loops”, Manish, Gupta and Rahul Nim, SC’98.
- More efficient run-time array privatization - No rolling back of entire loop computation and complete the loop (by generating synchronization) - Early hazard detection
![Page 20: Lawrence Rauchwerger and David A. Padua PLDI 1995 Presented by Seung-Jai Min](https://reader036.fdocuments.net/reader036/viewer/2022062520/568158d6550346895dc61e14/html5/thumbnails/20.jpg)
Other Run-time Parallelization Papers
• “Hardware for Speculative Run-Time Parallelization in Distributed Shared-Memory Multiprocessors”, Ye Zhang, L., Rauchwerger, and Josep Torrellas. HPCA 1998.
- Run-time parallelization techniques are often computationally expensive and not general enough.
- Idea : execute the code in parallel speculatively and let extended cache coherence protocol hardware detect any dependence violations.
- Perf. 7.3 for 16 procs. & 50% faster than soft-only