Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

49
Using the Iteration Space Visualizer in Loop Parallelization Yijun YU http://winpar.elis.rug.ac.be/ppt/i sv

Transcript of Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Page 1: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Using the Iteration Space Visualizer in Loop Parallelization

Yijun YU

http://winpar.elis.rug.ac.be/ppt/isv

Page 2: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

OverviewISV – A 3D Iteration Space Visualizer : view

the dependence in the iteration space iteration -- one instance of the loop body space – the grid of all index values Detect the parallelism Estimate the speedup Derive a loop transformation Find Statement-level parallelism Future development

Page 3: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

1. Dependence

DO I = 1,3

A(I) = A(I-1)

ENDDO

DOALL I = 1,3

A(I) = A(I-1)

ENDDO

A(2) = A(1)

A(1) = A(0)

A(3) = A(2)

1 2 3

1 1 3

0 1 3

0 1 1

0

0

0

0

program

A(1) = A(0)

A(2) = A(1)

A(3) = A(2)

execution trace

1 2 3

0 2 3

0 0 3

0 0 0

0

0

0

0

shared memory

Page 4: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

1.1 Example1

ISV directive

visualize

Page 5: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

1.2 Visualize the Dependence

A dependence is visualized in an iteration space dependence graph

iteration

Node IterationFlow dependenc

e Edge Dependence order between nodes

Color Dependence type: FLOW: Write Read ANTI: Read Write OUTPUT: Write Write

Page 6: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

1.3 Parallelism? Stepwise view sequential

execution

No parallelism found However, many programs have

parallelism…

Page 7: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

2. Potential Parallelism Time(sequential) = number of iterations Dataflow: iterations are executed as

soon as its data are readyTime(dataflow) = number of iterations on the longest critical path

The potential parallelism is denoted byspeedup = Time(sequential)/Time(dataflow)

Page 8: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

2.1 Example 2

Page 9: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Diophantine Equations

+

Loop bounds (polytope)

=

Iteration Space Dependencies

Page 10: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

2.2 Irregular dependence Dependences have

non-uniform distance Parallelism Analysis:

200 iterations over 15 data flow steps

Speedup:13.3

Problem: How to exploit it?

Page 11: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

3. Visualize parallelism

Find answers to these questions What is the dependence pattern? Is there a parallel loop? (How to

find?) What is the maximal parallelism?

(How to exploit it?) Is the load of parallel tasks

balanced?

Page 12: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

3.1 Example 3

Page 13: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

3.2 3D Space

Page 14: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

3.3 Loop parallelizable? The I, J, K loops are in

the 3D space: 32 iterations

Simulate sequentia

l execution

Which loop can be parallel?

Page 15: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Interactively try the parallelization

Interactively check a

parallel loop I

3.4 Loop parallelization

The blinking dependence edges prevent the parallelization of the given loop I.

Page 16: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Let ISV find the correct parallelizationAutomaticall

y check the

parallel loop

Simulateparallel executio

n

3.5 Parallel execution

It takes 16 time steps

Page 17: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Sequential execution takes 32 time steps

Simulatedata flow execution

3.6 Dataflow execution

Dataflow execution only takes 4 times steps

Potential speedup=8.

Page 18: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Dataflow speedup = 8

Iterating through

partitions: the connected components

3.7 Graph partitioning

All the partitions are load balanced

Page 19: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4. Loop Transformation

Real parallelism

Potential parallelism

Transformation

Page 20: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.1 Example 4

Page 21: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.2 The iteration space Sequentially 25 iterations

Page 22: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.3 Loop Parallelizable?check loop I check loop J

Page 23: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Totally 9 steps Potential

speedup: 25/9=2.78

Wave front effect:all iterations on the same wave are on the same line

4.4 Dataflow execution

Page 24: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.5 Zoom-in on the I-space

Page 25: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.6 Speedup vs program size Zoom-in previews parallelism in

part of a loop without modifying the program

Executing the programs of different size n estimates a speedup of n2/(2n-1)Loop si ze # i terati ons # datafl ow steps speedup

2 4 3 1. 33

3 9 5 1. 8

4 16 7 2. 29

5 25 9 2. 78

N N2 2N- 1 O(N/ 2)

Page 26: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.7 How to obtain thepotential parallelism

Here we already have these metrics: Sequential time steps = N2

Dataflow time step = 2N-1potential speedup = N2/(2N-1)

Transformation.How to obtain the potential speedup of a loop?

Page 27: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.8 Unimodular transformation (UT)

A unimodular matrix is a square integer matrix that has unit determinant. It is the result of identity matrix by three kinds of basic transformations: reversal, interchange, and skewing

The new loop execution order is determined by the transformed index. The iteration space remains unit step size

Find a suitable UT reorders the iterations such that the new loop nest has a parallel loop

iUi ' Unimodular matrix

New loop index Old loop

index

10

01

reversal

01

10

interchange

10

21

skewing

Page 28: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.9 Hyperplane transformation

Interactively define a hyper-plane

Observe the plane iteration matches the dataflow simulationplane = dataflow

The plane iteration

Based on the plane, ISV calculates a unimodular transformation

Page 29: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

The transformed iteration space and the generated loop

4.10 The derived UT

Page 30: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

4.11 Verify the UT ISV checks if the

transformation is valid Observe that the

parallel loop execution in the transformed loop matches the plane executionparallel = plane

Page 31: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

5. Statement-level parallelism Unimodular transformations work

at iteration level The statement dependence within

the loop body is hidden in the iteration space graph

How to exploit parallelism at statement level? Statement to iteration

Page 32: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

5.1 Example 5

SSV: statement space

visualization

Page 33: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

5.2 Iteration-level parallelism

The iteration space is 2D.

There are N2=16 iterations

The dataflow execution has 2N-1=7 time steps.

The potential speedup is:

16/7 = 2.29

Page 34: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

5.3 Parallelism in statements

The (statement) iteration space is 3D

There are 2N2=32 statements

The dataflow execution still has 2N-1=7 time steps.

The potential speedup is:

32/7 = 4.58

Page 35: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

5.4 Comparison Here: doubles the potential

speedup at iteration level

l oop si ze # i terati ons step # statements step

N N2 2N- 1 2N2 2N- 1

Page 36: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

5.5 Define the partition planes

partitions hyper-planes

Page 37: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

What is validity?

Show the execution order on top of the dependence arrows.(for 1 plane or all together, depending on the density of the slide)

Page 38: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

5.6 Invalid UT The invalid unimodular

transformation derived from hyper-plane is refused by ISV

Alternatively, ISV calculates the unimodular transformation based on the dependence distance vectors available in the dependence graph

Page 39: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

6. Pseudo distance method

The pseudo distance method:

Extract base vectors from the dependent iterations

Examine if the base vectors generates all the distances

Calculate the unimodular transformation based on the base vectors

The base vectors

The unimodula

r matrix

Page 40: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

Another way to find parallelism automatically

The iteration space is a grid,non-uniform dependencies are members of a uniform dependence grid, with unknown base-vectors.

Finding these base vectors allows usto extend existing parallelizationto the non-uniform case.

Page 41: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

6.1 Dependence distance

(1,0,-1) (0,1,1)

Page 42: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

6.2 The Transformation The transforming matrix discovered by

pseudo distance method1 1 0

-1 0 11 0 0

The distance vectors are transformed(1,0,-1) (0,1,0)(0,1,1) (0,0,1)

The dependent iterations have the same first index, implies the outermost loop is parallel.

Page 43: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

6.3 Compare the UT matrices

The transforming matrix discovered by pseudo distance method1 1 0-1 0 11 0 0

An invalid transforming matrix discovered by the hyper-plane method1 0 0-1 1 01 0 1The same first column means the transformed

outermost loops have the same index.

Page 44: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

6.4 The transformed space

The outermost loop is parallel

There are 8 parallel tasks

The load of tasks is not balanced

The longest task takes 7 time steps

Page 45: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

7. Non-perfectly nested loop What is it? The unimodular transformations

only work for perfectly nested loops For non-perfectly nested loop, the

iteration space is constructed with extended indices

N fold non-perfectly nested loop to a N+1 fold perfectly nested loop

Page 46: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

7.1 Perfectly nested Loop?Non-perfectly nested loop:

DO I1 = 1,3

A(I1) = A(I1-1)

DO I2 = 1,4

B(I1,I2) = B(I1-1,I2)+B(I1,I2-1)

ENDDO

ENDDO

Perfectly nested loop:

DO I1 = 1,3

DO I2 = 1,5

DO I3 = 0,1

IF (I2.EQ.1.AND.I3.EQ.0) A(I1) = A(I1-1)

ELSE IF(I3.EQ.1) B(I1-1,I2)=B(I1-2,I2)+B(I1-1,I2-1)

ENDDO

ENDDO

ENDDO

Page 47: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

7.2 Exploit parallelism with UT

Page 48: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

8. ApplicationsPrograms Catagory Depth Form Pattern Transformation

Example 1 Tutorial 1 Perfect Uniform N/A

Example 2 Tutorial 2 Perfect Non-uniform N/A

Example 3 Tutorial 3 Perfect Uniform Wavefront UT

Example 4 Tutorial 2 Perfect Uniform Wavefront UT

Example 5 Tutorial 2+1 Perfect Uniform Stmt Partitioning UT

Example 6 Tutorial 2+1 Non-perfect

Uniform Wavefront UT

Matrix multiplication Algorithm 3 Perfect Uniform Parallelization

Gauss-Jordan 

Algorithm 3 Perfect Non-Uniform Parallelization

FFT 

Algorithm 3 Perfect Non-Uniform Parallelization

Cholesky 

Benchmark 4 Non-perfect

Non-Uniform Partitioning UT

TOMCATV 

Benchmark 3 Non-perfect

Uniform Parallelization

Flow3D 

CFD App. 3 Perfect Uniform Wavefront UT

Page 49: Using the Iteration Space Visualizer in Loop Parallelization Yijun YU .

9. Future considerations Weighted dependence graph More semantics on data locality:

data space graph, data communication graph

data reuse iteration space graph, More loop transformation:

Affine (statement) iteration space mappingsAutomatic statement distributionIntegration with Omega library