Loop parallelization & pipelining

Loop Parallelization & Pipelining

AND

Trends in Parallel System & Forms of

Parallelism

By

Jagrat Gupta

M.tech[CSE] 1st Year

(Madhav Institute of Technology and Science, Gwalior-467005)

Loop Parallelization & PipeliningIt describes the theory & application of loop transformations for vectorization and parallelization purposes.

Loop Transformation Theory:-

Parallelization loop nests is one of the most fundamental program optimization techniques demanded in a vectorization & parallelization compiler.

The main goal is to maximize the degree of parallelism or data locality in a loop nest. It also support efficient use of memory hierarchy on a parallel machine.

Elementary Transformations:-

Permutation:- Simply interchange the i & j.

Reversal:- Reversal of the ith loop is represented by the identity matrix with the ith element on the diagonal equal to -1.

Do i=1,NDo j=1,N

A(j)=A(j)+C(i,j)End Do

End Do

Before Transformation

Do j=1,NDo i=1,N

A(j)=A(j)+C(I,j)End Do

End Do

After Transformation

Do i=1,NDo j=1,N

A(I,j)=A(i-1,j+1)End Do

End Do


Do i=1,NDo j=-N,-1

A(I,-j)=A(i-1,-j+1)End Do

End Do


1 00 -1

Skewing:- Skewing loop Ij by the integer factor f w.r.t. loop Ii . In the following loop nest, the transformation performed is a skew of the inner loop with respect to the outer loop by a factor of 1.

1 01 1

Do i=1,NDo j=1,N

A(i,j)=A(i,j-1)+A(i-1,j)End Do

End Do


Do i=1,NDo j=1,N

A(i,j-i)=A(i,j-i-1)+A(i-1,j-i)End Do

End Do


Transformation Matrices:-

Unimodular transformations are defined by transformation matrices.

A unimodular martix has 3 important properties:-

1) It is square, i.e. it map n dimensional iteration space into n-dimensional iteration space.

2) It has all integers components, so it maps integer vectors to integer vectors.

3) The absolute value of determinant is 1.

Wolf and Lam have stated the following conditions for unimodular transformations:-

1) Let D be the set of distance vector s of a loop nest. A unimodular transformation T is legal if and only if, d € D

T.d>=0 (Lexicographic positive)

2) Loop i through j of a nested computation with dependence

Vector D are fully permutable if for all d € D.

Do i=1,N

Do j=1,N

A(i,j)=f(A(i,j),A(i+1,j-1))

End Do

End Do

Let this code has the dependence vector d=(1,-1) . The Loop interchange Transformation is represented by the matrix

0 11 0

Here T.d= (-1,1) i.e. Negative

Now if we compound the interchange with a reversal represented by the transformation matrix:

T’=

Now T’.d= (-1,-1)=-(1,1) So matrix part is positive. So it is legal.

Parallelization and Wavefronting:-

The theory of loop transformation can be applied to execute loop iterations in parallel.

-1 00 1

0 11 0

0 -11 0

Parallelization Conditions:- The purpose of loop parallelization is to maximize the no of parallelizable loops. The algorithm for loop parallelization consists of two steps:-

1) It first transforms the original loop nest into canonical form, namely fully permutable loop nest.

2) It then transforms the fully permutable loop nest to exploit coarse and/or fine grain parallelism according to the target architecture.

Fine Grain Wavefronting:-

• A nest of n fully permutable loops can be transformed into code containing at least (n-1) degrees of parallelism. So these (n-1) parallel loops can be obtained by skewing the innermost loop in the fully permutable nest by each of the other loop and moving the innermost loop to the outermost position.

This Transformation is called Wavefront Transformation, is represented by the following matrix:-

• Fine grain parallelism is exploited on vector m/c, superscalar processors and systolic arrays.

• Actually wavefront transformation automatically places the maximum doall loops in the innermost loops, maximizing fine grain parallelism.

1 1 - - - - - - - 1 11 0 - - - - - - - 0 00 1 - - - - - - - 0 0- - - -- - - -0 0 - - - - - - - 1 0

Coarse Grain Parallelism:-

• A wavefront transformation produces the max degree of parallelism but makes the outermost loop sequential if any one.

• a heuristic although non optimal approach for making loops doall is simply to identify loops Ii such that all di are zero. Those loops can be made outermost Doall. The remaining loops in the tile can be wavefronted to obtain the remaining parallelism.

• The loop parallelization algorithm has a common step for fine grain and coarse grain parallelism in creating a n-deep fully permutable loop nest by skewing. The algorithm can be tailored for different machine based on the following guidelines:-

1) Move Doall loops innermost for fine-grain machine. Apply a wavefront transformation to create up to (n-1) doall loops.

2) Create outermost doall loops for coarse grain machine. Apply tilling to a fully permutable loop nest.

3) Use tilling to create loops for both fine and coarse grain m/c.

Tiling & Localization:-

The purpose is to reduce synchronization overhead and to enhance multiprocessor efficiency when loops are distributed for parallel execution.

It is possible to reduce to synchronization cost and improve data locality of parallelized loops via an optimization known as tiling.

In general tiling maps an n deep loop nest into 2n deep loop nest where the inner n loops include only a small fixed no of iteration. The outer loop of tiled code control the execution of tiles.

It also satisfy the property of Full permutability.

We can reduce synchronization cost in the following way- We first tile the loops and then apply the wavefront transformation to the controlling loops of the tiles. In this way, synchronization cost is reduced by the size of the tile.

Tiling for Locality:-

• Technique to improve the data locality of numerical algorithms.

• It can be used for different levels of memory, caches & registers; multiple tiling can be used to achieve locality at multiple levels of the memory hierarchy simultaneously.

Do i=1,NDo j=1,N

Do k=1,NC(i,k)=C(i,k)+A(i,j)*B(j,k)End Do

End DoEnd Do

Before Tiling

Do l=1,N,sDo m=1,N,s

Do i=1,NDo j=l, min(l+s-1,N)

Do k=m, min(m+s-1,N)C(i,k)=C(i,k)+A(i,j)*B(j,k)End Do

End DoEnd DoEnd Do

End DoAfter Tiling

• In the previous code some row of B & C are reused in the next iteration of the middle & outer loop. So tiling reorders the execution sequence such that iterations from loops of the outer dimensions are executed before all the iterations of the inner loops are completed.

• Tiling reduces the no of interleaving iterations and the data fetched b/w data reuses. This allows reused data to still be in the cache or register file & hence reduces memory access.

Pipelining Software Pipelining:-

Pipelining of successive iterations of the loop in the source programs. The advantage of s/w pipelining is to reduce the execution time with compact object code.

Pipelining of loop iterations:- (Lam`s Tutorials Notes)

Do i=1,N

A(i)= A(i)*B+C

End Do

• In the above code iterations are independents. It is assumed that each memory accesses (R or W) takes 1 cycles & each operation (Mul & Add) takes 2 cycles.

• Without Pipelining:-

1 Iteration require 6 cycles to be execute. So N Iteration Require 6N Cycles to complete ignoring loop control overhead.

Cycles Instructions Comment

1. Read /Fetch A(i)/

2. Mul Multiply by B

3.

4. Add /Add to C/

5.

6. Write /Store A(i)/

• With Pipelining:-

Now same code is executed on a 8-deep instruction pipeline.

Cycles Iterations

1 2 3 4

1 R

2 Mul

3 R

4 Mul

5 Add R

6 Mul

7 Add R

8 W Mul

9 Add

10 W

11 Add

12 W

13

14 W

Hence 4 Iterations are required 14 Clock Cycles.

Speed up factor= 24/14=1.7

For N Iterations, it is 6N/(2N+6).

Trends towards Parallel Systems From an application point of view, the mainstream of usage of

computer is experiencing a trend of four ascending levels of sophistication:-

• Data processing.

• Information processing.

• Knowledge processing.

• Intelligence processing.

Computer usage started with data processing, while is still a major task of today’s computers. With more and more data structures developed, many users are shifting to computer roles from pure data processing to information processing.

As the accumulated knowledge bases expanded rapidly in recent years, there grew a strong demand to use computers for knowledge processing.

Intelligence is very difficult to create; its processing even more so.

Today's computers are very fast and obedient and have many reliable memory cells to be qualified for data-information-knowledge processing. Computers are far from being satisfactory in performing theorem proving, logical inference and creative thinking.

Forms Of Parallelism Parallelism in Hardware (Uniprocessor)

– Pipelining

– Superscalar, VLIW etc.

Parallelism in Hardware (SIMD, Vector processors, GPUs)

Parallelism in Hardware (Multiprocessor)

– Shared-memory multiprocessors

– Distributed-memory multiprocessors

– Chip-multiprocessors a.k.a. Multi-cores

Parallelism in Hardware (Multicomputers a.k.a. clusters)

Parallelism in Software

– Task parallelism

– Data parallelism

Instruction Level Parallelism:-

• Multiple instructions from the same instruction stream can be executed concurrently. The potential overlap among instructions is called instruction level parallelism.

• Generated and managed by hardware (superscalar) or by compiler (VLIW).

• Limited in practice by data and control dependences.

• There are two approaches to instruction level parallelism:

-Hardware.

-Software.

• Hardware level works upon dynamic parallelism whereas, the software level works on static parallelism.

• Consider the following program:

1. e = a + b

2. f = c + d

3. m = e * f

• Operation 3 depends on the results of operations 1 and 2, so it cannot be calculated until both of them are completed. However, operations 1 and 2 do not depend on any other operation, so they can be calculated simultaneously. If we assume that each operation can be completed in one unit of time then these three instructions can be completed in a total of two units of time, giving an ILP of 3/2.

Thread-level or task-level parallelism (TLP):-

• Multiple threads or instruction sequences from the same application can be executed concurrently.

• Generated by compiler/user and managed by compiler and hardware.

• Limited in practice by communication/synchronization overheads and by algorithm characteristics.

Data-level parallelism (DLP):-

• Instructions from a single stream operate concurrently on several data

• Limited by non-regular data manipulation patterns and by memory bandwidth.

Transaction-level parallelism:-

• Multiple threads/processes from different transactions can be executed concurrently.

• Limited by access to metadata and by interconnection bandwidth.

Parallel Computing• Use of multiple processors or computers working together on a

common task.

–Each processor works on its section of the problem.

–Processors can exchange information .

Grid of Problem to be solved

CPU #1 works on this area of the problem

exchange


exchange



Why Do Parallel Computing? Limits of single CPU computing

–performance

–available memory

Parallel computing allows one to:

–solve problems that don’t fit on a single CPU

–solve problems that can’t be solved in a reasonable time

We can solve…

–larger problems

–the same problem faster

–more cases

Brent`s Theorem

Statement:- Given A, a parallel algorithm with computation time t, if parallel algorithm A performs m computational operations, then processors can execute algorithm A in time:-

t+(m-1)/p

Proof:- :Let si be the no of computational operations performed by parallel algorithm A at step i, (1<=i<=t)

Given t

∑ si = m

i=1

Since we have p no of processors, we can simulate step I in time

Ceil(si /p). So the entire computations of A can be performed with p processors in time :-

t t

∑ ceil(si /p) <= ∑ (si+p-1)/p

i=1 i=1

(Using the definition of ceiling Function)

t t

= ∑(p/p) + ∑(si -1 /p)

i=1 i=1

= t+(m-1)/p

(Hence Proved)

Loop parallelization & pipelining

Engineering

Transcript of Loop parallelization & pipelining