CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers...

29
S267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3) Tricks with Trees James Demmel http://www.cs.berkeley.edu/~demmel/ cs267_Spr99
  • date post

    20-Dec-2015
  • Category

    Documents

  • view

    217
  • download

    0

Transcript of CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers...

Page 1: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999

CS 267 Applications of Parallel Computers

Lecture 12:

Sources of Parallelism and Locality(Part 3)

Tricks with Trees

James Demmel

http://www.cs.berkeley.edu/~demmel/cs267_Spr99

Page 2: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).2 Demmel Sp 1999

Recap of last lecture

° ODEs• Sparse Matrix-vector multiplication

• Graph partitioning to balance load and minimize communication

° PDEs• Heat Equation and Poisson Equation

• Solving a certain special linear system T

• Many algorithms, ranging from

- Dense Gaussian elimination, slow but very general, to

- Multigrid, fast but only works on matrices like T

Page 3: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).3 Demmel Sp 1999

Outline

° Continuation of PDEs• What do realistic meshes look like?

° Tricks with Trees

Page 4: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).4 Demmel Sp 1999

Partial Differential Equations

PDEs

Page 5: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).5 Demmel Sp 1999

Poisson’s equation in 1D

° Solve Tx=b where

2 -1

-1 2 -1

-1 2 -1

-1 2 -1

-1 2

T = 2-1 -1

Graph and “stencil”

Page 6: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).6 Demmel Sp 1999

Poisson’s equation in 2D

° Solve Tx=b where

° 3D is analogous

4 -1 -1

-1 4 -1 -1

-1 4 -1

-1 4 -1 -1

-1 -1 4 -1 -1

-1 -1 4 -1

-1 4 -1

-1 -1 4 -1

-1 -1 4

T =

4

-1

-1

-1

-1

Graph and “stencil”

Page 7: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).7 Demmel Sp 1999

Algorithms for 2D Poisson Equation with N unknowns

Algorithm Serial PRAM Memory #Procs

° Dense LU N3 N N2 N2

° Band LU N2 N N3/2 N

° Jacobi N2 N N N

° Explicit Inv. N log N N N

° Conj.Grad. N 3/2 N 1/2 *log N N N

° RB SORN 3/2 N 1/2 N N

° Sparse LU N 3/2 N 1/2 N*log N N

° FFT N*log N log N N N

° Multigrid N log2 N N N

° Lower bound N log N N

PRAM is an idealized parallel model with zero cost communication

(see next slide for explanation)

2 22

Page 8: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).8 Demmel Sp 1999

Relation of Poisson’s equation to Gravity, Electrostatics° Force on particle at (x,y,z) due to particle at 0 is

-(x,y,z)/r^3, where r = sqrt(x +y +z )

° Force is also gradient of potential V = -1/r

= -(d/dx V, d/dy V, d/dz V) = -grad V

° V satisfies Poisson’s equation (try it!)

2 2 2

Page 9: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).9 Demmel Sp 1999

Comments on practical meshes

° Regular 1D, 2D, 3D meshes• Important as building blocks for more complicated meshes

° Practical meshes are often irregular• Composite meshes, consisting of multiple “bent” regular meshes

joined at edges

• Unstructured meshes, with arbitrary mesh points and connectivities

• Adaptive meshes, which change resolution during solution process to put computational effort where needed

Page 10: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).10 Demmel Sp 1999

Composite mesh from a mechanical structure

Page 11: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).11 Demmel Sp 1999

Converting the mesh to a matrix

Page 12: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).12 Demmel Sp 1999

Effects of Ordering Rows and Columns on Gaussian Elimination

Page 13: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).13 Demmel Sp 1999

Irregular mesh: NASA Airfoil in 2D (direct solution)

Page 14: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).14 Demmel Sp 1999

Irregular mesh: Tapered Tube (multigrid)

Page 15: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).15 Demmel Sp 1999

Adaptive Mesh Refinement (AMR)

°Adaptive mesh around an explosion°John Bell and Phil Colella at LBL (see class web page for URL)°Goal of Titanium is to make these algorithms easier to implement

in parallel

Page 16: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).16 Demmel Sp 1999

Challenges of irregular meshes (and a few solutions)

° How to generate them in the first place• Triangle, a 2D mesh partitioner by Jonathan Shewchuk

• 3D harder!

° How to partition them• ParMetis, a parallel graph partitioner

° How to design iterative solvers• PETSc, a Portable Extensible Toolkit for Scientific Computing

• Prometheus, a multigrid solver for finite element problems on irregular meshes

• Titanium, a language to implement Adaptive Mesh Refinement

° How to design direct solvers• SuperLU, parallel sparse Gaussian elimination

° These are challenges to do sequentially, the more so in parallel

Page 17: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).17 Demmel Sp 1999

Tricks with Trees

Page 18: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).18 Demmel Sp 1999

Outline

° A log n lower bound to compute any function in parallel

° Reduction and broadcast in O(log n) time

° Parallel prefix (scan) in O(log n) time

° Adding two n-bit integers in O(log n) time

° Multiplying n-by-n matrices in O(log n) time

° Inverting n-by-n triangular matrices in O(log n) time

° Inverting n-by-n dense matrices in O(log n) time

° Evaluating arbitrary expressions in O(log n) time

° Evaluating recurrences in O(log n) time

° Solving n-by-n tridiagonal matrices in O(log n) time

° Traversing linked lists

° Computing minimal spanning trees

° Computing convex hulls of point sets

2

2

Page 19: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).19 Demmel Sp 1999

A log n lower bound to compute any function of n variables

° Assume we can only use binary operations, one per time unit

° After 1 time unit, an output can only depend on two inputs

° Use induction to show that after k time units, an output can only depend on 2k inputs

° A binary tree performs such a computation

Page 20: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).20 Demmel Sp 1999

Broadcasts and Reductions on Trees

Page 21: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).21 Demmel Sp 1999

Parallel Prefix, or Scan

° If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes

° Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final value

y[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1

Page 22: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).22 Demmel Sp 1999

Mapping Parallel Prefix onto a Tree - Details

° Up-the-tree phase (from leaves to root)

° Down the tree phase (from root to leaves)

° By induction, S = sum of all leaves to left of subtree rooted at the parent

1) Get values L and R from left and right children2) Save L in a local register M3) Pass sum S = L+R to parent

1) Get value S from parent (the root gets 0)2) Send S to the left child3) Send S + M to the right child

Page 23: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).23 Demmel Sp 1999

Adding two n-bit integers in O(log n) time

° Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers

° We want their sum s = a+b = s[n]s[n-1]…s[0]

° Challenge: compute all c[i] in O(log n) time via parallel prefix

° Used in all computers to implement addition - Carry look-ahead

c[-1] = 0 … rightmost carry bitfor i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = a[i] xor b[i] xor c[i-1]

for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit

c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative)

= C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix

Page 24: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).24 Demmel Sp 1999

Multiplying n-by-n matrices in O(log n) time

° For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j)• cost = 1 time unit, using n^3 processors

° For all (1 <= I,j <= n) C(i,j) = P(i,j,k)• cost = O(log n) time, using a tree with n^3 / 2 processors

k =1

n

Page 25: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).25 Demmel Sp 1999

Inverting triangular n-by-n matrices in O(log2 n) time

° Fact:

° Function TriInv(T) … assume n = dim(T) = 2m for simplicity

° time(TriInv(n)) = time(TriInv(n/2)) + O(log(n))• Change variable to m = log n to get time(TriInv(n)) = O(log2n)

A 0C B

-1

= A 0

-B CA B-1

-1

-1 -1

If T is 1-by-1 return 1/Telse … Write T = A 0 C B In parallel do { invA = TriInv(A) invB = TriInv(B) } … implicitly uses a tree newC = -invB * C * invA Return invA 0 newC invB

Page 26: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).26 Demmel Sp 1999

Inverting Dense n-by-n matrices in O(log n) time

° Lemma 1: Cayley-Hamilton Theorem• expression for A-1 via characteristic polynomial in A

° Lemma 2: Newton’s Identities• Triangular system of equations for coefficients of characteristic

polynomial

° Lemma 3: trace(Ak) = Ak [i,i] = [i (A)]k

° Csanky’s Algorithm (1976)

° Completely numerically unstable

2

i=1

n

i=1

n

1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n)2) Compute the traces sk = trace(Ak) cost = O(log n)3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n)4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n)

Page 27: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).27 Demmel Sp 1999

Evaluating arbitrary expressions

° Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately

° Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labelled by +, -, * and /

° Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity

° Sketch of (modern) proof: evaluate expression tree E greedily by

• collapsing all leaves into their parents at each time step

• evaluating all “chains” in E with parallel prefix

Page 28: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).28 Demmel Sp 1999

Evaluating recurrences

° Let xi = fi(xi-1), fi a rational function, x0 given

° How fast can we compute xn?

° Theorem (Kung): Suppose degree(fi) = d for all I• If d=1, xn can be evaluated in O(log n) using parallel prefix

• If d>1, evaluating xn takes (n) time, i.e. no speedup is possible

° Sketch of proof when d=1

° Sketch of proof when d>1• degree(xi) as a function of x0 is di

• After k parallel steps, degree(anything) <= 2k

• Computing xi take (i) steps

xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as

xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or

numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0

demi ci di deni-1 deni-1 den0

Can use parallel prefix with 2-by-2 matrix multiplication

Page 29: CS267 L12 Sources of Parallelism(3).1 Demmel Sp 1999 CS 267 Applications of Parallel Computers Lecture 12: Sources of Parallelism and Locality (Part 3)

CS267 L12 Sources of Parallelism(3).29 Demmel Sp 1999

Summary of tree algorithms

° Lots of problems can be done quickly - in theory - using trees

° Some algorithms are widely used• broadcasts, reductions, parallel prefix

• carry look ahead addition

° Some are of theoretical interest only• Csanky’s method for matrix inversion

• Solving general tridiagonals (without pivoting)

• Both numerically unstable

• Csanky needs too many processors

° Embedded in various systems• CM-5 hardware control network

• MPI, Split-C, Titanium, NESL, other languages