Data Structures and Algorithms in Parallel Computing Lecture 10.

Data Structures and Algorithms in Parallel Computing Lecture 10 Numerical algorithms Algorithms that use numerical approximation to solve mathematical problems They do not seek exact solutions because this would be nearly impossible in practice Much work has been done on parallelizing numerical algorithms Matrix operations Particle physics Systems of linear equations Note: https://courses.engr.illinois.edu/cs554/fa2015/notes/index.html Matrix operations Inner product: Outer product: Matrix vector product: Matrix matrix product: Inner product Assign (n/k)/p coarse-grain tasks to each of p processes, for total of n/p components of x and y per process Communication: sum reduction over n/k coarse grained tasks Isoefficiency: How the amount of computation performed must scale with processor number to keep efficiency constant 1D mesh: (p 2 ) 2D mesh: (p 3/2 ) Hypercube: (p log p) Outer product At most n tasks store components of x and y: for some j, task (i,j) stores x i and task (j,i) stores y i, or task (i,i) stores both x i and y i, i = 1,...,n Communication: For i = 1,...,n, task that stores x i broadcasts it to all other tasks in ith task row For j = 1,...,n, task that stores y j broadcasts it to all other tasks in jth task column 1D mapping Column wise Row wise Each task holding either x or y components must broadcast them to neighbors Isoefficiency: (p 2 ) 2D mapping Isoefficiency: (p 2 ) Matrix vector product At most 2n ne-grain tasks store components of x and y, say either For some j, task (j,i) stores x i and task (i,j) stores y i, or Task (i,i) stores both x i and y i, i = 1,...,n Communication For j = 1,...,n, task that stores x j broadcasts it to all other tasks in jth task column For i = 1,...,n, sum reduction over ith task row gives y i Matrix vector product Steps 1.Broadcast x j to tasks (k,j), k = 1,...,n 2.Compute y i = a ij x j 3.Reduce y i across tasks (i,k), k = 1,...,n 2D mapping Isoefficiency: (p 2 ) 1D column mapping Isoefficiency: (p 2 ) 1D row mapping Isoefficiency: (p 2 ) Matrix matrix product Matrix-matrix product can be viewed as: n 2 inner products, or sum of n outer products, or n matrix-vector products Each viewpoint yields different algorithm One way to derive parallel algorithms for matrix- matrix product is to apply parallel algorithms already developed for inner product, outer product, or matrix-vector product We will investigate parallel algorithms for this problem directly Matrix matrix product At most 3n 2 ne-grain tasks store entries of A, B, or C, say task (i,j,j) stores a ij, task (i,j,i) stores b ij, and task (i,j,k) stores c ij for i,j = 1,...,n and some xed k (i,j,k) = (row, column, layer) Communication Broadcast entries of jth column of A horizontally along each task row in jth layer Broadcast entries of ith row of B vertically along each task column in ith layer For i,j = 1,...,n, result c ij is given by sum reduction over tasks (i,j,k), k = 1,...,n Matrix matrix product Steps 1.Broadcast a ik to tasks (i,q,k), q = 1,...,n 2.Broadcast b kj to tasks (q,j,k), q = 1,...,n 3.cij = a ik b kj 4.Reduce c ij across tasks (i,j,q), q = 1,...,n Task grouping Reduce number of processors Particle systems Many physical systems can be modeled as a collection of interacting particles Atoms in molecule Planets in solar system Stars in galaxy Galaxies in clusters Particles exert mutual forces on each other Gravitational Electrostatic N-body model Newtons law: Force between two particles: Overall force on ith particle: Complexity O(n 2 ) due to particle-particle interactions Can be reduced to O(n log n) or O(n) through Hierarchical trees Multipole methods Pay penalty of accuracy Trivial parallelism High parallelism but totally work prohibitive and memory requirements may be expensive 2 steps Broadcast position of each particle along rows and columns Reduce forces diagonally (to home of particle) and perform time integration Agglomeration can reduce communication in rows or columns Reducing complexity Forces have innite range, but with declining strength Three major options Perform full computation at O(n 2 ) cost Discard forces from particles beyond certain range, introducing error that is bounded away from zero Approximate long-range forces, exploiting behavior of force and/or features of problem Approach Monopole representation Or tree code Method: Aggregate distant particles into cells and represent effect of all particles in a cell by monopole (rst term in multipole expansion) evaluated at center of cell Replace influence of far away particles with aggregate approximation of force Use larger cells at greater distances Approximation is relatively crude Parallel Approach Divide domain into patches, with each patch assigned to a process Tree code replaces communication with all processes by communication with fewer processes To avoid accuracy problem of monopole expansion, use full multipole expansion Whats next? Discuss some recent papers on parallel algorithms dealing with classes of problems discussed during this lecture

Data Structures and Algorithms in Parallel Computing Lecture 10.

Documents

Transcript of Data Structures and Algorithms in Parallel Computing Lecture 10.