Hierarchically Tiled Arrays
description
Transcript of Hierarchically Tiled Arrays
Hierarchically Tiled Arrays
Presented by,Kenneth Detweiler
Overview Design and Creation Background
Parallel Operators Tiling Locality
HTA Illustrated Creation Data Layout Accessing
Implementation Sparse Matrix Cannon Algorithm Estimate PI using Monte Carlo
method MATLAB
Internal Structure What it does
C++ Implementation Logical Index Space HTA Class Machine Mapping Operator Framework
C++ Optimizations Automatic Memory Management Template Class Specialized Methods Inlined Hot Methods Lazy Evaluation Relaxation of serial evaluation
Current State of HTA My Thoughts
Design and Creation•Developed in 2006 by:
▫University of Coruna▫University of Illinois▫IBM
•IBM: Yorktown Heights, NY•UC: Coruna, Spain•UIC: Chicago, Illinois
Background• Data type based on higher level data parallel
operators▫Hierarchical Tiling
• HTA creates tiles a part of programming language▫Tiling helps control locality and Data Distribution▫Referenced explicitly by the compiler▫Operators extend to function on tiles
• C++ Library Implementation▫Distributed Memory using MPI▫Shared Memory using TBI
Parallel Operators•In order to give benefits of supercomputer
power the developer resorts to low level parallel constructs.▫This is both time consuming and an easily
error prone process
•The solution to this Parallel Operators, which exists simultaneously on all processors involved in distributed contribution.▫Acts as a single entity capable of processing
shared data in parallel.
Hierarchical Tiling•Arrays that are partitioned into tiles•Exploits parallelism and locality of all
levels of memory hierarchy.•Can be represented iteratively or
recursively.•Implemented during program design•Cut costs in parallel programs by
organizing computations between tiles
Locality• Known as the locality of reference or principle
of locality
• Describes the same value or related storage locations being frequently accessed.
• Two major types:▫Temporal:
If a particular memory location is referenced then it is likely that same location will be referenced again in the future.
▫Spatial: If a particular memory location is referenced then it is likely that the
nearby memory locations will also be referenced
HTA, Illustrated
HTA, Creation•Need:
▫ Source Array▫ Series of Delimiters
•Function:▫ hta(MATRIX{[delim1],[delim2],…}, [processor
mesh])
•Example:▫ h = hta(a,{[1,4].,[1,3]},[2,2]);
HTA, Data Layout•Inner most tiles are the leaf tiles.•Leaf tiles contain:
▫ ROW▫ COLUMN▫ TITLE: stores the elements of tile in continuous memory
locations
•Memory mapping of HTA is determined by:▫ How tiles are allocated▫ Memory layout of tiles
HTA, Accessing•Tiles:
▫Calling C{2,2} is the lower right quadrant
•Elements:▫Calling C(2,2)
Parallel Programming using HTASparse Matrixa = hta(MX,{dist}, [P 1]);b = hta(P, 1, [P 1]);b{:} = V;r = a * b;
Communication between client and each serverMultiples a sparse matrix mx by a dense vector V using P
processor.Distribute MX in chunks of rows into a HTA by calling the
constructorP servers handling the HTA are organized into a single columnDist argument is used to distribute the array MX such that it
results in a uniform computation across the servers
Parallel Programming using HTACannon Algorithmfor i = 1:n
c = c + a * b;a = circshift(a, [0, -1]);b = circshift(b, [-1, 0]);
end
Requires communication between client and servers, but also communication between the servers
Each iteration of the loop each server executes a matrix multiplication of tile a and b that reside on the server
Result of multiplication is stored in a local HTA cTiles a are shifted along the first dimension, and tiles b are shifted along
second dimension. Thus tiles a are sent to the left processor in mesh and tiles b to the right processor in mesh.
The left processor sends its tiles of a to the right most processor in its row and the bottom most processor transfers its tiles of b to the top most processor in its column
The end result is that HTA c = a*b
Parallel Programming using HTAEstimate pi using Monte Carlo method
input = hta(P, 1, [P 1]);input{:} = eP;output = parHTAFunc(@estimatePi, input);myPi = mean(output(:));
function r = estimatePi(n)x = randx(1, n);y = randx(1, n);pos = x .* x + y .* y;r = sum(pos < 1) / n * 4;
A distributed HTA input with one tile per processor is built
Tiles are filled with eP, the number of experiments to run per processor
Experiments are made on each processor by the function estimatePi()
The result of the parallel execution of the function is a distributed HTA output that has the same mapping as input and keeps a single tile per processor with the local estimation of pi.
Implementation•MATLAB (matrix laboratory)
▫ A numerical computing environment and fourth generation programming language
▫ Started as a wrapper on Fortran libraries
MATLAB – Internal Structure• MATLAB is used as
a client where code is executed
• On the server where its used as a computational engine for the distributed operations on the HTA’s.
MATLAB – Internal Structure (cont)• All communications are done through the MPI
▫Lower layers of the HTA toolbox take care of communications requirements.
▫Higher layers implement the syntax expected by the MATLAB users
• HTA programs have a single thread that is interpreted and executed by the MATLAB client
• HTA’s are just objects within the environment of the interpreter
MATLAB – Internal Structure (cont)• When HTA is local it is not distributed on the array servers
▫ Client HTA keeps both the structure and the content of the HTA
• When HTA is distributed▫ Client holds the structure of the HTA at all its levels▫ Keeps all information of the mapping of tiles on the top level of
the mesh servers
• Client is always able to regardless of local or distributed to:▫ Test the legality of the operation▫ Calculate the structure and mapping of the output HTA▫ Send messages that encode the command and its arguments to
the servers
MATLAB• Is a linear algebra language with a large base of
users who write scientific code▫ HTA allows users to harness powers of cluster of work stations instead
of a single machine
• Is polymorphic ▫ Allowing HTA s to substitute regular arrays almost without changing
the rest of the code, thereby adding parallelism painlessly.
• Is designed to be extensible▫ Third party developers can provide so call toolboxes of functions for
specialized purposes.
• Provides a native method called Mex▫ Allows functions to implemented in languages like C or Python
C++ Implementations• HTAs can be implemented in C++ by adding the htalib
library▫HTAs are represented as composite objects with methods
to operate on both distributed and sequential HTAs▫Two communication layers are available MPI and UPC
Implementation follows SPMD execution model while the programming model is still single-threaded
• Core Data Structures of htalib▫Logical index space▫HTA Class▫Machine Mapping▫Operator Framework
C++ ImplementationsLogical Index Space•Classes used to define index space and
tiling of an HTA▫Tuple<N> //an n dimensional index value▫Triplet //a 1D range with optional stride
(low:high:step)▫Region<N> //N-dimensional rectangular
index space spanned by N triplets
C++ ImplementationsHTA Class•Defines an HTA with scalar elements of
type T and N dimensions.•Data type implementation scalar access
▫(operator[], tile access(operator())•Built in array operations
▫transpose▫permute▫dpermute▫reduce
C++ ImplementationMachine Mapping•Specifies where the HTA is allocated in a
distributed system•Memory layout of the scalar data array
▫Captured by instances of class distribution that specifies home location of the scalar data for each of the tiles of an HTA
•Memory Mapping▫Specify the layout
(row-major across tiles, row major per tile)▫Size and Stride of the flat array data underlining
the HTA
C++ ImplementationOperator Framework•Htalib provides a powerful operator
framework following the design of the STL operator class▫Consists of routines that evaluate specific
operators on HTAs and base classes Serves as a foundation for user-defined
operator
C++ Optimizations• Represented as composite objects
▫Methods available for both distributed and local HTA’s
▫Two implementations MPI and UPC• Performance Optimizations
▫Automatic Memory Management▫Template Class▫Specialized Methods▫Inlined Hot Methods▫Lazy evaluation▫Relaxation of serial evaluation semantics
C++ OptimizationsAutomatic Memory Management•HTA’s are allocated through factory
methods on the heap for automatic memory management▫Methods return handle which is assigned to
a stack allocated variable▫All access occurs through this handle▫Once all handles to an HTA disappear from
the stack the HTA and its related structures are automatically deleted from memory
C++ OptimizationsTemplate Class•Used in htalib to handle data with
different types and dimensions
•Provides flexibility and opportunities for optimizations at compile time
•Data type of HTA can be any type or user defined type
C++ ImplementationSpecialized Methods•Methods are optimized and whenever
possible specialized for specific cases
•IE: A specialized method that avoids multiplication by stride is invoked when the data being accessed is known to be stored consecutive locations.
C++ ImplementationInlined Hot Methods / Lazy Evaluation• Inlined Hot Methods
▫Inlining is performed to methods that are used frequently
▫The tile access functions and scalar functions are carefully inlined to reduce the overhead of function calls.
• Lazy Evaluation▫In HTA Assignment when RHS has more than one
variable htalib uses lazy evaluation to avoid or reduce the temporary variables generated from one or more binary operations
▫Then if the LHS and RHS have no data dependencies then the assignment is directly evaluated to the RHS
C++ ExamplesMatrix Multiplication
C++ ImplementationRelaxation of serial evaluation semantics•Htalib provides a mechanism to temporarily
relax the serial evaluations ordering▫Helps the overlapping of different
communications and of communications with computations
htalib::async();B(1:n)[0] = B(0:n-1)[d];B(0:n-1)[d+1] = B(1:n)[1];htalib::sync();▫ The above code shows the boundary exchange in the 1D Jacobi▫ There is no data dependence among the assignments, both
statements can proceed concurrently▫ This is achieved through the runtime calls to async and sync
C++ ExamplesMatrix Transposition
State•Currently HTA library support is being
extended to support ▫Sparse Data Partitioning▫Hierarchical Place Trees
•Continually optimized to increase performance
•Searching for new ways to implement into new languages to increase productivity
My ThoughtsBy increasing the productivity of parallel
computing we can increase processor power across machines
More powerful programming languages due to having powerful library tools
HTA is something that I will keep in mind with my future programming projects