Hierarchically Tiled Arrays

Hierarchically Tiled Arrays

Presented by,Kenneth Detweiler

Overview Design and Creation Background

Parallel Operators Tiling Locality

HTA Illustrated Creation Data Layout Accessing

Implementation Sparse Matrix Cannon Algorithm Estimate PI using Monte Carlo

method MATLAB

Internal Structure What it does

C++ Implementation Logical Index Space HTA Class Machine Mapping Operator Framework

C++ Optimizations Automatic Memory Management Template Class Specialized Methods Inlined Hot Methods Lazy Evaluation Relaxation of serial evaluation

Current State of HTA My Thoughts

Design and Creation•Developed in 2006 by:

▫University of Coruna▫University of Illinois▫IBM

•IBM: Yorktown Heights, NY•UC: Coruna, Spain•UIC: Chicago, Illinois

Background• Data type based on higher level data parallel

operators▫Hierarchical Tiling

• HTA creates tiles a part of programming language▫Tiling helps control locality and Data Distribution▫Referenced explicitly by the compiler▫Operators extend to function on tiles

• C++ Library Implementation▫Distributed Memory using MPI▫Shared Memory using TBI

Parallel Operators•In order to give benefits of supercomputer

power the developer resorts to low level parallel constructs.▫This is both time consuming and an easily

error prone process

•The solution to this Parallel Operators, which exists simultaneously on all processors involved in distributed contribution.▫Acts as a single entity capable of processing

shared data in parallel.

Hierarchical Tiling•Arrays that are partitioned into tiles•Exploits parallelism and locality of all

levels of memory hierarchy.•Can be represented iteratively or

recursively.•Implemented during program design•Cut costs in parallel programs by

organizing computations between tiles

Locality• Known as the locality of reference or principle

of locality

• Describes the same value or related storage locations being frequently accessed.

• Two major types:▫Temporal:

If a particular memory location is referenced then it is likely that same location will be referenced again in the future.

▫Spatial: If a particular memory location is referenced then it is likely that the

nearby memory locations will also be referenced

HTA, Illustrated

HTA, Creation•Need:

▫ Source Array▫ Series of Delimiters

•Function:▫ hta(MATRIX{[delim1],[delim2],…}, [processor

mesh])

•Example:▫ h = hta(a,{[1,4].,[1,3]},[2,2]);

HTA, Data Layout•Inner most tiles are the leaf tiles.•Leaf tiles contain:

▫ ROW▫ COLUMN▫ TITLE: stores the elements of tile in continuous memory

locations

•Memory mapping of HTA is determined by:▫ How tiles are allocated▫ Memory layout of tiles

HTA, Accessing•Tiles:

▫Calling C{2,2} is the lower right quadrant

•Elements:▫Calling C(2,2)

Parallel Programming using HTASparse Matrixa = hta(MX,{dist}, [P 1]);b = hta(P, 1, [P 1]);b{:} = V;r = a * b;

Communication between client and each serverMultiples a sparse matrix mx by a dense vector V using P

processor.Distribute MX in chunks of rows into a HTA by calling the

constructorP servers handling the HTA are organized into a single columnDist argument is used to distribute the array MX such that it

results in a uniform computation across the servers

Parallel Programming using HTACannon Algorithmfor i = 1:n

c = c + a * b;a = circshift(a, [0, -1]);b = circshift(b, [-1, 0]);

end

Requires communication between client and servers, but also communication between the servers

Each iteration of the loop each server executes a matrix multiplication of tile a and b that reside on the server

Result of multiplication is stored in a local HTA cTiles a are shifted along the first dimension, and tiles b are shifted along

second dimension. Thus tiles a are sent to the left processor in mesh and tiles b to the right processor in mesh.

The left processor sends its tiles of a to the right most processor in its row and the bottom most processor transfers its tiles of b to the top most processor in its column

The end result is that HTA c = a*b

Parallel Programming using HTAEstimate pi using Monte Carlo method

input = hta(P, 1, [P 1]);input{:} = eP;output = parHTAFunc(@estimatePi, input);myPi = mean(output(:));

function r = estimatePi(n)x = randx(1, n);y = randx(1, n);pos = x .* x + y .* y;r = sum(pos < 1) / n * 4;

A distributed HTA input with one tile per processor is built

Tiles are filled with eP, the number of experiments to run per processor

Experiments are made on each processor by the function estimatePi()

The result of the parallel execution of the function is a distributed HTA output that has the same mapping as input and keeps a single tile per processor with the local estimation of pi.

Implementation•MATLAB (matrix laboratory)

▫ A numerical computing environment and fourth generation programming language

▫ Started as a wrapper on Fortran libraries

MATLAB – Internal Structure• MATLAB is used as

a client where code is executed

• On the server where its used as a computational engine for the distributed operations on the HTA’s.

MATLAB – Internal Structure (cont)• All communications are done through the MPI

▫Lower layers of the HTA toolbox take care of communications requirements.

▫Higher layers implement the syntax expected by the MATLAB users

• HTA programs have a single thread that is interpreted and executed by the MATLAB client

• HTA’s are just objects within the environment of the interpreter

MATLAB – Internal Structure (cont)• When HTA is local it is not distributed on the array servers

▫ Client HTA keeps both the structure and the content of the HTA

• When HTA is distributed▫ Client holds the structure of the HTA at all its levels▫ Keeps all information of the mapping of tiles on the top level of

the mesh servers

• Client is always able to regardless of local or distributed to:▫ Test the legality of the operation▫ Calculate the structure and mapping of the output HTA▫ Send messages that encode the command and its arguments to

the servers

MATLAB• Is a linear algebra language with a large base of

users who write scientific code▫ HTA allows users to harness powers of cluster of work stations instead

of a single machine

• Is polymorphic ▫ Allowing HTA s to substitute regular arrays almost without changing

the rest of the code, thereby adding parallelism painlessly.

• Is designed to be extensible▫ Third party developers can provide so call toolboxes of functions for

specialized purposes.

• Provides a native method called Mex▫ Allows functions to implemented in languages like C or Python

C++ Implementations• HTAs can be implemented in C++ by adding the htalib

library▫HTAs are represented as composite objects with methods

to operate on both distributed and sequential HTAs▫Two communication layers are available MPI and UPC

Implementation follows SPMD execution model while the programming model is still single-threaded

• Core Data Structures of htalib▫Logical index space▫HTA Class▫Machine Mapping▫Operator Framework

C++ ImplementationsLogical Index Space•Classes used to define index space and

tiling of an HTA▫Tuple<N> //an n dimensional index value▫Triplet //a 1D range with optional stride

(low:high:step)▫Region<N> //N-dimensional rectangular

index space spanned by N triplets

C++ ImplementationsHTA Class•Defines an HTA with scalar elements of

type T and N dimensions.•Data type implementation scalar access

▫(operator[], tile access(operator())•Built in array operations

▫transpose▫permute▫dpermute▫reduce

C++ ImplementationMachine Mapping•Specifies where the HTA is allocated in a

distributed system•Memory layout of the scalar data array

▫Captured by instances of class distribution that specifies home location of the scalar data for each of the tiles of an HTA

•Memory Mapping▫Specify the layout

(row-major across tiles, row major per tile)▫Size and Stride of the flat array data underlining

the HTA

C++ ImplementationOperator Framework•Htalib provides a powerful operator

framework following the design of the STL operator class▫Consists of routines that evaluate specific

operators on HTAs and base classes Serves as a foundation for user-defined

operator

C++ Optimizations• Represented as composite objects

▫Methods available for both distributed and local HTA’s

▫Two implementations MPI and UPC• Performance Optimizations

▫Automatic Memory Management▫Template Class▫Specialized Methods▫Inlined Hot Methods▫Lazy evaluation▫Relaxation of serial evaluation semantics

C++ OptimizationsAutomatic Memory Management•HTA’s are allocated through factory

methods on the heap for automatic memory management▫Methods return handle which is assigned to

a stack allocated variable▫All access occurs through this handle▫Once all handles to an HTA disappear from

the stack the HTA and its related structures are automatically deleted from memory

C++ OptimizationsTemplate Class•Used in htalib to handle data with

different types and dimensions

•Provides flexibility and opportunities for optimizations at compile time

•Data type of HTA can be any type or user defined type

C++ ImplementationSpecialized Methods•Methods are optimized and whenever

possible specialized for specific cases

•IE: A specialized method that avoids multiplication by stride is invoked when the data being accessed is known to be stored consecutive locations.

C++ ImplementationInlined Hot Methods / Lazy Evaluation• Inlined Hot Methods

▫Inlining is performed to methods that are used frequently

▫The tile access functions and scalar functions are carefully inlined to reduce the overhead of function calls.

• Lazy Evaluation▫In HTA Assignment when RHS has more than one

variable htalib uses lazy evaluation to avoid or reduce the temporary variables generated from one or more binary operations

▫Then if the LHS and RHS have no data dependencies then the assignment is directly evaluated to the RHS

C++ ExamplesMatrix Multiplication

C++ ImplementationRelaxation of serial evaluation semantics•Htalib provides a mechanism to temporarily

relax the serial evaluations ordering▫Helps the overlapping of different

communications and of communications with computations

htalib::async();B(1:n)[0] = B(0:n-1)[d];B(0:n-1)[d+1] = B(1:n)[1];htalib::sync();▫ The above code shows the boundary exchange in the 1D Jacobi▫ There is no data dependence among the assignments, both

statements can proceed concurrently▫ This is achieved through the runtime calls to async and sync

C++ ExamplesMatrix Transposition

State•Currently HTA library support is being

extended to support ▫Sparse Data Partitioning▫Hierarchical Place Trees

•Continually optimized to increase performance

•Searching for new ways to implement into new languages to increase productivity

My ThoughtsBy increasing the productivity of parallel

computing we can increase processor power across machines

More powerful programming languages due to having powerful library tools

HTA is something that I will keep in mind with my future programming projects

Hierarchically Tiled Arrays

Documents

Transcript of Hierarchically Tiled Arrays