Hierarchically Tiled Arrays

34
Hierarchically Tiled Arrays Presented by, Kenneth Detweiler

description

Hierarchically Tiled Arrays. Presented by, Kenneth Detweiler. Overview. C++ Implementation Logical Index Space HTA Class Machine Mapping Operator Framework C++ Optimizations Automatic Memory Management Template Class Specialized Methods Inlined Hot Methods Lazy Evaluation - PowerPoint PPT Presentation

Transcript of Hierarchically Tiled Arrays

Page 1: Hierarchically Tiled Arrays

Hierarchically Tiled Arrays 

Presented by,Kenneth Detweiler

Page 2: Hierarchically Tiled Arrays

Overview Design and Creation Background

Parallel Operators Tiling Locality

HTA Illustrated Creation Data Layout Accessing

Implementation Sparse Matrix Cannon Algorithm Estimate PI using Monte Carlo

method MATLAB

Internal Structure What it does

C++ Implementation Logical Index Space HTA Class Machine Mapping Operator Framework

C++ Optimizations Automatic Memory Management Template Class Specialized Methods Inlined Hot Methods Lazy Evaluation Relaxation of serial evaluation

Current State of HTA My Thoughts

Page 3: Hierarchically Tiled Arrays

Design and Creation•Developed in 2006 by:

▫University of Coruna▫University of Illinois▫IBM

•IBM: Yorktown Heights, NY•UC: Coruna, Spain•UIC: Chicago, Illinois

Page 4: Hierarchically Tiled Arrays

Background• Data type based on higher level data parallel

operators▫Hierarchical Tiling

• HTA creates tiles a part of programming language▫Tiling helps control locality and Data Distribution▫Referenced explicitly by the compiler▫Operators extend to function on tiles

• C++ Library Implementation▫Distributed Memory using MPI▫Shared Memory using TBI

Page 5: Hierarchically Tiled Arrays

Parallel Operators•In order to give benefits of supercomputer

power the developer resorts to low level parallel constructs.▫This is both time consuming and an easily

error prone process

•The solution to this Parallel Operators, which exists simultaneously on all processors involved in distributed contribution.▫Acts as a single entity capable of processing

shared data in parallel.

Page 6: Hierarchically Tiled Arrays

Hierarchical Tiling•Arrays that are partitioned into tiles•Exploits parallelism and locality of all

levels of memory hierarchy.•Can be represented iteratively or

recursively.•Implemented during program design•Cut costs in parallel programs by

organizing computations between tiles

Page 7: Hierarchically Tiled Arrays

Locality• Known as the locality of reference or principle

of locality

• Describes the same value or related storage locations being frequently accessed.

• Two major types:▫Temporal:

If a particular memory location is referenced then it is likely that same location will be referenced again in the future.

▫Spatial: If a particular memory location is referenced then it is likely that the

nearby memory locations will also be referenced

Page 8: Hierarchically Tiled Arrays

HTA, Illustrated

Page 9: Hierarchically Tiled Arrays

HTA, Creation•Need:

▫ Source Array▫ Series of Delimiters

•Function:▫ hta(MATRIX{[delim1],[delim2],…}, [processor

mesh])

•Example:▫ h = hta(a,{[1,4].,[1,3]},[2,2]);

Page 10: Hierarchically Tiled Arrays

HTA, Data Layout•Inner most tiles are the leaf tiles.•Leaf tiles contain:

▫ ROW▫ COLUMN▫ TITLE: stores the elements of tile in continuous memory

locations

•Memory mapping of HTA is determined by:▫ How tiles are allocated▫ Memory layout of tiles

Page 11: Hierarchically Tiled Arrays

HTA, Accessing•Tiles:

▫Calling C{2,2} is the lower right quadrant

•Elements:▫Calling C(2,2)

Page 12: Hierarchically Tiled Arrays

Parallel Programming using HTASparse Matrixa = hta(MX,{dist}, [P 1]);b = hta(P, 1, [P 1]);b{:} = V;r = a * b;

Communication between client and each serverMultiples a sparse matrix mx by a dense vector V using P

processor.Distribute MX in chunks of rows into a HTA by calling the

constructorP servers handling the HTA are organized into a single columnDist argument is used to distribute the array MX such that it

results in a uniform computation across the servers

Page 13: Hierarchically Tiled Arrays

Parallel Programming using HTACannon Algorithmfor i = 1:n

c = c + a * b;a = circshift(a, [0, -1]);b = circshift(b, [-1, 0]);

end

Requires communication between client and servers, but also communication between the servers

Each iteration of the loop each server executes a matrix multiplication of tile a and b that reside on the server

Result of multiplication is stored in a local HTA cTiles a are shifted along the first dimension, and tiles b are shifted along

second dimension. Thus tiles a are sent to the left processor in mesh and tiles b to the right processor in mesh.

The left processor sends its tiles of a to the right most processor in its row and the bottom most processor transfers its tiles of b to the top most processor in its column

The end result is that HTA c = a*b

Page 14: Hierarchically Tiled Arrays

Parallel Programming using HTAEstimate pi using Monte Carlo method

input = hta(P, 1, [P 1]);input{:} = eP;output = parHTAFunc(@estimatePi, input);myPi = mean(output(:));

function r = estimatePi(n)x = randx(1, n);y = randx(1, n);pos = x .* x + y .* y;r = sum(pos < 1) / n * 4;

A distributed HTA input with one tile per processor is built

Tiles are filled with eP, the number of experiments to run per processor

Experiments are made on each processor by the function estimatePi()

The result of the parallel execution of the function is a distributed HTA output that has the same mapping as input and keeps a single tile per processor with the local estimation of pi.

Page 15: Hierarchically Tiled Arrays

Implementation•MATLAB (matrix laboratory)

▫ A numerical computing environment and fourth generation programming language

▫ Started as a wrapper on Fortran libraries

Page 16: Hierarchically Tiled Arrays

MATLAB – Internal Structure• MATLAB is used as

a client where code is executed

• On the server where its used as a computational engine for the distributed operations on the HTA’s.

Page 17: Hierarchically Tiled Arrays

MATLAB – Internal Structure (cont)• All communications are done through the MPI

▫Lower layers of the HTA toolbox take care of communications requirements.

▫Higher layers implement the syntax expected by the MATLAB users

• HTA programs have a single thread that is interpreted and executed by the MATLAB client

• HTA’s are just objects within the environment of the interpreter

Page 18: Hierarchically Tiled Arrays

MATLAB – Internal Structure (cont)• When HTA is local it is not distributed on the array servers

▫ Client HTA keeps both the structure and the content of the HTA

• When HTA is distributed▫ Client holds the structure of the HTA at all its levels▫ Keeps all information of the mapping of tiles on the top level of

the mesh servers

• Client is always able to regardless of local or distributed to:▫ Test the legality of the operation▫ Calculate the structure and mapping of the output HTA▫ Send messages that encode the command and its arguments to

the servers

Page 19: Hierarchically Tiled Arrays

MATLAB• Is a linear algebra language with a large base of

users who write scientific code▫ HTA allows users to harness powers of cluster of work stations instead

of a single machine

• Is polymorphic ▫ Allowing HTA s to substitute regular arrays almost without changing

the rest of the code, thereby adding parallelism painlessly.

• Is designed to be extensible▫ Third party developers can provide so call toolboxes of functions for

specialized purposes.

• Provides a native method called Mex▫ Allows functions to implemented in languages like C or Python

Page 20: Hierarchically Tiled Arrays

C++ Implementations• HTAs can be implemented in C++ by adding the htalib

library▫HTAs are represented as composite objects with methods

to operate on both distributed and sequential HTAs▫Two communication layers are available MPI and UPC

Implementation follows SPMD execution model while the programming model is still single-threaded

• Core Data Structures of htalib▫Logical index space▫HTA Class▫Machine Mapping▫Operator Framework

Page 21: Hierarchically Tiled Arrays

C++ ImplementationsLogical Index Space•Classes used to define index space and

tiling of an HTA▫Tuple<N> //an n dimensional index value▫Triplet //a 1D range with optional stride

(low:high:step)▫Region<N> //N-dimensional rectangular

index space spanned by N triplets

Page 22: Hierarchically Tiled Arrays

C++ ImplementationsHTA Class•Defines an HTA with scalar elements of

type T and N dimensions.•Data type implementation scalar access

▫(operator[], tile access(operator())•Built in array operations

▫transpose▫permute▫dpermute▫reduce

Page 23: Hierarchically Tiled Arrays

C++ ImplementationMachine Mapping•Specifies where the HTA is allocated in a

distributed system•Memory layout of the scalar data array

▫Captured by instances of class distribution that specifies home location of the scalar data for each of the tiles of an HTA

•Memory Mapping▫Specify the layout

(row-major across tiles, row major per tile)▫Size and Stride of the flat array data underlining

the HTA

Page 24: Hierarchically Tiled Arrays

C++ ImplementationOperator Framework•Htalib provides a powerful operator

framework following the design of the STL operator class▫Consists of routines that evaluate specific

operators on HTAs and base classes Serves as a foundation for user-defined

operator

Page 25: Hierarchically Tiled Arrays

C++ Optimizations• Represented as composite objects

▫Methods available for both distributed and local HTA’s

▫Two implementations MPI and UPC• Performance Optimizations

▫Automatic Memory Management▫Template Class▫Specialized Methods▫Inlined Hot Methods▫Lazy evaluation▫Relaxation of serial evaluation semantics

Page 26: Hierarchically Tiled Arrays

C++ OptimizationsAutomatic Memory Management•HTA’s are allocated through factory

methods on the heap for automatic memory management▫Methods return handle which is assigned to

a stack allocated variable▫All access occurs through this handle▫Once all handles to an HTA disappear from

the stack the HTA and its related structures are automatically deleted from memory

Page 27: Hierarchically Tiled Arrays

C++ OptimizationsTemplate Class•Used in htalib to handle data with

different types and dimensions

•Provides flexibility and opportunities for optimizations at compile time

•Data type of HTA can be any type or user defined type

Page 28: Hierarchically Tiled Arrays

C++ ImplementationSpecialized Methods•Methods are optimized and whenever

possible specialized for specific cases

•IE: A specialized method that avoids multiplication by stride is invoked when the data being accessed is known to be stored consecutive locations.

Page 29: Hierarchically Tiled Arrays

C++ ImplementationInlined Hot Methods / Lazy Evaluation• Inlined Hot Methods

▫Inlining is performed to methods that are used frequently

▫The tile access functions and scalar functions are carefully inlined to reduce the overhead of function calls.

• Lazy Evaluation▫In HTA Assignment when RHS has more than one

variable htalib uses lazy evaluation to avoid or reduce the temporary variables generated from one or more binary operations

▫Then if the LHS and RHS have no data dependencies then the assignment is directly evaluated to the RHS

Page 30: Hierarchically Tiled Arrays

C++ ExamplesMatrix Multiplication

Page 31: Hierarchically Tiled Arrays

C++ ImplementationRelaxation of serial evaluation semantics•Htalib provides a mechanism to temporarily

relax the serial evaluations ordering▫Helps the overlapping of different

communications and of communications with computations

htalib::async();B(1:n)[0] = B(0:n-1)[d];B(0:n-1)[d+1] = B(1:n)[1];htalib::sync();▫ The above code shows the boundary exchange in the 1D Jacobi▫ There is no data dependence among the assignments, both

statements can proceed concurrently▫ This is achieved through the runtime calls to async and sync

Page 32: Hierarchically Tiled Arrays

C++ ExamplesMatrix Transposition

Page 33: Hierarchically Tiled Arrays

State•Currently HTA library support is being

extended to support ▫Sparse Data Partitioning▫Hierarchical Place Trees

•Continually optimized to increase performance

•Searching for new ways to implement into new languages to increase productivity

Page 34: Hierarchically Tiled Arrays

My ThoughtsBy increasing the productivity of parallel

computing we can increase processor power across machines

More powerful programming languages due to having powerful library tools

HTA is something that I will keep in mind with my future programming projects