Hierarchically Tiled Arrays (HTAs)

23
Anthony Delprete Jason Mckean Ryan Pineres Chris Olszewski Hierarchically Tiled Arrays (HTAs)

description

Hierarchically Tiled Arrays (HTAs). Anthony Delprete Jason Mckean Ryan Pineres Chris Olszewski. Overview. History and Purposes Tiles, Locality and Parallelism Structure, Creation of and Accessing HTAs Operations on HTAs Communication and Global HTA Implementations HTA vs MPI - PowerPoint PPT Presentation

Transcript of Hierarchically Tiled Arrays (HTAs)

Page 1: Hierarchically Tiled Arrays (HTAs)

Anthony DelpreteJason MckeanRyan PineresChris Olszewski

Hierarchically Tiled Arrays (HTAs)

Page 2: Hierarchically Tiled Arrays (HTAs)

Overview• History and Purposes• Tiles, Locality and Parallelism• Structure, Creation of and Accessing HTAs• Operations on HTAs

o Communication and Global

• HTA Implementations• HTA vs MPI• Conclusion• Resources

Page 3: Hierarchically Tiled Arrays (HTAs)

History Developed in 2004 by:Jia Guo, Ganesh Bikshandi, María J. Garzarán and David PaduaDept. of Computer ScienceU. of Illinois at Urbana-Champaign

Basilio B. FraguelaDept. de Electrónica e SistemasUniversidade da Coruña, Spain

Gheorghe Almási and José MoreiraIBM Thomas J. Watson, Research CenterYorktown Heights, NY, USA

Page 4: Hierarchically Tiled Arrays (HTAs)

Purpose

• Hierarchically Tiled Array (HTA) is a Object Oriented programming library

• The purpose of the library was to improve the programmability of distributed memory environments. This allows for improving performance by enhancing locality and parallelism. This was done through creating a new data type, HTA, allowing for easier manipulation of tiles.

Page 5: Hierarchically Tiled Arrays (HTAs)

What are Tiles, Locality and Parallelism?Tile

• A tile is a block of information.

• It is used in scientific computing

• An example of a tile would be a matrix.

Locality

• When the same value or location is frequently accessed.

• It is a predictable behavior that occurs in computers and is a good candidate for performance optimization

Page 6: Hierarchically Tiled Arrays (HTAs)

What are Tiles, Locality and Parallelism? (cont.)Parallelism

• Computation where many calculations are carried out simultaneously.

• Based on the principle of taking a large problem and dividing it into smaller ones and solving them at the same time.

Page 7: Hierarchically Tiled Arrays (HTAs)

Structure of an HTA

• HTAs are arrays partitioned into tiles.

• The tiles can be arrays or other HTAs.

• Allows for easier access to a specific location in an array

• By distributing the tiles across processors, parallelism is carried out. By arranging the tiles in a certain order, locality can be utilized.

Page 8: Hierarchically Tiled Arrays (HTAs)

Structure of an HTA (cont.)

Page 9: Hierarchically Tiled Arrays (HTAs)

Creating an HTAUsing Existing Array• Matrix and delimiters• See picture

New Empty HTA

• F = hta(3,3)

• Must be assigned data to complete

Page 10: Hierarchically Tiled Arrays (HTAs)

Accessing the HTAs ContentsNotation- { } used to index tiles- ( ) used to access elements within HTA or its tiles

Accessing Tiles- C{2,1} refers to the lower left tileAccessing Elements Directly- C(5,4) refers directly to a specific element at 5,4Accessing Elements Relatively- C{2,1}(1,4) refers to lower left tile, element at 1,4- C{2,1}{1,2}(1,2) refers to lower left tile, upper right tile of {2,1}, element at 1,2

Page 11: Hierarchically Tiled Arrays (HTAs)

Accessing the HTAs Contents (cont.)

Regional Access (Flattening)- Ignores tiling and returns array- C(1:2,3:6) returns a 2x4 matrix

Logical Indexing/Selection- Matrix of boolean values with same dimensions as HTA

Page 12: Hierarchically Tiled Arrays (HTAs)

Communication Operations• Communication is represented as

assignments on distributed HTAs.o V{2:3,:}(1,:) = V{1:2,:}(5,:)

• Can also be represented by overloaded HTA library methods.

• Permute Operationso permute(h,[x,y])o dpermute{h,[x,y]}

Operates on a 3D array.Some of the overloaded array operations provided by the HTA library.

• HTAs execute these operations at the tile level.o When circular shift is called,

whole tiles are shifted instead of individual array elements.

Page 13: Hierarchically Tiled Arrays (HTAs)

Matrix Matrix Multiplication• Cannon's Algorithm for matrix-matrix

multiplication shows how circular shift can be used to distribute work.

• Normal Implementationo Shift rows and columnso Perform multiplication by element

• HTA Implementationo Shift entire tileso Perform multiplication as matrix

multiplication of tiles where each processor or unit owns a tile.

• HTA Implementation increases locality due to single matrix multiplication.o Can increased even further if more

levels of tiling are used.

1234567891011

function C = cannon(A, B, C) for i = 2:m A{i,:} = circshift(A{i,:}, [0, -(i-1)]); B{:,i} = circshift(B{:,i}, [-(i-1), 0]); end for k = 1:m - 1 C = C + A * B; A = circshift(A, [0, -1]); B = circshift(B, [-1, 0]);end

MATLAB Code for Cannon's Algorithm using HTAs

Page 14: Hierarchically Tiled Arrays (HTAs)

Global ComputationsPassing an HTA to a function/operation• Operates in parallel on a set of tiles from an HTA distributed across a parallel

machine• parHTA(@func, H) where func is a function pointer

Reductionreduce(+, [5, 1, 3, 8]) = 17

• An operation applied to all the specified regions of a n dimensional vector to produce a scalar, producing a n-1 dimensional array.o If no dimension is given, the output contains only one scalar in each tile,

corresponding the associated input HTA tile in every dimension.• reduceHTA(op, dim, recurLevel, replicFlag)

o op = any associative and commutative operationo dim = dimension of the reductiono recurLevel = termination level of recursiono replicFlag = replication flag

Page 15: Hierarchically Tiled Arrays (HTAs)

Matrix-Vector Product• The simplest global computation is

achieved by operating in parallel on a set of distributed tiles from an HTA.

• Matrix-vector multiplication is one example of utilizing HTA global computation.

• A is an HTA containing the matrix MXo Distributed across m n processors.

• B is a two-dimensional HTA obtained by replicating the HTA V which contains the vector VX to multiply.

• The HTA V is replicated m times as specified by the operator repmat(V,m,1).

• Before multiplication, the row-vector B is transposed to a column.• The matrix-vector product A * B takes place locally and each processor

multiplies its portion of the matrix A by its portion of the vector in B.

123456

A = hta(MX, {partition_A}, [m n]);V = hta(VX, {partition_B}, [m n]);B = repmat(V, m, 1)

B = parHTA(@tranpose, B)C = reduceHTA(@sum, A * B, 2, true);

MATLAB code implementing HTA Sparse Matrix-Vector Multiplication

Page 16: Hierarchically Tiled Arrays (HTAs)

HTA Implementations• HTAs can be added to almost any object-based or object oriented language.

• Most research was done on MATLAB and C++

Page 17: Hierarchically Tiled Arrays (HTAs)

MATLAB ImplementationPros• Overall MATLABS syntax lends itself to HTAs• MATLAB provides a rich set of scientific operations which can

be easily incorporated in the HTA toolbox.

Cons• There is an immense overhead when MATLAB is interpreted.

o MATLAB creates temporary variables to hold the partial results of an expression. Greatly slows the program.

o MATLAB passes parameters by value and copies of the data are created from assignment statements.

Page 18: Hierarchically Tiled Arrays (HTAs)

C++ Implementation: htalib

Why C++ over MATLAB?

• Allocation/Deallocation improves performance:a. HTAs are allocated onto the heap. Return a handleb. Typically small in sizec. Once all handles are removed, HTAs are deleted.

• Inline functioningo Compiler will replace functions with their full body of instructions.o Used for Tile access

Page 19: Hierarchically Tiled Arrays (HTAs)

HTA compared to MPI• HTAs can be naturally implemented into many different

languages. MPI's unstructured manner, can potentially lead to programs that

are difficult to understand and maintain

• Follows a single threaded programming approach. eases the programmer from sequential to parallel programming

• HTAs are partitioned using the single HTA constructor. MPI has to make a lot more computations

• The lines of code for communication is significantly lower in HTA.

Page 20: Hierarchically Tiled Arrays (HTAs)
Page 21: Hierarchically Tiled Arrays (HTAs)

Conclusion

• Data Tiling is an effective mechanism for improving performance for both locality and parallelism.

• HTA as a library gives the programmer more control.

• HTAs facilitate algorithms that use multiple independent CPUs.

Page 22: Hierarchically Tiled Arrays (HTAs)

ResourcesBikshandi, Ganesh, et al. "Programming for parallelism and locality with hierarchically

tiled arrays." Proceedings of the eleventh ACM SIGPLAN symposium on Principles and practice of parallel programming. ACM, 2006.

Basilio B. Fraguela , Jia Guo , Ganesh Bikshandi , María J. Garzarán , Gheorghe Almási , José Moreira , David Padua, The Hierarchically Tiled Arrays programming approach, Proceedings of the 7th workshop on Workshop on languages, compilers, and run-time support for scalable systems, p.1-12, October 22-23, 2004, Houston, Texas [doi>10.1145/1066650.1066657]

Page 23: Hierarchically Tiled Arrays (HTAs)

Anthony DelpreteJason MckeanRyan PineresChris Olszewski

Hierarchically Tiled Arrays (HTAs)