Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky...
-
date post
19-Dec-2015 -
Category
Documents
-
view
215 -
download
0
Transcript of Matrix Multiplication on Two Interconnected Processors Brett A. Becker and Alexey Lastovetsky...
Matrix Multiplication on Two Interconnected Processors
Brett A. Becker and Alexey Lastovetsky
Heterogeneous Computing Laboratory
School of Computer Science and Informatics
University College Dublin
_______________________________________________________
HeteroPar’06 Barcelona Sept. 28, 2006
Outline
● Motivation and Goals
● Introduction: ‘Straight-Line’ Partitionings
● The ‘Square-Corner’ Partitioning - Minimizing the Total Volume of Communication
● MPI Experiments / Results
● Conclusion / Future Work
Motivation and Goals
● Partitioning algorithms for MMM designed for n processors result in partitionings which are not always optimal on a small number of processors
● We seek to lower the Total Volume of Communication by utilizing a new partitioning strategy.
● Our ultimate interest is to determine if the Square-Corner partitioning
is a viable technique for deployment on 2 interconnected Clusters.
Background: Straight-Line Partitioning
p
iii whS
1
)(
Total Volume of Inter-Processor Communication (TVC) is proportional to the Sum of Half-Perimeters (S)
Lower Bound (L) of S is when all partitions are square
p
iiaL
1
2
Straight-Line Partitioning
From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
Average and Minimum values of L
S
for two million randomly generated
areas
Background: Straight-Line Partitioning2 Processors
NwhwhwhSi
ii 3)( 2211
2
1
NL
NaLi
i
2,0 as
)(22 22
1
The Straight-Line Partitioning can not meet the lower bound, L
Background: Straight-Line Partitioning2 Processors
2TVC ,0 as N
Total Volume of Inter-Processor Communication (TVC) = N 2
Introduction: Square-Corner Partitioning
0TVC ,0 as X
N2TVC
Square-Corner Partitioning
NS
whwhwhSi
ii
2,0 as
)( 2211
2
1
NL
NaLi
i
2,0 as
)(22 22
1
The Square-Corner Partitioning can meet the lower bound, L
Square-Corner Partitioning
Average and Minimum values of L
Sfor 2 million randomly generated areas
Power Ratio > 3:1
Adapted From: Olivier Beaumont, Vincent Boudet, Fabrice Rastello and Yves Robert, “Matrix-Matrix Multiplication on Heterogeneous Platforms”, IEEE Transactions on Parallel and Distributed Systems, 2001, Vol.12, No.10, pp.1033-1051.
Square-Corner PartitioningMinimizing the TVC
The Square-Corner Partitioning has a lower Total Volume of Communication compared to the Straight-Line Partitioning Provided the Processor Power Ratio is > 3:1
The Total Volume of Communication is minimized when the slower processor’s partition is a square
Theorem:
Theorem:
Results: Square-Corner Partitioning
Matrix-Matrix Multiplication, N=6500, Bandwidth = 80Mb/s
Lower TVC Lower Communication Time Lower Execution Time
Average Reduction in Communication Time = 45%
Average Reduction in Execution Time = 14%
Results: Square-Corner Partitioning
Matrix-Matrix Multiplication, N=6500, Bandwidth = 380Mb/s
Average Reduction in Communication Time = 44%
Lower TVC Lower Communication Time Lower Execution Time
Average Reduction in Execution Time = 10%
Square-Corner Partitioning Overlapping Communication and Computation
A sub-partition of Processor 1’s C Partition is Immediately Calculable
Square-Corner Partitioning Overlapping Communication and Computation
Overlapping more than doubled advantage of Square-Corner algorithm. ● No Overlapping → 17% faster than Straight-Line algorithm. ● Overlapping → 39% faster than Straight-Line algorithm.
Algorithm Execution Time Speedup
Straight-Line 83s 0.94Square-Corner (No Overlapping) 69s 1.13Square-Corner (Overlapping) 51s 1.53Sequential 78s N/A
MM Multiplication, N=4500, Bandwidth=100Mb/s, Ratio=5:1,
Square-Corner Partitioning Two Cluster Architecture
Total of 20 Homogeneous Nodes in 2 Clusters
Square-Corner Partitioning Two Clusters
Algorithm Execution Time Speedup
Straight-Line 123s 1.04Square-Corner 115s 1.11Sequential 128s N/A
MM Multiplication, N=9000, Bandwidth=100Mb/s
All Machines are Homogeneous. One Cluster of 4, One Cluster of 16
Conclusions
● The Square-Corner Partitioning reduces the Total Volume of Communication provided the processor power ratio is > 3:1
● The possibility of Overlapping Communication and Computation can bring further reductions in Execution Time
● The Square-Corner Partitioning is viable on Two Clusters
_______________________________________________________
Current and Future Work
● We have successfully extended the Square-Corner Partitioning to Three Processors
To do:
● Experiment on more Two-Cluster architectures
● Overlap Communication and Computation on Two Clusters
● Extend to Three-Processor Algorithm to Three Clusters
_______________________________________________________
Acknowledgements
This work was supported by: