Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance...
-
Upload
scarlett-higgins -
Category
Documents
-
view
220 -
download
0
Transcript of Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance...
1
Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems
Using Functional Performance Models of Data-Parallel Applications
Published in:Cluster Computing (CLUSTER), 2012 IEEE International Conference on
2013/9/11
2
Outline
• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results
2013/9/11
3
Outline
• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results
2013/9/11
4
Introduction
• Heterogeneous multiprocessor systems– Better power efficiency– Performance/price ratio
• Multicore and GPU programming techniques– OpenMP, MPI– Brook+, CUDA, OpenCL
2013/9/11
5
Introduction (cont.)
• Data-parallel scientific applications– Linear algebra routines– Digital signal processing– Computational fluid dynamics
• Data partitioning algorithm– Performance models of processor
2013/9/11
6
Introduction (cont.)
• Constant performance model (CPM)– Use history of performance measurement– Absolute speed of processors/devices
• Functional performance model (FPM)– Be used with any data-parallel application– GPU and CPU have separate memory and different
programming models
2013/9/11
7
Introduction (cont.)
• Load balancing algorithm– Static algorithms• Known as predicting-the future• Do not require data redistribution• Cannot balance on non-dedicated platforms
– Dynamic algorithms• Do not require a priori information• Communication overhead
2013/9/11
8
Outline
• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results
2013/9/11
9
Performance Measurement
• Hybrid multicore and multi-GPU node of NUMA architecture – Multiple identical cores– Hierarchical memory– Heterogeneous GPUs via the PCI Express
2013/9/11
10
Performance Measurement
• CPU– GEMM kernel from ACML 4.4 (AMD Core Math
Library)• GPU– CUBLAS 4.1 (NVDIA CUDA BLAS)
2013/9/11
11
Performance Measurement (cont.)
• Approach to performance measurement– Processes are bound to cores– Processes are synchronized– Repeat multiple times
2013/9/11
12
Performance Measurement (cont.)
• CPU– The speed of a core depended on the number of
cores executing the kernel on the same socket– Wasn’t affected by the execution on the other
socket• GPU– One core is dedicated to the GPU, the other cores
are idle– Send / Receive matrix
2013/9/11
13
Outline
• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results
2013/9/11
14
Column-based matrix multiplication
2013/9/11
15
Column-based matrix multiplication (cont.)
• Partitioning algorithm – Arrange the submatrices to be as square as
possible– Minimizing the total volume of communications
and balancing the computations• blocking factor b – a parameter of the application adjusting the
granularity of communications and computations– Comes from experiment
2013/9/11
16
Outline
• Introduction• Related Work• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results
2013/9/11
17
FPM of multiple cores and GPUs
• Speed functions of multiple cores
2013/9/11
18
FPM of multiple cores and GPUs (cont.)
• Speed functions of GPUs
2013/9/11
19
FPM of multiple cores and GPUs (cont.)
• Version 1– pivot column A(b), row B(b), submatrix Ci are stored
in the host memory• Version 2– submatrix C is stored and accumulated in
the device until the device memory is exceeded
2013/9/11
20
FPM of multiple cores and GPUs (cont.)
• Version 3– Overlapping communications and computaions
2013/9/11
21
FPM of multiple cores and GPUs (cont.)
• Speed functions of GPUs
2013/9/11
22
Outline
• Introduction• Related Work• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results
2013/9/11
23
Experimental results
2013/9/11
24
Experimental results (cont.)
2013/9/11
25
Experimental results (cont.)
2013/9/11
26
Experimental results (cont.)
2013/9/11
27
Q&A
2013/9/11
28
Thank you for listening
2013/9/11
29
• 1. Performance modelling • 2. The performance of the program• 3. Why FPM• 4. Problem size• 5. Kernel• 6. NUMA• 7. GEMM• 8. BLAS• 9. GFlops2013/9/11