Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance...

1

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems

Using Functional Performance Models of Data-Parallel Applications

Published in:Cluster Computing (CLUSTER), 2012 IEEE International Conference on

2013/9/11

http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6336636

http://ieeexplore.ieee.org/xpl/mostRecentIssue.jsp?punumber=6336636

2

Outline

• Introduction• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

3

Outline


2013/9/11

4

Introduction

• Heterogeneous multiprocessor systems– Better power efficiency– Performance/price ratio

• Multicore and GPU programming techniques– OpenMP, MPI– Brook+, CUDA, OpenCL

2013/9/11

5

Introduction (cont.)

• Data-parallel scientific applications– Linear algebra routines– Digital signal processing– Computational fluid dynamics

• Data partitioning algorithm– Performance models of processor

2013/9/11

6


• Constant performance model (CPM)– Use history of performance measurement– Absolute speed of processors/devices

• Functional performance model (FPM)– Be used with any data-parallel application– GPU and CPU have separate memory and different

programming models

2013/9/11

7


• Load balancing algorithm– Static algorithms• Known as predicting-the future• Do not require data redistribution• Cannot balance on non-dedicated platforms

– Dynamic algorithms• Do not require a priori information• Communication overhead

2013/9/11

8

Outline


2013/9/11

9

Performance Measurement

• Hybrid multicore and multi-GPU node of NUMA architecture – Multiple identical cores– Hierarchical memory– Heterogeneous GPUs via the PCI Express

2013/9/11

10

Performance Measurement

• CPU– GEMM kernel from ACML 4.4 (AMD Core Math

Library)• GPU– CUBLAS 4.1 (NVDIA CUDA BLAS)

2013/9/11

11

Performance Measurement (cont.)

• Approach to performance measurement– Processes are bound to cores– Processes are synchronized– Repeat multiple times

2013/9/11

12

Performance Measurement (cont.)

• CPU– The speed of a core depended on the number of

cores executing the kernel on the same socket– Wasn’t affected by the execution on the other

socket• GPU– One core is dedicated to the GPU, the other cores

are idle– Send / Receive matrix

2013/9/11

13

Outline


2013/9/11

14

Column-based matrix multiplication

2013/9/11

15

Column-based matrix multiplication (cont.)

• Partitioning algorithm – Arrange the submatrices to be as square as

possible– Minimizing the total volume of communications

and balancing the computations• blocking factor b – a parameter of the application adjusting the

granularity of communications and computations– Comes from experiment

2013/9/11

16

Outline

• Introduction• Related Work• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

17

FPM of multiple cores and GPUs

• Speed functions of multiple cores

2013/9/11

18

FPM of multiple cores and GPUs (cont.)

• Speed functions of GPUs

2013/9/11

19


• Version 1– pivot column A(b), row B(b), submatrix Ci are stored

in the host memory• Version 2– submatrix C is stored and accumulated in

the device until the device memory is exceeded

2013/9/11

20


• Version 3– Overlapping communications and computaions

2013/9/11

21


• Speed functions of GPUs

2013/9/11

22

Outline

• Introduction• Related Work• Performance Measurement• Column-based matrix multiplication• FPM of multiple cores and GPUs• Experimental results

2013/9/11

23

Experimental results

2013/9/11

24

Experimental results (cont.)

2013/9/11

25


2013/9/11

26


2013/9/11

27

Q&A

2013/9/11

28

Thank you for listening

2013/9/11

29

• 1. Performance modelling • 2. The performance of the program• 3. Why FPM• 4. Problem size• 5. Kernel• 6. NUMA• 7. GEMM• 8. BLAS• 9. GFlops2013/9/11

Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance...

Documents

Transcript of Data Partitioning on Heterogeneous Multicore and Multi-GPU Systems Using Functional Performance...