multicore_multigpu_0

32
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing Laboratory University College Dublin, Ireland Cluster 2012 Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 1 / 22

description

multicore

Transcript of multicore_multigpu_0

Page 1: multicore_multigpu_0

Data Partitioning on Heterogeneous Multicore andMulti-GPU systems Using Functional Performance

Models of Data-Parallel Applictions

Ziming Zhong Vladimir Rychkov Alexey Lastovetsky

Heterogeneous Computing LaboratoryUniversity College Dublin, Ireland

Cluster 2012

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 1 / 22

Page 2: multicore_multigpu_0

Model-based Data Partitioning

Motivation

Hybrid GPU-accelerated parallel computers

Higher power efficiency, performance/price ratio, etc.Successfully applied to bioinformatics, astrophysics, moleculardynamics, computational fluid dynamics, etc.

Hybrid Clusters Hybrid Multicore & Multi-GPU System

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 2 / 22

Page 3: multicore_multigpu_0

Model-based Data Partitioning

Motivation

Hybrid Multicores+GPUs presents challenges

Parallel programming is hardLoad balancing problem

Heterogeneity: processor, memory, etc.Hierarchical levels of parallelism: node, socket, core, etc.

and others

Hybrid Clusters Hybrid Multicore & Multi-GPU System

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 2 / 22

Page 4: multicore_multigpu_0

Model-based Data Partitioning

Leveraging Hybrid Multicores/GPUs System

In this work, we target:

Data parallel application

Divisible computational workloadWorkload proportional to data sizeDependent on data locality

Dedicated hybrid system

Reuse of optimized software stack

Our approach:

Heterogeneous distributed-memory system

Performance modeling of hybrid system

Model-based data partitioning to balance load Processing Flow

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 3 / 22

Page 5: multicore_multigpu_0

Model-based Data Partitioning

Data Partitioning on Heterogeneous Platform:

(1) (2) (3)

1 Workload is divisible and proportional to data size

2 Workload is partitioned in proportion to processor speed

3 Workload is distributed in proportion to processor speed

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 4 / 22

Page 6: multicore_multigpu_0

Model-based Data Partitioning

Data partitioning relies on accurate performance modelsTraditionally, performance is defined by a single constant number

- Constant Performance Model (CPM)- Computed from clock speed or by performing a benchmark- Computational units are partitioned as di = N × (si/

∑pj=1 sj)

- Simplistic, algorithms may fail to converge to a balanced solution[1]

Functional Performance Model(FPM)[2]:

Represent speed as afunction of problem size

Realistic

Application centric

Hardware specific 0

5

10

15

20

25

30

35

40

0 1e+07 2e+07 3e+07 4e+07 5e+07

Sp

ee

d (

GF

LO

PS

)

Problem Size - matrix elements

Matrix Multiplication Benchmark on Grid5000-Lille

chirloute-3chimint-1

chinqchint-1chicon-1

[1] D. Clarke et al: Dynamic Load Balancing of Parallel Iterative Routines on Heterogeneous HPC Platforms, 2010

[2] A. Lastovetsky et al: Data partitioning with a functional performance model of heterogeneous processors, 2007.Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 5 / 22

Page 7: multicore_multigpu_0

Model-based Data Partitioning

Partitioning with functional performance models*

Load is balanced when:

t1(d1) ≈ t2(d2) ≈ . . . ≈ tp(dp){ti (di ) = di/si (di ),

d1 + d2 + . . . + dp = N

Size of the problem

Absolute

speed

s (d)1

s (d)2

s (d)4

s (d)3

d1 + d2 + d3 + d4 = n

d1 d2 d3 d4

All processors complete work within the same time

Solution lies on a line passing through the origin when di/si (di ) = constant

However, only designed for heterogeneous uniprocessor cluster

* A. Lastovetsky et al: Data partitioning with a functional performance model of heterogeneous processors, 2007.

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 6 / 22

Page 8: multicore_multigpu_0

Data Partitioning on Hybrid System

Outline

1 Model-based Data Partitioning

2 Data Partitioning on Hybrid System

3 Experiment Results

4 Conclusions and Future Work

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 7 / 22

Page 9: multicore_multigpu_0

Data Partitioning on Hybrid System

Performance Modeling of Hybrid System

Multicore/GPUs are modeled independently

Separate memory, programming modelsRepresented by speed functions (FPM)Benchmarking with computational kernels

Performance model of multicore:

Approximate the speed of multiple corese.g. all cores in a processor except theones dedicated to GPUs

Performance model of GPU:

Approximate combined speed of a GPUand it’s dedicated core Processing Flow

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 8 / 22

Page 10: multicore_multigpu_0

Data Partitioning on Hybrid System

Performance Measurement of Hybrid System

Generic measurement techniques

Process binding - avoid process migrationSynchronization - ensure resources are shared between coresRepeating measurement- ensure statistically reliable results

However, how to measure the processor performance accurately on ahybrid system?

Hybrid Multicore & Multi-GPU System

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 9 / 22

Page 11: multicore_multigpu_0

Data Partitioning on Hybrid System

Performance Measurement of Hybrid System

Performance measurement of multiple cores:

Programming model:one process (thread)per core to achievehigh performance

Cores interfere witheach other due toresource contention

Performance areevaluated in group

All cores in the groupexecuting the sameamount of workload inparallel

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 10 / 22

Page 12: multicore_multigpu_0

Data Partitioning on Hybrid System

Performance Measurement of Hybrid System

Performance measurement of GPU:

One core dedicated tothe GPU, other coresbeing idle

Kernel computationtime and data transfertime are both included

Additional issue: HostNUMA affects PCIetransfer throughput inDual-IOH system

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 10 / 22

Page 13: multicore_multigpu_0

Data Partitioning on Hybrid System

Application: Matrix Multiplication on Heterogeneous Platform*

Matrices partitioned unevenly to achieve load balancing

Processors arranged so that communication is minimized

Computational kernel: panel-panel updateReuse vendor-optimized GEMM routineComputation is proportional to the area of submatrix Ci

The same memory access pattern as the whole application

* Beaumont,O. et al: Matrix Multiplication on Heterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst. 2001.

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 11 / 22

Page 14: multicore_multigpu_0

Data Partitioning on Hybrid System

Development of Computational Kernel

Multicore CPU:

Use GEMM routine from ACML libraryMultiple processes running sequential routineAlternative: Single process running threaded routine

GPU accelerator:

Use GEMM routine from CUBLAS libraryDevelop out-of-core kernel to overcome memory limitationOverlap data transfers and kernel execution to hide latency

Out-of-core Kernel, Overlap of Data Transfers and Kernel Execution:- allocated 5 buffers in device memory: A0, A1, B0, C0, C1

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 12 / 22

Page 15: multicore_multigpu_0

Data Partitioning on Hybrid System

Experimental Setup

Hybrid Multicore and Multi-GPU Node

CPU (AMD) GPUs (NVIDIA)

Architecture Opteron 8439SE GF GTX680 Tesla C870Core Clock 2.8 GHz 1006 MHz 600 MHzNumber of Cores 4 × 6 cores 1536 cores 128 coresMemory Size 4 × 16 GB 2048 MB 1536 MBMemory Bandwidth 192.3 GB/s 76.8 GB/s

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 13 / 22

Page 16: multicore_multigpu_0

Data Partitioning on Hybrid System

Building Functional Performance Models (FPMs)

sc(x): approximate the speed of multiple cores when executing CPU kernelon c cores simultaneously, with problem size x/c one each core

g(x): approximate the combined speed of a GPU and it’s dedicated core

Functional Performance Models of multicore CPU:

Platform: consist of four6-core sockets

Modeling performance ofcores in each socket

s5(x), 5 cores running CPUkernel, 1 core being idle

s6(x), all 6 cores runningCPU kernel 0

20

40

60

80

100

120

0 300 600 900 1200

Sp

ee

d (

GF

lop

s)

Matrix blocks (b x b)

5 cores6 cores

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 14 / 22

Page 17: multicore_multigpu_0

Data Partitioning on Hybrid System

Building Functional Performance Models (FPMs)

sc(x): approximate the speed of multiple cores when executing CPU kernelon c cores simultaneously, with problem size x/c one each core

g(x): approximate the combined speed of a GPU and it’s dedicated core

Functional Performance Model of GPU:

version 1: naiveimplementation

version 2: accumulateintermediate result,out-of-core overcomememory limitation

version 3: overlap datatransfers and kernelexecution time 0

200

400

600

800

1000

0 1000 2000 3000 4000

Sp

ee

d (

GF

lop

s)

Matrix blocks (b x b)

memory limitversion 1version 2version 3

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 14 / 22

Page 18: multicore_multigpu_0

Data Partitioning on Hybrid System

Impact of Resource Contention to Performance Modeling

CPU and GPU kernel benchmarking simultaneously in a socket

FPM of multiple cores s5(x) are barely affected

FPM of GPU g(x) gets 85% accuracy (speed drops by 7 - 15%)

s5(x), speed of multiple cores

0

20

40

60

80

100

0 300 600 900 1200

Speed (

GF

lops)

Matrix blocks (b x b)

CPU-onlycores:GPU = 1:5

cores:GPU = 1:10

g(x), speed of a GPU

0

200

400

600

800

1000

0 1000 2000 3000 4000

Speed (

GF

lops)

Matrix blocks (b x b)

GPU-onlycores:GPU = 1:5

cores:GPU = 1:10

Note: the above two figures have different scales, 1:10

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 15 / 22

Page 19: multicore_multigpu_0

Experiment Results

Outline

1 Model-based Data Partitioning

2 Data Partitioning on Hybrid System

3 Experiment Results

4 Conclusions and Future Work

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 16 / 22

Page 20: multicore_multigpu_0

Experiment Results

Execution time of the application under different configurations

Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)

40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1

Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22

Page 21: multicore_multigpu_0

Experiment Results

Execution time of the application under different configurations

Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)

40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1

Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22

Page 22: multicore_multigpu_0

Experiment Results

Execution time of the application under different configurations

Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)

40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1

Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22

Page 23: multicore_multigpu_0

Experiment Results

Execution time of the application under different configurations

Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)

40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1

Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22

Page 24: multicore_multigpu_0

Experiment Results

Computation time of each process

0 20 40 60 80

100 120 140

0 2 4 6 8 10 12 14 16 18 20 22 24Co

mp

uta

tio

n t

ime

(se

c)

Process rank

CPM-based partitioning

Tesla C870

Geforce GTX 680

0 20 40 60 80

100 120 140

0 2 4 6 8 10 12 14 16 18 20 22 24Co

mp

uta

tio

n t

ime

(se

c)

Process rank

FPM-based partitioning

Matrix size 60 × 60, Computation time reduced by 40%Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 18 / 22

Page 25: multicore_multigpu_0

Experiment Results

Execution time of the application under different partitioning algorithms

0

100

200

300

400

500

600

700

800

10 20 30 40 50 60 70 80

Execution tim

e (

sec)

Matrix size n

Homogeneous PartitioningCPM-based PartitioningFPM-based Partitioning

Execution time reduced by 23% and 45% respectivelyZiming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 19 / 22

Page 26: multicore_multigpu_0

Conclusions and Future Work

Conclusions and Future Work

Conclusions

Defined and built functional performance models (FPMs) of hybridmulticore and multi-GPU system, considering it as a distributedmemory systemAdapted FPM-based data partitioning to hybrid system, achieved loadbalancing and delivered good performance

Future Work

Apply the approach to hybrid clusterPartitioning with respect to interconnect speed

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 20 / 22

Page 27: multicore_multigpu_0

Conclusions and Future Work

Thank You!

Science Foundation University College Heterogeneous Computing China ScholarshipIreland Dublin Laboratory Council

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 21 / 22

Page 28: multicore_multigpu_0

Conclusions and Future Work

Partitioning with functional performance models

Want all devices to compute assigned workload di within same time.

Points(di , si (di )

)lie on a line passing through the origin when

disi (di )

= constant.

Total problem size determines the slope.

Algorithm iteratively bisects solution space to find values di .

s (d)1

s (d)2

s (d)4

s (d)3

Size of the problem

Absolute

speed

d + d + d + d = n1 2 3 4

d1 d2 d3 d4

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22

Page 29: multicore_multigpu_0

Conclusions and Future Work

Partitioning with functional performance models

Want all devices to compute assigned workload di within same time.

Points(di , si (di )

)lie on a line passing through the origin when

disi (di )

= constant.

Total problem size determines the slope.

Algorithm iteratively bisects solution space to find values di .

s (d)1

s (d)2

s (d)4

s (d)3

Size of the problem

Absolute

speed

L

U

d + d + d + d < nU1 U2 U3 U4

d + d + d + d > nL1 L2 L3 L4

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22

Page 30: multicore_multigpu_0

Conclusions and Future Work

Partitioning with functional performance models

Want all devices to compute assigned workload di within same time.

Points(di , si (di )

)lie on a line passing through the origin when

disi (di )

= constant.

Total problem size determines the slope.

Algorithm iteratively bisects solution space to find values di .

s (d)1

s (d)2

s (d)4

s (d)3

Size of the problem

Absolute

speed

L

U

< n or > n

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22

Page 31: multicore_multigpu_0

Conclusions and Future Work

Partitioning with functional performance models

Want all devices to compute assigned workload di within same time.

Points(di , si (di )

)lie on a line passing through the origin when

disi (di )

= constant.

Total problem size determines the slope.

Algorithm iteratively bisects solution space to find values di .

s (d)1

s (d)2

s (d)4

s (d)3

Size of the problem

Absolute

speed

U

L

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22

Page 32: multicore_multigpu_0

Conclusions and Future Work

Partitioning with functional performance models

Want all devices to compute assigned workload di within same time.

Points(di , si (di )

)lie on a line passing through the origin when

disi (di )

= constant.

Total problem size determines the slope.

Algorithm iteratively bisects solution space to find values di .

s (d)1

s (d)2

s (d)4

s (d)3

Size of the problem

Absolute

speed

d1 d2 d3 d4

L

U

Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22