multicore_multigpu_0
-
Upload
emanuel-sena -
Category
Documents
-
view
214 -
download
1
description
Transcript of multicore_multigpu_0
Data Partitioning on Heterogeneous Multicore andMulti-GPU systems Using Functional Performance
Models of Data-Parallel Applictions
Ziming Zhong Vladimir Rychkov Alexey Lastovetsky
Heterogeneous Computing LaboratoryUniversity College Dublin, Ireland
Cluster 2012
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 1 / 22
Model-based Data Partitioning
Motivation
Hybrid GPU-accelerated parallel computers
Higher power efficiency, performance/price ratio, etc.Successfully applied to bioinformatics, astrophysics, moleculardynamics, computational fluid dynamics, etc.
Hybrid Clusters Hybrid Multicore & Multi-GPU System
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 2 / 22
Model-based Data Partitioning
Motivation
Hybrid Multicores+GPUs presents challenges
Parallel programming is hardLoad balancing problem
Heterogeneity: processor, memory, etc.Hierarchical levels of parallelism: node, socket, core, etc.
and others
Hybrid Clusters Hybrid Multicore & Multi-GPU System
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 2 / 22
Model-based Data Partitioning
Leveraging Hybrid Multicores/GPUs System
In this work, we target:
Data parallel application
Divisible computational workloadWorkload proportional to data sizeDependent on data locality
Dedicated hybrid system
Reuse of optimized software stack
Our approach:
Heterogeneous distributed-memory system
Performance modeling of hybrid system
Model-based data partitioning to balance load Processing Flow
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 3 / 22
Model-based Data Partitioning
Data Partitioning on Heterogeneous Platform:
(1) (2) (3)
1 Workload is divisible and proportional to data size
2 Workload is partitioned in proportion to processor speed
3 Workload is distributed in proportion to processor speed
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 4 / 22
Model-based Data Partitioning
Data partitioning relies on accurate performance modelsTraditionally, performance is defined by a single constant number
- Constant Performance Model (CPM)- Computed from clock speed or by performing a benchmark- Computational units are partitioned as di = N × (si/
∑pj=1 sj)
- Simplistic, algorithms may fail to converge to a balanced solution[1]
Functional Performance Model(FPM)[2]:
Represent speed as afunction of problem size
Realistic
Application centric
Hardware specific 0
5
10
15
20
25
30
35
40
0 1e+07 2e+07 3e+07 4e+07 5e+07
Sp
ee
d (
GF
LO
PS
)
Problem Size - matrix elements
Matrix Multiplication Benchmark on Grid5000-Lille
chirloute-3chimint-1
chinqchint-1chicon-1
[1] D. Clarke et al: Dynamic Load Balancing of Parallel Iterative Routines on Heterogeneous HPC Platforms, 2010
[2] A. Lastovetsky et al: Data partitioning with a functional performance model of heterogeneous processors, 2007.Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 5 / 22
Model-based Data Partitioning
Partitioning with functional performance models*
Load is balanced when:
t1(d1) ≈ t2(d2) ≈ . . . ≈ tp(dp){ti (di ) = di/si (di ),
d1 + d2 + . . . + dp = N
Size of the problem
Absolute
speed
s (d)1
s (d)2
s (d)4
s (d)3
d1 + d2 + d3 + d4 = n
d1 d2 d3 d4
All processors complete work within the same time
Solution lies on a line passing through the origin when di/si (di ) = constant
However, only designed for heterogeneous uniprocessor cluster
* A. Lastovetsky et al: Data partitioning with a functional performance model of heterogeneous processors, 2007.
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 6 / 22
Data Partitioning on Hybrid System
Outline
1 Model-based Data Partitioning
2 Data Partitioning on Hybrid System
3 Experiment Results
4 Conclusions and Future Work
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 7 / 22
Data Partitioning on Hybrid System
Performance Modeling of Hybrid System
Multicore/GPUs are modeled independently
Separate memory, programming modelsRepresented by speed functions (FPM)Benchmarking with computational kernels
Performance model of multicore:
Approximate the speed of multiple corese.g. all cores in a processor except theones dedicated to GPUs
Performance model of GPU:
Approximate combined speed of a GPUand it’s dedicated core Processing Flow
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 8 / 22
Data Partitioning on Hybrid System
Performance Measurement of Hybrid System
Generic measurement techniques
Process binding - avoid process migrationSynchronization - ensure resources are shared between coresRepeating measurement- ensure statistically reliable results
However, how to measure the processor performance accurately on ahybrid system?
Hybrid Multicore & Multi-GPU System
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 9 / 22
Data Partitioning on Hybrid System
Performance Measurement of Hybrid System
Performance measurement of multiple cores:
Programming model:one process (thread)per core to achievehigh performance
Cores interfere witheach other due toresource contention
Performance areevaluated in group
All cores in the groupexecuting the sameamount of workload inparallel
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 10 / 22
Data Partitioning on Hybrid System
Performance Measurement of Hybrid System
Performance measurement of GPU:
One core dedicated tothe GPU, other coresbeing idle
Kernel computationtime and data transfertime are both included
Additional issue: HostNUMA affects PCIetransfer throughput inDual-IOH system
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 10 / 22
Data Partitioning on Hybrid System
Application: Matrix Multiplication on Heterogeneous Platform*
Matrices partitioned unevenly to achieve load balancing
Processors arranged so that communication is minimized
Computational kernel: panel-panel updateReuse vendor-optimized GEMM routineComputation is proportional to the area of submatrix Ci
The same memory access pattern as the whole application
* Beaumont,O. et al: Matrix Multiplication on Heterogeneous Platforms. IEEE Trans. Parallel Distrib. Syst. 2001.
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 11 / 22
Data Partitioning on Hybrid System
Development of Computational Kernel
Multicore CPU:
Use GEMM routine from ACML libraryMultiple processes running sequential routineAlternative: Single process running threaded routine
GPU accelerator:
Use GEMM routine from CUBLAS libraryDevelop out-of-core kernel to overcome memory limitationOverlap data transfers and kernel execution to hide latency
Out-of-core Kernel, Overlap of Data Transfers and Kernel Execution:- allocated 5 buffers in device memory: A0, A1, B0, C0, C1
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 12 / 22
Data Partitioning on Hybrid System
Experimental Setup
Hybrid Multicore and Multi-GPU Node
CPU (AMD) GPUs (NVIDIA)
Architecture Opteron 8439SE GF GTX680 Tesla C870Core Clock 2.8 GHz 1006 MHz 600 MHzNumber of Cores 4 × 6 cores 1536 cores 128 coresMemory Size 4 × 16 GB 2048 MB 1536 MBMemory Bandwidth 192.3 GB/s 76.8 GB/s
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 13 / 22
Data Partitioning on Hybrid System
Building Functional Performance Models (FPMs)
sc(x): approximate the speed of multiple cores when executing CPU kernelon c cores simultaneously, with problem size x/c one each core
g(x): approximate the combined speed of a GPU and it’s dedicated core
Functional Performance Models of multicore CPU:
Platform: consist of four6-core sockets
Modeling performance ofcores in each socket
s5(x), 5 cores running CPUkernel, 1 core being idle
s6(x), all 6 cores runningCPU kernel 0
20
40
60
80
100
120
0 300 600 900 1200
Sp
ee
d (
GF
lop
s)
Matrix blocks (b x b)
5 cores6 cores
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 14 / 22
Data Partitioning on Hybrid System
Building Functional Performance Models (FPMs)
sc(x): approximate the speed of multiple cores when executing CPU kernelon c cores simultaneously, with problem size x/c one each core
g(x): approximate the combined speed of a GPU and it’s dedicated core
Functional Performance Model of GPU:
version 1: naiveimplementation
version 2: accumulateintermediate result,out-of-core overcomememory limitation
version 3: overlap datatransfers and kernelexecution time 0
200
400
600
800
1000
0 1000 2000 3000 4000
Sp
ee
d (
GF
lop
s)
Matrix blocks (b x b)
memory limitversion 1version 2version 3
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 14 / 22
Data Partitioning on Hybrid System
Impact of Resource Contention to Performance Modeling
CPU and GPU kernel benchmarking simultaneously in a socket
FPM of multiple cores s5(x) are barely affected
FPM of GPU g(x) gets 85% accuracy (speed drops by 7 - 15%)
s5(x), speed of multiple cores
0
20
40
60
80
100
0 300 600 900 1200
Speed (
GF
lops)
Matrix blocks (b x b)
CPU-onlycores:GPU = 1:5
cores:GPU = 1:10
g(x), speed of a GPU
0
200
400
600
800
1000
0 1000 2000 3000 4000
Speed (
GF
lops)
Matrix blocks (b x b)
GPU-onlycores:GPU = 1:5
cores:GPU = 1:10
Note: the above two figures have different scales, 1:10
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 15 / 22
Experiment Results
Outline
1 Model-based Data Partitioning
2 Data Partitioning on Hybrid System
3 Experiment Results
4 Conclusions and Future Work
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 16 / 22
Experiment Results
Execution time of the application under different configurations
Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)
40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1
Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22
Experiment Results
Execution time of the application under different configurations
Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)
40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1
Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22
Experiment Results
Execution time of the application under different configurations
Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)
40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1
Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22
Experiment Results
Execution time of the application under different configurations
Matrix size (blks) CPUs (sec) GTX680 (sec) Hybrid-FPM (sec)
40 × 40 99.5 74.2 26.650 × 50 195.4 162.7 77.860 × 60 300.1 316.8 114.470 × 70 491.6 554.8 226.1
Column 1: block size is 640 × 640Column 2: 24 CPU cores, homogeneous data partitioningColumn 3: 1 CPU core + 1 GPUColumn 4: 24 CPU cores + 2 GPUs, FPM-based data partitioning
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 17 / 22
Experiment Results
Computation time of each process
0 20 40 60 80
100 120 140
0 2 4 6 8 10 12 14 16 18 20 22 24Co
mp
uta
tio
n t
ime
(se
c)
Process rank
CPM-based partitioning
Tesla C870
Geforce GTX 680
0 20 40 60 80
100 120 140
0 2 4 6 8 10 12 14 16 18 20 22 24Co
mp
uta
tio
n t
ime
(se
c)
Process rank
FPM-based partitioning
Matrix size 60 × 60, Computation time reduced by 40%Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 18 / 22
Experiment Results
Execution time of the application under different partitioning algorithms
0
100
200
300
400
500
600
700
800
10 20 30 40 50 60 70 80
Execution tim
e (
sec)
Matrix size n
Homogeneous PartitioningCPM-based PartitioningFPM-based Partitioning
Execution time reduced by 23% and 45% respectivelyZiming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 19 / 22
Conclusions and Future Work
Conclusions and Future Work
Conclusions
Defined and built functional performance models (FPMs) of hybridmulticore and multi-GPU system, considering it as a distributedmemory systemAdapted FPM-based data partitioning to hybrid system, achieved loadbalancing and delivered good performance
Future Work
Apply the approach to hybrid clusterPartitioning with respect to interconnect speed
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 20 / 22
Conclusions and Future Work
Thank You!
Science Foundation University College Heterogeneous Computing China ScholarshipIreland Dublin Laboratory Council
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 21 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
d + d + d + d = n1 2 3 4
d1 d2 d3 d4
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
L
U
d + d + d + d < nU1 U2 U3 U4
d + d + d + d > nL1 L2 L3 L4
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
L
U
< n or > n
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
U
L
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22
Conclusions and Future Work
Partitioning with functional performance models
Want all devices to compute assigned workload di within same time.
Points(di , si (di )
)lie on a line passing through the origin when
disi (di )
= constant.
Total problem size determines the slope.
Algorithm iteratively bisects solution space to find values di .
s (d)1
s (d)2
s (d)4
s (d)3
Size of the problem
Absolute
speed
d1 d2 d3 d4
L
U
Ziming Zhong (HCL/UCD) Data Partitioning on Heterogeneous Multicore/GPUs Systems Cluster 2012 22 / 22