Porting Irregular Reductions on Heterogeneous CPU-GPU Configurations
Xin Huo, Vignesh T. Ravi, Gagan AgrawalDepartment of Computer Science and
EngineeringThe Ohio State University
Irregular Reduction - Context
• A dwarf in Berkeley view on parallel computing– Unstructured grid pattern– More random and irregular accesses– Indirect memory references
• Previous efforts in porting to different architectures– Distributed memory machines– Distributed shared memory machines– Shared memory machines– Cache performance improvement on uniprocessor– Many-core architecture - GPGPU (Our work in ICS 11)– No system study on heterogeneous architecture (CPU&GPU)
Why CPU + GPU ? – A Glimpse
• Dominant positions in different areas, but connected tightly– GPU: Computation Intensive problems, large number of threads
execute in SIMD– CPU: Data Intensive problems, branch processing, high precision
and complicated computation– GPU is dependent on the scheduling and data from CPU
• One of the most popular heterogeneous architectures– 3 out of 5 fastest supercomputers are based on CPU + GPU
architecture (500 list on 11/2011)– Fusion in AMD, Sandy Bridge in Intel, and Denver in NVIDIA– Cloud compute instances in Amazon
Outline
• Background– Irregular Reduction Structure– Partitioning-based Locking Scheme– Main Issues
• Contributions• Multi-level Partitioning Framework• Runtime Support
– Pipeline Scheme– Task Scheduling
• Experiment Support• Conclusions
Irregular Reduction Structure
• Robj: Reduction Object• e: Iteration of
computation loop• IA(e,x): Indirection Array
• IA: Iterators over e (Computation Space)
• Robj: Accessed by Indirection Array (Reduction Space)
/* Outer Sequence Loop */while( ){ /* Reduction Loop */ Foreach(element e) { (IA(e,0),val1) = Process(IA(e,0)); (IA(e,1),val2) = Process(IA(e,1)); Robj = Reduce(Robj(IA(e,0)),val1); Robj = Reduce(Robj(IA(e,1)),val2); } /*Global Reduction to Combine Robj*/}
Application Context
Molecular Dynamics• Indirection Array -> Edges (Interactions)• Reduction Objects -> Molecules (Attributes)
• Computation Space -> Interactions b/w molecules
• Reduction Space -> Attributes of Molecules
Partitioning-based Locking Strategy
• Reduction Space Partitioning– Efficient shared
memory utilization– Eliminate intra and
inter-block combination
• Multi-Dimensional Partitioning Method– Balance between
minimum cutting edges and partitioning time
Huo et al., 25th International Conference on Supercomputing
Main Issues
• Device memory limitation on GPU• Partitioning overhead– Partitioning cost increases with the increasing of
data volume– GPU is in idle for waiting the results of partitioning
• Low utilization of CPU– CPU only conducts partitioning– CPU is in idle when GPU doing computation
Contributions
• A Novel Multi-level Partitioning Framework– Parallelize irregular reduction on heterogeneous architecture
(CPU + GPU)– Eliminate device memory limitation on GPU
• Runtime Support Scheme– Pipelining Scheme– Work stealing based scheduling strategy
• Significant Performance Improvements– Exhaustive evaluations– Achieve 11% and 22% improvement for Euler and Molecular
Dynamics
Multi-level Partitioning Framework
Computation Space Partitioning• Partitioning on the iterations of the computation loop
• Pros– Load Balance on Computation
• Cons– Unequal reduction size in each
partition– Replicated reduction elements (4
out of 16 nodes)– Combination cost
• Between CPU and GPU (First Level)• Between different thread blocks
(Second Level)
1
4
3
5
6
1215
7 9
13
11
14
1610
8
2
Partition 1
Partition 2
Partition 3 Partition 4
24
12
7
Reduction Space Partitioning
• Pros– Balance reduction space
• Shared memory is feasible for GPU
– Independent between each two partitions• No communication between CPU
and GPU (First Level)• No communication between thread
blocks (Second Level)
– Avoid combination cost• Between different thread blocks• Between CPU and GPU
• Cons– Imbalance on computation space– Replicated work caused by
crossing edges
1
4
3
5
6
1215
7 9
13
11
14
1610
2
8
Partition 1
Partition 2
Partition 4
• Partitioning on Reduction Elements
Partition 3
Task Scheduling Framework
Task Scheduling Framework
• Pipelining Scheme– K blocks assigned to GPU in one global loading– Pipelining between Partitioning and Computation in a
global loading• Work Stealing Scheduling– Scheduling granularity (Large for GPU; Small for CPU)
• Too large: Better pipelining effect, but worse load balance• Too small: Better load balance, but small pipelining length
– Work Stealing can achieve both maximum pipelining length and good load balance
Experiment Setup
• Platform– GPU
• NVIDIA Tesla C2050 “Fermi” (14x32 = 448 cores)• 2.68GB device memory• 64KB configurable shared memory
– CPU• Intel 2.27 GHz Quad Xeon E5520• 48GB memory• x16 PCI Express 2.0
• Applications– Euler (Computational Fluid Dynamics)– MD (Molecular Dynamics)
Scalability – Molecular Dynamics• Scalability of Molecular Dynamics on Multi-core CPU and GPU across
Different Datasets (MD)
Pipelining – Euler• Effect of Pipelining CPU Partitioning and GPU Computation (EU)
Performance increasing
Pipelining increasing
Avoid partitioning overhead except the first partition
Heterogeneous Performance – EU and MD• Benefits From Dividing Computations Between CPU and GPU for EU and
MD
11%
22%
Work Stealing - Euler• Comparison of Fine-grained, Coarse-grained, and Work Stealing Strategies
Granularity = 1
Granularity = 5
Good load balance, Bad pipelining effect
Good pipelining effect Bad load balance
Good pipelining effect Good load balance
Conclusions
• Multi-level Partitioning Framework to port irregular reduction on heterogeneous architectures
• Pipelining Scheme can overlap partitioning on CPU and computation on GPU
• Work Stealing Scheduling achieves the best pipelining effect and load balance
Thank youQuestions
?Contacts:Xin Huo [email protected] T. Ravi [email protected] Agrawal [email protected]
Top Related