Towards Exploiting Data Locality for Irregular...
Transcript of Towards Exploiting Data Locality for Irregular...
Towards Exploiting Data Locality for IrregularApplications on Shared-Memory Multicore
Architectures
Cheng Wang
Advisor: Barbara Chapman
HPCTools Group,Department of Computer Science,
University of Houston, Houston, TX, 77004, USA
May 17, 2016
Outline
1 What are irregular applications?
2 Sparse FFT - A case study of irregular applications
3 A padding algorithm to improve the data locality
4 Conclusion & Future work
May 17, 2016 Cheng Wang ([email protected]) 2 / 22
The Reality of Parallel Computing ...
1Slide based on a post from https://highscalability.com
May 17, 2016 Cheng Wang ([email protected]) 3 / 22
Why CPU Caching Matters?
year
Perfo
rmance
1980 2004 2014
CPU
Memory
Processor-Memory
Performance Gap
Memory hierarchy Processor-memory
performance gap
• Memory has become the principle performance bottleneck
• Improve the cache utilization is the key to performance optimization
1Source: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html
May 17, 2016 Cheng Wang ([email protected]) 4 / 22
Shared-Memory Multicore Architectures
1 Shared memory• On-chip: (Last-level) cache shared by homo/hetero processors• Off-chip: Main memory shared by homo/hetero processors
May 17, 2016 Cheng Wang ([email protected]) 5 / 22
What are Irregular Applications?
do i = 1 , N. . . = x [ i d x [ i ] ]
end do
1 Indirect array reference pattern
2 Commonly found in linked list, tree and graph-basedapplications
3 Poor data locality
4 Challenge esp. for shared-memory multicore architecture ascores compete for memory bandwidth
May 17, 2016 Cheng Wang ([email protected]) 6 / 22
Approach: Computation/Data Reordering
1 2 3 4
1 2 3 4
Computation
Data
1 2 3 4
3 1 2 4
2 3 1 4
1 2 3 4
Computation
reordering
Data
reordering
May 17, 2016 Cheng Wang ([email protected]) 7 / 22
Challenges in Dynamic Irregularity Removal
1 Dynamic irregularity• Memory access pattern remains unknown until runtime and
may change during computations• Previous work on compile-time transformations can hardly
apply• Need for transformation at runtime
2 Runtime transformation overhead• Transformation overhead is placed on the critical path of the
application’s execution• The benefits of improved data locality must outweigh the cost
of the data layout transformation at runtime
May 17, 2016 Cheng Wang ([email protected]) 8 / 22
Outline
1 What are irregular applications?
2 Sparse FFT - A case study of irregular applications
3 A padding algorithm to improve the data locality
4 Conclusion & Future work
Sparse FFT
FFT
1 A novel compressive sensing algorithm with massive applicationdomains
2 Fourier transform is dominated by a small number of “peaks”• FFT(O(nlogn)) is inefficient
3 Compute the k-sparse Fourier transform in lower time complexity• k-sparse: no. of “large” coordinates at freq. domain
May 17, 2016 Cheng Wang ([email protected]) 10 / 22
Sparse Data is Ubiquitous ...
1Slide based on http://groups.csail.mit.edu/netmit/sFFT/
May 17, 2016 Cheng Wang ([email protected]) 11 / 22
Irregular Memory Access Pattern in Sparse FFT
n coordinates
B buckets
Irregular data
reference pattern
buckets[i % B] += signal[idx] * filter[i]
• Randomly permutes the signalspectrum and bins into a smallnumber of buckets
• Irregular memory access pattern
May 17, 2016 Cheng Wang ([email protected]) 12 / 22
Parallel Sparse FFT
1 Modern architectures are exclusively based on multicore andmanycore processors
• e.g., Multicore CPUs, GPUs, Intel Xeon Phi, etc.• Nature path to improve the performance of sFFT through
efficient parallel algorithm design and impl.
2 Standard full-size FFT has been well studied and implemented• FFTW, cuFFT, Intel MLK, etc..• Highly optimized for specific architectures
3 We are the first such effort of high-performance parallel sFFTimplementation
1cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs,C. Wang, S. Chandrasekaran, and B. Chapman, in Proceedings of 30th IEEEInternational Parallel & Distributed Processing Symposium (IPDPS 2016). [ToAppear]
May 17, 2016 Cheng Wang ([email protected]) 13 / 22
Exec. Time: cusFFT vs. sFFT vs. cuFFT (k = 1000)
0.001
0.01
0.1
1
10
19 20 21 22 23 24 25 26 27
Exe
cu
tio
n T
ime
(se
c)
Signal Size (2n)
GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)
sFFT (MIT)PsFFT (6 threads)
cusFFTcuFFT
• cuFFT: full-size FFT library on Nvidia GPUs• The MIT seq. sFFT is slower than cuFFT• cusFFT is 5x faster than PsFFT, 25x vs. the seq. sFFT• cusFFT is up to 12x faster than cuFFT
May 17, 2016 Cheng Wang ([email protected]) 14 / 22
Exec. Time cusFFT vs. sFFT vs. cuFFT (n = 225)
0.01
0.1
1
10
5000 10000 15000 20000 25000 30000 35000 40000
Execution T
ime (
sec)
Signal Sparsity k
GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)
sFFT (MIT)PsFFT (6 threads)
cusFFTcuFFT
• The seq. sFFT is slower than cuFFT
• PsFFT is faster than cuFFT until k = 3000
• cusFFT is faster than cuFFT until k = 41, 000
May 17, 2016 Cheng Wang ([email protected]) 15 / 22
Outline
1 What are irregular applications?
2 Sparse FFT - A case study of irregular applications
3 A padding algorithm to improve the data locality
4 Conclusion & Future work
Rethink the Consecutive Packing (CPACK) Algorithm
CPACK: A greedy algorithm which packs data intoconsecutive locations in the order they are first accessed bythe computation
23 67 103
1 2 3 4
Computation
Data
9 ... ... ...
CPACK
9 23 67 103
Data
6 cache miss
5 6 7 1 2 3 4
Computation
5 6 7
Data access order: 9, 23, 103, 23, 67, 23, 677 cache misses
Original Data reordered by CPACK
miss
hit
• First-touch policy packs (9,23) together• Not optimal
May 17, 2016 Cheng Wang ([email protected]) 17 / 22
Rethink the Consecutive Packing (CPACK) Algorithm
Affinity-conscious data reordering ...
23 67 103
1 2 3 4
Computation
Data
9 ... ... ...
data
reordering
23 679 103
Data
4 cache miss
5 6 7 1 2 3 4
Computation
5 6 7
Data access order: 9, 23, 103, 23, 67, 23, 677 cache misses
Original An Optimal data layout
miss
hit
• CPACK does not consider data affinity (i.e., how close the nearbydata elements are accessed together)
• Packs (23,67) rather than (9,23) should yield better locality
May 17, 2016 Cheng Wang ([email protected]) 18 / 22
Data Reordering and NP-completeness
1 Finding an optimal data layout is a NP-complete problem1
2 No “best” data reordering algorithm that works in general
3 Implicit constraint: Each data entry has only one copy inthe transformed format
4 The complexity can be significantly reduced if more space isallowed to use
1E. Petrank and D. Rawitz. 2002. The hardness of cache conscious data placement, POPL ’02)
May 17, 2016 Cheng Wang ([email protected]) 19 / 22
A Padding Algorithm that Circumvents the Complexity
CPACKE Algorithm: Extends the CPACK by creating duplicatedcopies of each repeatedly accessed data entry
23 67 103
1 2 3 4
Computation
Data
9 ... ... ...
data
reordering
23 679 103
Data
4 cache miss
5 6 7 1 2 3 4
Computation
5 6 7
Data access order: 9, 23, 103, 23, 67, 23, 677 cache misses
Original Padding algorithm
miss
hit
23 23 67
• Advantage: Better locality than CPACK
• Disadvantage: Slight space overhead
May 17, 2016 Cheng Wang ([email protected]) 20 / 22
Performance Evaluation
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
19 20 21 22 23 24 25 26 27 28
Execution tim
e (
sec)
Signal size (2n)
Intel Xeon E5-2670 (Sandy Bridge)
PsFFT (before trans)PsFFF (after trans)
• Applies the CPACKE to the perm+filter stage in sFFT• Improves the performance by 30% for the irregular kernel• Improves the overall performance of PsFFT by 20%
May 17, 2016 Cheng Wang ([email protected]) 21 / 22
Conclusion & Future Work
1 A padding-based algorithm improving the data locality ofirregular applications
2 Improves the performance of sFFT by 30%
3 Future work• Evaluate with more irregular applications• Evaluate with other data/computation reordering algorithms• Let compiler generate the transformed the code automatically
May 17, 2016 Cheng Wang ([email protected]) 22 / 22