Towards Exploiting Data Locality for Irregular...

Towards Exploiting Data Locality for IrregularApplications on Shared-Memory Multicore

Architectures

Cheng Wang

Advisor: Barbara Chapman

HPCTools Group,Department of Computer Science,

University of Houston, Houston, TX, 77004, USA

May 17, 2016

Outline

1 What are irregular applications?

2 Sparse FFT - A case study of irregular applications

3 A padding algorithm to improve the data locality

4 Conclusion & Future work

May 17, 2016 Cheng Wang ([email protected]) 2 / 22

The Reality of Parallel Computing ...

1Slide based on a post from https://highscalability.com


https://highscalability.com

Why CPU Caching Matters?

year

Perfo

rmance

1980 2004 2014

CPU

Memory

Processor-Memory

Performance Gap

Memory hierarchy Processor-memory

performance gap

• Memory has become the principle performance bottleneck

• Improve the cache utilization is the key to performance optimization

1Source: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html


http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html

Shared-Memory Multicore Architectures

1 Shared memory• On-chip: (Last-level) cache shared by homo/hetero processors• Off-chip: Main memory shared by homo/hetero processors


What are Irregular Applications?

do i = 1 , N. . . = x [ i d x [ i ] ]

end do

1 Indirect array reference pattern

2 Commonly found in linked list, tree and graph-basedapplications

3 Poor data locality

4 Challenge esp. for shared-memory multicore architecture ascores compete for memory bandwidth


Approach: Computation/Data Reordering

1 2 3 4

1 2 3 4

Computation

Data

1 2 3 4

3 1 2 4

2 3 1 4

1 2 3 4

Computation

reordering

Data

reordering


Challenges in Dynamic Irregularity Removal

1 Dynamic irregularity• Memory access pattern remains unknown until runtime and

may change during computations• Previous work on compile-time transformations can hardly

apply• Need for transformation at runtime

2 Runtime transformation overhead• Transformation overhead is placed on the critical path of the

application’s execution• The benefits of improved data locality must outweigh the cost

of the data layout transformation at runtime


Outline





Sparse FFT

FFT

1 A novel compressive sensing algorithm with massive applicationdomains

2 Fourier transform is dominated by a small number of “peaks”• FFT(O(nlogn)) is inefficient

3 Compute the k-sparse Fourier transform in lower time complexity• k-sparse: no. of “large” coordinates at freq. domain


Sparse Data is Ubiquitous ...

1Slide based on http://groups.csail.mit.edu/netmit/sFFT/


http://groups.csail.mit.edu/netmit/sFFT/

Irregular Memory Access Pattern in Sparse FFT

n coordinates

B buckets

Irregular data

reference pattern

buckets[i % B] += signal[idx] * filter[i]

• Randomly permutes the signalspectrum and bins into a smallnumber of buckets

• Irregular memory access pattern


Parallel Sparse FFT

1 Modern architectures are exclusively based on multicore andmanycore processors

• e.g., Multicore CPUs, GPUs, Intel Xeon Phi, etc.• Nature path to improve the performance of sFFT through

efficient parallel algorithm design and impl.

2 Standard full-size FFT has been well studied and implemented• FFTW, cuFFT, Intel MLK, etc..• Highly optimized for specific architectures

3 We are the first such effort of high-performance parallel sFFTimplementation

1cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs,C. Wang, S. Chandrasekaran, and B. Chapman, in Proceedings of 30th IEEEInternational Parallel & Distributed Processing Symposium (IPDPS 2016). [ToAppear]


Exec. Time: cusFFT vs. sFFT vs. cuFFT (k = 1000)

0.001

0.01

0.1

1

10

19 20 21 22 23 24 25 26 27

Exe

cu

tio

n T

ime

(se

c)

Signal Size (2n)

GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)

sFFT (MIT)PsFFT (6 threads)

cusFFTcuFFT

• cuFFT: full-size FFT library on Nvidia GPUs• The MIT seq. sFFT is slower than cuFFT• cusFFT is 5x faster than PsFFT, 25x vs. the seq. sFFT• cusFFT is up to 12x faster than cuFFT


Exec. Time cusFFT vs. sFFT vs. cuFFT (n = 225)

0.01

0.1

1

10

5000 10000 15000 20000 25000 30000 35000 40000

Execution T

ime (

sec)

Signal Sparsity k

GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)

sFFT (MIT)PsFFT (6 threads)

cusFFTcuFFT

• The seq. sFFT is slower than cuFFT

• PsFFT is faster than cuFFT until k = 3000

• cusFFT is faster than cuFFT until k = 41, 000


Outline





Rethink the Consecutive Packing (CPACK) Algorithm

CPACK: A greedy algorithm which packs data intoconsecutive locations in the order they are first accessed bythe computation

23 67 103

1 2 3 4

Computation

Data

9 ... ... ...

CPACK

9 23 67 103

Data

6 cache miss

5 6 7 1 2 3 4

Computation

5 6 7

Data access order: 9, 23, 103, 23, 67, 23, 677 cache misses

Original Data reordered by CPACK

miss

hit

• First-touch policy packs (9,23) together• Not optimal


Rethink the Consecutive Packing (CPACK) Algorithm

Affinity-conscious data reordering ...

23 67 103

1 2 3 4

Computation

Data

9 ... ... ...

data

reordering

23 679 103

Data

4 cache miss

5 6 7 1 2 3 4

Computation

5 6 7


Original An Optimal data layout

miss

hit

• CPACK does not consider data affinity (i.e., how close the nearbydata elements are accessed together)

• Packs (23,67) rather than (9,23) should yield better locality


Data Reordering and NP-completeness

1 Finding an optimal data layout is a NP-complete problem1

2 No “best” data reordering algorithm that works in general

3 Implicit constraint: Each data entry has only one copy inthe transformed format

4 The complexity can be significantly reduced if more space isallowed to use

1E. Petrank and D. Rawitz. 2002. The hardness of cache conscious data placement, POPL ’02)


A Padding Algorithm that Circumvents the Complexity

CPACKE Algorithm: Extends the CPACK by creating duplicatedcopies of each repeatedly accessed data entry

23 67 103

1 2 3 4

Computation

Data

9 ... ... ...

data

reordering

23 679 103

Data

4 cache miss

5 6 7 1 2 3 4

Computation

5 6 7


Original Padding algorithm

miss

hit

23 23 67

• Advantage: Better locality than CPACK

• Disadvantage: Slight space overhead


Performance Evaluation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

19 20 21 22 23 24 25 26 27 28

Execution tim

e (

sec)

Signal size (2n)

Intel Xeon E5-2670 (Sandy Bridge)

PsFFT (before trans)PsFFF (after trans)

• Applies the CPACKE to the perm+filter stage in sFFT• Improves the performance by 30% for the irregular kernel• Improves the overall performance of PsFFT by 20%


Conclusion & Future Work

1 A padding-based algorithm improving the data locality ofirregular applications

2 Improves the performance of sFFT by 30%

3 Future work• Evaluate with more irregular applications• Evaluate with other data/computation reordering algorithms• Let compiler generate the transformed the code automatically


Towards Exploiting Data Locality for Irregular...

Documents

Transcript of Towards Exploiting Data Locality for Irregular...