Towards Exploiting Data Locality for Irregular...

22
Towards Exploiting Data Locality for Irregular Applications on Shared-Memory Multicore Architectures Cheng Wang Advisor: Barbara Chapman HPCTools Group, Department of Computer Science, University of Houston, Houston, TX, 77004, USA May 17, 2016

Transcript of Towards Exploiting Data Locality for Irregular...

Page 1: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Towards Exploiting Data Locality for IrregularApplications on Shared-Memory Multicore

Architectures

Cheng Wang

Advisor: Barbara Chapman

HPCTools Group,Department of Computer Science,

University of Houston, Houston, TX, 77004, USA

May 17, 2016

Page 2: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Outline

1 What are irregular applications?

2 Sparse FFT - A case study of irregular applications

3 A padding algorithm to improve the data locality

4 Conclusion & Future work

May 17, 2016 Cheng Wang ([email protected]) 2 / 22

Page 3: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

The Reality of Parallel Computing ...

1Slide based on a post from https://highscalability.com

May 17, 2016 Cheng Wang ([email protected]) 3 / 22

Page 4: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Why CPU Caching Matters?

year

Perfo

rmance

1980 2004 2014

CPU

Memory

Processor-Memory

Performance Gap

Memory hierarchy Processor-memory

performance gap

• Memory has become the principle performance bottleneck

• Improve the cache utilization is the key to performance optimization

1Source: http://cs.uwec.edu/~buipj/teaching/cs.352.f12/lectures/lecture_08.html

May 17, 2016 Cheng Wang ([email protected]) 4 / 22

Page 5: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Shared-Memory Multicore Architectures

1 Shared memory• On-chip: (Last-level) cache shared by homo/hetero processors• Off-chip: Main memory shared by homo/hetero processors

May 17, 2016 Cheng Wang ([email protected]) 5 / 22

Page 6: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

What are Irregular Applications?

do i = 1 , N. . . = x [ i d x [ i ] ]

end do

1 Indirect array reference pattern

2 Commonly found in linked list, tree and graph-basedapplications

3 Poor data locality

4 Challenge esp. for shared-memory multicore architecture ascores compete for memory bandwidth

May 17, 2016 Cheng Wang ([email protected]) 6 / 22

Page 7: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Approach: Computation/Data Reordering

1 2 3 4

1 2 3 4

Computation

Data

1 2 3 4

3 1 2 4

2 3 1 4

1 2 3 4

Computation

reordering

Data

reordering

May 17, 2016 Cheng Wang ([email protected]) 7 / 22

Page 8: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Challenges in Dynamic Irregularity Removal

1 Dynamic irregularity• Memory access pattern remains unknown until runtime and

may change during computations• Previous work on compile-time transformations can hardly

apply• Need for transformation at runtime

2 Runtime transformation overhead• Transformation overhead is placed on the critical path of the

application’s execution• The benefits of improved data locality must outweigh the cost

of the data layout transformation at runtime

May 17, 2016 Cheng Wang ([email protected]) 8 / 22

Page 9: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Outline

1 What are irregular applications?

2 Sparse FFT - A case study of irregular applications

3 A padding algorithm to improve the data locality

4 Conclusion & Future work

Page 10: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Sparse FFT

FFT

1 A novel compressive sensing algorithm with massive applicationdomains

2 Fourier transform is dominated by a small number of “peaks”• FFT(O(nlogn)) is inefficient

3 Compute the k-sparse Fourier transform in lower time complexity• k-sparse: no. of “large” coordinates at freq. domain

May 17, 2016 Cheng Wang ([email protected]) 10 / 22

Page 11: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Sparse Data is Ubiquitous ...

1Slide based on http://groups.csail.mit.edu/netmit/sFFT/

May 17, 2016 Cheng Wang ([email protected]) 11 / 22

Page 12: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Irregular Memory Access Pattern in Sparse FFT

n coordinates

B buckets

Irregular data

reference pattern

buckets[i % B] += signal[idx] * filter[i]

• Randomly permutes the signalspectrum and bins into a smallnumber of buckets

• Irregular memory access pattern

May 17, 2016 Cheng Wang ([email protected]) 12 / 22

Page 13: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Parallel Sparse FFT

1 Modern architectures are exclusively based on multicore andmanycore processors

• e.g., Multicore CPUs, GPUs, Intel Xeon Phi, etc.• Nature path to improve the performance of sFFT through

efficient parallel algorithm design and impl.

2 Standard full-size FFT has been well studied and implemented• FFTW, cuFFT, Intel MLK, etc..• Highly optimized for specific architectures

3 We are the first such effort of high-performance parallel sFFTimplementation

1cusFFT: A High-Performance Sparse Fast Fourier Transform Algorithm on GPUs,C. Wang, S. Chandrasekaran, and B. Chapman, in Proceedings of 30th IEEEInternational Parallel & Distributed Processing Symposium (IPDPS 2016). [ToAppear]

May 17, 2016 Cheng Wang ([email protected]) 13 / 22

Page 14: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Exec. Time: cusFFT vs. sFFT vs. cuFFT (k = 1000)

0.001

0.01

0.1

1

10

19 20 21 22 23 24 25 26 27

Exe

cu

tio

n T

ime

(se

c)

Signal Size (2n)

GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)

sFFT (MIT)PsFFT (6 threads)

cusFFTcuFFT

• cuFFT: full-size FFT library on Nvidia GPUs• The MIT seq. sFFT is slower than cuFFT• cusFFT is 5x faster than PsFFT, 25x vs. the seq. sFFT• cusFFT is up to 12x faster than cuFFT

May 17, 2016 Cheng Wang ([email protected]) 14 / 22

Page 15: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Exec. Time cusFFT vs. sFFT vs. cuFFT (n = 225)

0.01

0.1

1

10

5000 10000 15000 20000 25000 30000 35000 40000

Execution T

ime (

sec)

Signal Sparsity k

GPU: NVIDIA Tesla K20x. CPU: Intel Xeon E5-2640 (Sandy Bridge)

sFFT (MIT)PsFFT (6 threads)

cusFFTcuFFT

• The seq. sFFT is slower than cuFFT

• PsFFT is faster than cuFFT until k = 3000

• cusFFT is faster than cuFFT until k = 41, 000

May 17, 2016 Cheng Wang ([email protected]) 15 / 22

Page 16: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Outline

1 What are irregular applications?

2 Sparse FFT - A case study of irregular applications

3 A padding algorithm to improve the data locality

4 Conclusion & Future work

Page 17: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Rethink the Consecutive Packing (CPACK) Algorithm

CPACK: A greedy algorithm which packs data intoconsecutive locations in the order they are first accessed bythe computation

23 67 103

1 2 3 4

Computation

Data

9 ... ... ...

CPACK

9 23 67 103

Data

6 cache miss

5 6 7 1 2 3 4

Computation

5 6 7

Data access order: 9, 23, 103, 23, 67, 23, 677 cache misses

Original Data reordered by CPACK

miss

hit

• First-touch policy packs (9,23) together• Not optimal

May 17, 2016 Cheng Wang ([email protected]) 17 / 22

Page 18: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Rethink the Consecutive Packing (CPACK) Algorithm

Affinity-conscious data reordering ...

23 67 103

1 2 3 4

Computation

Data

9 ... ... ...

data

reordering

23 679 103

Data

4 cache miss

5 6 7 1 2 3 4

Computation

5 6 7

Data access order: 9, 23, 103, 23, 67, 23, 677 cache misses

Original An Optimal data layout

miss

hit

• CPACK does not consider data affinity (i.e., how close the nearbydata elements are accessed together)

• Packs (23,67) rather than (9,23) should yield better locality

May 17, 2016 Cheng Wang ([email protected]) 18 / 22

Page 19: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Data Reordering and NP-completeness

1 Finding an optimal data layout is a NP-complete problem1

2 No “best” data reordering algorithm that works in general

3 Implicit constraint: Each data entry has only one copy inthe transformed format

4 The complexity can be significantly reduced if more space isallowed to use

1E. Petrank and D. Rawitz. 2002. The hardness of cache conscious data placement, POPL ’02)

May 17, 2016 Cheng Wang ([email protected]) 19 / 22

Page 20: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

A Padding Algorithm that Circumvents the Complexity

CPACKE Algorithm: Extends the CPACK by creating duplicatedcopies of each repeatedly accessed data entry

23 67 103

1 2 3 4

Computation

Data

9 ... ... ...

data

reordering

23 679 103

Data

4 cache miss

5 6 7 1 2 3 4

Computation

5 6 7

Data access order: 9, 23, 103, 23, 67, 23, 677 cache misses

Original Padding algorithm

miss

hit

23 23 67

• Advantage: Better locality than CPACK

• Disadvantage: Slight space overhead

May 17, 2016 Cheng Wang ([email protected]) 20 / 22

Page 21: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Performance Evaluation

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

19 20 21 22 23 24 25 26 27 28

Execution tim

e (

sec)

Signal size (2n)

Intel Xeon E5-2670 (Sandy Bridge)

PsFFT (before trans)PsFFF (after trans)

• Applies the CPACKE to the perm+filter stage in sFFT• Improves the performance by 30% for the irregular kernel• Improves the overall performance of PsFFT by 20%

May 17, 2016 Cheng Wang ([email protected]) 21 / 22

Page 22: Towards Exploiting Data Locality for Irregular ...credit.pvamu.edu/MCBDA2016/Slides/Day2_ChengWang.pdf · May 17, 2016 Cheng Wang (cwang35@uh.edu) 18 / 22. Data Reordering and NP-completeness

Conclusion & Future Work

1 A padding-based algorithm improving the data locality ofirregular applications

2 Improves the performance of sFFT by 30%

3 Future work• Evaluate with more irregular applications• Evaluate with other data/computation reordering algorithms• Let compiler generate the transformed the code automatically

May 17, 2016 Cheng Wang ([email protected]) 22 / 22