Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture
description
Transcript of Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture
![Page 1: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/1.jpg)
synergy.cs.vt.edu
Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core ArchitectureStudent: Carlo C. del Mundo*, Virginia Tech (Undergrad)Advisor: Dr. Wu-chun Feng*§, Virginia Tech* Department of Electrical and Computer Engineering, § Department of Computer Science, Virginia Tech
![Page 2: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/2.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Forecast: Hardware-Software Co-Design
Software(Transpose)
Hardware(K20c and shuffle)
NVIDIA Kepler K20c Shuffle Mechanism
![Page 3: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/3.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle?
![Page 4: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/4.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle?
Cheaper data movement
![Page 5: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/5.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle?
Cheaper data movement
• Faster than shared memory
![Page 6: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/6.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle?
Cheaper data movement
• Faster than shared memory• Only in NVIDIA Tesla Kepler GPUs
![Page 7: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/7.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle?
Cheaper data movement
• Faster than shared memory• Only in NVIDIA Tesla Kepler GPUs• Limited to a warp
![Page 8: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/8.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What is shuffle?
Cheaper data movement
• Faster than shared memory• Only in NVIDIA Tesla Kepler GPUs• Limited to a warp>>> Idea: reduce data communication between threads <<<
![Page 9: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/9.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving?
![Page 10: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/10.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving?• Enable efficient data
communication
![Page 11: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/11.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving?• Enable efficient data
communication– Shared Memory (the “old” way)
![Page 12: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/12.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving?• Enable efficient data
communication– Shared Memory (the “old” way)
![Page 13: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/13.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Q: What are you solving?• Enable efficient data
communication– Shared Memory (the “old” way)
– Shuffle (the “new” way)
![Page 14: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/14.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Approach• Evaluate shuffle using matrix transpose
– Matrix transpose is a data communication step in FFT
![Page 15: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/15.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Approach• Evaluate shuffle using matrix transpose
– Matrix transpose is a data communication step in FFT
![Page 16: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/16.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Approach• Evaluate shuffle using matrix transpose
– Matrix transpose is a data communication step in FFT
• Devised Shuffle Transpose Algorithm– Consists of horizontal (inter-thread shuffles) and vertical
(intra-thread)
![Page 17: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/17.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis• Bottleneck: Intra-thread data movement
![Page 18: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/18.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis
Register File
t0 t1 t2 t3
• Bottleneck: Intra-thread data movement
Stage 2: Vertical
![Page 19: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/19.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis
Register File
t0 t1 t2 t3
for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];
Code 1: (NAIVE)
• Bottleneck: Intra-thread data movement
Stage 2: Vertical
![Page 20: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/20.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis
Register File
t0 t1 t2 t3
for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];
Code 1: (NAIVE)
• Bottleneck: Intra-thread data movement
Stage 2: Vertical
![Page 21: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/21.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Analysis
Register File
t0 t1 t2 t3
for (int k = 0; k < 4; ++k) dst_registers[k] = src_registers[(4 - tid + k) % 4];
Code 1: (NAIVE)
• Bottleneck: Intra-thread data movement
Stage 2: Vertical
15x
![Page 22: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/22.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
General strategies• Registers are fast.• CUDA local memory is slow.
– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.
Analysis15x
![Page 23: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/23.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){
src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;
}else if (tid == 2){
src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;
}else if (tid == 3){
src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;
}
General strategies• Registers are fast.• CUDA local memory is slow.
– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.
Analysis15x
![Page 24: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/24.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){
src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;
}else if (tid == 2){
src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;
}else if (tid == 3){
src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;
}
General strategies• Registers are fast.• CUDA local memory is slow.
– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.
Divergence
Divergence
Divergence
Analysis15x
6%
![Page 25: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/25.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
Code 2 (DIV)int tmp = src_registers[0];if (tid == 1){
src_registers[0] = src_registers[3];src_registers[3] = src_registers[2];src_registers[2] = src_registers[1];src_registers[1] = tmp;
}else if (tid == 2){
src_registers[0] = src_registers[2];src_registers[2] = tmp;tmp = src_registers[1];src_registers[1] = src_registers[3];src_registers[3] = tmp;
}else if (tid == 3){
src_registers[0] = src_registers[1];src_registers[1] = src_registers[2];src_registers[2] = src_registers[3];src_registers[3] = tmp;
}
General strategies• Registers are fast.• CUDA local memory is slow.
– Compiler is forced to place data into CUDA local memory if array indices CANNOT be determined at compile time.
Code 3 (SELP OOP)
65 dst_registers[0] = (tid == 0) ? src_registers[0] : dst_registers[0]; 66 dst_registers[1] = (tid == 0) ? src_registers[1] : dst_registers[1]; 67 dst_registers[2] = (tid == 0) ? src_registers[2] : dst_registers[2]; 68 dst_registers[3] = (tid == 0) ? src_registers[3] : dst_registers[3]; 69 70 dst_registers[0] = (tid == 1) ? src_registers[3] : dst_registers[0]; 71 dst_registers[3] = (tid == 1) ? src_registers[2] : dst_registers[3]; 72 dst_registers[2] = (tid == 1) ? src_registers[1] : dst_registers[2]; 73 dst_registers[1] = (tid == 1) ? src_registers[0] : dst_registers[1]; 74 75 dst_registers[0] = (tid == 2) ? src_registers[2] : dst_registers[0]; 76 dst_registers[2] = (tid == 2) ? src_registers[0] : dst_registers[2]; 77 dst_registers[1] = (tid == 2) ? src_registers[3] : dst_registers[1]; 78 dst_registers[3] = (tid == 2) ? src_registers[1] : dst_registers[3]; 79 80 dst_registers[0] = (tid == 3) ? src_registers[1] : dst_registers[0]; 81 dst_registers[1] = (tid == 3) ? src_registers[2] : dst_registers[1]; 82 dst_registers[2] = (tid == 3) ? src_registers[3] : dst_registers[2]; 83 dst_registers[3] = (tid == 3) ? src_registers[0] : dst_registers[3];
Divergence
Divergence
Divergence
Analysis15x
6%
44%
![Page 26: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/26.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Results
![Page 27: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/27.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Results
![Page 28: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/28.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Conclusion• Overall Performance
– Max. Speedup (Amdahl’s Law): 1.19-fold– Achieved Speedup: 1.17-fold
![Page 29: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/29.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Conclusion• Overall Performance
– Max. Speedup (Amdahl’s Law): 1.19-fold– Achieved Speedup: 1.17-fold
• Surprise Result– Goal: Accelerate communication (“gray bar”)– Result: Accelerated the computation also (“black
bar”)
![Page 30: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/30.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Thank You!• Enabling Efficient Intra-Warp Comunication for
Fourier Transforms in a Many-Core Architecture– Student: Carlo del Mundo, Virginia Tech
(undergrad)– Overall Performance
• Theoretical Speedup: 1.19-fold• Achieved Speedup: 1.17-fold
Code 1 (NAIVE)63 for (int k = 0; k < 4; ++k)64 dst_registers[k] = src_registers[(4 -
tid + k) % 4];
![Page 31: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/31.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Appendix
![Page 32: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/32.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Motivation• Goal
– Accelerating an application based on hardware-specific mechanisms (e.g., “the hardware-software co-design process”)
• Case Study– Application: Matrix transpose as part of a 256-pt
FFT– Architecture: NVIDIA Kepler K20c
• Use shuffle to accelerate communication
• Results– Max. Theoretical Speedup: 1.19-fold– Achieved Speedup: 1.17-fold
![Page 33: Enabling Efficient Intra-Warp Communication for Fourier Transforms in a Many Core Architecture](https://reader036.fdocuments.net/reader036/viewer/2022070501/5681692c550346895de06d5e/html5/thumbnails/33.jpg)
synergy.cs.vt.eduEnabling Efficient Intra-Warp Communication for Fourier Transforms in a Many-Core Architecture
Background: The New and Old• Shuffle
– Idea:• Communicate data
within a warp w/o shared memory
– Pros• Faster (1 cycle to
perform load and store)• Eliminate the use of
shared memory higher thread occupancy
– Cons• Poorly understood• Only available in Kepler
GPUs
• Only limited to 32 threads
• Shared Memory– Idea
• Scratchpad memory to communicate data
– Pros• Easy to program• Scales to a block (up to
1536 threads)– Cons
• Prone to bank conflicts