Software Pipelining on Multi-Core Architectures
Transcript of Software Pipelining on Multi-Core Architectures
![Page 1: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/1.jpg)
Software Pipelining on Multi-Core Architectures
Alban Douillet, Guang R. Gao
CISC 879 : Software Support for Multicore Architectures
Tom St. JohnDept of Electrical and Computer Engineering
University of Delaware
![Page 2: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/2.jpg)
Outline
• Introduction
• Single-Dimension Software Pipelining
CISC 879 : Software Support for Multicore Architectures
• Multi-Threaded Software Pipelining
• Experiments
![Page 3: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/3.jpg)
Introduction
• Software pipelining is among most successful optimizations
• Can it be applied to multi-core chips?
• What extensions are required?
CISC 879 : Software Support for Multicore Architectures
![Page 4: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/4.jpg)
Single-Dimension SWP
• Does not simply pipeline innermost loop
- Pipelines most profitable loop level
• Loop levels enclosing selected loop left to global scheduler
Selected loop seen as outermost loop
CISC 879 : Software Support for Multicore Architectures
- Selected loop seen as outermost loop
- Inner loops executed sequentially
• Able to take advantage of ILP/data locality properties present in other loops
![Page 5: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/5.jpg)
SSP Example
CISC 879 : Software Support for Multicore Architectures
![Page 6: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/6.jpg)
SSP Example
CISC 879 : Software Support for Multicore Architectures
![Page 7: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/7.jpg)
SSP Example
CISC 879 : Software Support for Multicore Architectures
![Page 8: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/8.jpg)
Multi-Threaded SWP
• Several Obstacles Exist
• Dependences/resource constraints must be respected
• Operation cannot be scheduled before all dependences are satisfied
CISC 879 : Software Support for Multicore Architectures
dependences are satisfied
• Memory dependences may exist between thread units
- Synchronization is required
![Page 9: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/9.jpg)
Multi-ThreadedFinal Schedule
• Schedule each group of Sn iterations on a thread unit using round-robin approach
- Workload balance is fair
• Sn is max number of iterations that can be executed in parallel without resource conflict
CISC 879 : Software Support for Multicore Architectures
in parallel without resource conflict
• Thread units may share same instruction cache
![Page 10: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/10.jpg)
Final Schedule Example
CISC 879 : Software Support for Multicore Architectures
![Page 11: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/11.jpg)
Final Schedule Example
CISC 879 : Software Support for Multicore Architectures
![Page 12: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/12.jpg)
Data Dependencies
• Data dependencies may exist between outermost iterations
• Synchronization points are chosen to minimize code duplication during code generation
- WAIT is placed before each repeating pattern
CISC 879 : Software Support for Multicore Architectures
- WAIT is placed before each repeating pattern
- SIGNAL is placed after each pattern
• Synchronization delay guarantees the correctness of the schedule
![Page 13: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/13.jpg)
Synchronization DelayExample
CISC 879 : Software Support for Multicore Architectures
![Page 14: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/14.jpg)
Synchronization DelayExample
CISC 879 : Software Support for Multicore Architectures
![Page 15: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/15.jpg)
Synchronization DelayExample
CISC 879 : Software Support for Multicore Architectures
![Page 16: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/16.jpg)
Synchronization
• Each thread has two counters
- Synchronization counter counts number of synchronization signals received
- Clock counter incremented after each WAIT
When thread reaches a WAIT, execution continues
CISC 879 : Software Support for Multicore Architectures
• When thread reaches a WAIT, execution continues only if synchronization counter greater or equal to clock counter
• WAIT implemented with an active loop
• SIGNAL is a non-blocking atomic add-in-memory instruction
![Page 17: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/17.jpg)
Innermost Loop Tiling
• Allows for coarser-grain synchronization
• Execution of Nn - 1 instances of the innermost loop pattern is tiled into tiles of G iterations
• WAIT and SIGNAL are issued at the entrance and exit of each tile
CISC 879 : Software Support for Multicore Architectures
exit of each tile
• Gmin, value of G that minimizes final schedule length, can be approximated at compile time
![Page 18: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/18.jpg)
Cross-IterationRegister Dependences
• Assume thread units do not share registers
• Insert copy operations to copy value from one thread unit to next
• Register dependence transformed into memory dependence
CISC 879 : Software Support for Multicore Architectures
dependence
• Issue memory spill instruction to copy from register to scratch-pad memory of destination thread
• Value restored using local memory load
![Page 19: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/19.jpg)
Cross-IterationRegister Dependences
• Memory spill instructions only need to be issued by the last iteration of an iteration group
• Memory restore instructions only need to be issued by the first iteration
CISC 879 : Software Support for Multicore Architectures
• If distance of dependence is greater than 1, cascading copies and memory spills/restores will bring value to target iteration
![Page 20: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/20.jpg)
Cross-IterationDependence Example
CISC 879 : Software Support for Multicore Architectures
![Page 21: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/21.jpg)
Correctness & Properties
• The multi-core final schedule represented by the schedule function is correct
• The multi-threaded final schedule is deadlock-free
• The synchronization signal guarantees that the memory accesses preceding it on the same thread
CISC 879 : Software Support for Multicore Architectures
memory accesses preceding it on the same thread unit have been committed
![Page 22: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/22.jpg)
Experimental Framework
• The MTS method has been implemented on the Open64 compiler retargeted for the IBM Cyclops64 architecture
• Loop nests from the Livermore Suite, SPEC2000 and NAS were used
CISC 879 : Software Support for Multicore Architectures
![Page 23: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/23.jpg)
Execution Time Speedup
CISC 879 : Software Support for Multicore Architectures
• MTS schedules showed very good scalability, with relative speedup between 57.5 and 81 for 99 threads
![Page 24: Software Pipelining on Multi-Core Architectures](https://reader031.fdocuments.net/reader031/viewer/2022020702/61fb150e2e268c58cd59f106/html5/thumbnails/24.jpg)
Loop Tiling Factor
CISC 879 : Software Support for Multicore Architectures
• Timing results using tiling factor Gmin match results using best empirical tiling factor