WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.
-
Upload
david-gitt -
Category
Documents
-
view
218 -
download
1
Transcript of WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.
![Page 1: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/1.jpg)
WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS
ODES-9
![Page 2: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/2.jpg)
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
64-bit datapath64-bit addressing and high precision computing
64-bit adder
64bit
64bit
![Page 3: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/3.jpg)
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
64-bit datapath64-bit addressing and high precision computing
16-bit adder
16-bit adder
16-bit adder
16-bit adder
64bit
64bit
![Page 4: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/4.jpg)
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
16-bit integer datapath64-bit addressing and high precision computing
40% of computations need only a 16-bit datapath
Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT)
16-bit adder
16-bit adder
16-bit adder
16-bit adder
![Page 5: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/5.jpg)
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
What does non-productive mean?
0 x 0000 0000 0000 0001
0 x 0000 0000 0000 0025
0 x 0000 0000 0000 0026+
![Page 6: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/6.jpg)
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
What does non-productive mean?
0 x 0000 0000 0000 0001
0 x 0000 0000 0000 0025
0 x 0000 0000 0000 0026+
![Page 7: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/7.jpg)
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
What does non-productive mean?
0 x 0000 0000 0000 0001
0 x 0000 0000 0000 0025
0 x 0000 0000 0000 0026+
![Page 8: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/8.jpg)
ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES
INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)
Contributions and conclusions
1. Narrow ISA offers more opportunities to remove non-productive memory operations
2. 50 % of dynamic narrow operations are non-productive
3. Memory Productiveness Pruning: profile-guided, dynamic optimization
![Page 9: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/9.jpg)
ENERGY EFFICIENT CODE GENERATIONFOR PROCESSORS WITH EXPOSED DATAPATH
DONGRUI SHE, YIFAN HE, BART MESMAN, HENK CORPORAAL (TUE)
Exposed datapath: software controls every movement in the data pathExample: transport-triggered architecture (Henk Corporaal)
Register file access reduction
![Page 10: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/10.jpg)
REGISTER REUSE SCHEDULING
GERGÖ BARANY
ObjectiveMinimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation
MotivationSpill code generated by the compiler has crucial effect on program performance
MethodImplicitly enforce instruction scheduling decisions by adding extra arcs to the data dependence graph (DDG)
Results8.9% less spilling, 3.4% smaller static spill costs
![Page 11: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/11.jpg)
Register Allocation and spilling
REGISTER REUSE SCHEDULING
Virtual registersPhysical registers
Memory
![Page 12: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/12.jpg)
Register Allocation with reuse candidates
REGISTER REUSE SCHEDULING
basic block
interference graph
definitely overlap
definitely NO overlappossible overlap
data dependence graph
![Page 13: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/13.jpg)
Register Allocation with reuse candidates
REGISTER REUSE SCHEDULING
![Page 14: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/14.jpg)
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
MOUNIRA BACHIR, SID-AHMED-ALI TOUATI, ALBERT COHEN
ObjectiveMinimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II)
MotivationCode size related with memory requirements and I-cache performance
MethodStrategically insert move operations without increasing II to split meeting graph components into smaller ones
Results“Good” if enough functional units to perform the additional move operations and acceptable execution time
![Page 15: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/15.jpg)
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File
R
![Page 16: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/16.jpg)
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations
d-1 MOVs/iteration
d : iteration span of variables
![Page 17: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/17.jpg)
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations• Loop unrolling
3 * code size
![Page 18: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/18.jpg)
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion
a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]
using 9 registers instead of 8
MAXLIVE = 8
![Page 19: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/19.jpg)
Periodic Register Allocation
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion• Meeting Graph
lifetime in cycles
lifetime interval of c ends when interval of b begins
![Page 20: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/20.jpg)
Meeting Graph
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]a[i+3]b[i+3]c[i+3]a[i+4]b[i+4]c[i+4]a[i+5]b[i+5]c[i+5]a[i+6]b[i+6]c[i+6]a[i+7]b[i+7]c[i+7]
![Page 21: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/21.jpg)
Circuit Decomposition
DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING
![Page 22: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/22.jpg)
2011 INTERNATIONAL SYMPOSIUM ONCODE GENERATION AND OPTIMIZATION
Main Conference
![Page 23: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/23.jpg)
MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER
ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)
Micro-architectural: not always documentedProprietary compilers at advantage!
SPEC2000 int
Loop
SPEC2000 int
Loop
NOP+ 1 NOP instruction
- 7% execution time
![Page 24: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/24.jpg)
MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER
ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)
Micro-architectural: not always documentedExample: instruction decoding in Core 2 in chunks of 16 bytes
SPEC2000 int
Loop
SPEC2000 int
Loop
NOP16-byte alignment boundary
16-byte alignment boundary
![Page 25: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/25.jpg)
MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER
ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)
Contributions and conclusions
1. Extensible assembly to assembly optimizer
2. Does not fit in GCC flow, because after RTL level not enough information preserved
3. Discover micro-architectural details semi-automatically through generation of micro-benchmarks
![Page 26: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/26.jpg)
DYNAMIC REGISTER PROMOTION OF STACK VARIABLES
JIANJUN LI, CHENGGANG WU, WEI-CHUNG HSU
Use DBT to let x86 binaries use the extra registers on x86-64recompiling is not always an option (legacy binaries)compute-intensive applications gain speed when using 64-bit
Challenge: implicit stack accessesSolved using page protection and stack switching (with shadow stack)
![Page 27: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/27.jpg)
LANGUAGE AND COMPILER SUPPORT FORAUTO-TUNING VARIABLE-ACCURACY ALGORITHMS
JASON ANSEL, YEE LOK WONG, CY CHAN, MAREK OLSZEWSKI, ALAN EDELMAN, SAMAN AMARASINGHE (MIT)
PetaBricks: language extensions to expose trade-offsbetween time and accuracy to the compiler
1. New programming language, toolchain and run-time environment2. Technique for mapping variable accuracy code to enable auto-
efficient tuning
![Page 28: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/28.jpg)
PRACTICAL MEMORY CHECKING WITH DR. MEMORY
DEREK BRUENING (GOOGLE), QIN ZHAO (MIT)
x86
Existing memory checking tools (e.g. Valgrind)slowmany false positives
![Page 29: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/29.jpg)
A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER
HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)
Extend the compilation scope from methods to tracesTraces span multiple method invocationsMore powerful than method inlining
![Page 30: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/30.jpg)
A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER
HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)
Claim: current trace-JITs are immatureKeep the advanced optimization infrastructure by retrofitting
![Page 31: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/31.jpg)
PHASE-BASED TUNING FOR BETTER UTILIZATION OFPERFORMANCE-ASYMMETRIC MULTICORE PROCESSORS
TYLER SONDAG AND HRIDESH RAJAN
ObjectiveDesign and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores
MotivationTrend towards performance asymmetry among cores of a single chip
MethodStatically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster
Results36% average process speedup with negligible overheads
![Page 32: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/32.jpg)
Phase-based tuning
PHASE-BASED TUNING FOR BETTER UTILIZATION OF PERFORMANCE-ASYMMETRICMULTICORE PROCESSORS
![Page 33: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/33.jpg)
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
DORIT NUZMAN, SERGEI DYSHEL, ERVEN ROHOU, IRA ROZEN, ALBERT COHEN, AYAL ZAKS
ObjectiveDesign and a split vectorization framework and study how it compares to monolithic one
MotivationJIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse
MethodMix-and-match existing open compilation tools, namely GCC and MONO
ResultsComparable to specialized monolithic offline compilers
![Page 34: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/34.jpg)
Vectorizing for different platforms
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
![Page 35: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/35.jpg)
Split vectorization scheme
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
![Page 36: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/36.jpg)
Interoparable compilation flows
VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE
![Page 37: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/37.jpg)
![Page 38: WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.](https://reader035.fdocuments.net/reader035/viewer/2022070306/5518a2f2550346c31f8b495e/html5/thumbnails/38.jpg)
This is not a bullet slide.