WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS

ODES-9

ELIMINATING NON-PRODUCTIVE MEMORY OPERATIONSIN NARROW-BITWIDTH ARCHITECTURES

INDU BHAGAT, ENRIC GIBERT, JESÚS SÁNCHEZ, ANTONIO GONZÁLEZ (UPC)

64-bit datapath64-bit addressing and high precision computing

64-bit adder

64bit

64bit



64-bit datapath64-bit addressing and high precision computing

16-bit adder

16-bit adder

16-bit adder

16-bit adder

64bit

64bit



16-bit integer datapath64-bit addressing and high precision computing

40% of computations need only a 16-bit datapath

Caveat: 64-bit computation becomes 8 * 16-bit computations (DBT)

16-bit adder

16-bit adder

16-bit adder

16-bit adder



What does non-productive mean?

0 x 0000 0000 0000 0001

0 x 0000 0000 0000 0025

0 x 0000 0000 0000 0026+



Contributions and conclusions

1. Narrow ISA offers more opportunities to remove non-productive memory operations

2. 50 % of dynamic narrow operations are non-productive

3. Memory Productiveness Pruning: profile-guided, dynamic optimization

ENERGY EFFICIENT CODE GENERATIONFOR PROCESSORS WITH EXPOSED DATAPATH

DONGRUI SHE, YIFAN HE, BART MESMAN, HENK CORPORAAL (TUE)

Exposed datapath: software controls every movement in the data pathExample: transport-triggered architecture (Henk Corporaal)

Register file access reduction

REGISTER REUSE SCHEDULING

GERGÖ BARANY

ObjectiveMinimize spill code by attempting to find an instruction schedule that allows for the least expensive register allocation

MotivationSpill code generated by the compiler has crucial effect on program performance

MethodImplicitly enforce instruction scheduling decisions by adding extra arcs to the data dependence graph (DDG)

Results8.9% less spilling, 3.4% smaller static spill costs

Register Allocation and spilling


Virtual registersPhysical registers

Memory

Register Allocation with reuse candidates


basic block

interference graph

definitely overlap

definitely NO overlappossible overlap

data dependence graph

Register Allocation with reuse candidates


DECOMPOSING MEETING GRAPH CIRCUITS TO MINIMISE KERNEL LOOP UNROLLING

MOUNIRA BACHIR, SID-AHMED-ALI TOUATI, ALBERT COHEN

ObjectiveMinimize the unrolling factor resulting from periodic register allocation of a software-pipelined loop, without altering the initiation interval (II)

MotivationCode size related with memory requirements and I-cache performance

MethodStrategically insert move operations without increasing II to split meeting graph components into smaller ones

Results“Good” if enough functional units to perform the additional move operations and acceptable execution time

Periodic Register Allocation


• Rotating Register File

R



• Rotating Register File• Move operations

d-1 MOVs/iteration

d : iteration span of variables



• Rotating Register File• Move operations• Loop unrolling

3 * code size



• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion

a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]

using 9 registers instead of 8

MAXLIVE = 8



• Rotating Register File• Move operations• Loop unrolling• Modulo Variable Expansion• Meeting Graph

lifetime in cycles

lifetime interval of c ends when interval of b begins

Meeting Graph


a[i]b[i]c[i]a[i+1]b[i+1]c[i+1]a[i+2]b[i+2]c[i+2]a[i+3]b[i+3]c[i+3]a[i+4]b[i+4]c[i+4]a[i+5]b[i+5]c[i+5]a[i+6]b[i+6]c[i+6]a[i+7]b[i+7]c[i+7]

Circuit Decomposition


2011 INTERNATIONAL SYMPOSIUM ONCODE GENERATION AND OPTIMIZATION

Main Conference

MAO – AN EXTENSIBLE MICRO-ARCHITECTURAL OPTIMIZER

ROBERT HUNDT, EASWARAN RAMAN, MARTIN THURESSON, NEIL VACHHARAJANI (GOOGLE)

Micro-architectural: not always documentedProprietary compilers at advantage!

SPEC2000 int

Loop

SPEC2000 int

Loop

NOP+ 1 NOP instruction

- 7% execution time



Micro-architectural: not always documentedExample: instruction decoding in Core 2 in chunks of 16 bytes

SPEC2000 int

Loop

SPEC2000 int

Loop

NOP16-byte alignment boundary

16-byte alignment boundary



Contributions and conclusions

1. Extensible assembly to assembly optimizer

2. Does not fit in GCC flow, because after RTL level not enough information preserved

3. Discover micro-architectural details semi-automatically through generation of micro-benchmarks

DYNAMIC REGISTER PROMOTION OF STACK VARIABLES

JIANJUN LI, CHENGGANG WU, WEI-CHUNG HSU

Use DBT to let x86 binaries use the extra registers on x86-64recompiling is not always an option (legacy binaries)compute-intensive applications gain speed when using 64-bit

Challenge: implicit stack accessesSolved using page protection and stack switching (with shadow stack)

LANGUAGE AND COMPILER SUPPORT FORAUTO-TUNING VARIABLE-ACCURACY ALGORITHMS

JASON ANSEL, YEE LOK WONG, CY CHAN, MAREK OLSZEWSKI, ALAN EDELMAN, SAMAN AMARASINGHE (MIT)

PetaBricks: language extensions to expose trade-offsbetween time and accuracy to the compiler

1. New programming language, toolchain and run-time environment2. Technique for mapping variable accuracy code to enable auto-

efficient tuning

PRACTICAL MEMORY CHECKING WITH DR. MEMORY

DEREK BRUENING (GOOGLE), QIN ZHAO (MIT)

x86

Existing memory checking tools (e.g. Valgrind)slowmany false positives

A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER

HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)

Extend the compilation scope from methods to tracesTraces span multiple method invocationsMore powerful than method inlining

A TRACE-BASED JAVA JIT COMPILERRETROFITTED FROM A METHOD-BASED COMPILER

HIROSHI INOUE, HIROSHIGE HAYASHIZAKI, PENG WU, TOSHIO NAKATANI (IBM)

Claim: current trace-JITs are immatureKeep the advanced optimization infrastructure by retrofitting

PHASE-BASED TUNING FOR BETTER UTILIZATION OFPERFORMANCE-ASYMMETRIC MULTICORE PROCESSORS

TYLER SONDAG AND HRIDESH RAJAN

ObjectiveDesign and apply a transparent and fully-automatic process called phase-based tuning which adapts an application to effectively utilize performance-asymmetric multicores

MotivationTrend towards performance asymmetry among cores of a single chip

MethodStatically partition the application into code sections that are likely to have similar runtime behavior. Exhibited runtime characteristics of representative sections are used to map the whole cluster

Results36% average process speedup with negligible overheads

Phase-based tuning

PHASE-BASED TUNING FOR BETTER UTILIZATION OF PERFORMANCE-ASYMMETRICMULTICORE PROCESSORS

VAPOR SIMD: AUTO-VECTORIZE ONCE, RUN EVERYWHERE

DORIT NUZMAN, SERGEI DYSHEL, ERVEN ROHOU, IRA ROZEN, ALBERT COHEN, AYAL ZAKS

ObjectiveDesign and a split vectorization framework and study how it compares to monolithic one

MotivationJIT compiler technology offers portability while facilitating target – and context-specific specialization; SIMD hardware is ubiquitous and diverse

MethodMix-and-match existing open compilation tools, namely GCC and MONO

ResultsComparable to specialized monolithic offline compilers

Vectorizing for different platforms


Split vectorization scheme


Interoparable compilation flows


This is not a bullet slide.

WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.

Documents

Transcript of WORKSHOP ON OPTIMIZATIONS FOR DSP AND EMBEDDED SYSTEMS ODES-9.