Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti +...

51
1 OPC: Optimizing and Parallelizing Compilers Isabelle PUAUT Université de Rennes I / IRISA Rennes (Pacap) 2017-2018 2 Performance enablers ¢ Micro-architecture and ISA l Pipelining, ILP, caching, superscalar out-of-order execution, VLIW, Multicores, GPUs etc. (ADA - A. Seznec) ¢ Compiler l Loop unrolling, splitting, fusion, inlining, etc. ¢ Optimize frequent, average-case scenarios

Transcript of Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti +...

Page 1: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

1

OPC: Optimizing andParallelizing Compilers

Isabelle PUAUT

Université de Rennes I / IRISA Rennes (Pacap) 2017-2018

2

Performance enablers

¢ Micro-architecture and ISAl Pipelining, ILP, caching, superscalar out-of-order

execution, VLIW, Multicores, GPUs etc. (ADA - A. Seznec)

¢ Compilerl Loop unrolling, splitting, fusion, inlining, etc.

¢ Optimize frequent, average-case scenarios

Page 2: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

2

3

Performance predictability

¢ General purpose systemsl The �average� user wants stable performance:

• Cannot tolerate variations of performance by orders of magnitude when varying simple parameters

¢ Safety-critical/real-time systems: need for guaranteed performancel Worst-case performance (WCET - Worst-Case

Execution Times)

4

Outline

¢ Performance predictabilityl What is it? For which applications?

¢ WCET estimation methodsl Static WCET estimation methodsl Measurement-based and hybrid WCET

estimation methodsl Beyond WCET estimation: schedulability

analysisl WCET estimation for parallel codes

¢ Programming and compiling for worst-case performance

Page 3: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

3

Performance predictability

6

PerformanceInfluencing elements

¢ Sequencing of actions (execution paths)l Input data dependentl A priori unknown input data

¢ Duration of every action on a given processorl Hardware dependentl Depends on past execution

history

t1

t2

t3 t4

t5 t6

t7

t8

t8

Page 4: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

4

7

PerformanceDistribution of execution times

mean worst

P(et=t)

Execution time tmin

8

PerformanceApplication needs

¢ Desktop applications:l As-fast, and as stable as possiblel Typical input data (average-case)l Performance evaluation: execution/simulation with

typical input data¢ Real-time applications:

l Before a deadline (real-time is not real-fast)• Control applications: stability of physical process• Delay until system failure

l For all input data (including the worst-case)l Performance evaluation: the subject of this lecture

Page 5: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

5

9

Classes of real-time systems

¢ Hard real-timel Missing a deadline can cause catastrophic

consequences in the systems: need for a priori guarantees in the worst-case scenario

l Ex : control in transportation systems, nuclear applications, etc.

¢ Soft real-timel Missing a deadline decrease the quality of service of

the system but does not jeopardize its safetyl Ex : multimedia applications (VoD, etc.)

10

Validation of real-time systemsWhat is needed to demonstrate that deadlines are met?

¢ System-level validationl Worst-case knowledge of system load (task arrivals,

interrupt arrivals)l Ex: synchronous arrival of periodic tasks

¢ Task-level validationl Worst-case execution time of individual tasksl WCET (Worst-Case Execution Time)

• Upper bound for executing an isolated piece of code• Code considered in isolation• WCET ¹ response time

Page 6: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

6

11

Different uses of WCET

¢ Temporal validationl Schedulability analysisl Schedulability guarantees (worst-case)

¢ System dimensioningl Hardware selection

¢ Optimization of application code l Early in application design lifecycle

12

WCET: Definition

¢ Challenges in WCET estimationl Safety (WCET estimate >= any possible execution

time) : confidence in schedulability analysis methodsl Tightness

• Overestimation Þ schedulability test may fail, or too much resources might be used

¢ WCET ¹ WCET estimate

mean WCET estWCET

Page 7: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

7

Static timing estimation methods

lOverviewlFlow analysislWCET computationlHardware-level analysis

14

Static WCET estimation methods

¢ Principlel Analysis of program structure (no execution) l Computation of WCET from program structure

¢ Componentsl Flow analysis

• Determines possible flows in program

l Low-level (hardware-level) analysis• Determines the execution time of a sequence of

instructions (basic block) on a given hardware

l Computation• Computation from results of other components• All paths need to be considered : safety

Page 8: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

8

15

Overview of components

Source code

Object code

Compiler

Flow analysis

Low-levelanalysis

Computation

Flowrepresentation

WCET

(Annotations)

or

16

WCET computationAssumptions

¢ Simple architecturel Execution time of an instruction only depends on

instruction type and operand l No overlap between instructions, no memory

hierarchy

Page 9: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

9

17

Tree-based computation

¢ Data structuresl Syntax tree (control structures)l Basic block

¢ Principlel Determination of execution time of basic block (low-

level analysis) l Computation based on a bottom-up traversal of the

syntax tree (timing schema)

Loop [4]

If

BB2

BB0

BB5

BB4BB3

Seq1

BB6

Seq2BB1

18

Tree-based computation

Loop [4]

If

BB2

BB0

BB5

BB4BB3

Seq1

BB6

Seq2BB1

If

BB2 BB4BB3

WCET(SEQ) S1;…;Sn WCET(S1) + … + WCET(Sn)

WCET(IF) if(test) then else WCET(test) + max( WCET(then) , WCET(else))

WCET(LOOP) for(;tst;inc) {body} maxiter * (WCET(tst)+WCET(body+inc)) + WCET(tst)

Timing schema

WCET(Seq1) = WCET_BB0 + WCET(Loop) +WCET_BB_6WCET(Loop) = 4 * (WCET_BB1 + WCET(Seq2)) + WCET_BB1WCET(Seq2) = WCET(If) + WCET_BB5WCET(If) = WCET_BB2 + max( WCET_BB3 , WCET_BB4 )

Page 10: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

10

19

Tree-based computation

¢ Advantagesl Low computational effortl Good scalability with respect to program sizel Good user feedback (source-code level)

¢ Drawbacksl Not compatible with aggressive compiler

optimizationsl Expression of complex flow facts difficult (inter-

control-structure flow facts)

20

¢ Integer linear programming

l Objective function: max: f1t1+f2t2+…+fntn

l Structural constraints

" v: fi = S ai = S ai

f1 = 1

l Extra flow information

fi £ k (loop bound)

fi + fj £ 1 (mutually exclusive paths)

fi £ 2 fj (relations between execution freqs.)

IPET (Implicit Path Enumeration

Technique)

aiÎIn(v) aiÎOut(v)

t1

t2

t3

t4 t5

t6

t7

Page 11: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

11

21

IPET

¢ Advantagesl Supports all unstructured flows (gotos, etc.)l Supports all compiler optimizations, including the

most aggressive ones

¢ Drawbacksl More time-consuming than tree-based methodsl Low-level user feedbacks (works at binary level)l Annotations are hard to provide (need to know

compiler optimizations)

¢ Mostly used in commercial/academic tools

22

Low-level analysisIntroduction

¢ Simple architecturel Execution time of an instruction only depends on

instruction type and operand l No overlap between instructions, no memory

hierarchy ¢ Complex architecture

l Local timing effects• Overlap between instructions (pipelines)

l Global timing effects• Caches (data, instructions), branch predictors• Requires a knowledge of the entire code

Page 12: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

12

23

Low-level analysisIntroduction

¢ Outlinel Pipelines, caches, branch predictors,… considered in

isolationl Interferences between analyses

¢ Restrictionsl Simple in-order pipelines first

24

Low-level analysisPipelining

¢ Principle : parallelism between instructionsl Intra basic-block

l Inter basic-block

FetchDecodeExecuteMemoryWrite Back

Time

IFIDEXMEWB

Time

Time IFIDEXMEWB

Page 13: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

13

25

Low-level analysisPipelining

¢ Intra basic block: �Simulation� of the flow on instructions into the pipeline stages

l Takes in consideration data hazards (RAR, RAW, WAW)l Bypass mechanism (if any) has to be modeledl Obtained by WCET analysis tool or external tool

(simulator, measures on the processor)

FetchDecodeExecuteMemoryWrite Back

Time

ti

26

Low-level analysisPipelining

¢ Inter basic-block : modification of computation stepl Example: IPET, extra constraints in ILP problem (negative costs

on edges)

ti=7

di,j=-4

tj=7

max: S fiti+ S ai,j di,j

ti

tj

di,j <= 0

IFIDEXMEWB

IFIDEXMEWB

Time

Page 14: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

14

27

Low-level analysisOut-of-order execution [Li&Mitra, 06]

Superscalar processor model

IF CMWBEXID

Instr.cache

fetchbuffer

LD/ST

ROB

datacache

registerfilesALU2

ALU1

MULTU

FPU

In-order fetch and dispatch OO execution In-order commit

28

Low-level analysisOut-of-order execution: pipeline modeling

¢ Plain arrows: dependenciesl Pipeline stages of same instructions, buffer size, data

dependencies among instructions, in-order fetch and ID¢ Dashed arrows: contentions (shared resources)

Page 15: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

15

29

Low-level analysisOut-of-order execution: timing analysis

¢ Assumptions (simplified version)l Fixed-latency functional units

¢ Principlel Applied at every basic blockl Scheduling problem: find the longest path in an

acyclic graph with precedence and exclusion constraints

l Algorithm (simplified)• Traverse the graph in topological order (finish time of

all predecessors known)• Conservative estimation of contention time• See paper for details !

30

Low-level analysisInstruction caches

¢ Cachel Takes benefit of temporal and spatial locality of

referencesl Speculative: future behaviour depends on past

behaviourl Good average-case performance, but predictability

issues

Page 16: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

16

Low-level analysisInstruction caches

31

mainmemory

l bits k bitsaddress

…………2l cache lines

2k memory elements per cache line

n-ways

n-way set associative cache

32

Low-level analysisInstruction caches

¢ How to obtain safe and tight estimates?l Simulation along one path not an option

• Explores only one path, may not be the longest onel Simple solution (all miss): overly pessimistic

• And in some cases may be wrong (timing anomalies -see end of chapter)

¢ Objectivel Predict if an instruction will (certainly) cause a hit or

might (conservatively) cause miss.

Page 17: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

17

33

Low-level analysisInstruction caches

¢ Based on abstract interpretation [Ferdinand, 2004]

¢ Computation of Abstract Cache States (ACS)l Contains all possible cache contents considering �all� possible execution paths

l Fixpoint computation

¢ Instruction categorization from ACSl Always hit (AH) - guaranteed to be a hitl Always miss (AM) - guaranteed to be a missl First miss (FM) - guaranteed to be a hit after first

referencedl Not-classified (NC) - don�t know

34

Low-level analysisInstruction caches

¢ Analysesl Must: ACS contain all program lines guaranteed to

be in the cache at a given pointl May: ACS contain all program lines that may be in

the cache at a given pointl Persistence: ACS contains all program lines not

evicted after loaded¢ Modification of ACS

l Update: at every referencel Join: at every path merging point

¢ Following example : for LRU caches

Page 18: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

18

35

Low-level analysisInstruction caches - Must analysis

Join

Update

Intersection + max age

Apply replacementpolicy

a b c d b e d a

{} b {} d,a

Age +

a b c d

e a b c

ACS contain all program lines guaranteedto be in the cache at a given point

36

Low-level analysisInstruction caches - May analysis

Join

Update

Union + min age

Apply replacementpolicy

a b c d b e d a

a,b e c,d {}

Age +

a b c d

e a b c

ACS contain all program lines that maybe in the cache at a given point

Page 19: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

19

37

Low-level analysisInstruction caches

¢ From ACS to classificationl If in ACS_must: Always Hitl Else, if not in ACS_May: Always Missl Else, if ACS_Persistence: First Missl Else Not Classified

¢ From classification to WCET (IPET)l For every BB: t_first, t_nextl Objective function: max S (f_firstit_firsti + f_nextit_nexti )l New constraints:

• fi = f_firsti + f_nexti• Constraints based on cache classification

38

Low-level analysisData caches

¢ Extra issue: determination of addresses of data

¢ Means: abstract interpretation / DF analysisl Value analysis

l Pointer analysis

¢ Results:l Superset of referenced addresses per instruction

¢ Example: l for (int i=0;i<N;i++) tab[i+j/z] = x;

l Any address inside tab may be referenced per loop iteration (but only one is actually referenced)

l Hard-to-predict reference (input-dependent)

Page 20: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

20

39

Low-level analysisData caches

¢ Solutions (with some limitations)l Compiler-controlled: don’t cache input-dependent

data structuresl Assume every potentially referenced address is

actually referenced (2*N misses in example)l Cache Miss Equations (CME): for affine data-

independent loop indices

40

Low-level analysisData caches

¢ Tighter estimation method (pigeon hole principle) – re-work this onel Unpredictable vs unpredictable accesses (obtained

using static analysis)l Replacement of other blocks

• Cache miss counter Cm

• Incremented only for blocks used after the current instruction (info. obtained using static analysis)

l Cache miss if element not in the cache• Comparison of execution count and array size

• Persistence analysis of arrays• For persistent arrays fitting the cache, number of

misses <= array_range

Page 21: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

21

41

Low-level analysisData caches

¢ Cache miss equationsl Unpredictable vs unpredictable accesses (obtained

using static analysis)l Replacement of other blocks

• Cache miss counter Cm• Incremented only for blocks used after the current

instruction (info. obtained using static analysis)l Cache miss if element not in the cache

• Comparison of execution count and array size• Persistence analysis of arrays• For persistent arrays fitting the cache, number of

misses <= array_range

42

Low-level analysisBranch prediction

¢ Branch predictorsl Local predictors [Colin, 00]l Global predictors [Mitra], [Rochange]l Most complex predictors out of reach !

Page 22: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

22

43

The bad news …Timing anomalies (out-of-order execution)

Disp. Cycle

Instruction

A 1 LD r4,0(r3)B 2 ADD r5, r4, r4C 3 ADD r11, r10, r10D 4 MUL r11, r11, r11

E 5 MUL r13, r12, r12

LSU

IU

MCIU

12

LSU

IU

MCIU

11

Hit

Miss

44

The bad news …Timing anomalies

¢ Cache miss is not necessarily the worst case¢ Pipeline analysis and cache analysis are no

longer independent :-(l When a miss/hit cannot be predicted, exploration of

both outcomes: explosion of pipeline analysis timel [Li & Mitra 06]: more complex scheduling problem

Page 23: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

23

45

The bad news …Timing anomalies

¢ Empty cache is not necessarily the worst-case¢ Example: PLRU (4-ways)

a b c d

0

0 0

• 0 (resp. 1): left (resp. right) subtree is to be replaced next

• Update/hit: bits on the path flipped

46

The bad news …Timing anomalies (credits: Absint)

Page 24: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

24

47

The bad news …Timing anomalies

¢ Empty cache phenomenon and replacement policies: LRU and FIFO OK

¢ Latest development: predictability metrics [Reineke, 2008]l Evict: lower bound on eviction time (after evict

distinct accesses, guaranteed to be evicted)• Used in May analysis

l Fill: after fill distinct accesses, all possible cache contents are known

• Used in Must analysis

48

Flow analysis

Structurally feasible paths (infinite)

Basic finiteness (bounded loops)

Actually feasible (infeasible paths, mutually exclusive paths)

Page 25: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

25

49

Flow analysis

¢ Infeasible paths

¢ Path ABDEF is infeasible

¢ Identification of infeasible paths: improves tightness

int baz (int x) {if (x<5) // A

x = x+1; // Belse x=x*2; // Cif (x>10) // D

x = sqrt(x); // Ereturn x; // F

}

50

Flow analysis

¢ Maximum number of iterations of loops

¢ Tight estimation of loop bounds: improves tightness

¢ On this specific example, polyhedral techniques help

(see later)

for i := 1 to N do for j := 1 to i dobegin

if c1 then A.long

else B.short

if c2 then C.short

else D.long

end

Loop bound: NLoop bound: N

(N+1)N2 executions

Page 26: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

26

51

Flow analysis

¢ Determination of flow factsl Automatic (static analysis): infeasible in general

(equivalent to the halting problem)l Manual: annotations

• Loop bounds: constants, or symbolic annotations• Minimum required path information

• Annotations for infeasible / mutually exclusive paths, relations between execution counts

52

Flow analysis

¢ Annotations in aiT

l Simple example

• asm(“label:”); (in source code)

• ais2 {flow sum: point(”label”) == cnt;}

• ais2 {flow sum: point(”label”) <= cnt;}

l More generally, language with documented syntax for

• Bounds on recursion

• Loop bound specification (min, max, exact)

• Linear constraints

• Maximum execution times for loops and code snippets,

address ranges, etc …

l All this at binary level

Page 27: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

27

53

Flow analysisFlow analysis using abstract interpretation [Gust06]

¢ Abstract domain: intervals of values¢ Scopes: « repeated » blocks (loops/functions)¢ Fixpoint calculation:

l Update: • At every instruction (source/intermediate/binary)• Conditional constructs: two paths may be followed

l Join: • Safe merging of intervals to avoid state explosion• Here: end of scope iteration

54

Flow analysisFlow analysis using abstract interpretation

¢ Example:i=INPUT; // i=[1..4]while (i<10) {

// point p…i=i+2;…

}// point q

End of iter3: [5..8] à i=i+2 [7..10] à 2 abstract states

iter i at p

Min #iter: 3

Max #iter: 5

1 [1..4]

2 [3..6]

3 [5..8]

4 [7..9]

5 [9..9]

6 impossible

Page 28: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

28

55

Flow analysisFlow analysis using abstract interpretation

¢ Loop bound detectionl Loop counter:

• Incremented at every scope iteration• Updated at every scope exit (3,4,5) on the example

¢ Infeasible node: node counterl NB: Nodes with >0 counters might be infeasible

¢ Upper bounds for node executions¢ Infeasible paths

l For the same scope iterationl Matrix of pairs

¢ IPET: new linear constraints

56

Flow analysisFlow analysis using abstract interpretation

¢ Some numbers

Pgm Time WCETorig Time WCETff #FF

Crc 4.9 834159 6.65 833730 56

Inssort 0.16 31163 0.17 18167 7

Ns 6.09 130733 6.81 130733 8

Nsichneu 36.88 119707 435.70 41303 65280

Page 29: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

29

Measurement-based and hybrid methods

¢ Principles¢ Safe measurement-based methods¢ Less-safe measurement-based method

58

WCET estimation methodsDynamic methods

¢ Principlel Input datal Execution (hardware, simulator)l Timing measurement

¢ Generation of input datal User-defined: reserved to expertsl Exhaustive

• Risk of combinatorial explosion

Page 30: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

30

59

WCET estimation methodsSafe dynamic method

¢ Principlesl Structural testingl �All-paths� coverage criterionl Avoid enumerating all input-data combination

¢ Assumptionsl A1. One feasible path in source code gives rise to

one path in binary codel A2. Execution time of a feasible path is the same for

all input values which cause the execution of the pathl A3. Existence of a worst-case machine state

60

WCET estimation methodsSafe dynamic method

¢ Incremental coverage of input domain

t1 t1 t1

t2 t2

t3

Page 31: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

31

61

WCET estimation methodsSafe dynamic method

Input domain

Source code

Instrumented source Instrumented object

Compilation Execution

Substitution

Path predicate

Execution path

Path predicate ofprevious tests

Conjunction

Difference

Constraint solving

Injection of input values

Domain not yet covered

Inputs for next test

62

WCET estimation methodsSafe dynamic method

¢ Advantagesl No hardware modell All paths are covered

¢ Drawbacksl Assumption A2 restricts the architecture to be usedl Still an explosion of the number of paths to be

explored for some programs¢ Ongoing work: measuring fewer execution

paths

Page 32: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

32

63

WCET estimation methodsLess safe dynamic methods

¢ Use of genetic algorithms¢ Data structures

l Population, individuals, genes (sets of input data)¢ Operators

l Cross-over = mix two genes (sets of input data)l Mutation = change one gene (input data)l Selection = individuals with largest execution time

(obtained through measurements)¢ Benefits

l Measurement-based: no need for hardware modelsl No safety guaranteel Used to validate static analysis methods, or when safety is

not mandatory

64

WCET estimation methods

Hybrid method

¢ High-level analysis and WCET combinationl Static analysis

¢ Execution times of basic blocksl Measurementsl Distributions of execution time

¢ WCET computation (tree-based)l Operators defined on distributions instead of single values

• Independent basic blocks: convolution• Non-independant basic blocks: biased convolution

l Result: probabilistic WCET¢ Evaluation

l Benefits: no need for hardware model, path analysis is not probabilistic

l Safety is not guaranteed regarding hardware effects

Page 33: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

33

65

WCET estimation methods

Hybrid method: symbolic simulation

¢ Based on a cycle-accurate simulator¢ Values and registers tagged with a �type� (known/unknown)¢ Extension of the semantics of instructions

l Memory transfer from an unknown register content: target address tagged as �unknown�

l Conditional branch on unknown condition: exploration of both branches

l For every instruction …¢ State merging to avoid exploring all paths

l State = (memory/registers)l Merging = pessimistic union of the two statesl Merging points: different granularities

¢ Benefits: flow-analysis �for-free�¢ Drawbacks: hardware models, terribly slow

66

Open issues (low level analysis)

¢ Complexity of static WCET estimation tools

¢ Timing anomalies, integration of sub-analyses

¢ Analysis tools may be released a long time after

the hardware is available

¢ Trends:

l Probabilistic low-level analysis

l Software control of hardware (cache locking,

compiler-directed branch prediction)

l Multi-core architectures (still a long way to go, …)

Page 34: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

34

67

Static WCET estimation methodsA method for every usage

¢ Static WCET estimationl Safety Jl Pessimism Ll Need for a hardware model Ll Trade-off between estimation time and tightness

(tree-based / IPET) J¢ Measurement-based methods

l Safety ? Probabilistic methodsl Pessimism Jl No need for hardware models J

68

WCET estimation tools¢ Academic

l Cinderella [Princeton]l Heptane [IRISA, Rennes]l Otawa [IRIT Toulouse]l SWEET [Sweden]l Chronos [Singapour]

¢ Industriall Bound-T [SSF]l AIT [AbsInt, Sarbrüken]l Rapitime [Rapita systems, York]

Page 35: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

35

69

Some pointers

¢ Bibliographyl Survey paper in TECS, 2008l Workshop on worst-case execution time analysis

(2001..2013), in conjunction with ECRTS

¢ Working group l Timing Analysis on Code-Level (TACLe)

http://www.tacle.eu

Beyond predictable performance

Schedulability analysis

Page 36: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

36

71

Temporal validation of real-time systems

¢ Testingl Input data + execution (target architecture, simulator)l Necessary but not sufficient (coverage of scenarios)

¢ Schedulability analysis (models)l Hard real-time: need for guarantees in all execution

scenarios, including the worst-case scenariol Task modelsl Schedulability analysis methods (70s _ today)

72

Schedulability analysisIntroduction (1/2)

¢ Definitionl Algorithms or mathematical formulas allowing to

prove that deadlines will be met ¢ Classification

l Off-line validation• Target = hard real-time systems

l On-line validation• Acceptance tests executed at the arrival of new tasks• Some tasks may be rejected ® target = soft real-time

systems

Page 37: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

37

73

Schedulability analysisIntroduction (2/2)

¢ Input: system modell Tasl model

• Arrival: periodic, sporadic, aperiodic• Inter-task synchronization: precedence constraints,

resource sharing • Worst-case execution time (WCET)

l Architecturel Known off-line for hard real-time systems

¢ Outputl Schedulability verdict

74

Schedulability analysisExample (1/2)

¢ System modell Periodic tasks (Pi), deadline Di<=Pil Fixed priorities (the smaller Di the higher priority)l Worst-case execution time : Ci

¢ Necessary condition

¢ Sufficient condition

¢ Low complexity

11

≤=∑=

n

i PiCiU

1)-n(21

1

nn

i DiCi≤∑

=

Page 38: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

38

75

Schedulability analysisExample (2/2)

¢ Same task model as before¢ Estimation of response time: limit of series

¢ The series limit is the task response time (when converges)

¢ The system is schedulable when Ri £ Di¢ More complex schedulability test

jihpj j

kiik

i CPwCw ∑

+##$

%%&+=)(

1

ii Cw =0

76

Schedulability analysisNetworked systems

¢ Response time analysis on every processor¢ Response time analysis of network

transmissions (e.g. CAN: fixed-priority network)¢ End-to-end response time analysis

l Holistic approaches

Page 39: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

39

WCET estimation for parallel codes

¢ Issues¢ Preliminary studies

78

Hardware architecturesExamples of multi-core systems

¢ Shared memory + bus + local cache/SPM¢ Shared memory + bus + shared last-level cache¢ Clustered architectures (Kalray) with NoC

NoCRouter

NoCRouter

RAMCTRL Shared

ressource +Mem accessscheduling

Sharedressource +NoC policy

Page 40: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

40

79

Hardware architecturesIssues with multi-core systems

¢ Shared hardware resources (to reduce costs)l Busl NoCl Last Level Cache, memory controller

¢ Introduces dependencies between coresl Interference delaysl Depends on

• Arbitration policy• Aontending activity on the other cores

¢ Remarks: In some safety critical systems, disable all cores but one

80

Hardware architecturesSoftware executing on multi-cores

¢ Independent tasksl Have to consider worst-case contention

¢ Dependent tasksl Dependencies between

tasksl Examples: Directed Acyclic

Graphs (DAGs), Message Sequence Charts (MSCs)

Synchronization

Control Flow

Start/Fork

A

D G

B

S0

E

S1

C

W1

F

W0

Page 41: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

41

81

Shared Last Level Cache (LLC)

¢ Issues¢ A technique to identify worst-case number of

misses in LLCl Analysis of cache hierarchiesl Analysis of shared LLC

¢ Using synchronizations to tightenl Avoid pressure in shared cachel Account for synchronizations

Shared Last Level Cache

¢ Issues arising from cache sharing

Assumed architecture

Shared

Cache

[X]

Rival task (Core 2)Analysed task (Core 1)

[a]

[b]

[a]

[b]Is the second access to [a] a hit in the cache ?

Not necessarily, depends on when the rival task

accesses the shared cache.

Page 42: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

42

Analysis of cache hierarchies

¢ Hierarchical analysis of caches

¢ Compute accesses occurrences on cache level Ll Accessing level L not

necessarily the worst-case!¢ In the following examples:

l Non inclusive cachesl LRU replacement policyl Set-associative caches

Cache analysisLevel L-1

Cache analysisLevel L

References

Hit/Miss Classification

To the processor

To the main memory

Analysis of cache hierarchies [RTSS 08]

¢ Cache access classification at level Ll Never: never occurs at level Ll Always: always occur at level Ll Uncertain: unknown

¢ Modification of cache analysisl Update function

• Never: not considered• Always: considered• Unknown: local exploration: Join (Never, Always)

Page 43: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

43

Analysis of shared caches

¢ Estimate conflicts from rival tasks¢ Integrate conflicts into cache classification

(Hit/Miss)

References Access Classification

Cache analysisShared Level L

Analysed task

Cache analysisLevel L

Conflicts count

Conflict Estimation

Conflicts count

Cache analysisLevel LConflict Estimation

Rival tasks

Analysis of shared caches: conflict estimation

¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly

in practice

[a]

[a]

[b]

[w,x,y]

[d,e]

[b]

10

Page 44: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

44

Analysis of shared caches: conflict estimation

¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly

in practice¢ To reduce this cost, we abstract

from a task:l Access orderingl Occurrence count

[a]

[a]

[b]

[w,x,y]

[d,e]

[b]

10

Analysis of shared caches: conflict estimation

¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly

in practice¢ To reduce this cost, we abstract

from a task:l Access orderingl Occurrence count

[a]

[b]

[w]

10

[x]

[y]

[d]

[e]

Page 45: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

45

Analysis of shared caches: conflict estimation

¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly

in practice¢ To reduce this cost, we abstract

from a task:l Access orderingl Occurrence count

¢ For each cache setl Find memory blocks used by the task.l Count the total number of blocks

(CCN)

[a]

[b]

[w]

10

[x]

[y]

[d]

[e]

Analysis of shared caches: conflict integration

¢ A cache analysis is performed using modified cache states:l CCN cache blocks may have been allocated to rival

tasks in the shared cache.l Only (Cache L Associativity – CCN) ways are for sure

available.

Set 0

Set 1

Example:

Set 0

Set 1

Base cache configuration Available cache space during analysis

Cache line occupied by a conflicting block

CCN0 = 2

CCN1 = 7

Page 46: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

46

Analysis of shared caches

¢ Conflict Estimation is pessimistic:l Many blocks may clutter up the cache.l Every block a task may store on a shared cache level

is included.

¢ Little cache space detected as available during shared cache analyses.

l Bypass of shared cache blocks [RTSS 09, IRISA]l Accounting for synchronization [RTSS 09, NUS]

Need to control the impact of tasks on shared caches.

Analysis of shared caches: bypass

¢ If an instruction bypasses a cache level, it does not alter this cache level.

[a]

[c]

[c]

[a]

[b]

[c]

Access bypassing the cache

Considered cache

Example:

Without bypass Using bypass

Page 47: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

47

Analysis of shared caches: bypass

¢ Reduce tasks’ impact on caches.l A bypassing access produces no conflict

¢ Selection of accesses to bypassl Static analysisl Accesses not statically detected as reused (single-

usage)l Never worsens tasks WCET J

¢ Extended to data caches in [RTNS 10]

Analysis of shared caches: accounting for synchronization

¢ synchronisation points define zones of conflicting accesses

Task 1 Task 2

Without synchronisation

Task 1 Task 2

With synchronisation

Rendezvous

Accesses competing for shared cache space

Page 48: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

48

WCET estimation in multi core architectures: shared busses

¢ Issues¢ Estimation of contention delays for round-robin

arbitrationl Pessimistic contention analysis for RRl Accounting for synchronizations

WCET estimation in multi core architectures: shared busses

¢ Issues

¢ Delay to serve request r1l Depends on pending requests from the other coresl Depends on bus arbitration policy

Memory

Busrequest r1

Page 49: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

49

WCET estimation in multi core architectures: shared RR bus

¢ Round-robin arbitration

l A time units per core, round-robinl In case a core has no pending request, serve request

for the text core in round-robin ordering¢ Worst-case delay:

l For a request from core i: WCDRR = (Nc-1)*A, with Ncthe total number of cores

l For Ni requests from core i, WCDRR = (Nc-1)*A*Ni

A time units (say 1 memory access)

C1 C2 C3 … C1 C2 C3

WCET estimation in multi core architectures: shared RR bus

¢ Remarksl Assumes all concurrent cores always have a pending

request, which is a safe over-approximation but introduces pessimism

l Let Ni denote the maximum number of memory accesses performed by a concurrent core Cj

l For Ni requests from core I• Assumes cores are ordered by increasing Ni

• "#$%% #& = ∑ )* ∗ ,*-&./*0/ +∑ )& ∗ ,*-3

*0&4/ 565789

Cores with less than Ni accesseseach access delays by A

Cores with more than Ni accesses, delay of Ni*A

Page 50: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

50

Interfering Memory Accesses Delay

cores

Memory Accesses

Ni

Ci NcC1

N1

Nq

Interfering AccessesInterfering Accesses

WCDRR =(Nc-1)*A*Ni

¢ Graphically

01233 14 =5 67 ∗ 97:4;<

7=<+5 64 ∗ 9

7:?

7=4@<ABACDE

WCET estimation in multi core architectures: shared RR bus

¢ Estimating number of concurrent memory accessesl Knowledge of concurrent code snippetsl May Happen in Parallel relation (MHP)l Classical graph algorithms (transitive closure) for

graphs without cyclesl More complex for cyclic structures

Page 51: Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti + f_nextit_nexti ) l New constraints: • fi = f_firsti + f_nexti • Constraints based on cache

51

WCET estimation in multi core architectures: shared RR bus

Synchronization

Control Flow

Start/Fork

A

D G

B

S0

E

S1

C

W1

F

W0

102

WCET estimation in multi core architectures: shared RR busBB MHPA {E,S1}B {E,S1}S0 {E,S1}W1 {W0,F,G}C {W0,F,G}D {W0,F,G}E {A,B,S0}S1 {A,B,S0}W0 {W1,C,D}F {W1,C,D}G {W1,C,D}

Transitive ClosureSynchronizationControl Flow

Start/Fork

A

D G

B

S0

E

S1

C

W1

F

W0