Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti +...
Transcript of Optimizing and Parallelizing Compilers · l Objective function: max S(f_firstit_firsti +...
1
OPC: Optimizing andParallelizing Compilers
Isabelle PUAUT
Université de Rennes I / IRISA Rennes (Pacap) 2017-2018
2
Performance enablers
¢ Micro-architecture and ISAl Pipelining, ILP, caching, superscalar out-of-order
execution, VLIW, Multicores, GPUs etc. (ADA - A. Seznec)
¢ Compilerl Loop unrolling, splitting, fusion, inlining, etc.
¢ Optimize frequent, average-case scenarios
2
3
Performance predictability
¢ General purpose systemsl The �average� user wants stable performance:
• Cannot tolerate variations of performance by orders of magnitude when varying simple parameters
¢ Safety-critical/real-time systems: need for guaranteed performancel Worst-case performance (WCET - Worst-Case
Execution Times)
4
Outline
¢ Performance predictabilityl What is it? For which applications?
¢ WCET estimation methodsl Static WCET estimation methodsl Measurement-based and hybrid WCET
estimation methodsl Beyond WCET estimation: schedulability
analysisl WCET estimation for parallel codes
¢ Programming and compiling for worst-case performance
3
Performance predictability
6
PerformanceInfluencing elements
¢ Sequencing of actions (execution paths)l Input data dependentl A priori unknown input data
¢ Duration of every action on a given processorl Hardware dependentl Depends on past execution
history
t1
t2
t3 t4
t5 t6
t7
t8
t8
4
7
PerformanceDistribution of execution times
mean worst
P(et=t)
Execution time tmin
8
PerformanceApplication needs
¢ Desktop applications:l As-fast, and as stable as possiblel Typical input data (average-case)l Performance evaluation: execution/simulation with
typical input data¢ Real-time applications:
l Before a deadline (real-time is not real-fast)• Control applications: stability of physical process• Delay until system failure
l For all input data (including the worst-case)l Performance evaluation: the subject of this lecture
5
9
Classes of real-time systems
¢ Hard real-timel Missing a deadline can cause catastrophic
consequences in the systems: need for a priori guarantees in the worst-case scenario
l Ex : control in transportation systems, nuclear applications, etc.
¢ Soft real-timel Missing a deadline decrease the quality of service of
the system but does not jeopardize its safetyl Ex : multimedia applications (VoD, etc.)
10
Validation of real-time systemsWhat is needed to demonstrate that deadlines are met?
¢ System-level validationl Worst-case knowledge of system load (task arrivals,
interrupt arrivals)l Ex: synchronous arrival of periodic tasks
¢ Task-level validationl Worst-case execution time of individual tasksl WCET (Worst-Case Execution Time)
• Upper bound for executing an isolated piece of code• Code considered in isolation• WCET ¹ response time
6
11
Different uses of WCET
¢ Temporal validationl Schedulability analysisl Schedulability guarantees (worst-case)
¢ System dimensioningl Hardware selection
¢ Optimization of application code l Early in application design lifecycle
12
WCET: Definition
¢ Challenges in WCET estimationl Safety (WCET estimate >= any possible execution
time) : confidence in schedulability analysis methodsl Tightness
• Overestimation Þ schedulability test may fail, or too much resources might be used
¢ WCET ¹ WCET estimate
mean WCET estWCET
7
Static timing estimation methods
lOverviewlFlow analysislWCET computationlHardware-level analysis
14
Static WCET estimation methods
¢ Principlel Analysis of program structure (no execution) l Computation of WCET from program structure
¢ Componentsl Flow analysis
• Determines possible flows in program
l Low-level (hardware-level) analysis• Determines the execution time of a sequence of
instructions (basic block) on a given hardware
l Computation• Computation from results of other components• All paths need to be considered : safety
8
15
Overview of components
Source code
Object code
Compiler
Flow analysis
Low-levelanalysis
Computation
Flowrepresentation
WCET
(Annotations)
or
16
WCET computationAssumptions
¢ Simple architecturel Execution time of an instruction only depends on
instruction type and operand l No overlap between instructions, no memory
hierarchy
9
17
Tree-based computation
¢ Data structuresl Syntax tree (control structures)l Basic block
¢ Principlel Determination of execution time of basic block (low-
level analysis) l Computation based on a bottom-up traversal of the
syntax tree (timing schema)
Loop [4]
If
BB2
BB0
BB5
BB4BB3
Seq1
BB6
Seq2BB1
18
Tree-based computation
Loop [4]
If
BB2
BB0
BB5
BB4BB3
Seq1
BB6
Seq2BB1
If
BB2 BB4BB3
WCET(SEQ) S1;…;Sn WCET(S1) + … + WCET(Sn)
WCET(IF) if(test) then else WCET(test) + max( WCET(then) , WCET(else))
WCET(LOOP) for(;tst;inc) {body} maxiter * (WCET(tst)+WCET(body+inc)) + WCET(tst)
Timing schema
WCET(Seq1) = WCET_BB0 + WCET(Loop) +WCET_BB_6WCET(Loop) = 4 * (WCET_BB1 + WCET(Seq2)) + WCET_BB1WCET(Seq2) = WCET(If) + WCET_BB5WCET(If) = WCET_BB2 + max( WCET_BB3 , WCET_BB4 )
10
19
Tree-based computation
¢ Advantagesl Low computational effortl Good scalability with respect to program sizel Good user feedback (source-code level)
¢ Drawbacksl Not compatible with aggressive compiler
optimizationsl Expression of complex flow facts difficult (inter-
control-structure flow facts)
20
¢ Integer linear programming
l Objective function: max: f1t1+f2t2+…+fntn
l Structural constraints
" v: fi = S ai = S ai
f1 = 1
l Extra flow information
fi £ k (loop bound)
fi + fj £ 1 (mutually exclusive paths)
fi £ 2 fj (relations between execution freqs.)
IPET (Implicit Path Enumeration
Technique)
aiÎIn(v) aiÎOut(v)
t1
t2
t3
t4 t5
t6
t7
11
21
IPET
¢ Advantagesl Supports all unstructured flows (gotos, etc.)l Supports all compiler optimizations, including the
most aggressive ones
¢ Drawbacksl More time-consuming than tree-based methodsl Low-level user feedbacks (works at binary level)l Annotations are hard to provide (need to know
compiler optimizations)
¢ Mostly used in commercial/academic tools
22
Low-level analysisIntroduction
¢ Simple architecturel Execution time of an instruction only depends on
instruction type and operand l No overlap between instructions, no memory
hierarchy ¢ Complex architecture
l Local timing effects• Overlap between instructions (pipelines)
l Global timing effects• Caches (data, instructions), branch predictors• Requires a knowledge of the entire code
12
23
Low-level analysisIntroduction
¢ Outlinel Pipelines, caches, branch predictors,… considered in
isolationl Interferences between analyses
¢ Restrictionsl Simple in-order pipelines first
24
Low-level analysisPipelining
¢ Principle : parallelism between instructionsl Intra basic-block
l Inter basic-block
FetchDecodeExecuteMemoryWrite Back
Time
IFIDEXMEWB
Time
Time IFIDEXMEWB
13
25
Low-level analysisPipelining
¢ Intra basic block: �Simulation� of the flow on instructions into the pipeline stages
l Takes in consideration data hazards (RAR, RAW, WAW)l Bypass mechanism (if any) has to be modeledl Obtained by WCET analysis tool or external tool
(simulator, measures on the processor)
FetchDecodeExecuteMemoryWrite Back
Time
ti
26
Low-level analysisPipelining
¢ Inter basic-block : modification of computation stepl Example: IPET, extra constraints in ILP problem (negative costs
on edges)
ti=7
di,j=-4
tj=7
max: S fiti+ S ai,j di,j
ti
tj
di,j <= 0
IFIDEXMEWB
IFIDEXMEWB
Time
14
27
Low-level analysisOut-of-order execution [Li&Mitra, 06]
Superscalar processor model
IF CMWBEXID
Instr.cache
fetchbuffer
LD/ST
ROB
datacache
registerfilesALU2
ALU1
MULTU
FPU
In-order fetch and dispatch OO execution In-order commit
28
Low-level analysisOut-of-order execution: pipeline modeling
¢ Plain arrows: dependenciesl Pipeline stages of same instructions, buffer size, data
dependencies among instructions, in-order fetch and ID¢ Dashed arrows: contentions (shared resources)
15
29
Low-level analysisOut-of-order execution: timing analysis
¢ Assumptions (simplified version)l Fixed-latency functional units
¢ Principlel Applied at every basic blockl Scheduling problem: find the longest path in an
acyclic graph with precedence and exclusion constraints
l Algorithm (simplified)• Traverse the graph in topological order (finish time of
all predecessors known)• Conservative estimation of contention time• See paper for details !
30
Low-level analysisInstruction caches
¢ Cachel Takes benefit of temporal and spatial locality of
referencesl Speculative: future behaviour depends on past
behaviourl Good average-case performance, but predictability
issues
16
Low-level analysisInstruction caches
31
mainmemory
l bits k bitsaddress
…………2l cache lines
2k memory elements per cache line
n-ways
n-way set associative cache
32
Low-level analysisInstruction caches
¢ How to obtain safe and tight estimates?l Simulation along one path not an option
• Explores only one path, may not be the longest onel Simple solution (all miss): overly pessimistic
• And in some cases may be wrong (timing anomalies -see end of chapter)
¢ Objectivel Predict if an instruction will (certainly) cause a hit or
might (conservatively) cause miss.
17
33
Low-level analysisInstruction caches
¢ Based on abstract interpretation [Ferdinand, 2004]
¢ Computation of Abstract Cache States (ACS)l Contains all possible cache contents considering �all� possible execution paths
l Fixpoint computation
¢ Instruction categorization from ACSl Always hit (AH) - guaranteed to be a hitl Always miss (AM) - guaranteed to be a missl First miss (FM) - guaranteed to be a hit after first
referencedl Not-classified (NC) - don�t know
34
Low-level analysisInstruction caches
¢ Analysesl Must: ACS contain all program lines guaranteed to
be in the cache at a given pointl May: ACS contain all program lines that may be in
the cache at a given pointl Persistence: ACS contains all program lines not
evicted after loaded¢ Modification of ACS
l Update: at every referencel Join: at every path merging point
¢ Following example : for LRU caches
18
35
Low-level analysisInstruction caches - Must analysis
Join
Update
Intersection + max age
Apply replacementpolicy
a b c d b e d a
{} b {} d,a
Age +
a b c d
e a b c
ACS contain all program lines guaranteedto be in the cache at a given point
36
Low-level analysisInstruction caches - May analysis
Join
Update
Union + min age
Apply replacementpolicy
a b c d b e d a
a,b e c,d {}
Age +
a b c d
e a b c
ACS contain all program lines that maybe in the cache at a given point
19
37
Low-level analysisInstruction caches
¢ From ACS to classificationl If in ACS_must: Always Hitl Else, if not in ACS_May: Always Missl Else, if ACS_Persistence: First Missl Else Not Classified
¢ From classification to WCET (IPET)l For every BB: t_first, t_nextl Objective function: max S (f_firstit_firsti + f_nextit_nexti )l New constraints:
• fi = f_firsti + f_nexti• Constraints based on cache classification
38
Low-level analysisData caches
¢ Extra issue: determination of addresses of data
¢ Means: abstract interpretation / DF analysisl Value analysis
l Pointer analysis
¢ Results:l Superset of referenced addresses per instruction
¢ Example: l for (int i=0;i<N;i++) tab[i+j/z] = x;
l Any address inside tab may be referenced per loop iteration (but only one is actually referenced)
l Hard-to-predict reference (input-dependent)
20
39
Low-level analysisData caches
¢ Solutions (with some limitations)l Compiler-controlled: don’t cache input-dependent
data structuresl Assume every potentially referenced address is
actually referenced (2*N misses in example)l Cache Miss Equations (CME): for affine data-
independent loop indices
40
Low-level analysisData caches
¢ Tighter estimation method (pigeon hole principle) – re-work this onel Unpredictable vs unpredictable accesses (obtained
using static analysis)l Replacement of other blocks
• Cache miss counter Cm
• Incremented only for blocks used after the current instruction (info. obtained using static analysis)
l Cache miss if element not in the cache• Comparison of execution count and array size
• Persistence analysis of arrays• For persistent arrays fitting the cache, number of
misses <= array_range
21
41
Low-level analysisData caches
¢ Cache miss equationsl Unpredictable vs unpredictable accesses (obtained
using static analysis)l Replacement of other blocks
• Cache miss counter Cm• Incremented only for blocks used after the current
instruction (info. obtained using static analysis)l Cache miss if element not in the cache
• Comparison of execution count and array size• Persistence analysis of arrays• For persistent arrays fitting the cache, number of
misses <= array_range
42
Low-level analysisBranch prediction
¢ Branch predictorsl Local predictors [Colin, 00]l Global predictors [Mitra], [Rochange]l Most complex predictors out of reach !
22
43
The bad news …Timing anomalies (out-of-order execution)
Disp. Cycle
Instruction
A 1 LD r4,0(r3)B 2 ADD r5, r4, r4C 3 ADD r11, r10, r10D 4 MUL r11, r11, r11
E 5 MUL r13, r12, r12
LSU
IU
MCIU
12
LSU
IU
MCIU
11
Hit
Miss
44
The bad news …Timing anomalies
¢ Cache miss is not necessarily the worst case¢ Pipeline analysis and cache analysis are no
longer independent :-(l When a miss/hit cannot be predicted, exploration of
both outcomes: explosion of pipeline analysis timel [Li & Mitra 06]: more complex scheduling problem
23
45
The bad news …Timing anomalies
¢ Empty cache is not necessarily the worst-case¢ Example: PLRU (4-ways)
a b c d
0
0 0
• 0 (resp. 1): left (resp. right) subtree is to be replaced next
• Update/hit: bits on the path flipped
46
The bad news …Timing anomalies (credits: Absint)
24
47
The bad news …Timing anomalies
¢ Empty cache phenomenon and replacement policies: LRU and FIFO OK
¢ Latest development: predictability metrics [Reineke, 2008]l Evict: lower bound on eviction time (after evict
distinct accesses, guaranteed to be evicted)• Used in May analysis
l Fill: after fill distinct accesses, all possible cache contents are known
• Used in Must analysis
48
Flow analysis
Structurally feasible paths (infinite)
Basic finiteness (bounded loops)
Actually feasible (infeasible paths, mutually exclusive paths)
25
49
Flow analysis
¢ Infeasible paths
¢ Path ABDEF is infeasible
¢ Identification of infeasible paths: improves tightness
int baz (int x) {if (x<5) // A
x = x+1; // Belse x=x*2; // Cif (x>10) // D
x = sqrt(x); // Ereturn x; // F
}
50
Flow analysis
¢ Maximum number of iterations of loops
¢ Tight estimation of loop bounds: improves tightness
¢ On this specific example, polyhedral techniques help
(see later)
for i := 1 to N do for j := 1 to i dobegin
if c1 then A.long
else B.short
if c2 then C.short
else D.long
end
Loop bound: NLoop bound: N
(N+1)N2 executions
26
51
Flow analysis
¢ Determination of flow factsl Automatic (static analysis): infeasible in general
(equivalent to the halting problem)l Manual: annotations
• Loop bounds: constants, or symbolic annotations• Minimum required path information
• Annotations for infeasible / mutually exclusive paths, relations between execution counts
52
Flow analysis
¢ Annotations in aiT
l Simple example
• asm(“label:”); (in source code)
• ais2 {flow sum: point(”label”) == cnt;}
• ais2 {flow sum: point(”label”) <= cnt;}
l More generally, language with documented syntax for
• Bounds on recursion
• Loop bound specification (min, max, exact)
• Linear constraints
• Maximum execution times for loops and code snippets,
address ranges, etc …
l All this at binary level
27
53
Flow analysisFlow analysis using abstract interpretation [Gust06]
¢ Abstract domain: intervals of values¢ Scopes: « repeated » blocks (loops/functions)¢ Fixpoint calculation:
l Update: • At every instruction (source/intermediate/binary)• Conditional constructs: two paths may be followed
l Join: • Safe merging of intervals to avoid state explosion• Here: end of scope iteration
54
Flow analysisFlow analysis using abstract interpretation
¢ Example:i=INPUT; // i=[1..4]while (i<10) {
// point p…i=i+2;…
}// point q
End of iter3: [5..8] à i=i+2 [7..10] à 2 abstract states
iter i at p
Min #iter: 3
Max #iter: 5
1 [1..4]
2 [3..6]
3 [5..8]
4 [7..9]
5 [9..9]
6 impossible
28
55
Flow analysisFlow analysis using abstract interpretation
¢ Loop bound detectionl Loop counter:
• Incremented at every scope iteration• Updated at every scope exit (3,4,5) on the example
¢ Infeasible node: node counterl NB: Nodes with >0 counters might be infeasible
¢ Upper bounds for node executions¢ Infeasible paths
l For the same scope iterationl Matrix of pairs
¢ IPET: new linear constraints
56
Flow analysisFlow analysis using abstract interpretation
¢ Some numbers
Pgm Time WCETorig Time WCETff #FF
Crc 4.9 834159 6.65 833730 56
Inssort 0.16 31163 0.17 18167 7
Ns 6.09 130733 6.81 130733 8
Nsichneu 36.88 119707 435.70 41303 65280
29
Measurement-based and hybrid methods
¢ Principles¢ Safe measurement-based methods¢ Less-safe measurement-based method
58
WCET estimation methodsDynamic methods
¢ Principlel Input datal Execution (hardware, simulator)l Timing measurement
¢ Generation of input datal User-defined: reserved to expertsl Exhaustive
• Risk of combinatorial explosion
30
59
WCET estimation methodsSafe dynamic method
¢ Principlesl Structural testingl �All-paths� coverage criterionl Avoid enumerating all input-data combination
¢ Assumptionsl A1. One feasible path in source code gives rise to
one path in binary codel A2. Execution time of a feasible path is the same for
all input values which cause the execution of the pathl A3. Existence of a worst-case machine state
60
WCET estimation methodsSafe dynamic method
¢ Incremental coverage of input domain
t1 t1 t1
t2 t2
t3
31
61
WCET estimation methodsSafe dynamic method
Input domain
Source code
Instrumented source Instrumented object
Compilation Execution
Substitution
Path predicate
Execution path
Path predicate ofprevious tests
Conjunction
Difference
Constraint solving
Injection of input values
Domain not yet covered
Inputs for next test
62
WCET estimation methodsSafe dynamic method
¢ Advantagesl No hardware modell All paths are covered
¢ Drawbacksl Assumption A2 restricts the architecture to be usedl Still an explosion of the number of paths to be
explored for some programs¢ Ongoing work: measuring fewer execution
paths
32
63
WCET estimation methodsLess safe dynamic methods
¢ Use of genetic algorithms¢ Data structures
l Population, individuals, genes (sets of input data)¢ Operators
l Cross-over = mix two genes (sets of input data)l Mutation = change one gene (input data)l Selection = individuals with largest execution time
(obtained through measurements)¢ Benefits
l Measurement-based: no need for hardware modelsl No safety guaranteel Used to validate static analysis methods, or when safety is
not mandatory
64
WCET estimation methods
Hybrid method
¢ High-level analysis and WCET combinationl Static analysis
¢ Execution times of basic blocksl Measurementsl Distributions of execution time
¢ WCET computation (tree-based)l Operators defined on distributions instead of single values
• Independent basic blocks: convolution• Non-independant basic blocks: biased convolution
l Result: probabilistic WCET¢ Evaluation
l Benefits: no need for hardware model, path analysis is not probabilistic
l Safety is not guaranteed regarding hardware effects
33
65
WCET estimation methods
Hybrid method: symbolic simulation
¢ Based on a cycle-accurate simulator¢ Values and registers tagged with a �type� (known/unknown)¢ Extension of the semantics of instructions
l Memory transfer from an unknown register content: target address tagged as �unknown�
l Conditional branch on unknown condition: exploration of both branches
l For every instruction …¢ State merging to avoid exploring all paths
l State = (memory/registers)l Merging = pessimistic union of the two statesl Merging points: different granularities
¢ Benefits: flow-analysis �for-free�¢ Drawbacks: hardware models, terribly slow
66
Open issues (low level analysis)
¢ Complexity of static WCET estimation tools
¢ Timing anomalies, integration of sub-analyses
¢ Analysis tools may be released a long time after
the hardware is available
¢ Trends:
l Probabilistic low-level analysis
l Software control of hardware (cache locking,
compiler-directed branch prediction)
l Multi-core architectures (still a long way to go, …)
34
67
Static WCET estimation methodsA method for every usage
¢ Static WCET estimationl Safety Jl Pessimism Ll Need for a hardware model Ll Trade-off between estimation time and tightness
(tree-based / IPET) J¢ Measurement-based methods
l Safety ? Probabilistic methodsl Pessimism Jl No need for hardware models J
68
WCET estimation tools¢ Academic
l Cinderella [Princeton]l Heptane [IRISA, Rennes]l Otawa [IRIT Toulouse]l SWEET [Sweden]l Chronos [Singapour]
¢ Industriall Bound-T [SSF]l AIT [AbsInt, Sarbrüken]l Rapitime [Rapita systems, York]
35
69
Some pointers
¢ Bibliographyl Survey paper in TECS, 2008l Workshop on worst-case execution time analysis
(2001..2013), in conjunction with ECRTS
¢ Working group l Timing Analysis on Code-Level (TACLe)
http://www.tacle.eu
Beyond predictable performance
Schedulability analysis
36
71
Temporal validation of real-time systems
¢ Testingl Input data + execution (target architecture, simulator)l Necessary but not sufficient (coverage of scenarios)
¢ Schedulability analysis (models)l Hard real-time: need for guarantees in all execution
scenarios, including the worst-case scenariol Task modelsl Schedulability analysis methods (70s _ today)
72
Schedulability analysisIntroduction (1/2)
¢ Definitionl Algorithms or mathematical formulas allowing to
prove that deadlines will be met ¢ Classification
l Off-line validation• Target = hard real-time systems
l On-line validation• Acceptance tests executed at the arrival of new tasks• Some tasks may be rejected ® target = soft real-time
systems
37
73
Schedulability analysisIntroduction (2/2)
¢ Input: system modell Tasl model
• Arrival: periodic, sporadic, aperiodic• Inter-task synchronization: precedence constraints,
resource sharing • Worst-case execution time (WCET)
l Architecturel Known off-line for hard real-time systems
¢ Outputl Schedulability verdict
74
Schedulability analysisExample (1/2)
¢ System modell Periodic tasks (Pi), deadline Di<=Pil Fixed priorities (the smaller Di the higher priority)l Worst-case execution time : Ci
¢ Necessary condition
¢ Sufficient condition
¢ Low complexity
11
≤=∑=
n
i PiCiU
1)-n(21
1
nn
i DiCi≤∑
=
38
75
Schedulability analysisExample (2/2)
¢ Same task model as before¢ Estimation of response time: limit of series
¢ The series limit is the task response time (when converges)
¢ The system is schedulable when Ri £ Di¢ More complex schedulability test
jihpj j
kiik
i CPwCw ∑
∈
+##$
%%&+=)(
1
ii Cw =0
76
Schedulability analysisNetworked systems
¢ Response time analysis on every processor¢ Response time analysis of network
transmissions (e.g. CAN: fixed-priority network)¢ End-to-end response time analysis
l Holistic approaches
39
WCET estimation for parallel codes
¢ Issues¢ Preliminary studies
78
Hardware architecturesExamples of multi-core systems
¢ Shared memory + bus + local cache/SPM¢ Shared memory + bus + shared last-level cache¢ Clustered architectures (Kalray) with NoC
NoCRouter
NoCRouter
RAMCTRL Shared
ressource +Mem accessscheduling
Sharedressource +NoC policy
40
79
Hardware architecturesIssues with multi-core systems
¢ Shared hardware resources (to reduce costs)l Busl NoCl Last Level Cache, memory controller
¢ Introduces dependencies between coresl Interference delaysl Depends on
• Arbitration policy• Aontending activity on the other cores
¢ Remarks: In some safety critical systems, disable all cores but one
80
Hardware architecturesSoftware executing on multi-cores
¢ Independent tasksl Have to consider worst-case contention
¢ Dependent tasksl Dependencies between
tasksl Examples: Directed Acyclic
Graphs (DAGs), Message Sequence Charts (MSCs)
Synchronization
Control Flow
Start/Fork
A
D G
B
S0
E
S1
C
W1
F
W0
41
81
Shared Last Level Cache (LLC)
¢ Issues¢ A technique to identify worst-case number of
misses in LLCl Analysis of cache hierarchiesl Analysis of shared LLC
¢ Using synchronizations to tightenl Avoid pressure in shared cachel Account for synchronizations
Shared Last Level Cache
¢ Issues arising from cache sharing
Assumed architecture
Shared
Cache
[X]
Rival task (Core 2)Analysed task (Core 1)
[a]
[b]
[a]
[b]Is the second access to [a] a hit in the cache ?
Not necessarily, depends on when the rival task
accesses the shared cache.
42
Analysis of cache hierarchies
¢ Hierarchical analysis of caches
¢ Compute accesses occurrences on cache level Ll Accessing level L not
necessarily the worst-case!¢ In the following examples:
l Non inclusive cachesl LRU replacement policyl Set-associative caches
Cache analysisLevel L-1
Cache analysisLevel L
References
Hit/Miss Classification
To the processor
To the main memory
Analysis of cache hierarchies [RTSS 08]
¢ Cache access classification at level Ll Never: never occurs at level Ll Always: always occur at level Ll Uncertain: unknown
¢ Modification of cache analysisl Update function
• Never: not considered• Always: considered• Unknown: local exploration: Join (Never, Always)
43
Analysis of shared caches
¢ Estimate conflicts from rival tasks¢ Integrate conflicts into cache classification
(Hit/Miss)
References Access Classification
Cache analysisShared Level L
Analysed task
Cache analysisLevel L
…
Conflicts count
Conflict Estimation
…
Conflicts count
Cache analysisLevel LConflict Estimation
Rival tasks
Analysis of shared caches: conflict estimation
¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly
in practice
[a]
[a]
[b]
[w,x,y]
[d,e]
[b]
10
44
Analysis of shared caches: conflict estimation
¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly
in practice¢ To reduce this cost, we abstract
from a task:l Access orderingl Occurrence count
[a]
[a]
[b]
[w,x,y]
[d,e]
[b]
10
Analysis of shared caches: conflict estimation
¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly
in practice¢ To reduce this cost, we abstract
from a task:l Access orderingl Occurrence count
[a]
[b]
[w]
10
[x]
[y]
[d]
[e]
45
Analysis of shared caches: conflict estimation
¢ Any task, any time may alter shared cachesl Compute all interleavings: too costly
in practice¢ To reduce this cost, we abstract
from a task:l Access orderingl Occurrence count
¢ For each cache setl Find memory blocks used by the task.l Count the total number of blocks
(CCN)
[a]
[b]
[w]
10
[x]
[y]
[d]
[e]
Analysis of shared caches: conflict integration
¢ A cache analysis is performed using modified cache states:l CCN cache blocks may have been allocated to rival
tasks in the shared cache.l Only (Cache L Associativity – CCN) ways are for sure
available.
Set 0
Set 1
Example:
Set 0
Set 1
Base cache configuration Available cache space during analysis
Cache line occupied by a conflicting block
CCN0 = 2
CCN1 = 7
46
Analysis of shared caches
¢ Conflict Estimation is pessimistic:l Many blocks may clutter up the cache.l Every block a task may store on a shared cache level
is included.
¢ Little cache space detected as available during shared cache analyses.
l Bypass of shared cache blocks [RTSS 09, IRISA]l Accounting for synchronization [RTSS 09, NUS]
Need to control the impact of tasks on shared caches.
Analysis of shared caches: bypass
¢ If an instruction bypasses a cache level, it does not alter this cache level.
[a]
[c]
[c]
[a]
[b]
[c]
Access bypassing the cache
Considered cache
Example:
Without bypass Using bypass
47
Analysis of shared caches: bypass
¢ Reduce tasks’ impact on caches.l A bypassing access produces no conflict
¢ Selection of accesses to bypassl Static analysisl Accesses not statically detected as reused (single-
usage)l Never worsens tasks WCET J
¢ Extended to data caches in [RTNS 10]
Analysis of shared caches: accounting for synchronization
¢ synchronisation points define zones of conflicting accesses
Task 1 Task 2
Without synchronisation
Task 1 Task 2
With synchronisation
Rendezvous
Accesses competing for shared cache space
48
WCET estimation in multi core architectures: shared busses
¢ Issues¢ Estimation of contention delays for round-robin
arbitrationl Pessimistic contention analysis for RRl Accounting for synchronizations
WCET estimation in multi core architectures: shared busses
¢ Issues
¢ Delay to serve request r1l Depends on pending requests from the other coresl Depends on bus arbitration policy
Memory
Busrequest r1
49
WCET estimation in multi core architectures: shared RR bus
¢ Round-robin arbitration
l A time units per core, round-robinl In case a core has no pending request, serve request
for the text core in round-robin ordering¢ Worst-case delay:
l For a request from core i: WCDRR = (Nc-1)*A, with Ncthe total number of cores
l For Ni requests from core i, WCDRR = (Nc-1)*A*Ni
A time units (say 1 memory access)
C1 C2 C3 … C1 C2 C3
WCET estimation in multi core architectures: shared RR bus
¢ Remarksl Assumes all concurrent cores always have a pending
request, which is a safe over-approximation but introduces pessimism
l Let Ni denote the maximum number of memory accesses performed by a concurrent core Cj
l For Ni requests from core I• Assumes cores are ordered by increasing Ni
• "#$%% #& = ∑ )* ∗ ,*-&./*0/ +∑ )& ∗ ,*-3
*0&4/ 565789
Cores with less than Ni accesseseach access delays by A
Cores with more than Ni accesses, delay of Ni*A
50
Interfering Memory Accesses Delay
cores
Memory Accesses
Ni
Ci NcC1
N1
Nq
Interfering AccessesInterfering Accesses
WCDRR =(Nc-1)*A*Ni
¢ Graphically
01233 14 =5 67 ∗ 97:4;<
7=<+5 64 ∗ 9
7:?
7=4@<ABACDE
WCET estimation in multi core architectures: shared RR bus
¢ Estimating number of concurrent memory accessesl Knowledge of concurrent code snippetsl May Happen in Parallel relation (MHP)l Classical graph algorithms (transitive closure) for
graphs without cyclesl More complex for cyclic structures
51
WCET estimation in multi core architectures: shared RR bus
Synchronization
Control Flow
Start/Fork
A
D G
B
S0
E
S1
C
W1
F
W0
102
WCET estimation in multi core architectures: shared RR busBB MHPA {E,S1}B {E,S1}S0 {E,S1}W1 {W0,F,G}C {W0,F,G}D {W0,F,G}E {A,B,S0}S1 {A,B,S0}W0 {W1,C,D}F {W1,C,D}G {W1,C,D}
Transitive ClosureSynchronizationControl Flow
Start/Fork
A
D G
B
S0
E
S1
C
W1
F
W0