Modeling shared cache and bus in multi-core platforms for timing analysis
description
Transcript of Modeling shared cache and bus in multi-core platforms for timing analysis
![Page 1: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/1.jpg)
Modeling shared cache and bus in multi-core platforms for timing analysis
Sudipta ChattopadhyayAbhik RoychoudhuryTulika Mitra
![Page 2: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/2.jpg)
Timing analysis (basics) Hard real time systems need to meet certain deadline
System level or schedulability analysis Single task analysis (Worst Case Execution Time analysis)
WCET : An upper bound on the execution time for all possible inputs Usually obtained by static analysis Worst Case Execution Time (WCET) of a program for a given
hardware platform
Usage of WCET Schedulability analysis of hard real time systems Worst case oriented optimization
![Page 3: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/3.jpg)
WCET and BCET
ActualBCET
ActualWCET
Execution Time
ObservedWCET
Estimated BCET
Actual
Observed
Over-estimation
WCET = Worst-case Execution TimeBCET = Best-case Execution Time
ObservedBCET Estimated
WCETActualWCET
![Page 4: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/4.jpg)
Timing analysis for multi-cores Modeling shared cache and shared bus
Most common form of resource sharing in multi-cores Difficulties
Conflicts in shared cache arising from other cores Contention in shared bus introduced by other cores Interaction between shared cache and shared bus
![Page 5: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/5.jpg)
Commercial multi-core
Shared off-chip Bus
Core 0
L1….
Core N
L1
Shared L2
Core 0
L1….
Core N
L1
Shared L2
Off-chip Memory
Crossbar Crossbar
Processor 0 Processor 1
Intel Core-2 Duo
Presence of both shared cache and
shared bus
![Page 6: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/6.jpg)
Modeled architecture
Shared cache is accessed through a shared bus
….
Shared Bus
Shared L2
Core 0
L1….
Core N
L1
Shared Bus
Shared L3
L2 L2
Architecture AArchitecture B
L1 L1
Core 0 Core N
![Page 7: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/7.jpg)
Assumptions Perfect data cache, currently we model only shared instruction
cache
Shared bus is TDMA (Time Division Multiple Access) and TDMA slots are assigned in a round-robin fashion TDMA is chosen for predictability
Separated instruction and data bus Bus traffic arising from data memory accesses are ignored
No self modifying code Cache coherence need not be modeled
Non-preemptive scheduling
![Page 8: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/8.jpg)
Overview of the frameworkL1 cache analysis
L2 cacheanalysis
Cache accessclassification
L1 cache analysis
L2 cache analysis
L2 conflict analysisInitial interference
Cache accessclassification
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Iterative fix-point analysis
Termination of our analysis is
guaranteed
![Page 9: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/9.jpg)
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
L2 conflict analysisInitial interference
Cache accessclassification
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
![Page 10: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/10.jpg)
L1 cache analysis (Ferdinand et. al. RTS’97) Abstract cache set
{a}
{a}{b,c}
{c}
{a}{c}
{a}{b,c}
{a}{c}
{a}
{a}{b,c}
{c}
{a}{b,c}
{a}
{b}
{b}
{c}Evicted blocks
low
high
age
Must JoinIntersection, maximum age
Finds All hit (AH) cache blocks
May JoinUnion, minimum age
Finds All Miss (AM) cache blocks Persistence JoinUnion, maximum age
Finds Persistence (PS) or never evicted cache blocks
![Page 11: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/11.jpg)
Framework componentsL1 cache analysis
L1 cache analysis
L2 conflict analysisInitial interference
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
Cache accessclassification
L2 cache analysis
Cache accessclassification
L2 cache analysis1
![Page 12: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/12.jpg)
Per core L2 cache analysis (Puaut et. al. RTSS 2008)
Memory reference
L1 cache
L2 cache
All miss Persistence or NC
Never accessed (N)
Always accessed (A)
All hit
ACSout = ACSin ACSout = U(ACSin)
Unknown (U)
∏ Join
ACSin
ACSout = ACSinACSout = U(ACSin)
![Page 13: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/13.jpg)
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
Initial interference
Cache accessclassification
Bus awareanalysis
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
L2 conflict analysis
![Page 14: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/14.jpg)
Shared cache conflict analysis Our past work (RTSS 2009) Exploit task lifetime to refine shared cache analysis Task interference graph
There exists an edge between two task nodes if they have overlapping lifetimes
Analyze each cache set C individually
![Page 15: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/15.jpg)
Task interference graph
Timeline
T3
T2
T1
T1
T2
T3
Task interference graph
![Page 16: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/16.jpg)
Cache conflict analysis
T1
T2
T3
Task interference graphm1
Associativity = 4
T1
T2
T3
m2
m3
T1
T2
T3
m2
m3
m1
shiftAfter conflict analysis
m1: AH
m2: AH
m3: AH
m1: AH->AH
m2: AH->AH
m3: AH->AH
All memory blocks remain all hits
Cache set C
M(C) = 1
M(C) = 2
M(C) = 1
![Page 17: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/17.jpg)
Cache conflict analysis
T1
T2
T3
Task interference graphm1
Associativity = 4
T1
T2
T3
m2
m3
T1
T3
After conflict analysis
m0, m1: AH
m2: AH
m3: AH
m1: AH->AH
m2: AH->NC
m3: AH->AH
m2 may be replaced from the cache due to conflicts from other cores
Cache set C
M(C) = 1
M(C) = 3
M(C) = 1
m0
m0 m1
m3
T2
![Page 18: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/18.jpg)
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
Initial interference
Cache accessclassification
WCRT computation
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
L2 conflict analysis
Bus awareanalysis
![Page 19: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/19.jpg)
Example : variable bus delay
Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,
Code Executing on Core0
Right BranchCommon PathLeft Branch
C1 = 20
C2 = 10
C5 = 10
M1 = 10
C1 = 20
M2 = 20
C2 = 10C3 = 20
C4 = 30
C3 = 20
C4 = 20
t = 0
t = 50
t = 100
Core 0 bus slot
Core 1 bus slot
Core 0 bus slot
C5 = 10
t = 150
L2 miss
L2 hit
First iteration (No bus delay)
![Page 20: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/20.jpg)
Example : variable bus delay
Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,
Code Executing on Core0
Right BranchCommon PathLeft Branch
C1 = 20
C2 = 10
C5 = 10
M1 = 10
C1 = 20
M2 = 20
C2 = 10C3 = 20
C4 = 30
C3 = 20
C4 = 20
t = 0
t = 50
Core 0 bus slot
Core 1 bus slot
Core 0 bus slot
C5 = 10L2 miss
L2 hit
Second iteration (M1 suffers 20 cycles bus delay)
Bus delayM1 = 10
C1 = 20
t = 100
t = 150
Conclusion:WCET of different iterations
could be different
![Page 21: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/21.jpg)
Possible solutions
Source of problem Each iteration of a loop may start at different offset relative to its
bus slot
Possible solutions Virtually unroll all loop iterations – too expensive Do not model the bus or take maximum possible bus delay –
imprecise result
Our solution Assume each loop iteration starts at the same offset relative to its
bus slot and add necessary alignment cost
![Page 22: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/22.jpg)
Key observation
Core 0 slot Core 1 slot Core 0 slot Core 1 slot
Timeline
Bus schedule
Δ T starts at core 0 Δ T starts at
core 0
Round robin schedule follow repeated patterns
Core 0 slot Core 1 slot
Δ T starts at core 0
T must follow the same Execution pattern if the offset ( Δ) is same
Bus schedule
![Page 23: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/23.jpg)
Revisiting the example
Bus slot: 50 cycles, L2 hit: 10 cycles,L2 miss: 20 cycles,
Code Executing on Core0
C1 = 20
C2 = 10
C5 = 10
M2 = 20
C2 = 10C3 = 20
C4 = 30
C3 = 20
C4 = 20
t = 0
t = 50
Core 0 bus slot
Core 1 bus slot
Core 0 bus slot
C5 = 10L2 miss
L2 hit
Align
M1 = 10
C1 = 20
t = 100
t = 150
M1 = 10
C1 = 20
Alignment cost = 20 cycles (all iterations follow the same execution pattern with this alignment)
WCET of one iteration <= 100 cycles
(No need to virtually unroll the loop)
Right BranchCommon PathLeft Branch
![Page 24: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/24.jpg)
Partial Unrolling
C1=10C2=10L2 Hit
C1=10M2=10C2=10
C1=10M2=10C2=10C1=10M2=10C2=10C1=10M2=10C2=10
C1=10 C1=10
t=0
t=100
Core0 Bus slot
No unrolling Partial unrolling
Iter1
Iter2
Iter3
Iter1
Iter2 Iter4 Core0 Bus slot
Code Executing on Core0
Alignment cost higher if the loop is very small compared to the length of bus slot
Partially unroll such loopstill one bus slot is filled
![Page 25: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/25.jpg)
Extension to full programs
WCETof
inner loop
WCETof
outer loop
![Page 26: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/26.jpg)
Framework componentsL1 cache analysis
L2 cacheanalysis
L1 cache analysis
L2 cache analysis
Initial interference
Cache accessclassification
Interference changes ?
Yes
Estimated WCRT
No
Cache accessclassification
L2 conflict analysis
Bus awareWCET/BCETcomputation
WCRT computation
![Page 27: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/27.jpg)
WCRT
t3t2
t4
t1(1)
(2) (2)
(1)
Assigned core
Task graph
Peers
Task lifetime : [EarliestReady, LatestFinish]
EarliestReady(t1) = 0EarliestReady(t4) >= EarliestFinish(t2)EarliestReady(t4) >= EarliestFinish(t3)
EarliestFinish = EarliestReady + BCET
LatestReady(t4) >= LatestFinish (t2)LatestReady(t4) >= LatestFinish (t3)
t2 has peers LatestFinish (t2) = LatestReady(t2)
+ WCET(t2) + WCET(t3)
t4 has no peers LatestFinish (t4) = LatestReady(t4)
+ WCET(t4)
Computed WCRT = LatestFinish(t4)
Earliest timecomputation
Latest timecomputation
![Page 28: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/28.jpg)
An example
T2.1= 10
T3.2=10
T3.1=20
M2.2=20
T2.2=20
M3.2 =20T3.2 =10T4.2=10
T4.1 =20
M4.2=10T4.2 =10
Core 0 Core 1Bus
Core 0 Core 1
Wait
Wait
T1.1=90
T2.1= 10T2 lifetime
T3 lifetime
Bus schedule based on M2.2, M3.2 L2 missWCRT: 170 cyclesT2 and T3 have Disjoint lifetimeM2.2 and M3.2 cannot conflict: Both L2 Hit
L2 Hit: 10 cyclesL2 Miss: 20 cyclesBus slot: 50 cyclesM2.2 and M3.2 conflict in L2: Both L2 MissM4.2 is L2 Hit
T1.1= 90
T2.2=20
T3.1= 20
T4.1= 20
Core0 slot
Core1 slot
Core0 slot
Core1 slot
![Page 29: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/29.jpg)
Example contd.
Bus schedule based on M2.2, M3.2 L2 HitSecond bus wait for Core 1 eliminatedWCRT: 130 cycles
T1.1=90
T3.1=20
T2.2 =20
T3.2=10
T2.1= 10
T4.1=20
T4.2=10M2.2=10
M3.2=10
M4.2=10
Core 0 Core 1Bus
Wait
Core0
slot
Core1 slot
Core0 slot
Core1
slot
T3.1=20
M2.2=20
T2.2=20
M3.2 =20T3.2 =10T4.1 =20
M4.2=10T4.2 =10
Core 0 Core 1Bus
Wait
Wait
T1.1=90
T2.1= 10T2 lifetime
T3 lifetime
Core0 slot
Core1 slot
Core0 slot
Core1 slot
![Page 30: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/30.jpg)
Experimental evaluation
Tasks are compiled into Simplescalar PISA compliant binaries
CMP_SIM is used for simulation, CMP_SIM is extended with shared bus modeling and for PISA compliant binaries
Two setup Independent tasks running in different cores Task dependency specified through a task graph
![Page 31: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/31.jpg)
Overestimation ratio (2-core)
One core runs statemate another core runs the program under evaluation
L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles
Average Overestimation = 40%
![Page 32: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/32.jpg)
Overestimation ratio (4-core)
Either runs (edn, adpcm, compress, statemate) or runs (matmult, fir, jfdcint, statemate) in 4 different cores
L1 cache : direct mapped, 1 KBL2 cache : 4-way, 2 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles
Average Overestimation = 40%
![Page 33: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/33.jpg)
Sensitivity with bus slot length (2-core)Average overestimation ratio for program Statemate
![Page 34: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/34.jpg)
Sensitivity with bus slot length (4-core)Average overestimation ratio for program Statemate
![Page 35: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/35.jpg)
Debie is an online space debris monitoring program manufactured by Space Systems Finland Ltd.
Extracted task graph (Debie-test)
WCRT analysis of task graph
main-tc(1)
main-hm(1)
main-tm(1)
main-hit(1)
main-aq(1)
main-su(1)
tc-test(3)
hm-test(4)
tm-test(1)
hit-test(2)
aq-test(4)
su-test(2)
Assigned core number
![Page 36: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/36.jpg)
Experimental evaluation of Debie-test
L1 cache : 2-way, 2 KBL2 cache : 4-way, 8 KBL1 block size = 32 bytesL2 block size = 64 bytesL1 miss latency = 6 cyclesL2 miss latency = 30 cyclesBus slot length = 80 cycles
Overestimation ratio ~ 20%
This difference clearly shows that for real life application bus modeling is essential
![Page 37: Modeling shared cache and bus in multi-core platforms for timing analysis](https://reader036.fdocuments.net/reader036/viewer/2022062323/56816271550346895dd2e075/html5/thumbnails/37.jpg)
Extension to different multi-core architecture (e.g. Intel Core2 Duo)
Shared off-chip Bus
Core 0
L1….
Core N
L1
Shared L2
Core 0
L1….
Core N
L1
Shared L2
Off-chip Memory
Crossbar Crossbar
Processor 0 Processor 1
Only L2 cache misses appearin shared bus
Overall framework still remains the same, only shared bus waiting time is computed for L2 cache misses