Scheduling and Optimization of Fault-Tolerant Embedded Systems
description
Transcript of Scheduling and Optimization of Fault-Tolerant Embedded Systems
1 of 141
Scheduling and Optimization of Fault-Tolerant Embedded Systems
Viacheslav IzosimovEmbedded Systems Lab (ESLAB)
Linköping University, Sweden
Presentation of Licentiate Thesis
2 of 142
Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc.
Motivation
Focus on transient faults and intermittent faults
3 of 143
Motivation
Transient faults
Radiation
Electromagnetic interference (EMI)
Lightning storms
Happen for a short time Corruptions of data,
miscalculation in logic Do not cause a permanent
damage of circuits Causes are outside system
boundaries
4 of 144
Motivation
Intermittent faults
Internal EMICrosstalk
Power supply fluctuations
Init (Data)
Software errors (Heisenbugs)
Transient faults Manifest similar as
transient faults Happen repeatedly Causes are inside
system boundaries
5 of 145
Motivation
Errors caused by transient faults haveto be tolerated before they crash the system
However, fault tolerance againsttransient faults leads to significant
performance overhead
Transient faults are more likely to occuras the size of transistors is shrinking
and the frequency is growing
6 of 146
Motivation
Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc.
The Need for Design Optimization of Embedded Systems with Fault
Tolerance
7 of 147
Outline MotivationBackground and limitations of previous work Thesis contributions:
Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency
Conclusions and future work
8 of 148
General Design Flow
System Specification
Architecture Selection
Mapping & Hardware / Software Partitioning
Scheduling
Back-end Synthesis
Feedback loops
Fault Tolerance
Techniques
9 of 149
P10 20 40 60
N1
P12P1 P11 2
P1P1P1/1
Fault Tolerance Techniques
Error-detection overhead
N1
Re-execution
Checkpointing overhead
P1 P11 2N1
Rollback recovery with checkpointing
Recovery overhead
P1(1)
P1(2)
N1
N2
P1(1)
P1(2)
N1
N2
Active replication
P1/2
P11 P1/12 P1/22N11
10 of 1410
Limitations of Previous Work Design optimization with fault tolerance is
limited
Process mapping is not considered together with fault tolerance issues
Multiple faults are not addressed in the framework of static cyclic scheduling
Transparency, if at all addressed, is restricted to a whole computation node
11 of 1411
Outline Motivation Background and limitations of previous workThesis contributions:
Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency
Conclusions and future work
12 of 1412
Fault-Tolerant Time-Triggered Systems
Processes: Re-execution,Active Replication, Rollback
Recovery with CheckpointingMessages: Fault-
tolerant predictable protocol
…
Transient faults
P2
P4P3
P5
P1
m1m2
Maximum k transient faults within each application run (system period)
13 of 1413
Scheduling with Fault Tolerance Reqirements
Conditional Scheduling
Shifting-based Scheduling
Conditional Scheduling
14 of 1414
PP11
PP11
PP22
m1
Conditional Scheduling
true
P1 0
P2
k = 2
0 20 40 60 80 100 120 140 160 180 200
15 of 1415
PP11 PP22
PP11
PP22
m1
Conditional Scheduling
true
P1 0
P240
k = 2
1PF
0 20 40 60 80 100 120 140 160 180 200
16 of 1416
PP11
PP22
m1
Conditional Scheduling
true
P1 0
P2
454
0
k = 2
1PF
1PF
PP11PP1/11/1 PP1/21/2
0 20 40 60 80 100 120 140 160 180 200
17 of 1417
PP22
0 20 40 60 80 100 120 140
PP11
PP22
m1
Conditional Scheduling
true
P1 0
P2
454
0
k = 2
160 180 200
90130
1PF
1PF
1 1P PF FÙ
PP1/11/1 PP1/21/2 PP1/31/3
18 of 1418
PP11
PP22
m1
Conditional Scheduling
true
P1 0
P2
454
0
k = 2
90130 14085
1PF
1PF
1 1P PF FÙ1 1P PF FÙ
1 1 2P P PF F FÙ Ù
0 20 40 60 80 100 120 140 160 180 200PP1/11/1 PP1/21/2 PP22PP2/12/1 PP2/22/2
19 of 1419
PP11
PP22
m1
Conditional Scheduling
true
P1 0
P2
454
0
k = 2
90130 1509514085
1PF
1PF
1 1P PF FÙ1 1P PF FÙ
1 1 2P P PF F FÙ Ù1 2P PF FÙ
1 2 2P P PF F FÙ Ù
0 20 40 60 80 100 120 140 160 180 200PP11 PP22PP2/12/1 PP2/22/2 PP2/32/3
20 of 1420
PP22
PP22
PP22
PP22
PP22 PP22
PP11
PP11
PP11
Fault-Tolerance Conditional Process Graph
12P
F21PF
22P
F
PP11
PP22
m1
k = 211
22
33
11
22
33
44
55 66
m1
m1
m1
1
2
3
Conditional Scheduling
42P
F
11PF
21 of 1421
Conditional Schedule Table
true
P1 0
m1
P2
454
050
90130140 160105150
8595
1PF
1PF
1 1P PF FÙ1 1P PF FÙ
1 1 2P P PF F FÙ Ù1 2P PF FÙ
1 2 2P P PF F FÙ Ù
PP11
PP22
m1k = 2 N1 N2
22 of 1422
Conditional Scheduling Conditional scheduling:
Generates short schedules Allows to trade-off between transparency and
performance (to be discussed later...)
– Requires a lot of memory to store schedule tables
– Scheduling algorithm is very slow
Alternative: shifting-based scheduling
23 of 1423
Shifting-based Scheduling Messages sent over the bus should be
scheduled at one time Faults on one computation node must not
affect other computation nodes Requires less memory Schedule generation is very fast
– Schedules are longer
– Does not allow to trade-off between transparency and performance (to be discussed later...)
24 of 1424
Ordered FT-CPG
k = 2
PP11
PP33
PP44PP22
m2
m3
m1
P2 after P1
P3 after P4
PP33
PP33
PP33
PP33
PP33
PP33
PP44
PP44PP44
PP22PP22
PP22
PP22PP22 PP11
PP11
PP11
SSSSSS
1
2
3
1
2
3
4
56
1
mm33mm22mm11
2
3
1
2
3
4
5 5
PP22
25 of 1425
Root Schedules
P2
m 1
P1
m 2 m 3
P4 P3
N1
N2
Bus
P1 P1
Worst-case scenario for P1
Recovery slack for P1 and P2
26 of 1426
Extracting Execution Scenarios
P2
m 1
P1
m 2 m 3
P4/1 P3
N1
N2
Bus
P4/2 P4/3
27 of 1427
Memory Required to Store Schedule Tables
20 proc. 40 proc. 60 proc. 80 proc.
k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3
100% 0.13 0.28 0.54 0.36 0.89 1.73 0.71 2.09 4.35 1.18 4.21 8.75
75% 0.22 0.57 1.37 0.62 2.06 4.96 1.20 4.64 11.55 2.01 8.40 21.11
50% 0.28 0.82 1.94 0.82 3.11 8.09 1.53 7.09 18.28 2.59 12.21 34.46
25% 0.34 1.17 2.95 1.03 4.34 12.56 1.92 10.00 28.31 3.05 17.30 51.30
0% 0.39 1.42 3.74 1.17 5.61 16.72 2.16 11.72 34.62 3.41 19.28 61.85
Applications with more frozen nodesrequire less memory
1.734.968.09
12.5616.72
28 of 1428
Memory Required to Store Root Schedule
20 proc. 40 proc. 60 proc. 80 proc.
k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3
100% 0.016 0.034 0.054 0.070
Shifting-based scheduling requires very little memory
1.730.03
29 of 1429
Schedule Generation Time and Quality
Shifting-based scheduling much faster thanconditional scheduling
Shifting-based scheduling requires 0.2 seconds to generate a root schedule for
application of 120 processes and 10 faultsConditional scheduling already takes 319 seconds to generate a schedule table for application of 40 processes and 4 faults
~15% worse than conditional scheduling with100% inter-processor messages set to frozen
(in terms of fault tolerance overhead)
30 of 1430
Fault Tolerance Policy AssignmentCheckpoint Optimization
31 of 1431
Fault Tolerance Policy Assignment
P1/1 P1/2 P1/3
Re-execution
N1
P1(1)
P1(2)
P1(3)
Replication
N1
N2
N3
P1(1)/1
P1(2)
N1
N2
P1(1)/2
Re-executed replicas
2
32 of 1432
Re-execution vs. Replication
N1 N2P1
P3
P2
m1 1P1P2P3
N1 N240504060
5070
A1 P1 P3P2m1 m2A2
N1
N2
bus
P1 P2 P3
Missed
Deadline
P1(1)N1
N2
bus
P1(2)
P2(1)
P2(2)
P3(1)
P3(2) Met
m1(
2)
m1(
1)
m2(
2)
m2(
1)
Replication is better
N1
N2
bus
P1 P2
P3
Met
Deadline
P1(1)N1
N2
bus
P1(2)
P2(1)
P2(2)
P3(1)
P3(2)
Missedm
1(2)
m1(
1)
Re-execution is better
33 of 1433
P1N1
N2
bus
P2
P3
P4
m 2
Missed
N1
N2
P1(1)
P3(2)
P4(1)P2(1)
P1(2)
bus
P2(2)
P3(1)
P4(2)
Missedm
1(2)
m1(
1)
m2(
1)m
2(2)
m3(
1)m
3(2)
N1
N2
P1(1)
P3
P4P2
P1(2)
m2(
1)m
1(2)
bus
MetOptimization
of fault tolerancepolicy
assignment
Fault Tolerance Policy Assignment
P1P2P3
N1 N240 506060
8080
P4 40 50
1N1 N2
P1
P4P2
P3
m1
m2
m3
Deadline
34 of 1434
Optimization Strategy Design optimization:
Fault tolerance policy assignment Mapping of processes and messages
Root schedules
Three tabu-search optimization algorithms: 1. Mapping and Fault Tolerance Policy assignment (MRX)
Re-execution, replication or both2. Mapping and only Re-Execution (MX)3. Mapping and only Replication (MR)
Tabu-search
Shifting-based scheduling
35 of 1435
80
20
Experimental Results
010
3040506070
90100
20 40 60 80 100
80 Mapping and replication (MR)
20Mapping and re-execution (MX)
Mapping and policy assignment (MRX)
Number of processes
Avge
rage
% d
evia
tion
from
MRX
Schedulability improvement under resource constraints
36 of 1436
N1
Checkpoint Optimization
P1
P12P11 P12P1/12 P1/22
P12P11 P12
37 of 1437
Locally Optimal Number of Checkpoints
1 = 15 ms
k = 2
1 = 5 ms
1 = 10 ms
P1C1 = 50 ms
No. o
f che
ckpo
ints
P1 P12 1 2
P1 P1 P13 1 2 3
P1 P1 P1 P14 1 2 3 4
P11
P1 P1 P1 P1 P15 1 2 3 4 5
38 of 1438
Globally Optimal Number of Checkpoints
P2P1
m1
10 5 1010 5 10
P1P2
P1C1 = 50 ms
P2 C2=60 ms
k = 2
265
P1 P1 P11 2 3
P2 P2 P21 2 3
P1 P2 P2P11 2 1 2
255
39 of 1439
Globally Optimal Number of Checkpoints
P2P1
m1
10 5 1010 5 10
P1P2
P1C1 = 50 ms
P2 C2=60 ms
k = 2
P1 P1 P11 2 3 P2 P2 P2
1 2 3a)265
P1 P2 P2P11 2 1 2b)255
40 of 1440
Globally Optimal Number of Checkpoints
P2P1
m1
10 5 1010 5 10
P1P2
P1C1 = 50 ms
P2 C2=60 ms
k = 2
P1 P1 P11 2 3 P2 P2 P2
1 2 3a)265
P1 P2 P2P11 2 1 2b)255
41 of 1441
0%
10%
20%
30%
40%
40 60 80 100
Global Optimization of Checkpoint Distribution (MC)% d
evia
tion
from
MC0
(how
sm
alle
r th
e fa
ult t
oler
ance
ove
rhea
d)
Application size (the number of tasks)
4 nodes, 3 faults
Local Optimization of Checkpoint Distribution (MC0)
Global Optimization vs. Local Optimization
Does the optimization reduce the fault tolerance overheads on the schedule
length?
42 of 1442
Trading-off Transparency for PerformanceMapping Optimization with
Transparency
43 of 1443
Good for debugging and testing
FT Implementations with Transparency
PP22
PP44PP33
PP55
PP11
m1
m2
– regular processes/messages– frozen processes/messages
PP33Frozen
Transparency is achieved with frozen processes and messages
44 of 1444
N1 N2 P1P2P3
N130 X20X
X20
N2
P4 X 30
= 5 ms
k = 2
No TransparencyDeadline
PP22 PP11
PP44
m2 m1m3
PP33
PP22m
1PP11
m2
m3
PP44 PP33
no fault scenario N1
N2
bus
PP22
m 1
PP11
m2
PP44 PP33
PP11
PP44
m3
the worst-case fault scenario N1
N2
bus
processes start at different times
messages are sent at different times
45 of 1445
Full TransparencyCustomized Transparency
PP22PP11
m2
m3
PP33 PP33
m1
PP44 PP33
Customized transparency
PP22
m1
PP11
PP44
m2
m3
PP33no fault scenarioPP22
m1
PP11 PP11 PP11
PP44
m2
m3
PP33
PP22
m 1
PP11
m2
PP44 PP33
PP11
PP44
m3
No transparencyDeadline
DeadlinePP22
m1
PP11
PP44m
2
m3
PP33 PP33 PP33
Full transparency
46 of 1446
Trading-Off Transparency for Performance
0% 25% 50% 75%
100%
k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3
20 24 44 63 32 60 92 39 74 115 48 83 133 48 86 139
40 17 29 43 20 40 58 28 49 72 34 60 90 39 66 97
60 12 24 34 13 30 43 19 39 58 28 54 79 32 58 86
80 8 16 22 10 18 29 14 27 39 24 41 66 27 43 73
Four (4) computation nodesRecovery time 5 ms
Trading transparency for performance is essential
29 40 49 60 66
How longer is the schedule length with fault tolerance?
increasing increasing transparencytransparency
47 of 1447
m3
N1 N2 = 10 ms
k = 2
Mapping with Transparency
PP33
PP11
PP44
m2
PP66
m1
m4
PP55
N2
P1P2P3
N1
30 30
P4P5
40 4050 5060 6040 40
P6 50 50
m 1N1
N2
bus
PP44
PP11 PP33
PP22
PP66
PP55optimal mapping
without transparency
Deadline
N1
N2
busPP11 PP33
PP22
PP66
m 1
PP55PP4/24/2 PP4/34/3PP4/14/1
the worst-case faultscenario for optimal mapping
PP22
48 of 1448
N1 N2 = 10 ms
k = 2
Mapping with Transparency
PP33
PP11
PP44
m2
PP66
m1
m4
PP22
PP55
m3
N2
P1P2P3
N1
30 30
P4P5
40 4050 5060 6040 40
P6 50 50
bus
DeadlineN1
N2
m 1
the worst-case fault scenario withtransparency for “optimal” mappingPP11 PP33
PP2/12/1
PP66
PP55PP44 PP2/22/2 PP2/32/3
bus
N1
N2
m 2
the worst-case fault scenario withtransparency and optimized mappingPP11
PP33
PP22
PP66
PP55
PP4/14/1 PP4/24/2 PP4/34/3
49 of 1449
Design Optimization
Hill-climbing mapping optimization heuristic
Fast
Slow2. Schedule Length Estimation (SE)
1. Conditional Scheduling (CS)
Schedule length
50 of 1450
Experimental Results
4 nodes 25% of processes and50% of messages are frozen
15 applications k = 2 faults k = 3 faults k = 4 faults
Recovery overhead = 5 ms SE CS SE CS SE CS
20 processes 0.01 0.07 0.02 0.28 0.04 1.3730 processes 0.13 0.39 0.19 2.93 0.26 31.5040 processes 0.32 1.34 0.50 17.02 0.69 318.88
How faster is schedule length estimation (SE) compared to conditional scheduling (CS)?
0.69s318.88s
Schedule length estimation (SE) is more than 400 times faster
than conditional scheduling (CS)
51 of 1451
Experimental Results
4 computation nodes15 applications
25% of processes and50% of messages are frozen
Recovery overhead = 5 ms
k = 2 faults
k = 3 faults
k = 4 faults
20 processes 32.89% 32.20% 30.56%30 processes 35.62% 31.68% 30.58%40 processes 28.88% 28.11% 28.03%
How much is the improvement when transparency is taken into account?
31.68%
Schedule length offault-tolerant applications is
31.68%shorter on average if
transparency was considered during mapping
52 of 1452
Outline Motivation Background and limitations of previous work Thesis contributions:
Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency
Conclusions and future work
53 of 1453
Conclusions
Scheduling with fault tolerance requirements Two novel scheduling techniques Handling customized transparency
requirements, trading-off transparency for performance
Fast scheduling alternative with low memory requirements for schedules
54 of 1454
Conclusions
Design optimization with fault tolerance Policy assignment optimization strategy Estimation-driven mapping optimization that
can handle customized transparency requirements
Optimization of the number of checkpoints
Approaches and algorithms have been evaluated on the large number of synthetic applications and a real life example – vehicle cruise controller
55 of 1455
Design Optimization of Embedded Systems with Fault Tolerance is
Essential
56 of 1456
Some More…
Future Work
Fault-TreeAnalysis
ProbabilisticFault Model
Soft Real-Time