Scheduling and Optimization of Fault-Tolerant Embedded Systems

1 of 141

Scheduling and Optimization of Fault-Tolerant Embedded Systems

Viacheslav IzosimovEmbedded Systems Lab (ESLAB)

Linköping University, Sweden

Presentation of Licentiate Thesis

2 of 142

Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc.

Motivation

Focus on transient faults and intermittent faults

3 of 143

Motivation

Transient faults

Radiation

Electromagnetic interference (EMI)

Lightning storms

Happen for a short time Corruptions of data,

miscalculation in logic Do not cause a permanent

damage of circuits Causes are outside system

boundaries

4 of 144

Motivation

Intermittent faults

Internal EMICrosstalk

Power supply fluctuations

Init (Data)

Software errors (Heisenbugs)

Transient faults Manifest similar as

transient faults Happen repeatedly Causes are inside

system boundaries

5 of 145

Motivation

Errors caused by transient faults haveto be tolerated before they crash the system

However, fault tolerance againsttransient faults leads to significant

performance overhead

Transient faults are more likely to occuras the size of transistors is shrinking

and the frequency is growing

6 of 146

Motivation

Hard real-time applications Time-constrained Cost-constrained Fault-tolerant etc.

The Need for Design Optimization of Embedded Systems with Fault

Tolerance

7 of 147

Outline MotivationBackground and limitations of previous work Thesis contributions:

Scheduling with fault tolerance requirements Fault tolerance policy assignment Checkpoint optimization Trading-off transparency for performance Mapping optimization with transparency

Conclusions and future work

8 of 148

General Design Flow

System Specification

Architecture Selection

Mapping & Hardware / Software Partitioning

Scheduling

Back-end Synthesis

Feedback loops

Fault Tolerance

Techniques

9 of 149

P10 20 40 60

N1

P12P1 P11 2

P1P1P1/1

Fault Tolerance Techniques

Error-detection overhead

N1

Re-execution

Checkpointing overhead

P1 P11 2N1

Rollback recovery with checkpointing

Recovery overhead

P1(1)

P1(2)

N1

N2

P1(1)

P1(2)

N1

N2

Active replication

P1/2

P11 P1/12 P1/22N11

10 of 1410

Limitations of Previous Work Design optimization with fault tolerance is

limited

Process mapping is not considered together with fault tolerance issues

Multiple faults are not addressed in the framework of static cyclic scheduling

Transparency, if at all addressed, is restricted to a whole computation node

11 of 1411

Outline Motivation Background and limitations of previous workThesis contributions:



12 of 1412

Fault-Tolerant Time-Triggered Systems

Processes: Re-execution,Active Replication, Rollback

Recovery with CheckpointingMessages: Fault-

tolerant predictable protocol

…

Transient faults

P2

P4P3

P5

P1

m1m2

Maximum k transient faults within each application run (system period)

13 of 1413

Scheduling with Fault Tolerance Reqirements

Conditional Scheduling

Shifting-based Scheduling


14 of 1414

PP11

PP11

PP22

m1


true

P1 0

P2

k = 2

0 20 40 60 80 100 120 140 160 180 200

15 of 1415

PP11 PP22

PP11

PP22

m1


true

P1 0

P240

k = 2

1PF

0 20 40 60 80 100 120 140 160 180 200

16 of 1416

PP11

PP22

m1


true

P1 0

P2

454

0

k = 2

1PF

1PF

PP11PP1/11/1 PP1/21/2

0 20 40 60 80 100 120 140 160 180 200

17 of 1417

PP22

0 20 40 60 80 100 120 140

PP11

PP22

m1


true

P1 0

P2

454

0

k = 2

160 180 200

90130

1PF

1PF

1 1P PF FÙ

PP1/11/1 PP1/21/2 PP1/31/3

18 of 1418

PP11

PP22

m1


true

P1 0

P2

454

0

k = 2

90130 14085

1PF

1PF

1 1P PF FÙ1 1P PF FÙ

1 1 2P P PF F FÙ Ù

0 20 40 60 80 100 120 140 160 180 200PP1/11/1 PP1/21/2 PP22PP2/12/1 PP2/22/2

19 of 1419

PP11

PP22

m1


true

P1 0

P2

454

0

k = 2

90130 1509514085

1PF

1PF


1 1 2P P PF F FÙ Ù1 2P PF FÙ


0 20 40 60 80 100 120 140 160 180 200PP11 PP22PP2/12/1 PP2/22/2 PP2/32/3

20 of 1420

PP22

PP22

PP22

PP22

PP22 PP22

PP11

PP11

PP11

Fault-Tolerance Conditional Process Graph

12P

F21PF

22P

F

PP11

PP22

m1

k = 211

22

33

11

22

33

44

55 66

m1

m1

m1

1

2

3


42P

F

11PF

21 of 1421

Conditional Schedule Table

true

P1 0

m1

P2

454

050

90130140 160105150

8595

1PF

1PF


1 1 2P P PF F FÙ Ù1 2P PF FÙ


PP11

PP22

m1k = 2 N1 N2

22 of 1422

Conditional Scheduling Conditional scheduling:

Generates short schedules Allows to trade-off between transparency and

performance (to be discussed later...)

– Requires a lot of memory to store schedule tables

– Scheduling algorithm is very slow

Alternative: shifting-based scheduling

23 of 1423

Shifting-based Scheduling Messages sent over the bus should be

scheduled at one time Faults on one computation node must not

affect other computation nodes Requires less memory Schedule generation is very fast

– Schedules are longer

– Does not allow to trade-off between transparency and performance (to be discussed later...)

24 of 1424

Ordered FT-CPG

k = 2

PP11

PP33

PP44PP22

m2

m3

m1

P2 after P1

P3 after P4

PP33

PP33

PP33

PP33

PP33

PP33

PP44

PP44PP44

PP22PP22

PP22

PP22PP22 PP11

PP11

PP11

SSSSSS

1

2

3

1

2

3

4

56

1

mm33mm22mm11

2

3

1

2

3

4

5 5

PP22

25 of 1425

Root Schedules

P2

m 1

P1

m 2 m 3

P4 P3

N1

N2

Bus

P1 P1

Worst-case scenario for P1

Recovery slack for P1 and P2

26 of 1426

Extracting Execution Scenarios

P2

m 1

P1

m 2 m 3

P4/1 P3

N1

N2

Bus

P4/2 P4/3

27 of 1427

Memory Required to Store Schedule Tables

20 proc. 40 proc. 60 proc. 80 proc.

k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3

100% 0.13 0.28 0.54 0.36 0.89 1.73 0.71 2.09 4.35 1.18 4.21 8.75

75% 0.22 0.57 1.37 0.62 2.06 4.96 1.20 4.64 11.55 2.01 8.40 21.11

50% 0.28 0.82 1.94 0.82 3.11 8.09 1.53 7.09 18.28 2.59 12.21 34.46

25% 0.34 1.17 2.95 1.03 4.34 12.56 1.92 10.00 28.31 3.05 17.30 51.30

0% 0.39 1.42 3.74 1.17 5.61 16.72 2.16 11.72 34.62 3.41 19.28 61.85

Applications with more frozen nodesrequire less memory

1.734.968.09

12.5616.72

28 of 1428

Memory Required to Store Root Schedule

20 proc. 40 proc. 60 proc. 80 proc.

k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3

100% 0.016 0.034 0.054 0.070

Shifting-based scheduling requires very little memory

1.730.03

29 of 1429

Schedule Generation Time and Quality

Shifting-based scheduling much faster thanconditional scheduling

Shifting-based scheduling requires 0.2 seconds to generate a root schedule for

application of 120 processes and 10 faultsConditional scheduling already takes 319 seconds to generate a schedule table for application of 40 processes and 4 faults

~15% worse than conditional scheduling with100% inter-processor messages set to frozen

(in terms of fault tolerance overhead)

30 of 1430

Fault Tolerance Policy AssignmentCheckpoint Optimization

31 of 1431

Fault Tolerance Policy Assignment

P1/1 P1/2 P1/3

Re-execution

N1

P1(1)

P1(2)

P1(3)

Replication

N1

N2

N3

P1(1)/1

P1(2)

N1

N2

P1(1)/2

Re-executed replicas

2

32 of 1432

Re-execution vs. Replication

N1 N2P1

P3

P2

m1 1P1P2P3

N1 N240504060

5070

A1 P1 P3P2m1 m2A2

N1

N2

bus

P1 P2 P3

Missed

Deadline

P1(1)N1

N2

bus

P1(2)

P2(1)

P2(2)

P3(1)

P3(2) Met

m1(

2)

m1(

1)

m2(

2)

m2(

1)

Replication is better

N1

N2

bus

P1 P2

P3

Met

Deadline

P1(1)N1

N2

bus

P1(2)

P2(1)

P2(2)

P3(1)

P3(2)

Missedm

1(2)

m1(

1)

Re-execution is better

33 of 1433

P1N1

N2

bus

P2

P3

P4

m 2

Missed

N1

N2

P1(1)

P3(2)

P4(1)P2(1)

P1(2)

bus

P2(2)

P3(1)

P4(2)

Missedm

1(2)

m1(

1)

m2(

1)m

2(2)

m3(

1)m

3(2)

N1

N2

P1(1)

P3

P4P2

P1(2)

m2(

1)m

1(2)

bus

MetOptimization

of fault tolerancepolicy

assignment

Fault Tolerance Policy Assignment

P1P2P3

N1 N240 506060

8080

P4 40 50

1N1 N2

P1

P4P2

P3

m1

m2

m3

Deadline

34 of 1434

Optimization Strategy Design optimization:

Fault tolerance policy assignment Mapping of processes and messages

Root schedules

Three tabu-search optimization algorithms: 1. Mapping and Fault Tolerance Policy assignment (MRX)

Re-execution, replication or both2. Mapping and only Re-Execution (MX)3. Mapping and only Replication (MR)

Tabu-search

Shifting-based scheduling

35 of 1435

80

20

Experimental Results

010

3040506070

90100

20 40 60 80 100

80 Mapping and replication (MR)

20Mapping and re-execution (MX)

Mapping and policy assignment (MRX)

Number of processes

Avge

rage

% d

evia

tion

from

MRX

Schedulability improvement under resource constraints

36 of 1436

N1

Checkpoint Optimization

P1

P12P11 P12P1/12 P1/22

P12P11 P12

37 of 1437

Locally Optimal Number of Checkpoints

1 = 15 ms

k = 2

1 = 5 ms

1 = 10 ms

P1C1 = 50 ms

No. o

f che

ckpo

ints

P1 P12 1 2

P1 P1 P13 1 2 3

P1 P1 P1 P14 1 2 3 4

P11

P1 P1 P1 P1 P15 1 2 3 4 5

38 of 1438

Globally Optimal Number of Checkpoints

P2P1

m1

10 5 1010 5 10

P1P2

P1C1 = 50 ms

P2 C2=60 ms

k = 2

265

P1 P1 P11 2 3

P2 P2 P21 2 3

P1 P2 P2P11 2 1 2

255

39 of 1439


P2P1

m1

10 5 1010 5 10

P1P2

P1C1 = 50 ms

P2 C2=60 ms

k = 2

P1 P1 P11 2 3 P2 P2 P2

1 2 3a)265

P1 P2 P2P11 2 1 2b)255

40 of 1440


P2P1

m1

10 5 1010 5 10

P1P2

P1C1 = 50 ms

P2 C2=60 ms

k = 2

P1 P1 P11 2 3 P2 P2 P2

1 2 3a)265

P1 P2 P2P11 2 1 2b)255

41 of 1441

0%

10%

20%

30%

40%

40 60 80 100

Global Optimization of Checkpoint Distribution (MC)% d

evia

tion

from

MC0

(how

sm

alle

r th

e fa

ult t

oler

ance

ove

rhea

d)

Application size (the number of tasks)

4 nodes, 3 faults

Local Optimization of Checkpoint Distribution (MC0)

Global Optimization vs. Local Optimization

Does the optimization reduce the fault tolerance overheads on the schedule

length?

42 of 1442

Trading-off Transparency for PerformanceMapping Optimization with

Transparency

43 of 1443

Good for debugging and testing

FT Implementations with Transparency

PP22

PP44PP33

PP55

PP11

m1

m2

– regular processes/messages– frozen processes/messages

PP33Frozen

Transparency is achieved with frozen processes and messages

44 of 1444

N1 N2 P1P2P3

N130 X20X

X20

N2

P4 X 30

= 5 ms

k = 2

No TransparencyDeadline

PP22 PP11

PP44

m2 m1m3

PP33

PP22m

1PP11

m2

m3

PP44 PP33

no fault scenario N1

N2

bus

PP22

m 1

PP11

m2

PP44 PP33

PP11

PP44

m3

the worst-case fault scenario N1

N2

bus

processes start at different times

messages are sent at different times

45 of 1445

Full TransparencyCustomized Transparency

PP22PP11

m2

m3

PP33 PP33

m1

PP44 PP33

Customized transparency

PP22

m1

PP11

PP44

m2

m3

PP33no fault scenarioPP22

m1

PP11 PP11 PP11

PP44

m2

m3

PP33

PP22

m 1

PP11

m2

PP44 PP33

PP11

PP44

m3

No transparencyDeadline

DeadlinePP22

m1

PP11

PP44m

2

m3

PP33 PP33 PP33

Full transparency

46 of 1446

Trading-Off Transparency for Performance

0% 25% 50% 75%

100%

k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3 k=1 k=2 k=3

20 24 44 63 32 60 92 39 74 115 48 83 133 48 86 139

40 17 29 43 20 40 58 28 49 72 34 60 90 39 66 97

60 12 24 34 13 30 43 19 39 58 28 54 79 32 58 86

80 8 16 22 10 18 29 14 27 39 24 41 66 27 43 73

Four (4) computation nodesRecovery time 5 ms

Trading transparency for performance is essential

29 40 49 60 66

How longer is the schedule length with fault tolerance?

increasing increasing transparencytransparency

47 of 1447

m3

N1 N2 = 10 ms

k = 2

Mapping with Transparency

PP33

PP11

PP44

m2

PP66

m1

m4

PP55

N2

P1P2P3

N1

30 30

P4P5

40 4050 5060 6040 40

P6 50 50

m 1N1

N2

bus

PP44

PP11 PP33

PP22

PP66

PP55optimal mapping

without transparency

Deadline

N1

N2

busPP11 PP33

PP22

PP66

m 1

PP55PP4/24/2 PP4/34/3PP4/14/1

the worst-case faultscenario for optimal mapping

PP22

48 of 1448

N1 N2 = 10 ms

k = 2

Mapping with Transparency

PP33

PP11

PP44

m2

PP66

m1

m4

PP22

PP55

m3

N2

P1P2P3

N1

30 30

P4P5

40 4050 5060 6040 40

P6 50 50

bus

DeadlineN1

N2

m 1

the worst-case fault scenario withtransparency for “optimal” mappingPP11 PP33

PP2/12/1

PP66

PP55PP44 PP2/22/2 PP2/32/3

bus

N1

N2

m 2

the worst-case fault scenario withtransparency and optimized mappingPP11

PP33

PP22

PP66

PP55

PP4/14/1 PP4/24/2 PP4/34/3

49 of 1449

Design Optimization

Hill-climbing mapping optimization heuristic

Fast

Slow2. Schedule Length Estimation (SE)

1. Conditional Scheduling (CS)

Schedule length

50 of 1450


4 nodes 25% of processes and50% of messages are frozen

15 applications k = 2 faults k = 3 faults k = 4 faults

Recovery overhead = 5 ms SE CS SE CS SE CS

20 processes 0.01 0.07 0.02 0.28 0.04 1.3730 processes 0.13 0.39 0.19 2.93 0.26 31.5040 processes 0.32 1.34 0.50 17.02 0.69 318.88

How faster is schedule length estimation (SE) compared to conditional scheduling (CS)?

0.69s318.88s

Schedule length estimation (SE) is more than 400 times faster

than conditional scheduling (CS)

51 of 1451


4 computation nodes15 applications

25% of processes and50% of messages are frozen

Recovery overhead = 5 ms

k = 2 faults

k = 3 faults

k = 4 faults

20 processes 32.89% 32.20% 30.56%30 processes 35.62% 31.68% 30.58%40 processes 28.88% 28.11% 28.03%

How much is the improvement when transparency is taken into account?

31.68%

Schedule length offault-tolerant applications is

31.68%shorter on average if

transparency was considered during mapping

52 of 1452

Outline Motivation Background and limitations of previous work Thesis contributions:



53 of 1453

Conclusions

Scheduling with fault tolerance requirements Two novel scheduling techniques Handling customized transparency

requirements, trading-off transparency for performance

Fast scheduling alternative with low memory requirements for schedules

54 of 1454

Conclusions

Design optimization with fault tolerance Policy assignment optimization strategy Estimation-driven mapping optimization that

can handle customized transparency requirements

Optimization of the number of checkpoints

Approaches and algorithms have been evaluated on the large number of synthetic applications and a real life example – vehicle cruise controller

55 of 1455

Design Optimization of Embedded Systems with Fault Tolerance is

Essential

56 of 1456

Some More…

Future Work

Fault-TreeAnalysis

ProbabilisticFault Model

Soft Real-Time

Scheduling and Optimization of Fault-Tolerant Embedded Systems

Documents

Transcript of Scheduling and Optimization of Fault-Tolerant Embedded Systems