Post on 20-Jan-2016
Lifetime Reliability-Aware Task Allocation and Scheduling for
MPSoC Platforms
Lin Huang, Feng Yuan and Qiang XuReliable Computing Laboratory
Department of Computer Science & EngineeringThe Chinese University of Hong Kong
DATE’09
Lifetime Reliability of Embedded Multiprocessor Platform
Multiprocessor system-on-a-chip (MPSoC)• Platform-based design
• Hardware / software co-synthesis
Reliability issue• IC product wear-out lifetime reliability threats
• Time dependent dielectric breakdown (TDDB), electromigration (EM), stress migration (SM), negative bias temperature instability (NBTI)
• Soft errors
Prior Work
Prior work in reliability-driven task allocation and scheduling• Constant failure rate
Limitation of thermal-aware task scheduling• Might improve the system’s lifetime reliability implicitly
• Not readily applicable, especially for heterogeneous MPSoC
Problem Motivation Example
Electromigration
•
Suppose , and all other
parameters are the same
P1 ages much faster than P2,
dominating the MPSoC lifetime
P1
P2
MPSoC Platform
Problem Formulation Task allocation and scheduling
Output
Aim: to maximize the expected service life (mean time to failure, MTTF) of the MPSoC system under the performance constraint
P1P2
MPSoC PlatformT0
T1
T2
T3
T4
TaskGraph Binding &
Scheduling
T0
P1
P2T1
T2
T3
T4 PeriodicalSchedule
Lifetime Reliability Estimation Electromigration
Denote by the reliability of a single processor at time Expected service life Weibull distribution
TemperatureVariationExample
Computed by existing hard error models
Reflect some important factors (e.g., architecture properties)
Main Approach– Simulated Annealing Solution representation
• (schedule order sequence; resource assignment sequence)
• For example, (0, 1, 3, 2, 4; P2, P2, P2, P1, P1)
• Schedule order sequence: partial order defined by task graph
• Every solution corresponds to a feasible schedule
Schedule Reconstruction
T0
P1
P2T1
T2
T3
T4PeriodicalSchedule
Main Approach– Simulated Annealing Transforms of directed acyclic graph
• Expanded task graph
• Undirected complement graph
Lemma: Given a valid schedule order , swapping adjacent nodes leads to another valid schedule order, provided there is an edge between these two nodes in the complement graph
T0 T1
Task Graph
T2 T3 T4
T0 T1
Expanded Task Graph
T2 T3 T4
T0 T1
Complement Graph
T2 T3 T4
Main Approach– Simulated Annealing Theorem: Starting from a valid schedule order
we are able to reach any other valid schedule orderafter finite times of adjacent swapping
• For example 2 3 0 4 1
0 2 3 1 4
T0 T1
Task Graph
T2 T3 T4
T0 T1
Expanded Task Graph
T2 T3 T4
T0 T1
Complement Graph
T2 T3 T4
2 0 3 4 1
0 2 3 4 1
Main Approach– Simulated Annealing Moves
• M1: Swap two adjacent nodes in both schedule order sequence and resource assignment sequence, if there is an edge between these two nodes in the complement graph
• M2: Swap two adjacent nodes in resource assignment sequence
• M3: Change the resource assignment of a task
T0 T1
Task Graph
T2 T3 T4
T0 T1
Expanded Task Graph
T2 T3 T4
T0 T1
Complement Graph
T2 T3 T4
Main Approach– Simulated Annealing Three moves are defined, so that
• Starting from a valid schedule order A, we are able to reach any other valid schedule order B after finite times of adjacent swapping
Cost function
• First term guarantees a schedule meet all tasks’ deadlines
• Second term indicates the system lifetime
Significant large
Main Approach– Simulated Annealing Key problem: Computation time
Source of time overhead• Run temperature simulator EVERY TIME
we reach a new solution• Simulator is called 3×105 times
• Every time trace the temperature variationfor entire service life
• In range of years
• Accurate calculation requires fine-grained variation trace file
• Significant / within very short time
An efficient cost computation strategy is essential !
initial temperature 102
end temperature 10-5
cooling rate 0.95
iteration 103
SA parameters
Revisit System Lifetime Reliability Estimation – Speedup II
It will be better if we are able to compute MTTF by tracing the temperature variation of only one period
Revisit System Lifetime Reliability Estimation – Speedup II
A subdivision of time
……
Revisit System Lifetime Reliability Estimation – Speedup II
Given
Aging effect in one period
Property: does not vary from period to period
This property enables us to trace the temperature variation of only ONE period
Revisit System Lifetime Reliability Estimation – Speedup II
The expected service life of one processor is
Provided no redundant processors in the system, expected service life of entire system is
Revisit System Lifetime Reliability Estimation – Speedup IIII
Given
Instead of computing theaging effect in every period,we propose to compute theaging effect of periods atone time
Revisit System Lifetime Reliability Estimation – Speedup IIIIII
Accurate calculation requests setting the length of time intervals as very small value
Use steady temperature rather than accurate temporal temperature
TemperatureVariationExample
TaskScheduleP 1
P 2P 3
t
Task Type 1 Task Type 2Legend
Revisit System Lifetime Reliability Estimation – Speedup IVIV
Need to run temperature simulator every time we reach a new solution
There can be at most kinds of processor usage combinations in task schedules
• Given = 3, = 4, we need only 255 times pre-computation, each for a steady temperature
Estimate processors’ temperature for various processor usage combinations in pre-calculation phase only
P 1P 2
P 3
t
Task Type 1 Task Type 2Legend
Revisit System Lifetime Reliability Estimation – Speedup IVIV
Time slot
• The set of under-used processors
• The power consumption of the tasks running on these processors
• Categorize the tasks into types according to power consumption
• E.g.,
Processor index under usage
Task power consumption type
Revisit System Lifetime Reliability Estimation – Speedup IVIV
Pre-calculate the steady temperature of processor in time slot
The aging effect in unit time in this case is therefore
The aging effect of P1 in this schedule in a period is
Revisit System Lifetime Reliability Estimation – Summary
A summary of speedup techniques
• Rewrite MTTF expression in terms of aging effect in one period
• Compute the aging effect of several periods at one time
• Approximate aging effect in one period based on the task changes and using steady temperature
• Call temperature estimation simulator in the pre-calculation phase only
The time consumption of pre-calculation can be even reduced
Experimental Setup Random task graphs generated by TGFF
• Task numbers range from 20 to 260 Hypothetical MPSoC platforms
• Processor core numbers range from 2 to 8
• Homogeneous / Heterogeneous Take electromigration model in [Goel-IEEEPress07] as example
• Note that, our model also applied to other failure mechanisms Compare our method with a thermal-aware task scheduling
algorithm proposed in [Xie-JVLSISP06]
Accuracy Comparison between approximated MTTF and accurate value
Lifetime Reliability of Various Platforms with Various Task Graphs
Platform Description
Task Description Dead
line
Thermal-
aware
Simulated Annealing
0% DR 5% DR 10%DR
M-PE Co-PE Task Edge MTTF MTTF Δ(%) MTTF Δ(%) MTTF Δ(%)
2 0 22 23 535 492.5 492.5 0 582.3 18.2 582.3 18.2
4 049 76
1106 216.1 226.9 5.0 247.3 14.4 263.4 21.9
2 2 697 137.4 161.3 17.4 171.2 24.6 185.6 35.0
6 076 106
918 228.9 239.9 4.8 256.7 12.2 273.3 19.4
2 4 676 97.2 125.1 28.7 137.9 41.9 150.0 54.4
8 0131 190
1227 227.2 235.8 3.8 250.9 10.4 265.6 16.9
2 6 984 88.0 130.4 48.2 143.7 63.3 160.0 81.8Δ: Difference ratio between MTTF of simulated annealing and that of thermal aware
DR: Deadline Relaxation
Lifetime Reliability of 8-Processor Platforms
Task Description
8 Core Homogenous Platform 8 Core Heterogeneous Platform
Thermal Aware
Simulated AnnealingThermal Aware
Simulated Annealing
DR (%) MTTF Δ(%) DR (%) MTTF Δ(%)
Task #: 101
Edge #: 142
Deadline: 1059
MTTF: 240.1
0 247.8 3.2 Deadline: 809
MTTF: 91.6
0 129.0 40.8
5 264.3 10.1 5 146.0 59.3
10 279.6 16.5 10 160.5 75.4
Task #: 131
Edge #: 190
Deadline: 1227
MTTF: 227.2
0 235.8 3.8 Deadline: 984
MTTF: 88.0
0 130.4 48.2
5 250.9 10.4 5 143.7 63.3
10 265.6 16.9 10 160.0 81.8
Task #: 251
Edge #: 366
Deadline: 2014
MTTF: 191.4
0 203.4 6.3 Deadline: 1693
MTTF: 85.7
0 124.2 44.9
5 216.6 13.2 5 137.9 60.8
10 230.2 20.3 10 151.1 76.3
Efficiency The simulated annealing process requests 50-200s of CPU
time on Intel(R) Core(TM) 2 CPU 2.13GHz for each case
• 4 processors 49 tasks – 84s
• 8 processors 101 tasks – 158s
The CPU time spending on pre-calculation ranges from 3s to 160s
Conclusion Technology advancement has brought with adverse impact of
on lifetime reliability of MPSoC embedded systems Prior work on task allocation and scheduling does not explicitly
take wearout failure into account We propose an analytical modelan analytical model to estimate the lifetime
reliability of multiprocessor platforms under periodical tasks We present a novel lifetime reliability-aware algorithma novel lifetime reliability-aware algorithm based on
simulated annealing technique We propose several speedup techniquesseveral speedup techniques to simplify the design
space exploration process with satisfactory solution quality Experimental results demonstrate the effectiveness
Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms
Thank you for your attention !