Fuel consumption optimization in air transport: a review ...
Power consumption optimization of data ow applications on...
Transcript of Power consumption optimization of data ow applications on...
Royal Institute of Technology
Power consumption optimization of dataflowapplications on many-core systems
Emmanouil Komninoskomninos(@)kth.se
August 21, 2011
A master thesis project
conducted at
Examiner:
Ingo Sander
Supervisors:
Alain Girault
Pascal Fradet
TRITA-ICT-EX-2011:192
Abstract
With the growing need for high bandwidth digital communications and streaming applications re-
quiring high quality of video and audio encoding, the transition to platforms consisting of hundreds
of processors and efficient communication infrastructures is inevitable. DSP applications targeted to
such highly parallel platforms are best described by concurrent MoCs to enable the mapping and
scheduling process to such architectures.
Such platforms target embedded devices which operate under a very constraint energy budget. This
project is about the energy efficient scheduling of DSP applications, described under the dataflow
MoC, on many-core platforms. The target platform is the P2012 designed by STMicroelectronics
consisting 16 nodes interconnected through a 2D mesh asynchronous NoC. Each node can operate
on different voltage and frequency and can accommodate up to 16 processors. The dataflow MoC
considered to describe the aforementioned applications is called SDF.
The main advantage gained from this project is the formal description of the energy minimization
problem, when such platforms are being considered. We demonstrate the difficulties that arise from
these architectures, the insufficiency of the existing energy efficient scheduling approaches and we
propose a way to relax this very complex problem so that existing approaches can be applied.
ii
Acknowledgements
I would like to give special thanks to...
Alain Girault and Pascal Fradet, researchers at INRIA and my Supervisors
for sharing their wisdom on dataflow programming and scheduling
Ingo Sander, Professor at KTH and my Examiner
for being always accessible and giving me regular feedback on my work
Petro Poplavko, post-doc at INRIA
for the long talks that gave clear perspective
Thomas Martin Gawlitza, post-doc at INRIA
for introducing me to complexity theory
INRIA,
for providing a welcoming working environment and the big amounts of coffee
required for this project
My family,
for always supporting me on my decisions and allowing me to accomplish my
goals
...As well as everyone else who listened to my questions and spent the time to
help me along the way of completing this project
List of Abbreviations
Actor Mobility Window AMW
Actor Overlapping Ratio AOR
Adaptive Body Biasing ABB
Available Usable Slack AUS
Digital Analogue Converter DAC
Digital Signal Processing DSP
Digital to Analogue Converter DAC
Dynamic Voltage Frequency Scaling DVFS
Dynamic Voltage Scaling DVS
Earliest Deadline First EDF
First In First Out FIFO
Forward Body Biasing FBB
Giga Operations Per Second GOPS
Homogeneous Data Flow Graph HDFG
Instruction Set Simulator ISS
Integer Linear Programming ILP
Local Power Management LPM
Locally Adaptive Voltage and Frequency Scaling LAVFS
Locally Adaptive Voltage Frequency Scaling LAVFS
Low Voltage Transistor LVT
Maximum Usable Slack MUS
Model of Computation MoC
Multi Carrier-Code Division Multiple Access MC-CDMA
Multi-Processor System on Chip MPSoC
iv
Multiple Input Multiple Output MIMO
Multiple Threshold CMOS MTCMOS
Network on Chip NoC
Nondeterministic Polynomial NP
Orthogonal Frequency-Division Multiplexing OFDM
Power Shut down PS
Power Supply Unit PSU
Processing Element PE
Reverse Body Biasing RBB
Scenario Aware Data Flow SADF
Super Cut off CMOS SCCMOS
Synchronous Data Flow Graph SDFG
System on Chip SoC
Ultra Cut Off UCO
Voltage Frequency Domain VFD
Voltage Frequency Scaling VFS
Worst Case Execution Cycles WCEC
Worst Case Execution Time WCET
Worst Fit Decreasing WFD
List of Figures
1.1 VC-1 decoder’s algorithm schematic [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 4More functional diagram [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 MIMO OFDM mapping [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.1 (a) Consistent SDFG, (b) Its Topology Matrix . . . . . . . . . . . . . . . . . . . . . . 5
2.2 (a) Inconsistent SDFG, (b) Its Topology Matrix . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Iteration of SDF 2.1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.4 Expansion of an edge in an SDFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.5 HDFG equivalent of 2.1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.1 Trade-off of generality against run-time overhead and implementation complexity . . . 11
3.2 (a) Task set, (b) Task-core mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.3 EDF scheduling, f=1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.4 SimpleVS,f=7/12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 Move slack backwards,f=7/12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.6 Evolution of segment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.7 After migration,sg0.f = 31/72, sg1.f, sg2.f, sg3.f = 7/12 . . . . . . . . . . . . . . . . . 20
4.1 (a)HDFG, (b)Mapping and WCEC of each actor . . . . . . . . . . . . . . . . . . . . . 26
4.2 Individually managed PEs, (a)f=1 for all actors, (b)frequency scaling to f = 1/3 for
actor 3, (c) case for clustered PEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
4.3 The P2012 fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.4 NoC Unit Architecture - VFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.5 Power Supply Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.6 Dithering principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1 From Gs to G∗s (ALAPs(3) < ALAPs(4)) . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Adding WCET to edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.3 (a) Sample HDFG, (b) the corresponding binding . . . . . . . . . . . . . . . . . . . . . 44
5.4 The schedule of HSDFG of fig. 5.3a on nominal frequency . . . . . . . . . . . . . . . . 44
5.5 (a) DVFS in segment [5.25, 7.525] on VFD1, (b) DVFS in segment [3,4] on VFD2 . . . 45
5.6 (a) Sample HDFG and (b) The corresponding binding . . . . . . . . . . . . . . . . . . 47
5.7 The schedule of HSDFG of fig. 5.6a on nominal frequency . . . . . . . . . . . . . . . . 47
5.8 Scaling the frequency to fmax4 on the invocation of actor 3 and to fmax on the invocation
of actor 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
5.9 (a) Sample HDFG, (b) the corresponding binding . . . . . . . . . . . . . . . . . . . . . 48
5.10 The schedule of HSDFG of fig. 5.9a on nominal frequency . . . . . . . . . . . . . . . . 49
vi List of Figures
5.11 The schedule of HSDFG of fig. 5.9a on nominal frequency. The limits of the AMW for
actors 4 and 6 are noted with dashed green lines . . . . . . . . . . . . . . . . . . . . . 50
5.12 ALAPs based schedule of HSDFG of fig. 5.9a on nominal frequency . . . . . . . . . . 52
5.13 (a) sample HSDFG and (b) WCEC and binding information . . . . . . . . . . . . . . . 55
5.14 The clustering procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
5.15 The clustered Gs(T ′, E ′) from the HSDF in figure 5.13 . . . . . . . . . . . . . . . . . . 56
List of Tables
3.1 Overview of the assumptions in the related work . . . . . . . . . . . . . . . . . . . . . 22
5.1 The OLs and AORs of actors from the HSDF of figure 5.9a . . . . . . . . . . . . . . . 51
5.2 The OLs and AORs of actors from the HSDF of figure 5.13a . . . . . . . . . . . . . . . 54
Contents
List of Abbreviations iii
1 Introduction 1
1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2 Data Flow Graphs 5
2.1 Synchronous Data Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
Consistency of SDFGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2 Constructing an Equivalent HDFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3 Scheduling 9
3.1 Scheduling Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.1 Fully Dynamic and Fully Static . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3.1.2 Self-timed and Static assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.3 Quasi-static and Ordered-transactions . . . . . . . . . . . . . . . . . . . . . . . 10
3.1.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
3.2.1 Elaboration on Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
3.3 Ordering Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
As Late As Possible (ALAP) times . . . . . . . . . . . . . . . . . . . . . . . . . 12
As Soon As Possible (ASAP) times . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4 Scheduling Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.1 List Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.4.2 Low power scheduling approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Multi processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Multi-core processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.4.3 Discussion on the related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
4 Platform Power Management 23
4.1 Power Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Charging/Discharging of Capacitive loads . . . . . . . . . . . . . . . . . . . . . 23
4.1.2 Short Circuit Currents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.3 Leakage Currents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.1.4 Total Energy Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
Contents ix
Idle energy dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Actor energy dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
Schedule energy dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
4.2 Platform Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.1 Cluster Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.2.2 Dynamic Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
VDD hopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
4.2.3 Leakage Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2.4 NoC Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Energy dissipation and Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Architectural assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
Application modeling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.4 Energy Dissipation Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.4.1 Definitions Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Architecture definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
Data-flow graph definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5 Energy Efficient Scheduling 35
5.1 Constraint problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.1.2 Deriving the constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Graph transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Timing constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Computation of favg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
Precedence constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
5.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
Multiple VFDs and variable P (τ) . . . . . . . . . . . . . . . . . . . . . . . . . . 42
Complexity due to variable P (τ) . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2 Our proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
5.2.1 Useful terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Maximum Usable Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
Available Usable Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
Actor Mobility Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
Actor Overlapping Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
5.2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
Actor Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
The Shifting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
5.2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2.4 DVFS Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.5 Extension of PathDVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
x Contents
6 Future Work 59
6.1 Validation of the proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.2 Extension of the MoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.3 Extention to other platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
6.4 Extend to multi-criteria scheduling heuristics . . . . . . . . . . . . . . . . . . . . . . . 60
A Pseudo Algorithms 61
1
Introduction
1.1 Background
With the growing need for high bandwidth digital communications and streaming applications re-
quiring high quality of video and audio encoding, the transition to multi/many core platforms on
embedded systems is inevitable. Although multi-core architectures, featuring general purpose CPUs,
have become mainstream, the need for multi-GOPS to satisfy such high-end computational intensive
functionality demands the transition to highly heterogeneous SoC platforms. Such platforms often in-
corporate asynchronous NoCs, customizable CPUs, domain specific accelerators and multiple voltage
-frequency domains organized in the so-called DVFS islands.
Sequential programming languages such as C, impose limitations when it comes to mapping in parallel
hardware. Usually such algorithms are represented by concurrent models of computation (MoCs). A
model of computation is an abstraction of a computational system. It describes how the various com-
putation processes interact [12]. The applications in this work are described by data-flow MoCs where
the computational blocks (actors) are ordered and operate on data that are queued on the intermedi-
ate edges. Using the data-flow paradigm for programming computational intensive and usually time
and/or latency constrained applications, the data-dependencies between tasks as well as the inherent
parallelism can easily be expressed and exploited. This representation of algorithms, which is usually
depicted as a flow-graph, is analogous to the operation and the structure of the underlying hardware
facilitating thus, the mapping and scheduling to the platform under consideration. Figure 1.2 shows a
mobile terminal MC-CDMA chain applied to the future 4G telecommunication standard. This appli-
cation is divided into 21 cores. Figure 1.1 shows a functional representation the VC1 video codec found
in HD DVDs and Blu-ray discs. With this representation constraints such as throughput, timing and
latency can also be modeled more intuitively enabling the application of heuristics to optimize of one
or more criteria such as the power consumption or the schedule timespan. Figure 1.3 shows a mapping
of the MIMO-OFDM technique, used in the 4G telecommunication standard, on a multi-processor SoC.
Regarding data-flow programming, various data-flow models have been proposed over the years, each
of them having its own advantages and drawbacks when it comes to its expressive power and pre-
dictability. In synchronous data flow (SDF) MoC [27], the number of tokens produced and consumed
at the ports of the actors are known at compile time. It reduces the expressiveness of the model
while increasing its predictability (verification and optimization). At the other end is the dynamic
data-flow (DDF) MoC [8], supporting data-dependent behavior and unknown rate of token production
and consumption at the cost of almost no predictability.
While the data flow modeling is extremely convenient for programming computationally intensive DSP
applications, when it comes to mapping in many core SoCs, one would also want to take advantage
of the cutting-edge power management capabilities, supported by such platforms. Scheduling of tasks
with respect to the minimization of some criteria (power consumption and/or execution time) can
make use of such mechanisms in order to scale voltage efficiently. State of the art power management
2 Introduction
Figure 1.1: VC-1 decoder’s algorithm schematic [4]
Figure 1.2: 4More functional diagram [24]
engines include dynamic voltage and frequency scaling based on VDD hopping [33]. The platform
under consideration for this master thesis is a new many-core platform designed by STMicroelectronics,
called P2012 [1]. The voltage scaling scheme is based on the VDD hopping principle. Based on this
technique a virtual voltage point can be reached among two voltage points (high and low) and a duty
cycle for dithering between these two. The frequency can be reprogrammed in less than 200ns.
1.2 Problem formulation
Given a dataflow graph consisting of a set of actors with a known WCET and static binding to pro-
cessing elements, an off line schedule is to be found that minimizes the power consumption.
Problem formulation 3
Figure 1.3: MIMO OFDM mapping [24]
To this aim, the power consumption modeling and the completion time analysis of a static synchronous
data-flow schedule is to be studied. More precisely the factors (i.e. static and dynamic energy dissi-
pation of the PEs, communication costs introduced by the NoC) that affect the power consumption
and which of them can be minimized, have to be identified in order for the power consumption model
to be used for efficient scheduling. Moving one step forward from modeling the power consumption,
given a set of ordered actors with static binding, a scheduling algorithm that minimizes the power
consumption is to be found. Scheduling with respect to power optimization may incorporate a per-
mutation of an admissible schedule (based on the WCET of each actor) in order to maximize the idle
intervals in a time frame and distribute this to possible actors [30]. The time frame is usually such
that a number or constraints, depending on the application, are met. Such constraints might be the
quality of a video encoding system or a hard deadline for a safety critical system. The characteristics
of P2012 will also be taken into consideration.
The applications that we are focusing on operate on long streams of data. The execution time (or even
the invocation) of actors that model different parts (functions) of these applications is heavily data
dependent and consequently varying. Moreover data rates can also change at run time, affecting the
production and consumption of tokens. Examples of data-flow graphs with higher expressive power
that can model such behavior are PSDF [7] and HDF [14], to say a few. Since the goal is to minimize
the power consumption, starting from a SDF representation of the algorithm and a static binding
to the underlying platform, we first derive the necessary formulas to describe the ordering between
actors that should be preserved for the correct execution of the algorithm. These formulas will be
used as constraints for the optimization function. The work continues with deriving the objective
function, that is, the energy dissipation formula. The optimization problem is described later on and
is compared against the related work. From this comparison we will show that existing approaches
for energy minimization can not be applied to our case. Finally we propose a heuristic to overcome
the difficulties that arise from the underlying platform.
4 Introduction
1.3 Contributions
With this work our contributions towards the energy efficient scheduling of dataflow graphs is three-
fold:
• We study the energy efficient scheduling of dataflow graphs in high-end computing platforms
consisting of hundreds of cores organized on VFDs. Until the time this work was documented,
there was no known published work, assuming similar application and platform models. To
this end we contribute by formulating the optimization problem of energy efficient scheduling of
dataflow graphs in multiple VFDs.
• After formalizing the problem at hand, we reveal the difficulties that arise, when platforms like
P2012 are being considered for the mapping and energy efficient scheduling of dataflow graphs.
Highlighting these difficulties allows also to argue about the inefficiency of approaches found in
the related work.
• Finally we propose a way to relax this very complex problem to the know and well studied one
of energy efficient scheduling of dataflow graphs on multi-processor systems. The result of this
method is a new graph. In this graph each actor might be the result of a clustering, as we
will describe in section 5.2.3. Now we can abstract and consider each VFD as an individually
managed PE and apply any of the existing approaches for energy efficient scheduling. Last but
not least we propose also an extension that can be applied to existing heuristics, in the case that
actor suspension is not allowed.
2
Data Flow Graphs
The applications that we are studying are modeled under the data flow paradigm. Two very common
data flow models are the synchronous data flow graphs SDFGs and the homogeneous data flow graph,
HSDFGs, depending on the production and consumption rate of data from actors. In this chapter we
will give the definition and the notations needed for the application model under consideration in this
work.
2.1 Synchronous Data Flow Graphs
Notation
• T is the set of actors. Actors operate on input data streams.
• E ⊆ T ×T is the set of edges. Data streams between actors are exchanged through these edges.
Since edges are directed, with the notation e(τ, z) ∈ E and τ, z ∈ T , we denote, the edge directed
from actor τ to actor z. These edges represent thus, the data dependencies between actors.
Every edge e ∈ E has precisely one source and one destination ∀e(τ, z) ∈ E , ∃τ, z ∈ T | src(e) =
τ and dst(e) = z. We associate with each edge a production rate, Prate : src(e) → N+ for
producing tokens and a consumption rate, Crate : dst(e)→ N+ for consuming tokens.
• d : E → N, represents the number of initial tokens (delays) on an edge e and represent the data
dependencies across iterations of G.
• WCEC : T , E → N, represents the worst case execution cycles needed by an actor to complete its
execution. The same function returns also the total cycles needed for a communication between
two actors.
• We define a path p(τ, z), τ, z ∈ T directed from τ to z, to be a finite nonempty sequence of
edges, p(τ, z) ⊆ E ∧ p(τ, z) 6= ∅
• We define, the firing time as start(τ, k) ∈ N+, with k denoting the kth invocation of each actor.
Similarly we define end(τ, k) ∈ N+ to represent the completion of the kth invocation actor τ .
A subset of SDFGs where Prate = Crate = 1 for all e ∈ E is called homogeneous data flow graph.
1 221
3
— — — —
21 —
11
11
(a) SDFG
Γ =
1 −2 00 1 −10 −1 1−1 0 2
(b) Topology Matrix
Figure 2.1: (a) Consistent SDFG, (b) Its Topology Matrix
6 Data Flow Graphs
1 211
312
11
(a)
Γ =
−1 1 00 1 −1−2 0 1
(b)
Figure 2.2: (a) Inconsistent SDFG, (b) Its Topology Matrix
Consistency of SDFGs
SDFGs are usually characterized by a topology matrix with dimension | E | × | T |. The entries in
this matrix, represent the production (consumption) rate of tokes, on the corresponding edges. An
example SDFG with its corresponding topology matrix is shown in figure 2.1a and 2.1b. A positive
(i, j) entry in the topology matrix, indicates the number of produced tokens by actor j on edge i. A
negative entry would, similarly, indicate the number of tokens consumed. A zero entry indicates that
there is no connection between the edge and the actor.
It is proven in [26], that a sequential schedule can be constructed for an SDFG G, if the rank of the
topology matrix is one less than the number of actors in the graphs, i.e.
rank(Γ) =| T | −1 (2.1)
This SDFG is consistent. The SDFG of figure 2.1a with topology matrix shown in 2.1b, with rank(Γ) =
2. An example of an inconsistent SDFG is shown in figure 2.2 with rank(Γ) = 3. For the topology
matrix is also proven that:
rank(Γ) ≥| T | −1 (2.2)
If (2.1) holds, then there is a positive integer vector q in the null space of the topology matrix, called
repetition vector. The entries in the repetition vector indicate the number of invocations for each actor
in each iteration of the schedule. For the repetition vector q holds that:
Γq = O (2.3)
with O being a vector full of zeros. For the SDFG in figure 2.1a, the repetition vector can be found
by solving the equation: 1 −2 0
0 1 −1
0 −1 1
−1 0 2
·q(1)
q(2)
q(3)
= O
The repetition vector can be found to be:
q =
2
1
1
The above equation implies that the buffers needed for the inter actor communications are bounded.
Constructing an Equivalent HDFG 7
0 1 2 3 4
PE 1 1 2 3
Figure 2.3: Iteration of SDF 2.1a
An iteration of a data-flow graph is defined then, as q(τ) invocations of all actors that appear in
the graph. For the HSDF case, q(τ) = 1 for all actors τ in the HSDFG. One iteration of the
SDF in figure 2.1a is shown in figure 2.3. A finite sequence of actor invocations that respects the
precedence constraints and produce no net change on the number tokens, accumulated on edges, is
called admissible schedule.
We can define now the data precedence constraint with respect to start and end functions for the
HSDF case as:
start(z, k) ≥ end(τ, k − d(e(τ, z))), ∀k ≥ d(e(τ, z))), (2.4)
τ, z ∈ T , e(τ, z) ∈ E
Since there are already d(e(τ, z)) delay tokens on the edge directed from actor τ to actor z, the latter
can be invoked for maximum d(e(τ, z)) times before or without any invocation of actor τ . However,
since one iteration of a schedule requires the invocation of all actors once, the (d(e(τ, z)) + 1)th
invocation of actor z can only take place after the completion of the (d(e(τ, z)) + 1)th invocation of
actor τ . In this way the precedence constraints are met.
2.2 Constructing an Equivalent HDFG
The work presented in the following chapters, focuses on HSDF graphs because of the simplicity in
the communications. In this case the worst case latency for a communication between two actors is
sufficient to derive the formulas for the earliest possible starting time,ASAPs and the latest possible
starting time, ALAPs, presented in equations 3.10 and 3.4 respectively. However, our work can also
be applied to general SDF graphs. it suffices to expand the SDF in its equivalent HSDF graph, with
a process similar to the one presented in [40]. Starting from the repetition vector q, the equivalent
HSDF graph will contain q(τ) copies for each actor τ ∈ T . Each copy of τ will be the source of
Prate(src(e)) edges in the equivalent HSDF graph, where e is an edge directed from actor τ to another
actor in the SDFG. Similarly, each copy of τ will be connected also to Crate(dst(e)) incoming edges.
An example of an edge expansion is shown in figure 2.4. The new graph will contain three copies of
actor 2 since q(2) = 3 and one copy of actor 1. Actor 1 will be the source of Prate(src(e12)) = 3 edges,
while the copies of actor 2 will be connected to Crate(dst(e12)) = 1 incoming edges.
The equivalent HDFG of 2.1 appears in 2.5.
8 Data Flow Graphs
1 213
(a)
1 2,211
2,1
1
1
2,31
1
(b)
Figure 2.4: Expansion of an edge in an SDFG
1,1
1,2
21
1
11 3
— —
1
1
— —
1
1
—
11
11
Figure 2.5: HDFG equivalent of 2.1a
3
Scheduling
This chapter describes the techniques used, for scheduling dataflow graphs on multi-processor systems.
After an overview of the most commonly adopted techniques, we describe how to take into account
the energy consumption during scheduling.
3.1 Scheduling Taxonomy
Scheduling is often divided into three main phases [40, 25]:
• Binding or Mapping : The process of assigning actors to processing elements. This step, add
resource constraints between actors, sharing the same processing element.
• Ordering : The process of defining the exact firing order of actors. Apart from the data flow
graph that impose the data precedence constraints (2.4), information for the exact mapping of
actors is also required, to defining the firing order.
• Timing : The process of determining when each actor should fire to satisfy all the data and
resource precedence constraints.
The optimal scheduling of a dataflow graph with respect to schedule’s length, on a multi-processor
platform, is known to be a NP-hard problem [11]. We present different scheduling strategies The
classification following next is done according to when binding, ordering and firing are taking place.
3.1.1 Fully Dynamic and Fully Static
We identify a fully dynamic scheduling strategy if all the above steps are taking place at run-time.
Such an approach would be ideal when highly dynamic actor behavior is expected and is more general
in terms of applicability. However, the cost for being able to take advantage of the run-time variability
in the execution time of actors (or the variability in processor’s workload) is high. Such an approach
would be used in case of loose timing constraints. On the other end, in the fully static scheduling
strategy, all steps are taking place at compile time. We can also distinguish blocked and overlapped
fully static schedules. In the first case, the inter-iteration dependencies are neglected and the dataflow
graph is scheduled, as if it executes only for one iteration. To take into account the inter-iteration
dependencies, unfolding and re-timing can be applied to the dataflow graph. With unfolding, N iter-
ations of the dataflow graph are scheduled together. While unfolding often leads to improved blocked
schedules, in terms of schedule length, it requires an increase (by a factor of N) of program’s memory.
With re-timing, the delays in the dataflow graph are manipulated in such a way that the critical path
is reduced.
Of course, each of these strategies has each own advantages and disadvantages. If the objective is to
reduce the run-time overhead imposed by scheduler’s computations then, the fully static methodology
10 Scheduling
is appropriate. However, fully static schedules are viable only if there are tight bounds on the esti-
mates of the worst case actor execution times. Such approach is useful for hard real time systems.
3.1.2 Self-timed and Static assignment
Which of the steps can be done at compile time is determined by the amount of information avail-
able for the application. In between, the two approaches mentioned above, we identify the self timed
strategy and the static assignment strategy.
The self timed approach, imposes mapping and ordering to be done at compile time, while the timing,
of each actor, is determined at run-time based on the availability of the required input data. Such
a strategy is ideal to compensate for fluctuations in execution times of actors. Under the self-timed
scheduling approach, a fully static schedule should be obtained first, using a heuristic algorithm. After
obtaining the mapping and ordering information from the fully static schedule, we discard the timing
information. At the end, a list of actors is assigned to each processor, while their exact invocations is
determined at run-time. When comparing to the fully static strategy, a self timed scheduling approach
will perform at least as well when the synchronization overhead is negligible. This overhead mainly
stems from scheduling communication actors that are essential for inter-processor synchronization.
Apart from the synchronization overhead, the arbitration overhead should also be taken into account.
Relaxing the self timed approach by performing also the ordering at run-time, results in the static
assignment strategy. Following this approach, the ordering of actors can be decided at run-time. Al-
though a possible re-ordering might result in a reduced computational interval, deciding which actor
should be fired is not easy especially when there are many possible combinations.
Compared to the fully static scheduling approach, where each actor is guaranteed to get a resource in
a given time interval, in self timed scheduling , actors sharing the same resource should arbitrate at
run-time to gain access.
3.1.3 Quasi-static and Ordered-transactions
Apart from the four approaches described above (presented in [25]), two more approaches can be
identified when the inter-processor communication scheduling or the conditional execution of actors are
also taken into account. In the first case, we have the ordered transactions approach, where the inter-
processor communication ordering is defined at compile time and imposed at run-time. Intuitively, the
ordered transactions strategy is a self-timed schedule, with additional transaction order constraints.
The second case corresponds to actors containing conditional branches, such as, if-then-else constructs
and while loops, making the execution time or even the firing, data-dependent. The key idea behind the
quasi static scheduling approach, is to optimize the average execution time of the overall computation.
Based on a statistical model for the control variables (such as the average execution time), an execution
profile can be defined and selected at run-time.
3.1.4 Complexity
The goal of scheduling dataflow graphs in a multi-processor platform is, as described earlier, to define
the binding, ordering and timing of actors, in such a way that an objective is optimized. Typically, such
objectives might include the optimization of makespan of the schedule and/or the energy consumption
Notation 11
Figure 3.1: Trade-off of generality against run-time overhead and implementation complexity
(as in our case). The makespan, is the average iteration period of the schedule and a lower bound on
this, is imposed by the critical path of the graph. As critical path, is denoted the longest delay free,
path in the graph. It is evident that, when inter-processor communication cost is taken into account,
this lower bound is also affected by the binding of actors to PEs and the architecture dependent
characteristics of the platform, i.e. the communication infrastructure between the PEs. Because
optimal scheduling in multi-processor platforms is NP-hard, heuristics have been proposed to provide
near optimal results. Well-known heuristics are the critical path heuristic, the list scheduling method
and the graph decomposition method summarized in [20].
3.2 Notation
A fully static schedule SCH for PE processors specifies the triple:
SCH = {BT (τ),S, Tsch} (3.1)
where BT is the binding function, that associates actors with PEs, S is the function returning the
actor specific firing moments and Tsch is the iteration period. We use the same notation as in [40] and
we will deal with HSDF graphs. In homogeneous SDF, each actor is invoked only once per iteration.
So intuitively S(τ) denotes the time of the unique invocation. To construct a fully static schedule, we
use the following equation to derive the start time of actor τ during iteration k:
start(τ, k) = S(τ) + k · Tsch (3.2)
k ·Tsch represents the start time of the kth iteration of the schedule with the above formula we retrieve
the kth invocation of actor τ .
12 Scheduling
A schedule is said to be admissible, if it satisfies all the precedence constraints, (2.4), imposed by the
dataflow graph. In HSDGs, since an actor is likely to have more than one incoming edges, we change
the precedence constraint defined in (2.4) to:
start(τ, k) ≥ maxe∈dst−1(τ)(start(e, k) +WCET (e)) (3.3)
start(e, k) = end(src(e), k − d(src(e), τ)
)for all k ≥ d(z, τ). In the above inequality, start and end functions, define the exact invocation and
completion of an actor, dest−1(τ) returns all the edges from E that are directed to τ . Intuitevely 3.3
constraints the start time of an actor to be later than the end time (start(e, k) + WCET (e)) of all
incoming edges. Finally, src(e) returns the source actor of an edge.
3.2.1 Elaboration on Execution Time
In order for the static assignment and scheduling techniques to be valid, reasonably good estimates on
the execution time of actors, should be available at compile time. Most of the times, the execution time
of actors is data dependent and cause a variation in the execution time between different iterations of
the dataflow graph. As long as these variations are rare or small, static techniques for scheduling are
viable. Indeed, it is difficult to determine a worst case bound for the execution times of actors, as cache
misses and corner cases for inputs, might occur. However, it is still possible to obtain reasonably good
execution time estimates. It is usual for the programmer, to derive a mathematical model associated
with some actor parameters. Such parameters may include the block size for processing a video frame
or the number of coefficient in a FIR filter. Such an approach is used in the Ptolemy project [36].
This is feasible for actors written in low level languages (eg. as assembly) and the estimates can be
obtained through profiling to an ISS.
3.3 Ordering Assignment
As Late As Possible (ALAP) times
For the list scheduling approach an initial ordering assignment has to be performed. It is shown in
[23] that ordering actors based on the latest possible start time, provides comparable or better results,
than most other ordering metrics. ALAPs(τ) denotes the latest possible start time of an actor that
will not cause a timing violation and can be defined as:
ALAPs(τ) = ALAPf (τ)−WCET (τ) (3.4)
Since we assume that the WCEC(τ) is known, we can obtain the WCET (τ) by the following relation:
WCET (τ, f) =WCEC(τ)
f(3.5)
To compute ALAPf of actor τ , we should take into consideration, the precedence constraints from
the dataflow graph as well as the binding of actors to PEs. To form the equation for computing the
ALAPf , we will use the functions src and dest that return the source and destination actor of a
directed edge respectively. We will also denote as D the deadline for one iteration of an admissible
Scheduling Heuristics 13
schedule.
ALAPf (τ) = min(D,mine∈src−1({τ})
(ALAPs(e)
),minz∈succ(τ)
(ALAPs(z)
))(3.6)
ALAPs(e) = ALAPs(dst(e))−WCET (e) (3.7)
WCET (e) = (| BE(e) | +Prate(src(e)) ·Nflits/token) · Lflit/hop (3.8)
Where succ(τ) refers to all direct successors of actor τ and BE(e), Nflits/token and Lflit/hop are defined
in 4.4.1 after describing the platform under consideration. Equation 3.8 describes the worst case delay
for a token to reach its destination and is purely dependent on the communication infrastructure of the
underlying hardware. Equation (3.6) constraints the finish time of an actor. This constraint depends
on the deadline D, which is actually the timespan of the frame, as well as the latest starting times
ALAPs of all direct successors of actor τ . These successors might come from the dataflow graph or the
binding information. Actors that are bound to the same processing element and are to be executed
sequentially are assumed to have a direct dependency, even if there is no direct edge in the dataflow
graph connecting these two actors. To compute thus, the latest possible starting (and respectively
finish) times of all actors, we should transverse the graph backwards. Consequently, actors with no
direct successors, will have ALAPf = D.
As Soon As Possible (ASAP) times
In a similar way, we can define the earliest possible finish time of an actor as:
ASAPf (τ) = ASAPs(τ) +WCET (τ) (3.9)
with
ASAPs(τ) = max(A(τ),maxe∈dst−1(τ)
(ASAPs(e) +WCETe
),
maxz∈pred(τ)
(ASAPf (z)
))(3.10)
ASAPs(e) = ASAPf (src(e)) (3.11)
In the above equations, A(τ) denotes the arrival of actor τ , which is zero according to the data flow
notion and pred(τ) is the set containing all the predecessors of actor τ . However, in this case the
graph has to be traversed forward. Now actors with no direct predecessors, will have ASAPs time
equal to 0.
3.4 Scheduling Heuristics
Since the problem of scheduling dataflow graphs on multi-processor systems is NP-hard, heuristics are
used to find near optimal results, fast.
3.4.1 List Scheduling
The basic idea behind list scheduling, is the construction of an ordered list of actors. Based on this
list and the binding, each actor will be associated with a time interval. We say that an actor is ready
to fire at time t, as soon as all its predecessors have been fired and the associated processor is not
14 Scheduling
busy. This ordering is also adopted in the self-timed approach. However, the timing information is
discarded. To take into account the communication costs when an edge is crossing different VFDs,
either communication actors should be introduced and scheduled explicitly or this communication
latency should be embedded in the execution time of the predecessor node. This ordered list of
execution can be derived based on the earliest start time or the latest start time of actors while
taking into account the execution time estimates, the precedence constraints, the timing constraints
and the binding. The formulas for computing these times are presented in section 3.3. Dynamic
level scheduling, as presented in [38], utilizes a more sophisticated scheme. The ordering of actors is
recomputed on after each scheduling step. The ordering criterion is based on the difference between
the sum over the longest path and the earliest possible start time of the actor.
3.4.2 Low power scheduling approaches
Since the goal of this thesis is to find a schedule under a fixed binding, such that the (energy dissipa-
tion is minimized, we assume that the heuristics used for the binding step minimize the inter-domain
communications. This is possible by avoiding placing unrelated actors to the same path [47].
The available work for energy minimization can be divided depending on the platform under consider-
ation, as well as on the type of applications. A lot of research has been conducted for multi-processor
architectures, where each processor has a dedicated DVFS unit and can adjust the frequency-voltage
operating point individually or at least can start stop independently from the others. On the other
hand, only little research has been made for multi-core processor systems, where the granularity of
voltage-frequency regulation is the core or in architectures employing several voltage-frequency do-
mains where the grain is the VFD. In such architectures, all cores of each processor or all processors
in one VFD, are running under the same voltage-frequency pair, at any given point in time. In the
case of multi-processor systems with a dedicated DVFS unit embedded to each processor, both models
with and without precedence constraints have been investigated. In the case of multi-core processor
architectures, both application models have been investigated. However, work focusing on precedence
constraint graphs restrict the architecture to contain only one VFD.
Multi processor Architectures
In [45], which is actually an extension of [30], the authors propose a scheduling algorithm to reduce
both dynamic and static energy dissipation through ABB (adaptive body biasing) using the power
modeling from [32]. Forward body biasing (FBB) decreases the threshold voltage Vth of transistors,
increasing both maximum frequency and leakage, while reverse body biasing (RBB) has the opposite
effect. Adaptive body biasing refers to designs that can set the body biases statically or dynami-
cally. Under a given binding, their voltage scheduling algorithm is divided into two phases. In the
first phase, an optimal point is found between the supply voltage and the body bias voltage upon a
frequency update. Then, in order for the timing constraints to be met, their algorithm evaluates the
validity of the generated schedule by re-computing the earliest start times and latest finish times of the
actors. An initial schedule for a DAG is found by means of critical path based list scheduling under
the maximum clock frequency. The ordering metric adopted is the latest possible start time. After
the initial schedule is found, the available idle time is allocated to actors, ordered according to their
energy gradient profile, in such a way that the timing constraints are not violated. The higher the
energy gradient is, the more the energy savings by frequency scaling are. The frequency adjustment
Scheduling Heuristics 15
is done iteratively by decreasing the operating frequency by steps of df (given by the specifications
of the platform), in actors with high energy gradient until a timing violation occurs. If there is more
idle time available, then this is allocated to actors with lower energy gradients. The same scheduling
technique is also used in [31]. Having adjusted the operating frequency, they continue by finding the
optimal point for the supply and body biasing voltage.
A scheduling algorithm that uses priority ordering and best fit mapping to processors is presented
in [47] and extended in [43] to take into account the inter-processor communication cost. In [43],
the ordering is defined according to the sum of the latest finish time and the earliest start time at
which a processor is also available. The actor with the lowest metric value is assigned to a best-fit
processor. To maximize the idle time and thus the margin for energy savings, the actor ready to
execute is assigned the processor which was busy just before that actor was released. To take into
account how this mapping will affect the communication traffic, they introduce a parameter K called
communication awareness parameter. The lower the K, the more communication aware the algorithm
is. Each mapping adds a communication cost (if there is inter-processor communication) which is
compared with the average cost per edge (calculated by the DAG and multiplied by K ). The voltage
selection defined in [47] is formulated as integer linear programming problem (ILP) which can lead to
great computational complexity [30].
A work using a clustering approach for energy dissipation and makespan optimization is presented in
[44]. However, they explore voltage scaling for non-critical jobs, i.e. jobs that are not in the critical
path of the DAG. Concerning the voltage scaling problem, they first compute the slack available for
these non-critical jobs. For a certain job, they define slack as the difference between the latest possible
finish time and the earliest possible start time based on the previously scheduled jobs. Depending on
its slack and the time it takes to be executed under the maximum frequency, they can find an optimal
frequency for execution. In order to form a cluster it is necessary that the result does not lead to
an increase in energy consumption. Since clustered actors are executed on the same processor, this
approach guarantees also a reduction in the makespan of the schedule. For scheduling actors within
the cluster, they use a classic order assignment based on the longest path inside the cluster.
Multi-core processor Architectures
The first energy efficient approach to real-time scheduling in platforms that share some of the char-
acteristics as P2012 was presented in [46]. In this work, the platform under consideration consists of
a single multi-core processor and the goal is to schedule, off-line, a set of frame based independent
tasks with the number of cores being fixed, while minimizing the energy consumption. The authors
prove that the energy efficient real-time task scheduling, in the multi-core context is NP-hard. In their
content consideration a processor consists of M homogeneous cores. Each core can be in the dormant
mode independently from the others but all active cores must operate on the same voltage supply.
The tasks are such that they are ready at time 0 (or a multiple of the the frame period) and all tasks
share the same deadline D which is equal to end of the frame. As far as the assumptions made for the
architecture are concerned they assume that the voltage and consequently the speed s can be scaled
in a continuous fashion. Furthermore, the overheads for switching between different supply voltages
is negligible and task migration is not allowed. The computation requirements for each task, in terms
of cycles are also known. Since the tasks to be scheduled are independent and each core can be in
the dormant mode independently, it is proven that any feasible schedule for this set of tasks, can be
16 Scheduling
transformed into one that satisfies a property (called the deep sleeping property), while consuming the
same energy as the original one. To satisfy the deep sleeping property, it should hold that a core µ
is found in the dormant mode at any time t’ for t < t′ < D when µ is in sleep mode at some time
0 ≤ t < D. Based on this property, the authors prove that, for any given task assignment X, an opti-
mal voltage schedule in terms of power consumption, can be found. With this property the schedule, is
being partitioned to voltage-frequency segments. Upon the transition of a core from the active to the
dormant mode, a new segment starts. With this property, the optimal voltage can be found by using
Lagrange multiplier method to solve the power consumption minimization problem. They also prove
that finding the optimal task assignment is an NP-hard problem and propose a 2.371-Approximation
algorithm (the result from their algorithm is not more than 2.371 times the optimal value).
In [10], the authors, present a method to reduce the leakage current, the supply voltage and clock
frequency in an integrated way. The tasks to be scheduled are represented by a weighted acyclic task
graph and the architecture under consideration, consists of several processors running under the same
supply voltage and clock frequency. Moreover, it is assumed that the number of processors can be
equal to the number of tasks and the voltage and frequency can be scaled continuously. Their leak-
age aware scheduling algorithm determines the number of processors that result in the lowest energy
consumption. First, they prove that scaling the frequency below a certain point will result in a higher
energy consumption due to leakage current. They find this optimal frequency as 0.56 times of the
maximum frequency, when the threshold voltage is 0.3 times the maximum supply voltage. To reach
this conclusion they used processors that, in maximum frequency the leakage current is responsible
for 50% of the total energy consumption. First they determine the minimal number of processors
needed to finish the task graph before the deadline. To find this upper bound on the number of pro-
cessors, they perform a binary search in the interval [Nlwb, Nupb] with Nlwb be the minimum number
of processors needed to meet the deadline and Nupb the number of tasks. At each step, they use list
scheduling, with EDF priority function, to determine whether a schedule can be found that meets the
deadline. According to the EDF scheduling policy, the earliest the deadline of a task, the hight its
priority is. To determine the optimal number of processors they perform afterwards a linear search
on the interval [Nlwb, Nminimal] by repeatedly applying the schedule and stretch algorithm (lower
the supply voltage and clock frequency) until the the schedule finish exactly on the deadline. This
loop ends when the make span of the schedule does no longer decreases by increasing the number of
processors. After this point, the energy consumption increases by increasing the number of processors.
In [29], the authors present a methodology for lowering the energy dissipation of a multi-core system.
The scheduling of independent tasks is considered, by balancing the slack times among cores within
a voltage frequency domain and lowering the clock frequency while meeting the real time constraints.
In order to meet the timing requirements, the one with the maximum utilization is chosen and the
operating frequency is decided so that the tasks mapped on that core meet the deadline. Based on
this schedule, the propose a slack reallocation algorithm to further distribute the slack times within
the voltage-frequency domain. The schedule is being divided into segments and within these segments
appropriate job migrations are performed to adjust the slack time. Since the slack times are know
balanced a lower operating frequency can be chosen to further lower the energy consumption. The
platform consists of a set of homogeneous cores partitioned into several voltage frequency domains. A
core can be either in sleep or active mode. All active cores within a voltage-frequency domain share
the same supply voltage and clock frequency. However, each core can be put into sleep independently.
Scheduling Heuristics 17
The task model considered in that work, assumes a set of independent tasks, whose period and WCET
are known. Figure 3.2a shows such a task set along with the properties of each task. For the task-
core mapping the Worst Fit Decreasing (WFD) policy is assumed. WFD policy results in a better
balanced task partition when compared to other similar methods, such as Best-Fit decreasing and
First-Fit decreasing and maximizes the possible energy savings [3]. According to the WFD policy,
the binding of tasks to cores is shown in figure 3.2b. The tasks are scheduled by the EDF policy in a
preemptive way on each core. Since communication between different VF domains results in further
energy dissipation, it is assumed that job migration is permitted only within the same VFD. The EDF
schedule in nominal frequency for the aforementioned task set is shown in figure 3.3. The next step
in their proposal is to uniformly scale down the frequency based on the worst case utilization. From
figure 3.2b, is evident that with the current mapping the worst case utilization is equal to 712 . Scale
the frequency down to 712 of the nominal value, yields the schedule in figure 3.4. According to the
authors, moving the idle times backwards in time allows for better energy harnessing. Pushing all task
towards their deadlines alters the task scheduling to the one in figure 3.5. For the slack reallocation
algorithm the whole iteration period (the least common multiple of the periods of all tasks) is being
divided into consecutive non-overlapping segments. To divide the iteration period into segments, each
core is first divided into non-overlapping consecutive time slices. The start time of a slice is either the
start of task or the end of a task, which depends on whether the slice is active or idle. Having divided
each core in time slices, the segments of the VFD are defined as follows:
b0 = 0, bm = iteration period,
bi = min(ts.starttime) (3.12)
such that
ts ∈ ∪dcni=1ci.Sched ∧ ts.starttime > bi−1 ∧ ts.state = idle, 0 < i < sn
where bi denotes the start/end time of a segment, tsstarttime and tsstatus denote the start time of a
slice and its status respectively, dcn is the number of cores in the VFD and ci.Sched is the set of time
slices of each core. From equation (3.12), we see that each segment starts at the start of an active
time slice and ends in the first idle time slice of the same core. The green lines in 3.5 represent the
boundary points for the segments. In this way the domain is now divided into segments and within
each segment the load of all cores is balanced by migrating jobs. Balancing the load in each segment
allows for further reduction in the clock frequency and thus higher energy savings. Figure 3.6 shows
the different steps of this algorithm, after segmentation.
In [22] the authors address the problem of scheduling a set of real time, independent tasks sharing
a common deadline D. The platform consists of a set of cores, with non-negligible leakage power
consumption, organized in the voltage-frequency domain fashion, under given timing constraints.
Their work focuses on the problem of choosing the number of active voltage frequency domains, the
18 Scheduling
Task ID WCET Period Utilization
1 5 12 5/122 1 3 1/33 1 4 1/44 1 6 1/65 1 6 1/66 1 6 1/6
(a)
Core ID Task Utilization
1 1,6 7/122 2,5 1/23 3,4 5/12
(b)
Figure 3.2: (a) Task set, (b) Task-core mapping
0 1 2 3 4 5 6 7 8 9 10 11 12
Core1
Core2
Core3
j30|4, j40|6 j31|8 j32|12j41|12
j20|3, j50|6 j21|6 j22|9, j51|12 j23|12
j10|12, j60|6 j61|12
j3,0 j3,1 j3,2j4,0 j4,1
j2,0 j2,1 j2,2 j2,3j5,0 j5,1
j6,0 j6,1j1,0
Figure 3.3: EDF scheduling, f=1
0 1 2 3 4 5 6 7 8 9 10 11 12
Core1
Core2
Core3
j30|4, j40|6 j31|8 j32|12j41|12
j20|3, j50|6 j21|6 j22|9, j51|12 j23|12
j10|12, j60|6 j61|12
j3,0 j3,1 j3,2j4,0 j4,1
j2,0 j2,1 j2,2 j2,3j5,0 j5,1
j1,0j6,0 j6,0
Figure 3.4: SimpleVS,f=7/12
Scheduling Heuristics 19
0 1 2 3 4 5 6 7 8 9 10 11 12
Core1
Core2
Core3
j30|4, j40|6 j31|8 j32|12j41|12
j20|3, j50|6 j21|6 j22|9, j51|12 j23|12
j10|12, j60|6 j61|12
j3,2j3,1j30 j4,1j4,0
j2,3j2,2j2,1j2,0 j5,1j5,0
j6,0 j6,1j1,0
Figure 3.5: Move slack backwards,f=7/12
0 1 2 3 4
j30|4, j40|6
j20|3, j50|6
j10|12, j60|6
j30
j2,0 j5,0
j6,0 j1,0
(a) Seg1 after SimpleVS
0 1 2 3 4
j30j1,0j5,0
j2,0 j5,0
j6,0 j1,0
(b) Seg1 after load balancing with jobmigration
0 1 2 3 4
j30j1,0j5,0
j2,0 j5,0
j6,0 j1,0
(c) Seg1 after shifting jobs
0 1 2 3 4
j30j1,0j5,0
j2,0 j5,0
j6,0 j1,0
(d) Seg1 after the frequency scaling
Figure 3.6: Evolution of segment 1
20 Scheduling
0 1 2 3 4 5 6 7 8 9 10 11 12
Core1
Core2
Core3
j30|4, j40|6 j31|8 j32|12j41|12
j20|3, j50|6 j21|6 j22|9, j51|12 j23|12
j10|12, j60|6 j61|12
j3,2j3,1j30j1,0j5,0 j4,1j4,0
j2,3j2,2j2,1j2,0 j5,1j5,0
j6,0 j6,1j1,0
Figure 3.7: After migration,sg0.f = 31/72, sg1.f, sg2.f, sg3.f = 7/12
task mapping and the frequency assignment. Apart from a polynomial time complexity algorithm
for energy minimization, given the task mapping, they also prove that, in the absence of timing
constraints, the operating frequencies that minimize the energy consumption of each domain, depends
only on the number of cores and static power on this domain. They assume that the voltage-frequency
domains are symmetric (they contain the same number of cores), the frequency can be regulated in
a continuous fashion within an interval [fmin, fmax] and that a VFD can be either in two states, on
and off. Since the task set to be scheduled contains only independent tasks, each VFD’s execution
time can be divided into segments. Moreover, since there is no communication between tasks across
different domains, finding an optimal solution for each domain separately, results in a global optimum
solution. Given a fixed mapping of tasks to VF domains, the optimization problem to solve can be
written as:
minEt =
Nc∑j=1
Ej(tj)
with
Nc∑j=1
tj ≤ D, tminj ≤ tj ≤ tmaxj (3.13)
where Nc is the number of cores in the domain, Ej(tj) is the energy consumption of a segment. Since
the cores are shorted in a non-decreasing order of their workloads, they define tmin =WCj−WCj−1
fmax
and tmax =WCj−WCj−1
fmin, with WCj the workload of the segment segj . To solve the convex opti-
mization problem, they follow a two step approach, in which firstly they narrow the interval of tj
from [tmin, tmax] to the interval [tlowj , tupj ]. The intuition behind narrowing the interval is that, in the
[tlowj , tupj ], Ej(tj) decreases monotonously and thus solving (3.13) in the reduced interval, it should still
hold that∑Nc
j=1 = D when energy is minimized. In the next step, they short Ej(tupj ) and Ej(t
lowj ) of
all segments in a increasing order and they perform a binary search in the domain in order to decide
for the optimal tj for each segment. In order to find the optimal number of VF domains needed to
execute the task set, they perform a linear search in the interval [nlowb , nupb ], with nlowb =⌈∑N
i=1 wciNcD
⌉and nupb = min(
⌈NNc
⌉, Nb), where N is the number of tasks to be scheduled, Nc is the number of cores
in each domain, Nb the total number of domains and wci is the worst case execution time of task i.
In each iteration they use the Largest Task First (LTF) heuristic for mapping the tasks to cores and
check if the deadline can be met. In the LTF heuristic tasks are shorted in a non-increasing order of
Scheduling Heuristics 21
their execution cycles and then, through an iterative process each task is mapped to the least loaded
core.
In [21], the authors propose an algorithm that uses both dynamic voltage scaling (DVS) and power
shut down (PS) techniques to minimize the energy consumption of a time constrained dependent
task set, running on an on-chip multi-processor. Their algorithm provides an extended schedule
and stretch algorithm, where tasks computation cycles are iteratively stretched within the slack time
of a given time interval. They propose a minimum threshold interval for shutting down the cores,
that amortizes the power and time overhead induced by the shut down. The schedule and stretch
algorithm is being extended by incorporating a DVS efficiency metric, to evaluate the energy gain of
the stretched computation cycle. The application under consideration, is modeled as a directed acyclic
graph. Each node in this graph represents a task with known timing and computational constraints.
The underlying hardware consists of N homogeneous processing elements that can communicate with
each other through a shared cache. All processing elements on the chip are assumed to be powered by
one off-chip regulator. In this way, the same voltage and consequently the frequency is applied to all
processing elements at the same time. However, each PE can be in the dormant mode independently
of the others. As mentioned above, the energy dissipation can increase below critical speed. Critical
speed actually denotes the the frequency below which static power dissipation dominates the dynamic
power and as a consequence the energy increases. To determine whether switching all PEs to the
dormant mode is efficient or not, the threshold interval:
Tthreshold(PS) = max( Eoverhead(PS)
Pdc(criticalspeed), Toverhead(PS)
)(3.14)
where Eoverhead(PS), Pdc(criticalspeed) and Toverhead(PS) denote the energy overhead for switching
to the dormant mode, the static power consumption at the critical speed and the time overhead for
the transition respectively. In equation (3.14), the energy and timing overhead are normalized with
the maximum total energy need for one clock cycle on maximum frequency and the cycle time on this
frequency respectively. In order for a switching to the dormant mode to be efficient the number of
consecutive idle cycles of the PE should be more than the timing overhead required for the transition.
In order to decide the cycle that should be stretched a DVS efficiency metric is proposed and formulated
as:
NA(c)(− E(f1)− E(f2)
C1 − C2
)where, f1 > f2 (3.15)
This metric actually represents the ratio of the energy savings due to the increased cycle time. In
equation (3.15) NA(c) represents the number of PEs that are active on cycle c, E(f) denotes the total
energy consumption of all NA(c) PEs and C1 is the cycle time on frequency f1. The tasks are mapped
and scheduled according to [47], which actually assigns priorities to tasks according to their latest
finish time and maps the tasks to PEs according to PE’s finish time relative to task’s release. The
decision of which cycle to stretch is an iterative process and is based on the DVS efficiency metric.
The cycle with the highest efficiency metric will be stretched first. However, authors notice that
stretching computation cycles might not be energy efficient, when the energy overhead for transition
to dormant mode is considered. In order to evaluate whether the stretching of a cycle will minimize
the energy consumption, they compare the solution with the partially stretched computation cycle
with the previous best solution. The comparison stops when there is no available slack.
22 Scheduling
Work Type of energy to be minimized Application assumption Platform assumption
[45], [30] Dynamic and Static DAG Individually managed PEs[47], [43] Dynamic DAG Individually managed PEs
[44] Dynamic (for non critical actors) DAG Individually managed PEs[46] Dynamic Independent periodic tasks 1 VFD[10] Dynamic and Static DAG 1 VFD[29] Dynamic Independent periodic tasks 1 VFD[22] Dynamic Independent periodic tasks multiple VFDs[21] Dynamic DAG 1 VFD
Table 3.1: Overview of the assumptions in the related work
3.4.3 Discussion on the related work
As shown by the above discussion, only few works address the problem of energy efficient scheduling
of a data dependent task set to hardware configurations that impose, to a number of cores to share
the same pair of voltage-frequency, when active at the same time. An overview of the assumptions in
the related work is shown in table 3.1. Although, individually power managed PEs provide greater
flexibility for energy minimization, adopting this strategy, with a large number of PEs, is very complex
and expensive. It is also evident that blindly adopting, sophisticated methodologies for energy min-
imization for uni-processor architectures, in the aforementioned configurations, will result in higher
energy dissipation.
Although, in the above works, the authors consider both the minimization of dynamic and static energy
dissipation on clustered configurations, they mostly assume that the underlying platform consists of
only one VFD. If they assume multiple domains, they also consider that the tasks being scheduled
are independent which allow them to focus only on one domain and last but not least, they consider
that the switching activity constant. However, as most of DSP applications are described through
very large data-flow graphs, it is reasonable to expect that such graphs are mapped on several voltage-
frequency domains. Moreover, the switching activity of actors, is data and time dependent rather than
constant and plays dominant role in frequency scaling. We have to find, a more general approach,
which can incorporate and take into account inter domain dependencies and communications between
actors as well as, the actor dependent switching activity.
4
Platform Power Management
In this chapter, we present some background on power consumption of digital circuits. We describe
also the platform under consideration, that forms the basis of the assumptions used when the energy
efficient scheduling is discussed.
4.1 Power Basics
In current CMOS designs, the dynamic and static power consumption are the two major actors affect-
ing the energy dissipation. Dynamic power consumption stems from the average capacitance switched
per operation type. It is data dependent and exhibits an almost cubic relation with the supply voltage
(assuming that frequency has a linear dependence with the supply voltage). Static power consumption
is mainly attributed to leakage current. As CMOS technology is scaled down, the contribution of static
power in the overall energy dissipation is more and more important [37]. Finally, the incorporation
of NoCs as interconnecting mechanisms between processing elements, contribute to the overall energy
dissipation as well. Understanding and modeling the above factors is essential before the scheduling
of actors is considered.
The following analysis concerns the energy dissipated during one clock period. Later we will extend
the energy dissipation taking into account the number of actors, their execution time and the number
of PEs. We denote the clock period as Tclk which is defined as the inverse of the clock frequency fclk.
The energy dissipation of a digital circuit is mainly attributed to three phenomena:
• Charging/discharging of dapacitive loads
• Short circuit currents
• Leakage currents
The first two contribute to the dynamic energy dissipation while the last one to the static energy.
4.1.1 Charging/Discharging of Capacitive loads
The energy consumed, for a transition from 0 → 1 (or vice versa) depends on the total capacitance
of the circuit (roughly proportional to the number of CMOS transistors), the supply voltage and
the activity factor α. This factor denotes the fraction of the total capacitance being charged (or
discharged) and depends on the utilization of the PE’s components. The power consumption is given
by
Pcharge = α · C · V 2DD · fclk (4.1)
For the (65 nm) technology, the frequency can be considered proportional to the supply voltage
fclk ∝(VDD − Vth)a
VDD(4.2)
24 Platform Power Management
and we derive a cubic relation between the clock frequency and the power consumption as:
Pcharge ≈ α · C · f3clk (4.3)
In 4.2, the exponent a is constant. It is experimentally derived and for the aforementioned technology
is approximately equal to 1.3. The total energy dissipated in Tclk is then:
Echarge = α · C · V 2DD (4.4)
The switching activity factor can be minimized, by adopting efficient architectures, allowing instruc-
tions to complete with minimal hardware utilization. From (4.1), it is evident that minimizing the
supply voltage can provide significant power savings. However reducing the supply voltage, the op-
erating frequency is also affected by the equation (4.2). In order to compensate for the performance
loss, pipelined implementations can be used, as it is the case in most DSP systems.
4.1.2 Short Circuit Currents
When transistors switch, both nMOS and pMOS traverse a partially conducting state. This, results
in a direct conducting path from VDD to Vss called “short circuit”. The energy dissipated during this
state can be approximated by:
Esc = σ · Echarge (4.5)
with σ being 0.2 at average for VLSI circuits. The effects of short circuit currents at the gate’s output
in dynamic power consumption are relatively small. This effect can be absorbed by the α factor in
equation (4.1) or (4.3)
4.1.3 Leakage Currents
Among the phenomena that contribute to static power dissipation, the sub threshold current plays the
dominant role. Having this in mind, the leakage current Ilk can be approximated using the following
equation [9]:
Ilk ≈ Isub = K1 ·W · e−Vthη·Vθ · (1− e
−VDDVθ ) (4.6)
with K1 and η being technology dependent parameters, Vth the threshold voltage, W the gate width
and Vθ the thermal voltage (which depends on the temperature). From equation (4.6), it is clear,
that there are two possible ways to reduce the leakage current. The first one would be to turn off the
supply voltage causing loss of state and the second one would be to increase the threshold voltage,
causing loss of performance.
4.1.4 Total Energy Dissipation
Based on the formulas (4.4), (4.5) and (4.6) the total energy dissipated from one PE in a DVFS island
in one clock period is:
Ep = Echarge + Esc + VDD · Ilk ·1
fclk(4.7)
Power Basics 25
and can be approximated by:
Ep(fclk, VDD) = α · C · V 2DD + Ilk · VDD ·
1
fclk(4.8)
with k1 and k2 being constants. From the equation (4.8) we can see that a reduction in supply voltage
VDD, trades for a quadratic reduction in energy dissipation. Thus, the energy savings by applying
DVFS can be significant. However, voltage scaling in CMOS also affects the gate traversal delay and
consequently the global delay as shown by (4.2). So scaling voltage should also involve frequency
scaling. More precisely, fclk should be decreased linearly with VDD affecting thus the computation
time [16].
Idle energy dissipation
When a PE does not execute any actor, it is in the IDLE state. Since in this state the switching
activity is zero, only the leakage current contributes to the total energy dissipation. The energy
dissipation in this state is then given by the formula:
Eidlep =∑j
k2 · VDD · IDECj ·1
fclk(4.9)
with IDEC denoting the idle execution cycles between the end of an actor τ and its re-execution or
the start of another actor z mapped on the same PE and j denoting these idle intervals. As will be
described in subsection 4.2.3, most modern architectures employ circuits that gate the clock to the
computational logic as well as mechanisms to reduce the effect of leakage current in the idle mode.
Actor energy dissipation
Assuming a given WCEC for one execution of an actor and under constant frequency and supply
voltage, the total dynamic energy dissipated from the PE executing one invocation of this actor is:
Ep(τ) = Ep(fclk, VDD) ·WCEC(τ) (4.10)
Schedule energy dissipation
Since we are interested in the energy dissipated by executing a data-flow graph, we use (4.10) and
(4.9), along with the repetition vector defined in (2.3), to derive the formula for the schedule’s energy
dissipation. Without taking into account the energy for communication between PEs, the total energy
dissipated in one iteration of the schedule is:
Eactortot =∑τ∈T
q(τ) · Ep(τ) +∑p∈PE
Eidlep (4.11)
However the above equation needs to be refined when DVFS is applied on a VFD, containing a num-
ber of PEs, rather than on every PE independently. Consider the HDFG show in figure 4.1a with a
mapping shown in 4.1b. with a mapping shown in 4.1b. According to the given mapping a periodic
schedule can be found as shown in figure 4.2a for the nominal frequency f = fmax or f = 1, if the
normalized frequency is used. Figure 4.2b shows the initial schedule after scaling the frequency of
actor 3 from f = 1 to f = 1/3, when the frequency and voltage can be adjusted to each PE separately.
26 Platform Power Management
1
2
3
—
4
——
(a)
Actor WCEC PE
1 1 12 1 23 1 14 1 2
(b)
Figure 4.1: (a)HDFG, (b)Mapping and WCEC of each actor
0 1 2 3 4 5
PE 1
PE 2 4 2
3 1
(a)
0 1 2 3 4 5
0 1 2 3
4 2
31
(b)
0 1 2 3
42
31
(c)
Figure 4.2: Individually managed PEs, (a)f=1 for all actors, (b)frequency scaling to f = 1/3 for actor3, (c) case for clustered PEs
In this context the equation 4.10 can be used to calculate the energy dissipation of each actor. Since
each PE has a dedicated DVFS mechanism, the idle periods can be identified for each PE separately
and consequently equation 4.9 can be used for the idle energy calculation. Equation 4.11 will return
the total energy dissipation for a given actor scheduling and frequency scaling.
In the case when the two PEs are clustered to a VFD domain, then scaling the frequency of actor 3 as
above, would result to the schedule shown in figure 4.2c. It is evident that the scaling the frequency of
an actor directly affects the frequencies and consequently the energy dissipation of all actors from the
VFD that are active when frequency scaling occurs. Following the above argument, the equation 4.10
needs to be refined to take into account the frequency scaling points that fall into the actor’s active
interval. Last but not least, this clustering of PEs to VFDs requires the recalculation of idle intervals.
In contrast to the case of individually managed PEs , an idle interval in the context of VFDs is defined
as the interval when all the PEs in the VFD are idle simultaneously. Following this, while in figure
4.2b the idle intervals can be found to be the [1,3] and [4,5] for PE1 and [1,2] for PE2, in the case of
clustered PEs and figure 4.2c there are no idle intervals.
Platform Architecture 27
Figure 4.3: The P2012 fabric
4.2 Platform Architecture
The platform P2012 under consideration for this project is designed by STMicroelectronics. P2012
is an area and power efficient many core platform. The computing fabric is highly modular and is
based on multiple clusters, each of which is an independent power and clock domain (VFDs). Each of
these domains incorporates a power management unit, that can be controlled independently, enabling
aggressive fine-grained power management. The communication infrastructure between the clusters is
based on a high-performance asynchronous NoC architecture: the ALPIN NoC. A graphical schematic
of the architecture is presented in 4.3.
4.2.1 Cluster Power Management
Each cluster of PEs, in P2012, is an independent voltage/frequency domain. The architecture for each
domain is show in figure 4.4. Within each unit, a programmable local clock generator generates a
variable frequency in a predefined and programmable tuning range. Apart from the local clock gener-
ator, each domain incorporates a local power supply unit (PSU) to generate and control its internal
core voltage supply. Regarding the dynamic power consumption, the technique used is a Locally
Adaptive Voltage and Frequency Scaling (LAVFS) with VDD hopping [6]. As far as the static power
consumption is concerned, PMOS power switches, controlled by an ultra-cut-off (UCO) mechanism,
are inserted to maintain minimum leakage in standby mode. The PSU is presented in 4.5.
4.2.2 Dynamic Power Management
Efficient LAVFS is performed through a hardware controller that automatically switches between Vhigh
and Vlow, using a configurable duty-ratio. This way, low-level software control is avoided as much as
possible. The hopping technique used is VDD hopping with dithering described later.
28 Platform Power Management
Figure 4.4: NoC Unit Architecture - VFD
Figure 4.5: Power Supply Unit
Platform Architecture 29
Figure 4.6: Dithering principle
The Local Power Management (LPM) unit, is in charge of handling the domain’s power modes.
The LPM contains a set of programmable registers to define the domain power mode, configure the
programmable delay line (for frequency regulation), configure and control the PSU. More precisely,
the LPM contains two dedicated registers to program the frequency and the duty-ratio for the hopping
unit. In addition, it contains registers to control the hopping unit signals. A mode register controls
the mode of the unit. In this architecture, each unit (VFD) can be set in one of the following power
modes:
• INIT mode: supply voltage is Vhigh and the clock is gated.
• HIGH mode: supply voltage is Vhigh and the clock is sent to the VFD
• LOW mode: supply voltage is Vlow and the clock is sent to the VFD
• HOPPING mode: supply voltage automatically hops between Vhigh and Vlow. Frequency and
duty ratio of hopping is configurable. The obtained performance is an average value between
Vhigh and Vlow based on the duty ratio.
• IDLE mode: VFD clock is off and leakage power is reduced due to the Vlow supply voltage
• OFF mode: the unit is switched off by the UCO device, to further reduce the leakage power.
VDD hopping
As mentioned above, VDD hopping with dithering between two pairs such as (Vhigh, fhigh) and Vlow, flow
is used to control the average voltage and frequency of the VFD. Dithering provides superior results
against DFS or DVFS with discrete voltage levels [6] as shown in figure 4.6 and comparable perfor-
mance to a continuous voltage converter. The operating frequency is defined as the duty ratio between
the time spent in fhigh and flow:
Favg =(flow · tlow) + (fhigh · thigh)
thigh + tlow(4.12)
During hopping, the supply voltage is provided by a power supply selector acting as a linear regulator
with a voltage set point given by a DAC. With this precise control, changing between supply voltages
30 Platform Power Management
can be done following a controlled ramp (Vref ), limiting wide current variations and avoiding supply
voltage over or under shoots. Because of this smooth transition, the VFD does not need to be stopped
and thus there is no latency cost in application level.
4.2.3 Leakage Power Management
The use of power switch transistors to reduce leakage current in digital memory circuits is already
a mainstream. The first method used, is known as Multiple Threshold CMOS (MTCMOS) power
switches [2]. It consists of using low − VT , high performance transistors for the logic and high− VT ,
low leakage for the power switch. In this way, the power switch is inserted between the supply lines
and the logic. The drawback with this approach, is the poor performance of the high−VT transistors
under low supply voltage. To allow low-voltage operation, the super cut off CMOS (SCCMOS) [37, 19]
has been introduced. It is a low Vth transistor, whose leakage current is exponentially reduced, by
reverse biasing its gate. The UCO circuit is responsible for biasing the gate of the SCCMOS power
switch [5].
In [6] the authors present the leakage gain from the usage of a UCO-type power switch against the
gain from the usage of MTCMOS switch, when the UCO-type power switch is used to drive a higher
power dissipation IP. The leakage current in the OFF state was found to be 8 times lower due to the
UCO, while it is 2.5 times higher in the HIGH mode of operation.
4.2.4 NoC Interconnect
Energy dissipation and Latency
In on-chip interconnects there are two sources of energy dissipation: wires and routers. P2012 uses,
for interconnecting the VFDs, an innovative 2D-mesh NoC based on asynchronous logic perfectly
adapted to the GALS paradigm. The routers in the NoC are implemented in a Quasi-Delay-Insensitive
(QDI), closk-less logic [42]. QDI circuits are a class of delay invariant, asynchronous designs. This
fully asynchronous NoC interconnect scheme, provides almost 5 times less power consumption than
the synchronous equivalent and reduced latency (almost a ratio of 2). The availability of low-Vt
asynchronous cells (instead of multi-Vt), results in a higher static energy consumption from the routers,
compared to the synchronous equivalents. However, when the asynchronous routers are idle, there
is only static power dissipation. In the synchronous implementation the routers, while in the idle
state, can consume up to 5mW (for a high performance router) because of clock switching, even if
clock gating is performed, while the asynchronous router implementation consumes 240 µW [42]. On
a telecom application implemented to compare the synchronous and asynchronous approaches, the
power budget for a 15 node NoC was reduced from 82.6mW down to 11.9mW for the NoC. With
the asynchronous NoC as a communication infrastructure, for the system of figure 1.3, the energy
dissipated in the NoC corresponds to 6% of the overall energy dissipation [24]. As far as the latency
is concerned, for a 5 node path the ANoC latency is 17.3 ns against the 29 ns for the synchronous
version [42]. The throughput provided can reach the 17Gb/s for 32 bit flits.
4.3 Assumptions
Based on the above discussion on P2012 as well as the discussion on the application modeling adopted,
we proceed by presenting our assumptions necessary for refining the energy formula in (4.11).
Assumptions 31
Architectural assumptions
• The platform can be described by a directed architecture graph, GA = (I,L), where I is a set of
VF domains and L ⊂ I × I is a set of links between these domains. A link is an ordered pair
l = (i, z) with i, z ∈ I. Data can be sent from domain i to the domain z with a constant latency.
• The platform is homogeneous: the execution time of an actor is independent of the PE and
depends only on the supply voltage and frequency. From the homogeneous case, it follows that
k1 in (4.8) is constant for all PEs. This homogeneity comes from the fact that all PEs in the
VFDs of P2012 are identical.
• There is a linear dependence between the supply voltage and the operating frequency (for 0.8V ≤VDD ≤ 1.2V and 0.2V < VT < 0.5V in 65nm technology [17]).
• The platform supports state of the art mechanisms for reducing the static power which can be
neglected [6].
• Thanks to VDD hopping, the LAVFS mechanism can set the frequency to any value in the range
[fmin, fmax] [1].
• There is no overhead for issuing a frequency scaling command. This assumption comes from the
fact that the voltage and frequency can be scaled smoothly and consequently there is no need
for the actor executing to stop. An overhead exists when the island is decided to switch off.
• The scaling moments of frequency and voltage always coincide with each other.
• Intra-domain communications between PEs are instantaneous and do not consume energy. This
is reasonable assumption for our platform, where each VFD contains PEs, connected to multi
banked level 1, shared, instruction and data memories [1].
• Inter-domain communications have a known and bounded latency (proportional to the number
of nodes in the path and the amount of flits transmitted). This is the case for guaranteed
throughput NoCs and constant energy consumption per flit denoted as ENoCflit/hop. Flit is the
quantum of information in bits, transfered between two routers in the NoC in one clock cycle.
The flit size depends on the wires connecting the two routers together.
• The energy dissipated by the communication is neglected for two reasons: the design of the NoC
and the binding has been done in such a way that inter-domain communications are minimized.
• PEs that are not bound to any actor from the data-flow graph are assumed to be turned off.
Application modeling assumptions
• The data-flow notion is used to model the application under consideration. The definition of
this model was given in 2.1
• The dependencies and the amount of exchanged information (size and number of tokens) between
actors are known.
• The switching activity α is for each actor is known. .
32 Platform Power Management
• The WCEC of each actor and for each communication is known. Since the communication
between VFDs is done through an asynchronous NoC, the WCET for each edge, that cross
VFDs, will be considered known from the mapping step.
• The deadline for one iteration of the schedule is known.
4.4 Energy Dissipation Refinement
4.4.1 Definitions Overview
Following the assumptions of section 4.3, we proceed in evaluating the energy consumption of a given
data-flow graph.We provide a short summary of functions and sets definitions for the architecture and
the graph that will be used extensively in following chapters.
Architecture definitions
- GA = (I,L) is the directed architecture graph
- I is the set of VFDs
- PE is the set of PEs
- BPE : PE → I, is the binding function, that maps processing elements to a VFD. The inverse of
this function, i.e BPE−1(i) returns the set of PEs on a VFD i.
- N : I → 2N, returns the set of cycles in an island. This set depends on the frequency schedule.
The maximum of N is given as: |N | = D · fmax. With frequency scaling, the total number
of execution cycles can decrease. However, there is a lower bound on the number of execution
cycles which should be preserved in order for all actors mapped on a domain to complete their
execution. This lower bound on execution cycles is based on the actors’ WCEC. To calculate
this lower bound, we need to calculate the worst case execution path within the given domain.
Intuitively, the worst case path for each processing element can be calculated as the difference
between the min(S(τ)) and the max(S(τ) + WCEC(τ)), with S(τ) the function that returns
the scheduling cycle of an actor τ ∈ BT −1({p}). Respectively, to obtain the critical path within
a domain, we have to consider the critical paths from all PEs. For a domain i ∈ I we define this
lower bound to be:
lb cycles(i) = maxp∈BPE−1
({i})(Smax(p)
)−min
p∈BPE−1({i})
(Smin(p)
)(4.13)
with
Smin(p) = min(S(τ)),
Smax(p) = max(S(τ) +WCEC(τ)
), ∀τ ∈ BT −1
({p})
Considering the figure 4.2a, in the case that the two PEs are clustered in the same VFD, we can
calculate the lowest bound on the computation cycles, based on the mapping to be equal to 3.
- F : I × N → f , with N a finite set of cycles and f ∈ [fmin, fmax]. This function returns the
frequency schedule of a VFD.
Energy Dissipation Refinement 33
- V : I ×N → V, V ∈ [Vmin, Vmax]. This function return the voltage schedule for an domain.
- L ⊂ I × I is the set of links in the NoC.
- Nflits/token ∈ N+ : gives the number of flits per token.
- Lflit/hop ∈ R+ : gives the network hop delay for one flit
Data-flow graph definitions
- Gs = (T , E) is the data-flow graph
- T is the set of actors in the data-flow graph
- E ⊂ T × T is the set of edges in the data-flow graph
- q : T → N+ is the repetition vector and returns the number of invocations of τ ∈ T in one
iteration of the schedule
- D : returns the deadline for a schedule
- BI : T → I is the actor mapping function that associates one τ ∈ T to one i ∈ I. The inverse
of this function, i.e BI−1(i), returns all the actors mapped on the VFD i.
- BT : T → PE is the actor mapping that associates one τ ∈ T to one p ∈ PE . The inverse of this
function, i.e. BT −1(p), returns all the actors mapped on the PE p.
- WCEC : T → N+ returns the WCEC of τ ∈ T .
- α : T → R+ returns the switching activity of τ ∈ T
- src (dst) : E → T returns the source (dst) actor of an edge c ∈ E
- Prate (Crate) : src(e) (dst(e)) → N+, c ∈ E returns the tokens produced (consumed) by one
invocation of an actor
- BE : E → 2L is the edge binding function that binds one e ∈ E to possibly several or none l ∈ Land thus | BE(e) | ∈ N
- S : T → 2N , returns the scheduling moments (in cycles) for an actor. Reasoning behind this
definition will be given in following chapter.
- Idle(p) = N (BPE (p))\⋃τ∈BT −1(p){S(τ)+{0, 1, ...,WCEC(τ)−1}} Returns the set of idle inter-
vals in a processing element. It is obtained if from the total number of cycles N (BPE (p)), within
the time frame [0,D], we exclude those intervals where the PE is active. These active intervals
are calculated based on the S(τ) and WCEC(τ) as⋃τ∈BT −1(p){S(τ)+{0, 1, ...,WCEC(τ)−1},
the union of execution cycles of actors τ mapped on the PE p.
Using the above definitions we can extend (4.10) to accommodate multiple frequencies within one or
multiple executions of an actor as:
Ep(τ) = k1 · α · (Vrms(τ))2 · q(τ) ·WCEC(τ) (4.14)
34 Platform Power Management
Since the voltage is a time varying, periodic (with period equal to the deadline) function of time,
in equation (4.10) we can use the mean of the different voltage levels within an actor’s invocation.
By definition, this mean is equal to the root mean square of voltage given by equation (4.15). The
inner sum in (4.15) denotes one invocation of an actor, while the outer sum denotes the number of
invocations according to the repetition vector q.
Vrms(τ) =root
√√√√∑n∈q(τ)
∑n+WCEC(τ)−1n V2(BI(τ), n)
q(τ) ·WCEC(τ)(4.15)
In following chapters, we restrict the VF scheduling points per domain at specific events. These domain
specific events are the invocation and completion time of the actors mapped on this particular domain.
In this way, instead of exhaustively calculating the Vrms over every execution cycle, we will only have
to consider the number of intervals within the actors execution time [S(τ), S(τ) +WCEC(τ)].
We also extend (4.9) to accommodate multiple supply voltages within an idle interval rather than the
minimum one:
Eidlep = k2 ·∑
n∈Idle(p)
V(BPE (p), n) · 1
F(BPE (p), n)(4.16)
This refinement comes from the fact that each PE in a VF domain can not go to the dormant mode
independently. So while idle, if any of the other PEs, in the same VFD, is active, the voltage might
have a value other than the minimum one as shown in figure 4.2c.
We also define the energy dissipated to transmit one token between 2 NoC routers as:
ENoCtoken/hop = ENoCflit/hop ·Nflits/token (4.17)
Using equations (4.11), (4.14), (4.16) and (4.17), we can formalize the energy dissipation of a data-flow
graph under a given schedule as:
Esch = Eactortot + ENoCtot (4.18)
with
Esch =∑i∈I
( ∑p∈BPE−1
({i})
( ∑τ∈BT −1({p})
Ep(τ) + Eidlep
))
Ep(τ) = α · C · (Vrms(τ))2 · q(τ) ·WCEC(τ)
Eidlep = Ilk ·∑
n∈Idle(p)
V(BPE (p), n) · 1
F(BPE (p), n)
Which gives the active and idle energy in all domains, that have actors mapped and
ENoCtot = ENoCtoken/hop ·∑
τ∈BT −1({p})
∑e∈src−1({τ})
| BE(e) | ·Prate(src(e))
which gives the total energy dissipated for communication. If an edge does not cross VFD, then, its
energy is considered 0.
5
Energy Efficient Scheduling
Our work is focused on multi-processor chips where the processors are clustered to form voltage-
frequency domains and communicate through an asynchronous NoC. All processing elements in the
cluster share the same voltage-frequency pair and go to the dormant mode when the VFD is switched
to the idle state. As far as the application model is concerned, it is described by the data-flow
graph’s notion, described in previous chapter. Since the problem of scheduling tasks in such hardware
configurations is proven to be NP-hard, we restrict our work to acyclic graphs and we assume that
static power dissipation is negligible. Before explaining the heuristic for tackling this problem, we shall
first try to formulate the optimization problem and observe the differences from the related work.
5.1 Constraint problem formulation
5.1.1 Objective Function
As shown in chapter 4, the dynamic power dissipation is given by the formula (4.1). Current tech-
nologies voltage and frequency have an almost linear relation to each other through the formula (4.2).
Because of this linearity, dynamic power dissipation is almost cubically related to clock frequency fclk,
through equation (4.3). The worst case execution time needed by a task to complete its execution
under fclk is given by equation (3.5) and by substituting fmax with fclk: WCET (τ) = WCEC(τ)fclk(τ) and
consequently the energy consumption of the actor is given by:
E(τ) = Pcharge(τ) ·WCET ≈ α(τ) · C · fclk(τ)2 ·WCEC(τ) (5.1)
Whereas, in the case individually managed PEs where each task can have each own frequency/voltage,
in our underlying hardware, the frequency of a task depends greatly on the frequency/voltage scal-
ing schedule. By frequency/voltage schedule we mean, the time instances when the frequency and
voltage is being scaled. The scaling moments of frequency and voltage always coincide with each other.
Relatively to an actor’s scheduling moment, some or none voltage/frequency scaling moments can
affect its execution time. This fact, drives us to incorporate an average frequency for each actor, on
the energy formula. Based on this average frequency, the energy dissipation can be formulated as
follows :
E =∑τ∈T
α(τ) · C · fav(τ)2 ·WCEC(τ) (5.2)
In order to minimize the overhead of scaling, we assume that these moments also coincide with the
invocation or completion of an actor. Of course, do not necessarily invoke VF scaling. One could pro-
pose, for voltage and frequency scaling, to be independent of the scheduling of actors. However, since
in our platform and for our assumed application model properties such the deep-sleeping property,
assumed in the above works [46, 22] and since having a metric for the efficiency of each cycle like in
[21] it is too expensive, especially for applications that need thousands of cycles to complete, such an
approach would require some short of criterion to decide how to divide the execution time of a VFD,
36 Energy Efficient Scheduling
to segments of constant voltage and frequency. The lower the width of such an interval, the higher
the complexity of the heuristic and the higher the flexibility to harness energy dissipation are. Apart
from this trade off, one should also consider the timing overhead for voltage-frequency scaling when
deciding for the width, since the smaller the width is, the more the scaling points are.
The number of tasks mapped on the specific VFD should also be taken into account. Intuitively,
one would expect more scaling points as the number of tasks mapped on the VFD is growing. The
schedule and stretch approach, used in [10] and the approach in [21] are the two ends of this approach.
In [10], we have only one interval, equal to the iteration period where the voltage and frequency are
constant. In [21] we have the other end, where the interval is equal to the clock cycle. Although,
the schedule and stretch approach in [10] provides low complexity, we expect that taking into account
the switching activity of each task separately, will allow more efficient results. The relation between
switching activity and energy gain has been discussed in [30]. It is noted that the higher the switching
activity of a task is, the higher the energy gain is from scaling the frequency and the voltage on this
task.
5.1.2 Deriving the constraints
We believe that dividing the execution in intervals with constant frequency and voltage can lead to
better results. This approach is also adopted in [46, 29, 22]. However, in [46, 22], segmentation is a
natural choice, as these works deal with independent tasks and the mapping is such that the schedule
can satisfy the deep-sleeping property. In [29], segmentation is adopted in order to balance the slack
within each segment, through job migration and apply frequency scaling in the segment’s interval. In
our work, job migration is not allowed. Alternatively, we propose another approach to increase the
usable slack in each core, which will be described later (see section: 5.2.2). As it was mentioned above,
we choose each segment to start either at an invocation of a task or at the completion of a task, so as
to increase the energy harnessing and maintain a simple heuristic.
Graph transformation
Before describing the procedure of defining the VF segments, we should first proceed by transforming
the data-flow graph Gs, to G∗s. In the transformed graph, G∗s new edges are added, when necessary,
to capture the dependencies between actors that share the same resource as. Furthermore, each edge
has an associated communication cost, which is derived from the mapping step. Assuming that the
mapping has been done, we should first define the ordering of execution. The ordering is done in
such a way that it respects the data and resource dependencies. This ordering is based on the ALAP
start time given by equation (3.4). Ties are broken arbitrarily. We follow the approach of [39], which
describes how to add precedence edges, in the initial data flow graph based on the mapping and
the exact ordering of actor firings. However, this is not possible in general for multi rate data flow
graphs.The new transformed graph will have the same amount of actors, however a new edge has to be
inserted between two actors, if these are mapped on the same PE, there does not exist a direct edge in
the original data flow graph and their executions are sequential. Formalizing the above requirements,
an edge is added between two actors j and z if:
• ALAPs(j) > ALAPs(z) actor j will execute after actor z
• BT (j) = BT (z) both actors should be mapped on the same PE
Constraint problem formulation 37
1
2 3 4
5 6
7PE
1
PE
2
1
2 3 4
5 6
7
Figure 5.1: From Gs to G∗s (ALAPs(3) < ALAPs(4))
• There is no actor τ mapped on the same PE, which will be fired between z and j: ∀τ ∈BT −1
(BT (j)) =⇒ ALAPs(τ) > ALAPs(j) ∨ALAPs(τ) < ALAPs(z). This implies that on the
PE, the actor to be fired after the completion of actor z is the actor j
• E ∩ e(z, j) = ∅, in the original graph, there is no direct edge from actor z to actor j
The transformed graph, with embedded the resource and data dependencies, is now denoted as
G∗ = (T , E∗).
In figure 5.1 actors 3 and 4 are mapped to the same PE, along with actors 6 and 7. Actors 1, 2 and
5 are mapped on PE 1. Because of the data dependencies between actors 1, 2 and 5 the ordering
of execution in PE 1 is straightforward. On the other hand, on PE 2 there is no data dependency
between actors 3 and 4 and both of them are dependent from the completion of actor 1. Because
ALAPs(3) = ALAPs(4) we have to choose arbitrarily the order of execution between them. By choos-
ing 3 to execute before 4 we impose a resource dependency between them. This new dependency is
represented by the addition of an edge directed from actor 3 to actor 4.
The last step, is to add the communication cost to edges that cross VFDs 5.2. Edges with commu-
nication cost other than zero are already present on Gs. In order to add the communication cost,
we should then check if two actors, connected by an edge in Gs, are mapped on the same VFD. In
figure 5.2 thus, candidate edges are: e(1, 3), e(1, 4), e(2, 5) and e(7, 8). All other edges will have and
associated communication cost equal to 0.
Formalizing, an edge connecting actors τ and z, will have a WCET 6= 0 iff BI(τ) 6= BI(z). Then, a
WCET equal to:
WCET (e(τ, z)) =(|BE(e(τ, z))|+ Nflits/token · Prate(src(e(τ, z))− 1
)Lflit/hop (5.3)
is associated with this edge. In equation 5.3 the function |BE(e(τ, z))| returns the number of hops,
between the two domains and Lflit/hop returns the worst case latency for a hop, which is considered
constant and known.
38 Energy Efficient Scheduling
1
2 3 4
5 67
8
VF
D1,
PE
1
NoC
VF
2,P
E2
1
2 3
WC
ET
4
WCET
5
WC
ET
67
8
WC
ET
Figure 5.2: Adding WCET to edges
Segmentation
To divide the iteration period, into segments we follow an approach similar to the one presented in [29].
We consider the transformed data-flow graph G∗s, which describes the functionality of the application
and the information from the mapping of actors to VF domains and PEs.
A segment is a time interval, where the voltage and frequency are constant. The frequency and voltage
are considered to be scaled at the beginning of each segment. The segments are also sequential and
non-overlapping with each other. The number of segments per VFD, is greatly affected by the number
of actors mapped to the VFD. Segments that share the same pair of voltage and frequency can be
clustered together. Assuming, that in an VFD i, there are mapped n actors, we can derive the
maximum number of segments as 2 · n+ 1. This maximum number of segments, can be derived if the
invocations of all actors, mapped to the VFD i, fall in different time moments. We, associate with
each VFD i ∈ I a set of boundary points B(i) : {b0, ....bk}, to define these voltage-frequency segments.
For each boundary point b ∈ B(i), we associate a clock cycle and a frequency. We omit from the
boundary’s point definition the voltage, since voltage and frequency have a linear relationship.
• We define a boundary point to be a pair of clock cycle and frequency as b → N × F , where Ndenotes the number of cycles, which is domain specific and can be calculated after mapping and
F is set of available frequency levels.
We define the set B, to contain all scheduling and completion times of all actors in the data flow
graph, shorted in an increasing order. We associate with each domain a subset of B, I → 2B. This
subset contains the boundary points which divide the execution interval into segments of constant
voltage and frequency.
• ∀i ∈ I, B(i) : {b0, ....bk} ⊆ B, defines the VFD specific set of boundary points.
Since the boundary points, are invocations of actors or completion of their execution, then each domain
specific subset should contain only the boundary points derived from tasks mapped to the specific
domain. As we mentioned above the total number of boundary points associated with a domain, is
2 ·∑
p∈BPE−1({i}) |B
T −1({p})|+ 1. The segments defined by these points should be sequential and not
Constraint problem formulation 39
overlapping. Last but not least, since each boundary point represents also a possible frequency-voltage
scaling point, if two such points coincide, then they should share the same frequency-voltage pair. The
above discussion can be summarized in the following constraints:
• b =< n, f >, a boundary point is a pair of cycle-frequency
• b.n = S(τ) or b.n = S(τ) + WCEC(τ), a boundary points coincides with an actor’s invocation
or completion.
• ∀bj , bz ∈ B(i) =⇒ bj .n ≤ bz.n ⇐⇒ j < z, the set B(i) is shorted with respect to b.n
• ∀bj , bz ∈ B(i), bj .n = bz.n ⇐⇒ bj .f = bz.f only one segment is defined if two or more actors
are invoked on the same time.
• b0 ∈ B(i), ∀i ∈ I
• b0.n = 0, there is always a boundary point associated with the start of the frame.
• b.f ∈ [fmin, fmax]
Now the boundary points their properties and the boundary point subset have been defined. One will
notice that in the above discussion and definitions, we have used cycles instead of time. The intuition
behind this, is that with frequency and voltage scaling, the width of clock cycles change. Frequency
and voltage scaling in one domain, will have no effect to boundary points, of that domain, something
that would not hold if boundary points were defined in terms of time. Frequency and voltage scaling
might affect however, boundary points of other domains. This is the case where there is a direct or
indirect dependency between the actor, whose invocation or execution (boundary point) is used as a
scaling point in one VFD and an actor mapped to another VFD. Presumably, the latter one would
be a direct or indirect successor and of the first actor and a path would exist between them, in the
transformed graph G∗.
A segment, as described earlier, is defined within two sequential boundary points.
• sgj .start = bj
• sgj .end = bz
• sgj .f = bj .f
• sgj .st = active ⇐⇒ ∃τ ∈ BI(i)|S(τ) = sgj .start ∨ S(τ) +WCEC(τ) = sgj .end
such that bj , bz ∈ B(i) ∧ bx ∩ (bj , bz) = ∅, ∀bx ∈ B(i) and i ∈ I. Since with each boundary point we
associate one VF pair, we can calculate, the time span of each segment as:
timespan(sgj) =bj+1.n− bj .n
bj .f(5.4)
and the timing of each boundary point as:
time(bj) =∑
sgi∈[sg0,sgj−1]
timespan(sgi), (5.5)
[sg0, sgj−1] ⊆ SG(i) ∧ bj ∈ B(i)
40 Energy Efficient Scheduling
Timing constraints
As a next step we should define the timing constraints with respect to the boundary points defined
earlier. In order to define this exact timing, we have to consider the VF schedule fixed. Since, segments
are defined over boundary points, by construction, the constraint:
•∑
j∈|SG(i)| sgj .end− sgj .start ≥ lb cycles(i)
which impose that the sum of cycle spans of all segments in one island to be at least equal to the
lower bound of cycles (4.13), always holds. SG(i) is the island dependent set of segments.
Based on equation (5.5), it should hold that the invocations and completion times of all actors from
the data flow graph, should be less than the deadline D:
• time(bi) ≤ D ,∀bi ∈ B
Computation of favg
We have to compute the average frequency of an actor, so that we will be able to compute its energy
consumption. Frequency and voltage are associated with a boundary point, as described earlier.
Boundary points are defined by the invocations and completions of actors. Thus we can find a
subset of the ordered set B(i) containing all boundary points from the invocation of a actor to its
completion. Formally, we define the actor specific boundary point set as P (τ) to be a subset of B(i),
i.e. P (τ) ⊂ B(i), where i is the domain where the actor τ is mapped to. For the set P (τ) the following
should hold:
P (τ) = [S(τ), S(τ) +WCEC(τ)]
We remind, that both S(τ) and S(τ) + WCEC(τ), define two boundary points in the B(i). Then,
to form the set P (τ), we have to take from B(i), only these boundary points that fall in the interval
[S(τ), S(τ) +WCEC(τ)]. Then P (τ) is defined as:
P (τ) = [S(τ), S(τ) +WCEC(τ)] (5.6)
= {bj ∈ B(i)|S(τ) ≤ bj .n ≤ S(τ) +WCEC(τ)} (5.7)
Having associated, a subset of boundary points to an actor, we can proceed in computing its average
frequency as:
favg(τ) =
j=|P (τ)|∑j=2
bj .n− bj−1.n
WCEC(τ)· bj .f (5.8)
where bj is such that: ∀j ∈ [1, |P (τ)|], bj ∈ P (τ)
Precedence constraints
We can derive the constraint, necessary to preserve the data dependencies, between actors in the data
flow graph. If there is an edge directed from actor τ to actor z, then it should hold that, the time of
the boundary point associated with the completion time of actor τ (denoted for simplicity as bτ ) is less
than that of the boundary point bz, associated with the invocation of z. These precedence constraints
can be expressed as:
Constraint problem formulation 41
• time(minbj∈P (z)(bj .n)) ≥ time(maxbj∈P (τ)(bj .n))
with z, τ ∈ T such that e(τ, z) ∈ E∗.
5.1.3 Discussion
We have seen in the above description an exhaustive formulation of the problem constraints. We can
sum up the optimization problem with the constraints as:
Minimize:
E =∑τ∈T
α(τ) · C · favg(τ)2 ·WCEC(τ)
under
bi = S(τ) ∨ bi = S(τ) +WCEC(τ),
bi.f ∈ [fmin, fmax],
∀bj , bz ∈ B(i), bj .n = bz.n ⇐⇒ bj .f = bz.f, ∀i ∈ I
b0.n = 0 ∧ b0 ∈ B(i),∀i ∈ I,
sgj .start = bj ∧ sgj .f = bj .f,
time(bi) ≤ D, ∀bi ∈ B,
time(minbj∈P (z)(bj .n)) ≥ time(maxbj∈P (τ)(bj .n)) ⇐⇒ e(τ, z) ∈ E∗
where the expressions for time, favg and P (τ)are given by equations
time(bj) =∑
sgi∈[sg0,sgj−1]
timespan(sgi)
favg(τ) =
j=|P (τ)|∑j=2
bj .n− bj−1.n
WCEC(τ)· bj .f
P (τ) = {bj ∈ B(i)|S(τ) ≤ bj .n ≤ S(τ) +WCEC(τ)}
In all the related work done, either in energy minimization for multi processor systems, where the
granularity of VF regulation is the PE or for cluster based approaches where the granularity is the VF
domain, the workload to run under the optimal frequency is known and constant. In the first case,
with individually controlled PEs, this workload considered for VF regulation is, the workload of each
task/actor separately. In the latter case, the workload whose energy consumption is considered, is the
execution cycles of tasks mapped on one PE, the least loaded one first and then, the difference of the
workload, between subsequent PEs, from the same VF domain, by traversing them in an increasing
order of their mapped workload. In this way, the problem is relaxed to the one of multiprocessor energy
minimization, with individually managed PEs, which is proven to be NP-hard. However, this method
can be performed when considering the scheduling and frequency scaling of an independent task set
and when the switching activity among all actors is considered the same. Under the VF domain
assumption and when dealing with data-flow graphs, the workload considered for energy harnessing
is the sum of the workload of all actors mapped on a VF domain. Then for energy minimization,
stretching the execution of this workload until the deadline is considered. In order for this approach
to work, two major assumptions were made:
42 Energy Efficient Scheduling
• All actors, have the same switching activity
• There is no inter-island communication
The assumption of one switching activity over all actors, does not hold for real applications. The
assumption that the data flow graph can be mapped on one VF domain, might not be the case for
large applications or for VF domains with only a small number of PEs.
Although, the formulation of the problem with the notion of segment allows to account for different
switching activities and for inter-island communication, it makes the problem even more complex. The
additional complexity comes from the fact that there is not a constant workload to be considered for
voltage and frequency scaling. If we assume that, the VF domain, contains enough PEs for the graph
under consideration, then indeed the VF scaling, will not affect the overlapping of actors, since each
actor fires after all its predecessors have finished their execution. This phenomenon, happens because
the voltage and frequency pair, when scaled down, stretches the execution in all PEs on the domain
simultaneously. To tackle this problem we propose a clustering technique based on the segmentation
of the execution time. With this technique we create super node (see section 5.2.3) with switching ac-
tivity inherited from the clustered actors. This technique is presented in section 5.2.3. Since switching
activity actually affects the energy gain, taking into account the switching activity variation between
actors when clustering might result in more efficient energy harnessing. How to take into account the
different switching activities while clustering is presented in 5.2.3.
Since for now we are only considering one VF domain, VF scaling does not affect the invocations
of actors with respect to clock cycles.VF scaling stretches the execution time span by stretching the
period of clock cycles. It does not change, the total clock cycles of execution needed by an actor and
consequently it does not affect the invocations of actors with respect to clock cycles. Our system is,
also, constrained to have at least enough execution cycles to complete one iteration of a valid schedule.
Looking at the invocation of actors in terms of clock cycles enables us to compute the overlapping
of actors’ execution across different PEs and perform a clustering, to relax the problem to that of
multi-processor energy efficient scheduling with individually managed PEs. How the scheduling of
actors and the clustering of the inter-leaved execution intervals will take effect will be presented in
section.
Multiple VFDs and variable P (τ)
We want to be able to accommodate very large graphs, mapped across several VF domains. The
problem arising, when taking inter-domain dependencies into account, is that the scheduled invocations
and thus the overlapping of execution times, can not be considered constant any more, even when they
are expressed in terms of clock cycles. To illustrate, how VF scaling in one domain can possibly affect
invocations of actors in other domains, we should consider the following scenario: There exist two
actors, τ and z, such that τ, z ∈ T , with a direct data dependency, i.e there is an edge between them
in the original graph. Assume further, that these two actors are mapped on different VF domains.
• e(τ, z) ∈ E: τ is a direct predecessor of z.
• BI(τ) 6= BI(z): τ and z are mapped in different VFDs.
Assume now that our energy efficient scheduling algorithm, finds that stretching actor’s τ execution
will lead to great energy reduction. This stretch, will cause the start of execution of edge e(τ, z), to
Constraint problem formulation 43
shift in time. This shifting is equal to the difference between the completion time of τ after VF scaling
and the completion time before scaling. Since the invocation of actor z is given by:
S(z) = max(maxe∈dst−1(z)(S(e) +WCEC(e))
), e ∈ E∗ (5.9)
This stretching of actor’s τ execution time, can have effect on z′s domain, if this shifting cause the
completion time of edge e(τ, z) to dominate the completion times of all other edges that have actor z
as dst, in the transformed graph G∗s.
max(maxe∈dst−1(z)(S(e) +WCEC(e))
)= S(e(τ, z)) +WCEC(e(τ, z))
This chain of events is not limited only between actors that have direct data dependency. The same
chain of events could be caused to another domain by z′s shifting, when considering the transformed
graph.
Following this chain of events, we see that stretching the execution time of an actor, in one domain,
can cause shifting of actors in that or other domains. Now, S(τ) cannot be considered constant
with respect to clock cycles and consequently, neither are the boundary points which are defined
by invocation and completion of actors. Since S(τ), and bi can be shifted from a frequency scaling
decision, the inter-leavings among actors are not constant and the same applies for the set P (τ) defined
in equation (5.6).
Complexity due to variable P (τ)
To demonstrate the increased complexity of the energy efficient scheduling problem under our assump-
tions, both for the underlying hardware and the application modeling, we should look at the actor
dependent set P (τ). If this set was constant, during the frequency scaling part of an energy efficient
scheduling algorithm, it would mean that the invocation and completion of actors with respect to
clock cycles and, consequently, the frequency scaling points and the segmentation of the domain’s
execution interval would also be constant. This is the case when only one VFD is enough to execute
the graph before the deadline. A constant set P (τ) would allow us to know exactly the workload of
each segment and thus the energy gain after frequency scaling.
Consider the HSDFG of figure 5.3a, where c = WCET given by equation 5.3, along with the actor
properties and binding information presented in figure 5.3b. Figure 5.4 shows the schedule on nominal
frequency. In order to derive the set P (τ) we should first divide the execution interval of each domain
into segments. The boundary points of the segments are the green dashed lines in the figure 5.4. In
order to decide where we should apply DVFS, we should consider the total switching activity within
a segment as well as its timespan. In this way we scale the VF first to the most power consuming
combination of actors. Assuming that, the segment [3.25, 5.25], when actors 3 and 4 are active simul-
taneously, has the maximum energy consumption.We decide thus, to lower the VF. After the DVFS
the schedule will look like the figure 5.5a.
The decision made to lower the VF in this segment was based on the combined switching activities of
actors 3 and 4 as well as the timespan of the segment. However, as we described above, the set P (τ)
cannot be considered constant when multiple VFDs are needed. From the frequency scaling point of
44 Energy Efficient Scheduling
1
3
C
2 4
5 6
C
7
C
(a)
Actor WCEC VFD PE
1 3 2 12 1 2 13 2 1 24 4 1 15 2 2 16 2 1 27 1 1 2
(b)
Figure 5.3: (a) Sample HDFG, (b) the corresponding binding
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
VFD1, PE 1
VFD1, PE 2
VFD2, PE 1 1 2 5
3 6 7
4
c c c
Figure 5.4: The schedule of HSDFG of fig. 5.3a on nominal frequency
view, this means that a decision that provides the best energy gain in one iteration of the heuristic
can be invalidated from subsequent ones. This invalidation is caused because frequency scaling points
move, as actor’s invocation and completion moves, affecting thus, possibly, different amount and kind
(in terms of consumption) of workloads than the ones considered for scaling down the frequency in
the previous iteration. Moving frequency scaling points to different workloads means that the energy
gain is different. Not having a constant set P (τ) means also that in the objective function (5.2), the
term favg(τ), given by equation (5.8), cannot be considered constant among iterations of a frequency
scaling algorithm, since it is dependent on the elements of P (τ).
To illustrate this invalidation we will proceed from the figure 5.5a. Since there is more idle time
that we could take advantage of, we proceed by finding the next segment with the highest energy
consumption. Assume now that this segment is the [3,5] on VFD2. We proceed by scaling down the
frequency on this segment as shown in figure 5.5b. It is evident now that the first decision to scale
the VF on the segment [325, 5.25] is no longer valid. This is because the workload affected by this
decision has now changed as the invocation of actor 6 moved.
Constraint problem formulation 45
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
VFD1, PE 1
VFD1, PE 2
VFD2, PE 1 1 2 5
3 6 7
44
c c c
(a)
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
VFD1, PE 1
VFD1, PE 2
VFD2, PE 1 12
5
3 6 6 7
44
c c c
(b)
Figure 5.5: (a) DVFS in segment [5.25, 7.525] on VFD1, (b) DVFS in segment [3,4] on VFD2
46 Energy Efficient Scheduling
Conclusion
With the above discussion, we demonstrated the difficulty that arises when considering the energy
efficient scheduling of data flow graphs, across multiple VFDs. The difficulty stems from the fact that
we can not associate a fixed amount and kind of workload with a VF scaling point, as done in all
related works. Different from all related approaches, we propose a clustering of workloads, according
to domain’s segmentation, which will enable us to reduce the problem to the one of energy efficient,
multi-processor, scheduling approach with individually managed PEs and use one of the many available
and sophisticated heuristics to minimize the energy consumption.
5.2 Our proposal
5.2.1 Useful terms
Before delving into our proposal, we should first introduce some basic terms, fundamental for the de-
scription of the proposal. The necessary condition to harness the energy consumption, is the existence
of intervals, where the PEs are idle. Idle intervals, can be present at the start and end of the schedule,
as well as, between actors’ invocations and completions. Idle intervals, depend on the mapping of ac-
tors to PEs and VFDs, as well as, the scheduling policy used and is the result of the data flow concept
of invocation, i.e. for an actor to be ready to fire, all its predecessors and the relevant communications
should have finish. An idle interval is called slack. Since we divide the domain’s execution interval
into segments, we expect that these segments will be either active or idle. We remind, that for a
VFD to be idle, all PEs should be idle at the same time. Since frequency scaling, stretches the cycle
period and thus the execution time on all PEs simultaneously, we should define two terms, namely, the
Maximum Usable Slack and the Available Usable Slack, in order to have a metric, on the maximum
possible, frequency scaling allowable and the potential energy savings per island.
Maximum Usable Slack
The of Maximum Usable Slack (MUS) denotes the maximum number of time units that could, possibly,
by used for energy harnessing through DVFS. This upper bound on stretching, depends heavily on
the mapping of actors to PEs and can be derived as:
MUS(i) = D −maxp∈i(ET (p)), i ∈ I, p ∈ PE (5.10)
where ET (p) returns the total execution time of actors mapped on PE p. To get the maximum
available slack, this sum of execution times should be computed under fmax. With this definition,
the Maximum Usable Slack refers to the upper bound on execution time stretching, through DVFS.
However, the upper margin on the available slack can be actually lower, because of the mapping and
scheduling policy of actors.
To illustrate the MUS given a static binding of actors to PEs, we can consider the HSDF of figure 5.6a.
The actors’ WCEC and the corresponding binding information is given in figure 5.6b. On nominal
frequency the scheduling on the two PE would be the one of figure 5.7. Now according to equation
5.10 and since ET (1) = 4 and ET (2) = 5, the MUS of this cluster is equal to 5 time units. Although
these 5 time units represent the upper bound, the actual number of time units that a DVFS algorithm
can use without violating the deadline, represented by the red line, is equal to 3 time units. Stretching
Our proposal 47
1
2 3
45
(a)
Actor WCEC PE
1 2 12 2 13 1 24 3 25 1 2
(b)
Figure 5.6: (a) Sample HDFG and (b) The corresponding binding
0 1 2 3 4 5 6 7 8 9 10
PE 1
PE 2 3 4 5
1 2
Figure 5.7: The schedule of HSDFG of fig. 5.6a on nominal frequency
the execution time of any actor by more than 3 time units would cause a deadline violation. Such a
violation is shown in figure 5.8, where the frequency is being scaled down to fmax4 upon the invocation
of actor 3 and scaled back to the nominal value upon invocation of actor 4. Since actor 2 is active
when the frequency is scaled down, its execution is also stretched.
Available Usable Slack
The Available Usable Slack (AUS) denotes the actual, upper bound on stretching the execution time
of actors. As we described above, the available amount of time that can be used for scaling down the
voltage and frequency, can be different from the Maximum Usable Slack. However, since we only care
about the idle intervals of VFDs, in one iteration of the schedule and not about the idle intervals in the
granularity of PEs, the calculation of the Available Usable Slack is more complex. For its calculation,
we must determine the intervals where all PEs are idle simultaneously. For each island i, the Available
0 1 2 3 4 5 6 7 8 9 10
PE 1
PE 23
4 5
12
2
Figure 5.8: Scaling the frequency to fmax4 on the invocation of actor 3 and to fmax on the invocation
of actor 4
48 Energy Efficient Scheduling
1
32 4
C
5 6
7
C
(a)
Actor WCEC VFD PE
1 1 1 12 1 1 13 2 1 24 5 2 15 2 1 16 2 1 27 1 1 2
(b)
Figure 5.9: (a) Sample HDFG, (b) the corresponding binding
Usable Slack is given by:
AUS(i) =aus1(i) + aus2(i) + aus3(i)
fmax(5.11)
aus1(i) = minτ (S(τ)), τ ∈ BI−1({i})
aus2(i) =
|BI−1({i})|∑
j=2
(max
(S(τj), S
′j))− S′j)
), S′j = max
(S′j−1, S
′(τj−1))
aus3(i) = bD · fmaxc −maxτ (S′(τ))
For the set BI−1({i}), we have: BI−1
({i}) = {t1, ..., τ|BI−1({i})|} with S(τj−1) ≤ S(τj), ∀τj−1, τj ∈BI−1
({i}). S(τ) and S′(τ) return the invocation and completion of an actor, in terms of absolute
clock cycles. The first term of the equation, aus1(i), will return the idle time from the beginning of
the iteration until the first invocation of an actor. The second term, aus2(i), will return the idle times
between the first invocation and the last completion of an actor. Finally, the last term, aus3(i), will
return the idle time from the latest completion of an actor till the deadline.
Returning back to figure 5.7, for the calculation of AUS we have:
aus1 = 0, aus2 = 0, aus3 = b10 · 1c − 7 = 3
yielding a total of 3 time units as AUS. To give a better intuition for the first two terms of the
equation 5.11 we will consider the HSDFG of figure 5.9a. Two VFDs are used to execute this graph.
The binding of the actors to PEs and VFDs is shown in figure 5.9b and the corresponding schedule
on nominal frequency is shown in figure 5.10. Based on this schedule for the VFD1 we calculate the
terms of (5.11) to be:
aus1(1) = 0, aus2(1) = 1.5, aus3(1) = b10 · 1c − 7.5 = 2.5
yielding a total of 3 time units as AUS(1)=4, while for the second VFD we have:
aus1(2) = 1.25, aus2(2) = 0, aus3(2) = b10 · 1c − 6.25 = 3.75
Our proposal 49
0 1 2 3 4 5 6 7 8 9 10
VFD1, PE 1
VFD1, PE 2
VFD2, PE 1 4
3 6
5
7
1 2
c c
Figure 5.10: The schedule of HSDFG of fig. 5.9a on nominal frequency
yielding a total of 3 time units as AUS(2)=5.
Actor Mobility Window
The Actor Mobility Window (AMW) denotes the interval between the earliest and the latest possible
starting time of an actor, so that both data, resource, and timing constraints are satisfied. The earliest
possible starting time, based on the transformed graph, is given by equation equation (3.10) and the
latest possible starting time by equation (3.4). The AMW of an actor, is dependent on the invocations
of all its predecessors, since we traverse the graph downwards, for the computation of (3.10) and on
the invocations of its successors when computing the ALAP start time. We will use this window, to
explore possible actor shifting, after scheduling, in order to increase the AUS.
AMW (τ) = [ASAPs(τ), ALAPs(τ)], τ ∈ T (5.12)
We use the AMW in order to look for potential AUS increases. Thus, when computing the AMW of
each actor, all the other actors are considered fixed in terms of the time their invocation. The AMW
is being calculated on fmax. Consequently, in the formula (3.10), we will not use the ASAPf times
of predecessors and incoming edges, but the actual invocations, given by the function S. In this way,
we expect that the range of the AMW (ALAPs(τ) − ASAPs(τ)=0) of some actors will be zero. An
empty AMW (ALAPs(τ) = ASAPs(τ)) means that if the actor is shifted, a violation will occur, either
because the data and/or resource precedence constraints are not met or the deadline is missed.
Referring to the figure 5.11 we can find the AMW of actor 6 to be the interval [3,4.5]. The start time
of actor 6 can be shifted within this interval in order to explore for possible increases in the AUS of
the two VFDs. On the other hand, the AMW of actor 4 is equal to 0 as ASAPs(4) = ALAPs(4).
Since all other actors are considered fixed, with respect to their S, if actor 4 was shifted then, this
would result to a precedence constraint violation.
Actor Overlapping Ratio
Finally, one additional and necessary term, for actor shifting exploration is the Actor Overlapping
Ratio (AOR). The AOR denotes the amount of overlapping of actor’s τ execution by other actors.
This amount of overlapping give us a hint on how much we can improve the AUS, by shifting the
50 Energy Efficient Scheduling
0 1 2 3 4 5 6 7 8 9 10
VFD1, PE 1
VFD1, PE 2
VFD2, PE 1 4
3 6
5
7
1 2
c c
ASAPs(4), ALAPs(4)
ASAPs(6) ALAPs(6)
AMW (6)
Figure 5.11: The schedule of HSDFG of fig. 5.9a on nominal frequency. The limits of the AMW foractors 4 and 6 are noted with dashed green lines
actor’s τ start time inside its AMW. In order to compute the AOR, we should find the idle intervals
within the execution time of an actor. These idle intervals are such that all the PEs but the one that
the actor τ is mapped to are idle simultaneously. The AOR is calculated after the scheduling process
on fmax has finished. Since the execution interval of an actor is [S(τ), S(τ) +WCEC(τ)], we can find
all actors whose starting or completion times fall in this interval and form, afterwards, an ordered list
(OL) in terms of their starting times. Elements of this list are thus:
OL(τ) ={z1, z2, ...zn} such that
zj ∈ OL(τ) ⇐⇒ BI−1({BI(τ)})∧(
S(zj) ∨(S(zj) +WCEC(zj)
))∈ [S(τ), S(τ) +WCEC(τ)]∧
S(zj−1) ≤ S(zj)
Based on the schedule in figure 5.10 we can form the OL for all actors in the HSDF of figure 5.9a as
in table 5.1. In order to find the sum of the idle intervals within actor’s execution, we can use the
same notion as in formula (5.11) but in the actor’s execution interval, [S(τ), S(τ) + WCEC(τ)] and
modified to examine, in this interval, only the actors that are elements of the OL(τ). The three terms
that will return the idle time units within this reduced interval would be then:
aor1(τ) = max(S(z1), S(τ)
)− S(τ),
aor2(τ) =
|OLτ |∑j=2
(max
(S(zj), S
′j))− S′j
), S′j = max
(S′j−1, S
′(zj−1))
aor3(τ) = S′(τ)−min(S′(τ), S′(z|OLτ |))
Since aor1, aor2 and aor3 will return the amount of idle time units within the actor’s execution, to
calculate the AOR we will use the equation:
AOR(τ) = 1− aor1(τ) + aor2(τ) + aor3(τ)
WCEC(τ)(5.13)
Based on the schedule of the figure 5.10 we calculate the AOR for the actors as in the table 5.1. With
the information from the AOR we can find the actors that a possible shifting within their AMW could
Our proposal 51
Actor OL AOR
1 2, 3 02 1, 3 0.53 2, 5 0.54 ∅ 05 3, 5, 6 16 5 17 ∅ 0
Table 5.1: The OLs and AORs of actors from the HSDF of figure 5.9a
lead to an increase in the AUS. These actors constitute now a candidate list. Possible actors for shifting
in this case are 1, 2, 3, 4, 7. To choose among the possible candidates we have to use the information
from the AMW. From the candidate list we have to exclude those actors whom ASAPs = ALAPs and
thus the range of their AMW is equal to zero.
5.2.2 The algorithm
Scheduling
The scheduling of actors under fmax, is done according to the latest possible start time. A pseudo
algorithm for ALAPs based list scheduling is shown in algorithm 1. For the calculation of the ALAPs
times the transformed dataflow graph should be traversed backwards. After all the actors are shorted
in order of increasing ALAPs. The invocation of each actor is set equal to each ALAPs time. As
we mentioned earlier, to calculate the ALAPs, we use equation (3.4). In this way, we move the idle
intervals backwards in time and push the schedule to finish on the deadline. Schedules based on latest
possible starting times, are proven to result comparable or better results, in terms of schedulability
and schedule length, as most other scheduling approaches [30, 23]. A similar approach is adopted
in [29].The authors observe the restriction to the utilization of the idle intervals when these are
distributed close to the end of the frame. However, the scheme of frequency scaling and actor shifting
approach are completely decoupled from the scheduling of the graph, enabling thus the usage of any
scheduling algorithm. Since scheduling actors according to their latest possible starting times leads to
a minimized schedule length, we expect that the metrics MUS and AUS will be reasonably close to
each other. Based the ALAPs list scheduling approach, the schedule of figure 5.10 is transformed to
the one in figure 5.12. Segmentation of the execution interval and clustering at this point will lead thus
to a very energy efficient solution. We described how segmentation of the execution interval is done, in
previous sections. The clustering will be described later. We expect, however, that there will be some
room for further improvement in AUS, even when the schedule length is minimum. Increasing the
available usable slack will allow for better energy savings. This increase in the AUS is done through
actor shifting, as will be described next
Actor Shifting
With the shifting of actor’s start time, within the AMW, we want to achieve a better overlapping
and thus increased AUS and energy saving capabilities. However, we expect that there will only
be just a few number of actors that have AMW with range grater than zero. This is actually
because the ASAPs times of actors in equation (5.12) are calculated within the reduced interval
52 Energy Efficient Scheduling
0 1 2 3 4 5 6 7 8 9 10
VFD1, PE 1
VFD1, PE 2
VFD2, PE 1 4
3 6
5
7
1 2
reduced interval for AMW calculation in VFD1
reduced interval for AMW calculation in VFD2
c c
Figure 5.12: ALAPs based schedule of HSDFG of fig. 5.9a on nominal frequency
[min( S(τ)fmax
),max(S(z)+WCEC(z)fmax
)], where τ, z ∈ BI−1(i) as show in figure 5.12. This reduced interval is
different for each island and depends on the mapping of actors and the scheduling policy. We confine
the calculation of ASAPs times, because we want to increase the idle time as much as possible within
this reduced interval. In this way the first and third term of equation (5.11) remain constant and the
second term, which represent the idle times within the mentioned interval, will be increased.
A shifting is valid if two conditions are met. Firstly, the AUS in the current island does not decrease
and secondly the AUS in all other islands either remains intact or increases. If a shifting results into a
decrease in the slack time, the actor is removed from the candidate list. Thus, we only make positive
steps towards the increase of AUS.
Finally we also expect that within the AMW, there might exist more than one solution that leads to
the same increase in AUS. If this is the case then, shifting the actor to earliest possible event, will
lead to better or at least the same results in AUS potential increase, than choosing any other possible
event later in time.
The Shifting algorithm
To reduce the complexity of the algorithm, we consider only shifting a candidate actor to events
within its AMW. These events are invocations and completions of actors that fall into the AMW.
In each iteration of the algorithm, the AUS of all domains is re-evaluated. If there is no decrease
in AUS in all domains, the exploration for the actor proceeds. Otherwise, the actor returns to
the previous best scheduling point and afterwards it is removed from the list of candidates. The
algorithm 2 is responsible for deriving the list Cd of candidate domains to perform actor shifting.
In Algorithm 2, the MUS and AUS are calculated using equations (5.10) and (5.11) respectively.
Algorithm 3 is responsible for deriving a list of candidate actors Cτ for shifting on the reduced interval
[min(S(τ)),max(S(z) + WCEC(z))] of each domain, as described in the previous subsection. This
reduced interval is induced by changing the arrival time A(τ) of actors. Each actor that has an AMW
greater than zero is placed in a candidate list, Cτ , for shifting. Furthermore, we associate, with each
such candidate, a set of actors OL whose invocations fall into the AMW. The intuition behind sorting
Our proposal 53
the candidate list Cτ in order of increasing starting times, or actually, absolute starting cycle, is that
we do not want a later shifting to invalidate earlier ones. This invalidation might be the result of a
new AMW narrower than the one used when the actor was shifted. Exploring all possible actors, in
that order, allow as to have a safe AMW when re-evaluating the ASAPs time in Algorithm 4. There
for each candidate actor, we should first check if the associated AMW has changed. This can be the
result of the shifting of a predecessor actor. We check this by recalculating the ASAPs on the reduced
interval. If this is equal to the AMW.start, then there is no change. Otherwise, we should update the
list OL. Before the shifting we store the AOR(τ), the S(τ) as well as the AUS of all the domains.
Finally we explore the consequences in the AUS, by aligning the invocation of the actor with the
invocations, S(j), of the actors j ∈ OL(τ). Upon an increase in the AOR we check the impact on the
AUS of all domains. A shifting is accepted if there is no negative impact on the AUS.
5.2.3 Clustering
In the above subsections, we described how, given a data flow graph and its mapping to the underlying
architecture, we can schedule the actors and shift them to increase the AUS of a domain. Based on
this new schedule, we can now consider the sets B(i) as constants and proceed with clustering the
interleaved execution intervals of actors.
We will refer to the new clustered actors as super nodes. These new actors, will inherit the timespan
of the segment, as its WCEC, as well as a switching activity that is equal to the sum of the switching
activities of the actors active in the segment.
To formalizing the above process of clustering the inter leavings of actors based on the set SG(i) for
the worst case execution cycles of the new node, we have:
∀sgj ∈ SG(i) | sgj .st = active ⇐⇒ T ′ ∪ snj (5.14)
WCEC(snj) = (sgj .end− sgj .start)
The WCEC of the new super node sn is only dependent on the time span of the relative active
segment. Since the boundary points are defined over invocations or completions of actors, all actors
active in the segment sgi contribute equally to the switching activity of the new node. Thus, for the
switching activity of the new node we have:
α(snj) =∑τ
α(τ) (5.15)
τ ∈ BI(i)|S(τ) = sgj .start ∨ S(τ) +WCEC(τ) = sgj .end
Apart from the execution time and the switching activity of the new node, we should also define the
data dependencies, with nodes mapped on other VFDs, i.e, we have to define also the set E ′. The
data dependencies for the new super nodes will be inherited by the actors whose invocations and
completions defined the active segment. Thus, the incoming edges, will be inherited by the actors
whose invocations are equal to the start of the segment. In the same way the outgoing edges will
be inherited by the actors whose completions are equal to the segment’s end. Having fixed the data
54 Energy Efficient Scheduling
Actor OL AOR AMW
1 2, 3 0 [2, 2]2 1, 3, 4, 6, 7 1 [4, 4]3 1, 2, 4 1 [4, 4]4 3, 2, 6, 7 1 [6, 6]5 6, 7, 8 1 [9, 9]6 4, 7, 5 1 [8, 8]7 2, 6, 5, 8 1 [8, 8]8 5, 7 0 [10, 10]
Table 5.2: The OLs and AORs of actors from the HSDF of figure 5.13a
dependencies, we add edges between subsequent nodes to define the execution ordering.
e(sni, snj) ∈ E ′ ⇐⇒ (5.16)
∀τ, z ∈ T | ∃e(τ, z) ∈ E∗∧(S(τ) +WCEC(τ) = sgi.end
)∧(S(z) = sgj .start
)Now, a VFD can be considered as a PE and in this way, we relax our problem to the one of energy
efficient scheduling of data flow graphs in individually managed PEs.
We will illustrate the process of clustering using the HSDF of figure 5.13. Based on the information
on the WCEC of the actors and the binding information on PEs, we derive the ALAPs based schedule
shown in figure 5.14a and we proceed with the segmentation of the frame. First we derive the domain
specific boundary list B(i) as: B(i) = {0, 2, 4, 4, 6, 6, 8, 8, 9, 9, 10, 10, 12}. From the set B(i) we can
remove the duplicate entries and derive the segments as explained in subsection 5.14a. The set SG(i)
will be then: SG(i) = {[0, 2], [2, 4], [4, 6], [6, 8], [8, 9], [9, 10], [10, 12]}. The segments are indicated in
figure 5.14b with the green dashed lines. We should mention here that before the segmentation of
the frame and after the ALAPs based scheduling, one could first go through the shifting procedure in
order to increase the AUS of the domain. From the ALAPs based schedule however, we get the table
5.2 with the OL, AOR and AMW for each actor. We note also that the MUS for this domain, given
by equation 5.10 is equal to 4 while the AUS, given by equation 5.11 is equal to 2. The last step is to
cluster the interleaved execution intervals and form the super nodes. The clustered graph Gs(T ′, E ′)is shown in figure 5.15.
However, before adopting any of the existing methods for energy efficient scheduling, we should impose
one more constraint in out system, which originates from the clustering and from the fact that we do
not allow the actor’s execution to be suspended. Due to clustering and inter-domain communications,
it is possible that, after frequency scaling, the invocation of a super node will move in time. This
actually means that, the execution of all actors, forming this super node, will be moved in time. This
might cause the suspension of execution of one or multiple actors, that were also participating on the
clustering of the super node. In order to tackle this problem we should add one more constraint
targeting super nodes sharing common actors. These super nodes are mapped on the same PE
(abstraction of a VFD) and there exists an edge between them, i.e., their executions are sequential in
Our proposal 55
1
32 4
5 67
8
(a)
Actor WCEC PE
1 2 12 4 13 2 24 2 25 1 26 1 27 2 18 2 2
(b)
Figure 5.13: (a) sample HSDFG and (b) WCEC and binding information
time:
S(sni) +WCEC(sni) = S(snj) ⇐⇒ (5.17)
∀sni, snj ∈ T ′ | BI(sni) = BI(snj)∧
e(sni, snj) ∩ E ′ 6= ∅∧
∃τ ∈ T , | BI(τ) = BI(sgi)∧
S(τ) ≤ sgi.start ∧ S(τ) +WCEC(τ) ≥ sgj .end
The constraint (5.17) will be taken into account when applying DVFS in the new graph G’(T ′,E’).
In the above, we used, for the super nodes, the mapping function BI , that we defined for actors. The
rational behind using the same function is that, a super node is actually a set of actors clustered
together. Since all these actors are mapped on the same VFD, the mapping function can also return
the VFD for the super node.
5.2.4 DVFS Scheduling
If we allow the actor’s execution to be suspended due to reasons described earlier, then the constraint
(5.17) need not be taken into account. Then all heuristics for energy efficient scheduling of data-flow
graphs in multiprocessor platforms can be applied to the clustered graph G’(T ′,E’).
Since the energy efficient scheduling in multi processor system is NP-hard [46] even for the general
cases (that of periodic independent tasks [28]), we will concentrate on heuristics that allocate a unit
of slack to actors in order to minimize the energy consumption. From the available approaches, the
PathDVS found in [18] provides results comparable to the LPDVS algorithm presented in [47] and is
extended to take into account the communication costs.
The basic notion behind PathDVS scheduling heuristic is to find actors, from the graph, that can
share a unit of slack. By sharing we mean that, by decreasing the operating frequency and extending
56 Energy Efficient Scheduling
0 1 2 3 4 5 6 7 8 9 10 11 12
PE1
PE2
1 2 7
3 4 6 5 8
(a) ALAPs schedule of graph 5.13
0 1 2 3 4 5 6 7 8 9 10 11 12
PE1
PE2
1 2 7
3 4 6 5 8
(b) segmentation
0 1 2 3 4 5 6 7 8 9 10 11 12
PE 1 2 73 4 6 5 8
(c) clustering
Figure 5.14: The clustering procedure
1 23 24 76 75 8
Figure 5.15: The clustered Gs(T ′, E ′) from the HSDF in figure 5.13
Our proposal 57
the execution of two or more actors, the total time span of the schedule increases only by a unit of
slack. It is evident that there is no dependency (resource or data) between actors that can share a
unit of slack and for this reason the authors refer to these actors as compatible actors. Each actor has
a different amount of slack by which its execution can be extended and this is due to precedence and
resource constraints. A mapping that places unrelated actors on the same path, is expected to lead
to a reduction in the available slack of actors, when compared to a more efficient mapping. However,
the mapping heuristic is out of the scope of this work and at this phase (we operate on the G’ graph)
it has already been done. The available slack for each actor is the difference between the earliest and
latest possible starting times. These two values can be calculated through equations (3.10) and (3.4)
when considering the new graph G’.
As defined in section 2.1, p(τ, z) denotes the path in the graph, from actor τ to the actor z. These
two actors are compatible if and only if they belong to different paths:
∀τ, z ∈ T ′ |m(τ, z) = 1 ⇐⇒ p(τ, z) = ∅ (5.18)
m(τ, z) is a boolean denoting the compatibility of actors τ and z. In this way, a compatibility list
can be created for each actor, containing all compatible actors from the graph. Consequently, we can
form the compatibility list for an actor τ as:
compatibility list(τ) = {z|m(τ, z) = 1}, τ, z ∈ T ′ (5.19)
In order to minimize the energy consumption, the compatibility list is used, to find actors that, by
sharing a unit of slack , maximize the sum of energy reduction. The approach used to find the optimal
solution, is a branch and bound search over the solution space. From all compatible solutions the
branch and bound method determines the actor or the combination of actors which lead to maximum
energy reduction, after allocating a unit of slack.
The energy reduction of an actor is defined as the difference of energy consumption before and after
the slack allocation. Apart from the compatibility list, the authors associate with each actor, an
explorable list containing, actors that should be searched as child nodes, in the solution space tree.
The explorable list contains, all available actors, corresponding to the intersection of the compatibility
lists, from root to the particular node. The explorable list of a node without parents is thus equal
to its compatibility list. To effectively search the tree, the authors propose the Depth First Search
approach and by maintaining a lower bound on the energy reduction over the traversed paths, they
eliminate paths, where a better solution can not be found. At each node a cost function is calculated
and compared with other possible solutions. This cost function, is the sum of energy reduction, of all
nodes until the node under consideration plus the sum of energy reduction of all actors in the node’s
explorable list. At each point this cost is compared with a lower bound. If the cost is greater then the
lower bound is being updated. An initial value for the lower bound can be found as:
lower bound =∑τ∈T ′
energy reduc(τ)
|T ′|+ |compatibility list(τ)|+ 1(5.20)
To reduce the search space, the authors propose the identification of fully dependent, fully indepen-
dent and compressible actors. A fully independent actor, is the one that is present in all assignment
paths. The energy reduction from allocating a unit of slack to a fully dependent actor, is compared
58 Energy Efficient Scheduling
with those of other candidates and is not included in the search. On the other hand, fully independent
actors, are allocated a unit of slack, irrespective of the energy reduction. fully independent actors
contain in their |T ′| − 1 actors. Last but not least, compressible actors, are those who share the same
compatibility list. Then these actors are clustered together and are represented from the actor with
the highest energy reduction. In this way, there is a substantial reduction in runtime requirement.
5.2.5 Extension of PathDVS
The PathDVS, can be directly applied on the clustered graph G′, if suspension of actors execution was
allowed. To illustrate, why we do not allow the actors execution to be suspended, we will give a brief
but intuitive example.
Suppose that o VFD contains 16 PEs and that a super node snj , in the graph G′, was formed by
clustering parts of executions (equal to the segment’s time span) of 16 actors. After applying the
PathDVS, it is possible that the execution of one of these 16 actors and consequently the invocation of
snj , will move in time, by a unit slack, according to precedence constraints. If, the previous invoked
super node sni share, at least one actor with snj , then this shifting of sn′js invocation, will cause at
least one PEs to be idle.
To avoid this behavior, the PathDVS should be extended to check if the constraint (5.17), is satisfied,
before the allocation of a unit of slack. In this way the cost for energy reduction and the constraint
(5.17) are taken into consideration when the candidate path in the solution tree is decided.
5.3 Conclusion
We demonstrated in this chapter the difficulties that arise when the problem of the energy efficient
scheduling of dataflow graphs on many-core platforms is considered. Existing heuristics for energy
minimization, targeted for platforms where each PE can operate on its own VF are expected to provide
minor results. Moreover, the different switching activities among actors should be taken into account
when applying such techniques. Starting from a fixed binding of actors to PEs and VFDs, we continue
by scheduling the actors with the ALAPs based list scheduling technique. After examining the AUS
and the MUS of each domain, we proceed by an iterative shifting of actors within their AMW so as to
increase the AUS and consequently the potential energy reduction. Fixing the invocations of actors,
we demonstrated a clustering approach that creates super nodes with inherited switching activities
and precedence constraints. Now we are able to abstract and consider each VFD as an individually
managed PE. Finally, we can apply any of the available heuristics, that tackle the energy efficient
scheduling of precedence constrained graphs on individually managed PEs.
6
Future Work
Due to timing constraints this project was only focused on the simple SDF MoC. However, we managed
to bring up most of the difficulties arise from the incorporation of state of the art computin platforms,
as the P2012 and formalize the optimization problem of energy efficient scheduling. There are a
number of ways though, that this work can be extended in order to provide a more complete and
suitable energy efficient algorithm for applications running on such platforms. The list below just
names a few:
• Validate the results of the proposed approach incorporating various heuristics
• Extend the dataflow MoC considered to other more expressive ones
• Consider the characteristics of other many core platforms or interconnection infrastructures
• Extend the work to multi-criteria scheduling heuristics
6.1 Validation of the proposal
After formalizing all the constraints for the optimization problem at hand, the next step would be to
provide results on the energy efficiency of the proposed approach. Because the clustering and energy
efficient scheduling are completely decoupled, one could proceed by incorporating different heuristics
available for energy minimization and compare them in terms of efficiency and complexity. Since the
binding of actors to PEs and VFDs also affect the outcome, one could also experiment with different
binding heuristics.
6.2 Extension of the MoC
This work was only focused on the SDF MoC and more specifically to the derived HSDF. However,
other more expressive MoCs such as the PSDF and HDF are better suited for describing DSP appli-
cations. The characteristics of these MoCs and how these affect the energy efficient scheduling should
then be studied in dept. The first step towards the expansion of the approach would be to incorporate
actors with time varying production and/or consumption rates. Of course such a situation would affect
the repetition vector as well as the communication cost. In this work both the repetition vector and
the production/consumption rates were considered constant. Another extension would be to study
the possibility of different modes of operation per actor. Such modes of operations affect greatly the
resource requirements and consequently the energy dissipation. A MoC extending the SDF towards
this scenario based execution is the SADF MoC [41].
6.3 Extention to other platforms
Although P2012 is a state of the art computing platform, this work could be extended to other
platforms and interconnection infrastructures. Communication infrastructures such as the Nostrum
60 Future Work
NoC [34] or the Aetheral NoC [15] could also be studied. Different clocking schemes that are tailored
for power and latency reduction could be taken into account and compared against the asynchronous
operation of the ALPIN NoC. Such a scheme, for 2D mesh NoCs, is the Globally Pseudochronous
Locally Synchronous approach where a clock with a constant phase difference is being distributed to
the NoC routers [35]. Incorporation of different communication interconnects will most probably affect
the communication cost between actors as well as the energy dissipation for communication which in
this work considered negligible.
6.4 Extend to multi-criteria scheduling heuristics
The heuristic can be extended to accommodate more than one objectives for optimization. Such
objectives can be, apart from the energy consumption, the schedule makespan, the reliability of the
system [13], the buffer requirements and the software/hardware implementation cost [48]. This can
done based on the Pareto point algebra [12]. When there are two or even more objectives that need
to be optimized, a multidimensional solution space is created. From the possible solution pairs then
dominant ones are found and create the so-called Pareto front. Points in this line dominate any
other solution outside from the Pareto front. After creating the Pareto front, a solution can be found
according to additional criteria.
A
Pseudo Algorithms
Algorithm 1 ALAP scheduling
get BI−1(i)
sort BI−1(i) in increasing order of ALAPs times
for j = 1→ |BI−1(i)| do
S(τ)← ALAPs(τ)end for
Algorithm 2 Candidate actors for shifting
for i = 1→ |I| doif MUS(i) 6= AUS(i) thenCd ← i
end ifend for
Algorithm 3 AMW calc
for j = 1→ |Cd| do
for τ = 1→ |BI−1(j)| do
A(τ)← mini∈(BI(j))(S(i))end forfor τ = 1→ |BI−1
(j)| doAMW (τ).start← ASAPs(τ)AMW (τ).end← S(τ)if AMW (τ).start 6= AMW (τ).end thenCτ ← τfor i = 1→ |BI−1
(j)| doif S(i) ∨ S(i) +WCEC(i) ∈ [AMW (i).start, AMW (i).end] thenOL(τ)← i
end ifend forsort OL(τ) in increasing order of S(τ)
elseA(τ)← S(τ)
end ifend for
end forsort Cτ in increasing order of S(τ)
62 Pseudo Algorithms
Algorithm 4 Shifting
for τ = 1→ |Cτ | doamwold ← AMW (τ).startif ASAPs(τ) 6= amwold then
for i = 1→ |BI−1(j)| do
if S(i) ∨ S(i) +WCEC(i) ∈ [ASAPs(τ), AMW (i).end] thenOL(τ)← i
end ifend for
end ifaorold ← AOR(τ)sold ← S(τ)for i = 1→ |I| doausold(i)← AUS(i)
end forfor j = |OL(τ)| → 1 doS(τ)← max(sold, S(j))if AOR(τ) ≥ aorold then
for i = 1→ |I| doif AUS(i) < ausold thenS(τ)← soldbreak
elsesold ← S(τ)aorold ← AOR(τ)
end ifend for
end ifend for
end for
Bibliography
[1] Platform 2012 : A Many-core programmable accelerator for Ultra- Efficient Embedded Computing
in Nanometer Technology. Technology, 2012.
[2] M. Anis, S. Areibi, and M. Elmasry. Design and optimization of multithreshold CMOS (MTC-
MOS) circuits. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions
on, 22(10):1324–1342, October 2003.
[3] H. Aydin and Q. Yang. Energy-aware partitioning for multiprocessor real-time systems. In
Parallel and Distributed Processing Symposium, 2003. Proceedings. International, pages 9–pp.
IEEE, 2003.
[4] M Bariani, P Lambruschini, and M Raggio. VC-1 decoder on STMicroelectronics P2012 archi-
tecture. stday2010.uniud.it, pages 1–3.
[5] E Beigne, F Clermidy, S Miermont, and P Vivet. Dynamic voltage and frequency scaling ar-
chitecture for units integration within a GALS NoC. In Networks-on-Chip, 2008. NoCS 2008.
Second ACM/IEEE International Symposium on, pages 129–138. IEEE, 2008.
[6] E. Beigne, Fabien Clermidy, HElEne Lhermet, Sylvain Miermont, Yvain Thonnart, X.T. Tran,
Alexandre Valentian, Didier Varreau, Pascal Vivet, Xavier Popon, and Others. An asynchronous
power aware and adaptive NoC based circuit. Solid-State Circuits, IEEE Journal of, 44(4):1167–
1177, April 2009.
[7] B Bhattacharya and S.S. Bhattacharyya. Parameterized dataflow modeling for DSP systems.
Signal Processing, IEEE Transactions on, 49(10):2408–2421, October 2001.
[8] J.T. Buck and E.a. Lee. Scheduling dynamic dataflow graphs with bounded memory using the
token flow model. In icassp, number September, pages 429–432. IEEE, 1993.
[9] J.a. Butts and G.S. Sohi. A static power model for architects. In Microarchitecture, 2000.
MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on, pages 191–201.
IEEE, 2000.
[10] P. De Langen and Ben Juurlink. Trade-offs between voltage scaling and processor shutdown for
low-energy embedded multiprocessors. Embedded Computer Systems: Architectures, Modeling,
and Simulation, pages 75–85, 2007.
[11] MR Garey. Computers and Intractability: A Guide to the Theory of NP-completeness. 1979.
[12] Marc Geilen and Twan Basten. A calculator for Pareto points. In Proceedings of the conference
on Design, automation and test in Europe, volume 2, pages 285–290. EDA Consortium, April
2007.
64 Bibliography
[13] A. Girault and H. Kalla. A novel bicriteria scheduling heuristics providing a guaranteed global
system failure rate. IEEE Transactions on Dependable and Secure Computing, 6(4):241–254,
October 2009.
[14] A Girault, B. Lee, and E.A. Lee. Hierarchical finite state machines with multiple concurrency
models. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,
18(6):742–760, June 1999.
[15] K. Goossens, J. Dielissen, and A. Radulescu. Æthereal network on chip: concepts, architectures,
and implementations. Design & Test of Computers, IEEE, 22(5):414–421, May 2005.
[16] Philippe Grosse, Yves Durand, and Paul Feautrier. Power modeling of a NoC based design for
high speed telecommunication systems. Integrated Circuit and System Design. Power and Timing
Modeling, Optimization and Simulation, pages 157–168, 2006.
[17] Philippe Grosse, Yves Durand, and Paul Feautrier. Methods for power optimization in SOC-based
data flow systems. ACM Transactions on Design Automation of Electronic Systems (TODAES),
14(3):38, June 2009.
[18] Jaeyeon Kang and Sanjay Ranka. Energy-efficient dynamic scheduling on parallel machines. High
Performance Computing-HiPC 2008, pages 208–219, 2008.
[19] H. Kawaguchi, K.I. Nose, and T. Sakurai. A CMOS scheme for 0.5 V supply voltage with pico-
ampere standby current. In Solid-State Circuits Conference, 1998. Digest of Technical Papers.
1998 IEEE International, pages 192–193. IEEE, 1998.
[20] A.a. Khan, C.L. McCreary, and MS Jones. A comparison of multiprocessor scheduling heuristics.
1994 International Conference on Parallel Processing (ICPP’94), pages 243–250, August 1994.
[21] Hong-Sik Kim, Hyejeong Hong, H.S. Kim, J.H. Ahn, and Sungho Kang. Total energy minimization
of real-time tasks in an on-chip multiprocessor using dynamic voltage scaling efficiency metric.
Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 27(11):2088–
2092, November 2008.
[22] Fanxin Kong, Wang Yi, and Qingxu Deng. Energy-efficient scheduling of real-time tasks on
cluster-based multicores. In Design, Automation & Test in Europe Conference & Exhibition
(DATE), 2011, pages 1–6. IEEE, 2011.
[23] Y.K. Kwok and Ishfaq Ahmad. Dynamic critical-path scheduling: An effective technique for
allocating task graphs to multiprocessors. Parallel and Distributed Systems, IEEE Transactions
on, 7(5):506–521, 1996.
[24] Didier Lattard, E. Beigne, Fabien Clermidy, Yves Durand, Romain Lemaire, Pascal Vivet, and
Friedbert Berens. A reconfigurable baseband platform based on an asynchronous network-on-chip.
Solid-State Circuits, IEEE Journal of, 43(1):223–235, January 2008.
[25] E.A. Lee and S. Ha. Scheduling strategies for multiprocessor real-time DSP. In Global Telecom-
munications Conference, 1989, and Exhibition. Communications Technology for the 1990s and
Beyond. GLOBECOM’89., IEEE, pages 1279–1283. IEEE, 1989.
[26] E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flow programs for digital
signal processing. Computers, IEEE Transactions on, 100(1):24–35, January 1987.
Bibliography 65
[27] E.A. Lee and D.G. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235–
1245, 1987.
[28] W.Y. Lee. Energy-Saving DVFS Scheduling of Multiple Periodic Real-Time Tasks on Multi-core
Processors. In Proceedings of the 2009 13th IEEE/ACM International Symposium on Distributed
Simulation and Real Time Applications, pages 216–223. IEEE Computer Society, 2009.
[29] JJH Lin1. Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems. In
Network and Parallel Computing: IFIP International Conference, NPC 2010, Zhengzhou, China,
September 13-15, 2010, Proceedings, volume 6289, page 344. Springer-Verlag New York Inc, 2010.
[30] Jiong Luo and N.K. Jha. Power-efficient scheduling for heterogeneous distributed real-time em-
bedded systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions
on, 26(6):1161–1170, 2007.
[31] Jiong Luo, N.K. Jha, and L.S. Peh. Simultaneous dynamic voltage scaling of processors and
communication links in real-time distributed embedded systems. Very Large Scale Integration
(VLSI) Systems, IEEE Transactions on, 15(4):427–437, April 2007.
[32] S.M. Martin, K. Flautner, T. Mudge, and D. Blaauw. Combined dynamic voltage scaling and
adaptive body biasing for lower power microprocessors under dynamic workloads. In Proceedings
of the 2002 IEEE/ACM international conference on Computer-aided design, pages 721–725. ACM,
2002.
[33] Sylvain Miermont, Pascal Vivet, and Marc Renaudin. A power supply selector for energy-and
area-efficient local dynamic voltage scaling. In Integrated Circuit and System Design. Power
and Timing Modeling, Optimization and Simulation, pages 556–565, Berlin, Heidelberg, 2007.
Springer.
[34] M. Millberg, E. Nilsson, R. Thid, S. Kumar, and A. Jantsch. The Nostrum backbone-a communi-
cation protocol stack for networks on chip. In VLSI Design, 2004. Proceedings. 17th International
Conference on, pages 693–696. IEEE, 2004.
[35] Erland Nilsson and J. Oberg. Reducing power and latency in 2-D mesh NoCs using globally
pseudochronous locally synchronous clocking. In Proceedings of the 2nd IEEE/ACM/IFIP in-
ternational conference on Hardware/software codesign and system synthesis, pages 176–181, New
York, New York, USA, 2004. ACM.
[36] Ptolemy and N Copernicus. The Almagest; On the Revolutions of the Heavenly Spheres; and
Epitome of Copernican Astronomy: IV and V. The Classical Review, 34(02):299, October 1952.
[37] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand. Leakage current mechanisms and leakage
reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE, 91(2):305–
327, December 2003.
[38] G.C. Sih and E.A. Lee. A compile-time scheduling heuristic for interconnection-constrained
heterogeneous processor architectures. Parallel and Distributed Systems, IEEE Transactions on,
4(2):175–187, 1993.
[39] S Sriram and E.A. Lee. Determining the order of processor transactions in statically scheduled
multiprocessors. The Journal of VLSI Signal Processing, 15(3):207–220, 1997.
66 Bibliography
[40] Sundararajan Sriram and S.S. Bhattacharyya. Embedded multiprocessors: Scheduling and syn-
chronization. CRC, 2000.
[41] B.D. Theelen, M.C.W. Geilen, S. Stuijk, S.V. Gheorghita, T. Basten, J.P.M. Voeten, and
AH Ghamarian. Scenario-aware dataflow. Technical Report July, Citeseer, 2008.
[42] Yvain Thonnart, Pascal Vivet, and F. Clermidy. A fully-asynchronous low-power framework for
GALS NoC integration. In Proceedings of the Conference on Design, Automation and Test in
Europe, pages 33–38. IEEE, 2010.
[43] Girish Varatkar and R. Marculescu. Communication-aware task scheduling and voltage selection
for total systems energy minimization. In Proceedings of the 2003 IEEE/ACM international
conference on Computer-aided design, page 510. IEEE Computer Society, 2003.
[44] Lizhe Wang, Jie Tao, Gregor von Laszewski, and Dan Chen. Power Aware Scheduling for Par-
allel Tasks via Task Clustering. In 2010 IEEE 16th International Conference on Parallel and
Distributed Systems, pages 629–634. IEEE, December 2010.
[45] L. Yan, J. Luo, and N.K. Jha. Joint dynamic voltage scaling and adaptive body biasing for
heterogeneous distributed real-time embedded systems. Computer-Aided Design of Integrated
Circuits and Systems, IEEE Transactions on, 24(7):1030–1041, July 2005.
[46] CY Yang, JJ Chen, and T.W. Kuo. An approximation algorithm for energy-efficient scheduling
on a chip multiprocessor. Design, Automation and Test in Europe, pages 468–473, 2005.
[47] Yumin Zhang and XS Hu. Task scheduling and voltage selection for energy minimization. Pro-
ceedings of the 39th annual Design, page 183, 2002.
[48] Jun Zhu, Ingo Sander, and A. Jantsch. Pareto efficient design for reconfigurable streaming appli-
cations on CPU/FPGAs. In Proceedings of the Conference on Design, Automation and Test in
Europe, pages 1035–1040, 2010.