Power consumption optimization of data ow applications on...

78
Royal Institute of Technology Power consumption optimization of dataflow applications on many-core systems Emmanouil Komninos komninos(@)kth.se August 21, 2011 A master thesis project conducted at Examiner: Ingo Sander Supervisors: Alain Girault Pascal Fradet TRITA-ICT-EX-2011:192

Transcript of Power consumption optimization of data ow applications on...

Page 1: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Royal Institute of Technology

Power consumption optimization of dataflowapplications on many-core systems

Emmanouil Komninoskomninos(@)kth.se

August 21, 2011

A master thesis project

conducted at

Examiner:

Ingo Sander

Supervisors:

Alain Girault

Pascal Fradet

TRITA-ICT-EX-2011:192

Page 2: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization
Page 3: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Abstract

With the growing need for high bandwidth digital communications and streaming applications re-

quiring high quality of video and audio encoding, the transition to platforms consisting of hundreds

of processors and efficient communication infrastructures is inevitable. DSP applications targeted to

such highly parallel platforms are best described by concurrent MoCs to enable the mapping and

scheduling process to such architectures.

Such platforms target embedded devices which operate under a very constraint energy budget. This

project is about the energy efficient scheduling of DSP applications, described under the dataflow

MoC, on many-core platforms. The target platform is the P2012 designed by STMicroelectronics

consisting 16 nodes interconnected through a 2D mesh asynchronous NoC. Each node can operate

on different voltage and frequency and can accommodate up to 16 processors. The dataflow MoC

considered to describe the aforementioned applications is called SDF.

The main advantage gained from this project is the formal description of the energy minimization

problem, when such platforms are being considered. We demonstrate the difficulties that arise from

these architectures, the insufficiency of the existing energy efficient scheduling approaches and we

propose a way to relax this very complex problem so that existing approaches can be applied.

Page 4: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

ii

Acknowledgements

I would like to give special thanks to...

Alain Girault and Pascal Fradet, researchers at INRIA and my Supervisors

for sharing their wisdom on dataflow programming and scheduling

Ingo Sander, Professor at KTH and my Examiner

for being always accessible and giving me regular feedback on my work

Petro Poplavko, post-doc at INRIA

for the long talks that gave clear perspective

Thomas Martin Gawlitza, post-doc at INRIA

for introducing me to complexity theory

INRIA,

for providing a welcoming working environment and the big amounts of coffee

required for this project

My family,

for always supporting me on my decisions and allowing me to accomplish my

goals

...As well as everyone else who listened to my questions and spent the time to

help me along the way of completing this project

Page 5: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

List of Abbreviations

Actor Mobility Window AMW

Actor Overlapping Ratio AOR

Adaptive Body Biasing ABB

Available Usable Slack AUS

Digital Analogue Converter DAC

Digital Signal Processing DSP

Digital to Analogue Converter DAC

Dynamic Voltage Frequency Scaling DVFS

Dynamic Voltage Scaling DVS

Earliest Deadline First EDF

First In First Out FIFO

Forward Body Biasing FBB

Giga Operations Per Second GOPS

Homogeneous Data Flow Graph HDFG

Instruction Set Simulator ISS

Integer Linear Programming ILP

Local Power Management LPM

Locally Adaptive Voltage and Frequency Scaling LAVFS

Locally Adaptive Voltage Frequency Scaling LAVFS

Low Voltage Transistor LVT

Maximum Usable Slack MUS

Model of Computation MoC

Multi Carrier-Code Division Multiple Access MC-CDMA

Multi-Processor System on Chip MPSoC

Page 6: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

iv

Multiple Input Multiple Output MIMO

Multiple Threshold CMOS MTCMOS

Network on Chip NoC

Nondeterministic Polynomial NP

Orthogonal Frequency-Division Multiplexing OFDM

Power Shut down PS

Power Supply Unit PSU

Processing Element PE

Reverse Body Biasing RBB

Scenario Aware Data Flow SADF

Super Cut off CMOS SCCMOS

Synchronous Data Flow Graph SDFG

System on Chip SoC

Ultra Cut Off UCO

Voltage Frequency Domain VFD

Voltage Frequency Scaling VFS

Worst Case Execution Cycles WCEC

Worst Case Execution Time WCET

Worst Fit Decreasing WFD

Page 7: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

List of Figures

1.1 VC-1 decoder’s algorithm schematic [4] . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 4More functional diagram [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 MIMO OFDM mapping [24] . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2.1 (a) Consistent SDFG, (b) Its Topology Matrix . . . . . . . . . . . . . . . . . . . . . . 5

2.2 (a) Inconsistent SDFG, (b) Its Topology Matrix . . . . . . . . . . . . . . . . . . . . . . 6

2.3 Iteration of SDF 2.1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.4 Expansion of an edge in an SDFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.5 HDFG equivalent of 2.1a . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3.1 Trade-off of generality against run-time overhead and implementation complexity . . . 11

3.2 (a) Task set, (b) Task-core mapping . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 EDF scheduling, f=1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.4 SimpleVS,f=7/12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5 Move slack backwards,f=7/12 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.6 Evolution of segment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.7 After migration,sg0.f = 31/72, sg1.f, sg2.f, sg3.f = 7/12 . . . . . . . . . . . . . . . . . 20

4.1 (a)HDFG, (b)Mapping and WCEC of each actor . . . . . . . . . . . . . . . . . . . . . 26

4.2 Individually managed PEs, (a)f=1 for all actors, (b)frequency scaling to f = 1/3 for

actor 3, (c) case for clustered PEs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.3 The P2012 fabric . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.4 NoC Unit Architecture - VFD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.5 Power Supply Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

4.6 Dithering principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

5.1 From Gs to G∗s (ALAPs(3) < ALAPs(4)) . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Adding WCET to edges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 (a) Sample HDFG, (b) the corresponding binding . . . . . . . . . . . . . . . . . . . . . 44

5.4 The schedule of HSDFG of fig. 5.3a on nominal frequency . . . . . . . . . . . . . . . . 44

5.5 (a) DVFS in segment [5.25, 7.525] on VFD1, (b) DVFS in segment [3,4] on VFD2 . . . 45

5.6 (a) Sample HDFG and (b) The corresponding binding . . . . . . . . . . . . . . . . . . 47

5.7 The schedule of HSDFG of fig. 5.6a on nominal frequency . . . . . . . . . . . . . . . . 47

5.8 Scaling the frequency to fmax4 on the invocation of actor 3 and to fmax on the invocation

of actor 4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

5.9 (a) Sample HDFG, (b) the corresponding binding . . . . . . . . . . . . . . . . . . . . . 48

5.10 The schedule of HSDFG of fig. 5.9a on nominal frequency . . . . . . . . . . . . . . . . 49

Page 8: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

vi List of Figures

5.11 The schedule of HSDFG of fig. 5.9a on nominal frequency. The limits of the AMW for

actors 4 and 6 are noted with dashed green lines . . . . . . . . . . . . . . . . . . . . . 50

5.12 ALAPs based schedule of HSDFG of fig. 5.9a on nominal frequency . . . . . . . . . . 52

5.13 (a) sample HSDFG and (b) WCEC and binding information . . . . . . . . . . . . . . . 55

5.14 The clustering procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.15 The clustered Gs(T ′, E ′) from the HSDF in figure 5.13 . . . . . . . . . . . . . . . . . . 56

Page 9: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

List of Tables

3.1 Overview of the assumptions in the related work . . . . . . . . . . . . . . . . . . . . . 22

5.1 The OLs and AORs of actors from the HSDF of figure 5.9a . . . . . . . . . . . . . . . 51

5.2 The OLs and AORs of actors from the HSDF of figure 5.13a . . . . . . . . . . . . . . . 54

Page 10: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Contents

List of Abbreviations iii

1 Introduction 1

1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Data Flow Graphs 5

2.1 Synchronous Data Flow Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

Consistency of SDFGs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Constructing an Equivalent HDFG . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

3 Scheduling 9

3.1 Scheduling Taxonomy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.1 Fully Dynamic and Fully Static . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3.1.2 Self-timed and Static assignment . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.3 Quasi-static and Ordered-transactions . . . . . . . . . . . . . . . . . . . . . . . 10

3.1.4 Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Notation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.2.1 Elaboration on Execution Time . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3 Ordering Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

As Late As Possible (ALAP) times . . . . . . . . . . . . . . . . . . . . . . . . . 12

As Soon As Possible (ASAP) times . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4 Scheduling Heuristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4.1 List Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.4.2 Low power scheduling approaches . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Multi processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

Multi-core processor Architectures . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.4.3 Discussion on the related work . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Platform Power Management 23

4.1 Power Basics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

4.1.1 Charging/Discharging of Capacitive loads . . . . . . . . . . . . . . . . . . . . . 23

4.1.2 Short Circuit Currents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.3 Leakage Currents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

4.1.4 Total Energy Dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

Page 11: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Contents ix

Idle energy dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Actor energy dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

Schedule energy dissipation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

4.2 Platform Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.1 Cluster Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4.2.2 Dynamic Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

VDD hopping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

4.2.3 Leakage Power Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.2.4 NoC Interconnect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Energy dissipation and Latency . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

4.3 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

Architectural assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31

Application modeling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . 31

4.4 Energy Dissipation Refinement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.4.1 Definitions Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Architecture definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

Data-flow graph definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

5 Energy Efficient Scheduling 35

5.1 Constraint problem formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.1 Objective Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1.2 Deriving the constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Graph transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

Segmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Timing constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Computation of favg . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Precedence constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.1.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

Multiple VFDs and variable P (τ) . . . . . . . . . . . . . . . . . . . . . . . . . . 42

Complexity due to variable P (τ) . . . . . . . . . . . . . . . . . . . . . . . . . . 43

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2 Our proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.2.1 Useful terms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Maximum Usable Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

Available Usable Slack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

Actor Mobility Window . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

Actor Overlapping Ratio . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

5.2.2 The algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

Actor Shifting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

The Shifting algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.2.3 Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2.4 DVFS Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2.5 Extension of PathDVS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

Page 12: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

x Contents

6 Future Work 59

6.1 Validation of the proposal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.2 Extension of the MoC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.3 Extention to other platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

6.4 Extend to multi-criteria scheduling heuristics . . . . . . . . . . . . . . . . . . . . . . . 60

A Pseudo Algorithms 61

Page 13: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

1

Introduction

1.1 Background

With the growing need for high bandwidth digital communications and streaming applications re-

quiring high quality of video and audio encoding, the transition to multi/many core platforms on

embedded systems is inevitable. Although multi-core architectures, featuring general purpose CPUs,

have become mainstream, the need for multi-GOPS to satisfy such high-end computational intensive

functionality demands the transition to highly heterogeneous SoC platforms. Such platforms often in-

corporate asynchronous NoCs, customizable CPUs, domain specific accelerators and multiple voltage

-frequency domains organized in the so-called DVFS islands.

Sequential programming languages such as C, impose limitations when it comes to mapping in parallel

hardware. Usually such algorithms are represented by concurrent models of computation (MoCs). A

model of computation is an abstraction of a computational system. It describes how the various com-

putation processes interact [12]. The applications in this work are described by data-flow MoCs where

the computational blocks (actors) are ordered and operate on data that are queued on the intermedi-

ate edges. Using the data-flow paradigm for programming computational intensive and usually time

and/or latency constrained applications, the data-dependencies between tasks as well as the inherent

parallelism can easily be expressed and exploited. This representation of algorithms, which is usually

depicted as a flow-graph, is analogous to the operation and the structure of the underlying hardware

facilitating thus, the mapping and scheduling to the platform under consideration. Figure 1.2 shows a

mobile terminal MC-CDMA chain applied to the future 4G telecommunication standard. This appli-

cation is divided into 21 cores. Figure 1.1 shows a functional representation the VC1 video codec found

in HD DVDs and Blu-ray discs. With this representation constraints such as throughput, timing and

latency can also be modeled more intuitively enabling the application of heuristics to optimize of one

or more criteria such as the power consumption or the schedule timespan. Figure 1.3 shows a mapping

of the MIMO-OFDM technique, used in the 4G telecommunication standard, on a multi-processor SoC.

Regarding data-flow programming, various data-flow models have been proposed over the years, each

of them having its own advantages and drawbacks when it comes to its expressive power and pre-

dictability. In synchronous data flow (SDF) MoC [27], the number of tokens produced and consumed

at the ports of the actors are known at compile time. It reduces the expressiveness of the model

while increasing its predictability (verification and optimization). At the other end is the dynamic

data-flow (DDF) MoC [8], supporting data-dependent behavior and unknown rate of token production

and consumption at the cost of almost no predictability.

While the data flow modeling is extremely convenient for programming computationally intensive DSP

applications, when it comes to mapping in many core SoCs, one would also want to take advantage

of the cutting-edge power management capabilities, supported by such platforms. Scheduling of tasks

with respect to the minimization of some criteria (power consumption and/or execution time) can

make use of such mechanisms in order to scale voltage efficiently. State of the art power management

Page 14: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

2 Introduction

Figure 1.1: VC-1 decoder’s algorithm schematic [4]

Figure 1.2: 4More functional diagram [24]

engines include dynamic voltage and frequency scaling based on VDD hopping [33]. The platform

under consideration for this master thesis is a new many-core platform designed by STMicroelectronics,

called P2012 [1]. The voltage scaling scheme is based on the VDD hopping principle. Based on this

technique a virtual voltage point can be reached among two voltage points (high and low) and a duty

cycle for dithering between these two. The frequency can be reprogrammed in less than 200ns.

1.2 Problem formulation

Given a dataflow graph consisting of a set of actors with a known WCET and static binding to pro-

cessing elements, an off line schedule is to be found that minimizes the power consumption.

Page 15: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Problem formulation 3

Figure 1.3: MIMO OFDM mapping [24]

To this aim, the power consumption modeling and the completion time analysis of a static synchronous

data-flow schedule is to be studied. More precisely the factors (i.e. static and dynamic energy dissi-

pation of the PEs, communication costs introduced by the NoC) that affect the power consumption

and which of them can be minimized, have to be identified in order for the power consumption model

to be used for efficient scheduling. Moving one step forward from modeling the power consumption,

given a set of ordered actors with static binding, a scheduling algorithm that minimizes the power

consumption is to be found. Scheduling with respect to power optimization may incorporate a per-

mutation of an admissible schedule (based on the WCET of each actor) in order to maximize the idle

intervals in a time frame and distribute this to possible actors [30]. The time frame is usually such

that a number or constraints, depending on the application, are met. Such constraints might be the

quality of a video encoding system or a hard deadline for a safety critical system. The characteristics

of P2012 will also be taken into consideration.

The applications that we are focusing on operate on long streams of data. The execution time (or even

the invocation) of actors that model different parts (functions) of these applications is heavily data

dependent and consequently varying. Moreover data rates can also change at run time, affecting the

production and consumption of tokens. Examples of data-flow graphs with higher expressive power

that can model such behavior are PSDF [7] and HDF [14], to say a few. Since the goal is to minimize

the power consumption, starting from a SDF representation of the algorithm and a static binding

to the underlying platform, we first derive the necessary formulas to describe the ordering between

actors that should be preserved for the correct execution of the algorithm. These formulas will be

used as constraints for the optimization function. The work continues with deriving the objective

function, that is, the energy dissipation formula. The optimization problem is described later on and

is compared against the related work. From this comparison we will show that existing approaches

for energy minimization can not be applied to our case. Finally we propose a heuristic to overcome

the difficulties that arise from the underlying platform.

Page 16: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

4 Introduction

1.3 Contributions

With this work our contributions towards the energy efficient scheduling of dataflow graphs is three-

fold:

• We study the energy efficient scheduling of dataflow graphs in high-end computing platforms

consisting of hundreds of cores organized on VFDs. Until the time this work was documented,

there was no known published work, assuming similar application and platform models. To

this end we contribute by formulating the optimization problem of energy efficient scheduling of

dataflow graphs in multiple VFDs.

• After formalizing the problem at hand, we reveal the difficulties that arise, when platforms like

P2012 are being considered for the mapping and energy efficient scheduling of dataflow graphs.

Highlighting these difficulties allows also to argue about the inefficiency of approaches found in

the related work.

• Finally we propose a way to relax this very complex problem to the know and well studied one

of energy efficient scheduling of dataflow graphs on multi-processor systems. The result of this

method is a new graph. In this graph each actor might be the result of a clustering, as we

will describe in section 5.2.3. Now we can abstract and consider each VFD as an individually

managed PE and apply any of the existing approaches for energy efficient scheduling. Last but

not least we propose also an extension that can be applied to existing heuristics, in the case that

actor suspension is not allowed.

Page 17: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

2

Data Flow Graphs

The applications that we are studying are modeled under the data flow paradigm. Two very common

data flow models are the synchronous data flow graphs SDFGs and the homogeneous data flow graph,

HSDFGs, depending on the production and consumption rate of data from actors. In this chapter we

will give the definition and the notations needed for the application model under consideration in this

work.

2.1 Synchronous Data Flow Graphs

Notation

• T is the set of actors. Actors operate on input data streams.

• E ⊆ T ×T is the set of edges. Data streams between actors are exchanged through these edges.

Since edges are directed, with the notation e(τ, z) ∈ E and τ, z ∈ T , we denote, the edge directed

from actor τ to actor z. These edges represent thus, the data dependencies between actors.

Every edge e ∈ E has precisely one source and one destination ∀e(τ, z) ∈ E , ∃τ, z ∈ T | src(e) =

τ and dst(e) = z. We associate with each edge a production rate, Prate : src(e) → N+ for

producing tokens and a consumption rate, Crate : dst(e)→ N+ for consuming tokens.

• d : E → N, represents the number of initial tokens (delays) on an edge e and represent the data

dependencies across iterations of G.

• WCEC : T , E → N, represents the worst case execution cycles needed by an actor to complete its

execution. The same function returns also the total cycles needed for a communication between

two actors.

• We define a path p(τ, z), τ, z ∈ T directed from τ to z, to be a finite nonempty sequence of

edges, p(τ, z) ⊆ E ∧ p(τ, z) 6= ∅

• We define, the firing time as start(τ, k) ∈ N+, with k denoting the kth invocation of each actor.

Similarly we define end(τ, k) ∈ N+ to represent the completion of the kth invocation actor τ .

A subset of SDFGs where Prate = Crate = 1 for all e ∈ E is called homogeneous data flow graph.

1 221

3

— — — —

21 —

11

11

(a) SDFG

Γ =

1 −2 00 1 −10 −1 1−1 0 2

(b) Topology Matrix

Figure 2.1: (a) Consistent SDFG, (b) Its Topology Matrix

Page 18: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

6 Data Flow Graphs

1 211

312

11

(a)

Γ =

−1 1 00 1 −1−2 0 1

(b)

Figure 2.2: (a) Inconsistent SDFG, (b) Its Topology Matrix

Consistency of SDFGs

SDFGs are usually characterized by a topology matrix with dimension | E | × | T |. The entries in

this matrix, represent the production (consumption) rate of tokes, on the corresponding edges. An

example SDFG with its corresponding topology matrix is shown in figure 2.1a and 2.1b. A positive

(i, j) entry in the topology matrix, indicates the number of produced tokens by actor j on edge i. A

negative entry would, similarly, indicate the number of tokens consumed. A zero entry indicates that

there is no connection between the edge and the actor.

It is proven in [26], that a sequential schedule can be constructed for an SDFG G, if the rank of the

topology matrix is one less than the number of actors in the graphs, i.e.

rank(Γ) =| T | −1 (2.1)

This SDFG is consistent. The SDFG of figure 2.1a with topology matrix shown in 2.1b, with rank(Γ) =

2. An example of an inconsistent SDFG is shown in figure 2.2 with rank(Γ) = 3. For the topology

matrix is also proven that:

rank(Γ) ≥| T | −1 (2.2)

If (2.1) holds, then there is a positive integer vector q in the null space of the topology matrix, called

repetition vector. The entries in the repetition vector indicate the number of invocations for each actor

in each iteration of the schedule. For the repetition vector q holds that:

Γq = O (2.3)

with O being a vector full of zeros. For the SDFG in figure 2.1a, the repetition vector can be found

by solving the equation: 1 −2 0

0 1 −1

0 −1 1

−1 0 2

·q(1)

q(2)

q(3)

= O

The repetition vector can be found to be:

q =

2

1

1

The above equation implies that the buffers needed for the inter actor communications are bounded.

Page 19: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Constructing an Equivalent HDFG 7

0 1 2 3 4

PE 1 1 2 3

Figure 2.3: Iteration of SDF 2.1a

An iteration of a data-flow graph is defined then, as q(τ) invocations of all actors that appear in

the graph. For the HSDF case, q(τ) = 1 for all actors τ in the HSDFG. One iteration of the

SDF in figure 2.1a is shown in figure 2.3. A finite sequence of actor invocations that respects the

precedence constraints and produce no net change on the number tokens, accumulated on edges, is

called admissible schedule.

We can define now the data precedence constraint with respect to start and end functions for the

HSDF case as:

start(z, k) ≥ end(τ, k − d(e(τ, z))), ∀k ≥ d(e(τ, z))), (2.4)

τ, z ∈ T , e(τ, z) ∈ E

Since there are already d(e(τ, z)) delay tokens on the edge directed from actor τ to actor z, the latter

can be invoked for maximum d(e(τ, z)) times before or without any invocation of actor τ . However,

since one iteration of a schedule requires the invocation of all actors once, the (d(e(τ, z)) + 1)th

invocation of actor z can only take place after the completion of the (d(e(τ, z)) + 1)th invocation of

actor τ . In this way the precedence constraints are met.

2.2 Constructing an Equivalent HDFG

The work presented in the following chapters, focuses on HSDF graphs because of the simplicity in

the communications. In this case the worst case latency for a communication between two actors is

sufficient to derive the formulas for the earliest possible starting time,ASAPs and the latest possible

starting time, ALAPs, presented in equations 3.10 and 3.4 respectively. However, our work can also

be applied to general SDF graphs. it suffices to expand the SDF in its equivalent HSDF graph, with

a process similar to the one presented in [40]. Starting from the repetition vector q, the equivalent

HSDF graph will contain q(τ) copies for each actor τ ∈ T . Each copy of τ will be the source of

Prate(src(e)) edges in the equivalent HSDF graph, where e is an edge directed from actor τ to another

actor in the SDFG. Similarly, each copy of τ will be connected also to Crate(dst(e)) incoming edges.

An example of an edge expansion is shown in figure 2.4. The new graph will contain three copies of

actor 2 since q(2) = 3 and one copy of actor 1. Actor 1 will be the source of Prate(src(e12)) = 3 edges,

while the copies of actor 2 will be connected to Crate(dst(e12)) = 1 incoming edges.

The equivalent HDFG of 2.1 appears in 2.5.

Page 20: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

8 Data Flow Graphs

1 213

(a)

1 2,211

2,1

1

1

2,31

1

(b)

Figure 2.4: Expansion of an edge in an SDFG

1,1

1,2

21

1

11 3

— —

1

1

— —

1

1

11

11

Figure 2.5: HDFG equivalent of 2.1a

Page 21: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

3

Scheduling

This chapter describes the techniques used, for scheduling dataflow graphs on multi-processor systems.

After an overview of the most commonly adopted techniques, we describe how to take into account

the energy consumption during scheduling.

3.1 Scheduling Taxonomy

Scheduling is often divided into three main phases [40, 25]:

• Binding or Mapping : The process of assigning actors to processing elements. This step, add

resource constraints between actors, sharing the same processing element.

• Ordering : The process of defining the exact firing order of actors. Apart from the data flow

graph that impose the data precedence constraints (2.4), information for the exact mapping of

actors is also required, to defining the firing order.

• Timing : The process of determining when each actor should fire to satisfy all the data and

resource precedence constraints.

The optimal scheduling of a dataflow graph with respect to schedule’s length, on a multi-processor

platform, is known to be a NP-hard problem [11]. We present different scheduling strategies The

classification following next is done according to when binding, ordering and firing are taking place.

3.1.1 Fully Dynamic and Fully Static

We identify a fully dynamic scheduling strategy if all the above steps are taking place at run-time.

Such an approach would be ideal when highly dynamic actor behavior is expected and is more general

in terms of applicability. However, the cost for being able to take advantage of the run-time variability

in the execution time of actors (or the variability in processor’s workload) is high. Such an approach

would be used in case of loose timing constraints. On the other end, in the fully static scheduling

strategy, all steps are taking place at compile time. We can also distinguish blocked and overlapped

fully static schedules. In the first case, the inter-iteration dependencies are neglected and the dataflow

graph is scheduled, as if it executes only for one iteration. To take into account the inter-iteration

dependencies, unfolding and re-timing can be applied to the dataflow graph. With unfolding, N iter-

ations of the dataflow graph are scheduled together. While unfolding often leads to improved blocked

schedules, in terms of schedule length, it requires an increase (by a factor of N) of program’s memory.

With re-timing, the delays in the dataflow graph are manipulated in such a way that the critical path

is reduced.

Of course, each of these strategies has each own advantages and disadvantages. If the objective is to

reduce the run-time overhead imposed by scheduler’s computations then, the fully static methodology

Page 22: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

10 Scheduling

is appropriate. However, fully static schedules are viable only if there are tight bounds on the esti-

mates of the worst case actor execution times. Such approach is useful for hard real time systems.

3.1.2 Self-timed and Static assignment

Which of the steps can be done at compile time is determined by the amount of information avail-

able for the application. In between, the two approaches mentioned above, we identify the self timed

strategy and the static assignment strategy.

The self timed approach, imposes mapping and ordering to be done at compile time, while the timing,

of each actor, is determined at run-time based on the availability of the required input data. Such

a strategy is ideal to compensate for fluctuations in execution times of actors. Under the self-timed

scheduling approach, a fully static schedule should be obtained first, using a heuristic algorithm. After

obtaining the mapping and ordering information from the fully static schedule, we discard the timing

information. At the end, a list of actors is assigned to each processor, while their exact invocations is

determined at run-time. When comparing to the fully static strategy, a self timed scheduling approach

will perform at least as well when the synchronization overhead is negligible. This overhead mainly

stems from scheduling communication actors that are essential for inter-processor synchronization.

Apart from the synchronization overhead, the arbitration overhead should also be taken into account.

Relaxing the self timed approach by performing also the ordering at run-time, results in the static

assignment strategy. Following this approach, the ordering of actors can be decided at run-time. Al-

though a possible re-ordering might result in a reduced computational interval, deciding which actor

should be fired is not easy especially when there are many possible combinations.

Compared to the fully static scheduling approach, where each actor is guaranteed to get a resource in

a given time interval, in self timed scheduling , actors sharing the same resource should arbitrate at

run-time to gain access.

3.1.3 Quasi-static and Ordered-transactions

Apart from the four approaches described above (presented in [25]), two more approaches can be

identified when the inter-processor communication scheduling or the conditional execution of actors are

also taken into account. In the first case, we have the ordered transactions approach, where the inter-

processor communication ordering is defined at compile time and imposed at run-time. Intuitively, the

ordered transactions strategy is a self-timed schedule, with additional transaction order constraints.

The second case corresponds to actors containing conditional branches, such as, if-then-else constructs

and while loops, making the execution time or even the firing, data-dependent. The key idea behind the

quasi static scheduling approach, is to optimize the average execution time of the overall computation.

Based on a statistical model for the control variables (such as the average execution time), an execution

profile can be defined and selected at run-time.

3.1.4 Complexity

The goal of scheduling dataflow graphs in a multi-processor platform is, as described earlier, to define

the binding, ordering and timing of actors, in such a way that an objective is optimized. Typically, such

objectives might include the optimization of makespan of the schedule and/or the energy consumption

Page 23: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Notation 11

Figure 3.1: Trade-off of generality against run-time overhead and implementation complexity

(as in our case). The makespan, is the average iteration period of the schedule and a lower bound on

this, is imposed by the critical path of the graph. As critical path, is denoted the longest delay free,

path in the graph. It is evident that, when inter-processor communication cost is taken into account,

this lower bound is also affected by the binding of actors to PEs and the architecture dependent

characteristics of the platform, i.e. the communication infrastructure between the PEs. Because

optimal scheduling in multi-processor platforms is NP-hard, heuristics have been proposed to provide

near optimal results. Well-known heuristics are the critical path heuristic, the list scheduling method

and the graph decomposition method summarized in [20].

3.2 Notation

A fully static schedule SCH for PE processors specifies the triple:

SCH = {BT (τ),S, Tsch} (3.1)

where BT is the binding function, that associates actors with PEs, S is the function returning the

actor specific firing moments and Tsch is the iteration period. We use the same notation as in [40] and

we will deal with HSDF graphs. In homogeneous SDF, each actor is invoked only once per iteration.

So intuitively S(τ) denotes the time of the unique invocation. To construct a fully static schedule, we

use the following equation to derive the start time of actor τ during iteration k:

start(τ, k) = S(τ) + k · Tsch (3.2)

k ·Tsch represents the start time of the kth iteration of the schedule with the above formula we retrieve

the kth invocation of actor τ .

Page 24: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

12 Scheduling

A schedule is said to be admissible, if it satisfies all the precedence constraints, (2.4), imposed by the

dataflow graph. In HSDGs, since an actor is likely to have more than one incoming edges, we change

the precedence constraint defined in (2.4) to:

start(τ, k) ≥ maxe∈dst−1(τ)(start(e, k) +WCET (e)) (3.3)

start(e, k) = end(src(e), k − d(src(e), τ)

)for all k ≥ d(z, τ). In the above inequality, start and end functions, define the exact invocation and

completion of an actor, dest−1(τ) returns all the edges from E that are directed to τ . Intuitevely 3.3

constraints the start time of an actor to be later than the end time (start(e, k) + WCET (e)) of all

incoming edges. Finally, src(e) returns the source actor of an edge.

3.2.1 Elaboration on Execution Time

In order for the static assignment and scheduling techniques to be valid, reasonably good estimates on

the execution time of actors, should be available at compile time. Most of the times, the execution time

of actors is data dependent and cause a variation in the execution time between different iterations of

the dataflow graph. As long as these variations are rare or small, static techniques for scheduling are

viable. Indeed, it is difficult to determine a worst case bound for the execution times of actors, as cache

misses and corner cases for inputs, might occur. However, it is still possible to obtain reasonably good

execution time estimates. It is usual for the programmer, to derive a mathematical model associated

with some actor parameters. Such parameters may include the block size for processing a video frame

or the number of coefficient in a FIR filter. Such an approach is used in the Ptolemy project [36].

This is feasible for actors written in low level languages (eg. as assembly) and the estimates can be

obtained through profiling to an ISS.

3.3 Ordering Assignment

As Late As Possible (ALAP) times

For the list scheduling approach an initial ordering assignment has to be performed. It is shown in

[23] that ordering actors based on the latest possible start time, provides comparable or better results,

than most other ordering metrics. ALAPs(τ) denotes the latest possible start time of an actor that

will not cause a timing violation and can be defined as:

ALAPs(τ) = ALAPf (τ)−WCET (τ) (3.4)

Since we assume that the WCEC(τ) is known, we can obtain the WCET (τ) by the following relation:

WCET (τ, f) =WCEC(τ)

f(3.5)

To compute ALAPf of actor τ , we should take into consideration, the precedence constraints from

the dataflow graph as well as the binding of actors to PEs. To form the equation for computing the

ALAPf , we will use the functions src and dest that return the source and destination actor of a

directed edge respectively. We will also denote as D the deadline for one iteration of an admissible

Page 25: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Scheduling Heuristics 13

schedule.

ALAPf (τ) = min(D,mine∈src−1({τ})

(ALAPs(e)

),minz∈succ(τ)

(ALAPs(z)

))(3.6)

ALAPs(e) = ALAPs(dst(e))−WCET (e) (3.7)

WCET (e) = (| BE(e) | +Prate(src(e)) ·Nflits/token) · Lflit/hop (3.8)

Where succ(τ) refers to all direct successors of actor τ and BE(e), Nflits/token and Lflit/hop are defined

in 4.4.1 after describing the platform under consideration. Equation 3.8 describes the worst case delay

for a token to reach its destination and is purely dependent on the communication infrastructure of the

underlying hardware. Equation (3.6) constraints the finish time of an actor. This constraint depends

on the deadline D, which is actually the timespan of the frame, as well as the latest starting times

ALAPs of all direct successors of actor τ . These successors might come from the dataflow graph or the

binding information. Actors that are bound to the same processing element and are to be executed

sequentially are assumed to have a direct dependency, even if there is no direct edge in the dataflow

graph connecting these two actors. To compute thus, the latest possible starting (and respectively

finish) times of all actors, we should transverse the graph backwards. Consequently, actors with no

direct successors, will have ALAPf = D.

As Soon As Possible (ASAP) times

In a similar way, we can define the earliest possible finish time of an actor as:

ASAPf (τ) = ASAPs(τ) +WCET (τ) (3.9)

with

ASAPs(τ) = max(A(τ),maxe∈dst−1(τ)

(ASAPs(e) +WCETe

),

maxz∈pred(τ)

(ASAPf (z)

))(3.10)

ASAPs(e) = ASAPf (src(e)) (3.11)

In the above equations, A(τ) denotes the arrival of actor τ , which is zero according to the data flow

notion and pred(τ) is the set containing all the predecessors of actor τ . However, in this case the

graph has to be traversed forward. Now actors with no direct predecessors, will have ASAPs time

equal to 0.

3.4 Scheduling Heuristics

Since the problem of scheduling dataflow graphs on multi-processor systems is NP-hard, heuristics are

used to find near optimal results, fast.

3.4.1 List Scheduling

The basic idea behind list scheduling, is the construction of an ordered list of actors. Based on this

list and the binding, each actor will be associated with a time interval. We say that an actor is ready

to fire at time t, as soon as all its predecessors have been fired and the associated processor is not

Page 26: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

14 Scheduling

busy. This ordering is also adopted in the self-timed approach. However, the timing information is

discarded. To take into account the communication costs when an edge is crossing different VFDs,

either communication actors should be introduced and scheduled explicitly or this communication

latency should be embedded in the execution time of the predecessor node. This ordered list of

execution can be derived based on the earliest start time or the latest start time of actors while

taking into account the execution time estimates, the precedence constraints, the timing constraints

and the binding. The formulas for computing these times are presented in section 3.3. Dynamic

level scheduling, as presented in [38], utilizes a more sophisticated scheme. The ordering of actors is

recomputed on after each scheduling step. The ordering criterion is based on the difference between

the sum over the longest path and the earliest possible start time of the actor.

3.4.2 Low power scheduling approaches

Since the goal of this thesis is to find a schedule under a fixed binding, such that the (energy dissipa-

tion is minimized, we assume that the heuristics used for the binding step minimize the inter-domain

communications. This is possible by avoiding placing unrelated actors to the same path [47].

The available work for energy minimization can be divided depending on the platform under consider-

ation, as well as on the type of applications. A lot of research has been conducted for multi-processor

architectures, where each processor has a dedicated DVFS unit and can adjust the frequency-voltage

operating point individually or at least can start stop independently from the others. On the other

hand, only little research has been made for multi-core processor systems, where the granularity of

voltage-frequency regulation is the core or in architectures employing several voltage-frequency do-

mains where the grain is the VFD. In such architectures, all cores of each processor or all processors

in one VFD, are running under the same voltage-frequency pair, at any given point in time. In the

case of multi-processor systems with a dedicated DVFS unit embedded to each processor, both models

with and without precedence constraints have been investigated. In the case of multi-core processor

architectures, both application models have been investigated. However, work focusing on precedence

constraint graphs restrict the architecture to contain only one VFD.

Multi processor Architectures

In [45], which is actually an extension of [30], the authors propose a scheduling algorithm to reduce

both dynamic and static energy dissipation through ABB (adaptive body biasing) using the power

modeling from [32]. Forward body biasing (FBB) decreases the threshold voltage Vth of transistors,

increasing both maximum frequency and leakage, while reverse body biasing (RBB) has the opposite

effect. Adaptive body biasing refers to designs that can set the body biases statically or dynami-

cally. Under a given binding, their voltage scheduling algorithm is divided into two phases. In the

first phase, an optimal point is found between the supply voltage and the body bias voltage upon a

frequency update. Then, in order for the timing constraints to be met, their algorithm evaluates the

validity of the generated schedule by re-computing the earliest start times and latest finish times of the

actors. An initial schedule for a DAG is found by means of critical path based list scheduling under

the maximum clock frequency. The ordering metric adopted is the latest possible start time. After

the initial schedule is found, the available idle time is allocated to actors, ordered according to their

energy gradient profile, in such a way that the timing constraints are not violated. The higher the

energy gradient is, the more the energy savings by frequency scaling are. The frequency adjustment

Page 27: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Scheduling Heuristics 15

is done iteratively by decreasing the operating frequency by steps of df (given by the specifications

of the platform), in actors with high energy gradient until a timing violation occurs. If there is more

idle time available, then this is allocated to actors with lower energy gradients. The same scheduling

technique is also used in [31]. Having adjusted the operating frequency, they continue by finding the

optimal point for the supply and body biasing voltage.

A scheduling algorithm that uses priority ordering and best fit mapping to processors is presented

in [47] and extended in [43] to take into account the inter-processor communication cost. In [43],

the ordering is defined according to the sum of the latest finish time and the earliest start time at

which a processor is also available. The actor with the lowest metric value is assigned to a best-fit

processor. To maximize the idle time and thus the margin for energy savings, the actor ready to

execute is assigned the processor which was busy just before that actor was released. To take into

account how this mapping will affect the communication traffic, they introduce a parameter K called

communication awareness parameter. The lower the K, the more communication aware the algorithm

is. Each mapping adds a communication cost (if there is inter-processor communication) which is

compared with the average cost per edge (calculated by the DAG and multiplied by K ). The voltage

selection defined in [47] is formulated as integer linear programming problem (ILP) which can lead to

great computational complexity [30].

A work using a clustering approach for energy dissipation and makespan optimization is presented in

[44]. However, they explore voltage scaling for non-critical jobs, i.e. jobs that are not in the critical

path of the DAG. Concerning the voltage scaling problem, they first compute the slack available for

these non-critical jobs. For a certain job, they define slack as the difference between the latest possible

finish time and the earliest possible start time based on the previously scheduled jobs. Depending on

its slack and the time it takes to be executed under the maximum frequency, they can find an optimal

frequency for execution. In order to form a cluster it is necessary that the result does not lead to

an increase in energy consumption. Since clustered actors are executed on the same processor, this

approach guarantees also a reduction in the makespan of the schedule. For scheduling actors within

the cluster, they use a classic order assignment based on the longest path inside the cluster.

Multi-core processor Architectures

The first energy efficient approach to real-time scheduling in platforms that share some of the char-

acteristics as P2012 was presented in [46]. In this work, the platform under consideration consists of

a single multi-core processor and the goal is to schedule, off-line, a set of frame based independent

tasks with the number of cores being fixed, while minimizing the energy consumption. The authors

prove that the energy efficient real-time task scheduling, in the multi-core context is NP-hard. In their

content consideration a processor consists of M homogeneous cores. Each core can be in the dormant

mode independently from the others but all active cores must operate on the same voltage supply.

The tasks are such that they are ready at time 0 (or a multiple of the the frame period) and all tasks

share the same deadline D which is equal to end of the frame. As far as the assumptions made for the

architecture are concerned they assume that the voltage and consequently the speed s can be scaled

in a continuous fashion. Furthermore, the overheads for switching between different supply voltages

is negligible and task migration is not allowed. The computation requirements for each task, in terms

of cycles are also known. Since the tasks to be scheduled are independent and each core can be in

the dormant mode independently, it is proven that any feasible schedule for this set of tasks, can be

Page 28: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

16 Scheduling

transformed into one that satisfies a property (called the deep sleeping property), while consuming the

same energy as the original one. To satisfy the deep sleeping property, it should hold that a core µ

is found in the dormant mode at any time t’ for t < t′ < D when µ is in sleep mode at some time

0 ≤ t < D. Based on this property, the authors prove that, for any given task assignment X, an opti-

mal voltage schedule in terms of power consumption, can be found. With this property the schedule, is

being partitioned to voltage-frequency segments. Upon the transition of a core from the active to the

dormant mode, a new segment starts. With this property, the optimal voltage can be found by using

Lagrange multiplier method to solve the power consumption minimization problem. They also prove

that finding the optimal task assignment is an NP-hard problem and propose a 2.371-Approximation

algorithm (the result from their algorithm is not more than 2.371 times the optimal value).

In [10], the authors, present a method to reduce the leakage current, the supply voltage and clock

frequency in an integrated way. The tasks to be scheduled are represented by a weighted acyclic task

graph and the architecture under consideration, consists of several processors running under the same

supply voltage and clock frequency. Moreover, it is assumed that the number of processors can be

equal to the number of tasks and the voltage and frequency can be scaled continuously. Their leak-

age aware scheduling algorithm determines the number of processors that result in the lowest energy

consumption. First, they prove that scaling the frequency below a certain point will result in a higher

energy consumption due to leakage current. They find this optimal frequency as 0.56 times of the

maximum frequency, when the threshold voltage is 0.3 times the maximum supply voltage. To reach

this conclusion they used processors that, in maximum frequency the leakage current is responsible

for 50% of the total energy consumption. First they determine the minimal number of processors

needed to finish the task graph before the deadline. To find this upper bound on the number of pro-

cessors, they perform a binary search in the interval [Nlwb, Nupb] with Nlwb be the minimum number

of processors needed to meet the deadline and Nupb the number of tasks. At each step, they use list

scheduling, with EDF priority function, to determine whether a schedule can be found that meets the

deadline. According to the EDF scheduling policy, the earliest the deadline of a task, the hight its

priority is. To determine the optimal number of processors they perform afterwards a linear search

on the interval [Nlwb, Nminimal] by repeatedly applying the schedule and stretch algorithm (lower

the supply voltage and clock frequency) until the the schedule finish exactly on the deadline. This

loop ends when the make span of the schedule does no longer decreases by increasing the number of

processors. After this point, the energy consumption increases by increasing the number of processors.

In [29], the authors present a methodology for lowering the energy dissipation of a multi-core system.

The scheduling of independent tasks is considered, by balancing the slack times among cores within

a voltage frequency domain and lowering the clock frequency while meeting the real time constraints.

In order to meet the timing requirements, the one with the maximum utilization is chosen and the

operating frequency is decided so that the tasks mapped on that core meet the deadline. Based on

this schedule, the propose a slack reallocation algorithm to further distribute the slack times within

the voltage-frequency domain. The schedule is being divided into segments and within these segments

appropriate job migrations are performed to adjust the slack time. Since the slack times are know

balanced a lower operating frequency can be chosen to further lower the energy consumption. The

platform consists of a set of homogeneous cores partitioned into several voltage frequency domains. A

core can be either in sleep or active mode. All active cores within a voltage-frequency domain share

the same supply voltage and clock frequency. However, each core can be put into sleep independently.

Page 29: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Scheduling Heuristics 17

The task model considered in that work, assumes a set of independent tasks, whose period and WCET

are known. Figure 3.2a shows such a task set along with the properties of each task. For the task-

core mapping the Worst Fit Decreasing (WFD) policy is assumed. WFD policy results in a better

balanced task partition when compared to other similar methods, such as Best-Fit decreasing and

First-Fit decreasing and maximizes the possible energy savings [3]. According to the WFD policy,

the binding of tasks to cores is shown in figure 3.2b. The tasks are scheduled by the EDF policy in a

preemptive way on each core. Since communication between different VF domains results in further

energy dissipation, it is assumed that job migration is permitted only within the same VFD. The EDF

schedule in nominal frequency for the aforementioned task set is shown in figure 3.3. The next step

in their proposal is to uniformly scale down the frequency based on the worst case utilization. From

figure 3.2b, is evident that with the current mapping the worst case utilization is equal to 712 . Scale

the frequency down to 712 of the nominal value, yields the schedule in figure 3.4. According to the

authors, moving the idle times backwards in time allows for better energy harnessing. Pushing all task

towards their deadlines alters the task scheduling to the one in figure 3.5. For the slack reallocation

algorithm the whole iteration period (the least common multiple of the periods of all tasks) is being

divided into consecutive non-overlapping segments. To divide the iteration period into segments, each

core is first divided into non-overlapping consecutive time slices. The start time of a slice is either the

start of task or the end of a task, which depends on whether the slice is active or idle. Having divided

each core in time slices, the segments of the VFD are defined as follows:

b0 = 0, bm = iteration period,

bi = min(ts.starttime) (3.12)

such that

ts ∈ ∪dcni=1ci.Sched ∧ ts.starttime > bi−1 ∧ ts.state = idle, 0 < i < sn

where bi denotes the start/end time of a segment, tsstarttime and tsstatus denote the start time of a

slice and its status respectively, dcn is the number of cores in the VFD and ci.Sched is the set of time

slices of each core. From equation (3.12), we see that each segment starts at the start of an active

time slice and ends in the first idle time slice of the same core. The green lines in 3.5 represent the

boundary points for the segments. In this way the domain is now divided into segments and within

each segment the load of all cores is balanced by migrating jobs. Balancing the load in each segment

allows for further reduction in the clock frequency and thus higher energy savings. Figure 3.6 shows

the different steps of this algorithm, after segmentation.

In [22] the authors address the problem of scheduling a set of real time, independent tasks sharing

a common deadline D. The platform consists of a set of cores, with non-negligible leakage power

consumption, organized in the voltage-frequency domain fashion, under given timing constraints.

Their work focuses on the problem of choosing the number of active voltage frequency domains, the

Page 30: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

18 Scheduling

Task ID WCET Period Utilization

1 5 12 5/122 1 3 1/33 1 4 1/44 1 6 1/65 1 6 1/66 1 6 1/6

(a)

Core ID Task Utilization

1 1,6 7/122 2,5 1/23 3,4 5/12

(b)

Figure 3.2: (a) Task set, (b) Task-core mapping

0 1 2 3 4 5 6 7 8 9 10 11 12

Core1

Core2

Core3

j30|4, j40|6 j31|8 j32|12j41|12

j20|3, j50|6 j21|6 j22|9, j51|12 j23|12

j10|12, j60|6 j61|12

j3,0 j3,1 j3,2j4,0 j4,1

j2,0 j2,1 j2,2 j2,3j5,0 j5,1

j6,0 j6,1j1,0

Figure 3.3: EDF scheduling, f=1

0 1 2 3 4 5 6 7 8 9 10 11 12

Core1

Core2

Core3

j30|4, j40|6 j31|8 j32|12j41|12

j20|3, j50|6 j21|6 j22|9, j51|12 j23|12

j10|12, j60|6 j61|12

j3,0 j3,1 j3,2j4,0 j4,1

j2,0 j2,1 j2,2 j2,3j5,0 j5,1

j1,0j6,0 j6,0

Figure 3.4: SimpleVS,f=7/12

Page 31: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Scheduling Heuristics 19

0 1 2 3 4 5 6 7 8 9 10 11 12

Core1

Core2

Core3

j30|4, j40|6 j31|8 j32|12j41|12

j20|3, j50|6 j21|6 j22|9, j51|12 j23|12

j10|12, j60|6 j61|12

j3,2j3,1j30 j4,1j4,0

j2,3j2,2j2,1j2,0 j5,1j5,0

j6,0 j6,1j1,0

Figure 3.5: Move slack backwards,f=7/12

0 1 2 3 4

j30|4, j40|6

j20|3, j50|6

j10|12, j60|6

j30

j2,0 j5,0

j6,0 j1,0

(a) Seg1 after SimpleVS

0 1 2 3 4

j30j1,0j5,0

j2,0 j5,0

j6,0 j1,0

(b) Seg1 after load balancing with jobmigration

0 1 2 3 4

j30j1,0j5,0

j2,0 j5,0

j6,0 j1,0

(c) Seg1 after shifting jobs

0 1 2 3 4

j30j1,0j5,0

j2,0 j5,0

j6,0 j1,0

(d) Seg1 after the frequency scaling

Figure 3.6: Evolution of segment 1

Page 32: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

20 Scheduling

0 1 2 3 4 5 6 7 8 9 10 11 12

Core1

Core2

Core3

j30|4, j40|6 j31|8 j32|12j41|12

j20|3, j50|6 j21|6 j22|9, j51|12 j23|12

j10|12, j60|6 j61|12

j3,2j3,1j30j1,0j5,0 j4,1j4,0

j2,3j2,2j2,1j2,0 j5,1j5,0

j6,0 j6,1j1,0

Figure 3.7: After migration,sg0.f = 31/72, sg1.f, sg2.f, sg3.f = 7/12

task mapping and the frequency assignment. Apart from a polynomial time complexity algorithm

for energy minimization, given the task mapping, they also prove that, in the absence of timing

constraints, the operating frequencies that minimize the energy consumption of each domain, depends

only on the number of cores and static power on this domain. They assume that the voltage-frequency

domains are symmetric (they contain the same number of cores), the frequency can be regulated in

a continuous fashion within an interval [fmin, fmax] and that a VFD can be either in two states, on

and off. Since the task set to be scheduled contains only independent tasks, each VFD’s execution

time can be divided into segments. Moreover, since there is no communication between tasks across

different domains, finding an optimal solution for each domain separately, results in a global optimum

solution. Given a fixed mapping of tasks to VF domains, the optimization problem to solve can be

written as:

minEt =

Nc∑j=1

Ej(tj)

with

Nc∑j=1

tj ≤ D, tminj ≤ tj ≤ tmaxj (3.13)

where Nc is the number of cores in the domain, Ej(tj) is the energy consumption of a segment. Since

the cores are shorted in a non-decreasing order of their workloads, they define tmin =WCj−WCj−1

fmax

and tmax =WCj−WCj−1

fmin, with WCj the workload of the segment segj . To solve the convex opti-

mization problem, they follow a two step approach, in which firstly they narrow the interval of tj

from [tmin, tmax] to the interval [tlowj , tupj ]. The intuition behind narrowing the interval is that, in the

[tlowj , tupj ], Ej(tj) decreases monotonously and thus solving (3.13) in the reduced interval, it should still

hold that∑Nc

j=1 = D when energy is minimized. In the next step, they short Ej(tupj ) and Ej(t

lowj ) of

all segments in a increasing order and they perform a binary search in the domain in order to decide

for the optimal tj for each segment. In order to find the optimal number of VF domains needed to

execute the task set, they perform a linear search in the interval [nlowb , nupb ], with nlowb =⌈∑N

i=1 wciNcD

⌉and nupb = min(

⌈NNc

⌉, Nb), where N is the number of tasks to be scheduled, Nc is the number of cores

in each domain, Nb the total number of domains and wci is the worst case execution time of task i.

In each iteration they use the Largest Task First (LTF) heuristic for mapping the tasks to cores and

check if the deadline can be met. In the LTF heuristic tasks are shorted in a non-increasing order of

Page 33: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Scheduling Heuristics 21

their execution cycles and then, through an iterative process each task is mapped to the least loaded

core.

In [21], the authors propose an algorithm that uses both dynamic voltage scaling (DVS) and power

shut down (PS) techniques to minimize the energy consumption of a time constrained dependent

task set, running on an on-chip multi-processor. Their algorithm provides an extended schedule

and stretch algorithm, where tasks computation cycles are iteratively stretched within the slack time

of a given time interval. They propose a minimum threshold interval for shutting down the cores,

that amortizes the power and time overhead induced by the shut down. The schedule and stretch

algorithm is being extended by incorporating a DVS efficiency metric, to evaluate the energy gain of

the stretched computation cycle. The application under consideration, is modeled as a directed acyclic

graph. Each node in this graph represents a task with known timing and computational constraints.

The underlying hardware consists of N homogeneous processing elements that can communicate with

each other through a shared cache. All processing elements on the chip are assumed to be powered by

one off-chip regulator. In this way, the same voltage and consequently the frequency is applied to all

processing elements at the same time. However, each PE can be in the dormant mode independently

of the others. As mentioned above, the energy dissipation can increase below critical speed. Critical

speed actually denotes the the frequency below which static power dissipation dominates the dynamic

power and as a consequence the energy increases. To determine whether switching all PEs to the

dormant mode is efficient or not, the threshold interval:

Tthreshold(PS) = max( Eoverhead(PS)

Pdc(criticalspeed), Toverhead(PS)

)(3.14)

where Eoverhead(PS), Pdc(criticalspeed) and Toverhead(PS) denote the energy overhead for switching

to the dormant mode, the static power consumption at the critical speed and the time overhead for

the transition respectively. In equation (3.14), the energy and timing overhead are normalized with

the maximum total energy need for one clock cycle on maximum frequency and the cycle time on this

frequency respectively. In order for a switching to the dormant mode to be efficient the number of

consecutive idle cycles of the PE should be more than the timing overhead required for the transition.

In order to decide the cycle that should be stretched a DVS efficiency metric is proposed and formulated

as:

NA(c)(− E(f1)− E(f2)

C1 − C2

)where, f1 > f2 (3.15)

This metric actually represents the ratio of the energy savings due to the increased cycle time. In

equation (3.15) NA(c) represents the number of PEs that are active on cycle c, E(f) denotes the total

energy consumption of all NA(c) PEs and C1 is the cycle time on frequency f1. The tasks are mapped

and scheduled according to [47], which actually assigns priorities to tasks according to their latest

finish time and maps the tasks to PEs according to PE’s finish time relative to task’s release. The

decision of which cycle to stretch is an iterative process and is based on the DVS efficiency metric.

The cycle with the highest efficiency metric will be stretched first. However, authors notice that

stretching computation cycles might not be energy efficient, when the energy overhead for transition

to dormant mode is considered. In order to evaluate whether the stretching of a cycle will minimize

the energy consumption, they compare the solution with the partially stretched computation cycle

with the previous best solution. The comparison stops when there is no available slack.

Page 34: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

22 Scheduling

Work Type of energy to be minimized Application assumption Platform assumption

[45], [30] Dynamic and Static DAG Individually managed PEs[47], [43] Dynamic DAG Individually managed PEs

[44] Dynamic (for non critical actors) DAG Individually managed PEs[46] Dynamic Independent periodic tasks 1 VFD[10] Dynamic and Static DAG 1 VFD[29] Dynamic Independent periodic tasks 1 VFD[22] Dynamic Independent periodic tasks multiple VFDs[21] Dynamic DAG 1 VFD

Table 3.1: Overview of the assumptions in the related work

3.4.3 Discussion on the related work

As shown by the above discussion, only few works address the problem of energy efficient scheduling

of a data dependent task set to hardware configurations that impose, to a number of cores to share

the same pair of voltage-frequency, when active at the same time. An overview of the assumptions in

the related work is shown in table 3.1. Although, individually power managed PEs provide greater

flexibility for energy minimization, adopting this strategy, with a large number of PEs, is very complex

and expensive. It is also evident that blindly adopting, sophisticated methodologies for energy min-

imization for uni-processor architectures, in the aforementioned configurations, will result in higher

energy dissipation.

Although, in the above works, the authors consider both the minimization of dynamic and static energy

dissipation on clustered configurations, they mostly assume that the underlying platform consists of

only one VFD. If they assume multiple domains, they also consider that the tasks being scheduled

are independent which allow them to focus only on one domain and last but not least, they consider

that the switching activity constant. However, as most of DSP applications are described through

very large data-flow graphs, it is reasonable to expect that such graphs are mapped on several voltage-

frequency domains. Moreover, the switching activity of actors, is data and time dependent rather than

constant and plays dominant role in frequency scaling. We have to find, a more general approach,

which can incorporate and take into account inter domain dependencies and communications between

actors as well as, the actor dependent switching activity.

Page 35: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

4

Platform Power Management

In this chapter, we present some background on power consumption of digital circuits. We describe

also the platform under consideration, that forms the basis of the assumptions used when the energy

efficient scheduling is discussed.

4.1 Power Basics

In current CMOS designs, the dynamic and static power consumption are the two major actors affect-

ing the energy dissipation. Dynamic power consumption stems from the average capacitance switched

per operation type. It is data dependent and exhibits an almost cubic relation with the supply voltage

(assuming that frequency has a linear dependence with the supply voltage). Static power consumption

is mainly attributed to leakage current. As CMOS technology is scaled down, the contribution of static

power in the overall energy dissipation is more and more important [37]. Finally, the incorporation

of NoCs as interconnecting mechanisms between processing elements, contribute to the overall energy

dissipation as well. Understanding and modeling the above factors is essential before the scheduling

of actors is considered.

The following analysis concerns the energy dissipated during one clock period. Later we will extend

the energy dissipation taking into account the number of actors, their execution time and the number

of PEs. We denote the clock period as Tclk which is defined as the inverse of the clock frequency fclk.

The energy dissipation of a digital circuit is mainly attributed to three phenomena:

• Charging/discharging of dapacitive loads

• Short circuit currents

• Leakage currents

The first two contribute to the dynamic energy dissipation while the last one to the static energy.

4.1.1 Charging/Discharging of Capacitive loads

The energy consumed, for a transition from 0 → 1 (or vice versa) depends on the total capacitance

of the circuit (roughly proportional to the number of CMOS transistors), the supply voltage and

the activity factor α. This factor denotes the fraction of the total capacitance being charged (or

discharged) and depends on the utilization of the PE’s components. The power consumption is given

by

Pcharge = α · C · V 2DD · fclk (4.1)

For the (65 nm) technology, the frequency can be considered proportional to the supply voltage

fclk ∝(VDD − Vth)a

VDD(4.2)

Page 36: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

24 Platform Power Management

and we derive a cubic relation between the clock frequency and the power consumption as:

Pcharge ≈ α · C · f3clk (4.3)

In 4.2, the exponent a is constant. It is experimentally derived and for the aforementioned technology

is approximately equal to 1.3. The total energy dissipated in Tclk is then:

Echarge = α · C · V 2DD (4.4)

The switching activity factor can be minimized, by adopting efficient architectures, allowing instruc-

tions to complete with minimal hardware utilization. From (4.1), it is evident that minimizing the

supply voltage can provide significant power savings. However reducing the supply voltage, the op-

erating frequency is also affected by the equation (4.2). In order to compensate for the performance

loss, pipelined implementations can be used, as it is the case in most DSP systems.

4.1.2 Short Circuit Currents

When transistors switch, both nMOS and pMOS traverse a partially conducting state. This, results

in a direct conducting path from VDD to Vss called “short circuit”. The energy dissipated during this

state can be approximated by:

Esc = σ · Echarge (4.5)

with σ being 0.2 at average for VLSI circuits. The effects of short circuit currents at the gate’s output

in dynamic power consumption are relatively small. This effect can be absorbed by the α factor in

equation (4.1) or (4.3)

4.1.3 Leakage Currents

Among the phenomena that contribute to static power dissipation, the sub threshold current plays the

dominant role. Having this in mind, the leakage current Ilk can be approximated using the following

equation [9]:

Ilk ≈ Isub = K1 ·W · e−Vthη·Vθ · (1− e

−VDDVθ ) (4.6)

with K1 and η being technology dependent parameters, Vth the threshold voltage, W the gate width

and Vθ the thermal voltage (which depends on the temperature). From equation (4.6), it is clear,

that there are two possible ways to reduce the leakage current. The first one would be to turn off the

supply voltage causing loss of state and the second one would be to increase the threshold voltage,

causing loss of performance.

4.1.4 Total Energy Dissipation

Based on the formulas (4.4), (4.5) and (4.6) the total energy dissipated from one PE in a DVFS island

in one clock period is:

Ep = Echarge + Esc + VDD · Ilk ·1

fclk(4.7)

Page 37: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Power Basics 25

and can be approximated by:

Ep(fclk, VDD) = α · C · V 2DD + Ilk · VDD ·

1

fclk(4.8)

with k1 and k2 being constants. From the equation (4.8) we can see that a reduction in supply voltage

VDD, trades for a quadratic reduction in energy dissipation. Thus, the energy savings by applying

DVFS can be significant. However, voltage scaling in CMOS also affects the gate traversal delay and

consequently the global delay as shown by (4.2). So scaling voltage should also involve frequency

scaling. More precisely, fclk should be decreased linearly with VDD affecting thus the computation

time [16].

Idle energy dissipation

When a PE does not execute any actor, it is in the IDLE state. Since in this state the switching

activity is zero, only the leakage current contributes to the total energy dissipation. The energy

dissipation in this state is then given by the formula:

Eidlep =∑j

k2 · VDD · IDECj ·1

fclk(4.9)

with IDEC denoting the idle execution cycles between the end of an actor τ and its re-execution or

the start of another actor z mapped on the same PE and j denoting these idle intervals. As will be

described in subsection 4.2.3, most modern architectures employ circuits that gate the clock to the

computational logic as well as mechanisms to reduce the effect of leakage current in the idle mode.

Actor energy dissipation

Assuming a given WCEC for one execution of an actor and under constant frequency and supply

voltage, the total dynamic energy dissipated from the PE executing one invocation of this actor is:

Ep(τ) = Ep(fclk, VDD) ·WCEC(τ) (4.10)

Schedule energy dissipation

Since we are interested in the energy dissipated by executing a data-flow graph, we use (4.10) and

(4.9), along with the repetition vector defined in (2.3), to derive the formula for the schedule’s energy

dissipation. Without taking into account the energy for communication between PEs, the total energy

dissipated in one iteration of the schedule is:

Eactortot =∑τ∈T

q(τ) · Ep(τ) +∑p∈PE

Eidlep (4.11)

However the above equation needs to be refined when DVFS is applied on a VFD, containing a num-

ber of PEs, rather than on every PE independently. Consider the HDFG show in figure 4.1a with a

mapping shown in 4.1b. with a mapping shown in 4.1b. According to the given mapping a periodic

schedule can be found as shown in figure 4.2a for the nominal frequency f = fmax or f = 1, if the

normalized frequency is used. Figure 4.2b shows the initial schedule after scaling the frequency of

actor 3 from f = 1 to f = 1/3, when the frequency and voltage can be adjusted to each PE separately.

Page 38: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

26 Platform Power Management

1

2

3

4

——

(a)

Actor WCEC PE

1 1 12 1 23 1 14 1 2

(b)

Figure 4.1: (a)HDFG, (b)Mapping and WCEC of each actor

0 1 2 3 4 5

PE 1

PE 2 4 2

3 1

(a)

0 1 2 3 4 5

0 1 2 3

4 2

31

(b)

0 1 2 3

42

31

(c)

Figure 4.2: Individually managed PEs, (a)f=1 for all actors, (b)frequency scaling to f = 1/3 for actor3, (c) case for clustered PEs

In this context the equation 4.10 can be used to calculate the energy dissipation of each actor. Since

each PE has a dedicated DVFS mechanism, the idle periods can be identified for each PE separately

and consequently equation 4.9 can be used for the idle energy calculation. Equation 4.11 will return

the total energy dissipation for a given actor scheduling and frequency scaling.

In the case when the two PEs are clustered to a VFD domain, then scaling the frequency of actor 3 as

above, would result to the schedule shown in figure 4.2c. It is evident that the scaling the frequency of

an actor directly affects the frequencies and consequently the energy dissipation of all actors from the

VFD that are active when frequency scaling occurs. Following the above argument, the equation 4.10

needs to be refined to take into account the frequency scaling points that fall into the actor’s active

interval. Last but not least, this clustering of PEs to VFDs requires the recalculation of idle intervals.

In contrast to the case of individually managed PEs , an idle interval in the context of VFDs is defined

as the interval when all the PEs in the VFD are idle simultaneously. Following this, while in figure

4.2b the idle intervals can be found to be the [1,3] and [4,5] for PE1 and [1,2] for PE2, in the case of

clustered PEs and figure 4.2c there are no idle intervals.

Page 39: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Platform Architecture 27

Figure 4.3: The P2012 fabric

4.2 Platform Architecture

The platform P2012 under consideration for this project is designed by STMicroelectronics. P2012

is an area and power efficient many core platform. The computing fabric is highly modular and is

based on multiple clusters, each of which is an independent power and clock domain (VFDs). Each of

these domains incorporates a power management unit, that can be controlled independently, enabling

aggressive fine-grained power management. The communication infrastructure between the clusters is

based on a high-performance asynchronous NoC architecture: the ALPIN NoC. A graphical schematic

of the architecture is presented in 4.3.

4.2.1 Cluster Power Management

Each cluster of PEs, in P2012, is an independent voltage/frequency domain. The architecture for each

domain is show in figure 4.4. Within each unit, a programmable local clock generator generates a

variable frequency in a predefined and programmable tuning range. Apart from the local clock gener-

ator, each domain incorporates a local power supply unit (PSU) to generate and control its internal

core voltage supply. Regarding the dynamic power consumption, the technique used is a Locally

Adaptive Voltage and Frequency Scaling (LAVFS) with VDD hopping [6]. As far as the static power

consumption is concerned, PMOS power switches, controlled by an ultra-cut-off (UCO) mechanism,

are inserted to maintain minimum leakage in standby mode. The PSU is presented in 4.5.

4.2.2 Dynamic Power Management

Efficient LAVFS is performed through a hardware controller that automatically switches between Vhigh

and Vlow, using a configurable duty-ratio. This way, low-level software control is avoided as much as

possible. The hopping technique used is VDD hopping with dithering described later.

Page 40: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

28 Platform Power Management

Figure 4.4: NoC Unit Architecture - VFD

Figure 4.5: Power Supply Unit

Page 41: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Platform Architecture 29

Figure 4.6: Dithering principle

The Local Power Management (LPM) unit, is in charge of handling the domain’s power modes.

The LPM contains a set of programmable registers to define the domain power mode, configure the

programmable delay line (for frequency regulation), configure and control the PSU. More precisely,

the LPM contains two dedicated registers to program the frequency and the duty-ratio for the hopping

unit. In addition, it contains registers to control the hopping unit signals. A mode register controls

the mode of the unit. In this architecture, each unit (VFD) can be set in one of the following power

modes:

• INIT mode: supply voltage is Vhigh and the clock is gated.

• HIGH mode: supply voltage is Vhigh and the clock is sent to the VFD

• LOW mode: supply voltage is Vlow and the clock is sent to the VFD

• HOPPING mode: supply voltage automatically hops between Vhigh and Vlow. Frequency and

duty ratio of hopping is configurable. The obtained performance is an average value between

Vhigh and Vlow based on the duty ratio.

• IDLE mode: VFD clock is off and leakage power is reduced due to the Vlow supply voltage

• OFF mode: the unit is switched off by the UCO device, to further reduce the leakage power.

VDD hopping

As mentioned above, VDD hopping with dithering between two pairs such as (Vhigh, fhigh) and Vlow, flow

is used to control the average voltage and frequency of the VFD. Dithering provides superior results

against DFS or DVFS with discrete voltage levels [6] as shown in figure 4.6 and comparable perfor-

mance to a continuous voltage converter. The operating frequency is defined as the duty ratio between

the time spent in fhigh and flow:

Favg =(flow · tlow) + (fhigh · thigh)

thigh + tlow(4.12)

During hopping, the supply voltage is provided by a power supply selector acting as a linear regulator

with a voltage set point given by a DAC. With this precise control, changing between supply voltages

Page 42: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

30 Platform Power Management

can be done following a controlled ramp (Vref ), limiting wide current variations and avoiding supply

voltage over or under shoots. Because of this smooth transition, the VFD does not need to be stopped

and thus there is no latency cost in application level.

4.2.3 Leakage Power Management

The use of power switch transistors to reduce leakage current in digital memory circuits is already

a mainstream. The first method used, is known as Multiple Threshold CMOS (MTCMOS) power

switches [2]. It consists of using low − VT , high performance transistors for the logic and high− VT ,

low leakage for the power switch. In this way, the power switch is inserted between the supply lines

and the logic. The drawback with this approach, is the poor performance of the high−VT transistors

under low supply voltage. To allow low-voltage operation, the super cut off CMOS (SCCMOS) [37, 19]

has been introduced. It is a low Vth transistor, whose leakage current is exponentially reduced, by

reverse biasing its gate. The UCO circuit is responsible for biasing the gate of the SCCMOS power

switch [5].

In [6] the authors present the leakage gain from the usage of a UCO-type power switch against the

gain from the usage of MTCMOS switch, when the UCO-type power switch is used to drive a higher

power dissipation IP. The leakage current in the OFF state was found to be 8 times lower due to the

UCO, while it is 2.5 times higher in the HIGH mode of operation.

4.2.4 NoC Interconnect

Energy dissipation and Latency

In on-chip interconnects there are two sources of energy dissipation: wires and routers. P2012 uses,

for interconnecting the VFDs, an innovative 2D-mesh NoC based on asynchronous logic perfectly

adapted to the GALS paradigm. The routers in the NoC are implemented in a Quasi-Delay-Insensitive

(QDI), closk-less logic [42]. QDI circuits are a class of delay invariant, asynchronous designs. This

fully asynchronous NoC interconnect scheme, provides almost 5 times less power consumption than

the synchronous equivalent and reduced latency (almost a ratio of 2). The availability of low-Vt

asynchronous cells (instead of multi-Vt), results in a higher static energy consumption from the routers,

compared to the synchronous equivalents. However, when the asynchronous routers are idle, there

is only static power dissipation. In the synchronous implementation the routers, while in the idle

state, can consume up to 5mW (for a high performance router) because of clock switching, even if

clock gating is performed, while the asynchronous router implementation consumes 240 µW [42]. On

a telecom application implemented to compare the synchronous and asynchronous approaches, the

power budget for a 15 node NoC was reduced from 82.6mW down to 11.9mW for the NoC. With

the asynchronous NoC as a communication infrastructure, for the system of figure 1.3, the energy

dissipated in the NoC corresponds to 6% of the overall energy dissipation [24]. As far as the latency

is concerned, for a 5 node path the ANoC latency is 17.3 ns against the 29 ns for the synchronous

version [42]. The throughput provided can reach the 17Gb/s for 32 bit flits.

4.3 Assumptions

Based on the above discussion on P2012 as well as the discussion on the application modeling adopted,

we proceed by presenting our assumptions necessary for refining the energy formula in (4.11).

Page 43: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Assumptions 31

Architectural assumptions

• The platform can be described by a directed architecture graph, GA = (I,L), where I is a set of

VF domains and L ⊂ I × I is a set of links between these domains. A link is an ordered pair

l = (i, z) with i, z ∈ I. Data can be sent from domain i to the domain z with a constant latency.

• The platform is homogeneous: the execution time of an actor is independent of the PE and

depends only on the supply voltage and frequency. From the homogeneous case, it follows that

k1 in (4.8) is constant for all PEs. This homogeneity comes from the fact that all PEs in the

VFDs of P2012 are identical.

• There is a linear dependence between the supply voltage and the operating frequency (for 0.8V ≤VDD ≤ 1.2V and 0.2V < VT < 0.5V in 65nm technology [17]).

• The platform supports state of the art mechanisms for reducing the static power which can be

neglected [6].

• Thanks to VDD hopping, the LAVFS mechanism can set the frequency to any value in the range

[fmin, fmax] [1].

• There is no overhead for issuing a frequency scaling command. This assumption comes from the

fact that the voltage and frequency can be scaled smoothly and consequently there is no need

for the actor executing to stop. An overhead exists when the island is decided to switch off.

• The scaling moments of frequency and voltage always coincide with each other.

• Intra-domain communications between PEs are instantaneous and do not consume energy. This

is reasonable assumption for our platform, where each VFD contains PEs, connected to multi

banked level 1, shared, instruction and data memories [1].

• Inter-domain communications have a known and bounded latency (proportional to the number

of nodes in the path and the amount of flits transmitted). This is the case for guaranteed

throughput NoCs and constant energy consumption per flit denoted as ENoCflit/hop. Flit is the

quantum of information in bits, transfered between two routers in the NoC in one clock cycle.

The flit size depends on the wires connecting the two routers together.

• The energy dissipated by the communication is neglected for two reasons: the design of the NoC

and the binding has been done in such a way that inter-domain communications are minimized.

• PEs that are not bound to any actor from the data-flow graph are assumed to be turned off.

Application modeling assumptions

• The data-flow notion is used to model the application under consideration. The definition of

this model was given in 2.1

• The dependencies and the amount of exchanged information (size and number of tokens) between

actors are known.

• The switching activity α is for each actor is known. .

Page 44: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

32 Platform Power Management

• The WCEC of each actor and for each communication is known. Since the communication

between VFDs is done through an asynchronous NoC, the WCET for each edge, that cross

VFDs, will be considered known from the mapping step.

• The deadline for one iteration of the schedule is known.

4.4 Energy Dissipation Refinement

4.4.1 Definitions Overview

Following the assumptions of section 4.3, we proceed in evaluating the energy consumption of a given

data-flow graph.We provide a short summary of functions and sets definitions for the architecture and

the graph that will be used extensively in following chapters.

Architecture definitions

- GA = (I,L) is the directed architecture graph

- I is the set of VFDs

- PE is the set of PEs

- BPE : PE → I, is the binding function, that maps processing elements to a VFD. The inverse of

this function, i.e BPE−1(i) returns the set of PEs on a VFD i.

- N : I → 2N, returns the set of cycles in an island. This set depends on the frequency schedule.

The maximum of N is given as: |N | = D · fmax. With frequency scaling, the total number

of execution cycles can decrease. However, there is a lower bound on the number of execution

cycles which should be preserved in order for all actors mapped on a domain to complete their

execution. This lower bound on execution cycles is based on the actors’ WCEC. To calculate

this lower bound, we need to calculate the worst case execution path within the given domain.

Intuitively, the worst case path for each processing element can be calculated as the difference

between the min(S(τ)) and the max(S(τ) + WCEC(τ)), with S(τ) the function that returns

the scheduling cycle of an actor τ ∈ BT −1({p}). Respectively, to obtain the critical path within

a domain, we have to consider the critical paths from all PEs. For a domain i ∈ I we define this

lower bound to be:

lb cycles(i) = maxp∈BPE−1

({i})(Smax(p)

)−min

p∈BPE−1({i})

(Smin(p)

)(4.13)

with

Smin(p) = min(S(τ)),

Smax(p) = max(S(τ) +WCEC(τ)

), ∀τ ∈ BT −1

({p})

Considering the figure 4.2a, in the case that the two PEs are clustered in the same VFD, we can

calculate the lowest bound on the computation cycles, based on the mapping to be equal to 3.

- F : I × N → f , with N a finite set of cycles and f ∈ [fmin, fmax]. This function returns the

frequency schedule of a VFD.

Page 45: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Energy Dissipation Refinement 33

- V : I ×N → V, V ∈ [Vmin, Vmax]. This function return the voltage schedule for an domain.

- L ⊂ I × I is the set of links in the NoC.

- Nflits/token ∈ N+ : gives the number of flits per token.

- Lflit/hop ∈ R+ : gives the network hop delay for one flit

Data-flow graph definitions

- Gs = (T , E) is the data-flow graph

- T is the set of actors in the data-flow graph

- E ⊂ T × T is the set of edges in the data-flow graph

- q : T → N+ is the repetition vector and returns the number of invocations of τ ∈ T in one

iteration of the schedule

- D : returns the deadline for a schedule

- BI : T → I is the actor mapping function that associates one τ ∈ T to one i ∈ I. The inverse

of this function, i.e BI−1(i), returns all the actors mapped on the VFD i.

- BT : T → PE is the actor mapping that associates one τ ∈ T to one p ∈ PE . The inverse of this

function, i.e. BT −1(p), returns all the actors mapped on the PE p.

- WCEC : T → N+ returns the WCEC of τ ∈ T .

- α : T → R+ returns the switching activity of τ ∈ T

- src (dst) : E → T returns the source (dst) actor of an edge c ∈ E

- Prate (Crate) : src(e) (dst(e)) → N+, c ∈ E returns the tokens produced (consumed) by one

invocation of an actor

- BE : E → 2L is the edge binding function that binds one e ∈ E to possibly several or none l ∈ Land thus | BE(e) | ∈ N

- S : T → 2N , returns the scheduling moments (in cycles) for an actor. Reasoning behind this

definition will be given in following chapter.

- Idle(p) = N (BPE (p))\⋃τ∈BT −1(p){S(τ)+{0, 1, ...,WCEC(τ)−1}} Returns the set of idle inter-

vals in a processing element. It is obtained if from the total number of cycles N (BPE (p)), within

the time frame [0,D], we exclude those intervals where the PE is active. These active intervals

are calculated based on the S(τ) and WCEC(τ) as⋃τ∈BT −1(p){S(τ)+{0, 1, ...,WCEC(τ)−1},

the union of execution cycles of actors τ mapped on the PE p.

Using the above definitions we can extend (4.10) to accommodate multiple frequencies within one or

multiple executions of an actor as:

Ep(τ) = k1 · α · (Vrms(τ))2 · q(τ) ·WCEC(τ) (4.14)

Page 46: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

34 Platform Power Management

Since the voltage is a time varying, periodic (with period equal to the deadline) function of time,

in equation (4.10) we can use the mean of the different voltage levels within an actor’s invocation.

By definition, this mean is equal to the root mean square of voltage given by equation (4.15). The

inner sum in (4.15) denotes one invocation of an actor, while the outer sum denotes the number of

invocations according to the repetition vector q.

Vrms(τ) =root

√√√√∑n∈q(τ)

∑n+WCEC(τ)−1n V2(BI(τ), n)

q(τ) ·WCEC(τ)(4.15)

In following chapters, we restrict the VF scheduling points per domain at specific events. These domain

specific events are the invocation and completion time of the actors mapped on this particular domain.

In this way, instead of exhaustively calculating the Vrms over every execution cycle, we will only have

to consider the number of intervals within the actors execution time [S(τ), S(τ) +WCEC(τ)].

We also extend (4.9) to accommodate multiple supply voltages within an idle interval rather than the

minimum one:

Eidlep = k2 ·∑

n∈Idle(p)

V(BPE (p), n) · 1

F(BPE (p), n)(4.16)

This refinement comes from the fact that each PE in a VF domain can not go to the dormant mode

independently. So while idle, if any of the other PEs, in the same VFD, is active, the voltage might

have a value other than the minimum one as shown in figure 4.2c.

We also define the energy dissipated to transmit one token between 2 NoC routers as:

ENoCtoken/hop = ENoCflit/hop ·Nflits/token (4.17)

Using equations (4.11), (4.14), (4.16) and (4.17), we can formalize the energy dissipation of a data-flow

graph under a given schedule as:

Esch = Eactortot + ENoCtot (4.18)

with

Esch =∑i∈I

( ∑p∈BPE−1

({i})

( ∑τ∈BT −1({p})

Ep(τ) + Eidlep

))

Ep(τ) = α · C · (Vrms(τ))2 · q(τ) ·WCEC(τ)

Eidlep = Ilk ·∑

n∈Idle(p)

V(BPE (p), n) · 1

F(BPE (p), n)

Which gives the active and idle energy in all domains, that have actors mapped and

ENoCtot = ENoCtoken/hop ·∑

τ∈BT −1({p})

∑e∈src−1({τ})

| BE(e) | ·Prate(src(e))

which gives the total energy dissipated for communication. If an edge does not cross VFD, then, its

energy is considered 0.

Page 47: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

5

Energy Efficient Scheduling

Our work is focused on multi-processor chips where the processors are clustered to form voltage-

frequency domains and communicate through an asynchronous NoC. All processing elements in the

cluster share the same voltage-frequency pair and go to the dormant mode when the VFD is switched

to the idle state. As far as the application model is concerned, it is described by the data-flow

graph’s notion, described in previous chapter. Since the problem of scheduling tasks in such hardware

configurations is proven to be NP-hard, we restrict our work to acyclic graphs and we assume that

static power dissipation is negligible. Before explaining the heuristic for tackling this problem, we shall

first try to formulate the optimization problem and observe the differences from the related work.

5.1 Constraint problem formulation

5.1.1 Objective Function

As shown in chapter 4, the dynamic power dissipation is given by the formula (4.1). Current tech-

nologies voltage and frequency have an almost linear relation to each other through the formula (4.2).

Because of this linearity, dynamic power dissipation is almost cubically related to clock frequency fclk,

through equation (4.3). The worst case execution time needed by a task to complete its execution

under fclk is given by equation (3.5) and by substituting fmax with fclk: WCET (τ) = WCEC(τ)fclk(τ) and

consequently the energy consumption of the actor is given by:

E(τ) = Pcharge(τ) ·WCET ≈ α(τ) · C · fclk(τ)2 ·WCEC(τ) (5.1)

Whereas, in the case individually managed PEs where each task can have each own frequency/voltage,

in our underlying hardware, the frequency of a task depends greatly on the frequency/voltage scal-

ing schedule. By frequency/voltage schedule we mean, the time instances when the frequency and

voltage is being scaled. The scaling moments of frequency and voltage always coincide with each other.

Relatively to an actor’s scheduling moment, some or none voltage/frequency scaling moments can

affect its execution time. This fact, drives us to incorporate an average frequency for each actor, on

the energy formula. Based on this average frequency, the energy dissipation can be formulated as

follows :

E =∑τ∈T

α(τ) · C · fav(τ)2 ·WCEC(τ) (5.2)

In order to minimize the overhead of scaling, we assume that these moments also coincide with the

invocation or completion of an actor. Of course, do not necessarily invoke VF scaling. One could pro-

pose, for voltage and frequency scaling, to be independent of the scheduling of actors. However, since

in our platform and for our assumed application model properties such the deep-sleeping property,

assumed in the above works [46, 22] and since having a metric for the efficiency of each cycle like in

[21] it is too expensive, especially for applications that need thousands of cycles to complete, such an

approach would require some short of criterion to decide how to divide the execution time of a VFD,

Page 48: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

36 Energy Efficient Scheduling

to segments of constant voltage and frequency. The lower the width of such an interval, the higher

the complexity of the heuristic and the higher the flexibility to harness energy dissipation are. Apart

from this trade off, one should also consider the timing overhead for voltage-frequency scaling when

deciding for the width, since the smaller the width is, the more the scaling points are.

The number of tasks mapped on the specific VFD should also be taken into account. Intuitively,

one would expect more scaling points as the number of tasks mapped on the VFD is growing. The

schedule and stretch approach, used in [10] and the approach in [21] are the two ends of this approach.

In [10], we have only one interval, equal to the iteration period where the voltage and frequency are

constant. In [21] we have the other end, where the interval is equal to the clock cycle. Although,

the schedule and stretch approach in [10] provides low complexity, we expect that taking into account

the switching activity of each task separately, will allow more efficient results. The relation between

switching activity and energy gain has been discussed in [30]. It is noted that the higher the switching

activity of a task is, the higher the energy gain is from scaling the frequency and the voltage on this

task.

5.1.2 Deriving the constraints

We believe that dividing the execution in intervals with constant frequency and voltage can lead to

better results. This approach is also adopted in [46, 29, 22]. However, in [46, 22], segmentation is a

natural choice, as these works deal with independent tasks and the mapping is such that the schedule

can satisfy the deep-sleeping property. In [29], segmentation is adopted in order to balance the slack

within each segment, through job migration and apply frequency scaling in the segment’s interval. In

our work, job migration is not allowed. Alternatively, we propose another approach to increase the

usable slack in each core, which will be described later (see section: 5.2.2). As it was mentioned above,

we choose each segment to start either at an invocation of a task or at the completion of a task, so as

to increase the energy harnessing and maintain a simple heuristic.

Graph transformation

Before describing the procedure of defining the VF segments, we should first proceed by transforming

the data-flow graph Gs, to G∗s. In the transformed graph, G∗s new edges are added, when necessary,

to capture the dependencies between actors that share the same resource as. Furthermore, each edge

has an associated communication cost, which is derived from the mapping step. Assuming that the

mapping has been done, we should first define the ordering of execution. The ordering is done in

such a way that it respects the data and resource dependencies. This ordering is based on the ALAP

start time given by equation (3.4). Ties are broken arbitrarily. We follow the approach of [39], which

describes how to add precedence edges, in the initial data flow graph based on the mapping and

the exact ordering of actor firings. However, this is not possible in general for multi rate data flow

graphs.The new transformed graph will have the same amount of actors, however a new edge has to be

inserted between two actors, if these are mapped on the same PE, there does not exist a direct edge in

the original data flow graph and their executions are sequential. Formalizing the above requirements,

an edge is added between two actors j and z if:

• ALAPs(j) > ALAPs(z) actor j will execute after actor z

• BT (j) = BT (z) both actors should be mapped on the same PE

Page 49: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Constraint problem formulation 37

1

2 3 4

5 6

7PE

1

PE

2

1

2 3 4

5 6

7

Figure 5.1: From Gs to G∗s (ALAPs(3) < ALAPs(4))

• There is no actor τ mapped on the same PE, which will be fired between z and j: ∀τ ∈BT −1

(BT (j)) =⇒ ALAPs(τ) > ALAPs(j) ∨ALAPs(τ) < ALAPs(z). This implies that on the

PE, the actor to be fired after the completion of actor z is the actor j

• E ∩ e(z, j) = ∅, in the original graph, there is no direct edge from actor z to actor j

The transformed graph, with embedded the resource and data dependencies, is now denoted as

G∗ = (T , E∗).

In figure 5.1 actors 3 and 4 are mapped to the same PE, along with actors 6 and 7. Actors 1, 2 and

5 are mapped on PE 1. Because of the data dependencies between actors 1, 2 and 5 the ordering

of execution in PE 1 is straightforward. On the other hand, on PE 2 there is no data dependency

between actors 3 and 4 and both of them are dependent from the completion of actor 1. Because

ALAPs(3) = ALAPs(4) we have to choose arbitrarily the order of execution between them. By choos-

ing 3 to execute before 4 we impose a resource dependency between them. This new dependency is

represented by the addition of an edge directed from actor 3 to actor 4.

The last step, is to add the communication cost to edges that cross VFDs 5.2. Edges with commu-

nication cost other than zero are already present on Gs. In order to add the communication cost,

we should then check if two actors, connected by an edge in Gs, are mapped on the same VFD. In

figure 5.2 thus, candidate edges are: e(1, 3), e(1, 4), e(2, 5) and e(7, 8). All other edges will have and

associated communication cost equal to 0.

Formalizing, an edge connecting actors τ and z, will have a WCET 6= 0 iff BI(τ) 6= BI(z). Then, a

WCET equal to:

WCET (e(τ, z)) =(|BE(e(τ, z))|+ Nflits/token · Prate(src(e(τ, z))− 1

)Lflit/hop (5.3)

is associated with this edge. In equation 5.3 the function |BE(e(τ, z))| returns the number of hops,

between the two domains and Lflit/hop returns the worst case latency for a hop, which is considered

constant and known.

Page 50: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

38 Energy Efficient Scheduling

1

2 3 4

5 67

8

VF

D1,

PE

1

NoC

VF

2,P

E2

1

2 3

WC

ET

4

WCET

5

WC

ET

67

8

WC

ET

Figure 5.2: Adding WCET to edges

Segmentation

To divide the iteration period, into segments we follow an approach similar to the one presented in [29].

We consider the transformed data-flow graph G∗s, which describes the functionality of the application

and the information from the mapping of actors to VF domains and PEs.

A segment is a time interval, where the voltage and frequency are constant. The frequency and voltage

are considered to be scaled at the beginning of each segment. The segments are also sequential and

non-overlapping with each other. The number of segments per VFD, is greatly affected by the number

of actors mapped to the VFD. Segments that share the same pair of voltage and frequency can be

clustered together. Assuming, that in an VFD i, there are mapped n actors, we can derive the

maximum number of segments as 2 · n+ 1. This maximum number of segments, can be derived if the

invocations of all actors, mapped to the VFD i, fall in different time moments. We, associate with

each VFD i ∈ I a set of boundary points B(i) : {b0, ....bk}, to define these voltage-frequency segments.

For each boundary point b ∈ B(i), we associate a clock cycle and a frequency. We omit from the

boundary’s point definition the voltage, since voltage and frequency have a linear relationship.

• We define a boundary point to be a pair of clock cycle and frequency as b → N × F , where Ndenotes the number of cycles, which is domain specific and can be calculated after mapping and

F is set of available frequency levels.

We define the set B, to contain all scheduling and completion times of all actors in the data flow

graph, shorted in an increasing order. We associate with each domain a subset of B, I → 2B. This

subset contains the boundary points which divide the execution interval into segments of constant

voltage and frequency.

• ∀i ∈ I, B(i) : {b0, ....bk} ⊆ B, defines the VFD specific set of boundary points.

Since the boundary points, are invocations of actors or completion of their execution, then each domain

specific subset should contain only the boundary points derived from tasks mapped to the specific

domain. As we mentioned above the total number of boundary points associated with a domain, is

2 ·∑

p∈BPE−1({i}) |B

T −1({p})|+ 1. The segments defined by these points should be sequential and not

Page 51: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Constraint problem formulation 39

overlapping. Last but not least, since each boundary point represents also a possible frequency-voltage

scaling point, if two such points coincide, then they should share the same frequency-voltage pair. The

above discussion can be summarized in the following constraints:

• b =< n, f >, a boundary point is a pair of cycle-frequency

• b.n = S(τ) or b.n = S(τ) + WCEC(τ), a boundary points coincides with an actor’s invocation

or completion.

• ∀bj , bz ∈ B(i) =⇒ bj .n ≤ bz.n ⇐⇒ j < z, the set B(i) is shorted with respect to b.n

• ∀bj , bz ∈ B(i), bj .n = bz.n ⇐⇒ bj .f = bz.f only one segment is defined if two or more actors

are invoked on the same time.

• b0 ∈ B(i), ∀i ∈ I

• b0.n = 0, there is always a boundary point associated with the start of the frame.

• b.f ∈ [fmin, fmax]

Now the boundary points their properties and the boundary point subset have been defined. One will

notice that in the above discussion and definitions, we have used cycles instead of time. The intuition

behind this, is that with frequency and voltage scaling, the width of clock cycles change. Frequency

and voltage scaling in one domain, will have no effect to boundary points, of that domain, something

that would not hold if boundary points were defined in terms of time. Frequency and voltage scaling

might affect however, boundary points of other domains. This is the case where there is a direct or

indirect dependency between the actor, whose invocation or execution (boundary point) is used as a

scaling point in one VFD and an actor mapped to another VFD. Presumably, the latter one would

be a direct or indirect successor and of the first actor and a path would exist between them, in the

transformed graph G∗.

A segment, as described earlier, is defined within two sequential boundary points.

• sgj .start = bj

• sgj .end = bz

• sgj .f = bj .f

• sgj .st = active ⇐⇒ ∃τ ∈ BI(i)|S(τ) = sgj .start ∨ S(τ) +WCEC(τ) = sgj .end

such that bj , bz ∈ B(i) ∧ bx ∩ (bj , bz) = ∅, ∀bx ∈ B(i) and i ∈ I. Since with each boundary point we

associate one VF pair, we can calculate, the time span of each segment as:

timespan(sgj) =bj+1.n− bj .n

bj .f(5.4)

and the timing of each boundary point as:

time(bj) =∑

sgi∈[sg0,sgj−1]

timespan(sgi), (5.5)

[sg0, sgj−1] ⊆ SG(i) ∧ bj ∈ B(i)

Page 52: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

40 Energy Efficient Scheduling

Timing constraints

As a next step we should define the timing constraints with respect to the boundary points defined

earlier. In order to define this exact timing, we have to consider the VF schedule fixed. Since, segments

are defined over boundary points, by construction, the constraint:

•∑

j∈|SG(i)| sgj .end− sgj .start ≥ lb cycles(i)

which impose that the sum of cycle spans of all segments in one island to be at least equal to the

lower bound of cycles (4.13), always holds. SG(i) is the island dependent set of segments.

Based on equation (5.5), it should hold that the invocations and completion times of all actors from

the data flow graph, should be less than the deadline D:

• time(bi) ≤ D ,∀bi ∈ B

Computation of favg

We have to compute the average frequency of an actor, so that we will be able to compute its energy

consumption. Frequency and voltage are associated with a boundary point, as described earlier.

Boundary points are defined by the invocations and completions of actors. Thus we can find a

subset of the ordered set B(i) containing all boundary points from the invocation of a actor to its

completion. Formally, we define the actor specific boundary point set as P (τ) to be a subset of B(i),

i.e. P (τ) ⊂ B(i), where i is the domain where the actor τ is mapped to. For the set P (τ) the following

should hold:

P (τ) = [S(τ), S(τ) +WCEC(τ)]

We remind, that both S(τ) and S(τ) + WCEC(τ), define two boundary points in the B(i). Then,

to form the set P (τ), we have to take from B(i), only these boundary points that fall in the interval

[S(τ), S(τ) +WCEC(τ)]. Then P (τ) is defined as:

P (τ) = [S(τ), S(τ) +WCEC(τ)] (5.6)

= {bj ∈ B(i)|S(τ) ≤ bj .n ≤ S(τ) +WCEC(τ)} (5.7)

Having associated, a subset of boundary points to an actor, we can proceed in computing its average

frequency as:

favg(τ) =

j=|P (τ)|∑j=2

bj .n− bj−1.n

WCEC(τ)· bj .f (5.8)

where bj is such that: ∀j ∈ [1, |P (τ)|], bj ∈ P (τ)

Precedence constraints

We can derive the constraint, necessary to preserve the data dependencies, between actors in the data

flow graph. If there is an edge directed from actor τ to actor z, then it should hold that, the time of

the boundary point associated with the completion time of actor τ (denoted for simplicity as bτ ) is less

than that of the boundary point bz, associated with the invocation of z. These precedence constraints

can be expressed as:

Page 53: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Constraint problem formulation 41

• time(minbj∈P (z)(bj .n)) ≥ time(maxbj∈P (τ)(bj .n))

with z, τ ∈ T such that e(τ, z) ∈ E∗.

5.1.3 Discussion

We have seen in the above description an exhaustive formulation of the problem constraints. We can

sum up the optimization problem with the constraints as:

Minimize:

E =∑τ∈T

α(τ) · C · favg(τ)2 ·WCEC(τ)

under

bi = S(τ) ∨ bi = S(τ) +WCEC(τ),

bi.f ∈ [fmin, fmax],

∀bj , bz ∈ B(i), bj .n = bz.n ⇐⇒ bj .f = bz.f, ∀i ∈ I

b0.n = 0 ∧ b0 ∈ B(i),∀i ∈ I,

sgj .start = bj ∧ sgj .f = bj .f,

time(bi) ≤ D, ∀bi ∈ B,

time(minbj∈P (z)(bj .n)) ≥ time(maxbj∈P (τ)(bj .n)) ⇐⇒ e(τ, z) ∈ E∗

where the expressions for time, favg and P (τ)are given by equations

time(bj) =∑

sgi∈[sg0,sgj−1]

timespan(sgi)

favg(τ) =

j=|P (τ)|∑j=2

bj .n− bj−1.n

WCEC(τ)· bj .f

P (τ) = {bj ∈ B(i)|S(τ) ≤ bj .n ≤ S(τ) +WCEC(τ)}

In all the related work done, either in energy minimization for multi processor systems, where the

granularity of VF regulation is the PE or for cluster based approaches where the granularity is the VF

domain, the workload to run under the optimal frequency is known and constant. In the first case,

with individually controlled PEs, this workload considered for VF regulation is, the workload of each

task/actor separately. In the latter case, the workload whose energy consumption is considered, is the

execution cycles of tasks mapped on one PE, the least loaded one first and then, the difference of the

workload, between subsequent PEs, from the same VF domain, by traversing them in an increasing

order of their mapped workload. In this way, the problem is relaxed to the one of multiprocessor energy

minimization, with individually managed PEs, which is proven to be NP-hard. However, this method

can be performed when considering the scheduling and frequency scaling of an independent task set

and when the switching activity among all actors is considered the same. Under the VF domain

assumption and when dealing with data-flow graphs, the workload considered for energy harnessing

is the sum of the workload of all actors mapped on a VF domain. Then for energy minimization,

stretching the execution of this workload until the deadline is considered. In order for this approach

to work, two major assumptions were made:

Page 54: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

42 Energy Efficient Scheduling

• All actors, have the same switching activity

• There is no inter-island communication

The assumption of one switching activity over all actors, does not hold for real applications. The

assumption that the data flow graph can be mapped on one VF domain, might not be the case for

large applications or for VF domains with only a small number of PEs.

Although, the formulation of the problem with the notion of segment allows to account for different

switching activities and for inter-island communication, it makes the problem even more complex. The

additional complexity comes from the fact that there is not a constant workload to be considered for

voltage and frequency scaling. If we assume that, the VF domain, contains enough PEs for the graph

under consideration, then indeed the VF scaling, will not affect the overlapping of actors, since each

actor fires after all its predecessors have finished their execution. This phenomenon, happens because

the voltage and frequency pair, when scaled down, stretches the execution in all PEs on the domain

simultaneously. To tackle this problem we propose a clustering technique based on the segmentation

of the execution time. With this technique we create super node (see section 5.2.3) with switching ac-

tivity inherited from the clustered actors. This technique is presented in section 5.2.3. Since switching

activity actually affects the energy gain, taking into account the switching activity variation between

actors when clustering might result in more efficient energy harnessing. How to take into account the

different switching activities while clustering is presented in 5.2.3.

Since for now we are only considering one VF domain, VF scaling does not affect the invocations

of actors with respect to clock cycles.VF scaling stretches the execution time span by stretching the

period of clock cycles. It does not change, the total clock cycles of execution needed by an actor and

consequently it does not affect the invocations of actors with respect to clock cycles. Our system is,

also, constrained to have at least enough execution cycles to complete one iteration of a valid schedule.

Looking at the invocation of actors in terms of clock cycles enables us to compute the overlapping

of actors’ execution across different PEs and perform a clustering, to relax the problem to that of

multi-processor energy efficient scheduling with individually managed PEs. How the scheduling of

actors and the clustering of the inter-leaved execution intervals will take effect will be presented in

section.

Multiple VFDs and variable P (τ)

We want to be able to accommodate very large graphs, mapped across several VF domains. The

problem arising, when taking inter-domain dependencies into account, is that the scheduled invocations

and thus the overlapping of execution times, can not be considered constant any more, even when they

are expressed in terms of clock cycles. To illustrate, how VF scaling in one domain can possibly affect

invocations of actors in other domains, we should consider the following scenario: There exist two

actors, τ and z, such that τ, z ∈ T , with a direct data dependency, i.e there is an edge between them

in the original graph. Assume further, that these two actors are mapped on different VF domains.

• e(τ, z) ∈ E: τ is a direct predecessor of z.

• BI(τ) 6= BI(z): τ and z are mapped in different VFDs.

Assume now that our energy efficient scheduling algorithm, finds that stretching actor’s τ execution

will lead to great energy reduction. This stretch, will cause the start of execution of edge e(τ, z), to

Page 55: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Constraint problem formulation 43

shift in time. This shifting is equal to the difference between the completion time of τ after VF scaling

and the completion time before scaling. Since the invocation of actor z is given by:

S(z) = max(maxe∈dst−1(z)(S(e) +WCEC(e))

), e ∈ E∗ (5.9)

This stretching of actor’s τ execution time, can have effect on z′s domain, if this shifting cause the

completion time of edge e(τ, z) to dominate the completion times of all other edges that have actor z

as dst, in the transformed graph G∗s.

max(maxe∈dst−1(z)(S(e) +WCEC(e))

)= S(e(τ, z)) +WCEC(e(τ, z))

This chain of events is not limited only between actors that have direct data dependency. The same

chain of events could be caused to another domain by z′s shifting, when considering the transformed

graph.

Following this chain of events, we see that stretching the execution time of an actor, in one domain,

can cause shifting of actors in that or other domains. Now, S(τ) cannot be considered constant

with respect to clock cycles and consequently, neither are the boundary points which are defined

by invocation and completion of actors. Since S(τ), and bi can be shifted from a frequency scaling

decision, the inter-leavings among actors are not constant and the same applies for the set P (τ) defined

in equation (5.6).

Complexity due to variable P (τ)

To demonstrate the increased complexity of the energy efficient scheduling problem under our assump-

tions, both for the underlying hardware and the application modeling, we should look at the actor

dependent set P (τ). If this set was constant, during the frequency scaling part of an energy efficient

scheduling algorithm, it would mean that the invocation and completion of actors with respect to

clock cycles and, consequently, the frequency scaling points and the segmentation of the domain’s

execution interval would also be constant. This is the case when only one VFD is enough to execute

the graph before the deadline. A constant set P (τ) would allow us to know exactly the workload of

each segment and thus the energy gain after frequency scaling.

Consider the HSDFG of figure 5.3a, where c = WCET given by equation 5.3, along with the actor

properties and binding information presented in figure 5.3b. Figure 5.4 shows the schedule on nominal

frequency. In order to derive the set P (τ) we should first divide the execution interval of each domain

into segments. The boundary points of the segments are the green dashed lines in the figure 5.4. In

order to decide where we should apply DVFS, we should consider the total switching activity within

a segment as well as its timespan. In this way we scale the VF first to the most power consuming

combination of actors. Assuming that, the segment [3.25, 5.25], when actors 3 and 4 are active simul-

taneously, has the maximum energy consumption.We decide thus, to lower the VF. After the DVFS

the schedule will look like the figure 5.5a.

The decision made to lower the VF in this segment was based on the combined switching activities of

actors 3 and 4 as well as the timespan of the segment. However, as we described above, the set P (τ)

cannot be considered constant when multiple VFDs are needed. From the frequency scaling point of

Page 56: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

44 Energy Efficient Scheduling

1

3

C

2 4

5 6

C

7

C

(a)

Actor WCEC VFD PE

1 3 2 12 1 2 13 2 1 24 4 1 15 2 2 16 2 1 27 1 1 2

(b)

Figure 5.3: (a) Sample HDFG, (b) the corresponding binding

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

VFD1, PE 1

VFD1, PE 2

VFD2, PE 1 1 2 5

3 6 7

4

c c c

Figure 5.4: The schedule of HSDFG of fig. 5.3a on nominal frequency

view, this means that a decision that provides the best energy gain in one iteration of the heuristic

can be invalidated from subsequent ones. This invalidation is caused because frequency scaling points

move, as actor’s invocation and completion moves, affecting thus, possibly, different amount and kind

(in terms of consumption) of workloads than the ones considered for scaling down the frequency in

the previous iteration. Moving frequency scaling points to different workloads means that the energy

gain is different. Not having a constant set P (τ) means also that in the objective function (5.2), the

term favg(τ), given by equation (5.8), cannot be considered constant among iterations of a frequency

scaling algorithm, since it is dependent on the elements of P (τ).

To illustrate this invalidation we will proceed from the figure 5.5a. Since there is more idle time

that we could take advantage of, we proceed by finding the next segment with the highest energy

consumption. Assume now that this segment is the [3,5] on VFD2. We proceed by scaling down the

frequency on this segment as shown in figure 5.5b. It is evident now that the first decision to scale

the VF on the segment [325, 5.25] is no longer valid. This is because the workload affected by this

decision has now changed as the invocation of actor 6 moved.

Page 57: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Constraint problem formulation 45

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

VFD1, PE 1

VFD1, PE 2

VFD2, PE 1 1 2 5

3 6 7

44

c c c

(a)

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

VFD1, PE 1

VFD1, PE 2

VFD2, PE 1 12

5

3 6 6 7

44

c c c

(b)

Figure 5.5: (a) DVFS in segment [5.25, 7.525] on VFD1, (b) DVFS in segment [3,4] on VFD2

Page 58: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

46 Energy Efficient Scheduling

Conclusion

With the above discussion, we demonstrated the difficulty that arises when considering the energy

efficient scheduling of data flow graphs, across multiple VFDs. The difficulty stems from the fact that

we can not associate a fixed amount and kind of workload with a VF scaling point, as done in all

related works. Different from all related approaches, we propose a clustering of workloads, according

to domain’s segmentation, which will enable us to reduce the problem to the one of energy efficient,

multi-processor, scheduling approach with individually managed PEs and use one of the many available

and sophisticated heuristics to minimize the energy consumption.

5.2 Our proposal

5.2.1 Useful terms

Before delving into our proposal, we should first introduce some basic terms, fundamental for the de-

scription of the proposal. The necessary condition to harness the energy consumption, is the existence

of intervals, where the PEs are idle. Idle intervals, can be present at the start and end of the schedule,

as well as, between actors’ invocations and completions. Idle intervals, depend on the mapping of ac-

tors to PEs and VFDs, as well as, the scheduling policy used and is the result of the data flow concept

of invocation, i.e. for an actor to be ready to fire, all its predecessors and the relevant communications

should have finish. An idle interval is called slack. Since we divide the domain’s execution interval

into segments, we expect that these segments will be either active or idle. We remind, that for a

VFD to be idle, all PEs should be idle at the same time. Since frequency scaling, stretches the cycle

period and thus the execution time on all PEs simultaneously, we should define two terms, namely, the

Maximum Usable Slack and the Available Usable Slack, in order to have a metric, on the maximum

possible, frequency scaling allowable and the potential energy savings per island.

Maximum Usable Slack

The of Maximum Usable Slack (MUS) denotes the maximum number of time units that could, possibly,

by used for energy harnessing through DVFS. This upper bound on stretching, depends heavily on

the mapping of actors to PEs and can be derived as:

MUS(i) = D −maxp∈i(ET (p)), i ∈ I, p ∈ PE (5.10)

where ET (p) returns the total execution time of actors mapped on PE p. To get the maximum

available slack, this sum of execution times should be computed under fmax. With this definition,

the Maximum Usable Slack refers to the upper bound on execution time stretching, through DVFS.

However, the upper margin on the available slack can be actually lower, because of the mapping and

scheduling policy of actors.

To illustrate the MUS given a static binding of actors to PEs, we can consider the HSDF of figure 5.6a.

The actors’ WCEC and the corresponding binding information is given in figure 5.6b. On nominal

frequency the scheduling on the two PE would be the one of figure 5.7. Now according to equation

5.10 and since ET (1) = 4 and ET (2) = 5, the MUS of this cluster is equal to 5 time units. Although

these 5 time units represent the upper bound, the actual number of time units that a DVFS algorithm

can use without violating the deadline, represented by the red line, is equal to 3 time units. Stretching

Page 59: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Our proposal 47

1

2 3

45

(a)

Actor WCEC PE

1 2 12 2 13 1 24 3 25 1 2

(b)

Figure 5.6: (a) Sample HDFG and (b) The corresponding binding

0 1 2 3 4 5 6 7 8 9 10

PE 1

PE 2 3 4 5

1 2

Figure 5.7: The schedule of HSDFG of fig. 5.6a on nominal frequency

the execution time of any actor by more than 3 time units would cause a deadline violation. Such a

violation is shown in figure 5.8, where the frequency is being scaled down to fmax4 upon the invocation

of actor 3 and scaled back to the nominal value upon invocation of actor 4. Since actor 2 is active

when the frequency is scaled down, its execution is also stretched.

Available Usable Slack

The Available Usable Slack (AUS) denotes the actual, upper bound on stretching the execution time

of actors. As we described above, the available amount of time that can be used for scaling down the

voltage and frequency, can be different from the Maximum Usable Slack. However, since we only care

about the idle intervals of VFDs, in one iteration of the schedule and not about the idle intervals in the

granularity of PEs, the calculation of the Available Usable Slack is more complex. For its calculation,

we must determine the intervals where all PEs are idle simultaneously. For each island i, the Available

0 1 2 3 4 5 6 7 8 9 10

PE 1

PE 23

4 5

12

2

Figure 5.8: Scaling the frequency to fmax4 on the invocation of actor 3 and to fmax on the invocation

of actor 4

Page 60: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

48 Energy Efficient Scheduling

1

32 4

C

5 6

7

C

(a)

Actor WCEC VFD PE

1 1 1 12 1 1 13 2 1 24 5 2 15 2 1 16 2 1 27 1 1 2

(b)

Figure 5.9: (a) Sample HDFG, (b) the corresponding binding

Usable Slack is given by:

AUS(i) =aus1(i) + aus2(i) + aus3(i)

fmax(5.11)

aus1(i) = minτ (S(τ)), τ ∈ BI−1({i})

aus2(i) =

|BI−1({i})|∑

j=2

(max

(S(τj), S

′j))− S′j)

), S′j = max

(S′j−1, S

′(τj−1))

aus3(i) = bD · fmaxc −maxτ (S′(τ))

For the set BI−1({i}), we have: BI−1

({i}) = {t1, ..., τ|BI−1({i})|} with S(τj−1) ≤ S(τj), ∀τj−1, τj ∈BI−1

({i}). S(τ) and S′(τ) return the invocation and completion of an actor, in terms of absolute

clock cycles. The first term of the equation, aus1(i), will return the idle time from the beginning of

the iteration until the first invocation of an actor. The second term, aus2(i), will return the idle times

between the first invocation and the last completion of an actor. Finally, the last term, aus3(i), will

return the idle time from the latest completion of an actor till the deadline.

Returning back to figure 5.7, for the calculation of AUS we have:

aus1 = 0, aus2 = 0, aus3 = b10 · 1c − 7 = 3

yielding a total of 3 time units as AUS. To give a better intuition for the first two terms of the

equation 5.11 we will consider the HSDFG of figure 5.9a. Two VFDs are used to execute this graph.

The binding of the actors to PEs and VFDs is shown in figure 5.9b and the corresponding schedule

on nominal frequency is shown in figure 5.10. Based on this schedule for the VFD1 we calculate the

terms of (5.11) to be:

aus1(1) = 0, aus2(1) = 1.5, aus3(1) = b10 · 1c − 7.5 = 2.5

yielding a total of 3 time units as AUS(1)=4, while for the second VFD we have:

aus1(2) = 1.25, aus2(2) = 0, aus3(2) = b10 · 1c − 6.25 = 3.75

Page 61: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Our proposal 49

0 1 2 3 4 5 6 7 8 9 10

VFD1, PE 1

VFD1, PE 2

VFD2, PE 1 4

3 6

5

7

1 2

c c

Figure 5.10: The schedule of HSDFG of fig. 5.9a on nominal frequency

yielding a total of 3 time units as AUS(2)=5.

Actor Mobility Window

The Actor Mobility Window (AMW) denotes the interval between the earliest and the latest possible

starting time of an actor, so that both data, resource, and timing constraints are satisfied. The earliest

possible starting time, based on the transformed graph, is given by equation equation (3.10) and the

latest possible starting time by equation (3.4). The AMW of an actor, is dependent on the invocations

of all its predecessors, since we traverse the graph downwards, for the computation of (3.10) and on

the invocations of its successors when computing the ALAP start time. We will use this window, to

explore possible actor shifting, after scheduling, in order to increase the AUS.

AMW (τ) = [ASAPs(τ), ALAPs(τ)], τ ∈ T (5.12)

We use the AMW in order to look for potential AUS increases. Thus, when computing the AMW of

each actor, all the other actors are considered fixed in terms of the time their invocation. The AMW

is being calculated on fmax. Consequently, in the formula (3.10), we will not use the ASAPf times

of predecessors and incoming edges, but the actual invocations, given by the function S. In this way,

we expect that the range of the AMW (ALAPs(τ) − ASAPs(τ)=0) of some actors will be zero. An

empty AMW (ALAPs(τ) = ASAPs(τ)) means that if the actor is shifted, a violation will occur, either

because the data and/or resource precedence constraints are not met or the deadline is missed.

Referring to the figure 5.11 we can find the AMW of actor 6 to be the interval [3,4.5]. The start time

of actor 6 can be shifted within this interval in order to explore for possible increases in the AUS of

the two VFDs. On the other hand, the AMW of actor 4 is equal to 0 as ASAPs(4) = ALAPs(4).

Since all other actors are considered fixed, with respect to their S, if actor 4 was shifted then, this

would result to a precedence constraint violation.

Actor Overlapping Ratio

Finally, one additional and necessary term, for actor shifting exploration is the Actor Overlapping

Ratio (AOR). The AOR denotes the amount of overlapping of actor’s τ execution by other actors.

This amount of overlapping give us a hint on how much we can improve the AUS, by shifting the

Page 62: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

50 Energy Efficient Scheduling

0 1 2 3 4 5 6 7 8 9 10

VFD1, PE 1

VFD1, PE 2

VFD2, PE 1 4

3 6

5

7

1 2

c c

ASAPs(4), ALAPs(4)

ASAPs(6) ALAPs(6)

AMW (6)

Figure 5.11: The schedule of HSDFG of fig. 5.9a on nominal frequency. The limits of the AMW foractors 4 and 6 are noted with dashed green lines

actor’s τ start time inside its AMW. In order to compute the AOR, we should find the idle intervals

within the execution time of an actor. These idle intervals are such that all the PEs but the one that

the actor τ is mapped to are idle simultaneously. The AOR is calculated after the scheduling process

on fmax has finished. Since the execution interval of an actor is [S(τ), S(τ) +WCEC(τ)], we can find

all actors whose starting or completion times fall in this interval and form, afterwards, an ordered list

(OL) in terms of their starting times. Elements of this list are thus:

OL(τ) ={z1, z2, ...zn} such that

zj ∈ OL(τ) ⇐⇒ BI−1({BI(τ)})∧(

S(zj) ∨(S(zj) +WCEC(zj)

))∈ [S(τ), S(τ) +WCEC(τ)]∧

S(zj−1) ≤ S(zj)

Based on the schedule in figure 5.10 we can form the OL for all actors in the HSDF of figure 5.9a as

in table 5.1. In order to find the sum of the idle intervals within actor’s execution, we can use the

same notion as in formula (5.11) but in the actor’s execution interval, [S(τ), S(τ) + WCEC(τ)] and

modified to examine, in this interval, only the actors that are elements of the OL(τ). The three terms

that will return the idle time units within this reduced interval would be then:

aor1(τ) = max(S(z1), S(τ)

)− S(τ),

aor2(τ) =

|OLτ |∑j=2

(max

(S(zj), S

′j))− S′j

), S′j = max

(S′j−1, S

′(zj−1))

aor3(τ) = S′(τ)−min(S′(τ), S′(z|OLτ |))

Since aor1, aor2 and aor3 will return the amount of idle time units within the actor’s execution, to

calculate the AOR we will use the equation:

AOR(τ) = 1− aor1(τ) + aor2(τ) + aor3(τ)

WCEC(τ)(5.13)

Based on the schedule of the figure 5.10 we calculate the AOR for the actors as in the table 5.1. With

the information from the AOR we can find the actors that a possible shifting within their AMW could

Page 63: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Our proposal 51

Actor OL AOR

1 2, 3 02 1, 3 0.53 2, 5 0.54 ∅ 05 3, 5, 6 16 5 17 ∅ 0

Table 5.1: The OLs and AORs of actors from the HSDF of figure 5.9a

lead to an increase in the AUS. These actors constitute now a candidate list. Possible actors for shifting

in this case are 1, 2, 3, 4, 7. To choose among the possible candidates we have to use the information

from the AMW. From the candidate list we have to exclude those actors whom ASAPs = ALAPs and

thus the range of their AMW is equal to zero.

5.2.2 The algorithm

Scheduling

The scheduling of actors under fmax, is done according to the latest possible start time. A pseudo

algorithm for ALAPs based list scheduling is shown in algorithm 1. For the calculation of the ALAPs

times the transformed dataflow graph should be traversed backwards. After all the actors are shorted

in order of increasing ALAPs. The invocation of each actor is set equal to each ALAPs time. As

we mentioned earlier, to calculate the ALAPs, we use equation (3.4). In this way, we move the idle

intervals backwards in time and push the schedule to finish on the deadline. Schedules based on latest

possible starting times, are proven to result comparable or better results, in terms of schedulability

and schedule length, as most other scheduling approaches [30, 23]. A similar approach is adopted

in [29].The authors observe the restriction to the utilization of the idle intervals when these are

distributed close to the end of the frame. However, the scheme of frequency scaling and actor shifting

approach are completely decoupled from the scheduling of the graph, enabling thus the usage of any

scheduling algorithm. Since scheduling actors according to their latest possible starting times leads to

a minimized schedule length, we expect that the metrics MUS and AUS will be reasonably close to

each other. Based the ALAPs list scheduling approach, the schedule of figure 5.10 is transformed to

the one in figure 5.12. Segmentation of the execution interval and clustering at this point will lead thus

to a very energy efficient solution. We described how segmentation of the execution interval is done, in

previous sections. The clustering will be described later. We expect, however, that there will be some

room for further improvement in AUS, even when the schedule length is minimum. Increasing the

available usable slack will allow for better energy savings. This increase in the AUS is done through

actor shifting, as will be described next

Actor Shifting

With the shifting of actor’s start time, within the AMW, we want to achieve a better overlapping

and thus increased AUS and energy saving capabilities. However, we expect that there will only

be just a few number of actors that have AMW with range grater than zero. This is actually

because the ASAPs times of actors in equation (5.12) are calculated within the reduced interval

Page 64: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

52 Energy Efficient Scheduling

0 1 2 3 4 5 6 7 8 9 10

VFD1, PE 1

VFD1, PE 2

VFD2, PE 1 4

3 6

5

7

1 2

reduced interval for AMW calculation in VFD1

reduced interval for AMW calculation in VFD2

c c

Figure 5.12: ALAPs based schedule of HSDFG of fig. 5.9a on nominal frequency

[min( S(τ)fmax

),max(S(z)+WCEC(z)fmax

)], where τ, z ∈ BI−1(i) as show in figure 5.12. This reduced interval is

different for each island and depends on the mapping of actors and the scheduling policy. We confine

the calculation of ASAPs times, because we want to increase the idle time as much as possible within

this reduced interval. In this way the first and third term of equation (5.11) remain constant and the

second term, which represent the idle times within the mentioned interval, will be increased.

A shifting is valid if two conditions are met. Firstly, the AUS in the current island does not decrease

and secondly the AUS in all other islands either remains intact or increases. If a shifting results into a

decrease in the slack time, the actor is removed from the candidate list. Thus, we only make positive

steps towards the increase of AUS.

Finally we also expect that within the AMW, there might exist more than one solution that leads to

the same increase in AUS. If this is the case then, shifting the actor to earliest possible event, will

lead to better or at least the same results in AUS potential increase, than choosing any other possible

event later in time.

The Shifting algorithm

To reduce the complexity of the algorithm, we consider only shifting a candidate actor to events

within its AMW. These events are invocations and completions of actors that fall into the AMW.

In each iteration of the algorithm, the AUS of all domains is re-evaluated. If there is no decrease

in AUS in all domains, the exploration for the actor proceeds. Otherwise, the actor returns to

the previous best scheduling point and afterwards it is removed from the list of candidates. The

algorithm 2 is responsible for deriving the list Cd of candidate domains to perform actor shifting.

In Algorithm 2, the MUS and AUS are calculated using equations (5.10) and (5.11) respectively.

Algorithm 3 is responsible for deriving a list of candidate actors Cτ for shifting on the reduced interval

[min(S(τ)),max(S(z) + WCEC(z))] of each domain, as described in the previous subsection. This

reduced interval is induced by changing the arrival time A(τ) of actors. Each actor that has an AMW

greater than zero is placed in a candidate list, Cτ , for shifting. Furthermore, we associate, with each

such candidate, a set of actors OL whose invocations fall into the AMW. The intuition behind sorting

Page 65: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Our proposal 53

the candidate list Cτ in order of increasing starting times, or actually, absolute starting cycle, is that

we do not want a later shifting to invalidate earlier ones. This invalidation might be the result of a

new AMW narrower than the one used when the actor was shifted. Exploring all possible actors, in

that order, allow as to have a safe AMW when re-evaluating the ASAPs time in Algorithm 4. There

for each candidate actor, we should first check if the associated AMW has changed. This can be the

result of the shifting of a predecessor actor. We check this by recalculating the ASAPs on the reduced

interval. If this is equal to the AMW.start, then there is no change. Otherwise, we should update the

list OL. Before the shifting we store the AOR(τ), the S(τ) as well as the AUS of all the domains.

Finally we explore the consequences in the AUS, by aligning the invocation of the actor with the

invocations, S(j), of the actors j ∈ OL(τ). Upon an increase in the AOR we check the impact on the

AUS of all domains. A shifting is accepted if there is no negative impact on the AUS.

5.2.3 Clustering

In the above subsections, we described how, given a data flow graph and its mapping to the underlying

architecture, we can schedule the actors and shift them to increase the AUS of a domain. Based on

this new schedule, we can now consider the sets B(i) as constants and proceed with clustering the

interleaved execution intervals of actors.

We will refer to the new clustered actors as super nodes. These new actors, will inherit the timespan

of the segment, as its WCEC, as well as a switching activity that is equal to the sum of the switching

activities of the actors active in the segment.

To formalizing the above process of clustering the inter leavings of actors based on the set SG(i) for

the worst case execution cycles of the new node, we have:

∀sgj ∈ SG(i) | sgj .st = active ⇐⇒ T ′ ∪ snj (5.14)

WCEC(snj) = (sgj .end− sgj .start)

The WCEC of the new super node sn is only dependent on the time span of the relative active

segment. Since the boundary points are defined over invocations or completions of actors, all actors

active in the segment sgi contribute equally to the switching activity of the new node. Thus, for the

switching activity of the new node we have:

α(snj) =∑τ

α(τ) (5.15)

τ ∈ BI(i)|S(τ) = sgj .start ∨ S(τ) +WCEC(τ) = sgj .end

Apart from the execution time and the switching activity of the new node, we should also define the

data dependencies, with nodes mapped on other VFDs, i.e, we have to define also the set E ′. The

data dependencies for the new super nodes will be inherited by the actors whose invocations and

completions defined the active segment. Thus, the incoming edges, will be inherited by the actors

whose invocations are equal to the start of the segment. In the same way the outgoing edges will

be inherited by the actors whose completions are equal to the segment’s end. Having fixed the data

Page 66: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

54 Energy Efficient Scheduling

Actor OL AOR AMW

1 2, 3 0 [2, 2]2 1, 3, 4, 6, 7 1 [4, 4]3 1, 2, 4 1 [4, 4]4 3, 2, 6, 7 1 [6, 6]5 6, 7, 8 1 [9, 9]6 4, 7, 5 1 [8, 8]7 2, 6, 5, 8 1 [8, 8]8 5, 7 0 [10, 10]

Table 5.2: The OLs and AORs of actors from the HSDF of figure 5.13a

dependencies, we add edges between subsequent nodes to define the execution ordering.

e(sni, snj) ∈ E ′ ⇐⇒ (5.16)

∀τ, z ∈ T | ∃e(τ, z) ∈ E∗∧(S(τ) +WCEC(τ) = sgi.end

)∧(S(z) = sgj .start

)Now, a VFD can be considered as a PE and in this way, we relax our problem to the one of energy

efficient scheduling of data flow graphs in individually managed PEs.

We will illustrate the process of clustering using the HSDF of figure 5.13. Based on the information

on the WCEC of the actors and the binding information on PEs, we derive the ALAPs based schedule

shown in figure 5.14a and we proceed with the segmentation of the frame. First we derive the domain

specific boundary list B(i) as: B(i) = {0, 2, 4, 4, 6, 6, 8, 8, 9, 9, 10, 10, 12}. From the set B(i) we can

remove the duplicate entries and derive the segments as explained in subsection 5.14a. The set SG(i)

will be then: SG(i) = {[0, 2], [2, 4], [4, 6], [6, 8], [8, 9], [9, 10], [10, 12]}. The segments are indicated in

figure 5.14b with the green dashed lines. We should mention here that before the segmentation of

the frame and after the ALAPs based scheduling, one could first go through the shifting procedure in

order to increase the AUS of the domain. From the ALAPs based schedule however, we get the table

5.2 with the OL, AOR and AMW for each actor. We note also that the MUS for this domain, given

by equation 5.10 is equal to 4 while the AUS, given by equation 5.11 is equal to 2. The last step is to

cluster the interleaved execution intervals and form the super nodes. The clustered graph Gs(T ′, E ′)is shown in figure 5.15.

However, before adopting any of the existing methods for energy efficient scheduling, we should impose

one more constraint in out system, which originates from the clustering and from the fact that we do

not allow the actor’s execution to be suspended. Due to clustering and inter-domain communications,

it is possible that, after frequency scaling, the invocation of a super node will move in time. This

actually means that, the execution of all actors, forming this super node, will be moved in time. This

might cause the suspension of execution of one or multiple actors, that were also participating on the

clustering of the super node. In order to tackle this problem we should add one more constraint

targeting super nodes sharing common actors. These super nodes are mapped on the same PE

(abstraction of a VFD) and there exists an edge between them, i.e., their executions are sequential in

Page 67: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Our proposal 55

1

32 4

5 67

8

(a)

Actor WCEC PE

1 2 12 4 13 2 24 2 25 1 26 1 27 2 18 2 2

(b)

Figure 5.13: (a) sample HSDFG and (b) WCEC and binding information

time:

S(sni) +WCEC(sni) = S(snj) ⇐⇒ (5.17)

∀sni, snj ∈ T ′ | BI(sni) = BI(snj)∧

e(sni, snj) ∩ E ′ 6= ∅∧

∃τ ∈ T , | BI(τ) = BI(sgi)∧

S(τ) ≤ sgi.start ∧ S(τ) +WCEC(τ) ≥ sgj .end

The constraint (5.17) will be taken into account when applying DVFS in the new graph G’(T ′,E’).

In the above, we used, for the super nodes, the mapping function BI , that we defined for actors. The

rational behind using the same function is that, a super node is actually a set of actors clustered

together. Since all these actors are mapped on the same VFD, the mapping function can also return

the VFD for the super node.

5.2.4 DVFS Scheduling

If we allow the actor’s execution to be suspended due to reasons described earlier, then the constraint

(5.17) need not be taken into account. Then all heuristics for energy efficient scheduling of data-flow

graphs in multiprocessor platforms can be applied to the clustered graph G’(T ′,E’).

Since the energy efficient scheduling in multi processor system is NP-hard [46] even for the general

cases (that of periodic independent tasks [28]), we will concentrate on heuristics that allocate a unit

of slack to actors in order to minimize the energy consumption. From the available approaches, the

PathDVS found in [18] provides results comparable to the LPDVS algorithm presented in [47] and is

extended to take into account the communication costs.

The basic notion behind PathDVS scheduling heuristic is to find actors, from the graph, that can

share a unit of slack. By sharing we mean that, by decreasing the operating frequency and extending

Page 68: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

56 Energy Efficient Scheduling

0 1 2 3 4 5 6 7 8 9 10 11 12

PE1

PE2

1 2 7

3 4 6 5 8

(a) ALAPs schedule of graph 5.13

0 1 2 3 4 5 6 7 8 9 10 11 12

PE1

PE2

1 2 7

3 4 6 5 8

(b) segmentation

0 1 2 3 4 5 6 7 8 9 10 11 12

PE 1 2 73 4 6 5 8

(c) clustering

Figure 5.14: The clustering procedure

1 23 24 76 75 8

Figure 5.15: The clustered Gs(T ′, E ′) from the HSDF in figure 5.13

Page 69: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Our proposal 57

the execution of two or more actors, the total time span of the schedule increases only by a unit of

slack. It is evident that there is no dependency (resource or data) between actors that can share a

unit of slack and for this reason the authors refer to these actors as compatible actors. Each actor has

a different amount of slack by which its execution can be extended and this is due to precedence and

resource constraints. A mapping that places unrelated actors on the same path, is expected to lead

to a reduction in the available slack of actors, when compared to a more efficient mapping. However,

the mapping heuristic is out of the scope of this work and at this phase (we operate on the G’ graph)

it has already been done. The available slack for each actor is the difference between the earliest and

latest possible starting times. These two values can be calculated through equations (3.10) and (3.4)

when considering the new graph G’.

As defined in section 2.1, p(τ, z) denotes the path in the graph, from actor τ to the actor z. These

two actors are compatible if and only if they belong to different paths:

∀τ, z ∈ T ′ |m(τ, z) = 1 ⇐⇒ p(τ, z) = ∅ (5.18)

m(τ, z) is a boolean denoting the compatibility of actors τ and z. In this way, a compatibility list

can be created for each actor, containing all compatible actors from the graph. Consequently, we can

form the compatibility list for an actor τ as:

compatibility list(τ) = {z|m(τ, z) = 1}, τ, z ∈ T ′ (5.19)

In order to minimize the energy consumption, the compatibility list is used, to find actors that, by

sharing a unit of slack , maximize the sum of energy reduction. The approach used to find the optimal

solution, is a branch and bound search over the solution space. From all compatible solutions the

branch and bound method determines the actor or the combination of actors which lead to maximum

energy reduction, after allocating a unit of slack.

The energy reduction of an actor is defined as the difference of energy consumption before and after

the slack allocation. Apart from the compatibility list, the authors associate with each actor, an

explorable list containing, actors that should be searched as child nodes, in the solution space tree.

The explorable list contains, all available actors, corresponding to the intersection of the compatibility

lists, from root to the particular node. The explorable list of a node without parents is thus equal

to its compatibility list. To effectively search the tree, the authors propose the Depth First Search

approach and by maintaining a lower bound on the energy reduction over the traversed paths, they

eliminate paths, where a better solution can not be found. At each node a cost function is calculated

and compared with other possible solutions. This cost function, is the sum of energy reduction, of all

nodes until the node under consideration plus the sum of energy reduction of all actors in the node’s

explorable list. At each point this cost is compared with a lower bound. If the cost is greater then the

lower bound is being updated. An initial value for the lower bound can be found as:

lower bound =∑τ∈T ′

energy reduc(τ)

|T ′|+ |compatibility list(τ)|+ 1(5.20)

To reduce the search space, the authors propose the identification of fully dependent, fully indepen-

dent and compressible actors. A fully independent actor, is the one that is present in all assignment

paths. The energy reduction from allocating a unit of slack to a fully dependent actor, is compared

Page 70: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

58 Energy Efficient Scheduling

with those of other candidates and is not included in the search. On the other hand, fully independent

actors, are allocated a unit of slack, irrespective of the energy reduction. fully independent actors

contain in their |T ′| − 1 actors. Last but not least, compressible actors, are those who share the same

compatibility list. Then these actors are clustered together and are represented from the actor with

the highest energy reduction. In this way, there is a substantial reduction in runtime requirement.

5.2.5 Extension of PathDVS

The PathDVS, can be directly applied on the clustered graph G′, if suspension of actors execution was

allowed. To illustrate, why we do not allow the actors execution to be suspended, we will give a brief

but intuitive example.

Suppose that o VFD contains 16 PEs and that a super node snj , in the graph G′, was formed by

clustering parts of executions (equal to the segment’s time span) of 16 actors. After applying the

PathDVS, it is possible that the execution of one of these 16 actors and consequently the invocation of

snj , will move in time, by a unit slack, according to precedence constraints. If, the previous invoked

super node sni share, at least one actor with snj , then this shifting of sn′js invocation, will cause at

least one PEs to be idle.

To avoid this behavior, the PathDVS should be extended to check if the constraint (5.17), is satisfied,

before the allocation of a unit of slack. In this way the cost for energy reduction and the constraint

(5.17) are taken into consideration when the candidate path in the solution tree is decided.

5.3 Conclusion

We demonstrated in this chapter the difficulties that arise when the problem of the energy efficient

scheduling of dataflow graphs on many-core platforms is considered. Existing heuristics for energy

minimization, targeted for platforms where each PE can operate on its own VF are expected to provide

minor results. Moreover, the different switching activities among actors should be taken into account

when applying such techniques. Starting from a fixed binding of actors to PEs and VFDs, we continue

by scheduling the actors with the ALAPs based list scheduling technique. After examining the AUS

and the MUS of each domain, we proceed by an iterative shifting of actors within their AMW so as to

increase the AUS and consequently the potential energy reduction. Fixing the invocations of actors,

we demonstrated a clustering approach that creates super nodes with inherited switching activities

and precedence constraints. Now we are able to abstract and consider each VFD as an individually

managed PE. Finally, we can apply any of the available heuristics, that tackle the energy efficient

scheduling of precedence constrained graphs on individually managed PEs.

Page 71: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

6

Future Work

Due to timing constraints this project was only focused on the simple SDF MoC. However, we managed

to bring up most of the difficulties arise from the incorporation of state of the art computin platforms,

as the P2012 and formalize the optimization problem of energy efficient scheduling. There are a

number of ways though, that this work can be extended in order to provide a more complete and

suitable energy efficient algorithm for applications running on such platforms. The list below just

names a few:

• Validate the results of the proposed approach incorporating various heuristics

• Extend the dataflow MoC considered to other more expressive ones

• Consider the characteristics of other many core platforms or interconnection infrastructures

• Extend the work to multi-criteria scheduling heuristics

6.1 Validation of the proposal

After formalizing all the constraints for the optimization problem at hand, the next step would be to

provide results on the energy efficiency of the proposed approach. Because the clustering and energy

efficient scheduling are completely decoupled, one could proceed by incorporating different heuristics

available for energy minimization and compare them in terms of efficiency and complexity. Since the

binding of actors to PEs and VFDs also affect the outcome, one could also experiment with different

binding heuristics.

6.2 Extension of the MoC

This work was only focused on the SDF MoC and more specifically to the derived HSDF. However,

other more expressive MoCs such as the PSDF and HDF are better suited for describing DSP appli-

cations. The characteristics of these MoCs and how these affect the energy efficient scheduling should

then be studied in dept. The first step towards the expansion of the approach would be to incorporate

actors with time varying production and/or consumption rates. Of course such a situation would affect

the repetition vector as well as the communication cost. In this work both the repetition vector and

the production/consumption rates were considered constant. Another extension would be to study

the possibility of different modes of operation per actor. Such modes of operations affect greatly the

resource requirements and consequently the energy dissipation. A MoC extending the SDF towards

this scenario based execution is the SADF MoC [41].

6.3 Extention to other platforms

Although P2012 is a state of the art computing platform, this work could be extended to other

platforms and interconnection infrastructures. Communication infrastructures such as the Nostrum

Page 72: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

60 Future Work

NoC [34] or the Aetheral NoC [15] could also be studied. Different clocking schemes that are tailored

for power and latency reduction could be taken into account and compared against the asynchronous

operation of the ALPIN NoC. Such a scheme, for 2D mesh NoCs, is the Globally Pseudochronous

Locally Synchronous approach where a clock with a constant phase difference is being distributed to

the NoC routers [35]. Incorporation of different communication interconnects will most probably affect

the communication cost between actors as well as the energy dissipation for communication which in

this work considered negligible.

6.4 Extend to multi-criteria scheduling heuristics

The heuristic can be extended to accommodate more than one objectives for optimization. Such

objectives can be, apart from the energy consumption, the schedule makespan, the reliability of the

system [13], the buffer requirements and the software/hardware implementation cost [48]. This can

done based on the Pareto point algebra [12]. When there are two or even more objectives that need

to be optimized, a multidimensional solution space is created. From the possible solution pairs then

dominant ones are found and create the so-called Pareto front. Points in this line dominate any

other solution outside from the Pareto front. After creating the Pareto front, a solution can be found

according to additional criteria.

Page 73: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

A

Pseudo Algorithms

Algorithm 1 ALAP scheduling

get BI−1(i)

sort BI−1(i) in increasing order of ALAPs times

for j = 1→ |BI−1(i)| do

S(τ)← ALAPs(τ)end for

Algorithm 2 Candidate actors for shifting

for i = 1→ |I| doif MUS(i) 6= AUS(i) thenCd ← i

end ifend for

Algorithm 3 AMW calc

for j = 1→ |Cd| do

for τ = 1→ |BI−1(j)| do

A(τ)← mini∈(BI(j))(S(i))end forfor τ = 1→ |BI−1

(j)| doAMW (τ).start← ASAPs(τ)AMW (τ).end← S(τ)if AMW (τ).start 6= AMW (τ).end thenCτ ← τfor i = 1→ |BI−1

(j)| doif S(i) ∨ S(i) +WCEC(i) ∈ [AMW (i).start, AMW (i).end] thenOL(τ)← i

end ifend forsort OL(τ) in increasing order of S(τ)

elseA(τ)← S(τ)

end ifend for

end forsort Cτ in increasing order of S(τ)

Page 74: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

62 Pseudo Algorithms

Algorithm 4 Shifting

for τ = 1→ |Cτ | doamwold ← AMW (τ).startif ASAPs(τ) 6= amwold then

for i = 1→ |BI−1(j)| do

if S(i) ∨ S(i) +WCEC(i) ∈ [ASAPs(τ), AMW (i).end] thenOL(τ)← i

end ifend for

end ifaorold ← AOR(τ)sold ← S(τ)for i = 1→ |I| doausold(i)← AUS(i)

end forfor j = |OL(τ)| → 1 doS(τ)← max(sold, S(j))if AOR(τ) ≥ aorold then

for i = 1→ |I| doif AUS(i) < ausold thenS(τ)← soldbreak

elsesold ← S(τ)aorold ← AOR(τ)

end ifend for

end ifend for

end for

Page 75: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Bibliography

[1] Platform 2012 : A Many-core programmable accelerator for Ultra- Efficient Embedded Computing

in Nanometer Technology. Technology, 2012.

[2] M. Anis, S. Areibi, and M. Elmasry. Design and optimization of multithreshold CMOS (MTC-

MOS) circuits. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions

on, 22(10):1324–1342, October 2003.

[3] H. Aydin and Q. Yang. Energy-aware partitioning for multiprocessor real-time systems. In

Parallel and Distributed Processing Symposium, 2003. Proceedings. International, pages 9–pp.

IEEE, 2003.

[4] M Bariani, P Lambruschini, and M Raggio. VC-1 decoder on STMicroelectronics P2012 archi-

tecture. stday2010.uniud.it, pages 1–3.

[5] E Beigne, F Clermidy, S Miermont, and P Vivet. Dynamic voltage and frequency scaling ar-

chitecture for units integration within a GALS NoC. In Networks-on-Chip, 2008. NoCS 2008.

Second ACM/IEEE International Symposium on, pages 129–138. IEEE, 2008.

[6] E. Beigne, Fabien Clermidy, HElEne Lhermet, Sylvain Miermont, Yvain Thonnart, X.T. Tran,

Alexandre Valentian, Didier Varreau, Pascal Vivet, Xavier Popon, and Others. An asynchronous

power aware and adaptive NoC based circuit. Solid-State Circuits, IEEE Journal of, 44(4):1167–

1177, April 2009.

[7] B Bhattacharya and S.S. Bhattacharyya. Parameterized dataflow modeling for DSP systems.

Signal Processing, IEEE Transactions on, 49(10):2408–2421, October 2001.

[8] J.T. Buck and E.a. Lee. Scheduling dynamic dataflow graphs with bounded memory using the

token flow model. In icassp, number September, pages 429–432. IEEE, 1993.

[9] J.a. Butts and G.S. Sohi. A static power model for architects. In Microarchitecture, 2000.

MICRO-33. Proceedings. 33rd Annual IEEE/ACM International Symposium on, pages 191–201.

IEEE, 2000.

[10] P. De Langen and Ben Juurlink. Trade-offs between voltage scaling and processor shutdown for

low-energy embedded multiprocessors. Embedded Computer Systems: Architectures, Modeling,

and Simulation, pages 75–85, 2007.

[11] MR Garey. Computers and Intractability: A Guide to the Theory of NP-completeness. 1979.

[12] Marc Geilen and Twan Basten. A calculator for Pareto points. In Proceedings of the conference

on Design, automation and test in Europe, volume 2, pages 285–290. EDA Consortium, April

2007.

Page 76: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

64 Bibliography

[13] A. Girault and H. Kalla. A novel bicriteria scheduling heuristics providing a guaranteed global

system failure rate. IEEE Transactions on Dependable and Secure Computing, 6(4):241–254,

October 2009.

[14] A Girault, B. Lee, and E.A. Lee. Hierarchical finite state machines with multiple concurrency

models. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on,

18(6):742–760, June 1999.

[15] K. Goossens, J. Dielissen, and A. Radulescu. Æthereal network on chip: concepts, architectures,

and implementations. Design & Test of Computers, IEEE, 22(5):414–421, May 2005.

[16] Philippe Grosse, Yves Durand, and Paul Feautrier. Power modeling of a NoC based design for

high speed telecommunication systems. Integrated Circuit and System Design. Power and Timing

Modeling, Optimization and Simulation, pages 157–168, 2006.

[17] Philippe Grosse, Yves Durand, and Paul Feautrier. Methods for power optimization in SOC-based

data flow systems. ACM Transactions on Design Automation of Electronic Systems (TODAES),

14(3):38, June 2009.

[18] Jaeyeon Kang and Sanjay Ranka. Energy-efficient dynamic scheduling on parallel machines. High

Performance Computing-HiPC 2008, pages 208–219, 2008.

[19] H. Kawaguchi, K.I. Nose, and T. Sakurai. A CMOS scheme for 0.5 V supply voltage with pico-

ampere standby current. In Solid-State Circuits Conference, 1998. Digest of Technical Papers.

1998 IEEE International, pages 192–193. IEEE, 1998.

[20] A.a. Khan, C.L. McCreary, and MS Jones. A comparison of multiprocessor scheduling heuristics.

1994 International Conference on Parallel Processing (ICPP’94), pages 243–250, August 1994.

[21] Hong-Sik Kim, Hyejeong Hong, H.S. Kim, J.H. Ahn, and Sungho Kang. Total energy minimization

of real-time tasks in an on-chip multiprocessor using dynamic voltage scaling efficiency metric.

Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions on, 27(11):2088–

2092, November 2008.

[22] Fanxin Kong, Wang Yi, and Qingxu Deng. Energy-efficient scheduling of real-time tasks on

cluster-based multicores. In Design, Automation & Test in Europe Conference & Exhibition

(DATE), 2011, pages 1–6. IEEE, 2011.

[23] Y.K. Kwok and Ishfaq Ahmad. Dynamic critical-path scheduling: An effective technique for

allocating task graphs to multiprocessors. Parallel and Distributed Systems, IEEE Transactions

on, 7(5):506–521, 1996.

[24] Didier Lattard, E. Beigne, Fabien Clermidy, Yves Durand, Romain Lemaire, Pascal Vivet, and

Friedbert Berens. A reconfigurable baseband platform based on an asynchronous network-on-chip.

Solid-State Circuits, IEEE Journal of, 43(1):223–235, January 2008.

[25] E.A. Lee and S. Ha. Scheduling strategies for multiprocessor real-time DSP. In Global Telecom-

munications Conference, 1989, and Exhibition. Communications Technology for the 1990s and

Beyond. GLOBECOM’89., IEEE, pages 1279–1283. IEEE, 1989.

[26] E.A. Lee and D.G. Messerschmitt. Static scheduling of synchronous data flow programs for digital

signal processing. Computers, IEEE Transactions on, 100(1):24–35, January 1987.

Page 77: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

Bibliography 65

[27] E.A. Lee and D.G. Messerschmitt. Synchronous data flow. Proceedings of the IEEE, 75(9):1235–

1245, 1987.

[28] W.Y. Lee. Energy-Saving DVFS Scheduling of Multiple Periodic Real-Time Tasks on Multi-core

Processors. In Proceedings of the 2009 13th IEEE/ACM International Symposium on Distributed

Simulation and Real Time Applications, pages 216–223. IEEE Computer Society, 2009.

[29] JJH Lin1. Energy-Efficient Scheduling of Real-Time Periodic Tasks in Multicore Systems. In

Network and Parallel Computing: IFIP International Conference, NPC 2010, Zhengzhou, China,

September 13-15, 2010, Proceedings, volume 6289, page 344. Springer-Verlag New York Inc, 2010.

[30] Jiong Luo and N.K. Jha. Power-efficient scheduling for heterogeneous distributed real-time em-

bedded systems. Computer-Aided Design of Integrated Circuits and Systems, IEEE Transactions

on, 26(6):1161–1170, 2007.

[31] Jiong Luo, N.K. Jha, and L.S. Peh. Simultaneous dynamic voltage scaling of processors and

communication links in real-time distributed embedded systems. Very Large Scale Integration

(VLSI) Systems, IEEE Transactions on, 15(4):427–437, April 2007.

[32] S.M. Martin, K. Flautner, T. Mudge, and D. Blaauw. Combined dynamic voltage scaling and

adaptive body biasing for lower power microprocessors under dynamic workloads. In Proceedings

of the 2002 IEEE/ACM international conference on Computer-aided design, pages 721–725. ACM,

2002.

[33] Sylvain Miermont, Pascal Vivet, and Marc Renaudin. A power supply selector for energy-and

area-efficient local dynamic voltage scaling. In Integrated Circuit and System Design. Power

and Timing Modeling, Optimization and Simulation, pages 556–565, Berlin, Heidelberg, 2007.

Springer.

[34] M. Millberg, E. Nilsson, R. Thid, S. Kumar, and A. Jantsch. The Nostrum backbone-a communi-

cation protocol stack for networks on chip. In VLSI Design, 2004. Proceedings. 17th International

Conference on, pages 693–696. IEEE, 2004.

[35] Erland Nilsson and J. Oberg. Reducing power and latency in 2-D mesh NoCs using globally

pseudochronous locally synchronous clocking. In Proceedings of the 2nd IEEE/ACM/IFIP in-

ternational conference on Hardware/software codesign and system synthesis, pages 176–181, New

York, New York, USA, 2004. ACM.

[36] Ptolemy and N Copernicus. The Almagest; On the Revolutions of the Heavenly Spheres; and

Epitome of Copernican Astronomy: IV and V. The Classical Review, 34(02):299, October 1952.

[37] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand. Leakage current mechanisms and leakage

reduction techniques in deep-submicrometer CMOS circuits. Proceedings of the IEEE, 91(2):305–

327, December 2003.

[38] G.C. Sih and E.A. Lee. A compile-time scheduling heuristic for interconnection-constrained

heterogeneous processor architectures. Parallel and Distributed Systems, IEEE Transactions on,

4(2):175–187, 1993.

[39] S Sriram and E.A. Lee. Determining the order of processor transactions in statically scheduled

multiprocessors. The Journal of VLSI Signal Processing, 15(3):207–220, 1997.

Page 78: Power consumption optimization of data ow applications on ...kth.diva-portal.org/smash/get/diva2:447170/FULLTEXT01.pdf · Royal Institute of Technology Power consumption optimization

66 Bibliography

[40] Sundararajan Sriram and S.S. Bhattacharyya. Embedded multiprocessors: Scheduling and syn-

chronization. CRC, 2000.

[41] B.D. Theelen, M.C.W. Geilen, S. Stuijk, S.V. Gheorghita, T. Basten, J.P.M. Voeten, and

AH Ghamarian. Scenario-aware dataflow. Technical Report July, Citeseer, 2008.

[42] Yvain Thonnart, Pascal Vivet, and F. Clermidy. A fully-asynchronous low-power framework for

GALS NoC integration. In Proceedings of the Conference on Design, Automation and Test in

Europe, pages 33–38. IEEE, 2010.

[43] Girish Varatkar and R. Marculescu. Communication-aware task scheduling and voltage selection

for total systems energy minimization. In Proceedings of the 2003 IEEE/ACM international

conference on Computer-aided design, page 510. IEEE Computer Society, 2003.

[44] Lizhe Wang, Jie Tao, Gregor von Laszewski, and Dan Chen. Power Aware Scheduling for Par-

allel Tasks via Task Clustering. In 2010 IEEE 16th International Conference on Parallel and

Distributed Systems, pages 629–634. IEEE, December 2010.

[45] L. Yan, J. Luo, and N.K. Jha. Joint dynamic voltage scaling and adaptive body biasing for

heterogeneous distributed real-time embedded systems. Computer-Aided Design of Integrated

Circuits and Systems, IEEE Transactions on, 24(7):1030–1041, July 2005.

[46] CY Yang, JJ Chen, and T.W. Kuo. An approximation algorithm for energy-efficient scheduling

on a chip multiprocessor. Design, Automation and Test in Europe, pages 468–473, 2005.

[47] Yumin Zhang and XS Hu. Task scheduling and voltage selection for energy minimization. Pro-

ceedings of the 39th annual Design, page 183, 2002.

[48] Jun Zhu, Ingo Sander, and A. Jantsch. Pareto efficient design for reconfigurable streaming appli-

cations on CPU/FPGAs. In Proceedings of the Conference on Design, Automation and Test in

Europe, pages 1035–1040, 2010.