An Energy Aware Framework for Mobile Computing · I would like to thank my teacher Khwaja...

143
DISSERTATION An Energy Aware Framework for Mobile Computing ausgef¨ uhrt zum Zwecke der Erlangung des akademischen Grades eines Doktors der technischen Wissenschaften eingereicht an der Technischen Universit¨ at Wien Fakult¨ at f¨ ur Elektrotechnik und Informationstechnik von Dipl.-Ing. Naeem Zafar Azeemi Brigittenauer Lande 224/ 6643, 1200 Wien geboren in Karachi, Pakistan am 14. August 1968 Matrikelnummer: 0327346 October 6, 2007 .............................................................

Transcript of An Energy Aware Framework for Mobile Computing · I would like to thank my teacher Khwaja...

  • DISSERTATION

    An Energy Aware Framework for MobileComputing

    ausgeführt zum Zwecke der Erlangung des akademischen Gradeseines Doktors der technischen Wissenschaften

    eingereicht an derTechnischen Universität WienFakultät für Elektrotechnik und Informationstechnik

    von

    Dipl.-Ing. Naeem Zafar AzeemiBrigittenauer Lande 224/ 6643, 1200 Wiengeboren in Karachi, Pakistan am 14. August 1968Matrikelnummer: 0327346

    October 6, 2007 .............................................................

  • Advisor

    Univ.Prof. Dipl.-Ing. Dr.techn. Markus RuppTechnische Universität WienInstitut für Nachrichtentechnik und Hochfrequenztechnik

    Examiner

    Univ.Prof. Dr.phil.nat. Christoph GrimmTechnische Universität WienInstitut für Computertechnik

  • To Amra, Mukashfa and Kunza

  • ABSTRACT

    Since their inception, energy dissipation has been a critical issue for mobile computingsystems. Although a large research investment in low-energy circuit design and hardwarelevel energy management has led to more energy-efficient architectures, even then, thereis a growing realization that the contribution to energy conservation should be morerigorously considered at higher levels of the systems, such as operating systems andapplications.

    This dissertation puts forth the claim that energy-aware compilation to improve appli-cation quality both in terms of execution time and energy consumption is essential fora high performance mobile computing embedded system design. Our work is a designparadigm shift from the logic gate being the basic silicon computation unit, to an in-struction running on an embedded processor. Multimedia DSP processors are the mostlucrative choice to a mobile computing system design for their optimal performance de-livery in high data throughput at low energy. They use instruction-level parallelism (ILP)in programs, for executing more than one primitive instruction at a time. In this work,we exploit the parallelism slacks, unraveled by the native multimedia DSP compilers.We propose an iterative compilation environment to optimize a given ’C’ source code.The contributions of our framework are the collaboration of an application profile mon-itor (APM) together with an optimization engine in native multimedia DSP SoftwareDevelopment Environments (SDE). We propose to monitor application behavior at alllevels (such as static, compilation, scheduling, linking and during execution). TheseAPMs are later used in an optimization engine to speculate optimal code transformationschemes. These schemes are applied successively, across the basic code blocks. Wepropose two methods for the selection of optimization schemes, a Gradient Mode Iter-ative Compilation (GMIC) and Multicriteria Stochastic Iterative Compilation (MSIC).Both schemes are tested at several multimedia applications obtained from diversifieddomains such as video transcodecs (MPEG2, H-264L), audio transcodecs (G-723, Mp3)and bioinformatics (Glimmer, Fgene), to name a few.

    Finally, we propose the characterization of application-architecture correlations that sup-port our claim that an ideal performance of a mobile computing system demands a per-fect match between hardware capability and program behavior. We exposed our resultsfor 20 multimedia applications experimented at the TriMedia DSP 1300, the BlackfinDSP ADSP533, and the PIII-850 embedded processor.

    Keywords: Energy Aware, Source-to-Source, Multimedia Processor, Workload Charac-terization.

  • vi Abstract

  • ZUSAMMENFASSUNG

    Seit dem Bestehen von mobilen Rechensystemen ist Energieverbrauch ein entscheiden-der Faktor. Obwohl bereits zahlreiche Forschungsergebnisse zu hardwarelösungen mitniedrigem Energieverbrauch geführt haben, ist mittlerweile klar geworden, dass En-ergieeinsparungen auf höherer Ebene, wie beispielsweise bei Betriebssystemen und -anwendungen, vermehrt in Betracht gezogen werden sollten.

    Diese Dissertation belegt, dass eine energiebewusste Compilierung zur Verringerung derAusführungszeit führt und somit ein wesentliches Kriterium darstellt, um ein effizienteseingebettetes System für mobile Datenverarbeitung zu gewährleisten. Unsere Arbeitbeschäftigt sich mit einem neuen Entwicklungs-Paradigma, das sich nicht mehr aufeinzelne logische Gatter als grundlegende Entwicklungselemente konzentriert, sondernsich einzelnen Instruktionen auf einem eingebetteten Prozessor widmet. Digitale Sig-nalverarbeitungsprozessoren für Multimediaanwendungen stellen für ein mobiles Daten-verarbeitungssystem die preiswerteste Lösung dar, um eine optimale Datendurchlaufzeitbei niedrigem Energiebedarf zu gewährleisten. Diese nutzen hierfür die Parallelität aufInstruktionsebene (ILP) von Programmen, um damit mehrere primitive Instruktionenzur gleichen Zeit ausführen zu können. In der vorliegenden Dissertation wird die Pro-grammparalellisierung mit einem speziellen Monitor erfasst. Weiters schlagen wir eineschrittweise Compilierung vor, um den gegebenen Programmcode in ”C” zu optimieren.Ein weiterer Beitrag besteht aus einer Programmumgebung zur Analyse von Anwendun-gen und deren Optimierung. Hierbei wird das Programmverhalten auf mehreren Ebenen(statischer Ebene, Compilierung, Scheduling, Linking, und während der Ausführung)überwacht. Diese Analysen werden anschließend von einem Optimierungsprogramm ver-wendet, um eine optimale Compiler-Konfiguration zu ermitteln. In dieser Arbeit wer-den zwei verschiedene Methoden für die Auswahl der Optimierungsoptionen vorgestellt,nam̈lich ein Gradientenverfahren und ein stochastisches Verfahren. Beide Verfahrenwerden mit verschiedenen Multimediaanwendungen aus unterschiedlichen Bereichen wiebeipsielsweise Video-Kodierung (MPEG2, H-264L), Audio-Kodierung (G-723, MP3) undBioinformatik (Gllimmer, Fgene) getestet.

    Schließlich schlagen wir Metriken zur Erfassung der Korrelation zwischen Anwendung undHardware vor, die unsere Behauptung untermauern, dass eine ideale Leistung des mobilenDatenverarbeitungssystems nur dann erreicht werden kann, wenn die Hardwarekapazitätsowie das Programmverhalten perfekt zusammenpassen. Die Leistungsfähigkeit dieserMetriken wird anhand der Prozessoren Trimedia DSP 1300, Blackfin DSP ADSP533 undPIII-850 gezeigt.

  • viii Zusammenfassung

    Schlagwörter: Energy-aware, Quellcodetransformation, eingebettete Systeme, Multi-media Prozessoren, Mobile Computing, workload characterization

  • ACKNOWLEDGEMENTS

    I would like to thank my teacher Khwaja Shamsuddin Azeemi and parents who have hada positive effect on me personally, to whom I owe a debt of gratitude for helping in oneway or another to influence the person I am today.

    First and foremost, I thank my supervisor Dr. Markus Rupp, for his consistent efforts toinvoke my inherent skills to accomplish this task successfully. I appreciate his bottomlesspatience for technical review and substantive comments that improved the readabilityof the dissertation.

    Thanks to my sister Farhi, and brothers Waseem and Nadeem, who provide encourage-ment in the face of every seemingly impossible task that I face.

    Thanks to Afsar, Sobia, Shams Sahib, Ana Eliza and Liana for their love, support andgreat understanding, especially during vulnerable moments.

    Thanks to my friends, colleagues and acquaintances: Bastian, Martin at the ChristianDoppler Laboratory; Sabine from Vienna; Naveed and Saima from Boston; Nadeem andfamily from San Francisco; Amir Malik and family from Korea for their kind assistanceand facilitation during last 45 months.

    I would like to acknowledge valuable technical support from Dr. Arpad Scholtz atInstitute of Communications and Radio Frequency Engineering, Dr. Stefan Mahlknechtat Institute of Computer Technology and Aneesa Sultan at Vienna Bio Center.

    I am also grateful to Dr. Christoph Grimm for his time and patience to review thismanuscript.

  • CONTENTS

    1 Introduction 1

    1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.1.1 Mobile Embedded System Constraints . . . . . . . . . . . . . . 11.1.2 IC Fabrication Technology Constraints . . . . . . . . . . . . . . 21.1.3 Battery Technology Constraints . . . . . . . . . . . . . . . . . 31.1.4 Architecture-Application Correlation Slacks . . . . . . . . . . . 4

    1.2 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . . 5

    1.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

    2 Energy-Cycle Aware Compilation Framework (ECACF) 13

    2.1 Energy Saving Techniques - A Review . . . . . . . . . . . . . . . . . . 142.1.1 Fabrication level power reduction . . . . . . . . . . . . . . . . . 142.1.2 Processor level power reduction . . . . . . . . . . . . . . . . . . 152.1.3 EDA tools level power reduction . . . . . . . . . . . . . . . . . 152.1.4 Compiler level power reduction . . . . . . . . . . . . . . . . . . 162.1.5 Low power data structures . . . . . . . . . . . . . . . . . . . . 162.1.6 Idle mode power reduction . . . . . . . . . . . . . . . . . . . . 172.1.7 Power reduction in distributed computing systems . . . . . . . . 172.1.8 Power reduction in communication systems . . . . . . . . . . . 172.1.9 Battery aware power reduction . . . . . . . . . . . . . . . . . . 18

    2.2 Multimedia DSPCPU Architecture . . . . . . . . . . . . . . . . . . . . 192.2.1 Multimedia Processor Execution Model . . . . . . . . . . . . . 202.2.2 Multimedia Processor Operations Overview . . . . . . . . . . . 21

    2.3 Workload Description . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.3.1 Multimedia Applications . . . . . . . . . . . . . . . . . . . . . 232.3.2 Bioinformatics Workload . . . . . . . . . . . . . . . . . . . . . 24

    2.4 Energy Cycle Aware Compilation Framework Methodology . . . . . . . 282.4.1 Application Expression Profile . . . . . . . . . . . . . . . . . . . 30

    2.5 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . 312.5.1 Related Work for Energy Measurement . . . . . . . . . . . . . . 322.5.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . 32

    2.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    3 Gradient Mode Iterative Compilation (GMIC) 41

    3.1 GMIC Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

  • xii Contents

    3.1.1 Performance Qualifier Measurement . . . . . . . . . . . . . . . 43

    3.1.2 Code Block Queuing . . . . . . . . . . . . . . . . . . . . . . . 43

    3.1.3 Code Block Expression Profile . . . . . . . . . . . . . . . . . . 44

    3.1.4 Transformation Scheme . . . . . . . . . . . . . . . . . . . . . . 44

    3.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.3 Example: Optimization of an MPEG-1 encoder . . . . . . . . . . . . . 46

    3.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    3.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

    4 Multicriteria Stochastic Iterative Compilation (MSIC) 55

    4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

    4.2 Model Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

    4.2.1 Objects and Constraints . . . . . . . . . . . . . . . . . . . . . . 57

    4.2.2 Case Study I - Arbitrary Application . . . . . . . . . . . . . . . 59

    4.2.3 Case Study II - Nonlinear Interpolative Vector Quantization (NLIVQ) 61

    4.3 Performance Comparison with GMIC . . . . . . . . . . . . . . . . . . . 66

    4.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

    5 Application-Architecture Characterization 69

    5.1 Terminologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.1.1 Principal Component Analysis (PCA): . . . . . . . . . . . . . . 70

    5.1.2 Scree Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

    5.1.3 Box Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.1.4 Scatter Plot: . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

    5.1.5 Differential Application Expression Profile (dAEP): . . . . . . . 72

    5.2 Application Characterization . . . . . . . . . . . . . . . . . . . . . . . 73

    5.2.1 Case Study 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

    5.2.2 Case Study 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

    5.2.3 Case Study 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

    5.3 Architecture-Centric Application Characterization . . . . . . . . . . . . 81

    5.4 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

    6 Conclusions 89

    Appendices 91

    A List of Application Expression Profile (AEP) Monitors 93

    B VLIW Descriptor File (VDF) Format 99

    C User Constraints Files (UCF) Format 103

    C.1 UCF for MPEG-1 encoder example in Section 3.3 . . . . . . . . . . . . 104

    C.2 UCF for NLIVQ example in Section 4.2.3 . . . . . . . . . . . . . . . . 104

  • Contents xiii

    D Application Attributes 105

    E List of Acronyms 113

  • LIST OF FIGURES

    1.1 Power consumption for Intel CPUs [1]. . . . . . . . . . . . . . . . . . . 3

    1.2 Thermal and power delivery cost in a desktop PC [2]. . . . . . . . . . . 4

    1.3 Battery technologies and their capacities [3]. . . . . . . . . . . . . . . 5

    1.4 Thesis Structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

    2.1 TriMedia VLIW instruction [4]. . . . . . . . . . . . . . . . . . . . . . . 20

    2.2 TriMedia functional unit assignment [4]. . . . . . . . . . . . . . . . . . 21

    2.3 Transformation methodology. . . . . . . . . . . . . . . . . . . . . . . . 29

    2.4 Vertical application profile layers. . . . . . . . . . . . . . . . . . . . . . 30

    2.5 Experimental setup for instruction/program current measurement [5]. . 33

    2.6 Proposed experimental setup for application current measurement atprocessor and memory. . . . . . . . . . . . . . . . . . . . . . . . . . . 34

    2.7 Current consumption for vector quantization (VQ) application executionlife cycle. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

    2.8 CPU core current consumption versus address range for VQ application. 35

    2.9 Memory current consumption versus address range for G-728 audio transcodec.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

    2.10 CPU core current consumption versus address range for G-728 audiotranscodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    2.11 CPU peripheral current consumption versus address range for G-728 au-dio transcodec. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

    3.1 Gradient mode Iterative Compilation Methodology (GMIC). . . . . . . . 42

    3.2 Fraction of JPMO CB in an MPEG-1 application, the code blocks arenumbered from fb01 to fb34. . . . . . . . . . . . . . . . . . . . . . . . 43

    3.3 Fraction of JPMO contributed by code blocks in an MPEG-1 application-(a window view for seven blocks). . . . . . . . . . . . . . . . . . . . . 44

    3.4 GMIC algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

  • xvi List of Figures

    3.5 Heuristic track of CT-Tuple for an MPEG-1 encoder application. . . . . 48

    3.6 Heuristic track of CTxy tuple for FFT application. . . . . . . . . . . . . 50

    3.7 Heuristic track of CTxy tuple for IDCT application. . . . . . . . . . . . 50

    3.8 Heuristic track of CTxy tuple for T64 application. . . . . . . . . . . . . 51

    3.9 Heuristic track of CTxy tuple for M100 application. . . . . . . . . . . . 52

    3.10 Heuristic track of CTxy tuple for H-264L application. . . . . . . . . . . 52

    4.1 A simplified view of framework with multicriteria methodology extension. 56

    4.2 Simplified Genetic Algorithm Model [6]. . . . . . . . . . . . . . . . . . 58

    4.3 Development of fitness function for Case Study 1 in TS1 and TS2. . . . 59

    4.4 Fraction of IPC for Case Study 1 in TS1 and TS2. . . . . . . . . . . . 60

    4.5 Fraction of IPC and Energy overlapping for Case Study 1 in TS1 and TS2. 60

    4.6 Fraction of CPU cycles for CB life time (CBLT)in NLIVQ application (25CB are numbered from F01 to F25). . . . . . . . . . . . . . . . . . . . 62

    4.7 Development of the fitness function for NLIVQ. . . . . . . . . . . . . . 64

    4.8 Fraction of IPC for NLIVQ. . . . . . . . . . . . . . . . . . . . . . . . . 64

    4.9 Fraction of energy saving for NLIVQ. . . . . . . . . . . . . . . . . . . . 65

    4.10 Fraction of functional unit utilization for NLIVQ. . . . . . . . . . . . . 65

    5.1 Scatter plot for 20 applications at the TriMedia processor. . . . . . . . 75

    5.2 PCA Scree plot for 20 applications at the TriMedia processor. . . . . . 76

    5.3 PCA box plot for 20 applications at the TriMedia processor. . . . . . . 76

    5.4 PCA biplot for 20 applications at the TriMedia processor. . . . . . . . . 77

    5.5 Scatter plot for 20 applications at the Blackfin processor. . . . . . . . . 79

    5.6 PCA biplot for 20 applications at the Blackfin processor. . . . . . . . . 80

    5.7 Scatter plot for 20 applications at the PIII 850 processor. . . . . . . . . 82

    5.8 PCA biplot for 20 applications at the PIII 850 processor. . . . . . . . . 83

    5.9 Differential AEP across three hardware platforms. . . . . . . . . . . . . 83

    5.10 PCA biplot for 20 applications across the TriMedia processor and theBlackfin processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84

    5.11 PCA biplot for 20 applications across the Blackfin processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

    5.12 PCA biplot for 20 applications across the TriMedia processor and the PIII850 processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

  • LIST OF TABLES

    2.1 Energy reduction techniques for embedded system design. . . . . . . . . 14

    2.2 Multimedia Benchmarks (Speech Transcodecs). . . . . . . . . . . . . . 24

    2.3 Multimedia Benchmarks (Video Transcodecs). . . . . . . . . . . . . . . 25

    2.4 Multimedia Benchmarks (Audio Transcodecs). . . . . . . . . . . . . . . 25

    2.5 Generic DSP application Benchmarks [7]. . . . . . . . . . . . . . . . . 26

    2.6 Test Vectors Characterization. . . . . . . . . . . . . . . . . . . . . . . 26

    2.7 Bio-Computation Applications Benchmark . . . . . . . . . . . . . . . . 27

    3.1 Transformation Schemes. . . . . . . . . . . . . . . . . . . . . . . . . . 45

    3.2 Gradient Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

    4.1 CBLT in CPU cycles for NLIVQ. . . . . . . . . . . . . . . . . . . . . . 63

    4.2 Achieved CPU cycles (%) in ECHCB of NLIVQ application for TS04,TS07, TS09. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

    4.3 Sum of absolute difference for for TS04, TS07, TS09. . . . . . . . . . . 66

    4.4 Performance comparison between GMIC and MSIC. . . . . . . . . . . . 67

    5.1 MPEGdec profile for successive transformations [8]. . . . . . . . . . . . 72

    D.1 Pseudonyms for 20 applications. . . . . . . . . . . . . . . . . . . . . . 105

    D.2 AEP for optimized 20 applications at the TriMedia processor. . . . . . . 106

    D.3 AEP for optimized 20 applications at the Blackfin processor. . . . . . . 107

    D.4 AEP for optimized 20 applications at the PIII 850 processor. . . . . . . 108

    D.5 dAEP for optimized 20 applications across the TriMedia and the Blackfinprocessors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

    D.6 dAEP for optimized 20 applications across the Blackfin and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110

  • xviii List of Tables

    D.7 dAEP for optimized 20 applications across the TriMedia and the PIII 850processors. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 111

  • 1 INTRODUCTION

    1.1 Motivation

    The growing trend towards the untethered ubiquitous computing is entailed with many

    performance related issues. The ideal performance of a mobile computing system de-

    mands a perfect match between architecture capability and program behavior. Archi-

    tecture performance can be enhanced with better hardware technology, innovative low

    Integrated Circuits (IC) geometry features, and efficient resources management [9]. In

    the same vein, the demand for having multimedia functions on handheld devices requires

    an enormous computation power to handle large data and program sizes. Efficient ar-

    chitecture utilization for both energy dissipation and execution time as well as optimal

    application firmware are two important performance metrics for these embedded systems.

    The optimal architecture utilization is debilitated by different design limitations, such

    as high level system design constraints, fabrication level constraints, battery technology

    constraints etc. They are discussed next in more detail.

    1.1.1 Mobile Embedded System Constraints

    Mobile embedded systems (MES) present unique challenges and opportunities for system-

    level low-energy designs, e.g.,

    • MES are usually severely energy constrained. In particular, handheld devices , air-borne, and spaceborne systems are typically battery-operated and therefore have a

    limited energy budget [10]. MES are also typically relatively more time-constrained

    compared to portable embedded or general-purpose systems. Therefore, the chal-

    lenge is to save energy while guaranteeing temporal constraints.

    • Some MES applications such as avionics, robotics and deep space missions requiresystems with small form factors, which in turn mandates low heat dissipation.

    Since heat is a byproduct of energy dissipation, low-energy system-design ensures

    a more reliable system by limiting the heat produced.

    • MES are typically over-designed to ensure that the temporal deadline guaranteesare still met even if all tasks take up their Worst-Case Execution Time (WCET).

  • 2 1 Introduction

    Since, in the average case, tasks do not require their WCET, the redundancy in

    hardware design in MES makes them energy inefficient.

    In short, system-level techniques can decrease this energy dissipation through the

    use of energy-aware task scheduling algorithms while preserving their temporal

    constraints.

    1.1.2 IC Fabrication Technology Constraints

    Integrated circuits in their various incarnations consume some amount of electric power.

    This power is dissipated both by the action of the switching devices contained in IC

    (such as transistors) as well as heat due to the resistivity of the electrical circuits. This

    is a major consideration in the design of microporcessors and the embedded systems

    they are used in [11]. Figure 1.1 shows the power consumption for the Intel series

    of processors produced over the last two decades [1]. The horizontal axis shows the

    advancement in IC fabrication technology in terms of chip geometry (i.e nanometers),

    while power dissipation is plotted in Watts. Each point is marked with two numbers,

    showing chip geometry and power consumption, respectively. Points lying on the same

    vertical axis such as (350,43) and (350,34.8) show the processors in the same technology,

    but different performance. E.g., (350,43) and (350,34.8) corresponds to PII 300MHz

    and PII 233MHz, respectively. Similarly, P4 3MHz was fabricated at 130 nm and 81.9

    W, while in later versions at lower geometry P4 EE 3.40MHz is fabricated at 90 nm

    and low power 83.9 W; further, it is improved for higher operating frequency (P4 EE

    3.73MHz) at same the geometry but at a penalty of increase in power consumption

    i.e., 115 W. The increasing trend towards special purpose core processors has further

    reduced the geometry down to 65 nm and power consumption to 130 W (for Intel Core

    2 Extreme Qx6700). Readers are encouraged to read [1] [12] [13] for a detailed view of

    power versus technology trends realized by various CPU manufacturers.

    Attempts to shape the power-geometry envelop (shown as a shoe in Figure 1.1) have

    their limits at the fabrication technology at 50 nm, where leakage current starts dominat-

    ing the power consumption (discussed further in Chapter 2). Although special purpose

    core processors are implemented at 50 nm [14] [12], with a power consumption of 14.5

    W (shown at bottom of heal in Figure 1.1), but their operating frequency is limited to

    130 MHz, which is not sufficient to meet the current demand for multimedia process-

    ing. The designers goal to achieve a low leakage ’heal’ in the power-geometry shoe is

    associated with a high power cost. This cost has two components. The first is thermal

    cost, which is associated with keeping the devices below the specified operating temper-

    ature limits. Maintaining the integrity of packaging at higher temperatures also requires

    expensive solutions. The second component is the on board power delivery cost, which

    is related to on-board decoupling capacitances and interconnects associated with the

    power distribution network. Moreover, the increased trend towards driving the CPU at

  • 1.1. Motivation 3

    lower operating voltage and higher frequency increases the magnitude of the current

    drawn by the CPU. This exacerbates the issue of resistive and inductive noise problems

    and leads to a significant increase in system cost.

    Fig. 1.1: Power consumption for Intel CPUs [1].

    Figure 1.2 gives an idea of the range of dollar amounts associated with the above costs

    for different system components [2]. As can be seen, when the system power is in the

    35-40 W range, the cost of each additional Watt tends to grow above $1/W per chip.

    Designers have already pulled the fabrication limits to achieve low energy design goals

    [15]. E.g., shrinking the integrated circuit geometry below 50 nm doubles the leakage

    current as compared to 65 nm. Such issues exacerbate the need to consider low energy

    design more rigorously at higher hierarchies of the system level [5].

    1.1.3 Battery Technology Constraints

    The energy constraints on mobile devices are becoming increasingly tight as complexity

    and performance requirements continue to be pushed by the user demand [16]. Proces-

    sor speeds have doubled as approximately every 18 months as predicted by Moore’s

    law [17]. While processor speed and energy consumption have increased rapidly, the

    corresponding improvement in battery technology has been slow. In fact, battery ca-

    pacity has increased by a factor of less than four in the last three decades [3] [18].

  • 4 1 Introduction

    Fig. 1.2: Thermal and power delivery cost in a desktop PC [2].

    Figure 1.3 shows the current state-of-the-art in battery technology. The slack in in-

    crease in the battery capacity is hampered by the ionization chemistry limits [3] [19].

    The design target for batteries with long life-span and short sizes is hard to achieve.

    E.g., though Ni-MH is lighter in weight than Ni-Cd, it requires a higher recharging

    time. In the same vein, Li-Ion batteries are more promising for higher energy density,

    large number of charging cycles, little memory effect, longer shelf life, but higher cost

    and increased external protection against discharging inhibits its low cost wide use. In

    short, the technological constraint on the realization of high capacity, low size battery

    highlights the importance of low energy consideration.

    1.1.4 Architecture-Application Correlation Slacks

    Traditionally, optimal MES performance is gained by focussing on the underlying hard-

    ware architecture. This ignores the fact that it is the software executing on a CPU

    that determines its energy consumption. The execution time and energy consumption

    of a program on any parallel processor is dependent not only on the composition of

    operations contained within the program, but also on the ability of users to express the

  • 1.2. Design Space Exploration 5

    Fig. 1.3: Battery technologies and their capacities [3].

    parallelism at the correct granularity level for the processor. Therefore, to fairly com-

    pare cycle-energy performance of two applications at a given processor, two different

    mappings of the applications will be required, one for each application. An integrated

    approach that considers energy-cycle performance at architecture as well as application

    level is essential for energy efficient application developments.

    1.2 Design Space Exploration

    The program behavior is difficult to predict due to its heavy dependence on application

    and run-time conditions [20] [21]. For mobile computing, the application performance

    can be optimized by using parallel hardware architectures, such as Very-Long Instruction

    Word (VLIW) architectures [22] [23]. VLIW architectures are a suitable alternative for

    exploiting instruction-level parallelism (ILP) in programs, that is, for executing more than

    one basic (primitive) instruction at a time. These processors contain multiple functional

    units. They fetch from the instruction cache a Very-Long Instruction Word containing

    several primitive instructions, and dispatch the entire VLIW for parallel execution. These

  • 6 1 Introduction

    capabilities are exploited by compilers which generate code that has grouped together

    independent primitive instructions executable in parallel. The processors have a relatively

    simple control logic because they do not perform any dynamic scheduling nor reordering

    of operations (as is the case in most contemporary superscalar processors). The instruc-

    tion set for a VLIW architecture tends to consist of simple instructions (RISC-like). The

    compiler must assemble many primitive operations into a single ”instruction word” such

    that the multiple functional units are kept busy, which requires enough instruction-level

    parallelism (ILP) in a code sequence to fill the available operation slots.

    In mobile computing software design, the conventional software development environ-

    ment (for compilation and machine code generation) cannot be used. In these methods,

    the execution time and code size are primarily considered, while the energy dissipation

    issue is piggy-backed to the final design; that inevitably leads to an expensive cooling

    mechanism and eventually increases the system overall cost while reducing reliability.

    The software perspective on power consumption has been the subject of work in [24].

    Here a detailed instruction-level power model of the Intel 486DX2 was built. The impact

    of software on the CPU power and energy consumption, and software optimizations to

    reduce these were studied. It is well known that the number of useful instructions is

    always different from the number of instructions in a static code. The code execution

    flow determines the number of useful instructions according to input data. Therefore,

    computing the total energy consumed merely by adding the energy consumption of

    individual instructions does not provide the actual energy consumption of the program

    as claimed in [24].

    In this thesis we propose a framework, where software applications optimally utilize

    the hardware architecture to deliver energy-cycle performance within user defined con-

    straints. Our energy aware framework in [25] meets the demand by incorporating the

    following features in a native multimedia DSP compilation environment.

    1) The framework transforms the legacy application source code into optimal ’C’ source

    code, taking advantage of different slacks appearing in the application-to-binary devel-

    opment hierarchy.

    2) Unlike conventional techniques, ’C’ source code is iteratively compiled for different

    performance goals both in terms of execution time as well as energy dissipation.

    3) We developed post-profiling techniques published in [26] to evaluate the application

    performance not only at compilation layer (as conventional compiler does) but also at

    scheduling layer, linker layer, machine code generation layer and finally at loader layer.

    4) We measure the real-time performance of applications running on actual hardware.

    These measured parameters are further used to tune the transformation scheme of the

    legacy software application.

    5) We tested our framework at different applications that belong to diversified industrial

  • 1.2. Design Space Exploration 7

    domains such as audio transcodecs [27], video transcodecs [8], speech codecs, and

    bioinformatics applications [28] [29].

    6) The work is further extended in [30] [27] to characterize application-architecture

    correlation, that are well suited for a pre-design assessment of an embedded system

    design. It answers the question whether a given hardware architecture is an appropriate

    choice for a given multimedia software application or not.

    It may be noted, the terms power consumption and energy consumption are often in-

    terchanged. It is important to distinguish between these two when we talk of either of

    these in the context of programs running on mobile applications. Mobile systems run

    on limited energy available in a battery. Therefore, the energy consumed by the system

    or by the software running on it, determines the length of the battery life.

    This thesis is based on the following publications.

    • N. Zafar Azeemi, A. Sultan ”Characterization of Bioinformatics Applications onMultimedia Processor”, in Proc. IEEE Cairo International Biomedical Engineering

    Conference (CIBEC ’06), pages BI06-BI09, 195 - 200, Cairo, Egypt, December,

    2006.

    • N. Zafar Azeemi ”Handling Architecture-Application Dynamic Behavior in Set-top Box Applications”, in Proc. IEEE International Conference on Information

    and Automation (ICIA ’06), pages 195 - 200, Colombo, Sri Lanka, December,

    2006.

    • N. Zafar Azeemi, A. Sultan, A. Muhammad ”Parameterized Characterization ofBioinfomatics Workload on SIMD Architecture”, in Proc. IEEE International Con-

    ference on Information and Automation (ICIA ’06), pages 189 - 194, Colombo,

    Sri Lanka, December, 2006.

    • N. Zafar Azeemi ”Multicriteria Energy Efficient Source Code Compilation for De-pendable Embedded Applications”, in Proc. IEEE International Conference on

    Information Technology (IIT ’06), Dubai, UAE, November, 2006.

    • N. Zafar Azeemi ”Compiler Directed Battery-Aware Implementation of Mobile Ap-plications”, in Proc. IEEE 2nd International Conference on Emerging Technologies

    (ICET ’06), pages 151 - 156, Peshawar, Pakistan, November, 2006.

    • N. Zafar Azeemi ”A Multiobjective Evolutionary Approach for Constrained JointSource Code Optimization”, in Proc. ISCA 19th International Conference on Com-

    puter Application in Industry (CAINE ’06), pages 175 - 180, Las Vegas, Nevada,

    USA, November, 2006.

    • N. Zafar Azeemi ”Probabilistic Iterative Compilation for Source Optimization ofEmbedded Programs”, in Proc. 2006 IEEE International SoC Design Conference

    (ISOCC ’06), pages 323 - 328, Seoul, Korea, October, 2006.

  • 8 1 Introduction

    • N. Zafar Azeemi, M. Rupp ”Multicriteria Low Energy Source Level Optimization ofEmbedded Programs”, in Proc. Tagungsband zur Informationstagung Mikroelek-

    tronik (ME ’06) IEEE Austria, pages 150 - 158, Vienna, Austria, October, 2006.

    • N. Zafar Azeemi ”Architecture-Aware Hierarchical Probabilistic Source Optimiza-tion”, in Proc. ISCA 19th International Conference on Parallel and Distributed

    Computing Systems (PDCS ’06),pages 90-95, San Francisco, USA, September,

    2006.

    • N. Zafar Azeemi ”Power Aware Framework for Dense Matrix Operations in Mul-timedia Processors”, in Proc. IEEE 9th International Multi-topic Conference (IN-

    MIC ’05), Karachi, Pakistan, December, 2005.

    • N. Zafar Azeemi, M. Rupp ”Energy-Aware Source-to-Source Transformations fora VLIW DSP Processor”, in Proc. IEEE 17th International Conference on Micro-

    electronics (ICM ’05), pages 133 - 138, Islamabad, Pakistan, December, 2005.

    • N. Zafar Azeemi ”A Framework for Architecture Based Energy-Aware Code Trans-formations in VLIW Processors”, in Proc. International Symposium on Telecom-

    munication (IST ’05), pages 393 - 398, Shiraz, Iran, September, 2005.

    1.3 Thesis Outline

    This thesis is organized in five chapters, as shown in Figure 1.4. A brief description of

    each chapter is given below.

    Chapter 1: We discuss the different design limitations, such as high level system design

    constraints, fabrication level constraints, battery technology constraints etc. We explore

    the design slacks that exist in contemporary work [31] [24] [5] for energy aware code

    optimization. We explain the thesis structure and provide a detailed list of contributions.

    Chapter 2: This chapter lays the necessary foundation for the development of our

    energy cycle aware iterative compilation framework. Our methodology optimizes a soft-

    ware application for energy consumption, execution time as well as efficient hardware

    architecture utilization. As compared to [5] [32] [33] [34], we elaborate our method

    for generic multimedia processors. Unlike [35] [36] [36], we define software applica-

    tion in terms of its architectural behavior. We provide a simplified overview of typical

    multimedia processors. Though various multimedia operation models are presented in

    [37] [31] [38] [39] [40], but their complexity refrain them to be readily usable in a real

    time optimization environment. We use a simplified multimedia operation model devel-

    oped in [4], that views the instruction set in terms of load/store operations, compute

    operations, special register operations and control flow operations. The measurement

    of energy consumption made by an application at a real-time platform is a first step

  • 1.3. Thesis Outline 9

    Fig. 1.4: Thesis Structure.

    to know in any energy constrained embedded system and can be used to estimate

    the battery lifetime of the system. The experimental setup proposed in [5] [32] [41]

    for instruction/program current measurement, addressing modes, immediate operands,

    and exhaustive characterization is very time consuming. We present here a measure-

    ment platform that is generic and applicable to most off-the-shelf available multimedia

    processors. It is based on current measurement at both processor and memory input

    lines. Unlike the instruction based energy model presented in [42] [24], we propose a

    simplified energy consumption model based on code blocks. We expose a step-by-step

    procedure for the measurement of software application energy consumption at a target

    hardware architecture. As compared to [24] [32] [41], we apply our framework at two

    major application domains, multimedia and bioinformatics. The multimedia application

    set consists of encoders and decoders (transcodecs) encompassing three media types -

    speech, video, and audio (music), whereas, we categorize the basic functionality offered

    by all bioinformatic tools into four groups. They are pattern recognition algorithms, rule

    based analysis, biological data bases and biological taxonomy. The results published

  • 10 1 Introduction

    in [28] [29] reveal the usefulness of our framework at diversified application domains.

    Several energy reduction opportunities at design level are also presented.

    Chapter 3: Our energy cycle aware compilation framework is powered by a source

    code transformation engine. Unlike [43] [42] [24], we implement our scheme by first

    investigating the ’C’ source code of application for cycle energy taxing blocks, based

    on trace data collected during a profile of the application as mentioned in Chapter 2.

    Here, we present a novel heuristic that searches the solution space for an optimal source

    code transformation scheme. We demonstrate that the algorithm executes a solution

    and evaluates the energy-time tradeoff based on a user-defined metric. Based on the

    evaluation, it selects the next solution to be evaluated. The heuristic terminates when

    desired objectives are achieved. Our gradient mode iterative compilation scheme has

    two salient features. First, it requires queuing code blocks such that blocks pertaining

    similar expression profile most likely to benefit from the same transformation scheme.

    Second, it completes in a discrete number of steps based on the number of code blocks,

    whereas schemes mentioned by Sinha et al. in [33] and Tiwari et al. in [5] offer searches

    that grow exponentially as the number of code blocks increases. We also expose our

    scheme by analyzing a video encoding application (MPEG-1 encoder). Further merits

    and demerits of the scheme are also explained in different application scenarios.

    Chapter 4: The gradient mode iterative compilation as proposed in the previous chapter,

    belongs to a class of compilation termed as feedback directed compilation. It brings

    relatively small improvement, as it effectively restricts itself to trying different back-end

    optimizations. The major impediment to such approach is the heuristic search technique

    itself. Unlike [32] [41], in this chapter we consider the optimization problem as a single

    task, where all desired aims have to be taken into account simultaneously. We present

    a new method, which is based on the optimization of a multicriteria, objective function,

    where the desired aims of architecture-based energy-cycle optimization are formulated as

    penalty terms of such objective function. Further, we describe how the maximization of

    the objective function can be achieved by using a Genetic Algorithm (GA). The interface

    of the proposed methodology to our energy cycle aware compilation framework is also

    explained. We also expose the minutia of our methodology e.g., selection of constraints,

    development of fitness function, formation of Hertz matrix. We discuss two multimedia

    applications in depth to elaborate the advantage of the algorithm.

    Chapter 5: In this chapter we introduce the concept of application-architecture char-

    acterization with the help of our ECACF and multivariate statistics techniques. To our

    knowledge this is a first attempt to obtain such characterization from the application

    expression profiles.

    The application-architecture correlation is a bidirectional process matching algorithmic

    structure with hardware architecture and vice vera. The programmer will benefit from

    this efficient mapping and produce better source codes. Applications of similar function-

    ality may yield similar Application Expression Profile (AEP), and hence can be suitable

  • 1.3. Thesis Outline 11

    for similar hardware platform. We explore the fact that despite the simplicity of our

    methodology, the analysis of large matrices provided by an application expression profile

    under different levels of transformation at different architectures is not trivial and re-

    quires an advanced knowledge of discovery processes. To this end, we introduce a new

    methodology to evaluate the application portability using multivariate statistics. We

    demonstrate how box plot, scree plot, and PCA biplots can be used to characterize an

    application at a given hardware architecture. We expose the minutia of methodology by

    exploring the AEP across three different hardware platforms at diversified applications.

    Finally, we demonstrate how dAEP can be used to find out the legacy code portability

    across platforms.

  • 12 1 Introduction

  • 2. ENERGY-CYCLE AWARE COMPILATION

    FRAMEWORK (ECACF)

    Miniaturization of computing systems is finding applications in special areas such as

    hand-held computation, tiny robots, guidance systems in automated vehicles, to name

    just a few. Also, these systems or their users move from place to place. Because of

    their small size and their mobility requirement, they are powered by batteries of low

    rating. In order to avoid frequent recharging and/or replacement of the batteries, there

    is significant interest in low-energy system design. Energy consumption is an area of

    growing concern in system design. It leads to variety of system related issues, such as

    battery life, thermal limits, packaging constraints, and cooling options [44]. Though

    energy is actually consumed by the hardware, energy consumption can be reduced apart

    from using low-energy electronics by suitably manipulating the software systems. This

    is because the hardware activities are controlled through the software. Let a program

    X run for T seconds to achieve its goal, VCC be the supply voltage of the system, and

    I be the average current in Amperes drawn from the power source for T seconds. We

    can rewrite T as T = N x τ where N is the number of clock cycles and τ is the clock

    period. Then, the amount of energy consumed by X to achieve its goal is given by: E

    =VCC x I x N x τ joules. Since for a given hardware, both VCC and τ are fixed, E

    ∝ I x N. However, at the application level, it is more meaningful to talk about T thanN, and therefore, we express energy as E ∝ I x T. This expression is the foundation ofour ECACF. It shows the main idea in the design of energy-efficient software that is to

    reduce both T and I. From the running time (average case) of an algorithm we achieve

    a measure of T . However, to compute I, one must consider the current drawn during

    each clock cycle. This is illustrated in Section 2.5.

    Given the fact that power is the rate of energy consumption, in this thesis, we refer to

    power and energy interchangeably. Low power design is a complex endeavor requiring

    a broad range of strategies from floor planning on silicon substrate to the design of

    application software. In Table 2.1, we enlisted several strategies for achieving energy

    efficiency in an energy-conscious system design. In the following section, we review some

    of these strategies.

  • 14 2 Energy-Cycle Aware Compilation Framework (ECACF)

    Power Reduction Strategies MES Design LevelsFabrication Level Power Reduction Low level

    Processor Level Power Reduction Intermediate level

    EDA Tools Level Power Reduction High level

    Compiler Level Power Reduction High level

    Low Power Data Structures High level

    Idle Model Power Reduction Intermediate level

    Power Reduction in Distributed Computing High level

    Power Reduction in Communication Systems High level

    Battery Aware Power Reduction High level

    Tab. 2.1: Energy reduction techniques for embedded system design.

    2.1 Energy Saving Techniques - A Review

    We review a wide spectrum of strategies, shown in Table 2.1, ranging from the hardware

    fabrication process to energy efficient communications system. Energy saving due to

    different approaches are, in the best case, multiplicative. E.g., in an IDCT application

    implemented in [44] [45] [46] [47], a 30% energy saving from low-energy electronics

    together with a 23% saving from compiler techniques will yield a total energy saving of

    (1-((1-0.30)(1-0.23)))×100%= 46.1%.

    However, generally the total energy saving is less, say, in this example 34%, because the

    various energy saving strategies may adversely affect each other.

    2.1.1 Fabrication level power reduction

    The power consumption in a CMOS digital circuit is expressed as [48]

    P = (CLV 2DDfp) + (ISCVDD) + (IleakgeVDD) (2.1)

    where VDD is the supply voltage, fp is the output switching frequency, CL is the output

    capacitance load, ISC is the short circuit current pulse, generated when both n- and

    p-transistors are briefly turned on during the output switching, and Ileakage is the leakage

    current. The first term on the righthand side of the power equation is the dominant

    factor [48]. It is expected that power saving with two orders of magnitude can be

    achieved using low-power electronics. About half of the power reduction will come from

    architecture changes and management of switching activity (fp). The other half of

    power reduction will come from using advanced materials technology to allow reduction

    of VDD to 1 V or below from 5 or 3.5 V while also reducing CL [48] [49].

  • 2.1. Energy Saving Techniques - A Review 15

    2.1.2 Processor level power reduction

    Mobile embedded system requires small form factors and hence processors designed for

    high-end desktops are not suitable for such application. Havinga et al. in [50] show that

    microprocessors can account for up to 33% of a typical notebook power budget, which

    is around 15W. Therefore, processor designers include a number of features to reduce

    power consumption. E.g., in TriMedia processor TM130x [4] and Blackfin processor

    ADSP533S some of the power reduction features are dynamic idle-time shutdown of

    separate execution units, low-power cache design, and power considerations for standard

    cells, data-path elements, and clocking. The processor also supports three static power

    management modes doze, nap, and sleep [51]. These modes reduce power at a global

    level when the processor is idle for an extended period of time. Since CMOS circuits

    consume power during the charging and discharging of capacitances, reducing switching

    activity saves power. At the architecture-level, two strategies to reduce switching activi-

    ties are Gray code addressing and cold scheduling of instructions [52] [53]. Experimental

    results show that cold scheduling reduces switching by 20 ∼ 30%. The Gray codes ad-vantage over the binary code is that each memory access changes the address by only

    one bit. Thus, a significant number of bit switches can be eliminated using Gray code

    addressing. Also, by decomposing a finite-state machine into several submachines, [54]

    suggest that it is possible to selectively turn off portions of a circuit, thereby reducing

    the switching activities. Tiwari et al. [31] have studied the idea of shutting off parts of

    a logic circuit that are not needed in a particular computation on a per-clock-cycle basis.

    This saves the power used in all the useless transitions in those parts of the circuit. Burd

    et al. in [55] and Govilak et al. in [56] have suggested that power consumption in a

    CPU can be reduced by dynamically changing its operating frequency and voltage. Fur-

    ther studies to expose the role of prediction and of smoothing in dynamic speed-setting

    policies is discussed in [57]. Havinga and Smit [50] propose energy saving by exploiting

    locality of reference with dedicated, optimized modules. The idea of locality of reference

    is to offload as much work as possible from the CPU to programmable modules that are

    placed in the data streams.

    2.1.3 EDA tools level power reduction

    The design of low-power systems cannot be achieved without good power-conscious

    EDA tools. EDA tools are used at all levels of hardware design: behavioral, architectural,

    logic and physical. For a detailed exposition of power-conscious EDA tools, the reader

    is referred to tutorials by [58] [59] [14].

  • 16 2 Energy-Cycle Aware Compilation Framework (ECACF)

    2.1.4 Compiler level power reduction

    Compiler design techniques contribute to energy saving in several ways [60] [61]. Kolson

    and Nicolau [62] [40] [63] address the problem of allocating memory to variables in em-

    bedded DSP (digital signal processing) software. The goal is to maximize simultaneous

    data transfers from different memory banks to registers [64] [65] [66]. In several DSP

    applications mentioned in [67] [68], two registers are loaded with the required data and

    an arithmetic operation is performed. Loading two registers with a single double transfer

    instruction draws a little more current than a move instruction. Both the instructions

    take one clock cycle each. However, energy is saved by using the double transfer, be-

    cause the double transfer instruction loads the two registers in one clock cycle, whereas

    we need two clock cycles to sequentially load the registers. Experimental results for a

    few applications on a Blackfin DSP processor in [30] show that up to 47% of energy

    can be saved by this approach. Instructions with memory operands have much higher

    energy costs than instructions with register operands [30]. This suggests that energy

    can be saved by suitably assigning the live variables of a program to registers. But, a

    processor has only a small number of registers. When the number of simultaneous live

    variables is larger than the number of available registers, some of the variables must be

    spilled to memory. Register assignment for loop variables is important because loops

    are typically executed many times. Algorithms for optimal register assignment to loop

    variables are presented in [69] [70] [71] [62]. This algorithm can be included in the

    code generation part of a compiler.

    2.1.5 Low power data structures

    Kondo et al. [72] propose a method of implementing set data types with minimum power

    consumption. In a programming language, one can implement the set data type using a

    variety of concrete data structures such as arrays, pointer arrays, linked list and binary

    tree [73]. Thus, to implement the set operations, such as locate, insert, and remove

    a record from a set, one has to manipulate the memory elements in a concrete data

    structure as proposed in [74] [75] [33] [42]. It is the memory accesses in the process

    of set operations that actually consume power. Thus, the power consumption in set

    operations is a function of the number of memory elements used in implementing a set

    data type, the number of read and write operations are performed in the implementation,

    and some logic details such as capacitance of memory elements, voltage level, and

    frequency of operation. The concrete data structures are compared on the basis of a

    filling factor, which is the fraction of the locations that would be filled if implementation

    is in arrays [76] [77] [78]. It has been shown that for different levels of filling factor,

    different concrete data structures lead to low values of the power cost function. E.g.,

    for filling factors greater than 60%, arrays are better in implementing energy efficient

    set data types [72].

  • 2.1. Energy Saving Techniques - A Review 17

    2.1.6 Idle mode power reduction

    The doze mode is an innovative approach to conserving energy [79] [80] [81] [60]. It is

    very attractive in a communication environment where a mobile system may occasionally

    send or receive messages. In the doze mode, the clock speed is reduced and no user

    process is executed. Rather, a mobile host simply waits for any incoming message. Upon

    receiving a message, the host resumes its normal mode of operation. The energy saving

    due to this mode depends on the local computations on a mobile and the pattern of

    communication between a mobile and a support station [82]. Simulation studies in [41]

    show that energy saving due to this mode spreads over a wide range of 2 ∼ 98%.

    2.1.7 Power reduction in distributed computing systems

    Agent based computation is a relatively new idea in distributed computing [83] [81]

    [84]. General agent-based distributed computing systems have been designed using the

    concept of Lindas tuple space [85]. Wei et al. [86] discuss how energy-efficient

    distributed algorithms in a mobile computing environment can be designed using a tuple

    space managed on the fixed network of a mobile system. Lin et al. [22] propose a power

    efficient commit protocol which supports conventional two-phase commit services. A

    distributed autonomous system called Noah (Network oriented application harmony)

    has been proposed in [87] built in the Mitsubishi laboratory. Though the purpose of

    Noah is not to save energy, it demonstrates how agent based systems can be built using

    a tuple space as the medium for process communication. By shifting most workload

    to peer fixed hosts, the load, the power consumption and the message exchanged via

    expensive wireless links in a mobile host are greatly reduced.

    2.1.8 Power reduction in communication systems

    The receiver subsystem of a mobile station need not be active all the time [88]. Most

    digital cellular and cordless systems provide power cycling at the mobile units. Mobile

    stations can periodically relax (power cycle) their receivers as a means of conserving

    energy. Since the receiver of a mobile unit is not continuously ready to receive messages

    from the local support station (base station), some kind of coordination between a base

    station and a mobile unit is necessary. Salkintzis et al. [89] propose a page-and-answer

    protocol. Intuitively, the protocol works as follows:

    When a base station has a message for a mobile unit, the base station sends a small

    paging packet to the mobile unit. If the mobile unit receives the paging packet, that

    is if the mobile receiver is up, the mobile sends an answer packet to the base station.

    Obviously, if the paging message is sent at a time when the receiver is powered off, no

    answer packet is generated by the mobile and the base station will once again page the

  • 18 2 Energy-Cycle Aware Compilation Framework (ECACF)

    mobile after some time. Upon receiving an answer packet, the base station sends the

    desired message to the mobile unit.

    Kravets and Krishnan [90] propose power saving by selectively choosing short periods

    of time to suspend communications and shut down the communication device. Applying

    this method to a transport protocol and using three simulated communication patterns,

    they have achieved up to an 83% saving in the energy consumed by the communication

    system. Chlamtac et al. [91] address the problem of wireless access protocols which

    include an energy constraint and develop three energy conserving protocols for various

    loads: grouped-tag TDMA, directory, and pseudorandom. Singh et al. [92] argue that

    there is a need for using power-aware metrics, such as minimize energy consumed per

    packet, minimize variance in node power levels, maximize time to network partition, etc.,

    in the design of power efficient routing protocols. They show that these metrics in a

    shortest-cost routing algorithm reduces the cost/packet of routing packets by 5 ∼ 30%over shortest-hop routing.

    2.1.9 Battery aware power reduction

    Chiasserini and Rao [18] have shown how battery behavior can be exploited to prolong

    battery life. In particular, they identify the phenomenon of charge recovery that takes

    place under pulsed discharge conditions as a mechanism that can be exploited to enhance

    the capacity of an energy cell. The bursty nature of many data traffic sources suggests

    that there might be a natural fit between the two. Bai and Lai [93] implement some

    methods to let the low power CPU efficiently do some kind of computation intensive

    tasks, such as graphic image processing and displaying. Their methods include reducing

    the computation complexity of bitmap file processing, using fixed-point math instead

    of floating point math, prestoring the table of trigonometric functions, and using a few

    lines of assembly language code in the inner loop of graphic image processing program

    to improve its performance. These methods lead to a speed up of the programs by a

    factor of three to six.

    In [44], we argue that mobile applications development require us to rethink the concept

    of an algorithm from the viewpoint of battery life. Instead of asking for the best result,

    a user may say :

    ’Give me the best result you can find, using no more than X units of resource R.’

    Or, one can let the system make the tradeoff between fidelity and resource consumption

    by saying:

    ’Give me the best result you can obtain economically.’

  • 2.2. Multimedia DSPCPU Architecture 19

    2.2 Multimedia DSPCPU Architecture

    A multimedia processor is a media processor for high-performance multimedia appli-

    cations that deals with high-quality video and audio. Typically, an extended general-

    purpose CPU ( called the DSPCPU) makes it capable of implementing a variety of

    multimedia algorithms from popular multimedia standards such as MPEG-1 and MPEG-

    2. The key features behind this powerful processor are as follows:

    • A general-purpose VLIW processor core coordinates all the on-chip activities.In addition to implementing the non-trivial parts of multimedia algorithms, this

    processor runs a small real-time operating system that is driven by interrupts from

    the other units.

    • DMA-driven multimedia input/output units that operate independently and thatproperly format data to make software media processing efficient.

    • DMA-driven multimedia coprocessors that operate independently and in parallelwith the DSPCPU to perform operations specific to important multimedia algo-

    rithms.

    • A high-performance bus and memory system that provides communication betweenthe processing units.

    • A flexible external bus interface.

    A typical multimedia processor is based on a three-level hierarchy of operators:

    • Instructions

    • Operations

    • RISC operations

    One instruction may contain five operations as depicted in Figure 2.1. Each operation

    may execute multiple arithmetic operations. E.g., for TriMedia DSP processor TM130x,

    one such operation is the command IFIR(a, b). This command contains a total of threearithmetic operations: Two multiplications and one addition (aHI × bHI + aLO × bLO).

    Up to five operations including two IFIR commands can be issued in each machine

    cycle. The ability of TriMedia’s VLIW architecture to execute multiple operations in

    parallel gives it a big advantage over traditional RISC and CISC architectures found in

    current mass-market microprocessors.

  • 20 2 Energy-Cycle Aware Compilation Framework (ECACF)

    Fig. 2.1: TriMedia VLIW instruction [4].

    2.2.1 Multimedia Processor Execution Model

    The multimedia processor processor provides a large set of general purpose registers,

    generally named as r0, r1, and so on. In addition to the hardware program counter PC,

    there are a few user-accessible special purpose registers to hold CPU branch addresses.

    The CPU issues one long instruction every clock cycle. Each instruction consists of

    several operations (five operations for the TM1300 microprocessor) [4]. Each operation

    is comparable to a RISC machine instruction, except that the execution of an operation

    is conditional upon the content of a general purpose register. Examples of operations

    are:

    IF r10 iadd r11 r12 → r13 (if r10 true, add r11 and r12 and write sum in r13)

    IF r10 ld32d(4) r15 → r16 (if r10 true, load 32 bits from mem[r15+4] into r16)

    IF r20 jmpf r21 r22 (if r20 true and r21 false, jump to address in r22)

    Each operation has a specific, known execution latency (in clock cycles). For example,

    in case of TM1300, iadd takes 1 cycle. This means that the result of an iadd operation

    started in clock cycle ’i’ is available for use as an argument to operations issued in cycle

    ’i+1’ or later. The other operations issued in cycle ’i’ cannot use the result of iadd.

    Similarly the ld32d operation has a latency of 3 cycles. The result of an ld32d operation

    started in cycle ’j’ is available for use by other operations issued in cycle ’j+3’ or later.

    Branches, such as the jmpf example above have three delay slots. This means that if a

    branch operation in cycle ’k’ is taken, all operations in the instructions in cycle k+1, k+2

    and k+3 are still executed. In the above examples, r10 and r20 control the conditional

    execution of the operations. This is also referred to as guarding, where r10 and r20

    contain the guard of the operation.

    The implementation of architecture restricts the choice of operations that can be per-

    formed in parallel or can be packed into an instruction. For example, the DSPCPU in

    TM1300 allows no more than two load/store class operations to be packed into a single

    instruction, shown in Figure 2.2. Also, no more than five results (of previously started

    operations) can be written during any one cycle. The packing of operations is not nor-

  • 2.2. Multimedia DSPCPU Architecture 21

    mally performed by the programmer. Instead, the instruction scheduler takes care of

    converting the parallel intermediate format code into packed instructions ready for the

    assembler. The rules are formally described in the VLIW Description File (VDF) used

    by the instruction scheduler and other tools.

    Fig. 2.2: TriMedia functional unit assignment [4].

    2.2.2 Multimedia Processor Operations Overview

    In this section we present a brief overview of the multimedia processor instruction set.

    Readers are encouraged to refer to [4] for details.

    Conditional Execution: In multimedia processor architectures, all operations are op-

    tionally ’guarded’. A guarded operation executes conditionally, depending on the value

    in the ’guard’ register. For example, a guarded add is written as:

    IF R23 iadd R14 R10 → R13.

    This should be taken to mean if R23 then R13 ← R14 + R10. The ’if R23’ clausecontrols the execution of the operation based on the LSB of R23. Hence, depending

    on the LSB of R23, R13 is either unchanged or set to contain the integer sum of R14

    and R10. Guarding applies to all TM1300 operations, except the iimm and uimm (load-

    immediate) operations. Guarding controls the effect on all programmer visible state of

    the system, i.e. register values, memory content, exception raising and device state.

    Load and Store Operations: Memory is byte addressable. Loads and stores have to

    be naturally aligned, i.e. a 16-bit load or store must target an address that is a multiple

    of two. A 32-bit load or store must target an address that is a multiple of four. For

  • 22 2 Energy-Cycle Aware Compilation Framework (ECACF)

    TM1300, the BSX bit in the PCSW (program control status word) register determines

    the byte order of loads and stores. E.g., see ld32 and st32 in Appendix A of [4], only

    32-bit load and store operations are allowed to access MMIO registers in the MMIO

    address aperture. The results are undefined for other loads and stores. A load from

    a non-existent MMIO register returns an undefined result. A store to a non-existent

    MMIO register times out and then does not happen. There are no other side effects of

    an access to a nonexistent MMIO register. The state of the BSX bit has no effect on

    the result of MMIO accesses. Loads are allowed to be issued speculatively. Loads that

    are outside the range of valid data memory addresses for the active process return an

    implementation dependent value and do not generate an exception. Misaligned loads

    also return an implementation dependent value and do not generate an exception.

    Compute Operations: Compute operations are register-to-register operations. The

    specified operation is performed on one or two source registers and the result is written

    to the destination register.

    Immediate Operations load an immediate constant (specified in the opcode) and produce

    a result in the destination register.

    Floating-Point Compute Operations are register-to-register operations. The specified

    operation is performed on one or two source registers and the result is written to the

    destination register. Unless otherwise mentioned all floating point operations observe

    the rounding mode bits defined in the PCSW register. All floating-point operations

    not ending in flags update the PCSW exception flags. All operations ending in flags

    compute the exception flags as if the operation were executed and return the flag values

    (in the same format as in the PCSW); the exception flags in the PCSW itself remain

    unchanged.

    Multimedia Operations are special compute operations. They are like normal compute

    operations, but the specified operations are not usually found in general purpose CPUs.

    These operations provide special support for multi-media applications.

    Special-Register Operations: Special register operations operate on special registers,

    such as program control status word, branch address holding registers etc.

    Control-Flow Operations: Control-flow operations change the value of the program

    counter. Conditional jumps test the value in a register, and based on this value, change

    the program counter to the address contained in a second register or continue execution

    with the next instruction. Unconditional jumps always change the program counter

    to the specified immediate address. Control-flow operations can be interruptible or

    non-interruptible. The execution of an interruptible jump is the only occasion where a

    multimedia processor allows special event handling to take place.

  • 2.3. Workload Description 23

    2.3 Workload Description

    Our workload consists of two major application domains, multimedia and bioinformatics.

    Both use compute and data intensive algorithms. In this section we present in detail the

    diversity found in these application domains, that we selected for the rigorous testing of

    our ECACF. The variability in the input data streams is also discussed.

    2.3.1 Multimedia Applications

    The multimedia application set consists of encoders and decoders (transcodecs) encom-

    passing three media types - speech, video, and audio (music) - and is summarized in

    Table 2.2 to Table 2.5. We obtained codes for these applications from various public

    domain sources [94] [95] [96] [21]. The applications were chosen for their importance

    in real systems and (we believe) to be representative enough to make the inferences in

    this study. We evaluated all our applications with four inputs, summarized in Table 2.6.

    Here, we only report results from a single input for each application. We chose the input

    that gave the highest (normalized) standard deviation in per frame execution time on

    our base system. We call these inputs the default inputs, and list them in the second

    column of Table 2.6. Results with the other inputs are similar, both quantitatively and

    qualitatively. The G.728, H.263, and MPEG codecs statically distinguish multiple frame

    types. G.728 uses an adaptive algorithm, where certain parameters are updated every

    four frames. The processing of each frame in a single four-frame cycle is different due

    to the calculation of these parameters. Thus, we treat these as different types of frames

    (numbered one through four). The H.263 and MPEG codecs use almost the same video

    compression scheme. A key difference is that MPEG uses three different types of frames

    - I frames do not exploit inter-frame redundancy, P frames exploit inter-frame redun-

    dancy using a previous frame, and B frames exploit such redundancy using a previous

    and a later frame. Our H.263 codecs do not use B frames. They use a single I frame at

    the beginning of the video and P frames for the rest. We do not include the I frame in

    our analysis. It takes excessively long to simulate a frame with the MPEG codecs using

    the frame sizes specified by the MPEG-2 standard (about 4 to 16 hours per frame for

    MPEGenc. We scaled down the frame size to 176x144 pixels so that we could simulate

    a reasonable number of frames to assess execution time variability. We ensured that

    the scaling did not affect the cache behavior by performing a working set analysis and

    running representative experiments with larger frame sizes and different cache sizes. As

    the chosen frame size conforms to the H.263 standard, we used the same size for the

    H.263 codecs for consistency. Also for consistency, we used the same set of four inputs

    for both MPEG and H.263 codecs. These inputs contain a great deal of motion to

    stress the applications. H.263 was designed for low bit-rate applications such as video

    conference (which typically have less motion); therefore, our results from these inputs

    represent an upper bound on the expected variability for H.263.

  • 24 2 Energy-Cycle Aware Compilation Framework (ECACF)

    Application Description Input Vector SampleRate/Through-put

    GSMenc Low bit-rate speech codingbased on the European GSM6.10 provisional standard. UsesRPE/LTP (residual pulse ex-citation/long term prediction)coding at 13 Kb/s. Compressesframes of 160 16-bit samplesinto 264 bits.

    orignova 20 ms (160 sam-ples), 8 KHz

    GSMdec homemsg

    G728enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.

    lpcqutfe 625 µs, (5 sam-ples), 8 KHz

    G728dec homemsg

    G723enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.

    lpcqutfe 625 µs, (5 sam-ples), 8 KHz

    G723dec homemsg

    G729enc High bit-rate speech codingbased on the G.728 standard.Uses low-delay CELP (code ex-cited linear prediction) codingat 16 Kb/s. Compresses framesof five 16-bit samples into 10bits.

    lpcqutfe 625 µs, (5 sam-ples), 8 KHz

    G729dec homemsg

    Tab. 2.2: Multimedia Benchmarks (Speech Transcodecs).

    2.3.2 Bioinformatics Workload

    Due to a significant increase in biological threats against humane, plants and other

    species during last two decades, there is a growing realization that bioinformatics and

    molecular biology equipments should be available in small form factors, that can be

    readily available in field [97]. This lead to development of battery as well as execu-

  • 2.3. Workload Description 25

    Application Description Input Vector SampleRate/Through-put

    H263enc Low bit-rate video coding basedon the H.263 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.

    orignova 40 ms, 25 frames/s

    H263dec buggy

    H264Lenc Low bit-rate video coding basedon the H.264 standard. Primar-ily uses inter-frame coding (Pframes). Widely used for bit-rates less than 64 Kb/s.

    orignova 40 ms, 25 frames/s

    H264Ldec buggy

    MPEGenc High bit-rate video codingbased on the MPEG-2 videocoding standard. Uses intra-frame (1) and inter-frame (P,B) coding. Typical bit rate is1.5-6 Mb/s.

    Buggy 33 ms, 30 frames/s

    MPEGdec flwr

    MPEG-1 encoder High bit-rate video codingbased on the MPEG-1 videocoding standard.

    Buggy 33 ms, 30 frames/s

    MPEG-1 encoder flwr

    NLIVQ Non linear interpolative vectorquantization, image processingcodec

    cameraman.tif 512x512 resolu-tion, Gray scale

    Tab. 2.3: Multimedia Benchmarks (Video Transcodecs).

    Application Description Input Vector SampleRate/Through-put

    MP3enc Audio decoding based on theMPEG Audio Layer-3 standard.Synthesizes an audio signal outof coded spectral components.Typical bit rate is 16-256 Kb/s.

    filter 26 ms (1151 sam-ples), 44.1 KHz

    MP3dec filter

    Tab. 2.4: Multimedia Benchmarks (Audio Transcodecs).

  • 26 2 Energy-Cycle Aware Compilation Framework (ECACF)

    Application DescriptionFFT Fast Fourier Transform

    IDCT Inverse Discrete Cosine Transform

    T64 Matrix Transpose 64x64

    M100 Matrix Multiplication 100x100

    Tab. 2.5: Generic DSP application Benchmarks [7].

    Domain Test Vector Description FeaturesAudio CatSteven Soft rock song 2500 frames, av-

    erage length 65.25seconds

    Sting Pop songBeethoven 2500 classical piece

    Video Flwr Drive-by of houses 450 frames, each18 seconds forH.263 and 15seconds for MPEG

    Cact Panoramic viewBuggy Buggy raceTens Table tennis match

    Speech Homemsg An answering message Average frame sizefor GSM codecs is500, for G.72x is19000, length: 20seconds

    Orignova Sentences read by different adultslpcqutefe Sentence read by a boy

    Tab. 2.6: Test Vectors Characterization.

    tion time efficient handheld devices for bioinformtics applications. Bioinformatics is an

    interdisciplinary research area that helps to produce ’sensible’ and ’useful’ information

    from the wealth of data that has been produced by the genome sequencing projects.

    We categorize the basic functionality offered by all bioinformatics tools into four groups,

    they are:

    1. Algorithm for pattern recognition, probability formulae are used to determine the

    statistical similarity in given two or more than two sequences.

    2. Rule-bases analysis defines how a mathematical or statistical technique can be applied.

    Different sets are defined with a membership, and set of rules are also created to elaborate

    associativity. A basic set theory is used to fire a rule.

    3. Biological data bases are uniformly and efficiently maintained archives of consistent

    data that contain information and annotation of DNA and protein sequences, DNA

    and protein structures as well as DNA and protein expression profiles [98] [99]. An

  • 2.3. Workload Description 27

    important feature of these databases is their simplicity in access and query management.

    In addition some websites [100] [101] [102] provide visualization tools to aid biological

    interpretation.

    4. Biological taxonomy records the differences in sequences across different classes

    helping further to reduce the similarity errors.

    We chose applications for their importance in real system and representative enough to

    make the inferences in this study. They are summarized in Table 2.7. We obtained

    codes for these applications from various public domain sources. For lack of space, we

    only report their underlying algorithm; details may be found in [99] [97] [102]. The

    input databases are obtained from the NIH genetic sequence database ’GenBank’, NCBI

    assembly archive ’Genome Assembly Archive’, Homologus structure alignment database

    ’HOMSTRAD’, the NIMH-NCI protein-disease database ’PDD’ and ’The Lens’ [100]

    [102].

    Application Pseudonym Features AlgorithmsGENESPLICER A01 Detect splice sites in the

    genomic DNAHigh accuracy and com-putationally efficient

    TIGRSCAN A02 DNA modeling Generalized HiddenMarkov Model (GHMM),HMM

    TRANSTERMIS A03 Rho-independent tran-scriptional terminators

    Statistical estimationtechniques

    GENSCAN A04 Predict complete genestructure

    Search algorithms

    MUMMER A05 Genome Sequence align-ment

    Tree algorithms

    GLIMMERHMM A06 Find gene sequence ineukaryotes

    IMM, Splice site models,Maximal dependence de-composition techniques

    GENIE A07 Gene finder in vertebrateand human DNA

    GHMM, Neural Net-works

    FGENE A08 Find splice sites, genes,promoters

    Linear discriminantanalysis

    GRAIL A09 Analysis of DNA se-quence

    Automated computation

    GENEMARK A10 Find genes in bacterialDNA sequence

    Markov chains

    NetPlaneGene A11 Sequence analysis Neural network

    GLIMMER A12 Coding regions in micro-bial DNA

    Interpolated MarkovModels (IMM)

    Tab. 2.7: Bio-Computation Applications Benchmark .

  • 28 2 Energy-Cycle Aware Compilation Framework (ECACF)

    2.4 Energy Cycle Aware Compilation Framework Methodology

    The ECACF is shown in Figure 2.3. The source code is processed successively for

    static code analysis, post compiler analysis and finally for scheduling analysis. A VLIW

    processor descriptor file (VDF) is used to provide architecture information to compiler,

    scheduler and finally to the machine code generator. The VDF file contains a list of

    pseudo and machine operations, latency of the operations, opcodes, slot assignment

    schemes, processor operating frequency, instruction cache feature (associativity, block

    size, number of sets) and main memory features (size, order, read/write latencies). This

    file format is compatible as mentioned in [103] [4] [81] [104]. Here, we follow the

    same VLIW naming convention as used in [104]. This feature has made our scheme

    architecture independent. A list of parameters is generated in each step during the

    methodology flow. Intermediate trace files are generated during the code processing

    flow to produce AEP, such as code size, execution time number of cache miss (for both

    instruction and data caches), data cache conflicts, data bank alignment, highway usage,

    scheduling factor and slot utilization. After the simulation these parameters are used

    to compute transformation control factors such as unrolling factor, grafting depth and

    blocking metrics. These control factors are further explained in [25]. Iteratively after

    each cycle all these parameters are recorded again and are compared to preset user

    constraints mentioned in a User Constraint File (UCF). This file contains desired values

    for code, execution time, energy and allowed percentage cache miss. Energy is measured

    at the target platform (the setup is explained in Section 2.5). All these parameters are fed

    back to the transformation cost analyzer. In each successive transformation it is decided

    that whether energy-cycle performance has been optimized or not. The source code is

    optimized by undergoing code restructuring schemes known as loop unrolling, decision

    tree grafting and loop tiling. Additional benefits are gained by combining traditional

    compiler optimization algorithms, such as constant and variable propagation, dead code

    elimination, strength reduction etc..

  • 2.4. Energy Cycle Aware Compilation Framework Methodology 29

    Fig. 2.3: Transformation methodology.

  • 30 2 Energy-Cycle Aware Compilation Framework (ECACF)

    2.4.1 Application Expression Profile

    From a ’C’ source code to an executable binary, an embedded application has to go

    through many tools: the text writing notepad, compiler, scheduler, linker, and the

    loader. The urge ’how can I?’ is transformed into the conscious biased perception, en-

    tailed by embedded systems emerging from software hardware co-design. The software

    leads and the hardware follows the technological limitations. The behavior, a software

    implementation can express on a hardware is limited by the liberty offered by the hard-

    ware architecture and the ability of programmers to code the ’how can I?’. The above

    issues indicate that for a ’good’ energy-cycle performance there is a need to gather

    more detailed profiles, containing information about system behavior on various levels

    as shown in Figure 2.4. The main goal of such vertical profiling is to further improve the

    understanding of system behavior through correlation of profile information at different

    levels.

    Fig. 2.4: Vertical application profile layers.

    Hitherto, an executable application development hierarchy is composed of compilation,

    scheduling, linking, and binary code generation. Finally, this code is downloaded to

    the SDRAM attached with the multimedia processor. Our Application Profile Monitor

    (APM) extracts application behavioral parameters as mentioned above. This infor-

    mation is extracted from the vertical profile layer block as shown in Figure 2.4. An

    application is profiled both in terms of its static and run time (dynamic) behavior. The

    way an application expresses itself, we call Application Expression Profile (AEP) for a

    given hardware architecture. We characterize an application expression profile using the

    following conventions:

    1) Name : It describes the name of the profile monitor.

    2) Definition: It defines the profile monitor as used in our ECACF.

  • 2.5. Experimental Setup 31

    3) Location: It shows the location of the monitor in the application development hier-

    archy such as compilation, scheduling, linking etc.

    4) Type : There are two possible types: static or dynamic.

    5) Range: The possible range of value a monitor can have.

    6) Level: If a parameter is measured directly from the code, it is called primary monitor,

    in other case if it is computed using one or more parameters, we call it secondary monitor.

    E.g., a primary monitor can be written as:

    Name: Processor Frequency

    Definition: The operating frequency of the microprocessor

    Location: VDF

    Type: static

    Range: Typical 100MHz - 233MHz (depends on given hardware architecture)

    Level: Primary

    Similarly, a secondary monitor can be written as:

    Name: Scheduling Factor

    Definition: Computed this factor by dividing infinite machine cycle time with finite

    machine cycle time

    Location: Transformation Engine and Scheduler

    Type: Dynamic

    Range: 0 to 1

    Level: Secondary

    A complete list of profile monitors is provided in Appendix A.

    2.5 Experimental Setup

    The energy consumption by an application at a realtime platform is a first step to be

    known in any energy constrained embedded system and can be used to estimate the

    battery lifetime of the system. In this section, we describe an energy measurement

    method for a software application running on a realtime multimedia VLIW processor.

    The method is described for TM1300 Philips DSP processor, but it is applicable to other

    multimedia processors, for e.g., Blackfin ADSP533S. The measurement framework has

    been incorporated into our ECACF, that allows a software application programmer to

    measure a realtime energy consumption by running the candidate ’C’ source code.

  • 32 2 Energy-Cycle Aware Compilation Framework (ECACF)

    2.5.1 Related Work for Energy Measurement

    The energy consumption of a software applicati