ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

170
The Pennsylvania State University The Graduate School Department of Computer Science and Engineering ORCHESTRATING THE COMPILER AND MICROARCHITECTURE FOR REDUCING CACHE ENERGY A Thesis in Computer Science and Engineering by Jie Hu c 2004 Jie Hu Submitted in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy August 2004

Transcript of ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

Page 1: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

The Pennsylvania State University

The Graduate School

Department of Computer Science and Engineering

ORCHESTRATING THE COMPILER AND

MICROARCHITECTURE FOR REDUCING CACHE ENERGY

A Thesis in

Computer Science and Engineering

by

Jie Hu

c© 2004 Jie Hu

Submitted in Partial Fulfillmentof the Requirements

for the Degree of

Doctor of Philosophy

August 2004

Page 2: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

The thesis of Jie Hu was reviewed and approved∗ by the following:

Vijaykrishnan NarayananAssociate Professor of Computer Science and EngineeringThesis AdviserChair of Committee

Mary Jane IrwinA. Robert Noll Chair of EngineeringProfessor of Computer Science and Engineering

Mahmut KandemirAssociate Professor of Computer Science and Engineering

Yuan XieAssistant Professor of Computer Science and Engineering

Richard R. BrooksResearch Associate/Department Head, Applied Research LaboratoryIndustrial and Manufacturing Engineering

Raj AcharyaProfessor of Computer Science and EngineeringHead of the Department of Computer Science and Engineering

∗Signatures are on file in the Graduate School.

Page 3: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

iii

Abstract

Cache memories are widely employed in modern microprocessor designs to bridge

the increasing speed gap between the processor and the off-chip main memory, which

imposes the major performance bottleneck in computer systems. Consequently, caches

consume a significant amount of the transistor budget and chip die area in microproces-

sors employed in both low-end embedded systems and high-end server systems. Being

a major consumer of on-chip transistors, thus also of the power budget, cache memory

deserves a new and complete study of its performance and energy behavior and new

techniques for designing cache memories for next generation microprocessors.

This thesis focuses on developing compiler and microarchitecture techniques for

designing energy-efficient caches, targeting both dynamic and leakage energy. This thesis

has made four major contributions towards energy efficient cache architectures. First, a

detailed cache behavior characterization for both array-based embedded applications and

general-purpose applications was performed. The insights obtained from this study sug-

gest that (1) different applications or different code segments within a single application

have very different cache demands in the context of performance and energy concerns,

(2) program execution footprints (instruction addresses) can be highly predictable and

usually have a narrow scope during a particular execution phase, especially for embed-

ded applications, (3) high sequentiality is presented in accesses to the instruction cache.

Second, a technique called compiler-directed cache polymorphism (CDCP) was proposed.

CDCP is used to analyze the data reuse exhibited by loop nests, and thus to extract the

Page 4: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

iv

cache demands and determine the best data cache configurations for different code seg-

ments to achieve the best performance and energy behavior. Third, this thesis presents

a redesigned processor datapath to capture and utilize the predictable execution foot-

print for reducing energy consumption in instruction caches. Finally, this thesis work

addresses the increasing leakage concern in the instruction cache by exploiting cache

hotspots during phase execution and the sequentiality exhibited in execution footprint.

Page 5: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

v

To my Mom and Dad

and

To Kai Chen

Page 6: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

vi

Table of Contents

List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xi

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xii

Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvi

Chapter 1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.1 Caches as the Bridge between Processor and Memory . . . . . . . . . 1

1.2 Basics on Cache Energy Consumption . . . . . . . . . . . . . . . . . 3

1.3 Thesis Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.4 Challenges in This Work . . . . . . . . . . . . . . . . . . . . . . . . . 7

1.4.1 Determine Cache Resource Demands . . . . . . . . . . . . . . 8

1.4.2 Redesign Instruction-Supply Mechanism . . . . . . . . . . . . 9

1.4.3 Application Sensitive Leakage Control . . . . . . . . . . . . . 10

1.5 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

1.6 Thesis Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

Chapter 2. Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1 Compiler Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.1.1 Addressing Dynamic Energy Consumption . . . . . . . . . . . 14

2.1.2 Managing Cache Leakage . . . . . . . . . . . . . . . . . . . . 15

2.2 Architectural and Microarchitectural Schemes . . . . . . . . . . . . . 16

Page 7: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

vii

2.2.1 Using Additional Smaller Caches . . . . . . . . . . . . . . . . 16

2.2.2 Changing Load Capacitance of Cache Access . . . . . . . . . 16

2.2.3 Improving the Fetch Mechanism . . . . . . . . . . . . . . . . 17

2.2.4 Reducing leakage in Caches . . . . . . . . . . . . . . . . . . . 18

2.3 Circuit and Device Techniques . . . . . . . . . . . . . . . . . . . . . 18

Chapter 3. Experimental Models . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1 Simulation Frameworks . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.1.1 Evaluating Compiler schemes . . . . . . . . . . . . . . . . . . 20

3.1.2 Evaluating Microarchitectural Schemes . . . . . . . . . . . . . 21

3.2 Benchmarks and Input Sets . . . . . . . . . . . . . . . . . . . . . . . 21

Chapter 4. Characterizing Application and Cache Behavior . . . . . . . . . . . . 24

4.1 Data Cache Demands for Performance . . . . . . . . . . . . . . . . . 24

4.2 Instruction Execution Footprint . . . . . . . . . . . . . . . . . . . . . 27

4.3 Accessing Behavior in Instruction Cache . . . . . . . . . . . . . . . . 28

Chapter 5. Analyzing Data Reuse for Cache Energy Reduction . . . . . . . . . . 34

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

5.2 Array-Based Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

5.2.1 Representation for Programs . . . . . . . . . . . . . . . . . . 36

5.2.2 Representation for Loop Nests . . . . . . . . . . . . . . . . . 38

5.2.3 Representation for Array References . . . . . . . . . . . . . . 39

5.3 Cache Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

Page 8: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

viii

5.3.1 Cache Misses . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

5.3.2 Data Reuse and Data Locality . . . . . . . . . . . . . . . . . 41

5.4 Algorithms for Cache Polymorphism . . . . . . . . . . . . . . . . . . 42

5.4.1 Compiler-Directed Cache Polymorphism . . . . . . . . . . . . 43

5.4.2 Formal Description of Program Hierarchies . . . . . . . . . . 45

5.4.3 Array References and Uniform Reference Sets . . . . . . . . . 46

5.4.4 Algorithm for Reuse Analysis . . . . . . . . . . . . . . . . . . 47

5.4.4.1 Self-Reuse Analysis . . . . . . . . . . . . . . . . . . 48

5.4.4.2 Group-Reuse Analysis . . . . . . . . . . . . . . . . . 50

5.4.5 Simulating the Footprints of Reuse Spaces . . . . . . . . . . . 52

5.4.6 Computation and Optimization of Cache Configurations for

Loop Nests . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.4.7 Global Level Cache Polymorphism . . . . . . . . . . . . . . . 58

5.4.8 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.5.1 Simulation Framework . . . . . . . . . . . . . . . . . . . . . . 64

5.5.2 Selected Cache Configurations . . . . . . . . . . . . . . . . . . 65

5.5.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . 68

5.6 Discussions and Summary . . . . . . . . . . . . . . . . . . . . . . . . 76

Chapter 6. Reusing Instructions for Energy Efficiency . . . . . . . . . . . . . . . 77

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.2 Modified Issue Queue Design . . . . . . . . . . . . . . . . . . . . . . 80

Page 9: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

ix

6.2.1 Detecting Reusable Loop Structures . . . . . . . . . . . . . . 81

6.2.2 Buffering Reusable Instructions . . . . . . . . . . . . . . . . . 82

6.2.2.1 Buffering Strategy . . . . . . . . . . . . . . . . . . . 84

6.2.2.2 Handling Procedure Calls . . . . . . . . . . . . . . . 85

6.2.3 Optimizing Loop Buffering Strategy . . . . . . . . . . . . . . 85

6.2.4 Reusing Instructions in the Issue Queue . . . . . . . . . . . . 87

6.2.5 Restoring Normal State . . . . . . . . . . . . . . . . . . . . . 89

6.3 Distribution of Dynamic Loop Code . . . . . . . . . . . . . . . . . . 89

6.4 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6.5 Impact of Compiler Optimizations . . . . . . . . . . . . . . . . . . . 96

6.6 Discussions and Summary . . . . . . . . . . . . . . . . . . . . . . . . 99

Chapter 7. Managing Instruction Cache Leakage . . . . . . . . . . . . . . . . . . 101

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

7.2 Existing Approaches: Where they stumble? . . . . . . . . . . . . . . 105

7.3 Using Hotspots and Sequentiality in Managing Leakage . . . . . . . 109

7.3.1 HSLM: HotSpot Based Leakage Management . . . . . . . . . 110

7.3.1.1 Protecting Program Hotspots . . . . . . . . . . . . . 110

7.3.1.2 Detecting New Program Hotspots . . . . . . . . . . 114

7.3.2 JITA: Just-In-Time Activation . . . . . . . . . . . . . . . . . 115

7.4 Design Space Exploration . . . . . . . . . . . . . . . . . . . . . . . . 116

7.5 Experimental Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . 119

7.5.1 Experiment Setup . . . . . . . . . . . . . . . . . . . . . . . . 119

Page 10: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

x

7.5.2 Experimental Results . . . . . . . . . . . . . . . . . . . . . . 122

7.5.3 Sensitivity Analysis . . . . . . . . . . . . . . . . . . . . . . . . 129

7.6 Discussions and Summary . . . . . . . . . . . . . . . . . . . . . . . . 133

Chapter 8. Conclusions and Future Work . . . . . . . . . . . . . . . . . . . . . . 136

8.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136

8.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

Page 11: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

xi

List of Tables

3.1 Base configurations of simulated processor and memory hierarchy for

evaluating microarchitectural schemes. . . . . . . . . . . . . . . . . . . . 22

3.2 Array-based benchmarks used in the experiments. . . . . . . . . . . . . 22

3.3 Benchmarks from SPEC2000 used in the experiments. . . . . . . . . . . 23

5.1 Cache configurations generated by algorithm 4 for the example nest. . . 64

5.2 Running time of algorithm 4 for each benchmark. . . . . . . . . . . . . . 65

5.3 Cache configurations for each loop nest in benchmarks: Shade Vs CDCP. 67

5.4 Energy consumption (micro joules) of L1 data cache for each loop nest

in benchmarks with configurations in Table 5.3: Shade Vs CDCP. . . . . 75

7.1 Leakage control schemes evaluated: turn-off mechanisms. . . . . . . . . 116

7.2 Leakage control schemes evaluated: turn-on mechanisms. . . . . . . . . . 117

7.3 Technology and energy parameters for the simulated processor given in

Table 3.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121

Page 12: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

xii

List of Figures

1.1 Typical memory hierarchy in modern computer systems. . . . . . . . . . 2

1.2 Leakage current paths in an SRAM cell. The bitline leakage flows through

the access transistor Nt2, while the cell leakage flows through transistors

N1 and P2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

4.1 Cache performance behavior as data cache size increases from 1KB to

1024KB. All cache configurations use fixed block size of 32 bytes and

fixed associativity of 4 ways. . . . . . . . . . . . . . . . . . . . . . . . . . 26

4.2 Dynamic instruction address distribution at runtime for a set of array-

intensive code. A sampling rate (/500) means only one instruction among

500 dynamic instructions is sampled for its address (PC). . . . . . . . . 29

4.3 The distribution of accesses (at cache line granularity) to L1 instruction

cache with respect to the length of consecutively accessing cache lines

(sequential length), for SPEC2000 benchmarks. The rightmost bar (<)

in each plot corresponds to those with sequential length larger than 32

cache lines. To be continued by Figure 4.4. . . . . . . . . . . . . . . . . 31

4.4 (A continue to Figure 4.3) The distribution of accesses (at cache line

granularity) to L1 instruction cache with respect to the length of consec-

utively accessing cache lines (sequential length), for SPEC2000 bench-

marks. Last two plots show the average distribution for integer bench-

marks and floating-point benchmarks used, respectively. . . . . . . . . . 32

Page 13: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

xiii

5.1 Format for a program. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

5.2 Format for a loop nest. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

5.3 Overview of compiler-directed cache polymorphism (CDCP). . . . . . . 44

5.4 Intermediate format of source codes produced by the generator. . . . . . 45

5.5 Example code – a loop nest. . . . . . . . . . . . . . . . . . . . . . . . . . 46

5.6 An Example: Array-based code. . . . . . . . . . . . . . . . . . . . . . . . 60

5.7 Cache performance comparison for configurations at block size of 16:

Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.8 Cache performance comparison for configurations at block size of 32:

Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5.9 Cache performance comparison for configurations at block size of 64:

Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

5.10 A breakdown of cache performance comparison at the granularity of each

loop for benchmarks adi, aps, bmcm, and tsf . Configurations for all

three cache block sizes, 16 byte, 32 byte and 64 byte are compared:

Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

5.11 A breakdown of cache performance comparison at the granularity of each

loop for benchmarks eflux, tomcat, vpenta, and wss. Configurations for

all three cache block sizes, 16 byte, 32 byte and 64 byte are compared:

Shade Vs CDCP. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

Page 14: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

xiv

6.1 (a). The datapath diagram, and (b). pipeline stages of the modeled

baseline superscalar microprocessor. Parts in dotted lines are augmented

for the new design. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.2 State machine for the issue queue. . . . . . . . . . . . . . . . . . . . . . 83

6.3 The new issue queue with augmented components supporting instruction

reuse. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

6.4 An example of a non-bufferable loop that is an outer loop in this code

piece . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

6.5 Dynamic instruction distribution w.r.t. loop sizes. . . . . . . . . . . . . 91

6.6 Percentages of the total execution cycles that the pipeline front-end has

been gated with different issue queue sizes: 32, 64, 128, 256 entries. . . . 92

6.7 Access Reduction and energy reduction in instruction cache, branch pre-

dictor, instruction decoder, and issue queue. . . . . . . . . . . . . . . . . 94

6.8 The overall power reduction compared to a baseline microprocessor using

the conventional issue queue at different issue queue sizes. . . . . . . . . 95

6.9 Performance impact of reusing instructions at different issue queue sizes. 96

6.10 Impact of compiler optimizations on instruction cache accesses. . . . . . 97

6.11 Impact of compiler optimizations on overall energy saving. . . . . . . . . 98

6.12 Impact of compiler optimizations on performance degradation. . . . . . 98

7.1 (a). A simple loop with two portions, (b). Bank mapping for the loop

given in (a). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

7.2 Leakage control circuitry supporting Just-in-Time Activation (JITA). . 111

Page 15: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

xv

7.3 Microarchitecture for Hotspot based Leakage Management (HSLM) scheme.

Note that O/P from AND gates go to the set I/P of the mask latches. . 113

7.4 The ratio of cycles that cache lines are in active mode over the entire

execution time (Active ratio). . . . . . . . . . . . . . . . . . . . . . . . . 121

7.5 Breakdown of turn offs in scheme DHS-Bk-PA. . . . . . . . . . . . . . . . 123

7.6 Leakage energy reduction w.r.t the Base scheme. . . . . . . . . . . . . . 123

7.7 The leakage energy breakdown (an average for fourteen SPEC2000 bench-

marks). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

7.8 Ratio of activations on instruction cache hits. . . . . . . . . . . . . . . . 126

7.9 The ratio of effective preactivations performed by JITA over total acti-

vations incurred during the entire simulation. . . . . . . . . . . . . . . . 126

7.10 Performance degradation w.r.t the Base scheme. . . . . . . . . . . . . . 128

7.11 Energy delay (J*s) product (EDP). . . . . . . . . . . . . . . . . . . . . . 128

7.12 Impact of sampling window size on leakage control scheme DHS-Bk-PA. . 130

7.13 Impact of hotness threshold on leakage control scheme DHS-Bk-PA. . . . 131

7.14 Impact of subbank size on leakage control scheme DHS-Bk-PA. . . . . . . 132

7.15 Impact of cache associativity. IPC degradation (left), Leakage energy

reduction (right). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

Page 16: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

xvi

Acknowledgments

I would like to take this special moment to thank many of those who guided and

helped me journey through these four years invaluable Ph. D. life at Penn State. First

and foremost, my heartfelt gratitude to my thesis advisor, Dr. Vijaykrishnan N.. He

introduced me to this most exciting research area and supervised my research with great

enthusiasm. He’s always there ready to help whenever I had difficulty or was under

quandary with research. I’ll never forget those sincere and free discussions on a wide

variety of topics. Working with him has gained me life-time benefits.

I am also grateful and indebted to Dr. Mary Jane Irwin and Dr. Mahmut Kan-

demir for their great suggestion and advice, enlightening discussions, and the happy time

we worked together. I thank my other committee members, Dr. Richard Brooks and

Dr. Yuan Xie, for their insightful commentary on my work.

I feel very lucky to work with many of our wonderful MDL members. My work

at MDL wouldn’t be such a great joy without those friends, who never hesitate to stop

by for brainstorming or chit-chat. I’d like to thank you all, especially, Wei Zhang, Vijay

Degalahal, Avanti Nadgir, Feihui li, Yuh-Fang Tsai, Jooheung Lee, and Soontae kim.

Finally and most importantly, I owe my deepest gratitude and thanks to my

family. Mom, dad, and Jiaorong, your unconditional support made all this possible.

My dear wife, Kai Chen, you are always being my side supporting me, encouraging me,

sharing every joy and every anguish we have, your love has made these Ph.D. years a

fantasy.

Page 17: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

1

Chapter 1

Introduction

Cache memories are widely employed in modern microprocessor designs to bridge

the increasing speed gap between the processor and the off-chip main memory, which

imposes the major performance bottleneck in computer systems. Consequently, caches

consume a significant amount of the transistor budget and chip die area in microproces-

sors employed in both low-end embedded systems and high-end server systems. Being

a major consumer of on-chip transistors, thus also of the power budget, cache memory

deserves a new and complete study of its performance and energy behavior and new

techniques for designing cache memories for next generation microprocessors.

1.1 Caches as the Bridge between Processor and Memory

The computer memory system seeks a hierarchical design that takes the advantage

of reference locality (both code and data) and achieves the benefit of cost/performance

of memory technologies. The growing speed gap between the fast processor and the

slow memory has resulted in the increasing importance of the memory hierarchy. The

speed of microprocessors improved around 60% per year since 1987. However, memory

performance only achieved less than 10% per year since 1980. As is pointed out in [28],

microprocessors designed in 1980 were often without caches, while in 1995 microproces-

sors were often integrated with two levels of caches. Figure 1.1 shows the typical memory

Page 18: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

2

hierarchy presented in modern computer systems. The closer to CPU, the faster, smaller,

and more expensive memory structures. Level one caches usually operate at the same

clock frequency as the processor core.

Disk Storage

L1 Caches

Registers

CPU

Microprocessor

L2/L3 Caches

Main Memory

Fig. 1.1. Typical memory hierarchy in modern computer systems.

In today’s microprocessor designs, for both low-end embedded microprocessors

and high-performance general purpose microprocessors, larger on-chip caches are always

preferred for performance sake. However, embedded microprocessors in general have a

much simple memory hierarchy with only one level on-chip data cache and instruction

cache, e.g., StrongArm SA-110 [56] processor only has level one caches accounting for

94% of the total transistor budget. Continuous technology scaling enables large and

multilevel cache structures to be integrated with the processor core in a single die. For

Page 19: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

3

the latest processors, a large fraction of the transistor budget and die area is dedicated

to the cache structures, e.g., 90% in Alpha 21364 processors [9], 86% (93% in the second

generation) in Itanium 2 processors [57][70].

Meanwhile, excessive power/energy consumption has become one of the major

impediments in designing future microprocessors as semiconductor technology continues

scaling down. Low power design is not only important in battery-powered embedded

systems, but also very important in desktop PCs, workstations, and even servers due to

the increasing packaging and cooling cost. As on-chip caches dominate the transistor

budget, they present a major contribution to the dynamic and leakage power consump-

tion in processors. For example, data cache and instruction cache consume 43% of the

total dynamic power in DEC StrongArm SA-110 [56], and 22% in IBM PowerPc micro-

processor [10]. Leakage power is projected to account for 70% of the cache power budget

in 70nm technology [20]. Thus, optimizing power/energy consumption in caches is of

first class importance in microprocessor designs.

1.2 Basics on Cache Energy Consumption

Cache energy consumption consists of two part: dynamic energy Edyn and leakage

energy Eleak, as shown in Equation 1.1.

E = Edyn + Eleak (1.1)

Page 20: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

4

Dynamic energy consumption of a device can be modeled as Equation 1.2,

Edyn = CL × V 2DD

× P0→1 (1.2)

where CL is the capacitance of the device, VDD is the supply voltage, and P0→1 is the

probability that the device switches. Clearly, optimizing dynamic energy consumption

can attack one or more of these three parameters by reducing the number of switching

devices, lowing the supply voltage, or reducing the switching probability. This equation

also shows that reducing supply voltage has a quadratic effect on decreasing the dynamic

energy consumption. However, lower supply voltage leads to slower circuit.

At microarchitecture level, Cacti cache model [63] is used in this thesis to derive

the dynamic energy consumption of an access to a given cache configuration. The energy

consumption comes from two part, tag portion and data array portion. In tag portion,

energy is consumed in address decoder, wordline, bitlines, sense amplifiers, compara-

tors, mux drivers, and output drivers. Similarly, in data portion, energy is consumed

in address decoder, wordline, bitlines, sense amplifiers, and output drivers. The data

bitlines and sense amplifiers are responsible for the majority part of energy consumption

of low-associative caches [63].

On-chip caches constitute the major portion of the processor’s transistor budget

and account for a significant share of leakage, which can be derive by Equation 1.3.

Leakage current is a combination of subthreshold leakage current and gate-oxide leakage

Page 21: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

5

current: Ileak = Isub + Iox.

Eleak = VDD × Ileak × t (1.3)

Vdd

data

/BL

subthreshold leakage

leakage from cell

leakage from bitline

Nt1 Nt2

N1 N2

P2P1

0/data

WLWL

0V 0V

Vdd

BL

Vdd

Vdd

Fig. 1.2. Leakage current paths in an SRAM cell. The bitline leakage flows throughthe access transistor Nt2, while the cell leakage flows through transistors N1 and P2.

Figure 1.2 illustrates the various leakage current paths in a typical SRAM cell.

The current through the access transistor Nt2 from the bitline is referred to as bitline

leakage, while the current flowing through transistors N1 and P2 is cell leakage. Both

bitline and cell leakage result from subthreshold conduction - current flowing from the

source to drain even when gate-source voltage is below the threshold voltage Vth.

Page 22: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

6

The following equation developed in [16] shows how subthreshold leakage current

depends on threshold voltage and supply voltage.

Isub = K1 × W × e−Vth/nVθ × (1 − e−VDD/Vθ ) (1.4)

K1 and n are experimentally derived, W is the gate width, and Vθ in the exponents is

the thermal voltage. At room temperature, Vθ is about 25 mV; it increases linearly as

temperature increases. Equation 1.4 suggests two ways to reduce Isub: reducing supply

voltage VDD or increasing threshold voltage Vth. However, for an SRAM cell, lowing

VDD may destroy the state stored; using high Vth transistors will increase the access

latency of SRAM cells.

On the other hand, gate-oxide leakage Iox is projected to be dramatically reduced

if high-k dielectric gate insulators reach mainstream production [1]. Thus, this thesis

will not take gate leakage for further discussion.

1.3 Thesis Statement

Caches were first introduced to form the memory hierarchy in computer systems

for performance sake. Technology advance has turned the increasing power/energy con-

sumption into a major constraint in designing future microprocessors. Due to their large

share of transistor budget, die area, and energy budget of the processor, caches are the

ideal target for optimizing microprocessor energy behavior. With the principle role of

caches in mind, any energy optimization based on caches should be carefully weighted

not to jeopardize the performance noticeably.

Page 23: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

7

This thesis explores the implicit possible connection between application charac-

teristics and its cache behavior, and tries to answer how this information can be used

to reduce energy consumption (both dynamic and leakage energy) in caches and how

compiler and microarchitectural schemes can be orchestrated for this purpose. More

specifically, this work focuses on four main problems. First, given an application, what

information should be extracted and what characteristics should be studied for cache

energy optimization. Second, how to extract this information, at compile time or at run

time. Third, how compiler and microarchitecture can utilize the analytical results from

the second step. Finally, how to justify the information from the first step is the right

one for the purpose of this thesis work.

1.4 Challenges in This Work

There are several major challenges in achieving the objective of this thesis work.

As discussed in Section 1.2, dynamic energy can be reduced by attacking one or more

of the three factors CL, VDD, and P0→1 in Equation 1.2. Lowing VDD is suggested as

the most effective way due to its quadratic effect on energy consumption. The major

problem associated with this approach is that lowing VDD also slows the circuit. Notice

that instruction fetch and data loading are on the critical path of the processor datapath

pipeline. Increasing access latency to the level one instruction cache or data cache is not

preferable even for energy optimization, which leaves us two other options, reducing CL

or reducing P0→1. Now, the question is what we can do with these two factors with

respect to the principle role of caches.

Page 24: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

8

There is a similar problem when optimizing leakage (subthreshold leakage) energy

in caches. Equation 1.4 tells us that increasing the threshold voltage Vth reduces the

subthreshold leakage current. However, high Vth results in a longer access latency to

the SRAM cell. Reducing supply voltage VDD to zero can eliminate the subthreshold

leakage current (Isub = 0). Meanwhile, the content or state stored in the SRAM cell will

be also destroyed. Later access to this cell will result in voltage recovery and access to

lower level caches. Lowing the supply voltage VDD to a particular point that leads to

a significant leakage reduction while still retains the content seems more interesting in

terms of performance. Later access to this cell only needs to restore the supply voltage

to the regular one before performing the access. This voltage restoring incurs both

performance and energy penalty. Now, the tricky question is how to maximize benefit

from leakage reduction at the least cost of performance loss.

This thesis proposes a new design methodology for energy-efficient caches by

first understanding the applications such as their resource demands, execution footprint,

cache behavior, etc. Then, this application specific characteristics can be utilized to

guide energy optimization strategies at either compile time, or run time, or both.

1.4.1 Determine Cache Resource Demands

Different applications may have different resource requirements. For example,

some applications are computation intensive while others are I/O intensive. For the

same reason, the actual size of data caches demanded by applications are also different

from one to another. On the other hand, most of nowadays microprocessors implement

caches in fixed sizes that are normally quite large for performance sake. However, large

Page 25: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

9

caches are actually wasted for applications that cannot fully utilize them as they are

implemented in a rigid manner. For example, not all the loops in a given array-based

application can take advantage of a large on-chip cache. Also, working with a fixed

cache configuration can increase energy consumption in loops where the best required

configuration (from the performance angle) is smaller than the default (fixed) one. This

is because a larger cache can result in a larger per access energy.

This research proposes a novel approach [33][32] where an optimizing compiler

analyzes the application code and decides the best cache configuration demanded (from a

given objective viewpoint) for different part in the application code. The caches are then

dynamically reconfigured according to these compiler determined configurations during

the course of execution. These configurations match the dynamic characteristics of the

running application. This is called as compiler-directed cache polymorphism (CDCP) in

this work. This approach differs from previous research on reconfigurable caches such as

[4] and [62] in that it does not depend on dynamic feedback information.

1.4.2 Redesign Instruction-Supply Mechanism

Notice that CDCP is actually targeting the parameter CL for dynamic energy re-

duction in data cache in embedded application domain. This idea is definitely applicable

to instruction cache as well. However, the thesis work explores a more aggressive ap-

proach to redesign the instruction-supply mechanism, which is to optimize the parameter

P0→1 for dynamic energy reduction in instruction cache.

Page 26: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

10

The proposed issue queue [38] has a mechanism to dynamically detect and identify

reusable instructions, particularly instructions belonging to tight loops. This code typi-

cally dominates the execution time of array-based embedded applications. Once reusable

instructions are detected, the issue queue buffers these instructions and reschedules these

buffered reusable instructions for the following execution. Special care should be taken

of to guarantee that the reused instructions are register-renamed in the original program

order. Thus, the instructions are supplied by the issue queue itself rather than the fetch

unit. There is no need to perform instruction cache access, branch prediction, or in-

struction decoding. Consequently, the front-end of the datapath pipeline, i.e., pipelines

stages before register renaming, can be gated for energy saving during instruction reusing

mode.

1.4.3 Application Sensitive Leakage Control

A good leakage management scheme needs to balance appropriately the energy

penalty of leakage incurred in keeping a cache line turned on after its current use with

the overhead associated with the transition energy (for turning on a cache line) and

performance loss that will be incurred if and when that cache line is accessed again. In

order to strike this balance, it is important that the management approach tracks both

the spatial and temporal locality of instruction cache accesses. Existing leakage control

approaches track and exploit one or the other of these forms of locality.

The leakage management proposed in this work focuses on being able to ex-

ploit both forms of locality and exploits two main characteristics of instruction access

patterns: program execution is mainly confined in program hotspots and instructions

Page 27: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

11

exhibit a sequential access pattern [34][44]. In order to exploit this behavior, this the-

sis work proposes a HotSpot based Leakage Management (HSLM) approach that is used

for detecting and protecting cache lines containing program hotspots from inadvertent

turn-off and used to detect a shift in program hotspot and to turn off cache lines closer

to their last use instead of waiting for a period to expire. This scheme is specifically

oriented to detect new loop-based hotspots. Next, this work presents a Just-in-Time

Activation (JITA) scheme that exploits sequential access pattern for instruction caches

by predictively activating the next cache line when the current cache line is accessed.

1.5 Contributions

By understanding, capturing, and utilizing the static and dynamic characteristics

of application code, this thesis work is to provide a comprehensive solution to optimize

the energy efficiency in caches, including both dynamic and static energy. Specifically,

four major contributions have been made in this thesis research.

• A detailed cache behavior characterization for both array-based embedded applica-

tions and general-purpose applications was performed. The insights obtained from

this study suggest that (1) different applications or different code segments within a

single application have very different cache demands in the context of performance

and energy concerns, (2) program execution footprints (instruction addresses) can

be highly predictable and usually have a narrow scope during a particular execution

phase, especially for embedded applications, (3) high sequentiality is presented in

accesses to the instruction cache.

Page 28: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

12

• A technique called compiler-directed cache polymorphism (CDCP) was proposed.

CDCP is used to analyze the data reuse exhibited by loop nests, and thus to extract

the cache demands and determine the best data cache configurations for different

code segments to achieve the best performance and optimized energy behavior, as

well as reconfigure the data cache with these determined cache configurations at

runtime.

• This thesis presents a redesigned processor datapath to capture and utilize the pre-

dictable execution footprint for reducing energy consumption in instruction cache

as well as other processor components as a side benefit. The issue queue proposed

here is capable of rescheduling buffered instructions in the issue queue itself thus to

avoid instruction streaming from the pipeline front-end and result in significantly

reduced energy consumption in the instruction caches.

• This thesis proposes hotspot-based leakage management (HSLM) and just-in-time

activation (JITA) strategies to manage the instruction cache leakage in an appli-

cation sensitive fashion. The scheme, employing these two strategies in additional

to periodic and spatial based (bank switch) turn-off, provides a significant im-

provement on leakage energy savings in the instruction cache (while considering

overheads incurred in the rest of the processor as well) over previously proposed

schemes [45][78].

Page 29: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

13

1.6 Thesis Roadmap

The rest part of this thesis is organized as follows. An overview of related work

is presented in Chapter 2. The experimental framework for this thesis is detailed in

Chapter 3. Chapter 4 studies the data cache and instruction cache behavior for a set

of array-based embedded applications and a set of general purpose applications that are

used throughout this thesis work. Chapter 5 proposes the compiler-directed cache poly-

morphism technique to optimize dynamic energy consumption in data cache while guar-

anteeing near-optimal performance in embedded application domain. A more aggressive

scheme, rather than cache reconfiguration, scheduling reusable instructions within the

issue queue is proposed in Chapter 6 to achieve significant energy reduction in instruc-

tion cache as well as other components in the front-end of datapath pipeline. Chapter 7

develops two new strategies, namely hotspot-based leakage management and just-in-time

activation, to attack leakage in the instruction cache in a more effective and application

aware manner. Finally, Chapter 8 concludes this thesis work and outlines the directions

for future research.

Page 30: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

14

Chapter 2

Related Work

There has been a lot of prior research on energy (both dynamic and leakage)

optimizations in caches (both data cache and instruction cache). This research spans

multiple levels in the microprocessor design flow, from high level compiler optimiza-

tions, architectural and microarchitectural schemes, to low level circuit optimizations

and physical device designs.

2.1 Compiler Optimizations

2.1.1 Addressing Dynamic Energy Consumption

In the domain of embedded systems design, most of the work focuses on the be-

havior of array references in loop nests as loop nests are the most important part of

array-intensive media and signal processing application programs. In most cases, the

computation performed in loop nests dominates the execution time of these programs.

Thus, the behavior of the loop nests determines both performance and energy behavior

of applications. Previous research (e.g., [53]) shows that the performance of loop nests is

directly influenced by the cache behavior of array references. Also, energy consumption

is a major design constraint in embedded systems [73][55] [25]. Consequently, determin-

ing a suitable combination of cache memory configuration and optimized software is a

challenging problem in the embedded design world.

Page 31: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

15

The conventional approach to address this problem is to employ compiler opti-

mization techniques [27] [7][51][53] [64][68][76] [3] and try to modify the program be-

havior such that the new behavior becomes more compatible with the underlying cache

configuration. Current locality oriented compiler techniques generally work under the

assumption of a fixed cache memory architecture, and there are several problems with

this method. First, these compiler-directed modifications sometimes are not effective

when data/control dependences prevent necessary program transformations. Second,

the available cache space sometimes cannot be utilized efficiently, because the static con-

figuration of cache does not match different requirements of different programs and/or of

different portions of the same program. Third, most of the current compiler techniques

(adapted from scientific compilation domain) do not take energy issues into account in

general.

2.1.2 Managing Cache Leakage

In [78][79], an optimizing compiler is used to analyze the program to insert explicit

cache line turn-off instructions. This scheme demands sophisticated program analysis

and needs modifications in the ISA to implement cache line turn-on/off instructions.

In addition, this approach is only applicable when the source code of the application

being optimized is available. In [78], instructions are inserted only at the end of loop

constructs and, hence, this technique does not work well if a lot of time is spent within

the same loop. In these cases, periodic schemes may be able to transition portions of

the loop that are already executed into a drowsy mode. Further, when only selected

portions of a loop are used, the entire loop is kept in an active state. Finally, inserting

Page 32: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

16

the turn-off instructions after a fast executing loop placed inside an outer loop can cause

performance and energy problems due to premature turn-offs.

2.2 Architectural and Microarchitectural Schemes

2.2.1 Using Additional Smaller Caches

Generally, smaller caches consume less power due to their less capacitance. To

reduce the power consumption in the pipeline front-end, stage-skip pipeline [31][8] in-

troduces a small decoded instruction buffer (DIB) to temporarily store decoded loop

instructions that are reused to stop instruction fetching and decoding for power re-

duction. The DIB is controlled by a special loop-evoking instruction and requires ISA

modification. Loop caches [47][5] dynamically detect loop structures and buffer loop

instructions or decoded loop instructions in an additional loop cache for later reuse. A

preloaded loop cache is proposed in [26] using profiling information. Loops dominating

the execution time are preloaded into the loop cache during system reset based on static

profiling. More generally, filter caches [46][71] use smaller level zero caches (between the

level one cache and datapath) to capture tight spatial/temporal locality in cache access

thus reducing the power consumption in larger level one caches. However, filter caches

usually incur big performance loss due to the low hit rate in the smaller level zero caches.

2.2.2 Changing Load Capacitance of Cache Access

An alternative approach for addressing the cache behavior problem is to use re-

configurable cache structures and dynamically modify the cache configuration (at specific

program point) to meet the execution profile of the application at hand. This approach

Page 33: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

17

has the potential to address the problem in cases where optimizing the application code

alone fails. However, previous research on this area such as [4], [62], and [77] is mainly

focused on the implementation and the employment mechanisms of such designs, and

lack software-based techniques to direct dynamic cache reconfigurations.

Way-prediction and selective direct-mapping were used for dynamic energy re-

duction in set-associative caches in in [61]. In this scheme, only predicted cache way is

accessed without probing all cache ways simultaneously. Other ways are accessed only

after a way misprediction.

2.2.3 Improving the Fetch Mechanism

Monitoring instruction fetch has a significant impact on the energy consumption

in the instruction cache, e.g., speculation control for pipeline gating [52]. Trace cache

was first studied for its energy efficiency in [37] [35][36]. Sequential trace cache achieves

superior power behavior at the cost of a large performance degradation compared to con-

ventional current trace cache architecture [37]. A compiler-based selective trace cache

(SLTC) [35] utilizes the profiling information to statically determine the fetch direction

either to trace cache or to instruction cache. Direction prediction based trace cache

(DPTC) proposed in [36] is independent of compiler optimizations or code layout op-

timizations. DPTC provides a pure hardware scheme to implement this selective fetch

rather than profile-based software schemes. It also avoids any impact on the current ISA

architecture which makes it independent of the underlying platforms. Both SLTC and

DPTC achieve the best energy behavior from sequential access mechanism and a very

close performance behavior to conventional trace cache due to their selective access.

Page 34: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

18

2.2.4 Reducing leakage in Caches

The leakage current is a function of the supply voltage and the threshold voltage.

It can be controlled by either reducing the supply voltage or by increasing the threshold

voltage. However, this has an impact on the cache access times. Thus, a common

approach is to use these mechanisms dynamically when a cache line is not currently in

use. DRI-icache [60][59] uses the performance feedback to dynamically resize the cache

utilizing Gated-Vdd technique. However, this resizing is at a very coarse granularity.

Using the same circuit technique, cache decay [41] explores the generation information

of cache lines and turns off the cache line after its decay period expires. Cache leakage

was controlled at the fine granularity of each cache line. Drowsy cache [20] periodically

transitions all cache lines to a drowsy mode based on multiplexed supply voltage. Drowsy

cache lines retain the cache contents. However, they must be restored to the active supply

voltage before any access to them is carried out, which incurs both performance and

energy overhead. This technique was adopted to instruction cache in [45] and augmented

with a bank based strategy for cache line turn-off and predictive cache line activation.

Oriented for reducing performance penalty, this bank-based scheme suffers from the

dynamic energy overhead due to turning on a whole cache subbank.

2.3 Circuit and Device Techniques

Existing techniques that control cache leakage use three main styles of circuit

primitives for reducing cache leakage energy. The first approach involves the Gated-

Vdd [60] techniques employed in [59][41] that use an additional NMOS sleep transistor

Page 35: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

19

connected between the memory storage cell and the ground of the power supply. This

sleep transistor is turned off to reduce leakage but results in data stored in the memory

array being lost. A modification of this GND-gating scheme that can still retain data

was used in [49][2]. The second type of circuit primitive is based on a multiplexed supply

voltage for the cache lines [20]. When a reduced supply voltage is selected the leakage

can be controlled and the cache line is said to be in a drowsy state (retaining its value).

However, cache lines in a drowsy state cannot be accessed and need to be brought back

to the active state (operating at the normal supply voltage). This transition from drowsy

voltage to normal voltage requires a single cycle or multi-cycle wakeup time. The third

approach to reducing leakage energy while minimizing performance penalties relies on

selectively decreasing the threshold voltage of the cache lines that are accessed while

maintaining a higher threshold voltage for all other cache lines [43]. There have also

been approaches at designing memory cells using dual threshold voltages to minimize

leakage when storing the preferred data value (zero or one) [6].

Page 36: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

20

Chapter 3

Experimental Models

This chapter presents the experimental models for this thesis work. Two different

simulation frameworks are used for evaluating the compiler schemes and microarchitec-

tural schemes proposed for reducing cache energy (including both dynamic and leakage

energy) in this work, respectively. Two sets of benchmarks from array-intensive em-

bedded application domain and general purpose application domain are used for this

purpose.

3.1 Simulation Frameworks

3.1.1 Evaluating Compiler schemes

SUIF compiler version 1.0 [69] is used as the framework to implement the compiler

algorithms proposed in this work. It has two major part: kernel and toolkit. The SUIF

compiler kernel defines the intermediate representation of programs, provides methods to

perform operations on the intermediate representation, and interfaces between different

compiler passes. The SUIF toolkit consists of compiler passes built on the top of its

kernel, including Fortran and ANSI C front ends for Fortran and C to SUIF translation,

a SUIF to Fortran and C translator, data dependence analyzer, a basic parallelizer, a

loop-level locality optimizer, and a visual SUIF code browser. The proposed compiler

algorithms in Chapter 5 are implemented as several independent SUIF passes.

Page 37: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

21

Shade cache simulator [18] is augmented to cooperated with SUIF compiler to

perform cache simulation and runtime reconfiguration. SUIF compiler takes the program

source code and converts it to SUIF intermediate representation. Then it executes the

SUIF passes that implement the proposed compiler algorithms and generates a Shade

required profile. Shade cache simulator then uses this profile to carry out corresponding

action during its simulation.

3.1.2 Evaluating Microarchitectural Schemes

For evaluating proposed microarchitectural schemes, a superscalar processor sim-

ulator, SimpleScalar 3.0 [12] is used as the base to develop microarchitectural simulators

required in this work. SimpleScalar performs cycle-accurate simulation of modern proces-

sors and implements a six-stage processor pipeline: Fetch, Dispatch Schedule, Execute,

Writeback, and Commit.

In the experiments conducted in this work, a contemporary microprocessor similar

to the Alpha 21264 microprocessor is modeled. The base configurations of the processor

and memory hierarchy are given in Table 3.1.

3.2 Benchmarks and Input Sets

Two set of benchmarks from both array-intensive application domain and general

purpose application domain are used in the experiments. Table 3.2 lists a set of nine

array-based benchmarks. The second column and third column give the number of arrays

and loop nests manipulated by the each benchmark, respectively. The fourth column

Page 38: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

22

Processor Core

Issue Queue 64 entriesLoad/Store Queue 32 entriesReorder Buffer 64 entriesFetch Width 4 instructions per cycleDecode Width 4 instructions per cycleIssue Width 4 instructions per cycle, out of orderCommit Width 4 instructions per cycleFunction Units 4 IALU, 1 IMULT/IDIV,

4 FALU, 1 FMULT/FDIV,2 Memports

Branch Predictor Bimodal, 2048 entries,512-set 4-way BTB,8-entry RAS

Memory Hierarchy

L1 ICache 32KB, 1 way, 32B blocks, 1 cycle latencyL1 DCache 32KB, 4 ways, 32B blocks, 1 cycle latencyL2 UCache 256KB, 4 ways, 64B blocks, 8 cycle latencyMemory 80 cycles first chunk, 8 cycles rest, 8B bus widthTLB 4 way, ITLB 64 entry, DTLB 128 entry,

30 cycle miss penalty

Table 3.1. Base configurations of simulated processor and memory hierarchy for evalu-ating microarchitectural schemes.

Benchmark Arrays Nests Brief Description Source

adi 6 2 Alternate Direction Integral Livermoreaps 17 3 Mesoscale Hydro Model Perfect Club

bmcm 11 3 Molecular Dynamic of Water Spec92/NASAbtrix 29 7 Block tridiagonal matrix solution Spec92/NASAeflux 5 6 Mesh Computation Perfect Club

tomcat 9 8 Mesh Generation Spec95tsf 1 4 Array-based Computation Perfect Club

vpenta 9 8 inverts 3 matrix pentadiagonals Spec92/NASAwss 10 7 Molecular Dynamics of Water Perfect Club

Table 3.2. Array-based benchmarks used in the experiments.

Page 39: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

23

describes the main function of each benchmark. The sources of the benchmarks are given

in the last column.

Benchmark Input Set Description

gzip input.source 60 Compressionvpr net.in arch.in place.in FPGA Circuit Placement and Routinggcc scilab.i C Programming Language Compilermcf inp.in Combinatorial Optimization

parser 2.1.dict -batch ref.in Word Processingperlbmk splitmail.pl PERL Programming Language

gap -q -m 192M ref.in Group Theory, Interpretervortex bendian1.raw Object-oriented Databasebzip2 input.source Compressiontwolf ref Place and Route Simulator

wupwise wupwise.in Physics / Quantum Chromodynamicsmesa -frames 1000 mesa.in 3-D Graphics Libraryart c756hel.in a10.img hc.img Image Recognition / Neural Networks

equake inp.in Seismic Wave Propagation Simulation

Table 3.3. Benchmarks from SPEC2000 used in the experiments.

In addition to the array-based benchmarks, a set of ten integer and four floating-

point applications from the SPEC2000 benchmark suite are used for evaluating the leak-

age control strategies proposed in Chapter 7. Their PISA version binaries and reference

inputs for execution are used in the experiments. During the simulation, each of these

SPEC2000 benchmarks is first fast forwarded half billion instructions, and then simu-

lated in detail the next half billion committed instructions. Table 3.3 gives the names,

input sets, and function descriptions of these fourteen SPEC2000 benchmarks used in

this work.

Page 40: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

24

Chapter 4

Characterizing Application and Cache Behavior

This thesis proposes to develop compiler and microarchitectural techniques for

cache energy reduction based on the understanding of the application characteristics

and its cache behavior. In this chapter, three critical properties of a given application

and its cache behavior, namely cache resource demands for performance, program exe-

cution footprint, and instruction cache access behavior have been identified, highlighted,

extracted, and analyzed in the context of cache energy optimization.

4.1 Data Cache Demands for Performance

General-purpose high-performance microprocessor designs, such as Alpha 21364

[9] and HP PA-8800 microprocessor, intend to incorporate fix-sized large level one caches

to accommodate different workloads of different applications. However, this design is in-

herently in a very conservative way and only achieves an average performance. Notice

that the cache resource demands of the workload from different applications are signifi-

cantly different from each other. Even within a single application, different part of code

segments may also have very different cache requirements due to the different functions

performing and different data being manipulated. Fix-sized cache design leads to very

inefficient utilization of the cache resources when the demands of applications are much

Page 41: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

25

smaller than the configured caches. On the other hand, the performance is to be seriously

suffered if the application demands a much larger cache than the existing one.

Such a design in embedded microprocessors will be a disaster in terms of cost,

energy consumption, and performance. Larger caches result in higher energy consump-

tion, which is a major constraint in embedded systems design. Smaller caches may cause

performance problem (i.e., missing deadlines) and also incur additional energy overhead

due to the accesses to the lower level memory hierarchies. Note that embedded applica-

tions usually manipulate massive data and are very complex in the way the data being

accessed and processed. Thus, it is paramount to study the cache resource demands

of embedded applications and use this characteristics to direct energy optimizations in

caches, especially level one data cache.

In this section, this application characterization is performed on a set of array-

based embedded applications, which are listed in Table 3.2 in Chapter 3. Since loop

nests are the major code in these benchmarks, this study analyzes the cache performance

behavior while the cache configuration varies for each loop nest within a given benchmark.

Figure 4.1 presents a set of analysis results, where the data cache size varies from 1 KB

to 1024 KB while the cache block size is fixed at 32 Bytes and set-associativity is fixed

at 4 ways for all cache configurations. In each plot, x axis represents the data cache sizes

and y axis is the data cache miss rate.

Several important observations can be found from Figure 4.1. First, at a given

cache configuration, most loops have very different cache performance behavior within a

single application. For example, in benchmark aps, the miss rates in an 8 KB data cache

for its three loops are 0.99%, 2.07%, and 16.11%, respectively. In few case, if two loops

Page 42: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

26

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.05

0.1

0.15

0.2

0.25

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.05

0.1

0.15

0.2

0.25

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2loop3

(a). adi (b). aps

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2loop3

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.05

0.1

0.15

0.2

0.25

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2loop3loop4loop5loop6

(c). bmcm (d). eflux

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2loop3loop4loop5loop6loop7loop8

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.05

0.1

0.15

0.2

0.25

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2loop3loop4

(e). tomcat (f). tsf

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2loop3loop4loop5loop6loop7loop8

1K 2K 4K 8K 16K 32K 64K 128K 256K 512K 1M0

0.02

0.04

0.06

0.08

0.1

0.12

Data Cache Size

Dat

a C

ache

Mis

s R

ate

loop1loop2loop3loop4loop5loop6loop7

(g). vpenta (h). wss

Fig. 4.1. Cache performance behavior as data cache size increases from 1KB to 1024KB.All cache configurations use fixed block size of 32 bytes and fixed associativity of 4 ways.

Page 43: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

27

are very similar in code structure and data access pattern, they might have the same

behavior, e.g., loop 1 and loop 2 in bmcm and loop 2 and loop 5 in eflux. Second, cache

behavior variation with respect to the increasing cache size is different among different

loops. For some loops, the miss rate does not improve even when the cache size increases

from 1 KB to 1024 KB, e.g., loop 2, 3, 5, 6, 7 in benchmark vpenta. However, for most

loops, the cache performance behavior improves significantly as cache size increases, e.g.,

all loops but loop 4 in eflux. Finally and most importantly, every loop has a performance

saturating point, either at some sharp-turning point or at the very first point (smallest

cache size). For those having sharp-turning points, increasing cache size before those

points may or may not improve the performance, e.g., the performance of loop 1, loop 4,

and loop 8 in vpenta has no improvement as cache size increases from 1 KB to 32 KB.

However, this sharp-turning point brings a significant performance improvement for the

loop such as 128 KB cache size for loop 1 and loop 8 in vpenta. Further, increasing cache

size beyond this sharp-turning point only has very minor or no performance benefit. It

is very important to understand these findings. This saturating point is proposed as the

optimal cache configuration for a particular loop in terms of performance and energy

consumption. Schemes trying to optimize cache energy for these embedded applications

should develop approaches to capture these optimal points of their loops.

4.2 Instruction Execution Footprint

This thesis proposes to redesign the instruction supply mechanism for dynamic

energy optimization in instruction caches oriented for embedded systems. This proposal

Page 44: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

28

premises well understanding of the dynamic behavior of instruction footprints at runtime

for typical array-based embedded applications.

This section conducts an analysis on the distribution of dynamic instructions

with respect to their PC addresses for the set of array-based embedded applications

used in this work. Figure 4.2 gives the PC address profiling results for a group of eight

benchmarks. In each plot, x axis represents the sampling point during the execution,

and y axis is the instruction address space. The sampling rate for each benchmark is

given with the label of x axis, e.g., “/500” means a sampling point is made for every 500

dynamic instructions.

For these embedded applications, as shown in Figure 4.2, the dynamic instruction

footprint has a very regular pattern. Simple pattern has a very uniform behavior such

as in aps, btrix, and wss. Some pattern consists of several phases such as in eflux and

vpenta. Each phase executes for a certain amount of time and within a narrow address

space. This dynamic characteristics of these applications can be certainly utilized by

a reconfigurable instruction cache for energy reduction. This thesis research explores a

more aggressive microarchitectural design to exploit this predictable dynamic behavior

for instruction cache energy optimization as well as other components such as branch

predictor and decoder in the datapath front-end.

4.3 Accessing Behavior in Instruction Cache

This section studies the distribution of accesses to the instruction cache for a set

of benchmarks from SPEC2000 benchmark suite. The accesses being characterized are

at the granularity of each cache line. Sequentially accessing instructions within a single

Page 45: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

29

0 2000 4000 6000 8000 10000 12000 140004.194

4.196

4.198

4.2

4.202

4.204

4.206

4.208

4.21

4.212x 10

6

Instruction Fetch Sampling (/300)

Inst

ruct

ion

Add

ress

Spa

ce

0 2000 4000 6000 8000 10000 120004.19

4.195

4.2

4.205

4.21

4.215

4.22x 10

6

Instruction Fetch Sampling (/50)

Inst

ruct

ion

Add

ress

Spa

ce

(a). adi (b). aps

0 2000 4000 6000 8000 10000 12000 140004.194

4.196

4.198

4.2

4.202

4.204

4.206

4.208

4.21

4.212

4.214x 10

6

Instruction Fetch Sampling (/3000)

Inst

ruct

ion

Add

ress

Spa

ce

0 2000 4000 6000 8000 10000 12000 140004.194

4.196

4.198

4.2

4.202

4.204

4.206

4.208

4.21

4.212

4.214x 10

6

Instruction Fetch Sampling (/500)

Inst

ruct

ion

Add

ress

Spa

ce

(c). btrix (d). eflux

0 0.5 1 1.5 2 2.5 3

x 104

4.194

4.196

4.198

4.2

4.202

4.204

4.206x 10

6

Instruction Fetch Sampling (/5000)

Inst

ruct

ion

Add

ress

Spa

ce

0 0.5 1 1.5 2 2.5 3

x 104

4.19

4.195

4.2

4.205

4.21

4.215x 10

6

Instruction Fetch Sampling (/500)

Inst

ruct

ion

Add

ress

Spa

ce

(e). tomcat (f). tsf

0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2

x 104

4.194

4.196

4.198

4.2

4.202

4.204

4.206

4.208x 10

6

Instruction Fetch Sampling (/5000)

Inst

ruct

ion

Add

ress

Spa

ce

0 2000 4000 6000 8000 10000 12000 140004.194

4.196

4.198

4.2

4.202

4.204

4.206

4.208

4.21x 10

6

Instruction Fetch Sampling (/2500)

Inst

ruct

ion

Add

ress

Spa

ce

(g). vpenta (h). wss

Fig. 4.2. Dynamic instruction address distribution at runtime for a set of array-intensivecode. A sampling rate (/500) means only one instruction among 500 dynamic instructionsis sampled for its address (PC).

Page 46: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

30

cache line is counted as only one access to the instruction cache in this study. The

instruction cache configuration is given in Table 3.1. The goal of this study is to provide

quantitative analysis of the sequentiality of cache accesses (at cache line granularity) for

general purpose applications. This quantitative characteristics of applications can be

later utilized for guiding energy optimizations in instruction caches.

The characterization of this cache-line access distribution is performed with re-

spect to the sequential length, which is defined as the number of consecutively accessed

cache lines with increasing set index in the same cache way. For example, a consecutive

access to five cache lines (in a same cache way) with set indices 5, 6, 7, 8, 10, has a

sequential length of four for the first four cache-line accesses, and the sequentiality is

broken after th fourth cache-line access. Cache line 10 starts a new sequential length

counting. Note that a sequence of accesses with sequential length of N only realizes N−1

sequential accesses, i.e., cache lines 5, 6, and 7 have sequential (followed-up) access in

the previous example. Access to cache line 5 is said to achieve sequential access since

the next consecutive access is to cache line 6. A preset series of sequential lengths, from

1 to 32 and larger than 32, is used for this distribution characterization. Figure 4.3 and

Figure 4.4 present this cache line access distribution with respect to sequential length for

each benchmark, and an average for integer benchmarks and floating-point benchmarks

used in this study, respectively. The rightmost bar (<) in each plot corresponds to those

with sequential length larger than 32 cache lines.

From these two figures, one should observe that most cache line accesses happened

in sequential mode. Accesses with sequential length of one (i.e., without sequential

access) only accounts a very small portion of the overall accesses, e.g., benchmark gcc

Page 47: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

31

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

(a). gzip (b). vpr

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

(c). gcc (d). mcf

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

(e). parser (f). perlbmk

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

(g). gap (h). vortex

Fig. 4.3. The distribution of accesses (at cache line granularity) to L1 instruction cachewith respect to the length of consecutively accessing cache lines (sequential length), forSPEC2000 benchmarks. The rightmost bar (<) in each plot corresponds to those withsequential length larger than 32 cache lines. To be continued by Figure 4.4.

Page 48: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

32

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

(h). bzip2 (i). twolf

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

(j). wupwise (k). mesa

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

(l). art (m). equake

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

4 8 12 16 20 24 28 32 <0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

L1 IC

ache

Acc

ess

Dis

tr. w

.r.t

Seq

uent

ial L

engt

h

Avg. for Int. Avg. for FP

Fig. 4.4. (A continue to Figure 4.3) The distribution of accesses (at cache line granular-ity) to L1 instruction cache with respect to the length of consecutively accessing cachelines (sequential length), for SPEC2000 benchmarks. Last two plots show the averagedistribution for integer benchmarks and floating-point benchmarks used, respectively.

Page 49: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

33

has the biggest value 6.5% for this percentage. Overall, for integer benchmarks, most

instruction cache line accesses are spent in sequences with sequential length from 2 to 16,

and some length has the highest percentage such as 6 in gzip and 14 in gap. For floating-

point benchmarks, this distribution appears more irregular and tends to have much larger

sequential length in instruction cache accesses, e.g., more than 75% of cache accesses have

a sequential length more than 32 cache lines in wupwise. The last two plots in Figures 4.4

give the average distribution for integer benchmarks and floating-pointed benchmarks. In

general, the percentage decreases as the sequential length increases in integer benchmarks

and floating-point benchmarks achieve longer sequential length than integer benchmarks.

Notice that the last access of a sequence of accesses with sequential length N does

not have sequential (followed-up) access. On the average, 78% of instruction cache

line accesses achieve sequential access in SPEC2000 integer benchmarks, and 87% in

SPEC2000 floating-point benchmarks used in this study. This quantitative result is used

in Chapter 7 to explore performance-aware leakage management.

Page 50: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

34

Chapter 5

Analyzing Data Reuse for Cache Energy Reduction

5.1 Introduction

Most of today’s microprocessor systems include several special architectural fea-

tures (e.g., large on-chip caches) that use a significant fraction of the on-chip transistors.

These complex and energy-hungry features are meant to be applicable across different

application domains. However, they are wasted for applications that cannot fully utilize

them as they are implemented in a rigid manner. For example, not all the loops in a

given array-based embedded application can take advantage of a large on-chip cache.

Also, working with a fixed cache configuration can increase energy consumption in loops

where the best required configuration (from the performance angle) is smaller than the

default (fixed) one. This is because a larger cache can result in a larger per access energy.

This thesis work proposes a strategy where an optimizing compiler decides the

best cache configuration (from a given objective viewpoint) for each nest in the appli-

cation code. This approach focuses on array based applications and reconfigures cache

configuration dynamically between the nests as loop nests are the most important part

of array-intensive media and signal processing application programs. In most cases, the

computation performed in loop nests dominates the execution time of these programs.

Thus, the behavior of the loop nests determines both performance and energy behavior

of applications. Previous research (e.g., [53]) shows that the performance of loop nests is

Page 51: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

35

directly influenced by the cache behavior of array references. Also, energy consumption

is an important design constraint in embedded systems [73][55] [25]. Consequently, de-

termining a suitable combination of cache memory configuration and optimized software

is a challenging problem in the embedded design world.

Classical compiler optimizations assume a fixed cache architecture and modify

the program to take best advantage of it. In some cases, this may not be the best

strategy because each nest might work best with a different cache configuration and

transforming a nest for a given fixed cache configuration may not be possible due to

data and control dependences. Working with a fixed cache configuration can also increase

energy consumption in loops where the best required configuration is smaller than the

default (fixed) one. This thesis work takes an alternate approach and modifies the cache

configuration for each nest depending on the access pattern exhibited by the nest. This

technique is called as compiler-directed cache polymorphism (CDCP). More specifically,

in this chapter, following contributions have been made. First, it presents an approach

for analyzing data reuse properties of loop nests. Second, it gives algorithms to simulate

the footprints of array references in their reuse space. This simulation approach is much

more efficient than classical cycle-based simulation techniques as it simulates only data

reuse space. Third, based on the reuse analysis, it presents an optimization algorithm

to compute the best cache configurations for each loop nest. The experimental results

show that CDCP is very effective in finding the near-optimal data cache configurations

for different nests in array-intensive applications.

Section 5.2 revises the basic concepts, notions, and representations for array-

based codes. In Section 5.3, the concepts related to cache behavior such as cache misses,

Page 52: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

36

interferences, data reuse, and data locality are analyzed. Section 5.4 introduces the

compiler-directed cache polymorphism technique, and presents a complete set of algo-

rithms to implement it. The experimental results is presented in Section 5.5 to show

the effectiveness of this technique. Finally, Section 5.6 concludes the chapter with a

summary.

5.2 Array-Based Codes

This work is particularly targeted at the array-based codes. Since the performance

of loop nests dominates the overall performance of the array-based codes, optimizing loop

nests is particularly important for achieving best performance in many embedded signal

and video processing applications. Optimizing data locality (so that the majority of

data references are satisfied from the cache instead of main memory) can improve the

performance and energy efficiency of loop nests in the following ways. First, it can

significantly reduce the number of misses in data cache, thus avoiding frequent accesses

to lower memory hierarchies. Second, by reducing the number of accesses to the lower

memory hierarchies, the increased cache hit rate helps promote the energy efficiency of

the entire memory system. In this section, some basic notions about array-based codes

are discussed, loop nests, array references as well as some assumptions to be made.

5.2.1 Representation for Programs

It is assumed that the application code to be optimized has the format which is

shown in Figure 5.1.

Page 53: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

37

#include < header.h >· · ·Global Declaration Section of Arrays;· · ·main(int argc, char *argv[ ]){· · ·Loop Nest No. 0;· · ·Loop Nest No. 1;

...Loop Nest No. l;· · ·

}

Fig. 5.1. Format for a program.

Assumption 1. Each array in the application code being optimized is declared in the

global declaration section of the program. The arrays declared in the global section can

be referenced by any loop in the code.

This assumption is necessary for the algorithms that will be discussed in following

sections. In the optimization stage of computing the cache configuration for the loop

nests, Assumption 1 ensures an exploitable relative base address of each array involved.

Since loop nests are the main structures in array-based programs, program codes

between loop nests can be neglected. It is also assumed that each nest is independent

from the others. That is, as shown in Figure 5.1, the application contains a number of

independent nests, and no inter-nest data reuse is accounted for. This assumption can

be relaxed to achieve potentially more effective utilization of reconfigurable caches. This

will be addressed in the future research. Note that several compiler optimizations such

Page 54: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

38

as loop fusion, fission, and code sinking can be used to bring a given application code

into the format shown in Figure 5.1.

Assumption 2. All loop nests are at the same program lexical level, the global level.

There is no inter-nesting between any two nest pair.

Assumption 3. All nests in the code are perfectly-nested, i.e., all array operations and

array references only occur at the innermost loop.

These assumptions, while not vital for this analysis, make the implementation easier.

5.2.2 Representation for Loop Nests

In this work, loop nests form the boundaries at which dynamic cache reconfigu-

rations occur. Figure 5.2 shows the format for a loop nest.

for(i1

= l1; i

1≤ u

1; i

1+ = s

1)

for(i2

= l2; i

2≤ u

2; i

2+ = s

2)

· · ·for(i

n= l

n; i

n≤ u

n; i

n+ = s

n)

{

· · ·AR1[f

1,1(~i)][f

1,2(~i)] · · · [f

1,d1(~i))] · · · ;

· · ·AR2[f

2,1(~i)][f

2,2(~i)] · · · [f

2,d2(~i))] · · · ;

...

· · ·ARr[(f

r,1(~i)][f

r,2(~i)] · · · [f

r,dr(~i))] · · · ;

}

Fig. 5.2. Format for a loop nest.

Page 55: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

39

In this format, ~i stands for the loop index vector, ~i = (i1, i2, · · · , in)T . Nota-

tions lj , uj and sj are the corresponding the lower bound, upper bound, and stride for

loop index ij , where j = 1, 2, · · · , n. AR1, AR2, · · · , and ARr correspond to different

instances of array references in the nest. Note that these may be same or different ref-

erences to the same array, or different references to different arrays. Function fj,k(~i)

is the subscript (expression) function (of ~i) for the kth subscript of the jth array refer-

ence, where j = 1, 2, · · · , r, k = 1, 2, · · · , dk , and dk is the number of dimensions for the

corresponding array.

5.2.3 Representation for Array References

In a loop nest with the loop index vector ~i, a reference ARj to an array with m

dimensions is expressed as:

ARj [fj,1(~i)][fj,2(~i)] · · · [fj,m(~i)].

It assumes that the subscript expression functions fj,k(~i) are affine functions of the en-

closing loop indices and loop-invariant constants. A row-major storage layout is assumed

for all arrays as in C language. Assuming that the loop index vector is an n depth vec-

tor, that is, ~i = (i1, i2, · · · , in)T , where n is the number of loops in the nest, an array

Page 56: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

40

reference can be represented as:

fj,1

fj,2

...

fj,m

=

a11 a12 · · · a1n

a21 a22 · · · a2n

......

. . ....

am1 am2 · · · amn

i1

i2

...

in

+

c1

c2

...

cm

(5.1)

The vector on the left side of the above equation is called the array reference

subscript vector and is denoted using ~f . The matrix shown above is defined as the

access matrix, and is denoted using A. The rightmost vector is known as the constant

offset vector ~c. Thus, the above equation can be also written as [74]:

~f = A~i + ~c. (5.2)

5.3 Cache Behavior

This section reviews a few fundamental concepts about cache behavior. As noted

earlier, in array-intensive applications, cache behavior is largely determined by the foot-

prints of the data manipulated by loop nests.

5.3.1 Cache Misses

There are three types of cache misses: compulsory (cold) misses, capacity misses,

and conflict (interference) misses. Different types of misses influence the performance

of program in different ways. Compulsory misses cannot be avoided (using software

techniques alone) and usually form only a small fraction of total cache misses. Capacity

Page 57: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

41

misses can be reduced by increasing the cache size or by optimizing the application code.

Note that, in fully-associative caches, only capacity misses and cold misses can exist.

However, most of the data caches used in current embedded systems are implemented as

set-associative caches or direct-mapping caches in order to achieve high speed, low power,

and low implementation cost. Thus, for these caches, interference misses (also known

as conflict misses) can dominate the cache behavior, particularly for array-based codes.

Previous research by [72] has identified different kinds of cache interferences in numerical

(array-based) codes: self-interferences and cross-interferences. The interference misses

can also be grouped into temporal interferences and spatial interferences. It should

be stressed that since the cache interferences occur in a highly irregular manner, it is

very difficult to capture them accurately as well as costly to estimate [15]. Ghosh et

al. proposed cache miss equations in [23][24] as an analytical framework to compute

potential cache misses and direct code optimizations for achieving good cache behavior.

5.3.2 Data Reuse and Data Locality

Data reuse and data locality concepts for scientific array based applications are

discussed in [74], among others, in detail. Basically, there are two types of data reuses:

temporal reuse and spatial reuse. In a given loop nest, if a reference accesses the same

memory location across different loop iterations, this is termed as the temporal reuse; if

the reference accesses the same cache block (not necessarily the same memory location),

this is called as the spatial reuse. In fact, temporal reuse can be considered as a special

case of spatial reuse. If there are different references accessing the same memory location,

it is said that a group-temporal reuse exists; whereas if different references are accessing

Page 58: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

42

the same cache block, it is termed as group-spatial reuse. Note that group reuse occurs

only among different references of the same array in a loop nest. When the reused data

item is found in the cache, it is said that the reference exhibits locality. This means that

data reuse does not guarantee data locality. A data reuse can be converted into locality

only by catching the reused item in cache. Classical loop-oriented compiler techniques

try to achieve this by modifying the loop access patterns and/or array layout in memory.

5.4 Algorithms for Cache Polymorphism

In a cache based embedded system, the performance and energy behavior of loop

nests are largely determined by their cache behavior. Thus, how to optimize the cache

behavior of loop nests is utmost important for satisfying high-performance and energy

efficiency demands of array-based codes.

There are at least two kinds of approaches for achieving acceptable cache behav-

ior. The conventional way is to employ compiler algorithms that optimize loops using

transformations such as interchange, reversal, skewing, and tiling, or transform the data

layouts (i.e., array layout in memory) to match the array access pattern. As mentioned

earlier, an alternative approach to optimize the cache behavior is to modify the underly-

ing cache architecture depending on the program access pattern. Recent research work

by [40] explores the potential benefits from the second approach. The strategy presented

in [40] is based on exhaustive simulation. It simulates each loop nest of an array-based

program separately with all possible cache configurations, and then determines the best

cache configuration from the viewpoint of a given objective (e.g., optimizing memory

energy or reducing cache misses). These configurations can then be dynamically applied

Page 59: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

43

at running time. The main drawback of this simulation-based strategy is that it is time

consuming and can consider only a fixed set of cache configurations. Typically, simu-

lating each nest with all possible cache configurations makes this approach unsuitable

for very large embedded applications. In this section, an alternative way is presented

for determining the suitable cache configurations for different sections (nests) of a given

code.

5.4.1 Compiler-Directed Cache Polymorphism

The existence of cache interferences is the main factor that degrades the perfor-

mance of a loop nest. Cache interferences disrupt the cache behavior of a loop nest by

preventing data reuse from being converted into locality. Note that both self-interferences

or cross-interferences can prevent a data item from being reused from the cache. The

objective is then to determine the cache configurations that help reduce interferences.

The basic idea behind the compiler-directed cache polymorphism (CDCP) is to analyze

the source code of an array-based program and determine data reuse characteristics of

its loop nests at compile time, and then to compute a suitable (near-optimal) cache

configuration for each loop nest to exploit the data locality implied by its reuse. The

near-optimal cache configuration determined for each nest is expected to eliminate most

of the interference misses while keeping the cache size and associativity under control.

In this way, it optimizes execution time and energy at the same time. In fact, increasing

either cache capacity or associativity further (i.e., expand the configuration determined

by CDCP) only increases energy consumption. In this approach, the source codes are

not modified (obviously, they can be optimized before the algorithms are run; what it

Page 60: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

44

means here is that it does not do any further code modifications for the sake of cache

morphism).

IntermediateCode

Generator

Array Sorter

Loop Nest

Optimizer

Inter-

Codes

Uniform-set

Shade Cache Simulator Energy Model

Cacti

SUIF

SCC

Source

Codes

GCC

Cache Reconfiguration

Mechanism

Constructor

Cache

Array

Block Size

ReuseUniform Reuse-

Analyser

Configurations

Sets Vectors

Reuse-space

Simulator

Bitmaps

Performance/Energy

Results

Fig. 5.3. Overview of compiler-directed cache polymorphism (CDCP).

At the very high level, this approach can be described as follows. First, it uses

compiler to transform the source codes into an intermediate format (IF), which represents

the array-based programs in a regular hierarchical format. In the second step, each loop

nest is processed as a basic element for cache configuration. In each loop nest, references

of each array are assigned into different uniform reference sets. References belonging

to a same uniform set have the same access matrix. Each uniform set is then analyzed

to determine the reuse they exhibit over different loop levels. Then, for each array, an

algorithm is used to simulate the footprints of the reuse space within the layout space of

this array. Following this, a loop nest level algorithm optimizes the cache configurations

while ensuring data locality. Finally, the code is generated such that these dynamic

Page 61: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

45

cache configurations are activated at runtime (in appropriate points in the application

code). Figure 5.3 shows an overview of this approach.

5.4.2 Formal Description of Program Hierarchies

This subsection shows how an array-based program is shown in its IF. The in-

termediate code generator follows the hierarchy of the source program code to generate

an explicit hierarchical tree structure to represent the original code. The hierarchy goes

from the root (which represents the main program) down to loop nests, arrays, and array

references at the leaves. Each node and leaf contain all the information needed by this

approach as can be extracted from the original code. Thus, such a tree structure func-

tionally represents the original codes in full scope. This intermediate program format is

shown in Figure 5.4.

loop nest 0 loop nest 2loop nest 1

Program

loop nest n

array ref 2 array ref larray ref 0

array 0 array 1 array 2 array m

array ref 1

Fig. 5.4. Intermediate format of source codes produced by the generator.

Page 62: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

46

5.4.3 Array References and Uniform Reference Sets

An array reference is at the lowest level of the IF. As explained earlier, each array

reference can be expressed using ~f = A~i + ~c, where ~f is the subscript vector, A is the

access matrix,~i is the loop index vector, and ~c is the constant vector. All the information

about a reference is stored in the array reference leaf, array node and its parent loop-nest

node of the intermediate codes. Consider the piece of code shown in Figure 5.5.

for(i = 0; i ≤ N1; i + +)

for(j = 0; j ≤ N2; j + +)

for(k = 0; k ≤ N3; k + +)

for(l = 0; l ≤ N4; l + +)

{a[i + 2 ∗ k][2 ∗ j + 2][l] = a[i + 2 ∗ k][2 ∗ j][l];b[j][k + l][i] = a[2 ∗ i][k][l];

}

Fig. 5.5. Example code – a loop nest.

In the intermediate code format, the first reference to array a is represented by

the following access matrix Aa and constant offset vector −→ca,

Aa :

1 0 2 0

0 2 0 0

0 0 0 1

,−→ca :

0

2

0

.

Page 63: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

47

The reference to array b in this code fragment is also represented by its access matrix

(Ab) and constant offset vector (−→cb):

Ab :

0 1 0 0

0 0 1 1

1 0 0 0

,−→cb :

0

0

0

.

The definition of uniform reference set is very similar to the uniformly generated

set [22]. If two references to an array have the same access matrix and only differ in

constant offset vectors, these two references are said to belong to the same uniform

reference set. Constructing uniform reference sets for an array provides an efficient

way for analyzing the data reuse for the said array. This is because all references in

an uniform reference set have same data access pattern and data reuse characteristics.

Also, identifying uniform reference sets allows us to capture group reuse easily.

5.4.4 Algorithm for Reuse Analysis

The following sections use a bottom-up approach to introduce the algorithms that

implementing the CDCP. First, algorithms analyzing the data reuses including self-reuses

and group-reuses are provided for each uniform reference set. After that, an array-level

algorithm to obtain the footprints of all the reuses in a given nest is given. This algorithm

is called by a loop-nest level algorithm. The loop-nest level algorithm simulates the reuse

behavior of arrays in a memory space, and computes near-optimal cache configurations

in order to exploit data reuses with the minimum cache capacity/associativity. Finally, a

Page 64: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

48

global or program level algorithm integrates the results from each loop nest, and makes

all necessary changes to activate the selected cache reconfigurations at runtime.

5.4.4.1 Self-Reuse Analysis

Before the reuse analysis, all references to an array in a given nest are first clas-

sified into several uniform reference sets. Self-reuses (both temporal and spatial) are

analyzed at a uniform set granularity. This algorithm works on access matrices and is

given in Algorithm 1.

The algorithm checks each loop index variable from the innermost loop to the

outermost loop to see whether it occurs in the subscript expressions of the references.

If the jth loop index variable ij does not occur in any subscript expression, the impact

of this on the corresponding access matrix is that all elements in the jth column are

0. This means that the iterations of the jth loop do not have a say in the memory

location accessed, i.e., the array reference has self-temporal reuse in the jth loop. If

the index variable only occurs in the lowest (the fastest-changing) dimension (i.e., the

mth dimension), the distance between the contiguous loop iterations is checked. In

the algorithm, s[CLP ] is the stride of the CLP th loop, BK SZ is the cache block size,

and ELMT SZ is the size of array elements. If the distance (A[CDN ][CLP ] ∗ s[CLP ])

between two contiguous iterations of the reference being analyzed is within a cache block,

it has spatial reuse in this loop level.

Page 65: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

49

Algorithm 1 Self-Reuse Analysis

INPUT: access matrix Am∗n of a uniform reference set, array node, loop-nest node, agiven cache block size: BK SZ

OUTPUT: self-reuse pattern vector−−−−→SRPn of this uniform set

Initial self-reuse pattern vector:−−−−→SRPn = ~0;

Set current loop level CLP to be the innermost loop: CLP = n;repeat

Set current dimension level CDN to be the highest dimension: CDN = 0;Set index occurring flag IOF : IOF = FALSE;repeat

if Element in access matrix A[CDN ][CLP ] 6= 0 then

/∗ Which means the CLP th index variable appears in expression of the CDN th

subscript ∗/Set IOF = TRUE;Break;

end ifGo up to the next lower dimension level;

until CDN > the lowest dimensionif IOF == FALSE then

/∗ The CLP th index variable does not occur in any subscript expression ∗/Set reference has temporal reuse at this level: SRP [CLP ] = TEMP-REUSE;

else if CDN == m then/∗ index variable only occurs in the lowest dimension ∗/if A[CDN ][CLP ] ∗ s[CLP ] < BK SZ/ELMT SZ then

Set reference has spatial reuse at this level: SRP [CLP ] = SPAT-REUSE;end if

end ifGo up to the next higher loop level;

until CLP < the outermost loop level

Page 66: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

50

5.4.4.2 Group-Reuse Analysis

Group reuses only exist among references in the same uniform reference set.

Group-temporal reuse occurs when different references access the same data location

across loop iterations, while group-spatial reuse occurs when different references access

the same cache block in the same or different loop iterations. Algorithm 2 exploits a

simplified version of group reuse which only exists in a single loop level.

When a group-spatial reuse is found at a particular loop level, Algorithm 2 first

checks whether this level has group-temporal reuse for other pairs of references. If it

does not have such reuse, this level will be marked to indicate that a group-spatial reuse

exists. Otherwise, it just omits the current reuse found. It defines a vector−−−−→GRPn as

the group-reuse vector. Each element of−−−−→GRPn records the type of group reuse at its

corresponding loop level. For group-temporal reuse found at some loop level, the element

corresponding to that level in−−−−→GRPn will be directly set to have group-temporal reuse.

Now, for each array and each of its uniform reference sets in a particular loop

nest, using Algorithm 1 and Algorithm 2, the reuse information at each loop level can

be collected. As for the example code in subsection 4.3, references to array a have self-

spatial reuse at loop level l, self-temporal reuse at loop level j and group reuse at loop

level j. Reference of array b has self-spatial reuse at loop level i.

Note that, in contrast to the most of the previous work in reuse analysis (e.g.,

[74]), this approach is simpler and computes the required reuse information without

solving a system of equations.

Page 67: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

51

Algorithm 2 Group-Reuse Analysis

INPUT: a uniform reference set with A and ~cs, array node, loop-nest node, a givencache block size: BK SZOUTPUT: group-reuse pattern vector

−−−−→GRPn of this uniform set

Initial group-reuse pattern vector:−−−−→GRPn = ~0;

for all pairs of constant vectors ~c1 and ~c2 do

if ~c1 and ~c2 only differ at the jth element then

/∗ set the initial reuse distance at jth dimension ∗/Set init dist = | c1[j] − c2[j] |;

Check the jth row in access matrix A;Find the first occurring loop index variable (non-zero element) starting from theinnermost loop, say ik;if k < 1 then

/∗ no index variable occurs in the dimension ∗/if j == m and init dist < BK SZ/ELMT SZ then

/∗ these two references have group-spatial reuse at each loop iteration ∗/Continue;

end ifelse

Check the kth column of access matrix A;

if ik only occurs in the jth dimension thenif j == m //m is the lowest dimension of array then

if init dist%A[m][k] == 0 thenSet GRP[k] = TEMP-REUSE;

else if GRP[k] == 0 thenSet GRP[k] = SPAT-REUSE;

end ifelse

/∗ j < m ∗/if init dist%A[j][k] == 0 then

Set GRP[k] = TEMP-REUSE;end if

end ifend if

end ifend if

end for

Page 68: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

52

5.4.5 Simulating the Footprints of Reuse Spaces

The next step in this approach is to transform the data reuses detected above

into data locality. The idea is to make the data cache capacity large enough to hold all

the data in the (detected) reuse spaces of the arrays. Note that the data that are out

of reuse space are not necessary to be kept in cache after the first reference since such

data do not exhibit reuse. As discussed earlier, the cache interferences can significantly

affect the overall performance of a nest. Thus, the objective of this technique is to

find a near-optimal cache configuration that can reduce or eliminate the majority of the

cache interferences within a given nest. An informal definition of near-optimal cache

configuration is as follows.

Definition 1. A ’near-optimal cache configuration’ is the one with the smallest capacity

and associativity that approximates the locality that can be obtained using a very large

and fully-associative cache. Intuitively, any increase in either cache size or associativity

over this configuration would not produce any significant improvement.

In order to figure out such a near-optimal cache configuration, the cache behavior

in the reuse space is required for potential optimizations. This section provides an

algorithm that simulates the exact footprints (memory addresses) of array references in

their reuse spaces.

Suppose, for a given loop index vector~i, an array reference with a particular value

of ~i = (i1, i2, · · · , in)T can be expressed as follows:

f(~i) = SA + Cof1 ∗ i1 + Cof2 ∗ i2 + · · · + Cofn ∗ in. (5.3)

Page 69: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

53

Here, SA is starting address of the array reference, which is different from the base

address (the memory address of the first array element) of an array. It is the constant

part of the above equation. Suppose that the size of each array element is elmt sz, the

depth of dimension is m, the dimensional bound vectors (defining the scope of each array

dimension) are−−→dlm = (dl1, dl2, · · · , dlm)T , and

−−→dum = (du1, du2, · · · , dum)T , and the

constant offset vector is ~c = (c1, c2, · · · , cm)T , SA can be derived from the following

equation:

SA = elmt sz ∗

m∑

j=1

m+1∏

k=j+1

ddk ∗ cj , ddk =

1, k = m + 1

duk − dlk , k ≤ m

(5.4)

Cofj(j = 1, 2, · · · , n) is used to denote the integrated coefficients of the loop index

variables. Suppose that the access matrix is m by n. In this case, Cofj can be derived

as follows:

Cofj = elmt sz ∗m∑

l=1

m+1∏

k=l+1

ddk ∗ alj , ddk =

1, k = m + 1

duk − dlk, k ≤ m

(5.5)

Note that, with Equation 3, the address of an array reference at a particular

loop iteration can be calculated as the offset in the layout space of this array. The

algorithm provided in this section uses these formulations to simulate the footprints

of array references at each loop iteration within their reuse spaces. The following two

observations give some basis as to how to simulate the reuse spaces.

Page 70: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

54

Observation 1. In order to realize the reuse carried by the innermost loop, only one

cache block is needed for this array reference.

Observation 2. In order to realize the reuse carried by a non-innermost loop, the mini-

mum number of cache blocks needed for this array reference is the number of cache blocks

that are visited by the loops inner than it.

Since it has assumed that all subscript functions are affine, for any array reference, the

patterns of reuse space during different iterations at the loop level which has reuse are

exactly the same. Thus, it only needs to simulate the first iteration of the loop having the

reuse currently under exploiting. For example, loop level j in loop vector~i has the reuse

it is exploiting, the simulation space is defined as SMSj = (i1 = l1, i2 = l2, · · · , ij =

lj , ij+1, · · · , in), in which ik>j varies from its lower bound lk to upper bound uk.

Algorithm 3 first calls Algorithms 1 and 2 given on pages 9 and 10, respectively.

After that, it simulates the footprints of the most significant reuse space for an array in

a particular loop nest. The most significant reuse space is defined as the highest reuse

(self reuse and group reuse) level among that of all uniform sets of this given array. If

an array only has reuse at the innermost loop level, Algorithm 3 only needs to calculate

the footprints once by setting all loop index variables to their lower bounds. Otherwise,

Algorithm computes the footprints of all references to this array within the iteration

space which is defined by the reuse space SMS. These footprints are marked using an

array bitmap.

Page 71: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

55

Algorithm 3 Simulating Footprints in Reuse Spaces

INPUT: an array node, a loop-nest node, a given cache block size: BK SZOUTPUT: an array-level bitmap for footprints

Initial array size AR SZ in number of cache blocks;Allocate an array-level bitmap ABM with size AR SZ and initial ABM to zeros;Initial the highest reuse level RS LEV = n;/∗ n is the depth of loop nest ∗/for each uniform reference set do

Call Algorithm 1 for self-reuse analysis;Call Algorithm 2 for group-reuse analysis;Set URS LEV = highest reuse level of this set;if RS LEV > URS LEV then

/∗ current set has the highest reuse level ∗/Set RS LEV = URS LEV ;

end ifend forif RS LEV == n then

/∗ only innermost loop has reuse or no reuse exists ∗//∗ simulate one iteration at innermost loop (loop n) ∗/for all references of this array do

Set ~i = ~l; /* only use the lower bound */

apply equation 3 to get the reference address f(~i);

transfer to block id: bk id = f(~i)/BK SZ;set array bitmap: ABM [bk id] = V ISITED;

end forelse

/∗ simulate all iterations of loops inner than RS LEV ∗/for all loop indexes ij , j > RS LEV do

varies the value of ij from lower bound to upper bound;for all references of this array do

apply equation 3 to get the reference address f(~i);

transfer to block id: bk id = f(~i)/BK SZ;set array bitmap: ABM [bk id] = V ISITED;

end forend for

end if

Page 72: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

56

5.4.6 Computation and Optimization of Cache Configurations for Loop Nests

In previous subsections, the reuse spaces of each array in a particular loop nest

have been determined and their footprints have been simulated in the layout space of

each array. After executing Algorithm 3, each array has a bitmap indicating the cache

blocks which have been visited by the iterations in reuse spaces. As is discussed earlier,

the phenomena of cache interferences can disturb these reuses and prevent the array

references from realizing data localities across loop iterations. Thus, an algorithm that

can reduce these cache interferences and result in better data localities within the reuse

spaces is crucial.

This subsection provides a loop-nest level algorithm to capture cache interfer-

ences among different arrays accessed within a loop nest. This algorithm tries to map

the reuse space of each array into a linear memory space. At the same time, the degree

of conflicts (number of interferences among different arrays) at each cache block is stored

in a loop-nest level bitmap. Since the self-interference of each array is already solved

by Algorithm 3 using an array bitmap, this algorithm mainly focuses on reducing the

group-interference that might occur among different arrays. As is well-known, one of the

most effective way to avoid interferences is to increase the associativity of data cache,

which is used in this algorithm. Based on the definition of near-optimal cache config-

uration, this algorithm tries to find the smallest data cache with smallest associativity

that reduces cache conflicts significantly. The detailed algorithm is given as Algorithm

4 that computes and optimizes the cache configuration.

Page 73: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

57

Algorithm 4 Compute and Optimize Cache Configurations for Loop Nests

INPUT: loop-nest node, global list of arrays declared, lower bound of block size:Bk SZ LB, upper bound of block size: Bk SZ UBOUTPUT: optimal cache configurations at different BK SZ

Set BK SZ = BK SZ LB; /∗ lower bound ∗/repeat

for each array in this loop nest doCall algorithm 3 to get the array bitmap ABM ;

end forcreate and initial a loop-nest level bitmap LBM , with the size is the smallest 2n

that is ≥ the size of the largest array (in block): LBM size;for each array bitmap ABM do

map ABM into the loop-nest bitmap LBM with the relative base-address ofarray: base addr to indicate the degree of conflict at each block;for block id < array size do

LBM [(block id + base addr)%LBM size] += ABM [block id];end for

end forset assoc = the largest degree of conflict in LBM ;set cache sz = assoc ∗ LBM size;set optimal cache conf. to current cache conf.;for assoc < assoc upper bound do

half the number of sets of current cache by LBM size/ = 2;for i ≤ LBM size do

LBM [i]+ = LBM [i + LBM size];end forset assoc = highest value of LBM [i], i ≤ LBM size;set cache size = assoc ∗ LBM size;if assoc < assoc upper bound and cache size < optimal cache size then

set optimal cache conf. to current cache conf.;end if

end forgive out optimal cache conf. at BK SZ;doubling BK SZ∗ = 2;

until BK SZ > BK SZ UB /∗ upper bound ∗/

Page 74: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

58

For a given loop nest, Algorithm 4 starts with the cache block size (BK SZ) from

its lower bound, e.g., 16 bytes and goes up to its upper bound, e.g., 64 bytes. For a given

BK SZ, it first applies Algorithm 3 to obtain the array bitmap ABM of each array. After

that, it allocates a loop-nest level bitmap LBM for all arrays within this nest, whose size

is the smallest value (in power of two) that is greater or equal to the largest array size.

All ABMs are remapped to this LBM with their relative array base addresses. The

value of each bits in LBM indicates the conflict at a particular cache block. Following

this, the optimization is carried out by halving the size of LBM and remapping LBM .

Note that the largest value of bits in LBM gives the smallest cache associativity needed

to avoid the interference in the corresponding cache block. This process stops when the

upper bound of associativity is met. A near-optimal cache configuration at block size

BK SZ is computed as the one which has the smallest cache size as well as the smallest

associativity.

5.4.7 Global Level Cache Polymorphism

The compiler-directed cache polymorphism technique does not make changes to

the source code. Instead, it uses compiler only for source code parsing and generates

internal code with the intermediate format which is local to the algorithms. A global

or program level algorithm, Algorithm 5 is presented in this section to implement the

compiler-directed cache polymorphism.

Algorithm 5 first generates the intermediate format of the original code and col-

lects the global information of arrays in source code. After that, it applies Algorithm

4 to each of its loop nests and obtains the near-optimal cache configuration for each of

Page 75: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

59

Algorithm 5 Global Level Cache Polymorphism

INPUT: source code(.spd)OUTPUT: Performance data and its cache configurations for each loop nest

Initial cache-configuration list: CCL;Use one SUIF pass to generate the intermediate code format;Construct a global list of arrays declared with its relative base address;for each loop nest do

for each array in this loop nest doConstruct uniform reference sets for all its references;

end forCall algorithm 4 to optimize the cache configurations for this loop nest;Store the configurations to the CCL;

end forfor each block size do

Activate reconfiguration mechanisms with each loop nest using its configurationfrom the CCL;Output performance data as well as the cache configuration of each loop nest;

end for

them. These configurations are stored in the cache-configuration list (CCL). Each loop

nest has a corresponding node in the CCL, which has its near-optimal cache configu-

rations at different block sizes. After the nest-level optimization has been performed,

Algorithm 5 activates the cache reconfiguration, where a modified version of the Shade

simulator is used. During the simulation, Shade is directed to use the near-optimal cache

configurations in CCL for each loop nest before its execution. The performance data of

each loop nest under different cache configurations is generated as output.

Since current cache reconfiguration mechanisms can only vary cache size and cache

associativity with fixed cache block size, or cache optimization is performed for different

(fixed) cache block sizes. This means that the algorithms in this work suggest a near-

optimal cache configuration for each loop nest for a given block size. In the section 5.5,

experimental results verifying the effectiveness of this technique are presented.

Page 76: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

60

#define N 8int a[N][N][N], b[N][N][N];intN

1= 4, N

2= 4, N

3= 4, N

4= 4;

main(){

int i, j, k, l;for(i = 0; i ≤ N

1; i + +)

for(j = 0; j ≤ N2; j + +)

for(k = 0; k ≤ N3; k + +)

for(l = 0; l ≤ N4; l + +)

{a[i + k][j + 2][l] = a[i + k][j][l];b[j][k + l][i] = a[2 ∗ i][k][l];

}}

Fig. 5.6. An Example: Array-based code.

5.4.8 An Example

This subsection focuses on the example code in Figure 5.6 to demonstrate how

CDCP works. For simplicity, this code only contains a single nest (with four loops and

four references).

Algorithm 5 starts with a SUIF pass that converts the source code shown above

into the IF, in which the program node only has one loop-nest node. The loop-nest node

is represented by its index vector ~i = (i, j, k, l)T , with an index lower bound vector of

−→il = (0, 0, 0, 0)T , an upper bound vector of

−→iu = (N1,N2,N3,N4)T and a stride vector

of−→is = (1, 1, 1, 1)T . Within the nest, arrays a and b have references AR

a1 , ARa2 , AR

a3

and ARb, which are represented using access matrices and constant vectors as follows:

Page 77: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

61

Aa1 :

1 0 1 0

0 1 0 0

0 0 0 1

,−→ca1 :

0

2

0

, Aa2 :

1 0 1 0

0 1 0 0

0 0 0 1

,−→ca2 :

0

0

0

,

Aa3 :

2 0 0 0

0 0 1 0

0 0 0 1

,−→ca3 :

0

0

0

, Ab :

0 1 0 0

0 0 1 1

1 0 0 0

,−→cb :

0

0

0

.

Also, a global array list is generated as < a, b >. Then, for array a, references ARa1

and ARa2 are grouped into a uniform reference set, and AR

a3 is put into another set.

Array b, on the other hand, has only a single uniform reference set.

After that, Algorithm 4 is invoked. It starts processing from the smallest cache

block size, BK SZ, say 16 bytes and uses Algorithm 3 to obtain the array bitmap ABMa

for array a and ABMb for array b using this BK SZ. Within Algorithm 3, it first calls

Algorithm 1 and Algorithm 2 to analyze the reuse characteristics of a given array. In

this example, these algorithms compute the self-reuse pattern−−−→SRP = (0, 0, 0, 1)T and

group-reuse pattern−−−→GRP = (0, 1, 0, 0)T for the first uniform set of array a. These two

patterns indicate that this array has self-spatial reuse at level l, group-temporal reuse at

level j. For the second uniform set of array a, the algorithm returns−−−→SRP = (0, 2, 0, 1)T

which indicates that the array has self-spatial reuse at level l and self-temporal reuse at

level j. Reference to array b has self-spatial reuse at level i corresponding to its self-reuse

pattern−−−→SRP = (1, 0, 0, 0)T . The highest level of reuse is then used for each array by

Algorithm 3 to generate the ABM for its footprints in the reuse space. It is assumed

Page 78: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

62

that an integer has 4 bytes in size. In this case, both ABMa and ABMb have 128 bits

as shown below.

ABMa:

0-31 1 0 1 0 1 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 032-63 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 064-95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

96-127 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

ABMb:

0-31 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 032-63 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 064-95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

96-127 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

These two ABMs are then passed by Algorithm 3 to Algorithm 4. In turn,

Algorithm 4 creates a loop-nest bitmap LBM , its size being equal to the largest array

size, MAX( ABMs), and re-maps ABMa and ABMb to LBM . Since array a has

relative base address at 0 (byte), and array b at 2048, it determines the LBM as follows:

The maximum value of bits in the LBM indicates the number of interference

among different arrays in the nest. Thus, it is the least associativity that is required to

Page 79: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

63

0-31 2 0 2 0 2 0 2 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 032-63 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 064-95 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

96-127 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

avoid this interference. In this example, Algorithm 4 starts from a cache associativity

of 2 to compute the near-optimal cache configuration. At each iteration, the size of

LBM is halved and the LBM is re-mapped until the resulting associativity reaches

the upper bound, e.g., 16. Then, the algorithm outputs the smallest cache size with

smallest associativity as the near-optimal configuration at this block size BK SZ. For

this example, the near-optimal cache configuration is 2KB 2-way associative cache at 16

byte block size. The LBM after optimization looks as follows.

0-31 2 0 2 0 2 0 2 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 032-63 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 0 2 0 1 0 2 0 1 0 1 0 1 0 1 0 1 0

Algorithm 4 then proceeds to compute the near-optimal cache configurations for

larger cache block sizes by doubling the previous block size. When the block size reaches

its upper bound (e.g., 64 bytes), the algorithm stops, and passes all the near-optimal

configurations at different block sizes to Algorithm 5. The cache configurations (in

this example) computed by Algorithm 4 at different block sizes are given in Table 5.1.

On receiving these configurations, Algorithm 5 activates Shade (a fast instruction-set

Page 80: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

64

simulator) to simulate the example code (executable) with these cache configurations.

Then the performance data is generated as the output of Algorithm 5.

Block Size(B) Number of Sets Associativity Cache Size(B)

16 64 2 204832 32 2 204864 16 2 2048

Table 5.1. Cache configurations generated by algorithm 4 for the example nest.

5.5 Experiments

5.5.1 Simulation Framework

This section presents the simulation results to verify the effectiveness of the CDCP

technique. The technique has been implemented using SUIF [69] compiler and Shade

[18]. Eight array-based benchmarks from Table 3.2 are used in this simulation work. In

each benchmark, loop nests dominate the overall execution time.

The main goal here is to compare the cache configurations returned by CDCP

scheme and those obtained through a scheme based on exhaustive simulation (using

Shade). Three different block (line) sizes are considered here: 16, 32 and 64 bytes. Note

that this part of work is particularly targeted at on-chip L1 caches.

Page 81: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

65

5.5.2 Selected Cache Configurations

This subsection first applies an exhaustive simulation method using the Shade

simulator. For this method, the original program codes are divided into a set of small

programs, each program having a single nest. Shade simulates these loop nests individ-

ually with all possible L1 data cache configurations within the following ranges: cache

sizes from 1K to 128K, set-associativity from 1 way to 16 ways, and block size at 16, 32

and 64 bytes. The number of data cache misses is used as the metric for comparing per-

formance. The optimal cache configuration at a certain cache block size is the smallest

one in terms of both cache size and set associativity that achieves a performance (the

number of misses) which cannot be improved significantly (the number of misses cannot

be reduced by 1%) by increasing cache size and/or set associativities. The left portion

of Table 5.3 shows the optimal cache configurations (as selected by Shade) for each loop

nest in different benchmarks as well as at different cache block sizes.

Benchmark Running Time(s) Name Running Time(s)

adi.c 10.491 aps.c 1.638bmcm.c 2.609 eflux.c 0.296tomcat.c 1.544 tsf.c 0.809vpenta.c 4.009 wss.c 0.148

Table 5.2. Running time of algorithm 4 for each benchmark.

Page 82: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

66

CDCP technique takes the original source code in the SUIF .spd format and

applies Algorithm 5 to generate the near-optimal cache configurations for each loop nest

in the source code. It does not perform any instruction level simulation for configuration

optimization. Thus, it is expected to be very fast in finding the near-optimal cache

configuration. In fact, Table 5.2 gives the real running time of Algorithm 4 for each

benchmark. The execution engine (a modified version of Shade) of CDCP directly applies

these cache configurations to activate the reconfiguration mechanisms dynamically. The

cache configurations determined by CDCP are shown on the right part of Table 5.3.

To sum up, in Table 5.3, for each loop nest in a given benchmark, the optimal cache

configurations from Shade and near-optimal cache configurations from CDCP technique

at block sizes 16, 32, and 64 bytes are given. A notation such as 8k4s is used to indicate

a 8K bytes 4-way set associative cache with a block size of 32 bytes. In this table, B

means bytes, K denotes kilobytes, and M indicates megabytes.

From Table 5.3, it can be observed that CDCP has the ability to determine cache

capacities at byte granularity. In most cases, the cache configuration determined by

CDCP is less than or equal to the one determined by the exhaustive simulation. This is

because the exhaustive simulation strategy searches for an optimal cache configuration

reducing the cache conflicts as much as possible. On the other hand, CDCP is trying to

determine a cache configuration that avoids the majority of cache conflicts among the

reuse space of different arrays rather the whole memory space of all arrays. The next

section presents simulation results to show the effectiveness of the CDCP technique de-

termining the near-optimal cache configuration by exploiting this majority cache conflicts

among reuse space.

Page 83: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

67

Benchmark Shade CDCP

adi 16 32 64 16 32 64

1 1k4w 1k4w 1k4w 64B4w 128B4w 256B4w

2 16k16w 16k16w 16k16w 16k16w 16k16w 16k16w

aps 16 32 64 16 32 64

1 2k4w 4k8w 64k4w 2k8w 4k4w 8k8w

2 16k8w 16k16w 32k16w 16k4w 16k8w 32k8w

3 4k2w 4k8w 8k8w 2k16w 4k8w 8k8w

bmcm 16 32 64 16 32 64

1 1k8w 2k8w 4k8w 64B1w 128B1w 256B1w

2 1k8w 2k8w 4k8w 64B2w 128B4w 256B1w

3 32k4w 64k4w 128k4w 32k4w 64k4w 128k4w

eflux 16 32 64 16 32 64

1 16k4w 32k4w 64k4w 2k8w 4k4w 8k8w

2 16k8w 32k4w 64k4w 8k4w 16k2w 32k4w

3 128k16w 256k2w 256k2w 128k8w 256k2w 256k2w

4 2k8w 2k8w 4k8w 128B4w 256B2w 256B4w

5 16k16w 32k4w 64k4w 8k16w 16k8w 32k4w

6 128k16w 256k1w 256k2w 128k8w 256k2w 256k2w

tomcat 16 32 64 16 32 64

1 1k2w 1k1w 1k1w 32B2w 64B2w 128B1w

2 1k1w 1k1w 1k1w 32B1w 64B1w 128B2w

3 128k4w 256k4w 256k16w 64k1w 128k2w 256k2w

4 1k2w 1k4w 2k8w 32B2w 64B2w 128B1w

5 64k8w 128k8w 256k16w 64k1w 128k2w 256k2w

6 1k2w 1k4w 2k4w 64B4w 128B4w 256B2w

7 64k4w 128k8w 128k8w 32k4w 64k8w 128k16w

8 32k1w 128k2w 128k4w 32k1w 64k2w 128k4w

tsf 16 32 64 16 32 64

1 4k4w 8k1w 8k1w 4k1w 4k1w 4k1w

2 1M1w 1M1w 1M1w 1M1w 1M1w 1M1w

3 4k4w 4k16w 8k4w 4k1w 4k1w 4k1w

4 1M1w 1M1w 1M1w 1M1w 1M1w 1M1w

vpenta 16 32 64 16 32 64

1 64k1w 128k1w 256k1w 64k1w 128k1w 256k8w

2 1k8w 2k4w 2k8w 128B8w 256B8w 512B8w

3 1k4w 2k2w 2k8w 256B4w 512B2w 1k2w

4 128k8w 256k8w 512k2w 128k2w 256k8w 512k2w

5 1k4w 2k4w 4k2w 256B4w 512B2w 1k2w

6 1k2w 2k2w 2k8w 128B8w 256B4w 512B8w

7 1k2w 1k2w 1k16w 64B1w 128B2w 256B4w

8 64k8w 128k2w 256k2w 64k1w 128k1w 256k1w

wss 16 32 64 16 32 64

1 4k4w 8k4w 8k16w 2k2w 4k4w 8k8w

2 1k8w 2k8w 4k4w 64B4w 128B4w 256B4w

3 1k2w 1k2w 1k2w 64B2w 128B4w 256b4w

4 64k4w 64k4w 64k4w 64k2w 64k2w 64k2w

5 4k4w 8k8w 16k8w 2k4w 4k4w 8k8w

6 1k2w 1k2w 1k2w 32B2w 64B1w 128B2w

7 2k8w 4k4w 4k4w 64B4w 128B1w 256B2w

Table 5.3. Cache configurations for each loop nest in benchmarks: Shade Vs CDCP.

Page 84: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

68

5.5.3 Simulation Results

Notice that an underlying reconfigurable cache is assume for this research. Since

loop nests dominate the performance and energy consumption in array-based appli-

cation, the cache reconfigurations which take place at the loop nest boundaries incur

negligible performance/energy overhead. Reconfiguration is performed at a coarse gran-

ularity of changing cache associativity and cache sizes. This reconfiguration involves

enabling/disabling cache sub-banks that are normally present in a cache architecture.

The impact on cache access time will be negligible in designs that exploit this prop-

erty. Note that the reconfiguration required in this work disables the unused portions of

the cache as opposed to the more complex reconfigurable caches that divides the cache

memory into multiple partitions used for different purposes [62]. A simple cache flushing

scheme is applied during the cache reconfiguration.

In this part of experiments, the two sets of cache configurations for each loop

nests given in Table 5.3 are both simulated. All configurations from CDCP with cache

size less than 1K are simulated at 1K cache size with other parameters unmodified. For

best comparison, the performance is shown as the cache hit rate instead of the miss rate.

Figure 5.7 gives the performance comparison between Shade (exhaustive simulation) and

CDCP using a block size of 16 bytes.

The observation from Figure 5.7 is that, for benchmarks adi.c, aps.c, bmcm.c,

tsf.c, and wss.c, the results obtained from Shade and CDCP are very close. On the

other hand, Shade outperforms CDCP in benchmarks eflux.c, tomcat.c and vpenta.c.

On the average, CDCP achieves 98% of the optimal performance using optimal cache

Page 85: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

69

adi aps bmcm eflux tomcat tsf vpenta wss0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

ShadeCDCP

Fig. 5.7. Cache performance comparison for configurations at block size of 16: ShadeVs CDCP.

adi aps bmcm eflux tomcat tsf vpenta wss0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

ShadeCDCP

Fig. 5.8. Cache performance comparison for configurations at block size of 32: ShadeVs CDCP.

Page 86: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

70

configurations. Figures 5.8 and 5.9, on the other hand, show the results obtained for

block sizes of 32 and 64 bytes.

adi aps bmcm eflux tomcat tsf vpenta wss0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%D

ata

Cac

he H

it R

ate

ShadeCDCP

Fig. 5.9. Cache performance comparison for configurations at block size of 64: ShadeVs CDCP.

It should be noted that, for most benchmarks, the performance difference between

Shade and CDCP decreases as the block size is increased to 32 and 64 bytes. Especially

for benchmarks adi.c, aps.c, bmcm.c, tsf.c, vpenta.c, and wss.c, the performance of

the configurations determined by the two approaches are almost the same. For other

benchmarks such as eflux.c and tomcat.c, Shade consistently outperforms CDCP when

block size is 32 or 64 bytes. On the average, the performance difference is reduced to

1.1% and 0.7% at 32 byte and 64 byte block sizes, respectively.

Page 87: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

71

Loop 1 Loop 2 0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

Loop 1 Loop 2 Loop 3 0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

eShade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

(a). adi (b). aps

Loop 1 Loop 2 Loop 3 0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

Loop 1 Loop 2 Loop 3 Loop 40

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

(c). bmcm (d). tsf

Fig. 5.10. A breakdown of cache performance comparison at the granularity of eachloop for benchmarks adi, aps, bmcm, and tsf . Configurations for all three cache blocksizes, 16 byte, 32 byte and 64 byte are compared: Shade Vs CDCP.

Page 88: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

72

Loop 1 Loop 2 Loop 3 Loop 40

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

Loop 5 Loop 6 Loop 5

Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

(a). eflux

Loop 1 Loop 2 Loop 3 Loop 40

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

Loop 5 Loop 6 Loop 7 Loop 8

Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

(b). tomcat

Loop 1 Loop 2 Loop 3 Loop 40

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

Loop 5 Loop 6 Loop 7 Loop 8

Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

(c). vpenta

Loop 1 Loop 2 Loop 3 Loop 40

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Dat

a C

ache

Hit

Rat

e

Loop 5 Loop 6 Loop 7

Shade−16CDCP−16Shade−32CDCP−32Shade−64CDCP−64

(d). wss

Fig. 5.11. A breakdown of cache performance comparison at the granularity of eachloop for benchmarks eflux, tomcat, vpenta, and wss. Configurations for all three cacheblock sizes, 16 byte, 32 byte and 64 byte are compared: Shade Vs CDCP.

Page 89: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

73

For more detailed study, a breakdown of the performance comparison at loop

nest level for benchmarks adi, aps, bmcm, and tsf is given in Figure 5.10, and the

comparison breakdown for benchmarks eflux, tomcat, vpenta, and wss is presented in

Figure 5.11. For each loop of a given benchmark, optimal cache configurations from

Shade exhaustive simulation and the near-optimal cache configurations from CDCP at

all three cache block sizes (16, 32, and 64 bytes) are compared. In each group of six bars

for a loop, the left two bars are cache hit rates of configurations at 16-byte block size

from Shade and CDCP, the middle two bars are for configurations at 32-byte block size,

and the last two are for configurations at 64-byte block size. From Figure 5.10, the cache

configurations computed by CDCP for each loop at different block sizes achieve the very

same cache performance as the optimal ones from Shade. One exception is loop 3 of

aps at block size of 16-byte. However, this performance difference disappears for cache

configurations at 32-byte or 64-byte block size. It also happens in benchmarks vpenta

and wss in Figure 5.11. Figure 5.11 also sees some noticeable performance gap between

configurations from Shade and CDCP for benchmarks eflux and tomcat. This gap is

diminishing in configurations with larger block size. The results from the loop nest level

comparison show that the CDCP technique is very effective in finding the near-optimal

cache configurations for loop nests in this benchmark, especially at block sizes of 32 and

64 bytes (the most common block sizes used in embedded processors). Since CDCP is

analysis-based not simulation-based, it is expected that it will be even more desirable in

codes with large input sizes.

Page 90: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

74

Data cache hit rate is a reliable metric for comparing the performances of Shade-

based and CDCP-based configurations. However, the impact of degraded cache perfor-

mance (i.e., increased miss rate) on the overall processor performance can sometimes be

amortized by other factors such as control dependences, data dependences, and resource

conflicts. Thus, the overall performance degradation can be smaller than cache hit rate

degradation when using CDCP (in comparison to Shade). That is, it is being pessimistic

here in estimating the performance impact of CDCP-selected cache configuration.

From energy perspective, the Cacti power model [63] is used to compute the

energy consumption in L1 data cache for each loop nest of the benchmarks at different

cache configurations listed in Table 5.3. A 0.18 micron technology is used for all the

cache configurations. Since cache reconfiguration is performed at the granularity of loop

nest, the energy consumed during reconfiguration is negligible compared to the energy

consumed during the execution of loop nests (Experimental results show that the energy

impact is less than 0.1% even if the cost of reconfiguration energy is assumed to be 1000

times of a single cache access). The detailed energy consumption figures are given in

Table 5.4.

From this experimental results, it can be concluded that (i) the CDCP strategy

generates competitive performance results with exhaustive simulation, and (ii) in general

it results in a much lower energy consumption than a configuration selected by exhaustive

simulation. Consequently, this approach strikes a balance between performance and

power consumption.

1Energy estimation is not available from Cacti due to the very small cache configuration.

Page 91: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

75

Benchmark Shade CDCP

adi 16 32 64 16 32 64

1 318.6 287.4 -1

318.6 287.4 -

2 12154.4 13164.5 16753.6 12154.4 13164.5 16753.6

aps 16 32 64 16 32 64

1 322.3 771.7 540.1 661.2 335.4 822.0

2 125599.5 279985.9 368764.9 65461.7 122847.2 145962.2

3 7907.7 33273.5 34697.7 64275.4 33273.5 34697.7

bmcm 16 32 64 16 32 64

1 314.6 342.9 393.4 31.7 30.5 31.1

2 314.6 342.9 393.4 83.0 155.2 31.1

3 26826.7 32203.8 36989.1 26826.7 32203.8 36989.1

eflux 16 32 64 16 32 64

1 366.7 386.4 433.3 648.4 320.1 776.6

2 1068.8 610.3 700.1 534.8 301.7 598.5

3 2366.1 727.5 749.6 1220.7 727.5 749.6

4 310.2 321.7 370.7 146.0 77.0 -

5 2326.5 636.5 731.2 2399.6 1121.7 624.5

6 2573.0 682.0 821.3 1323.3 795.5 821.3

tomcat 16 32 64 16 32 64

1 895.0 280.4 260.0 895.0 748.4 260.0

2 28.4 27.5 28.1 28.4 27.5 74.3

3 66507.5 86086.0 350675.9 26846.9 40767.0 83199.2

4 78.1 147.5 - 78.1 77.1 29.5

5 25678.1 27508.1 79570.4 9448.6 14978.6 25989.1

6 80.8 152.7 167.6 152.8 152.7 86.5

7 9461.3 18865.2 25190.9 9647.7 21984.0 57050.0

8 2051.1 5050.0 8406.6 2051.1 4046.2 8406.6

tsf 16 32 64 16 32 64

1 160.9 38.5 41.4 34.7 34.7 35.9

2 6263.6 9501.5 14293.2 6263.6 9501.5 14293.2

3 163.5 787.9 173.9 35.2 35.2 42.5

4 6234.3 9452.6 14226.7 6234.3 9452.6 14226.7

vpenta 16 32 64 16 32 64

1 4111.6 5130.1 9029.6 4111.6 5130.1 22364.9

2 350.7 184.6 - 350.7 - -

3 189.4 102.3 - 189.4 97.7 98.6

4 77075.1 90080.5 100849.2 27835.4 90080.5 100849.2

5 188.4 216.9 108.7 188.4 97.4 98.3

6 99.1 101.7 - - 185.8 -

7 90.2 89.0 - 32.7 89.0 -

8 36158.0 13557.1 26934.3 8994.2 12456.5 21512.0

wss 16 32 64 16 32 64

1 268.8 279.9 1610.6 138.6 261.1 624.0

2 288.5 317.1 168.1 143.9 143.6 -

3 75.1 74.1 74.9 75.1 141.8 -

4 22641.6 23665.1 22935.2 13274.4 13051.5 14560.2

5 326.7 672.8 775.3 325.7 319.6 756.3

6 74.8 73.8 74.6 74.8 27.6 74.6

7 302.8 155.6 166.6 142.4 27.9 75.1

Table 5.4. Energy consumption (micro joules) of L1 data cache for each loop nest inbenchmarks with configurations in Table 5.3: Shade Vs CDCP.

Page 92: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

76

5.6 Discussions and Summary

In this chapter, a new technique, compiler-directed cache polymorphism (CDCP),

is proposed for optimizing data locality of array-based embedded applications while keep-

ing the energy consumption under control. In contrast to many previous techniques that

modify a given code for a fixed cache architecture, this technique is based on modifying

(reconfiguring) the cache architecture dynamically between loop nests. A set of algo-

rithms are presented in this chapter that (collectively) allow the compiler to select a

near-optimal cache configuration for each nest of a given application. The experimental

results obtained using a set of array-intensive applications reveal that this approach gen-

erates competitive performance results and consumes much less energy (when compared

to an exhaustive simulation based framework).

CDCP can be further extended in the following several directions. First, high-level

transformation algorithms [21] can be incorporated with CDCP to convert pointer-based

code to array code followed by CDCP optimization. Second is to use cache polymor-

phism at granularities smaller than loop nests. And finally, how to combine CDCP with

loop/data based compiler optimizations to optimize both hardware and software in a

coordinated manner is a very interesting topic.

Page 93: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

77

Chapter 6

Reusing Instructions for Energy Efficiency

6.1 Introduction

Advancing technology has increased the speed gap between on-chip caches and the

datapath. Even in current technology, the access latency of the level one instruction cache

can hardly be maintained within one cycle (e.g., two cycles for accessing the trace cache

in the Pentium 4 [30]). In this case, pipelined instruction cache must be implemented

in order to supply instructions each cycle. As a result, the pipeline depth of the front

end of the datapath will increase (e.g., 6 stages in Pentium 4 [30]). Sophisticated branch

predictors employed in the latest microprocessors are also very power consuming [58].

This again will increase the power contribution of the pipeline front-end.

Previous research utilized a small instruction buffer to capture tight loop code

such as decoded instruction buffer (DIB) [31][8], decoded loop cache [5], decoded filter

cache [71] to reduce energy consumption in instruction cache and decoder. Loop cache

[47] and filter cache [46] are more general for reducing energy consumption in level one

caches.

The dynamic instruction footprint analysis performed in Chapter 4 for a set of

array-based embedded applications shows that these applications have very regular be-

havior pattern. Their execution happens in phases, one or more. Within a particular

phase, the instruction footprint of execution only spans a very limited range in address

Page 94: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

78

space. This dynamic characteristics of these applications can be used to design either

reconfigurable instruction caches or smaller instruction buffers to capture the phase exe-

cution for energy optimization. This thesis work explores a more aggressive approach to

utilize this dynamic application behavior to optimize instruction cache energy consump-

tion as well as other components in the datapath front-end.

This chapter proposes a new issue queue design that is capable of instruction

reuse. The proposed issue queue has a mechanism to dynamically detect and identify

reusable instructions, particularly instructions belonging to tight loops. Once reusable

instructions are detected, the issue queue switches its operation mode to buffer these

instructions. In contrast to conventional issue logic, buffered instructions are not removed

from the issue queue after they are issued. After the buffering is finished, the issue queue

is then switched to an operation mode to reuse those buffered reusable instructions.

During this mode, issued (buffered) instructions keep occupying their entries in the issue

queue and are reused in later cycles. A special mechanism employed by the issue queue

guarantees that the reused instructions are register-renamed in the original program

order. Thus, the instructions are supplied by the issue queue itself rather than the

fetch unit. There is no need to perform instruction cache access, branch prediction, or

instruction decoding. Consequently, the front-end of the datapath pipeline, i.e., pipelines

stages before register renaming, can be gated during this instruction reusing mode. This

thesis proposes this design as a solution to effectively address the power/energy problem

in the front-end of the pipeline. Since no instruction is entering or leaving the issue

queue in this mode, the power consumption in the issue queue is also reduced due to the

reduced activities.

Page 95: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

79

As embedded microprocessor designs are moving on to use superscalar architec-

ture for high performance such as SandCraft MIPS64 embedded processor [66], this work

targets at an out-of-order multi-issue superscalar processor rather than simple in-order

single-issue processors that have been the focus of previous research on loop caches.

Different from previous research [31][47][5][46][71], the scheme proposed here eliminates

the need of an additional instruction buffer for loop caching and utilizes the existing

issue queue resources. It automatically unrolls the loops in the issue queue to reduce the

inter-loop dependences instead of buffering only one iteration of the loop in the small

DIB or loop cache. Further, there is no need for ISA modification as in [31]. Note

that the concept and the purpose of instruction reuse in this paper is also different from

that proposed in [67]. The proposed scheme speculatively reuse the decoded instructions

buffered in the issue queue to avoid the instruction streaming from the instruction cache

rather than speculatively reusing the result of a previous instance of the instruction for

performance as in [67].

Results using array-intensive codes show that up to 82% of the total execution

cycles, the pipeline front-end can be gated, providing a energy reduction of 70% in the

instruction cache, 30% in the branch predictor, and 17.5% in the issue queue, respectively,

at a small performance cost. Further, the impact of compiler optimizations on this new

issue queue is investigated. The results indicate that using optimized code can further

improve the gated rate (the percentage of gated cycles in the total execution cycles) of

the pipeline front-end, and thus the overall power savings.

The detailed issue queue design is presented in Section 6.2. Section 6.3 studies the

dynamic instruction distribution of a set of array-intensive code. Section 6.4 describes

Page 96: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

80

the experimental framework and provides the evaluation results. A study of the impact

of compiler optimizations on the proposed scheme is conducted in Section 6.5. Finally,

Section 6.6 summaries this chapter.

6.2 Modified Issue Queue Design

In this section, the detailed design of the proposed new issue queue is elaborated.

This design is based on a superscalar architecture with a separated issue queue and re-

order buffer (ROB) and the datapath model is similar to that of the MIPS R10000 [75]

except that it use a unified issue queue instead of separated integer queue and floating-

point queue. The baseline datapath pipeline is given in Figure 6.1.

CommitIssueDecodeFetch

calcAdd

Data

Int Function Units

FP Function Units

File

Register

ResourceMap

Register

Decoder

Inst.

Cache

Reorder Buffer (ROB)

Cache

Inst.

Rename

Gate-Gate-

Detected

LoopRegister #

LRL

Issue

Queue

Control

Rename

Register

SignalSignal

Reuse

QueueStoreLoad

(b)

(a)

DcacheAccWriteBackExecuteReg Read

Queue

Fig. 6.1. (a). The datapath diagram, and (b). pipeline stages of the modeled baselinesuperscalar microprocessor. Parts in dotted lines are augmented for the new design.

The fetch unit fetches instructions from the instruction cache and performs branch

prediction and next PC generation. Fetched instructions are then sent to the decoder

Page 97: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

81

for decoding. Decoded instructions are register-renamed and dispatched into the issue

queue. At the same time, each instruction is allocated an entry in the ROB in program

order. Instructions with all source operands ready are waken up and selected to issue to

the appropriate available function units for execution, and removed from the issue queue.

The status of the corresponding ROB entry will be updated as the instruction proceeds.

The results coming either from the function units or the data cache are written back to

the register file. Instructions in ROB are committed in order.

Reusable instructions are those mainly belonging to loop structures that are re-

peatedly executed. The proposed new issue queue is thus designed to be able to reuse

these instructions in the loop structures. The new issue queue design consists of the

following four parts: a loop structure detector, a mechanism to buffer the reusable

instructions within the issue queue, a scheduling mechanism to reuse those buffered in-

structions in their program order, and a recovery scheme from the reuse state to the

normal state. The dotted parts in Figure 6.1 shows the augmented logic for this new

design.

6.2.1 Detecting Reusable Loop Structures

To enable loop detection, additional logic is added to check for conditional branch

instructions and direct jump instructions that may form the last instruction of a loop

iteration. The loop detector performs two checks for these instructions: (1) whether it

is a backward branch/jump; (2) whether the static distance from the current instruction

to the target instruction is no larger than the issue queue size.

Page 98: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

82

Loop detection can be performed at either the decode stage or stages after ex-

ecution stage. If detection takes place at post-execution stages, the detector can be

100% sure whether it is a loop or not by comparing the computed target address and

the current instruction address. However, it has several drawbacks. First, the detection

may come too late for small tight loops. Second, deciding when to start buffering the

detected loop can be complex. Third, the ROB has to keep the address information

for each instruction in flight in order to perform this detection. On the other hand,

performing loop detection at decode stage by using predicted target address has many

advantages. First, loop buffering can be started immediately after a loop is detected.

Second, since the instruction fetch buffer is very small (e.g., 4 or 8 entries), adding ad-

dress information will not incur much hardware overhead. Further, the target address of

direct jump will be available at decode stage and can be directly used for this purpose.

With these tradeoffs in consideration, loop detection is performed at decode stage rather

in this design than at a later stage.

6.2.2 Buffering Reusable Instructions

After a loop is detected and determined to be capturable (loop size less than or

equal to the issue queue size) by the issue queue, two dedicated registers Rloophead and

Rlooptail are used to record the addresses of the starting and ending instructions of the

loop iteration. A two-bit register Riqstate is utilized to indicate the current state of the

issue queue (00-Normal, 01-Loop Buffering, 11-Code Reuse, 10-not used). A complete

state transition diagram of the issue queue is given in Figure 6.2. The issue queue state

is then changed from Normal to Loop Buffering state. In the following cycle, the issue

Page 99: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

83

Reuse

Recov

ery

Detected

Buffering Revoke

Misprediction Recovery/

Start

Capturable LoopM

ispre

dicti

on

Code_Buffering

Loop_Buffering finished

Normal

Fig. 6.2. State machine for the issue queue.

queue starts to buffer instructions as the second iteration begins. The new issue queue

is augmented as illustrated in Figure 6.3.

Specifically, each entry is augmented with a classification bit indicating whether

this instruction belongs to a loop being buffered, and a issue state bit indicating whether

a buffered instruction has been issued or not. The logical register numbers for each

buffered instruction are stored in the logical register list (LRL). For an issue queue size

of 64 entries, the additional hardware cost for these augmented components is around

136 byte (= (1 bit + 1 bit + 15 bits for three logical register numbers) * 64 / 8) cache

structure.

After the issue queue enters Loop Buffering state, buffering a reusable instruction

requires several operations as the instruction is renamed and queued into the issue queue:

the classification bit is set, the issue state bit is reset to zero, the logical register numbers

of all the operands are recorded in the logical register list. With the classification bit set,

the instruction will not be removed from the issue queue even after it has been issued.

Note that a collapsing design is used for the issue queue.

Page 100: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

84

The following two subsections are going to address two important issues con-

cerning the buffering: when to terminate the instruction buffering and how to handle

procedure calls within a loop.

6.2.2.1 Buffering Strategy

There are at least two strategies for deciding when to stop buffering and promote

to Code Reuse state. The first strategy is to buffer only one iteration of the loop. This

scheme is simple to implement and enables more instructions to be reused from the issue

queue. This is because it stops instruction fetch from the instruction cache and enters

Code Reuse state much earlier (at the beginning of the third iteration). In contrast,

the second strategy tries to buffer multiple iterations of the loop according to available

free entries in the issue queue. The buffering logic uses an additional counter to record

the size of the current buffering iteration and to predict the size of the next iteration.

After buffering one iteration of the loop, a decision is made whether the remaining issue

queue can hold another iteration by comparing the counter value with the number of

free entries in the issue queue. If yes, the buffering continues. Otherwise, the state of

the issue queue is switched from Loop Buffering to Code Reuse, and the front end of the

pipeline is then gated. It automatically unrolls the loop to exploit more instruction level

parallelism, which is basically the way that the original issue queue works. Also, the issue

queue resource is used more effectively here than in the first strategy, especially for small

loops. Although the second strategy does not gate the pipeline front-end as fast as the

first strategy, it is still chosen in this work for performance sake. If the execution exits

Page 101: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

85

the loop (check with Rloophead and Rlooptail) during the buffering state, the buffering

is revoked and the issue queue switches back to the Normal state.

6.2.2.2 Handling Procedure Calls

Note that the loop detector has no knowledge about either the existence or the

sizes of procedure calls within a detected loop. This is because the detection only uses

one iteration and happens at the end of the first iteration of the loop. If the procedure

is small, the issue queue should be managed so as to capture both the loop and the

procedure. Otherwise, it may not be possible to buffer the loop. The strategy to deal

with procedure calls works as follows. During the Loop Buffering state, if a procedure

call instruction is decoded, it will keep buffering. If the issue queue is used up before the

loop-ending instruction is met, which means the procedure is too large to be captured

by the issue queue, the buffering is revoked and the issue queue state is changed back to

Normal. Otherwise, the counter value (the size of current iteration including procedure

calls) is checked with the number of free entries in the issue queue to make the decision

whether to promote to Code Reuse state or to continue buffering.

6.2.3 Optimizing Loop Buffering Strategy

Since the innermost loop dominates the execution of the loop nest, buffering outer

loops does not make sense in this case. Thus the attempt to buffer outer loops should

be avoided. Loops with procedure calls in which the procedures are large, may not be

bufferable. The loop detection does not have any information about this. Any started

buffering of those loops will be soon revoked as the condition is met. This will incur the

Page 102: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

86

6263 1 0

scan directionreuse pointer

loopheadR

Rlooptail

rsrt

rd rdrd

rtrt

rsrsrsrsrs

rtrtrtrd rd rd

Logical Register List 15 bits

2

10

Original Issue Queue

0

Index

Classification Bit 1 11111 1 bit 1 bit0 1 0Issue State Bit

Fig. 6.3. The new issue queue with augmented components supporting instructionreuse.

state thrashing between Loop Buffering and Normal. Thus, an optimization scheme for

loop detection and buffering is proposed in this section.

To avoid the state thrashing between Loop Buffering and Normal, a small non-

bufferable loop table (NBLT) holding the most recent non-bufferable loops (e.g., 8 loops)

is introduced for optimizing the buffer strategy. The NBLT is implemented in CAM and

maintained as a FIFO queue. Each entry in NBLT has a valid bit and the address of

the loop-ending instruction. If a detected loop appears in NBLT, it is identified as non-

bufferable. In this case, no buffering is attempted for this loop. Otherwise, the issue

queue is switched to Loop Buffering state. During the Loop Buffering state, if an inner

loop is detected, or the execution exits the current loop, or a procedure call within the

loop causes the issue queue to become full before the loop end is met, the current loop

is identified as a non-bufferable loop and registered with the NBLT table. Figure 6.4

shows an example of an non-bufferable loop. With this optimization, the issue queue

can eliminate most of the buffering of non-bufferable loops.

Page 103: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

87

(bufferable)

Loop

(non-bufferable)

Outer Loop Innermost

slti r2, r24, 499

addiu r24, r24, 1

addiu r5, r5, 2000

addiu r6, r6, 2000

slti r2, r22, 499

addiu r3, r3, 4

addiu r4, r4, 4

addiu r22, r22, 1

sw r2, 0(r4)

subu r2, r24, 422

sw r2, 0(r3)

addu r3, r0, r5

addu r4, r0, r6

beq r20, r0, 0x4002e8

addiu r20, r0, 499������������������������������

������������������������������addu r22, r0, r0

bne r2, r0, 0x4002a0

addu r2, r24, r22

bne r2, r0, 0x400278

Fig. 6.4. An example of a non-bufferable loop that is an outer loop in this code piece .

6.2.4 Reusing Instructions in the Issue Queue

After the reusable instructions of a loop have been successfully buffered, the state

of the issue queue is switched to Code Reuse. A gating signal is then sent to the fetch

unit and the instruction decoder. In the following cycles, the issue queue starts to supply

instructions itself by reusing the buffered instructions already in the issue queue. Thus,

the instruction streaming from the instruction cache is no longer needed and the pipeline

front-end is then completely gated.

During instruction scheduling, the classification bit of a ready-to-issue instruction

is checked at the issue time. If this bit is not set (i.e., its value is zero meaning not a

reusable instruction), the instruction is removed from the issue queue after being issued.

Otherwise, the instruction still occupies its entry in the issue queue after its issue. And

its corresponding issue state bit is set to indicate that this buffered instruction has been

Page 104: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

88

issued. The issue queue collapses each cycle if any hole is generated due to the removal

of an issued instruction.

The issue queue utilizes a reuse pointer to scan the buffered instructions in unidi-

rection for instructions to be reused in the next cycle. The pointer is initiated to point

to the first buffered instruction. In each cycle, the issue state bits of the first n (equal

to the issue width) instructions starting from the entry pointed by the reuse pointer are

checked. If the first m (m ≤ n) bits are set, which means these m instructions have

been issued and can be reused, the logical register numbers of these instructions are

fetched from the logical register list and sent to renaming logic. The reuse pointer then

advances by m and scans instructions for the next cycle. Renamed instructions update

their corresponding entries in the issue queue. Note that only register information and

ROB pointer of each instruction are updated in this case. Register renaming is needed

anyway in both this scheme and conventional issue queues and hence is not an overhead.

After the last buffered instruction is reused, the reused pointer is automatically reset to

the position of the first buffered instruction. This process repeats until a branch mispre-

diction is detected due to either the execution exiting the loop or the execution taking

a different path within the loop. The state of the issue queue is then switched back the

Normal state.

Note that the dynamic branch prediction is avoided during the Code Reuse state.

Branch instructions are statically predicted using the previous dynamic prediction out-

come from Loop Buffering state. The static prediction scheme works very well for loops

since the branches within loops are normally highly-biased for one direction. In this

Page 105: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

89

scheme, the static prediction is still verified after the branch instruction completes exe-

cution. The issue queue exits Code Reuse state if the static prediction is detected to be

incorrect during this verification.

6.2.5 Restoring Normal State

When an ongoing buffering is revoked, if an instruction is buffered (classification

bit = 1) and issued (issue state = 1), it is immediately removed from the issue queue.

All classification bits are then cleared. The issue queue state is switched back to Nor-

mal. If a misprediction is detected at the writeback stage and the issue queue is in the

Loop Buffering state, a conventional recovery is carried out by removing instructions

newer than this branch from the issue queue, ROB and restoring registers, followed by

the recovery process of revoking the current buffering state. If a misprediction is detected

in the Code Reuse state, this may be due to an early branch outside the current loop,

or a branch within the loop taking different path, or the execution exiting the current

loop. In this case, a conventional branch misprediction recovery is initiated followed by

the revoking process. The gating signal is also reset when restoring the Normal state. It

should be noted that the new issue queue has no impact on exception handling.

6.3 Distribution of Dynamic Loop Code

Although the phase pattern of execution footprint has been extracted and ana-

lyzed in Chapter 4 and leads to this proposed new instruction supply mechanism, further

information still needs to be extracted for each phase in order to guide the real imple-

mentation of this new issue queue such as the size of the issue queue. This section

Page 106: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

90

studies the dynamic instruction distribution with respect to the size of loop code that

an instruction resides in.

Three types of loop structure are profiled for this study: loops (without any

constraint), innermost loops, and innermost loops without any procedural call in it. Re-

member the discussion in previous sections, outer loops are not considered as bufferable

loops, and innermost loops without procedural call are most likely the bufferable loops.

If instructions from the last type of loops dominate the overall dynamic instructions, the

new issue queue can maximize the opportunity for loop buffering and instruction reusing.

Figure 6.5 gives the dynamic instruction distribution for a set of array-based embedded

applications. This figure shows that the majority of dynamic instructions, more than

90% are from loop code. Loop code size varies from less than 16 instructions to a size

between 128 instructions and 256 instructions, which requires different size issue queues

if to capture the loop code. From this figure, it clearly shows that dynamic instructions

from the innermost loops without procedural call are the dominant part during execu-

tion and instructions located between innermost loop and outer loops only account for

a negligible portion. This confirms the proposed new issue queue will be very effective

in capturing a significant number of reusable instructions.

6.4 Experiments

The proposed issue queue was modeled upon SimpleScalar 3.0 [12] and the power

model is derived from Wattch [11]. The baseline configuration for the simulated processor

is given in Table 3.1. A set of array-intensive applications listed in Table 3.2 are used to

evaluate the new issue queue.

Page 107: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

91

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

(a). adi (b). aps

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

(c). btrix (d). eflux

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

(e). tomcat (f). tsf

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

16 32 64 128 256 512 1024 20480

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Loop Code Size

Per

cent

of D

ynam

ic In

stru

ctio

ns

LoopIn−LoopIn−Loop w/o Call

(g). vpenta (h). wss

Fig. 6.5. Dynamic instruction distribution w.r.t. loop sizes.

Page 108: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

92

adi aps btrix eflux tomcat tsf vpenta wss avg0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Pip

elin

e F

ront

−en

d G

ated

Rat

e (in

Cyc

les)

IQ−32IQ−64IQ−128IQ−256

Fig. 6.6. Percentages of the total execution cycles that the pipeline front-end has beengated with different issue queue sizes: 32, 64, 128, 256 entries.

It is found that two factors: the loop structure and the issue queue size, affect

the effectiveness of the proposed issue queue design. A large loop structure cannot be

completely buffered in a small issue queue. This section conducts a set of experiments

to evaluate the impact of issue queue size by varying it from 32 to 256 entries, suggested

by the analysis results presented in previous section. In these experiments, the ROB size

is set equal to the issue queue size, and the load/store queue size is half that of the issue

queue. An eight-entry NBLT is used to optimize the loop detection, which helps reduce

the buffering revoke rate from around 40% to 1% below.

Once the issue queue enters Code Reuse state, the pipeline front-end is gated.

Figure 6.6 shows the percentages of the total execution cycles that the front-end of the

pipeline has been gated due to the instruction reuse for issue queues with different sizes.

Benchmarks aps, tsf , and wss achieve very high gated percentage even with small issue

Page 109: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

93

queues due to their small loop structures. Some benchmarks work well only with large

issue queues, such as adi, btrix, eflux, tomcat, and vpenta. An interesting observation

from this figure is that increasing issue queue size does not always improve the ability

to perform pipeline gating (e.g., see tsf and wss). The main reason for this case is

that a larger issue queue will unroll and buffer more iterations of the loop, delaying the

instruction reuse and pipeline gating. On the average, the ability to gate the pipeline

front-end increases from 42% to 82% as the issue queue size increases.

Gated pipeline front-end leads to activity reduction in the instruction cache,

branch predictor, and instruction decoder. As shown in Figure 6.7 (a), on the aver-

age, the instruction cache access is reduced by 42% to 82%, branch prediction or update

is reduced by 50% to 76% as shown in Figure 6.7 (c), and instruction decoding is re-

duced by 46% to 84% as illustrated in Figure 6.7 (e), as the issue queue size is increased

from 32 to 256 entries. Figure 6.7 (b)(d)(f) show the corresponding energy reduction

in the instruction cache ranging from 35% to 70%, branch predictor from 19% to 30%,

and issue queue from 12% to 17.5%, as the issue queue size increases from 32 entries to

256 entries. The energy reduction in the issue queue is due to the partial update (only

register information and ROB pointer are updated) during the instruction reuse state in

contrast to removing and inserting the instructions in a conventional issue queue.

The energy reduction of the entire processor for each benchmark at different issue

queue sizes is shown in Figure 6.8. The overall energy saving is up to 20.5%. For

benchmark adi and btrix, the overall energy is increased at some configurations. On

the average, the energy reduction is improved from 6.7% to 7.8% as the issue queue size

increases. The performance impact of this new issue queue is illustrated in Figure 6.9.

Page 110: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

94

adi aps btrix eflux tomcat tsf vpenta wss avg0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Red

uctio

n in

ICac

he A

cces

ses

IQ−32IQ−64IQ−128IQ−256

adi aps btrix eflux tomcat tsf vpenta wss avg0

10%

20%

30%

40%

50%

60%

70%

80%

90%

Ene

rgy

Red

uctio

n in

ICac

he

IQ−32IQ−64IQ−128IQ−256

(a). Access reduction in Icache. (b). Energy reduction in Icache.

adi aps btrix eflux tomcat tsf vpenta wss avg0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Red

uctio

n in

Bra

nch

Pre

dict

or A

cces

ses

IQ−32IQ−64IQ−128IQ−256

adi aps btrix eflux tomcat tsf vpenta wss avg−10%

0

10%

20%

30%

40%

50%

60%

70%

Ene

rgy

Red

uctio

n in

Bra

nch

Pre

dict

or

IQ−32IQ−64IQ−128IQ−256

(c). Access reduction in bpred. (d). Energy reduction in bpred.

adi aps btrix eflux tomcat tsf vpenta wss avg0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Red

uctio

n in

Inst

ruct

ion

Dec

odin

g

IQ−32IQ−64IQ−128IQ−256

adi aps btrix eflux tomcat tsf vpenta wss avg−5%

0

5%

10%

15%

20%

25%

30%

35%

Ene

rgy

Red

uctio

n in

Issu

e Q

ueue

IQ−32IQ−64IQ−128IQ−256

(e). Reduction in instruction decoding. (f). Energy reduction in issue queue.

Fig. 6.7. Access Reduction and energy reduction in instruction cache, branch predictor,instruction decoder, and issue queue.

Page 111: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

95

adi aps btrix eflux tomcat tsf vpenta wss avg−5%

0

5%

10%

15%

20%

25%

30%

Ove

rall

Ene

rgy

Sav

ings

IQ−32IQ−64IQ−128IQ−256

Fig. 6.8. The overall power reduction compared to a baseline microprocessor using theconventional issue queue at different issue queue sizes.

The average performance loss ranges from 0.2% (32 entry issue queue) to 4% (256 entry

issue queue). Notice that the performance of the new issue queue is compared to the

conventional issue queue with the same number of entries. This performance degradation

is mainly due to the non-fully utilized issue queue (i.e., only to buffer an integer number

of iterations of the loop). In benchmark btrix, the execution is dominated by a loop with

size of 90 instructions that results in a low utilization of the issue queue with size of 128

entries or 256 entries in Code Reuse state, consequently a noticeable performance loss

(around 12%) as seen in Figure 6.9. From Figure 6.8 and Figure 6.9, one can find out

32-entry issue queue IQ-32 takes the best advantage of this instruction reusing in terms

of performance and overall energy saving.

Page 112: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

96

adi aps btrix eflux tomcat tsf vpenta wss avg−2%

0

2%

4%

6%

8%

10%

12%

14%

Per

form

ance

(IP

C)

Deg

rada

tion

IQ−32IQ−64IQ−128IQ−256

Fig. 6.9. Performance impact of reusing instructions at different issue queue sizes.

6.5 Impact of Compiler Optimizations

Notice that some benchmarks such as adi, btrix, eflux, tomcat, and vpenta have

large loop structures, and these loops can hardly be captured with a small issue queue

(e.g., with size of 32 or 64). Compiler optimizations, especially loop transformations

can play an important role in optimizing these loop structures. This section specifically

focuses on loop distribution [42] to reduce the size of the loop body.

After applying loop distribution, as shown in Figure 6.10 the new issue queue

starts to schedule reusable instructions for benchmarks adi and btrix, and buffer more

reusable loop code in benchmarks eflux, tomcat, and vpenta. However, loop distribution

has minor effect on benchmarks aps, tsf , and wss since their reuse rate is already very

high. On the average, the reduction of instruction cache accesses improves from 51%

Page 113: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

97

adi aps btrix eflux tomcat tsf vpenta wss avg0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Red

uctio

n in

ICac

he A

cces

ses

OrginalOptimized

Fig. 6.10. Impact of compiler optimizations on instruction cache accesses.

to 88% after this compiler optimization, which results in corresponding more energy

reduction in the instruction cache.

Figure 6.11 shows the overall energy comparison between optimized code (per-

formed loop distribution) and non-optimized code, both simulated with the baseline

configuration (64 entry issue queue). The average energy reduction of the entire pro-

cessor is increased from 6.7% to 11.1% by using the optimized code, at the cost of a

slightly increased performance loss from 1% to 2.4% as shown in Figure 6.12, on the

average. This improvement of power reduction results from the increased percentage of

gated cycles (an average from 48% to 86% (not shown for brief)) when executing the

optimized code.

Page 114: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

98

adi aps btrix eflux tomcat tsf vpenta wss avg−2%

0

2%

4%

6%

8%

10%

12%

14%

16%

18%

Ove

rall

Ene

rgy

Sav

ings

OrginalOptimized

Fig. 6.11. Impact of compiler optimizations on overall energy saving.

adi aps btrix eflux tomcat tsf vpenta wss avg0

1%

2%

3%

4%

5%

6%

7%

8%

Per

form

ance

(IP

C)

Deg

rada

tion

OrginalOptimized

Fig. 6.12. Impact of compiler optimizations on performance degradation.

Page 115: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

99

6.6 Discussions and Summary

In many recent embedded microprocessors, code compression is used in instruc-

tion cache to minimize the required memory size due to cost and space constraints [48].

Such a compression scheme can be also used in this new instruction supply mechanism

without any negative impact since the instructions under reuse are decoded (thus also

decompressed) and buffered in the instruction issue queue. On the other hand, this in-

struction reusing scheme can significantly improve the energy and performance behavior

of code compression in that instruction reusing also avoids code decompression since no

(compressed) instruction is fetched from the instruction cache during the code reusing

state.

This redesigned processor datapath also implies opportunities to optimize the

energy consumption in the clock distribution network and the bus between instruction

cache and datapath due to the gated datapath front-end during instruction reusing. The

instruction cache can be also turned off or transitioned to drowsy mode for leakage energy

reduction when the issue queue is reusing buffered instructions.

To summarize, a new issue queue design is proposed in this chapter to be capable

of buffering the dynamically detected reusable instructions, and reusing these buffered

instructions in the issue queue. The front-end of the pipeline is then completely gated

when the issue queue enters instruction reusing state, thus invoking no activities in the

instruction cache, branch predictor, and the instruction decoder. Consequently, this

leads to a significant energy reduction in these components, and a considerable overall

energy reduction. The experimental evaluation also shows that compiler optimizations

Page 116: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

100

(loop transformations) can further gear the code towards a given issue queue size and

improve these energy savings.

Page 117: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

101

Chapter 7

Managing Instruction Cache Leakage

7.1 Introduction

Static energy consumption due to leakage current is an important concern in

future technologies [13]. As the threshold voltage continues to scale and the number

of transistors on the chip continues to increase, managing leakage current will become

more and more important. As on-chip caches are the major portion of the processor’s

transistor budget, they account for a significant share of the leakage power consumption.

In fact, leakage is projected to account for 70% of the cache power budget in 70nm

technology [45].

The leakage current is a function of the supply voltage and the threshold voltage.

It can be controlled by either reducing the supply voltage or by increasing the threshold

voltage. However, this has an impact on the cache access times. Thus, a common

approach is to use these mechanisms dynamically when a cache line is not currently in

use. Existing techniques that control cache leakage utilize three main styles of circuit

primitives for reducing cache leakage energy, namely, Gated-Vdd [60], multiplexed supply

voltage for cache line [20], and dynamic Vt SRAM [43]. The approach in [29] targets

bitline leakage and hence does not utilize either of these three primitives.

This chapter focus on reducing the leakage energy in the instruction cache. A

good leakage management scheme needs to balance appropriately the energy penalty of

Page 118: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

102

leakage incurred in keeping a cache line turned on after its current use with the overhead

associated with the transition energy (for turning on a cache line) and performance loss

that will be incurred if and when that cache line is accessed again. In order to strike

this balance, it is important that the management approach tracks both the spatial

and temporal locality of instruction cache accesses. Existing leakage control approaches

track and exploit one or the other of these forms of locality. For example, the drowsy

cache scheme [20] (designed originally for the data caches) periodically transitions all

cache lines to a drowsy mode assuming accesses to cache lines are confined to a specific

time period. Hence, it tends to focus mainly on temporal locality. Due to the use of

fixed periods, it also does not adapt well to changes in temporal locality. This can

be important to capture as straightline code has very little temporal locality, while

instructions in loops have significant temporal locality. Further, this scheme does not

support the sequential nature of instruction accesses well and will incur the wakeup

latencies when a new instruction is accessed in the sequential code.

The approach proposed for drowsy instruction caches in [45] focuses on spatial

locality. Here, turn-off 1 is applied when execution shifts from a specified spatial region.

Specifically, this scheme turns off a bank of cache lines when execution shifts to a new

bank. This scheme is well suited for capturing the sequential and repetitive behavior of

program execution confined to a small instruction address space. However, this scheme is

agnostic as to the extent of spatial locality in that it turns on only at the fixed granularity

of a bank. If execution is in a small, long running loop, the extend of spatial locality

1Turn-off is used here to refer to a transition to the drowsy state, and turn-on to refer to

waking up to normal (active) state.

Page 119: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

103

is small and may never access most of the cache lines in a bank. A finer granularity of

leakage control (at cache line instead of bank level) would provide more adaptability to

the different extents of spatial locality. Further, this scheme frequently turns on and off

banks when the instructions accessed in a given phase are not tightly clustered together

in one portion of the address space. Previous research shows the program hotspots can

be scattered all over the address space [54]. A common example would be a method call

made within a loop, with the method being located in a separate bank.

The leakage management scheme proposed in this chapter focuses on being able to

exploit both forms of locality and exploits two main characteristics of instruction access

patterns: program execution is mainly confined in program hotspots and instructions

exhibit a sequential access pattern. It is observed that a significant part of the execution

is spent in specific program hotspots (identification of hotspots is explained Section

7.3). This percentage is found to be 82% on the average for the SPEC2000 benchmark

suite. In order to exploit this behavior, this work proposes a HotSpot based Leakage

Management (HSLM) approach that is used in two different ways. First, it is used for

detecting and protecting cache lines containing program hotspots from inadvertent turn-

off. HSLM is particularly useful in reducing performance and energy penalties associated

with unnecessarily turning off actively used cache lines. It can provide some adaptability

to simple periodic or spatial schemes that are program behavior agnostic. Second, HSLM

can be used to detect a shift in program hotspot and to turn off cache lines closer to their

last use instead of waiting for a period to expire. This scheme is specifically oriented

to detect new loop-based hotspots. Next, a Just-in-Time Activation (JITA) scheme is

Page 120: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

104

presented that exploits sequential access pattern for instruction caches by predictively

activating the next cache line when the current cache line is accessed.

The experiments show that the combination of HSLM and JITA strategies can

make periodic schemes quite effective in terms of both performance and energy reduc-

tion. This work further proposes a scheme that combines both periodic and spatial

based turn-off (to capture both temporal and spatial locality) in an application sensitive

fashion by using the hotspot information. This scheme when combined with the JITA

scheme is shown to provide the best energy savings as compared to existing approaches.

Specifically, the evaluation of this scheme using SPEC2000 benchmarks shows that it

provides 22% and 49% more leakage energy savings in the instruction cache (while con-

sidering overheads incurred in the rest of the processor as well) as compared to pure

periodic and spatial schemes. Further, it also provides 29% more leakage energy savings

in the instruction cache as compared to a recently proposed instruction cache leakage

scheme based on compiler analysis [78].

Section 7.2 provides a more detailed view of factors influencing leakage reduction

and how the new approach proposed in this work relates to existing schemes. Section 7.3

details the implementation of the HSLM and JITA strategies. Section 7.4 explores

different leakage management approaches that combine HSLM and JITA. An evaluation

of the different schemes is performed in Section 7.5. Finally, Section 7.6 summaries this

chapter.

Page 121: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

105

7.2 Existing Approaches: Where they stumble?

Previous approaches that target reducing cache leakage energy consumption can

be broadly categorized into three groups: (i) those that base their leakage management

decisions on some form of performance feedback (e.g., cache miss rate) [59], (ii) those that

manage cache leakage in an application insensitive manner (e.g., periodically turning off

cache lines) [20, 41, 45], and (iii) those that use feedback from the program behavior [41,

80, 78].

The approach in category (i) is inherently coarse-grain in managing leakage as

it turns off large portions of the cache depending on a performance feedback that does

not specifically capture cache line usage patterns. For example, the approach in (i) may

indicate that 25% of the cache can be turned off because of very good hit rate, but, it

does not provide the guidance on which 75% of the cache lines are going to be used in

the near future.

The major drawback of the approaches in category (ii) is that they turn off cache

lines independent of the instruction access pattern. An example of such a scheme is the

periodic cache line turn-off proposed in [20]. The success of this strategy depends on how

well the selected period reflects the rate at which the instruction working set changes.

Specifically, the optimum period may change not only across applications but also within

the different phases of the application itself. In such cases, one can either keep cache lines

in the active state longer than necessary, or turn off cache lines that hold the current

instruction working set, thereby impacting performance and wasting energy. Note that

trying to address the first problem by decreasing the period will exacerbate the second

Page 122: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

106

problem. On the plus side, this approach is simple and has very little implementation

overhead.

Bank I Bank II

Loop Portions

(A)

(B)

(A)

(B)

(b)(a)

Fig. 7.1. (a). A simple loop with two portions, (b). Bank mapping for the loop givenin (a).

Another example of a fixed scheme in category (ii) is the technique proposed in

[45]. This technique adopts a bank based strategy, where when execution moves from

one bank to another, the hardware turns off the former and turns on the latter. To

illustrate some of the potential drawbacks of this bank-based strategy, a simple loop

structure is shown in Figure 7.1(a). Let us assume that this loop structure is mapped

on to a two-bank cache architecture as shown in Figure 7.1(b). The first problem is

that while the execution is in part (A) of the loop, the entire bank I is kept in the

active state. Consequently, all cache lines in this bank save for the ones that hold part

(A) of the loop waste leakage. While this energy wastage can be reduced with very

small bank sizes, increasing the number of banks beyond a point incurs latency penalties

Page 123: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

107

(due to decoding overheads); 4KB banks are typical bank sizes. The second problem

with this approach becomes clear when one considers the execution of the entire loop in

Figure 7.1(a). Assuming that this loop contains no other loop, one can expect frequent

transitions from part (A) to part (B) and vice versa. Note that this leads to frequent

bank turn-offs/ons, thereby increasing the energy overhead of the execution. By using

compiler directives it might be possible to align some loops across bank boundaries

(this also assumes that compiler knows the bank structure). However, in a typical large

application, it is likely that there are some loops that are divided across bank boundaries.

Note that reducing the bank size (to eliminate the first problem) aggravates the second

problem. Note that while in this simple example using a loop to illustrate the idea,

frequent bank transitions can also occur due to procedure calls (which might be quite

numerous in applications written in languages such as Java). A typical scenario would

be a small procedure located in one bank but is frequently invoked by procedures that

reside in different banks.

Another technique in category (ii) is the cache decay-based approach (its adaptive

variant falls in category (iii)) proposed by Kaxiras et al [41]. In this technique, a small

counter is attached to each cache line which tracks its access frequency. If a cache line

is not accessed for a certain number of cycles, it is placed into the leakage saving mode.

While this technique tries to capture the usage frequency of cache lines, it does not

directly predict the cache line access pattern. Consequently, a cache line whose counter

saturates is turned off even if it is going to be accessed in the next cycle. Since it is also

a periodic approach, choosing a suitable decay interval is crucial if it is to be successful.

In fact, the problems associated with selecting a good decay interval are similar to those

Page 124: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

108

associated with selecting a suitable turn-off period in [20]. Consequently, this scheme

can also keep a cache line in the active state until the next decay interval arrives even if

the cache line is not going to be used in the near future. Finally, since each cache line is

tracked individually, this scheme has more overhead.

The approaches in category (iii) attempt to manage cache lines in an application-

sensitive manner. The adaptive version of the cache-decay scheme [41] tailors the decay

interval for the cache lines based on cache line access patterns. They start out with the

smallest decay interval for each cache line to aggressively turn off cache lines and increase

the decay interval when they learn that the cache lines were turned off prematurely.

These schemes learn about premature turn-off by leaving the tags on at all times. The

approach in [80] also uses tag information to adapt leakage management.

In [78], an optimizing compiler is used to analyze the program to insert explicit

cache line turn-off instructions. This scheme demands sophisticated program analysis

and modification support, and needs modifications in the ISA to implement cache line

turn-on/off instructions. In addition, this approach is only applicable when the source

code of the application being optimized is available. In [78], instructions are inserted

only at the end of loop constructs and, hence, this technique does not work well if a lot

of time is spent within the same loop. In these cases, periodic schemes may be able to

transition portions of the loop that are already executed into a drowsy mode. Further,

when only select portions of a loop are used, the entire loop is kept in an active state.

Finally, inserting the turn-off instructions after a fast executing loop placed inside an

outer loop can cause performance and energy problems due to premature turn-offs.

Page 125: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

109

Another important limitation of existing leakage control schemes is that most of

the techniques only focus on a turn-off mechanism and activate turned-off cache lines

(or banks) only when accessed. Due to the sequential nature of instruction cache access

patterns, this is a significant shortcoming of the existing techniques. A notable exception

to this is the predictive bank turn-on scheme employed in [45]. Also, almost all exist-

ing schemes focus either on temporal locality (using counters) or spatial locality (using

address space).

7.3 Using Hotspots and Sequentiality in Managing Leakage

Having analyzed the shortcomings of directly applying existing approaches to

instruction cache leakage management, the goal of this work is to support a turn-off

scheme that is sensitive to program behavior changes and that captures both temporal

and spatial locality changes. Further, a predictive turn-on mechanism is needed to

support the sequentiality of instruction cache accesses. However, the granularity of

predictive turn-on should be kept as small as possible so that the cache lines are turned

on if and only if they are needed.

In this work, two mechanisms are proposed to support leakage management of

instruction caches. First, it proposes a HotSpot based Leakage Management scheme

(HSLM) that tracks program behavior. Second, it proposes a Just-in-Time Activation

(JITA) schemes for the next cache line exploiting sequentiality of code accesses.

Page 126: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

110

7.3.1 HSLM: HotSpot Based Leakage Management

Previous research shows that a program execution typically occurs in phases.

Each phase can be identified by a set of instructions that exhibit high temporal locality

during the course of execution of the phase. Two important observations made by

previous research is that the phases can share instructions and that the instructions in

a given phase do not need to be tightly clustered together in one portion of the address

space. In fact, they can be scattered all over the address space as mentioned in [54].

Typically, when execution enters a new phase, it spends a certain number of cycles in

it. When this number is high, one can refer to that phase as a hotspot. Since branch

behavior is an important factor in shaping the instruction access behavior, the hotspot

is tracked using a branch predictor in this work. While the use of branch predictors for

optimizing programs has been used in the past (e.g., see [54]), to our knowledge, this is

the first study that employs branch predictors for improving leakage consumption.

Detecting program hotspots can bring two advantages. First, it gives the knowl-

edge which cache lines are going to be the most active ones and prevent them from being

turned off. Second, cache lines can be turned off if they hold instructions that do not

belong to a newly detected hotspot.

7.3.1.1 Protecting Program Hotspots

The proposed leakage management approach builds on the drowsy cache tech-

nique [20] that periodically transitions all cache lines to drowsy mode by issuing a global

turn-off signal which sets register Q of leakage control circuitry in Figure 7.2. A global

(modulo-N) counter is used to control the periodic turn-off. In order to protect the cache

Page 127: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

111

lines containing the program hotspots from inadvertent turn-off, each drowsy cache cir-

cuit is augmented with a local voltage control mask bit (VCM). If this mask bit is set,

the corresponding cache line will mask the influence of the global turn-off signal and

prevent turn-off. In order to identify execution within hotspots, the information from

the branch target buffer (BTB) is augmented and utilized as explained in detail in the

next paragraph. Once the program is identified to be within a program hotspot (or not),

the global mask bit (GM) (Figure 7.3) is set (reset). When this global mask bit is set,

the voltage control mask bit of all cache lines accessed is set to one to indicate that these

cache lines form the program hotspot. In a set-associative cache, the voltage control

mask bit is set based on the tag match results of the cache access and is performed only

for the way that actually services the request. The voltage control mask bits are reset

on cache line replacements.

Q!Q

reset

Word line

Word line (Drowsy Mode)

(Active Mode)

Ro

w D

ecod

er

Preactivate

Preactivate

0.3V

1V

Cache Line

Word line Gate

Power line

GlobalTurn-off

set

VCM

Fig. 7.2. Leakage control circuitry supporting Just-in-Time Activation (JITA).

Page 128: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

112

The hotspot detection mechanism tracks the branch behavior information using

the BTB. The BTB entries are augmented to collect the execution frequencies of basic

blocks. Compared to the conventional BTB entry, the augmented structure includes

three additional fields: the valid bit (vbit) for target address, an access counter for

the target basic block (tgt cnt), and an access counter for the fall-through basic block

(fth cnt). This new structure is shown in Figure 7.3. The valid bit indicates whether

the current value of target address is valid or not. The valid bit is needed as a new

entry can be added to the augmented BTB by both taken and non-taken branches. If

the new entry is introduced when the branch is taken (not taken), the valid bit is set

to one (zero). The access counter for the target (fall-through) basic block records how

many times the branch is predicted as taken (not-taken). These counters are accessed

and updated during each branch prediction according to the outcome of the prediction.

The value of the target/fall-through counter shows the frequency of the target/fall-

through basic block fetched within a given sampling window and is compared with a

predefined threshold Tacc to determine the hotness of the corresponding basic block.

Each counter in the BTB has log(Tacc) + 1 bits. The counters are initially set to zero

when a new BTB entry is created. During a branch prediction, if the BTB hits, the

corresponding counter is read out according to the outcome of the prediction and then

incremented. Next, the most significant bit of the corresponding counter is checked to

determine the hotness of the basic block starting at the target/fall-through address. If

this bit is set, it means that the next (target or fall-through) basic block has exceeded the

threshold Tacc number of accesses and subsequent fetches are part of a program hotspot.

The global mask bit is set to capture this detection of a program hotspot and set the

Page 129: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

113

1 0

������������������������������

������������������������������������������������������������

������������������������������������������������������������

������������������������������������������������������������

������������������������������������������������������������

������������������������������

1

1

target_addr

target_addr

0

0

0

0

1

1

1

1

1

tgt_cnt

1 0000 0 0111

0 1 00001000

fth_cnt

PC

Word line

vbit

ICache

CircuitryLeakage Control

Mask BitGlobal

Branch Target Buff

Branch Taken

BTB Hit

Way Select

GM

Global Reset

BitVCM

Fig. 7.3. Microarchitecture for Hotspot based Leakage Management (HSLM) scheme.Note that O/P from AND gates go to the set I/P of the mask latches.

Page 130: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

114

voltage control mask bit of all accessed cached lines as long as the global masking signal

is set. The mask bit is set based on the tag match results of the cache access and is

performed only for the way that actually services the request in a set-associative cache.

Also, the mask bits are reset on cache line replacement. The global mask bit is reset

when the most significant bit of the access counter for a subsequent BTB lookup is not

set or when a BTB miss happens.

When a sampling window expires (determined by zeroing of the global counter),

several initialization operations take place. First, a global turn-off signal is issued to

turn off all cache lines except those with their voltage control mask bit set (mask bits of

cache lines in hotspots are set to disable voltage scaling). Second, a global reset signal

resets all voltage control mask bits. This is performed to track variances in program

behavior hotness. Third, all the access counters in the BTB are shifted right by one bit

to reduce their access count by half. This is performed to reduce the weight for accesses

performed in an earlier period when determining hotness. Subsequently, a new sampling

window begins and the operations repeat.

7.3.1.2 Detecting New Program Hotspots

One of the drawbacks of periodic approaches is that cache lines can be turned off

only when a preset period expires. It would be more beneficial if older cache lines can

be turned off immediately when a shift in hotspot is detected. The current approach

proposed in this work is specifically targeted at identifying a shift of the program hotspot

to a new loop. Specifically, in this dynamic turn-off scheme, if the current target counter

in the BTB entry of a predicted taken branch indicates that the target basic block is

Page 131: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

115

in a hotspot (the most significant bit of the counter is “1”) and if the target address

is smaller than the current program counter value, it assumes that the program is in

a hotspot executing a loop. At this point, a global turn off signal is set and all cache

lines except hotspots are switched to drowsy mode. In schemes evaluated in this work, a

periodic turn-off is always used in addition to the dynamic loop-based turnoff to account

for cases where the execution remains within the same loop for a long time or when there

are few loop constructs.

7.3.2 JITA: Just-In-Time Activation

In many applications, sequentiality is the norm in code execution. For example,

optimization such as loop unrolling, superblock and hyperblock formation increase the

sequentiality of the code [14, 17, 50]. The sequential nature of code can be used to

predict the next cache line that will be accessed and mask the penalty for transitioning

a cache line from drowsy to active mode just-in-time for access. Specifically, this work

proposes a scheme that preactivates the next cache line, JITA.

The leakage control circuitry that also supports preactivation for a direct-mapped

cache is illustrated in Figure 7.2. When the current cache line is being accessed, the

voltage control bit for the next cache line (next index) is reset, thereby transitioning it

to the active state. Thus, when the next fetch cycle occurs and there is code sequentiality,

the next required cache line is already in the active mode and ready for access. However,

this preactivation scheme is not successful when a taken branch occurs or when the next

address falls in a different memory bank. While the same circuit can be employed for

a set-associative cache, it would lead to activating the cache lines in all the ways of the

Page 132: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

116

same set (Approach 1). In order to avoid this, way prediction information associated

with the next cache line is used to activate only the cache line of a selected way. In this

scheme (Approach 2), each cache set has n bits, one for each of the n ways, where the

set bit corresponds to the way that provided the data when the cache set was accessed

previously. This scheme is found to work well as programs expend a major part of their

time in program hotspots.

7.4 Design Space Exploration

Schemes Turn-off Mechanism Granularity of Turn-off

Base - -

Drowsy-Bk Switch banks Bank

Loop Instruction Entire cache

FHS Periodic+Not Hot Entire cache

FHS-PA Periodic+Not Hot Entire cache

DHS-PA Periodic+Hot backward branch+Not Hot Entire cache

DHS-Bk-PA Periodic+Hot backward branch+Switch banks+Not Hot Entire cache

Table 7.1. Leakage control schemes evaluated: turn-off mechanisms.

Table 7.1 shows the turn-off mechanisms and granularity of the different ap-

proaches evaluated. Table 7.2 shows the turn-on mechanisms and granularity of these

approaches. All the approaches considered, except the Drowsy-Bk scheme, turn on at the

cache line granularity and turn off using a global signal to all cache lines. By contrast,

the Drowsy-Bk approach turns on and turns off at the bank granularity.

Page 133: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

117

Schemes Turn-on Mechanism Granularity of Turn-on

Base When accessed Cache lineDrowsy-Bk Bank prediction BankLoop When accessed Cache lineFHS When accessed Cache lineFHS-PA When previous line is accessed Cache lineDHS-PA When previous line is accessed Cache lineDHS-Bk-PA When previous line is accessed Cache line

Table 7.2. Leakage control schemes evaluated: turn-on mechanisms.

In all cases, including Base, a cache line is assumed to be in drowsy mode before its

first access. The Loop and Drowsy-Bk schemes are used here for comparative purposes.

The FHS (Fixed HotSpot) scheme is a variant of the drowsy scheme [20] that augments

the hotspot protection scheme described in section 7.3.1.1 to avoid turning off hot cache

lines. If the span of execution in a hotspot is longer than the fixed period for turn-off,

the FHS scheme will gain because of the masking. Shorter periods of turn-off are useful

when executing straightline code while longer periods of turn-off are desirable for long

executing loops. The FHS scheme helps to balance these. However, this scheme does

have a shortcoming (as compared to Drowsy) in that it can delay the turn-off of cache

lines that belonged to an older hotspot because of the masking. The FHS-PA scheme is

similar to FHS but uses the JITA scheme to predictively turn on the next cache line.

The Drowsy-Bk scheme employs a turn-off policy that is based on the assumption

that bank access changes indicate a shift in locality. The reactivation energy may involve

both the transition energy for changing the supply voltage for the cache line as well as an

additional energy expended in the rest of the system due to the performance penalties

associated with wakeup.

Page 134: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

118

The DHS (Dynamic HotSpot) scheme is built on top of the FHS scheme. In addi-

tional to the periodic global cacheline turn-offs, global cacheline turnoff signals are also

issued when a new loop-based hotspot is detected. This scheme also employs the hotspot

detection for protecting cache lines containing program hotspots. This scheme can turn

off unused cache lines before the fixed period is reached by detecting that execution will

remain in the new loop based hotspot. This approach is specifically useful when there

are straight line code segments sandwiched between loops. The DHS scheme also incurs a

penalty due to the masking that can delay the turn-off of cache lines that belonged to an

older hotspot until the identification of a new hotspot or the expiration of an additional

period as compared to a periodic scheme that employs no masking. The DHS-PA scheme

employs the JITA strategy on top of the DHS scheme.

All schemes considered so far are either oriented towards identifying spatial or

temporal locality changes. The final approach, DHS-Bk-PA attempts to identify both of

these. Specifically, it issues a global turn-off at fixed periods, when execution shifts to a

new bank or when a new loop hotspot is detected. Further, it employs the mask bits set

using hotspot detection to protect active cache lines and the JITA scheme for predictive

cacheline turn-on.

The turn-on mechanisms of the proposed schemes can be classified broadly as

those that are activated on access (incurring transition latency) and those that are pre-

dictively activated. The schemes denoted with a PA suffix employ the JITA strategy.

Predictive turn-on strategies are not without their drawbacks. When a wrong prediction

is made, they not only incur the performance penalty (also associated with techniques

that have no prediction) but also the energy cost for activating the wrong cache line(s).

Page 135: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

119

7.5 Experimental Evaluation

This section is to evaluate the leakage control schemes described in the previous

section. First, it describes the simulation parameters. Next, it compares the energy,

performance and energy-delay results of the different schemes. Finally, a sensitivity

analysis is performed for scheme DHS-Bk-PA.

7.5.1 Experiment Setup

The experimental environment is described in Chapter 3. The experiments are

conducted using our simulator developed based on SimpleScalar 3.0 [12]. A set of ten

integer and four floating point applications from the SPEC2000 benchmark suite and

their PISA binaries and reference inputs for execution are used in this experiment. Table

7.3 gives the technology and energy parameters used in this work. The energy parameters

are based on drowsy control for individual cache lines and is based on the circuit in [20].

The energy model is as follows:

Eenergy = Edrowsy + Eactive + Edatapath+dcache + Eoverhead (7.1)

Eoverhead = Eturnon + Eextraturnon + Ebtbcounters + Emisc (7.2)

Emisc = Econtrolbits + Ewaypredictor (7.3)

The total leakage energy Eenergy of the instruction cache with leakage manage-

ment schemes is composed of three part: leakage energy Edrowsy consumed by the cache

lines in drowsy mode, leakage energy Eactive consumed by cache lines in active mode, the

Page 136: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

120

increased leakage energy consumption in datapath and data cache Edatapath+dcache due

to the increased cycles incurred by leakage control, and the overhead energy Eoverhead

for implementing the leakage control schemes. Here, the instruction cache is assumed

to consume one-third of the leakage energy of the whole processor and that the re-

maining is expended in the datapath and data cache. The overhead energy Eoverhead

includes transition energy Eturnon for activating a drowsy cache line to active mode,

extra transition energy Eextraturnon due to unnecessary turn-ons resulting from pre-

dictive cacheline turn-on schemes, the dynamic energy Ebtbcounters consumed in BTB

counters introduced for HSLM, and miscellaneous energy consumption Emisc due to

voltage control mask bits and a way predictor, if used, in set-associative cache. The

transition delay for activating an entire bank in all schemes was assumed to be one cycle

and based on the use of a separate voltage controller associated with each cache line.

A bank activation would cause all cache lines in the bank to switch from the reduced

voltage to a normal voltage. Hence, the energy for transitioning a bank is proportional

to the number of cache lines in the bank. The dynamic energy for the Ebtbcounters is

calculated using Cacti 3.0 [65] using 70nm technology. Since the BTB counters have a

very high percentage of zero bits (an average of 95%, due to saturation or not being

touched), these counters are implemented using asymmetric-Vt SRAM cells [6]. The

optimized cells consume only 1/10th of the original leakage when storing zeros. Thus,

additional leakage due to BTB counters is very small.

Page 137: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

121

Technology and Energy Parameters

Feature Size 70nmSupply Voltage 1.0VClock Speed 1.0GHzL1 cache line Leakage in Active 0.417pJ/cycleL1 cache line Leakage in Drowsy 0.0663pJ/cycleTransition (drowsy to active) Energy 25.6pJTransition (drowsy to active) latency 1 cycleDynamic Energy per BTB counter (5 bits) 0.96pJ/transaction

Simulation Parameters

Window Size 2048 cyclesHotness Threshold (T

acc) 16

Subbank Size 4K Bytes

Table 7.3. Technology and energy parameters for the simulated processor given in Table3.1.

0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Rat

io o

f Cac

he L

ines

in A

ctiv

e M

ode

(in C

ycle

s)

gzip vp

rgc

cm

cf

pars

er

perlb

mk

gap

vote

xbz

ip2 twolf

wupwise

mes

a art

equa

ke Avg

BaseDrowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA

Fig. 7.4. The ratio of cycles that cache lines are in active mode over the entire executiontime (Active ratio).

Page 138: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

122

7.5.2 Experimental Results

The effectiveness of a leakage control scheme depends critically on how many

cache lines it can place in the drowsy mode. In order to evaluate this, the active ratio

(see Figure 7.4), which is defined as the average percentage of cache lines that are active

throughout the program execution, is measured. A smaller active ratio indicates the

potential for larger savings. However, overheads or performance penalties can impact

this potential. In measuring this ratio, the instruction cache is assumed to initially be

in a drowsy mode and each cache line is activated only when first accessed. On the

average, the DHS-Bk-PA scheme achieves the lowest active ratio (around 4.5%) while

the active ratio for the Base scheme was 66.2%. Observe that DHS-Bk-PA employs the

most aggressive turn-off scheme. Among the leakage optimization schemes, FHS-PA and

Drowsy-Bk have the largest active ratio (12.7% and 12.5%). In Drowsy-Bk scheme this

happens because all lines in a bank are turned on immaterial of whether they will be

accessed. In FHS-PA masking can delay turn-off. In vortex, the active ratio of Loop is

much higher (around 59%) due to the absence of many loop constructs.

Figure 7.5 breaks down the turn-offs in scheme DHS-Bk-PA into three categories:

periodic turn-off, dynamic turn-off, and bankswitch turnoff. The observation is that each

category presents a significant portion. On average, periodic turn-off accounts for 35.8%,

dynamic turn-off accounts for 25.3%, and the rest (38.9%) is contributed by bankswitch

turn-off. This convinces us that all these three turning-off schemes are important in

leakage control.

Page 139: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

123

0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Cac

helin

e T

urni

ng O

ff B

reak

dow

n

gzip vp

rgc

cm

cf

pars

er

perlb

mk

gap

vote

xbz

ip2 twolf

wupwise

mes

a art

equa

ke Avg

PeriodicDynamicBankswitch

Fig. 7.5. Breakdown of turn offs in scheme DHS-Bk-PA.

−20%

−10%

0

10%

20%

30%

40%

50%

60%

70%

80%

Leak

ge E

nerg

y (w

/ Ove

rhea

d) R

educ

tion

gzip vp

rgc

cm

cf

pars

er

perlb

mk

gap

vote

xbz

ip2 twolf

wupwise

mes

a art

equa

ke Avg

Drowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA

Fig. 7.6. Leakage energy reduction w.r.t the Base scheme.

Page 140: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

124

Next, it should be checked out whether the active ratio really translates into en-

ergy savings. Figure 7.6 presents the total leakage energy reduction of all leakage control

schemes compared to the Base scheme. This evaluation depends on the overhead leak-

age incurred in the rest of the chip excluding the instruction cache. In order to capture

different processor configurations and underlying circuit styles, the contribution of the

instruction cache leakage is varied from 10-30% of overall on-chip leakage. DHS-Bk-PA

which has the smallest active ratio also has the best energy behavior. Further, HSLM

and JITA help to reduce additional overhead energy for this scheme. Hence, it achieves

an average energy reduction of 63% over Base, 49% over Drowsy-Bk, and 29% over

Loop. When this percentage is 10%, these energy reductions are 59% over Base, 44%

over Drowsy-Bk, and 50% over loop (Not shown in figure for brevity). Focusing on an

anomalous trend in Figure 7.6, benchmark wupwise exhibits very different energy behav-

ior. Except for scheme FHS (0.3% reduction) and Loop (0% reduction), all other schemes

increase the energy consumption with the Drowsy-Bk scheme increasing energy consump-

tion by 19% This results from the small footprint of this benchmark which touches only

77 cache lines of the same bank (out of 128 lines for configuration given in Table 7.3).

In order to have a closer look at the energy behavior of the different schemes,

Figure 7.7 provides a more detailed breakdown (averaged over all benchmarks). For

Base, the leakage energy is due to leakage energy consumed by drowsy cache lines (before

a cache line is accessed) and leakage energy consumed by active cache lines. Loop and

FHS have a noticeable portion of energy from additional datapath and data cache leakage

energy due to performance degradation. In contrast, Drowsy-Bk has very little overhead

due to performance overhead because it predictively turns on banks. However, turn-on

Page 141: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

125

Base Drowsy−Bk Loop FHS FHS−PA DHS−PA DHS−Bk−PA0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

Leak

age

Ene

rgy

(w/ O

verh

ead)

Bre

akdo

wn

Drowsy LeakageActive LeakageDatapath+DCacheTurn OnExtra Turn OnBTB Counter

Fig. 7.7. The leakage energy breakdown (an average for fourteen SPEC2000 bench-marks).

energy for the Drowsy-Bk scheme is significant, as it turns all lines in a bank in one

cycle. Further, the extra turn-on energy (activating the wrong subbank) is around 20%

of the turn-on overhead energy which accounts for 40.7% for the total leakage energy.

Even without accounting for the significant turn-on penalty for the Drowsy-Bk scheme,

DHS-Bk-PA (considering all overheads except turn-on) achieves 23% more energy savings

from Drowsy-Bk. This is because of the high-active ratio for the Drowsy-Bk scheme. The

FHS and FHS-PA also have a major portion of their energy budget consumed in active

cache lines due to their high active ratio. Further, the BTB counter overhead is minimal

since the saturated counters do not incur any additional dynamic activity as their clocks

are gated once the most significant bit turns to a one (until it is reset). Only the most

significant bit of these saturated counters is used even when reading to identify hotspots.

Page 142: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

126

0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Rat

io o

f Act

ivat

ions

on

Cac

he H

its

gzip vp

rgc

cm

cf

pars

er

perlb

mk

gap

vote

xbz

ip2 twolf

wupwise

mes

a art

equa

ke Avg

Drowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA

Fig. 7.8. Ratio of activations on instruction cache hits.

0

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

Rat

io o

f Pre

activ

atio

ns o

ver

Tot

al A

ctiv

atio

ns

gzip vp

rgc

cm

cf

pars

er

perlb

mk

gap

vote

xbz

ip2 twolf

wupwise

mes

a art

equa

ke Avg

FHS−PADHS−PADHS−Bk−PA

Fig. 7.9. The ratio of effective preactivations performed by JITA over total activationsincurred during the entire simulation.

Page 143: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

127

Next, a measurement is performed on a metric defined as activation ratio to

highlight the performance penalty for accessing drowsy cache lines. This ratio provides

the percentage of activations of cache lines made on an a cache hit to the total number of

activations performed. A larger number indicates more performance penalty. Activations

on cache misses do not incur any additional penalty. Figure 7.8 shows the results. For

Loop and FHS this value is 79.5% and 83% on average respectively. The use of JITA

reduces this number to 7.6%, 11% and 12.4% for the FHS-PA, DHS-PA and DHS-Bk-PA.

While JITA is successful in reducing the penalty of activation, it still incurs penalties

when it fails due to taken branches or jumps to drowsy cache lines. For the Drowsy-Bk

scheme, as it activates many unnecessary cache lines when turning on an entire bank,

this metric is not very useful. To provide more insights about how JITA works so

well in FHS-PA, DHS-PA, and DHS-Bk-PA, Figure 7.9 shows the percentage of effective

preactivations performed by JITA over the total activations incurred during the execution

for these three schemes.

Figure 7.10, shows how this activation penalty translates into actual performance

values. The Base scheme (not shown) performs the best as it incurs no performance

penalties except for initial activation of untouched cache lines. The Drowsy-Bk scheme

performs the best among other schemes and incurs only a degradation of 0.56%. The

Loop scheme incurs the highest degradation of 15.4% degradation on the average because

this scheme had the highest number of accesses to drowsy cache lines on average. In

contrast, providing the hotspot protection in FHS reduces this penalty to 5.2%. Further,

using JITA reduces this penalty to 0.7% for the FHS-PA scheme. The best performing

energy scheme, DHS-Bk-PA suffers a degradation of 2.3% on the average. Note, however,

Page 144: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

128

0

5%

10%

15%

20%

25%

30%

35%

Per

form

ance

(IP

C)

Deg

rada

tion

gzip vp

rgc

cm

cf

pars

er

perlb

mk

gap

vote

xbz

ip2 twolf

wupwise

mes

a art

equa

ke Avg

Drowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA

Fig. 7.10. Performance degradation w.r.t the Base scheme.

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

0.16

0.18

Ene

rgy

(J)

* D

elay

(s)

Pro

duct

(E

DP

)

gzip vp

rgc

cm

cf

pars

er

perlb

mk

gap

vote

xbz

ip2 twolf

wupwise

mes

a art

equa

ke Avg

BaseDrowsy−BkLoopFHSFHS−PADHS−PADHS−Bk−PA

Fig. 7.11. Energy delay (J*s) product (EDP).

Page 145: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

129

that the DHS-Bk-PA is the best performing scheme in spite of the additional energy

overhead incurred due to the performance penalty.

Finally, the energy delay products (EDP) of each scheme are presented in Figure

7.11. Note that the overhead energy (see Figure 7.7) has been included. The results

show that scheme DHS-Bk-PA performs best. It achieves the smallest EDP value, which

has an average reduction of 62.63% over Base, and additionally 48.3% and 37.7% over

Drowsy-Bk and Loop respectively.

7.5.3 Sensitivity Analysis

This section is to investigate the impact of different parameters and report results

only for the best performing scheme, the DHS-Bk-PA. It will only highlight some of the key

aspects influencing the other schemes and only select a representative set of benchmarks

for clarity and due to space limitations. Three benchmarks, parser, bzip2, and equake are

selected for the purpose of this experiment. The baseline configuration for DHS-Bk-PA is

same as the one in the above experiments: 2K cycle window for hotness sampling and

global turnoff interval, 16 accesses for the hotness threshold, Tacc, used in HSLM and

4KB subbank size. In each set of following experiments, only one parameter is varying

while the other two keep unchanged.

Sampling window plays an important role in leakage control schemes with peri-

odic turning off. If the window decreases, cache lines can be placed in drowsy mode

much faster, potentially, implying more leakage reduction. However, it may incur more

performance loss due to more activations on cache hit. Consequently more energy con-

sumption for transition and energy in datapath and data cache. On the other hand,

Page 146: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

130

0.5K 1K 2K 4K 8K1

1.5

2.0

2.5

Window Size

Inst

ruct

ions

per

Cyc

le (

IPC

)

parserbzip2equake

0.5K 1K 2K 4K 8K0.02

0.025

0.03

0.035

0.04

0.045

Window Size

Tot

al L

eaka

ge E

nerg

y (J

)

parserbzip2equake

(a). Performance (IPC) (b). Leakage energy (J)

Fig. 7.12. Impact of sampling window size on leakage control scheme DHS-Bk-PA.

increasing the window will put cache lines to drowsy mode less frequently, which reduces

the opportunity for leakage saving. But the performance loss and overhead energy will be

much smaller. The masking bits used in hotspot protection schemes can however mask

the negative impact of unnecessary turn-offs. However, a very small sampling window

close to Tacc can prevent cache lines from entering hotspots and nullify the benefits of

masking. Figure 7.12 shows the impact of sampling window size on performance and

leakage energy of scheme DHS-Bk-PA. The performance impact is very slight (IPC degra-

dation is just 1.4%) when the period decreases from 8K cycles to 0.5K cycles, compared

to the Drowsy scheme for which IPC degrades by 9.9% for the same period decrease.

The leakage energy also reduces as window size shrinks when using DHS-Bk-PA, but the

reduction drops after window size reaches 1K cycles as there is less potential to turn-off

cache lines. In contrast, the energy consumption of Drowsy scheme increases due to

higher overheads when period reduces from 8K to 0.5K cycles, e.g. benchmark parser

Page 147: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

131

increases the energy by 10.5%. The result shows that the DHS-Bk-PA scales very well

with window size.

4 8 16 32 641

1.5

2.0

2.5

Hotness Threshold

Inst

ruct

ions

per

Cyc

le (

IPC

)

parserbzip2equake

4 8 16 32 640.02

0.025

0.03

0.035

0.04

0.045

Hotness Threshold

Tot

al L

eaka

ge E

nerg

y (J

)

parserbzip2equake

(a). Performance (IPC) (b). Leakage energy (J)

Fig. 7.13. Impact of hotness threshold on leakage control scheme DHS-Bk-PA.

Hotness threshold controls how a cache line can be established as hotspot within

a given sampling window. A smaller threshold puts more cache lines into hotspots and

prevents them from being turned off. This helps maintain high performance but hurts

leakage energy saving. On the other hand, larger threshold favorites energy saving but

might degrade performance. Figure 7.13 shows this impact when the threshold varies

from 4 to 64. Note that reducing threshold from 64 to 4 improved IPC by 1.2% but

increased the leakage energy consumption by 5.1%. Also, note that increasing threshold

beyond 64 makes the threshold approach the sampling window intervals and does not

help the schemes much.

Page 148: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

132

0.5K 1K 2K 4K 8K1

1.5

2.0

2.5

Subbank Size

Inst

ruct

ions

per

Cyc

le (

IPC

)

parserbzip2equake

0.5K 1K 2K 4K 8K0.02

0.025

0.03

0.035

0.04

0.045

Subbank Size

Tot

al L

eaka

ge E

nerg

y (J

)

parserbzip2equake

(a). Performance (IPC) (b). Leakage energy (J)

Fig. 7.14. Impact of subbank size on leakage control scheme DHS-Bk-PA.

Subbank size affects the turning off caused by bank switches. Smaller bank size

might present more opportunity for energy optimization. It also introduces false phase

changes detected by monitoring bank switches (for example more loops can split across

bank boundaries), which incurs more energy and performance overhead. Figure 7.14

shows the impact of subbank size on scheme DHS-Bk-PA. It says that the energy con-

sumption increases and IPC degrades when use banks smaller than 2KB. Note that very

small banks are anyhow not desirable as it increases decoding overheads.

Next, the impact of cache associativities is studied using Approaches 1 and 2

described in Section 7.3.2. Approach 1 gives preference to performance and eliminates

performance penalties due to way predictions at the expense of energy. Approach 2 uses

way prediction to reduce energy consumption and can potentially incur performance

penalties due to additional activation on access due to way misprediction. Figure 7.15

presents performance degradation and leakage energy reduction (compared to the Base

Page 149: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

133

scheme of corresponding cache configuration) for five instruction cache configurations:

direct mapped cache (DM), 2-way associative cache (2-Way) using approach 1, 2-way

associative cache (2-Way + WP) using approach 2, 4-way associative cache (4-Way)

using approach 1, and 4-way associative cache (4-Way + WP) using approach 2. It

is observed that the performance degradation increases as the associativity increases as

way-prediction accuracy drops when using approach 2. Consequently, the leakage energy

reduction reduces as well. In contrast, Approach 1 achieves much better performance at

the cost of higher leakage energy, especially for the cache with more ways.

DM 2−Way 2−Way+WP 4−Way 4−Way+WP0

1%

2%

3%

4%

5%

6%

7%

Per

form

ance

(IP

C)

Deg

rada

tion

parserbzip2equake

DM 2−Way 2−Way+WP 4−Way 4−Way+WP55%

60%

65%

70%

75%

80%

Leak

age

Ene

rgy

Red

uctio

n

parserbzip2equake

(a). Performance (IPC) loss (b). Leakage energy reduction

Fig. 7.15. Impact of cache associativity. IPC degradation (left), Leakage energy reduc-tion (right).

7.6 Discussions and Summary

This work focused on the leakage management of instruction caches. The leakage

management premise focuses on being able to identify changes in spatial and temporal

Page 150: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

134

locality and exploits two main characteristics of instruction access patterns: that pro-

gram execution is mainly confined in program hotspots and that instructions exhibit a

sequential access pattern. Two strategies have been devised: a HotSpot based Leakage

Management (HSLM) and Just-in-Time Activation (JITA) to exploit these two main

characteristics.

Specifically, HSLM is used to protect turning off cache lines containing program

hotspots and for dynamically identifying shifts in program hotspot. JITA was used to

predictively activate the next cache line to mitigate the performance penalty incurred in

waking up drowsy cache lines. These schemes were combined with existing approaches

that exploit either the spatial or temporal locality of instruction cache accesses. The

evaluation shows that it is important to consider shifts in both spatial and temporal

locality in order to optimize the leakage energy consumed by instruction caches. Fur-

ther, using program behavior captured by HSLM helps avoid some of the overheads of

managing leakage in an application agnostic fashion and also helps to detect shifts in

program hotspots dynamically. Finally, JITA is a simple and effective scheme for mask-

ing the performance penalties associated with waking up drowsy cache lines and permits

a fine-grain leakage management at the cache line level.

DHS-Bk-PA, one of the leakage management schemes explored in this work, is the

most effective in terms of both energy reduction and energy-delay metrics among all the

schemes explored (including recently proposed instruction cache leakage management

techniques). It aggressively combines HSLM, JITA and both the spatial and temporal

based cache line turn-off. In DHS-Bk-PA, the spatial, temporal and HSLM hotspot de-

tection aggressively reduce the leakage of the caches, while HSLM hotspot protection

Page 151: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

135

and the JITA mitigate the performance and energy overheads associated with aggressive

cache line turn-off. With the increasing focus on reducing leakage energy as technol-

ogy scales and the incorporation of larger and larger caches on-chip, such cache leakage

control schemes will be vital in future processor generations.

Page 152: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

136

Chapter 8

Conclusions and Future Work

8.1 Conclusions

Energy consumption becomes an increasing concern and one of the major con-

straints in microprocessor designs. Due to their large share in transistor budget, on-chip

caches present a significant contribution to the processor energy consumption in terms of

both dynamic and static energy. This thesis work started with exploring the relationship

between application characteristics and its cache behavior, and how the properties of this

relationship can be utilized by either compiler or microarchitectural schemes to reduce

the energy consumption (both dynamic and leakage energy) in caches. Following that,

this thesis proposed several techniques that orchestrate compiler and microarchitectural

support to attack the cache energy consumption in an application sensitive way.

More specifically, this thesis research made the following four major contributions

towards a new approach and design methodology for designing highly energy-efficient on-

chip memory hierarchies,

• A detailed cache behavior characterization for both array-intensive embedded ap-

plications and general-purpose applications was performed in this work. Three

critical properties of application and its cache behavior, namely cache resource de-

mands for performance, program execution footprint, and instruction cache access

behavior have been identified, highlighted, extracted, and analyzed in the context

Page 153: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

137

of cache energy optimization. The insights obtained from this study suggest that

(1) different applications or different code segments within a single application have

very different cache demands in the context of performance and energy concerns,

(2) program execution footprints (instruction addresses) can be highly predictable

and usually have a narrow scope during a particular execution phase, especially

for array-intensive applications, (3) high sequentiality is presented in accesses to

the instruction cache.

• Inspired by the findings from the above study, the compiler-directed cache polymor-

phism (CDCP) technique proposed in this thesis work implements an optimizing

compiler that is capable to analyze the cache behavior (i.e., data reuse) of the

application code and determine the best cache configuration that matches this

cache behavior and achieves the best performance and optimized energy behavior.

The cache is then directed to perform dynamic reconfiguration at runtime with

the cache configurations determined by CDCP. This technique is mainly focusing

on the new role of a compiler interacting with reconfigurable cache architectures.

Experimental results show that this CDCP technique provides competitive perfor-

mance and less energy consumption in data cache compared to an oracle scheme

using optimal cache configurations from exhaustive simulation.

• Utilizing the dynamic behavior of instruction footprint at runtime from a set of

array-based embedded applications, this thesis proposed a new issue queue de-

sign that restructures the instruction supply mechanism in conventional micro-

processors. This new scheme is to capture and utilize the predictable execution

Page 154: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

138

footprint for reducing energy consumption in instruction cache as well as other pro-

cessor components as a side benefit. The issue queue proposed here is capable of

rescheduling buffered instructions in the issue queue itself thus to avoid instruction

streaming from the pipeline front-end and result in significantly reduced energy

consumption in the instruction cache.

• Further, two techniques, hotspot based leakage management (HSLM) and just-in-

time activation (JITA) are proposed in this work to manage the leakage in the

instruction cache in an application sensitive fashion. HSLM not only protects

program hotspots from inadvertent turning off, but also switches old hotspots into

drowsy mode as soon as a phase change is detected. JITA exploits the sequential

nature of accesses to the instruction cache and preactivates the next cache line of

a currently been accessed cache line thus to overlap the one cycle drowsy wakeup

penalty. The scheme, employing these two strategies in additional to periodic and

spatial based (bank switch) turn-off, provides a significant improvement on leakage

energy savings in the instruction cache (while considering overheads incurred in the

rest of the processor as well) over previously proposed schemes [45][78].

8.2 Future Work

This thesis research has raised a number of potential new ideas and topics for fu-

ture research work in the areas of low-power systems design, high-performance computer

architecture, and reliable power-efficient systems.

Page 155: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

139

Power consumption has become one of the critical limiters to integration in mod-

ern microprocessor and might stop current technology advancement if not solved. Ex-

cessive power consumption not only exponentially increases the packaging and cooling

cost of microprocessors, but also demands extremely costly room design for data centers.

These data centers cost millions of dollars every year for power supply and the heat sink

(removal) systems. In my future work, I would like to extend my research theme from

circuit, architecture and compiler to operating systems and applications, from processor

core to on-chip systems, main memory, disk, and disk arrays in data centers. My goal

for this research is to build an infrastructure seamlessly embodying power optimizations

at different system levels for different system components, which is intended to make a

big impact on the low power industry and academia research community.

The traditional design paradigm for microprocessor architectures might become

an impediment to sustain the current performance improvement delivered by the advanc-

ing VLSI technologies. I believe that the new generation technology necessitates new

design methodology and philosophy. Wakeup-free instruction scheduler [19][39] is one of

the very good examples of new computer architectures for the future microprocessors.

An insight from this research is that the complexity and timing of large centralized com-

ponents of the processor are becoming obstacles to performance improvement driven by

faster clock speeds. A promising research topic is to reconsider the datapath and design

new architectures that partition these centralized components. Avoiding centralized de-

sign, each distributed part is self-managed, self-adaptive, and self-activated in such a

way that this structure would scale very well with the technology scaling, in terms of

both complexity and performance.

Page 156: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

140

This research has indicated that behavior characterization is crucial for guiding

potential effective optimizations on a particular component of the system. Foreseeing

the rapidly speed diverting between the memory hierarchy and the processor core, an

everlasting question is how to improve the performance of the memory hierarchy. This

question is not new but never stops bothering the architects. Specifically, my interest

in this area is to use sophisticated behavior characterization to direct performance op-

timization. Now, the question is What memory behavior should be characterized beyond

the generation behavior? Following up is How to efficiently perform the characterization,

and at what level, compiler or architectural level? Finally, How to utilize the behavior

characteristics for performance improvement? These questions raise a lot of interesting

research topics for high performance memory systems.

Another issue that is capturing increasing attention of the industry and academia

is the system reliability with further scaling technology and aggressive power optimiza-

tion strategies. The transistor noise margin is lowering due the much faster scaling down

supply voltage than the transistor threshold voltage Vth, which makes the circuit more

vulnerable to noises. Supply noise is becoming worse with runtime dynamic/leakage

power optimizations. Dropping Qcritical due to lower supply voltage and smaller node

capacitance makes the chip more susceptible to soft errors. All these trends lead to

reducing reliability of future products. However, the commercial severs could not afford

the downtime due to reliability problems. Since we are now conducting some very funda-

mental research on understanding the basics of soft errors, my specific interest for future

work is to model the impact of soft errors on different components in the processor and

Page 157: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

141

the error propagation path and distance. This research is intended to give some funda-

mental understanding on how to, when to, and where to detect the possible errors as

well as perform the following up correction/recovery in an effective way. Another very

interesting research direction is designing reliable systems from unreliable components

with schemes such as self-adaptive reconfiguration, time/space redundancy, etc.. Reli-

ability in conjunction with low power systems design opens a broad research area for

future work.

Projecting to high performance computer architectures, I have been interested

in and plan to investigate power and reliability issues in the context of multithreaded

architectures, chip multiprocessor architecture (CMP), and network on-chip architecture

(NOC). In network domain, I am particularly interested in the power management dur-

ing different operation modes in sensor networks. The power optimization I am going

to explore spans from sensor hardware, communication protocols to the specific applica-

tions.

Page 158: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

142

References

[1] International technology roadmap for semiconductors, semiconductor industry as-

sociation. http://public.itrs.net, 2001.

[2] A. Agarwal, H. Hai, and K. Roy. Drg-cache: A data retention gated-ground cache

for low power. In Proc. ACM/IEEE Design Automation Conference, pages 473–478,

June 2002.

[3] Nawaaz Ahmed, Nikolay Mateev, and Keshav Pingali. Synthesizing transformations

for locality enhancement of imperfectly-nested loop nests. In Proceedings of the 2000

International Conference on Supercomputing, pages 141–152, Santa Fe, New Mexico,

May 2000.

[4] D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. In

Proc. of the 32nd Annual International Conference on Microarchitecture, 1999.

[5] T. Anderson and S. Agarwala. Effective hardware-based two-way loop cache for

high performance low power processors. In IEEE Int’l Conf. on Computer Design,

2000.

[6] N. Azizi, A. Moshovos, and F. N. Najm. Low-leakage asymmetric-cell sram. In Proc.

the 2002 International Symposium on Low Power Electronics and Design, Monterey,

CA, 2002.

Page 159: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

143

[7] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-

performance computing. ACM Computing Surveys, 26(4):345–420, 1994.

[8] Raminder S. Bajwa et al. Instruction buffering to reduce power in processors for sig-

nal processing. IEEE Transactions on Very Large Scale Integration (VLSI) Systems,

5(4):417–424, December 1997.

[9] P. Bannon. Alpha 21364: A scalable single-chip smp. Microprocessor Forum, Oc-

tober 1998.

[10] R. Bechade and e. al. A 32b 66mhz 1.8w microprocessor. In Proc. of International

Solid-State Circuits Conference, 1994.

[11] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architectural-

level power analysis and optimizations. In Proc. International Symposium on High-

Performance Computer Architecture, 2000.

[12] D. Burger and T. M. Austin. The simplescalar tool set, version 2.0. Technical

report, University of Wisconsin-Madison, June 1997.

[13] J. A. Butts and G. Sohi. A static power model for architects. In Proc. the 33th

Annual International Symposium on Microarchitecture, December 2000.

[14] S. Carr, C. Ding, and P. Sweany. Improving software pipelining with unroll-and-

jam. In Proc. the 29th Annual Hawaii International Conference on System Sciences,

pages 183–192, Maui, HI, January 1996.

Page 160: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

144

[15] Jacqueline Chame. Compiler Analysis of Cache Interference and its Applications to

Compiler Optimizations. PhD thesis, Dept. of Computer Engineering, University of

Southern Californi, 1997.

[16] Anantha Chandrakasan, William J. Bowhill, and Frank Fox, editors. Design of

High-Performance Microprocessor Circuits. IEEE Press, 2001.

[17] P. P. Chang, N. J. Warter, S. Mahlke, W. Y. Chen, and W-M. W. Hwu. Three

superblock scheduling models for superscalar and superpipelined processors. Tech-

nical Report CRHC-91-29, Center for Reliable and High-Performance Computing,

University of Illinois, Urbana, IL, 1991.

[18] B. Cmelik and D. Keppel. Shade: a fast instruction-set simulator for execution

profiling. In Proc. of the 1994 ACM SIGMETRICES Conf. on the Measurement

and Modeling of Computer Systems, pages 128–137, May 1994.

[19] D.Ernst, A.Hamel, and T.Austin. Cyclone:a broadcast-free dynamic instruction

scheduler selective replay. In Proceedings of the 30th Annual International Sympo-

sium on Computer Architecture, June 2003.

[20] K. Flautner, N. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: Simple

techniques for reducing leakage power. In Proc. the 29th International Symposium

on Computer Architecture, Anchorage, AK, May 2002.

[21] B. Franke and M. F.P. O’Boyle. Array recovery and high-level transformations for

dsp applications. ACM Transactions on Embedded Computing Systems (TECS),

2(2):132 – 162, May 2003.

Page 161: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

145

[22] D. Gannon, W. Jalby, and K. Gallivan. Strategies for cache and local memory

management by global program transformation. Journal of Parallel and Distributed

Computing, 5(5):587–616, October 1988.

[23] S. Ghosh, M. Martonosi, and S. Malik. Cache miss equations: An analytical repre-

sentation of cache misses. In Proc. of the 11th International Conference on Super-

computing (ICS-97), July 1997.

[24] S. Ghosh, M. Martonosi, and S. Malik. Precise miss analysis for program transforma-

tions with caches of arbitrary associativity. In Proceedings of the 8th International

Conference on Architectural Support for Programming Languages and Operating

Systems, pages 228–239, San Jose, CA, October 1998.

[25] T. Givargis, J. Henkel, and F. Vahid. Interface and cache power exploration for core-

based embedded systems. In Proceedings of International Conference on Computer

Aided Design (ICCAD), pages 270–273, November 1999.

[26] A. Gordon-Ross, S. Cotterell, and F. Vahid. Exploiting fixed programs in embedded

systems: A loop cache example. IEEE Computer Architecture Letters, 2002.

[27] D. Grunwald, B. G. Zorn, and R. Henderson. Improving the cache locality of memory

allocation. In Proceedings of the ACM SIGPLAN’93 Conference on Programming

Language Design and Implementation (PLDI), pages 177–186, Albuquerque, New

Mexico, 1993.

[28] J. Hennesey and D. Patterson. Computer Architecture: A Quantitative Approach.

Morgan Kaufman Publishers, 3rd edition, 2002.

Page 162: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

146

[29] S. Heo, K. Barr, M. Hampton, and K. Asanovi. Dynamic fine-grain leakage reduc-

tion using leakage-biased bitlines. In Proc. the 29th International Symposium on

Computer Architecture, Anchorage, AK, May 2002.

[30] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel.

The microarchitecture of the pentium 4 processor. Intel Technical Journal, Q1 2001

Issue, Feb. 2001.

[31] M. Hiraki et al. Stage-skip pipeline: A low power processor architecture using

a decoded instruction buffer. In Proc. International Symposium on Low Power

Electronics and Design, 1996.

[32] J. S. Hu, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Analyzing data reuse

for cache reconfiguration. Accepted to publish in ACM Transactions on Embedded

Computing Systems.

[33] J. S. Hu, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, H. Saputra, and W. Zhang.

Compiler-directed cache polymorphism. In Proc. of ACM SIGPLAN Joint Con-

ference on Languages, Compilers, and Tools for Embedded Systems (LCTES’02)

and Software and Compilers for Embedded Systems (SCOPES’02), pages 165 – 174,

Berlin , Germany, June 19-21 2002.

[34] J. S. Hu, A. Nadgir, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir. Exploiting

program hotspots and code sequentiality for instruction cache leakage management.

In Proc. of the International Symposium on Low Power Electronics and Design

(ISLPED’03), pages 402 – 407, Seoul, Korea, August 25-27 2003.

Page 163: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

147

[35] J. S. Hu, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir. Selective trace cache:

A low power and high performance fetch mechanism. Technical Report CSE-02-

016, Department of Computer Science and Engineering, The Pennsylvania State

University, 2002.

[36] J. S. Hu, N. Vijaykrishnan, M. J. Irwin, and M. Kandemir. Using dynamic branch

behavior for power-efficient instruction fetch. In Proc. of IEEE Computer Society

Annual Symposium on VLSI (ISVLSI 2003), pages 127 – 132, Tampa, Florida,

February 20-21 2003.

[37] J. S. Hu, N. Vijaykrishnan, M. Kandemir, and M. J. Irwin. Power-efficient trace

caches. In Proc. of the 5th Design Automation and Test in Europe Conference

(DATE’02), March 2002.

[38] J. S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M. J. Irwin. Scheduling

reusable instructions for power reduction. In Proc. of the Conference on Design,

Automation and Test in Europe Conference (DATE’04), Paris, France, February

16-20 2004.

[39] Jie S. Hu, N. Vijaykrishnan, and Mary Jane Irwin. Exploring wakeup-free instruc-

tion scheduling. In Proc. of the International Symposium on High Performance

Computer Architecture (HPCA-10), pages 232 – 241, Madrid, Spain, February 14-

18 2004.

Page 164: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

148

[40] I. Kadayif, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, and J. Ramanujam. Mor-

phable cache architectures: potential benefits. In ACM Workshop on Languages,

Compilers, and Tools for Embedded Systems (LCTES’01), June 2001.

[41] S. Kaxiras, Z. Hu, and M. Martonosi. Cache decay: exploiting generational behav-

ior to reduce cache leakage power. In Proc. the 28th International Symposium on

Computer Architecture, Sweden, June 2001.

[42] K. Kennedy and K. S. McKinley. Optimizing for parallelism and data locality. In

Proc. the 6th ACM International Conference on Supercomputing (ICS’92, Washing-

ton, DC, 1992.

[43] H. Kim and K. Roy. Dynamic vt sram’s for low leakage. In Proc. ACM International

Symposium on Low Power Design, pages 251–254, August 2002.

[44] N. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. S. Hu, M. J. Irwin,

M. Kandemir, and N. Vijaykrishnan. Leakage current: Moore’s law meets static

power. IEEE Computer Special Issue on Power- and Temperature-Aware Comput-

ing, pages 68 – 75, December 2003.

[45] N. Kim, K. Flautner, D. Blaauw, and T. Mudge. Drowsy instruction caches: Leakage

power reduction using dynamic voltage scaling and cache sub-bank prediction. In

Proc. the 35th Annual International Symposium on Microarchitecture, November

2002.

[46] J. Kin et al. The filter cache: An energy efficient memory structure. In Proc.

International Symposium on Microarchitecture, 1997.

Page 165: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

149

[47] L. H. Lee, B. Moyer, and J. Arends. Instruction fetch energy reduction using loop

caches for embedded applications with small tight loops. In Proc. International

Symposium on Low Power Electronics and Design, 1999.

[48] Haris Lekatsas and Wayne Wolf. Samc: a code compression algorithm for embedded

processors. IEEE Transactions on CAD, 18(12):1689–1701, December 1999.

[49] L. Li et al. Leakage energy management in cache hierarchies. In Proc. the 11th Inter-

national Conference on Parallel Architectures and Compilation Techniques, Septem-

ber 2002.

[50] S. A. Mahlke et al. Effective compiler support for predicate execution using the hy-

perblock. In Proc. the 25th Annual International Symposium on Microarchitecture,

1992.

[51] N. Manjikian and T. S. Abdelrahman. Fusion of loops for parallelism and locality. In

Proceedings of the 24th International Conference on Parallel Processing (ICPP’95),

pages II:19–28, Oconomowoc, Wisconsin, August 1995.

[52] S. Manne, A. Klauser, and D. Grunwald. Pipeline gating: Speculation control for

energy reduction. In Proc. the 25th Annual International Symposium on Computer

Architecture, pages 132–141, June 1998.

[53] Kathryn S. McKinley, Steve Carr, and Chau-Wen Tseng. Improving data locality

with loop transformations. ACM Transactions on Programming Lanaguages and

Systems, 18(4):424–453, July 1996.

Page 166: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

150

[54] M. C. Merten et al. An architectural framework for runtime optimization. IEEE

Transactions on Computers, 50(6):567–589, June 2001.

[55] T. Simunic G. De Micheli and L. Benini. Energy-efficient design of battery-powered

embedded systems. In Proceedings of International Symposium on Low Power Elec-

tronics and Design, pages 212–217, August 1999.

[56] J. Montanaro and et al. A 160-mhz, 32-b, 0.5-w cmos risc microprocessor. Digital

Technical Journal, Digital Equipment Corporation, 9, 1997.

[57] Samuel D Naffziger and Gary Hammond. The implementation of the next-generation

64b itaniumTM microprocessor. In Proceedings of ISSCC, February 2002.

[58] Dharmesh Parikh, Kevin Skadron, Yan Zhang, Marco Barcella, and Mircea R. Stan.

Power issues related to branch prediction. In Proc. the 8th International Symposium

on High-Performance Computer Architecture (HPCA’02), February 2002.

[59] M. D. Powell, S. Yang, B. Falsafi, K. Roy, and T. N. Vijaykumar. Reducing leakage

in a high-performance deep-submicron instruction cache. IEEE Transactions on

VLSI, 9(1), February 2001.

[60] Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T. N. Vijaykumar.

Gated-vdd: A circuit technique to reduce leakage in deep-submicron cache memo-

ries. In Proc. the International Symposium on Low Power Electronics and Design

(ISLPED ’00), pages 90–95, July 2000.

Page 167: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

151

[61] Michael D. Powell, Amit Agarwal, T. N. Vijaykumar, Babak Falsafi, and Kaushik

Roy. Reducing set-associative cache energy via way-prediction and selective direct-

mapping. In Proceedings of the 34th annual ACM/IEEE international symposium

on Microarchitecture, pages 54–65, 2001.

[62] P. Ranganathan, S. Adve, and N. P. Jouppi. Reconfigurable caches and their appli-

cation to media processing. In Proc. of the 27th Annual International Symposium

on Computer Architecture, pages 214–224, June 2000.

[63] G. Reinman and N. Jouppi. An integrated cache timing and power model. Cacti

2.0 technical report, COMPAQ Western Research Lab, 1999.

[64] G. Rivera and C.-W. Tseng. Eliminating conflict misses for high performance archi-

tectures. In Proceedings of the 1998 International Conference on Supercomputing,

pages 353–360, Melbourne, Australia, July 1998.

[65] P. Shivakumar and N. P. Jouppi. Cacti 3.0: An integrated cache timing, power and

area model. Technical report, Compaq Computer Corporation, August 2001.

[66] Silicon Strategies. Sandcraft mips64 embedded processor hits 800-mhz.

http://www.siliconstrategies.com, 2002.

[67] Avinash Sodani and Gurindar S. Sohi. Dynamic instruction reuse. In Proc. the 24th

Annual International Symposium on Computer Architecture (ISCA-97), June 1997.

Page 168: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

152

[68] Y. Song and Z. Li. New tiling techniques to improve cache temporal locality. In

Proceedings of the SIGPLAN ’99 Conference on Programming Language Design and

Implementation, Atlanta, GA, May 1999.

[69] Stanford Compiler Group. The SUIF Library, version 1.0 edition, 1994.

[70] Jason Stinson and Stefan Rusu. A 1.5ghz third generation itanium&#174; 2 pro-

cessor. In Proc. of the 40th conference on Design automation, pages 706–709, 2003.

[71] W. Tang, R. Gupta, and A. Nicolau. Power savings in embedded processors through

decode filter cache. In Proc. Design and Test in Europe Conference, 2002.

[72] O. Temam, C. Fricker, and W. Jalby. Cache interference phenomena. In Proc. of

ACM SIGMETRICS Conference on Measurement & Modeling Computer Systems,

1994.

[73] V. Tiwari, S. Malik, A. Wolfe, and M.T.C. Lee. Instruction level power analysis

and optimization of software. Journal of VLSI Signal Processing, 13(2):1–18, 1996.

[74] M. Wolf and M. Lam. A data locality optimizing algorithm. In Proc. of SIGPLAN’91

conf. Programming Language Design and Implementation, pages 30–44, 1991.

[75] K. C. Yager. The mips r10000 superscalar microprocessor. IEEE Micro, 16(2):28–40,

April 1996.

[76] Q. Yi, V. Adve, and K. Kennedy. Transforming loops to recursion for multi-level

memory hierarchies. In Proceedings of the SIGPLAN ’00 Conference on Program-

mingLanguage Design and Implementation, Vancouver, Canada, June 2000.

Page 169: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

153

[77] Chuanjun Zhang, Frank Vahid, and Walid Najjar. A highly configurable cache

architecture for embedded systems. In Proceedings of the 30th annual international

symposium on Computer architecture, pages 136–146, 2003.

[78] W. Zhang, J. S. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin.

Compiler-directed instruction cache leakage optimization. In Proc. the 35th Annual

International Symposium on Microarchitecture, November 2002.

[79] W. Zhang, J. S. Hu, V. Degalahal, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin.

Reducing instruction cache energy consumption using a compiler-based strategy.

ACM Transactions on Architecture and Code Optimization (TACO), 1(1):3 – 33,

2004.

[80] H. Zhou, M. C. Toburen, E. Rotenberg, and T. M. Conte. Adaptive mode control:

a static power-efficient cache design. In Proc. the 2001 International Conference on

Parallel Architectures and Compilation Techniques, September 2001.

Page 170: ORCHESTRATING THE COMPILER AND MICROARCHITECTURE …

Vita

Jie Hu was born in Ninghai, Zhejiang, China on July 8, 1975. He graduated from

Ninghai High School of Zhejiang Province in 1993. He received his B.E. degree in com-

puter science and engineering from Beijing University of Aeronautics and Astronautics

in 1997. He was ranked the first place in his class and was recommended by his depart-

ment to the graduate school at Peking University with the privilege of waived graduate

admission exams. In 2000, he married Ms. Kai Chen. In the same year, he received his

M.E. degree in signal and information processing from Peking University. Immediately

after than, he enrolled in the Ph. D. program in computer science and engineering at the

Pennsylvania State University. Since August 2000, he has been a graduate assistant in

the same department.

Jie Hu is a member of IEEE, ACM, and ACM SIGARCH.