IMPROVING MEMORY HIERARCHY PERFORMANCE WITH...

IMPROVING MEMORY HIERARCHY PERFORMANCE WITH

DRAM CACHE, RUNAHEAD CACHE MISSES, AND

INTELLIGENT ROW-BUFFER PREFETCHES

By

XI TAO

A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL

OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT

OF THE REQUIREMENTS FOR THE DEGREE OF

DOCTOR OF PHILOSOPHY

UNIVERSITY OF FLORIDA

2016

To my parents

4

ACKNOWLEDGMENTS

It was such a long journey since I first arrived at Gainesville. I have never dreamt of ever

studying at a place so distant from my hometown, yet I spent five and a half wonderful years

here.

Obtaining a Ph.D. degree is never an easy job. You constantly feel stressful, at loss, and

sometimes wondering how to continue. During those years, I am really grateful for all the help

and guidance from my advisor, Dr. Jih-Kwon Peir, who is always so patient and kind. His

brilliant suggestions helped me overcome many obstacles. He has also spent numerous hours

helping me reviewing my paper and making modifications. Without his help, I really cannot

imagine sitting here and writing this dissertation now.

I also want to thank my Ph.D. committee members: Dr. Shigang Chen, Dr. Prabhat

Mishra, Dr. Beverly Sanders and Dr. Tan Wong. Thank you for your advice and support during

my study at University of Florida. I also would like to thank my lab mate Qi Zeng, who has

provided great suggestions and advice on our collaborating work.

Lastly, I want to give my greatest thanks to my friends here at Gainesville. You guys

really made my life colorful here. I also want to thank my parents, who has always been there

encouraging me and believed in me. I could not achieve all these without your support!

5

TABLE OF CONTENTS

page

ACKNOWLEDGMENTS ...............................................................................................................4

LIST OF TABLES ...........................................................................................................................7

LIST OF FIGURES .........................................................................................................................8

ABSTRACT ...................................................................................................................................10

CHAPTER

1 INTRODUCTION ..................................................................................................................12

1.1 DRAM Caches ...............................................................................................................17

1.2 Runahead Cache Misses ................................................................................................18

1.3 Hashing Fundamentals and Bloom Filter ......................................................................19

1.4 Intelligent Row Buffer ...................................................................................................21

2 PERFORMANCE METHODOLOGY AND WORKLOAD SELECTION ..........................23

2.1 Evaluation Methodology ...............................................................................................23

2.2 Workload Selection .......................................................................................................25

3 CACHE LOOKASIDE TABLE .............................................................................................26

3.1 Background and Related Work ......................................................................................26

3.2 CLT Overview ...............................................................................................................29

3.2.1 Stacked Off-die DRAM Cache with On-Die CLT ............................................29

3.2.2 CLT Coverage ...................................................................................................31

3.2.3 Comparison of DRAM Cache Methods ............................................................32

3.3 CLT Design ...................................................................................................................37

3.4 Performance Evaluation .................................................................................................41

3.4.1 Difference between Related Proposals..............................................................41

3.4.2 Performance Results..........................................................................................43

3.4.3 Sensitivity Study and Future Projection ............................................................47

3.4.4 Summary ...........................................................................................................49

4 RUNAHEAD CACHE MISSES USING BLOOM FILTER .................................................50

4.1 Background and Related work .......................................................................................50

4.2 Memory Hierarchy and Timing analysis .......................................................................51

4.3 Performance Results ......................................................................................................57

4.3.1 IPC Comparison ................................................................................................58

4.3.2 Sensitivity Study ...............................................................................................60

4.4 Summary ........................................................................................................................62

6

5 GUIDED MULTIPLE HASHING .........................................................................................64

5.1 Background ....................................................................................................................64

5.2 Hashing ..........................................................................................................................66

5.3 Proposed Algorithm .......................................................................................................67

5.3.1 The Setup Algorithm .........................................................................................67

5.3.2 The Lookup Algorithm .....................................................................................70

5.3.3 The Update Algorithm ......................................................................................71

5.4 Performance Results ......................................................................................................72

5.5 Summary ........................................................................................................................82

6 INTELLIGENT ROW BUFFER PREFETCHES ..................................................................83

6.1 Background and Motivation ..........................................................................................83

6.2 Hot Row Buffer Design and Results .............................................................................86

6.3 Performance Evaluation .................................................................................................93

6.4 Conclusion .....................................................................................................................95

7 SUMMARY ............................................................................................................................97

LIST OF REFERENCES .............................................................................................................100

BIOGRAPHICAL SKETCH .......................................................................................................106

7

LIST OF TABLES

Table page

2-1 Architecture parameters of processor and memories ........................................................ 24

2-2 MPKI and footprint of the selected benchmarks .............................................................. 25

3-1 Comparison of different DRAM cache designs ................................................................ 33

3-2 Difference between three designs ..................................................................................... 42

3-3 Comparison of L4 MPKR, L4 occupancy and predictor accuracy ................................... 46

4-1 False-positive rates of 12 benchmarks .............................................................................. 59

4-2 Future Conventional DRAM parameters .......................................................................... 62

5-1 Notation and Definition .................................................................................................... 68

5-2 Routing table updates for enhanced 4-ghash .................................................................... 80

6-1 Hit ratio for hybrid scheme of 10 workloads using 64 entries .......................................... 89

6-2 Prefetch usage for 10 workloads using a simple stream prefetcher .................................. 95

6-3 Sensitivity study on prefetch granularity .......................................................................... 95

8

LIST OF FIGURES

Figure page

1-1 The structure of a memory hierarchy. ............................................................................... 13

1-2 Memory hierarchy organization with 4-level caches ........................................................ 14

1-3 Dram Internal Organization .............................................................................................. 15

3-1 Memory hierarchy with stacked DRAM cache ................................................................ 30

3-2 Reuse distance curves normalized to the percentage of the maximum distance .............. 32

3-3 Coefficient of variation (CV) of hashing 64K cache-set using different indices ............. 35

3-4 DRAM cache MPKI using sector indexing ...................................................................... 36

3-5 CLT design schematics ..................................................................................................... 38

3-6 CLT operations in handling memory requests .................................................................. 39

3-7 CLT speedup with respect to Alloy, TagTables_64, and TagTables_16 .......................... 45

3-8 Memory access latency (CPU cycles)............................................................................... 45

3-9 IPC change for different CLT coverage............................................................................ 48

3-10 Execution cycle change for different sector size in CLT design ...................................... 49

4-1 Memory latency with / without BFL3 .............................................................................. 52

4-2 Cache indexing and hashing for BF .................................................................................. 55

4-3 False-positive rates for 6 hashing mechanisms ................................................................. 56

4-4 False-positive rates with m:n = 2:1, 4:1, 8:1, 16:1, and k = 1, 2 ...................................... 56

4-5 IPC comparisons with/without BF .................................................................................... 59

4-6 Average IPC for m:n ratios and hashing functions ........................................................... 61

4-7 Average IPC for different L4 sizes ................................................................................... 61

4-8 Average IPC over different DRAM latency ..................................................................... 62

5-1 Distribution of keys in buckets of four hashing algorithms. ............................................. 66

5-2 A simple d-ghash table with 5 keys, 8 buckets and 2 hash functions. .............................. 68

9

5-3 Bucket loads for the five hashing schemes. ...................................................................... 73

5-4 Number of bucket accesses per lookup for d-ghash. ........................................................ 74

5-5 Average number of keys per lookup based on memory usage ratio. ................................ 75

5-6 The average number of non-empty buckets for looking up a key. ................................... 77

5-7 Sensitivity of the number of bucket accesses per lookup. ................................................ 78

5-8 Changes in the number of bucket accesses per lookup and rehash percentage. ............... 79

5-9 Number of bucket accesses per lookup for experiments with five routing tables. ........... 80

5-10 Experiment with the update trace using enhanced 4-ghash. ............................................. 80

6-1 Hot Row pattern of 10 workloads ..................................................................................... 85

6-2 Hot Row Identification and Update .................................................................................. 88

6-3 Results of proposed hybrid scheme .................................................................................. 89

6-4 Block column difference within a row for 10 workloads. ................................................ 91

6-5 IPC/Row buffer hit Ratio speedup of 10 workloads ......................................................... 93

10

Abstract of Dissertation Presented to the Graduate School

of the University of Florida in Partial Fulfillment of the

Requirements for the Degree of Doctor of Philosophy

IMPROVING MEMORY HIERARCHY PERFORMANCE WITH

DRAM CACHE, RUNAHEAD CACHE MISSES, AND

INTELLIGENT ROW-BUFFER PREFETCHES

By

Xi Tao

December 2016

Chair: Jih-Kwon Peir

Major: Computer Engineering

Large off-die stacked DRAM caches have been proposed to provide higher effective

bandwidth and lower average latency to main memory. Designing a large off-die DRAM cache

with conventional block size (64 bytes) requires a large tag array which is impractical to fit on-

die. We investigate a novel design called Cache Lookaside Table (CLT) to reduce the average

access latency and to lessen off-die tag array accesses. The proposed CLT exploits memory

reference locality and provides a fast alternative tag path to capture most of the DRAM cache

requests.

To hide long memory latency and to alleviate memory bandwidth requirement, a fourth-

level of cache (L4) is introduced in modern high-performance computing systems. However,

increasing cache levels worsens the cache miss penalty since memory requests go through levels

of cache hierarchy sequentially. We investigate a new way of using a Bloom Filter (BF) to

predict cache misses earlier at a particular cache level. These misses can runahead to access

lower level of caches and memory to shorten the miss penalty.

Inspired by the usefulness of Bloom filter in cache accesses, we conduct a fundamental

study to find a way to balance the hashing buckets while maintaining lower false-positive rate for

11

Bloom filter. To broaden the applications, our study is based on the routing and packet

forwarding function at the core of the IP network-layer protocols. We propose a guided multi-

hashing approach which achieves near perfect load balance among hash buckets, while limiting

the number of buckets to be probed for each key (address) lookup, where each bucket holds one

or a few routing entries.

A key challenge in effectively improving system performance lies in maximizing both

row-buffer hits and bank level parallelism while simultaneously providing fairness among

different requests. We observed that accesses to each bank in DRAM are not equally distributed

among different rows for most of the workloads we study. We propose a simple scheme to

capture the hot row pattern and prefetch data in these hot rows. Results have demonstrated the

effectiveness of our proposed scheme.

12

CHAPTER 1

INTRODUCTION

Memory hierarchy plays a critical role in designing high-performance processors. It

becomes increasingly difficult to advance processor performance further due to the memory wall

problem. Despite aggressive out-of-order, speculative execution, processor stalls waiting for the

data from memory. Lately, microprocessor manufacturers are putting a growing number of cores

on a chip to satisfy the increased demand for larger workloads such as data mining and analytics.

As the number of cores grows, the pressure on memory subsystem in terms of capacity and

bandwidth increases as well.

Memory hierarchy design takes advantage of memory reference locality, trade-offs in the

capacity and access speed of memory technologies to hide the memory latency and to alleviate

memory bandwidth requirement. Between CPU and main memory, there are multiple levels of

cache memory. Close to the CPU side are small, fast caches with higher bandwidth that

temporarily storing the most frequently used data. With increasing levels, the cache capacity

become larger, but the access speed becomes slower with reduced bandwidth. Based on

reference locality, the most recently referenced data can be accessed in the highest level of

cache. The large capacity of lower levels capture recently referenced data that cannot fit into the

highest level. Figure 1-1 depicts this memory hierarchy organization.

Conventional cache access goes through a tag path to determine a cache hit or a miss and

a data path to access the data in case of a hit. The cache tag and data arrays maintain topological

equivalency such that matching of the address tag in the tag array determines the location of the

block in the data array. These two paths may overlap to permit the data array access starts before

the hit position is determined to shorten the cache access time.

13

Level 1

Level 2

...

Level n

CPU

Levels in the

memory hierarchy

Increasing distance from

the CPU in access time,

decreasing bandwidth

Size of the memory at each level

Figure 1-1. The structure of a memory hierarchy: as the distance from the processor increases, so

does the size and access time, but with decreasing bandwidth.

Modern high-performance multi-core systems generally adopt 3-level on-die cache

architecture, referred as L1, L2 and L3 caches which are placed on the processor die using the

SRAM technology. The L1 and the L2 caches are usually private meaning each core has its own

private L1 and L2 caches. The L3 cache is normally shared by all the cores and serves as a

connection point to the main memory which is located off the processor chip with long access

latency. Intel Haswell [1], the 4th-generation core adopts a 4th-level cache built on embedded

DRAM technology to hide main memory latency and to deliver substantial performance

improvements for media, graphics and other high-performance computing applications. A

general organization of a multicore system with 4-level of caches is illustrated in Figure 1-2.

14

CPU CPU...

L1/L2 Caches L1/L2 Caches

L3 Cache

L4 Cache

Main Memory

Memory Controller

Processor Die

Multichip Package

Figure 1-2. Memory hierarchy organization with 4-level caches.

In order to measure quantitatively the memory performance of cache hierarchy, we use

the average memory access time (AMAT) to describe the average time it takes for the entire

hierarchy to return data. Suppose we have a three-level cache architecture with L1, L2 and L3

caches, the AMAT is calculated as follows:

𝐴𝑀𝐴𝑇 = 𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿1 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿1 × 𝑀𝑖𝑠𝑠𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝐿1

= 𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿1 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿1 × (𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿2 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿2

× (𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿3 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿3 × 𝑀𝑖𝑠𝑠𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝐿3))

where HitTime is the access time to the cache at a particular level, MissRate is the

percentage of misses at a cache level, and MissPenalty is the time to fetch the block from next

level of memory hierarchy. Fetching data from next level may encounter cache hits or cache

misses too, so the miss penalty of current level is also equivalent to the AMAT starting from next

level. For achieving high-performance, we want to design a cache hierarchy with fast hit time,

small miss ratio and miss penalty.

15

Beyond the cache hierarchy, main memory is where the application instruction and data

are stored. During program execution, the requested instruction and data are moved from disk to

main memory on demand initiating an I/O activity called page fault. Main memory is built using

the dynamic random access memory (DRAM) technology. When the requested instruction and

data are not located in caches, they are accessed from main memory with substantial longer

latency. Multiple levels of caches hold the working set of the recently referenced instructions and

data in hope to limit the need to access them from the main memory.

Memory Controller

Internal Row buffer

Rows

DRAM ChipCols

Bank 0 Bank 1

Addr

Data

Channel 0

Cmd

Figure 1-3. Dram Internal Organization.

DRAM-based main memory is a multi-level hierarchy of structures. At the highest level,

each processor die is connected to one or more DRAM channels. Each channel has a dedicated

command, address and data bus. One or more memory modules can be connected to each DRAM

channel. Each memory module contains a number of DRAM chips. As the data output width of

each DRAM chip is low (typically 8 bits for commodity DRAM), multiple chips are grouped

16

together to form a rank. In other words, a rank is a collection of DRAM chips that together feed

the standard 64-bit data bus.

Internally, each chip consists of multiple banks. Each bank consists of many rows of

DRAM cells and a row buffer that caches the last accessed row from the bank. Each dram cell in

the row is identified by the corresponding column address. Reading or writing data to DRAM

requires that the entire row first be read into the row buffer. Reads and writes then operate

directly on the row buffer. After the operation, the row is closed and the data in the row buffer is

written back into the DRAM array. Figure 1-3 shows this topology.

When the memory controller receives an access to 64-byte cache line, it will first decode

the address into the channel, rank, bank, row and column number. As the data of each 64-byte

cache line is split into different chips within the rank, the memory controller maintains a

mapping scheme to determine which parts of the cache line are mapped to which chips. Upon

receiving the command, each chip accesses the corresponding column of data from the row

buffer and transfer it on the data bus. Once the data is transferred, the memory controller

assembles the required cache line and sends back to the processor.

All banks on a channel share a common set of command and data buses, operations on

multiple banks may occur in parallel (e.g., opening a row in one bank while reading data from

another bank’s row buffer) so long as the commands are properly scheduled and any other

DRAM timing constraints are obeyed. A memory controller can improve memory system

throughput by scheduling requests such that can be paralleled among banks. In the meanwhile,

dram can have different page modes. Leaving a row buffer open after every access is called

Open-page policy. Closing a row buffer after every access is called Close-page policy. Accessing

data already loaded in the row buffer, also called a row-buffer hit, incurs a shorter latency than

17

when the corresponding row must first be “opened” from the DRAM array. Therefore, open-

page policy enables more efficient access to the same open row, at the expense of increased

access delay to other rows in the same DRAM array. A row-buffer conflict would happen if

requesting a different row rather than the currently opened row, which incurs substantial delay.

Close-page policies on the other hand, can server row buffer conflict requests faster.

Our proposed research is focused on various techniques from caches to memory to

improve performance of memory hierarchy on modern multicore systems. The outline of the

research topics is given in the following subsections. The performance evaluation methodology

and workload selection will be given in Chapter 2. This is followed by detailed description of

each research topic in Chapter 3, 4, 5, and 6. Finally, a summary of the proposed research is

given in Chapter 7.

1.1 DRAM Caches

Cache capacity is limited by the number of transistors in the processor die. With newer

packaging technology such as silicon interposer (2.5D) [2] or 3D integrated circuit stacking [3],

processor and DRAM can be in close proximity which gives high-bandwidth and low-latency

access to dense memory from processors. However, at the current time, unfortunately stacked

DRAM capacity is still insufficient to be used as the system main memory [4] [5]. There have

been two approaches to integrate the stacked DRAM, either as the last-level cache [6] [7] [4] [8]

[9] [10], or as a part of the main memory [11] [12] [13] [14]. Using stacked DRAM as a part of

memory requires extra address mapping and data block swapping between fast and slow DRAMs

[15] [16] [17] [18] [19]. This approach utilizes the entire DRAM capacity which is essential

when the capacities of stacked and off-chip DRAM are close. However, with tens of GBs off-

chip DRAM in today’s personal systems, it is more viable to use an order of magnitude smaller

18

stacked DRAM as the last-level cache (L4) to provide fast memory access and to alleviate off-

chip memory bandwidth requirement.

Our first research topic is to investigate fundamental issues and to access performance

advantage of a large stacked DRAM cache included in the memory hierarchy as the last-level

cache. Large off-die stacked DRAM caches have been proposed to provide higher effective

bandwidth and lower average latency to main memory. Designing a large off-die DRAM cache

with conventional block size (e.g. 64 bytes) requires a large tag array which is impractical to fit

on-die. Placing the large directory on off-die memory prolong the latency since a tag access is

necessary before the data can be accessed. This additional trip also generates extra off-die traffic.

We investigate a novel design called Cache Lookaside Table (CLT) to reduce the average

access latency and to lessen off-die tag array accesses. The basic approach is to cache a small

amount of recently referenced tags on-die. An off-die tag access is avoided when a requested

block’s tag hits a cached tag. To save on-die space, cached tags are recorded in a large sector for

sharing tags with multiple blocks. However, due to the loss of one-to-one physical mapping of

the cached tags and the data array, a way pointer is added for each block to indicate its way

location. The proposed CLT exploits memory reference locality and provides a fast alternative

tag path to capture most of the DRAM cache requests.

Experiment results show that with a small on-die CLT, in comparison with other

proposed DRAM caching mechanisms, the on-die CLT shows average performance

improvements in the range of 4-15%.

1.2 Runahead Cache Misses

To hide long memory latency and to alleviate memory bandwidth requirement, a fourth-

level of cache (L4) is introduced in modern high-performance computing systems as illustrated

in Figure 1-2. However, increasing cache levels worsens the cache miss penalty since memory

19

requests go through levels of cache hierarchy sequentially. We investigate a new way of using a

Bloom Filter (BF) to predict cache misses earlier at a particular cache level. These misses can

runahead to access lower level of caches and memory to shorten the miss penalty. One inherent

difficulty in using a BF to predict cache misses is due to the fact that cache contents are

dynamically updated through insertions and deletions. We propose a new BF hashing scheme

that extends the cache index for the target set to access the BF array. Since the BF index is a

superset of the cache index, all blocks hashed to the same BF location are allocated in the same

cache set to simplify updates to the BF array. When a block is evicted from the cache, the

corresponding BF bit is reset only when no block hashed to this location exists in the cache set.

Performance evaluation using a set of SPEC2006 benchmarks show that using a BF for

the third-level (L3) cache in a 4-level cache hierarchy to filter and runahead L3 misses, the IPCs

can be improved by 3-21% with an average improvement of 9.5%.

1.3 Hashing Fundamentals and Bloom Filter

Inspired by the usefulness of Bloom filter in cache accesses, we conduct a fundamental

study to find a way to balance the hashing buckets while maintaining lower false-positive rate for

Bloom filter. To broaden the applications, our study is based on the routing and packet

forwarding function at the core of the IP network-layer protocols. The throughput of a router is

constrained by the speed at which the routing table lookup can be performed. Hash-based lookup

has been a research focus in this area due to its O(1) average lookup time.

It is well-known that hash collision is an inherent problem when a single random hash

function is used, which causes uneven distribution of keys among the hash buckets in a

nondeterministic fashion. The multiple-hashing technique, on the other hand, uses d independent

hash functions to place a key into one of d possible buckets. The criteria of selecting the target

bucket for placement is flexible and can be controlled to accomplish a specific objective. One

20

well-known objective of using multiple hash functions is load balancing, i.e. to balance the keys

in the buckets [20] [21] [22] [23] [24] [25].

Another known objective of multiple hashing sets an opposite criteria for reducing the fill

factor of the hash buckets [22] [26]. The fill factor is measured by the ratio of nonempty buckets.

Instead of placing a key in the bucket with smaller number of keys for load balancing, this

approach places the key in the bucket with non-zero number of keys. The objective of this

placement is to maximize the amount of empty buckets. One potential application is to apply the

low fill-factor hashing method to a Bloom filter [22] [26]. With more zeros remained in the

Bloom filter, the critical false positive rate can be reduced. To create more zeros in establishing

the Bloom filter, however, multiple sets of hash functions are needed for different keys since all

the hashed k bits for each key must be set during the setup of the Bloom filter. Therefore, the

multiple hashing concept is actually applied for choosing a set of hash functions out of multiple

groups to maximize the number of zeros in the Bloom filter after recording 𝑘 ‘1’s for every key.

With a series of prior multi-hashing developments, including d-random, 2-left, and d-left,

we discover that a new guided multi-hashing approach holds the promise of further pushing the

envelope of this line of research to make significant performance improvement beyond what

today’s best technology can achieve. Our guided multi-hashing approach achieves near perfect

load balance among hash buckets, while limiting the number of buckets to be probed for each

key (address) lookup, where each bucket holds one or a few routing entries. Unlike the localized

optimization by the prior approaches, we utilize the full information of multi-hash mapping from

keys to hash buckets for global key-to-bucket assignment. We have dual objectives of lowering

the bucket size while increasing empty buckets, which helps to reduce the number of buckets

21

brought from off-chip memory to the network processor for each lookup. We introduce

mechanisms to make sure that most lookups only require one bucket to be fetched.

Our simulation results show that with the same number of hash functions, the guided

multiple hashing schemes are more balanced than d-left and others, while the average number of

buckets to be accessed for each lookup is reduced by 20–50%.

1.4 Intelligent Row Buffer

Accessing off-chip memory is a major performance bottleneck in microprocessors. As all

of the cores must share the limited off-chip memory bandwidth, a large number of outstanding

requests greatly increases contention for the memory data and command buses. Because a bank

can only process one command at a time, a large number of requests also increases bank

contention, where requests must wait for busy banks to finish servicing other requests.

A key challenge in effectively improving system performance lies in maximizing both

row-buffer hits and bank level parallelism while simultaneously providing fairness among

different requests. We observed that accesses to each bank in DRAM are not equally distributed

among different rows for most of the workloads we study in Chapter 2. Some rows tend to have

more frequent accesses than other rows in a certain amount of time due to spatial locality. We

call these rows “hot rows”. However, if requests from different hot rows in the same bank

interleave with each other, there is slight chance that those requests result in row buffer hits.

DRAM banks will frequently close opened rows and issue commands to open another row thus

causing large queuing delays (time spent waiting for the memory controller to start servicing a

request) and DRAM device access delays (due to decreased row-buffer hit rates and bus

contention).

22

We propose a simple scheme to capture the hot row pattern and prefetch data in these hot

rows. Prefetched data will be row buffer hits and thus saving access time later. Results show that

our design is able to consistently perform better than simple LRU and LFU hot row schemes.

23

CHAPTER 2

PERFORMANCE METHODOLOGY AND WORKLOAD SELECTION

2.1 Evaluation Methodology

In order to evaluate the performance advantages of the proposed works in memory

hierarchy designs, we adopt two cycle-accurate simulation methodologies. The first method is to

establish and run applications on MARSSx86 [27], an x86-based whole-system simulation

environment. MARSSx86 is built on QEMU, a full system emulation environment, where

selected multi-threaded and multi-programmed workloads are compiled and run in. The executed

instructions and memory requests drive a cycle-accurate multi-core model, which is extended

from PTLsim [28]. Memory requests are simulated through multiple levels of cache hierarchy. In

case of a last level cache miss, the request is issued to the memory, which is modelled using

DRAMsim2 [29], a cycle-accurate DDR-based DRAM model.

We develop a memory interface controller, called MICsim to handle requests from

processors to the off-die DRAM cache and memory. We also develop a callback function

between MICsim and the multicore processor model in MARSSx86. When a memory request

misses the last-level on-die caches, the request is inserted into a memory request queue. The

MICsim processes requests from the top of the memory request queue one at a time in every

cycle. We model partial hits to the stacked and the conventional DRAMs. Outstanding requests

are saved in a pending queue for detecting and holding subsequent requests to the pending

blocks. This first performance evaluation methodology is used in studying run-ahead cache

misses proposal since detailed L1, L2 and L3 caches are simulated to understand the

effectiveness of bypassing certain levels of caches.

24

Table 2-1. Architecture parameters of processor and memories.

Processor, DRAM Cache and DRAM Memory Parameters

processor 3.2GHz, 8 cores, out-of-order

L1 Caches I/D split, 32KB MESI, 64B line, 4-way, 2 read ports, 1 write port,

latency 4 cycles

L2 Cache Private, 256 KB, 64B line, 8-way, 2 read ports, 1 write ports, latency

11 cycles, snooping-bus MESI protocol, 2-cycle per request, split

transaction

L3 Cache Shared, 8MB, 64B line, 16-way, 2 read ports, 2 write ports, latency

24 cycles

L4-DRAM

Cache

128MB-256MB, 1.6GHz, 16-byte bus, Channels/Ranks/Banks:

4/1/16, tCAS-tRCD-tRP: 9-9-9, tRAS-tRC: 36-33

Conventional

DRAM

16GB, 800Mhz, 8-byte bus, Channels/Ranks/Banks: 2/1/8, tCAS-

tRCD-tRP: 9-9-9, tRAS-tRC: 36-33, 2KB row buffer, close page

Although MARSSx86 precisely simulates multicore processors, the simulation time is

unbearably long when simulating large stacked DRAM cache and memory. It requires billions of

instructions to drive meaningful results. The virtual machine infrastructure in MARSSx86 also

puts limits on the physical address space as pointed in [30]. In the second method, we adopt the

Epoch model [31] for estimating the execution time of different applications. It uses traces

generated by the Pin-based Sniper simulator [32] in which the representative regions of

applications are simulated and the requests sent to L3 are collected based on Intel Gainestown

configuration with private L1, L2 caches and a shared L3 that interface with main memory. Per-

core memory traces generated from Sniper are annotated with Epoch marks to ensure correct

dependence tracking for issuing a cadence of memory requests. Each core explores memory level

parallelism by issuing memory requests up to the Epoch mark. Each memory request is

simulated through cycle-accurate memory hierarchy model which is the same as in the first

method. The processor waits until all requests come back from the memory controller before

moves to the next Epoch. In this method, we model precise on-die shared L3 cache with correct

timing and bandwidth considerations. This Epoch simulation model is used in evaluation of the

25

Cache Lookaside Table proposal for providing alternative tag path for large stacked DRAM

cache as well as studying intelligent hot-row prefetch for DRAM row buffers. Table 2-1

summarizes the architecture parameters used in our simulation.

2.2 Workload Selection

Table 2-2. MPKI and footprint of the selected benchmarks.

Benchmarks FootPrint (MB) L3MPKI L4MPKI

mcf 9310 74.0 19.2

gcc 477 39.5 1.4

lbm 3222 30.2 15.1

soplex 1693 28.1 21.9

milc 4084 27.5 22.1

libquantum 256 24.1 13.2

omnetpp 259 19.1 0.2

sphinx 78 12.1 0.2

bt 240 11.2 0.4

bwaves 3794 9.2 5.0

leslie3d 599 8.3 4.0

gems 1663 7.7 5.2

zeusmp 3488 6.3 4.5

To conduct our evaluation, we evaluated all workloads from SPEC CPU2006 and

selected 12 applications with high L3 MPKI and large footprint. All workloads are running under

multithreaded mode, where each application is replicated 8 times with 8 threads running on 8

cores. Therefore, the total memory footprint is roughly 8 times large as the footprint reported in

[33]. Table 2-2 gives basic information for these workloads. In general, we use the first 5 billion

instructions to warm up caches, tables, and other data structures in different methods. We

simulate the next billion instructions to collect performance statistics.

26

CHAPTER 3

CACHE LOOKASIDE TABLE

In this chapter, we propose our CLT work on providing an alternative tag path for large

off-chip stacked DRAM cache. We begin by showing the current state-of-the-art designs to solve

the tag problem and the motivation for using small on-die space to capture the majority of

DRAM cache tags. We then present the detailed design of how CLT adopts the decoupled sector

cache idea to utilize spatial locality as well as save tag space. Finally we show some results to

support our work.

3.1 Background and Related Work

Future high-end servers with tens or hundreds of cores demand high memory bandwidth.

Recent advances in 3D stacking technology provides a viable solution to the memory wall

problem [13] [34] [11] [35]. It provides a promising venue for low-latency and high-bandwidth

interconnect between processor and DRAM dies through silicon vias [3]. However, due to

physical space availability, the capacity of this nearby memory is limited and not suitable to

serve as the system main memory [4] [5] [36]. One viable approach is to use this nearby DRAM

as the last level cache for fast access and reduced bandwidth to main memory. Intel Haswell [1],

a fourth generation core is an example, which has 128MB L4 cache on embedded DRAM

technology

Designs of large off-die DRAM caches have gained much interest recently [5] [37] [38]

[39] [40] [6] [14] [7] [41]. Researchers noticed large space requirement as well as the access

time and power overheads for implementing a tag array for a large DRAM cache [4] [9].

Considering a common cache block size of 64 bytes, if each tag is 6 bytes, the tag array will

consumes 24MB and 96MB respectively for 256MB and 1GB DRAM caches. Such a large tag

27

array is impractical to fit on the processor die. If this tag array is part of the off-die DRAM

cache, it requires extra trip to access the tag array.

There are two general approaches to handle the large tag array of a DRAM cache placed

off-die. The first approach is to record large block tags in order to fit the tag array on-die. Zhao,

et al. [10] explore DRAM cache for CMP platforms. They recognize tag space overhead and

suggest storing all tags in a set in a continuous DRAM block for fast access. They show that

using on-die partial tags and sector directory can achieve the best space-performance tradeoff.

However, partial tags are expensive and encounter false-positive situation. CHOP [6] advocates

large block size to alleviate the tag space overhead. It uses a separate filter cache to detect and

cache only hot blocks for reducing the fragmentation problem.

The Footprint cache [39] has large blocks and uses the sector cache idea [42] [43] [44]

[45] to reduce the tag space and the memory bandwidth requirement. In addition, they predict the

footprint in a 2KB sector and prefetch those 64-byte blocks based on the footprint history [46].

Data prefetching is orthogonal to the proposed caching methods and is beyond the scope of this

proposal. Nevertheless, it is noteworthy to point out that footprint cache will lose cache space for

blocks which are not a part of the footprint. The Unison cache [40] extends the Footprint idea for

handling even bigger stacked DRAM cache. They move the sector tag off-chip and use way-

prediction to fetch the tag and the predicted data block in parallel.

The second approach is to keep conventional block size and use other techniques to

alleviate the impact of extra off-die tag array access. Loh and Hill [4] [47]propose allocating all

tags and data blocks in a cache set in a single contiguous DRAM location to improve row buffer

hit. To reduce the miss latency, they use a miss-map directory to identify DRAM cache misses

and issue accesses directly to the next low-level memory. To save space, they record miss-map

28

for a large memory segment. However, when the segment is evicted from the miss-map

directory, all blocks in the segment must be evicted from the DRAM cache. Sim et al. [5] suggest

to speculatively issuing requests directly to main memory if the block is predicted not in the

DRAM cache. However, one will need to handle complicated miss predictions.

To reduce extra off-die tag access, we can cache on-die a small amount of recent

referenced tags. An off-die tag array access is avoided when the tag is found on-die. This simple

approach faces two problems. First, a tag is cached on-die only when a request misses on-die

cached tags. It does not take advantage of spatial locality in data access. Second, caching tags of

individual blocks does not save tag space if we want to maintain the same capacity. Even worse,

without one-to-one mapping of the cached tags and the DRAM data array, a location pointer to

the data array is needed for each cached block tag. Meza et al. [48] propose to cache the block

tags of the entire DRAM cache set as a unit in an on-die directory called Timber. Caching the

tags of the entire set as a unit avoids using way pointers. It does not require to invalidate blocks

in a set when the set of tags are evicted from Timber. However, caching all tags in a set does not

save any tag space. Moreover, it does not follow spatial nor temporal locality principle in

applications. The ATCache [38] applies the same idea of Timber with additional prefetching for

a set of tags. Other works [49] [50] have proposed caching tags to improve cache access latency

in the generic setting of a multilevel SRAM cache hierarchy.

The Alloy cache trades high hit ratios of a set-associative cache for fast access time of a

direct-mapped cache [9]. Besides suffering the hit ratio, Alloy cache relies on a cache-miss

predictor [51] to speculative issue parallel accesses to both cache and memory when a cache

miss is predicted in order to avoid sequential accesses to the off-die cache and memory. The

TagTables [37] applies the page-walk technique to identify large blocks (pages) located in cache.

29

It allocates the entire page into the same cache set in order to combine adjacent blocks into a

chunk to save space for the location pointers in the page-walk tables.

Our proposed CLT maintains conventional block size for off-die DRAM cache. It differs

from other proposed approaches in avoiding off-die tag array accesses. In contrast to the

TagTables, the Alloy cache, the Footprint, and the Unison caches, the role of CLT is to provide a

fast, alternative on-die tag path to cover a majority of cache requests. With off-die full block tags

as the backup, CLT can be decoupled from the off-die cache without the inherent complexity in

the decoupled sector cache [45]. Unlike Timber and ATCache, CLT caches on-die tags for

bigger sectors allowing CLT to capture spatial locality and to save tag space by sharing the

sector tags. In addition, unlike the miss-map approach, CLT can identify both cache hit and

cache miss without block invalidation when a sector is replaced from CLT. Furthermore,

different from using partial tag or hit/miss speculation, our proposed CLT maintains precise

hit/miss information for all recorded blocks and is non-speculative which can bypass the off-die

tag array for a majority of memory requests.

3.2 CLT Overview

3.2.1 Stacked Off-die DRAM Cache with On-Die CLT

Figure 3-1 depicts the block diagram of a CMP system with nearby off-die stacked

DRAM cache and on-die Cache Lookaside Table (CLT). All requests to the DRAM cache are

filtered through the CLT first, which records recently referenced memory sectors. When the

sector of a requested block is found in the CLT (CLT hit which is different from DRAM cache

hit), off-die tag directory access is avoided from the critical cache access path. Either the stacked

data array or the next-level DRAM will be accessed depending on the block hit/miss

information. The proposed sector-based CLT records large sector tags to save space and to

exploit spatial locality for better coverage. If a small CLT can cover a high percentage of the

30

total requests, we can reduce the average memory access latency as well as the off-die bandwidth

requirement without putting the entire cache tags on-die.

StackedDRAM Cache

Tag and Data Array

Main Memory

Stacked DRAM Controller

DRAM Controller

Core + Cache

Cache Lookaside Table(CLT)

Core + Cache

Core + Cache

Core + Cache

Main Memory

Main Memory

Main Memory

Figure 3-1.4Memory hierarchy with stacked DRAM cache.

It is well-known that conventional cache access goes through a tag path to determine a

cache hit or a miss and a data path to access the data in case of a hit. The cache tag and data

arrays maintain topological equivalency such that matching of the address tag in the tag array

determines the location of the block in the data array. The original sector cache [42] records

large sector tags to save tag space, but maintains the physical equivalency with the block-based

data array where all blocks in a sector are allocated in fixed locations in the data array. It wastes

cache space for unused blocks.

Decoupled sector cache [45] allocates requested blocks in a sector in any locations in the

target set of the data array for better cache performance. However, it faces a few inherent

difficulties. First, without physical equivalency, it requires a location pointer to the data array for

each block in a sector for locating the block. Second, due to topological mismatch of sector-

based tag array and block-based data array, the location pointers are also used to invalidate the

remaining valid blocks when a sector is replaced from the sector tag array. To minimize such

31

invalidations, the number of sector tags recorded in the sector tag array needs to be larger than

the number of sectors matching the cache size. Third, decoupled sector cache also requires a

backward pointer from each block in the data array to its parent sector in the sector tag array for

updating the validity information when the block is replaced from the data array. This double-

pointer requirement along with enlarged number of sector tags defeats the purpose of saving tag

space and further complicates the sector-cache design.

CLT only captures a portion of recently referenced sectors on-die and reply on off-die

full block tags to handle the rest. It avoids two critical issues that the decoupled sector cache

encounters. First, the backward pointer from the blocks in the data array to the parent sector in

the CLT can be eliminated. This is due to the fact that with only a portion of the sectors recorded

in the CLT, the index bits to large cache sets can be a superset of the index bits to the CLT.

Although the missed block and the replaced block in a cache set can be from different sectors,

they must be located in the same CLT set. A search in the CLT set can identify the sector where

the evicted block belongs to for updating the validity information.

Second, when a sector is replaced from the CLT, the valid blocks in the sector can remain

in cache as long as all blocks in a sector are allocated in the same cache set. This allocation can

be accomplished by using the low-order bits of the sector address as the cache index bits. When

the sector is referenced later, a search of the block tags in the target cache set can recover all the

valid blocks in the sector. This search is possible because the block tags are maintained in the

cache tag array. Without the block tags, the decoupled sector cache must invalidate remaining

blocks when a sector is evicted. Detailed CLT design will be given in Section 4.3.

3.2.2 CLT Coverage

We first validate the potential CLT coverage using 12 SPEC2006 CPU workloads. In

Figure 3-2, we plot the accumulated reuse distance curves of the 12 workloads with large (2KB)

32

block size. We want to show that a small portion of recently used large blocks (sectors) can

indeed cover the majority of cache access. The horizontal axis (logarithm scale) represents the

percentage of the reuse distance with respect to the full stack distance that covers the entire block

references, and the vertical axis is the accumulated percentage of the total blocks that can be

covered. We can observe that by recording 10% of the most recent referenced blocks, over 90%

of the requests can be covered for all workloads except for gcc and milc whose coverages are

82% and 88% respectively. These results support the CLT approach by recording a small portion

of recently reference sectors on-die to provide a fast path for a majority of the DRAM cache

requests.

Figure 3-2.5Reuse distance curves normalized to the percentage of the maximum distance.

3.2.3 Comparison of DRAM Cache Methods

There are several key aspects in comparing DRAM cache designs including Alloy cache,

ATcache, TagTables, and the proposed CLT as summarized in Table 3-1. For comparison

purpose, we also include an impractical method to have entire cache block tags on-die. We

assume 64-way set-associativity for all caches. Note that the footprint cache follows the original

33

sector cache design plus prefetching the blocks in the predicted footprint in each sector. Since

prefetching techniques (e.g. streaming prefetcher) benefit all proposed methods and is orthogonal

to the cache design, we omit the prefetch aspect in our comparison.

The goal of all proposals is to avoid off-die tag access. Therefore, the storage requirement

including on-die SRAM, stacked DRAM, and regular memory must be considered. First,

different methods require different on-die SRAM. Alloy uses a small on-die table to predict

DRAM cache misses for issuing parallel accesses to memory. ATcache caches tags of recently

referenced cache sets. TagTables caches the page-walk tables in on-die L3 on demand. CLT

records tags of recently referenced sectors along with valid, modify and way pointer for each

block to cover tag accesses for a majority of requests. For fair comparison, we will keep fixed

on-die SRAM size by adjusting the L3 size for all methods. For example, we deduct the CLT

size from the L3 cache size in our evaluation.

Table 3-1. Comparison of different DRAM cache designs.

On-die SRAM

(constant)

Stacked

DRAM

(constant)

Tag-data

mapping

Entropy on

cache set

Cache

placement

Block tag

on-die

block tag data 1-1 map block indices 64-way LRU

Alloy

cache

predict table tag + data 1-1 map block indices direct-map

TagTables cached page-walk

tables

data 1-many

decoupled

sector indices chunk

placement

CLT CLT

(tag+pointer)

tag + data 1-many

decoupled

sector indices 64-way LRU

Next, for stacked DRAM requirement, Alloy, ATcache and CLT must maintain the tags

along with the data blocks in the stacked DRAM while TagTables does not have separate block

tags. Therefore, we reduce the data array size proportionally for Alloy, ATcache and CLT in our

evaluation. Third, it is important to note that TagTables creates page-walk tables in main

34

memory. Since we do not evaluate I/O activity, we do not impose any penalty associated with

this extra memory requirement.

Physical mapping of the on-die tags and their data blocks is another key aspect. Alloy has

a simple direct-mapped topology without separate on-die tags. ATcache does not alter mapping

of the on-die tags and their data. TagTables, and CLT share a sector tag with multiple data

blocks to save tag space. CLT only records a portion of the sector tags. As a result, a location

pointer for each block in a sector is needed to locate the block in the data array. TagTables also

requires the location pointers. It further limits four pointers per sector (page) by combining

adjacent blocks into physical chunks using more restricted block placement and replacement in

the cache data array.

With respect to fetch bandwidth requirement, all methods fetches 64-byte blocks from

off-die stacked DRAM. However, Alloy needs to fetch the 64-byte data block along with its tag

and miscellaneous bits.

Last but not the least, the effort of avoiding off-die tag access could affect the cache

performance. The first impact is on the entropy of indexing the cache set. It is well-known that

using the low-order block address bits to hash to the cache sets provides good entropy of

distributing blocks across the entire cache sets. However, due to restricted mapping in CLT, as

well as a need to combine adjacent blocks into chunks in TagTables, these approaches use the

low-order bits from the sector address to determine the cache sets. Depending on the sector size,

the cache indices are taken from higher order address bits in comparison with the block indices.

Using higher-order bits for indexing the cache may adversely impact the entropy of hashing

blocks to the cache sets and create more conflicts.

35

These three methods are also different in cache placement and replacement policies.

ATcache and CLT maintains 64-way set-associativity in each set with a pseudo-LRU

replacement policy decoupled from the topology of the sector tags in the CLT. Alloy uses the

direct-mapped design which may suffer lower hit ratios. TagTables relies on special placement

and replacement mechanisms for creating big chunks since each sector can only record up to four

chunks.

Figure 3-3.6Coefficient of variation (CV) of hashing 64K cache-set using different indices.

In order to understand the impact on the entropy of hashing memory requests across

cache sets, we show the coefficient of variation (CV) using five different sector-based caching

indices in Figure 3-3. The CV is the ratio of the standard deviation to the mean based on the

number of requests hashed to each cache set. In this simulation, we assume there are 64K sets,

hence 16 index bits and the block size is 64 bytes. In the figure, Sector_n indicates that the

starting least-significant index bit position is from the least-significant bit of the block address to

the left of log2 𝑛 bits. For 64-byte block size, the least-significant 6 bits are the block offset. The

0

0.5

1

1.5

2

2.5

3

Coef

fiec

ien

t of

vari

ati

on

sector_1 sector_8 sector_16 sector_32 sector_64

36

index bits start from the 7th bit for sector_1, 10th bit for sector_8, and so on. In other words,

there are n consecutive blocks allocated to the same set. We can observe that the CV increase

significantly when n increases for all workloads except for milc. The significance of the variation

among the workloads is sorted from left to right. With large n, this uneven distribution of

memory requests across cache sets increases the chance of set conflict and degrades cache

performance. Milc presents special indexing behavior. By allocating consecutive blocks into the

same set, the CV is actually reduced. However, since milc has the least CV significance, the

impact should be minimum.

Figure 3-4.7DRAM cache MPKI using sector indexing.

Figure 3-4 shows the impact on the MPKI using the sector-based indexing schemes. We

simulate a 256MB cache, 64-way set-associativity with 64-byte blocks. The results indeed

demonstrate that with large sector size, allocating all blocks in the same cache set degrades cache

performance significantly, especially when the sector size is 64. However, with moderate sector

size (e.g. 16 blocks), the impacts are rather manageable. Among different workloads, mcf,

0

1

2

3

4

5

MPKI

bwaveszeusmpleslie3dgccomnetppsphinx3

5

15

25

35

45

55

65

mcfmilcsoplexlbmlibquantumGemsFDTD

37

sphinx3, omnetpp, gcc, and leslie3d have worse impacts which are consistent with the CV results

in Figure 3-3. These results suggest that a moderate sector size enables a decoupled CLT without

compromising much of cache performance.

3.3 CLT Design

An example of 3-way set-associative CLT with 64-byte blocks and 16 blocks per sector is

depicted in Figure 3-5. According to the size and set associativity of the CLT, a few low-order

bits in the sector address are used to determine the CLT set. The remaining high-order tag bits

are used to match the sector tags recorded in the set. In this example, the address part labeled sect

represents the block ID in a sector and is used to look up the recorded block information. Each

sector has a valid bit and 16 groups of valid (v), modify (m), and location pointer (way) for the

16 blocks. Given that the cache set index is a part of the address, only the cache way is needed in

the location pointer. For example, in a 64-way DRAM cache, it requires 6 bits to record the way.

Based on whether the sector is valid in the CLT and whether the requested block is located in the

DRAM cache, the cache access works as follows.

First, when a sector tag match is found and the target sector is valid (a CLT hit), the 4-bit

block ID (sect) selects the corresponding hit/miss (same as a valid bit) and the way pointer for

the block. In case of a cache hit, the way pointer is used to access the data block from off-die

stacked DRAM data array. The critical off-die tag path is bypassed.

Second, on a CLT hit and the hit/miss indicating that the block is not located in the

DRAM cache, the request is then issued to the conventional DRAM (main memory). The off-die

tag path is bypassed as well. When the missed block is returned from the conventional DRAM,

the block data and tag are stored into the DRAM cache in the LRU position given by the on-die

replacement logic. A writeback is necessary in case the evicted block is dirty. (For simplicity we

omit the dirty bits in the drawing.) Meanwhile, the CLT is updated by turning off the hit/miss for

38

the evicted block and recording a hit and the way location for the new block. Note that the

evicted block may not be in the same sector as the missing block. However, they must be located

in the same CLT set since the CLT index bits are a subset of the DRAM cache index bits. By

matching the cache tag and the remaining cache index bits with proper sector tag bits, the LRU

block in the CLT can be identified.

Figure 3-5.8CLT design schematics.

Third, when a request misses the CLT, an off-die tag access is necessary to bring in all

the tags in the target cache set for determining hit/miss status and the way location of the

requested block. Depending on whether the requested block is located in cache, the remaining

cache and memory accesses are the same as that when the target sector is valid in the CLT. In

order to update the CLT for the new sector, cache tag comparison logic is extended to allow

matching of the tags in the target cache set with all other block tags in the sector. For those

blocks in the set, hit/miss status bits are set and their way pointers are recorded. For other blocks

that are missing, their corresponding hit/miss indicator is recorded as a miss. The new sector tag

Address:

offsetsectindextag

64

sectagh/m, way

. . .

= =

MUX

=

DRAM way

Cache Lookaside Table (CLT)

...

h/m, way

. . .

h/m

v

16

39

and its associative hit/miss and location information replace the LRU sector in the CLT. Note

that there is no cache invalidation for the evicted sector.

Address:

offsetsect

index

tag

62

way0 1 2 3 4 6 75

DRAM Cache

CLT

Request CLT – Target Set for Sector A, B, C

addr condition

initial

tag v pt

B 0 --

v pt

1 001

v pt

0 --

v pt

1 101

A0 A2

C2 C3 C1 C0

B1 B3

tag v pt v pt v pt v pt

--

A0

A2

A3

B1

B2

B3

C1

C2

MRU LRU

m, h

h, h

h, m

h, h

h, m

h, h

m, h

h, h

B 0 -- 1 001 0 -- 1 101A 1 000 0 -- 1 100 0 --

B 0 -- 1 001 0 -- 1 101A 1 000 0 -- 1 100 0 --

A3

B 0 -- 1 001 0 -- 1 101A 1 000 0 -- 1 100 1 110

A 1 000 0 -- 1 100 1 110B 0 -- 1 001 0 -- 1 101

B2

B 0 -- 1 001 1 011 1 101 A 1 000 0 -- 1 100 1 110

B 0 -- 1 001 1 011 1 101 A 1 000 0 -- 1 100 1 110

B 0 -- 1 001 1 011 1 101C 1 111 1 101 1 010 1 011

C 1 111 1 101 1 010 1 011 B 0 -- 1 001 1 011 1 101

Note, A, B, C are 3 sectors each has 4 blocks; a few blocks are located in cache initially, the ones with circle are moved in due to cache miss. All three sectors are recorded in the same CLT set with 2-way set-associativity. ‘condition’ indicates hit/miss to CLT and cache for the request.

Figure 3-6.9CLT operations in handling memory requests.

In Figure 3-6, we illustrate the CLT operations in handling a sequence of DRAM cache

requests, A0, A2, A3, B1, B2, B3, C1, and C2, where A, B, C represent three different sectors

and each sector has 4 64-byte blocks as indicated by the subscript. The least-significant 6 address

bits are the block offset and the next 2 bits define the block IDs in a sector. Both the target sets of

the CLT and the cache are determined by the low-order bits in the sector address as illustrated in

the figure for allocating blocks in a sector in the same cache set. Since the number of cache

40

blocks is several times larger than the number of sectors in the CLT, the cache index bits are a

superset of the sector index bits. In this example, we assume sectors A, B, and C are hashed to

the same CLT set, but allocated to different cache sets. Several blocks in A, B, and C are located

in the 8-way DRAM cache initially. Note, the blocks marked by a circle are moved into the

cache after a miss occurs. For simplicity, we assume the CLT has 2-way set-associativity. We

also assume sector B is already recorded in the CLT with its sector tag, two valid blocks B1 and

B3 with location pointers ‘001’, ‘101’, and two invalid blocks B0 and B2.

When A0 is issued, it misses the CLT. All tags in the target DRAM cache set where A0 is

located are fetched and compared with all the block tags in A. A match of requested A0 is found

and so is A2, while A1 and A3 are missing. The request to the data array to fetch A0 is then

issued. The CLT is updated by recording sector tag for A and setting valid and way bits for A0,

A2 and invalid for A1, A3. When A2 is processed, it hits the CLT with a valid block indicator

and the location pointer is ‘100’. Therefore, the data block can be fetched from the correct data

array location directly. Next, A3 is also a CLT hit, but the block is invalid. A request is issued to

the conventional DRAM to bring in the missing block. According to the on-die pseudo-LRU

cache replacement logic, A3 is placed in way 6 to replace the LRU block. The DRAM cache tag

array and data array are updated accordingly. Meanwhile, the valid bit and location pointer are

updated for A3 and the valid bit for the replaced block is turned off in case the block is valid.

When B1 comes, it hits the CLT as a valid block, hence B1 can be fetched from the data

array directly with the location pointer ‘001’. The MRU/LRU position in the CLT is updated.

Next, B2 is a CLT hit, but a cache miss, which is handled the same way as A3. B2 is moved to

the DRAM cache set in way 3 afterwards. B3 hits both the CLT and the cache and can be treated

the same as B1. Next, C1 misses the CLT, but hit the cache. It can be handled the same as A0

41

where an off-die fetch to bring in all tags in the target DRAM set is necessary. In addition, to

record the new sector C in the CLT, sector A must be evicted to make room for sector C. Blocks

A0, A2 and A3 remain valid in cache. The update of C in the CLT is the same as the update of A

when A entered the CLT. Finally, C2 hits both the CLT and the cache and can be handled

accordingly.

3.4 Performance Evaluation

3.4.1 Difference between Related Proposals

Table 3-2 summarizes the on-die SRAM space and latency for Alloy, ATcache,

TagTables, and CLT, as well as L3 cache size and DRAM cache data array size. The MAP-I

cache miss predictor is used in Alloy with one cycle access latency and 768-byte SRAM space.

TagTables takes up L3 space for caching the page-walk tables. Proper partition based on address

of the TagTables allocates metadata on the same bank of the L3 cache that would trigger tag

access. As a result, the interconnect latency can be avoided. We use Cacti 6.5 [52] to estimate a

latency of 8 cycles for accessing the tag table, which is the same as that used in the TagTables

paper [37].

For CLT, the sector tag plus 16 groups of valid, modify, and way pointers account for 20

bytes per sector. With 4K CLT sets, each has 20 ways, the total number of sectors is 80K.

Therefore, the CLT space is 80𝐾 × 20 𝑏𝑦𝑡𝑒𝑠 = 1.6 𝑀𝐵. In addition, we use a 6-level binary

tree (63 bits) to implement a pseudo-LRU policy for the 64-way cache. The space requirement

is 64𝐾𝑠𝑒𝑡 × 63𝑏𝑖𝑡𝑠 = 504𝐾𝐵. Therefore, the total on-die SRAM for CLT is close to 2MB. We

use the same policy to allocate CLT partitions on the same L3 cache bank which triggers the

CLT access to avoid interconnect latency. With smaller 20 bytes of data, the estimated CLT

latency is 6 cycles. ATcache requires on-die Timber, pseudo-LRU logic, and a tag prefetcher.

Since each cache set consists of 64 4-byte tags, Timber has 12-way and 512 sets for a total of

42

1.54MB of tag space. In addition, each entry needs one bit for prefetching logic, which costs

12*512/8 = 768 B.

CLT only records recently referenced sectors and must fetch all cache tags from the

stacked DRAM when a CLT miss occurs. For a 256MB cache, 64-byte blocks and 64-way set-

associativity, each set has 64 blocks. Each block has 30-bit address tag, a valid, and a modify bit,

for a total of 4 bytes. Therefore, it requires fetching 4 tag blocks of 256 bytes on a CLT miss.

Note that 30-bit address tag can accommodate 52-bit physical address. To overlap tag accesses,

four tag blocks are allocated in different banks in the stacked DRAM. ATcache requires fetching

4 tag blocks on a miss too. It also includes a tag prefetcher as described in [38].

Table 3-2.4Difference between three designs.

SRAM size On-die

latency

cycles

L3 Cache size Stacked DRAM

Cache Data Array

Alloy 768 bytes 1 8MB, 16-way 240MB,

direct-mapped

ATcache 2MB (1.54MB +

768 bytes prefetch +

504KB pseudo-LRU)

6 6MB, 12-way 240 MB, 64-way,

pseudo-LRU

TagTables 2MB metadata in L3 8 8MB, 16-way

(metadata in L3)

256MB, 64-way,

chunk placement

CLT 2MB (1600KB tag +

504KB pseudo-LRU)

6 6MB, 12-way 240MB, 64-way,

pseudo-LRU

For Alloy, we follow the operations described in [9]. The MAP-I cache-miss predictor is

implemented to predict cache misses for parallel accesses to DRAM cache and memory. Each

core has a 256-entry 3-bit hit/miss counter table. The address of the L3 miss causing instruction

is hashed using folded-xor [53] into the table for the recorded counter. In Alloy, the extra tag

fetch along with data block from the stacked DRAM is charged for one burst cycle.

For TagTables, the page-walk tables are dynamically created in main memory during the

simulation. We do not charge any penalty in creating the page-walk tables. The tables are cached

43

in L3 on demand. Extra latency occurs when the needed entry in a table is not located in L3. A

fetch to the main memory is issued to bring back the block with the needed information. We

follow the same procedure in [37] in managing the shared L3 cache for caching the page-walk

tables. The page entries recorded in the leaf level are saved in the intermediate level whenever

possible to shorten the level of the page walk. We implement the same algorithm for allocating

and combining blocks into chunks with special cache placement and replacement mechanism.

TagTables allocates 64 blocks in a page into the same set, which hurts the entropy of

hashing blocks across the cache sets. In addition, the limit of four chunks for each page may

create holes (empty frames) in a cache set and underutilize the DRAM cache. Therefore, we also

evaluate a TagTables scheme with 16 blocks per page. We reduce the page offset from 6 bits to 4

bits and shift the remaining higher-order bits to the right. As a result, it may encounter a level-4

table with 2-bit index in 48-bit physical address format. We keep 4 chunks per entry at the leaf

level. The rest design and operations stay the same.

3.4.2 Performance Results

In this section, we first compare the speedup of five DRAM cache designs, Alloy,

ATcache, TagTables_64, TagTables_16, and CLT. We show the average memory access times

for the tag and the data, which contribute to the overall execution time. Multiple factors that

impact the memory access time such as number of DRAM cache misses, on-die tag hit/miss

ratios for ATcache, TagTables and CLT, Alloy’s miss-predictor accuracy are also discussed.

In Figure 3-7, we plot the speedups of CLT with respect to Alloy, ATcache,

TagTables_64, and TagTables_16. CLT demonstrates significant performance advantage over

the other four methods. On the average, CLT improves 4.3%, 12.8%, 12.9% and 14.9%

respectively over Alloy, ATcache, TagTables_64 and TagTables_16. The improvement of CLT

over Alloy is rather moderate. CLT performs worse than Alloy for omnetpp due to its sector-

44

indexing. In comparison with ATcache, CLT is able to gain 12-24% speedup for all workloads

except mcf and omnetpp. This is because CLT can capture most DRAM cache accesses and the

sector-indexing does not hurt DRAM cache performance much as shown in Figure 3-4. Both

TagTables perform especially poorly for mcf, lbm, and milc. For TagTables_64, the CLT

improvements are 44.8%, 70.6%, and 36.4% for these three workloads while TagTables_64

shows slight edge over CLT for omnetpp, leslie3d and zeusmp.

The diversified performance impacts on individual workloads are caused by multiple

factors. An overall speedup analysis is further complicated by the fact of exploiting MLP

(memory-level parallelism) in the Epoch model. During the timing simulation, a cadence of

memory requests is issued in each Epoch. The latency is dominated by the DRAM cache misses

in each Epoch. Therefore, the DRAM cache hit latency plays a small role. On the other hand, the

hit latency becomes the decisive factor in case there is no cache miss in an Epoch. This mix

performance factors exist even with a precise processor model. In the following, we analyze the

important parameters without detailing the MLP factor.

The most decisive performance factor is the average memory access time of the L3

misses. In Figure 3-8, we plot the average access latencies separated by the tag and the data

segments where the total access time is dominated by the data latency. In general, these average

latencies are consistent with the speedups shown in Figure 3-7. CLT has the shortest average

latency followed by Alloy, ATcache and both TagTables. As expected, Alloy has the shortest tag

latency since it only pays one-cycle predictor delay. However, in case of a false-positive miss

prediction, the tag latency includes a sequential DRAM cache access for fetching the tag.

ATcache has the longest tag latency due to: 1. recording the block tags does not save space,

hence lowering the Timber hit ratio; and 2. Sequential prefetch of set tags generates high traffic

45

since each set of tags occupied 4 blocks. The TagTables_16 has longer tag latency than

TagTables_64 in accessing the tags through the page-walk tables. When the page size reduces to

16 blocks, more active pages are requested, causing more L3 misses.

Figure 3-7.10CLT speedup with respect to Alloy, TagTables_64, and TagTables_16.

Figure 3-8.11Memory access latency (CPU cycles).

There are multiple factors contributed to the data latency. In table 3-3, we analyze three

performance parameters. The first and most important parameter is the DRAM cache

performance. Based on the trace-driven Epoch model, we can measure DRAM cache

performance using misses per thousand requests (MPKR) where each request is a L2 miss.

Similar to MPKI, MPKR is closely associated with the execution speedup estimation, higher

-20%

0%

20%

40%

60%

80% Alloy ATcache TagTables_64 TagTables_16

0

50

100

150

200

250

300

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

T

All

oy

AT

cach

eT

agT

able

s_64

Tag

Table

s_16

CL

Tmcf gcc lbm soplex milc libquantum omnetpp sphinx3 bwaves leslie3d GemsFDTD zeusmp GEOMEAN

Data Latency Tag Latency

46

MPKR causing longer average access time. In general, ATcache has the lowest MPKR due to its

64-way set-associativity. CLT is close to ATcache for all workloads except mcf and omnetpp. Its

moderate sector size (i.e. 16 blocks per sector) does not degrade cache performance much. Alloy

suffers higher MPKR, hence longer data latency, due to its direct-mapped design. TagTables

show much higher MPKR in comparison with CLT for mcf, gcc, lbm, and milc, hence lower

speedups as shown in Figure 3-7. Although omnetpp and sphinx also have large MPKR gap

between CLT and TagTables, much smaller MPKRs lessen the impact.

The high MPKR of TagTables is due to two reasons, a negative impact on the sector-

indexing scheme, and the restricted chunk-based placement and replacement. As observed in the

second parameter ˗ the DRAM cache occupancy, the restricted 4 chunks per page creates empty

space in the cache sets. For example, the average occupancy for gcc is only 75% for

TagTables_64. In other words, 25% of the cache space is wasted and caused more misses. By

reducing the page size from 64 to 16, we can alleviate both the negative sector-indexing effect

and the empty space in the cache data array. But, TagTables_16 encounters higher L3 misses for

accessing the page-walk tables. Note that Alloy, ATcache, and CLT have 100% cache

occupancy.

Table 3-3.5Comparison of L4 MPKR, L4 occupancy and predictor accuracy.

DRAM cache MPKR Occupancy

Predictor accuracy

Alloy ATcache TT-

64

TT-

16 CLT

TT-

64

TT-

16 Alloy

ATcache TT-

64

TT-

16 CLT

mcf 198 71 228 113 109 91 95 83 64 88 73 8

gcc 484 410 520 441 413 75 84 81 63 95 84 8

lbm 307 275 504 382 278 66 81 99 69 98 93 9

soplex 693 655 668 668 664 98 99 98 61 97 91 9

milc 747 658 722 714 665 99 87 89 55 90 79 7

libqutum 515 456 470 448 465 100 90 99 62 98 94 9

omnetpp 161 71 181 123 100 73 82 79 67 83 69 7

sphinx3 85 7 95 7 7 66 99 96 70 97 92 9

bwaves 429 411 422 411 412 100 100 99 58 98 93 9

leslie3d 438 394 459 402 408 99 99 95 59 98 93 9

gems 447 427 484 442 429 90 92 99 65 98 93 9

zeusmp 564 545 565 563 559 99 100 99 66 97 88 9

47

The third performance factor is the predictor accuracy which shows the accuracy for

Alloy miss predictor, the CLT hit ratio, TagTables tags hit ratio in L3 and cached tags hit ratio

for ATcache. In general, all schemes show high hit ratios except for ATcache. TagTables_16 has

lower tag hit ratio and causes higher average tag access latency.

In summary, mcf, lbm, and milc, have the highest MPKRs for the TagTables. Together

with wasted cache space and L3 miss for tags, CLT outperforms TagTables by a large margin.

For milc, the large MPKR gap between CLT and Alloy helps CLT outperforming Alloy. Alloy

outperforms CLT by 11.8% for omnetpp which has small MPKR, hence DRAM cache miss

plays an insignificant role. The difference in memory access latency is due to high CLT misses

and longer tag latency. ATcache hurts the most due to its low Timber hit ratios.

3.4.3 Sensitivity Study and Future Projection

In this section, we show the results of two sensitivity studies, CLT coverage and sector

size. For CLT coverage, we change the total SRAM space from 2MB to 1MB (low coverage)

and 3MB (high coverage) and adjust the L3 size to 7MB and 5MB accordingly. With 1MB

SRAM, the CLT size is reduced to (2𝐾 ×13)×20 𝑏𝑦𝑡𝑒𝑠= 520KB with 2K sets, 13-way, and 20

bytes per entry. The pseudo-LRU still costs 504 KB. On the other hand, with 3MB SRAM, the

CLT increases to (4𝐾 ×30)×20 𝑏𝑦𝑡𝑒𝑠= 2.4MB.

In Figure 3-9, we show the change of the total execution cycles for the new coverages

with respect to the original CLT which utilizes 2MB SRAM space. On the average, the low

coverage has about 9% increases in the total cycles, while the high coverage shows about 1%

increase. The low coverage option provokes more CLT misses and degrades the performance.

Bigger L3 helps little in this case. Mcf, omnetpp and sphinx3 have 20-26% increases in the

execution cycles with low CLT coverage due mainly to the significant increase in CLT misses.

48

On the other hand, the high coverage relinquishes more L3 space for building bigger CLT. Since

the CLT hit ratio is very high for most workloads, further improvement is limited. The increase

of L3 miss hurts the overall performance for most workloads. Omnetpp shows about 3% cycle

reduction due to improvement of its low CLT hit ratio (79%).

We also study the cycle change for smaller (8 blocks) and bigger (32 blocks) sector sizes.

As shown in Figure 3-10, both small and big sector sizes increase the total cycles by about 7%

and 3% respectively. In this study, we keep 2MB SRAM space for the CLT. We need to adjust

the number of sectors recorded in the CLT to utilize the available space since the number of

way-pointer is changed from 16 to 8 and 32 for the respective sector sizes. For 8-block sectors,

although it alleviates the sector-indexing effect, the CLT coverage is also reduced since each

sector can only record 8 blocks. It also lowers the advantage of the special locality since each

CLT miss can only record 8 adjacent blocks in the CLT. The impact of bigger sector is the

opposite. Although the CLT coverage is better with higher special locality exploitation,

allocating 32 blocks into the same set hurts the cache performance. Among the workloads, mcf

and gcc degrade the most with small sector size while mcf and omnetpp hurt the most with large

sector size.

Figure 3-9.12IPC change for different CLT coverage.

-10%

0%

10%

20%

30%

% o

f C

ycl

e ch

ange lower coverage

higher coverage

49

Figure 3-10.13Execution cycle change for different sector size in CLT design.

3.4.4 Summary

We present a new caching technique to cache a portion of the large tag array for an off-

die stacked DRAM cache. Due to its large size, the tag array is impractical to fit on-die, hence

caching a portion of the tags can reduce the need to go off-die twice for each DRAM cache

access. In order to reduce the space requirement for cached tags and to obtain high coverage for

DRAM cache accesses, we proposed and evaluated a sector-based Cache Lookaside Table (CLT)

to record cache tags on-die. CLT reduces space requirement by sharing a sector tag for a number

of consecutive cache blocks and uses a location (way) pointer to locate the blocks in off-die

cache data array. The large sector can also take the advantage of exploiting spatial locality for

better coverage. In comparison with the Alloy cache, ATcache and the TagTables approaches,

the average improvements are in the range of 4-15%.

-5%

0%

5%

10%

15%

20%

% o

f cy

cle

chan

ge sector_8

sector_32

50

CHAPTER 4

RUNAHEAD CACHE MISSES USING BLOOM FILTER

In this chapter, we present the work of using a Bloom Filter to filter out L3 cache misses

and issue requests to off-ide early. We first introduce some related works of Bloom Filter

applications as well as cache miss identification. We then present the timing analysis of using the

Bloom Filter, followed by the proposed indexing scheme for solving the problem of dynamic

updates of L3 cache contents when using Bloom Filter. In the end we present results that

demonstrates our idea.

4.1 Background and Related work

Membership queries using a Bloom Filter has been explored in many architecture,

database, and network applications [54] [55] [56] [57] [58] [59] [14] [60]. In [60], a cache-miss

BF based on partial or partitioned block address is proposed to filter cache miss early in the

processor pipeline. The early cache miss filtering helps scheduling load dependent instructions to

avoid execution pipeline bubbles. To reduce cache coherence traffic, the RegionScout [59] was

used to dynamically detect most non-shared regions. A node with a RegionScout filter can

determine in advance that a request will miss in all remote nodes, hence the coherence request

can be avoided. A vector Bloom Filter [14] was introduced to satisfy quick search of large

MSHRs in the critical execution path without the need for expensive CAM implementation. A

counting Bloom Filter called a summary cache to handle dynamic membership updates is

presented in [56]. In this approach, each proxy keeps a summary of internet cache directory and

use the summary to check for potential hit to avoid sending a useless query to other proxies. In

[57], a counting Bloom Filter is used as a conflict detector in virtualized Transactional Memory

to detect conflicts among all the active transactions. Multiprocessor deterministic replay was

51

introduced in [58] in which a replayer creates an equivalent execution despite inherent sources of

nondeterminism that exist in modern multicore computer systems. They use a write and a read

Bloom filters to track the current episode’s write and read sets. A good early survey paper in

network applications using Bloom Filters and the mathematical basis behind them is reported in

[54]. A Bloom-filter-based semijoin algorithm for distributed database systems is in [55]. This

algorithm reduces communications costs to process a distributed natural join as much as possible

with a filter approach.

A closely related work by Lok and Hill [4] suggested that the block residency for DRAM

cache can be recorded in an on-die structure called MissMap. As described in Section 3.1, off-die

trips to access the DRAM cache can be avoided if the block recorded in the MissMap indicates a

miss. To save space, the MissMap records the block residency information for a large

consecutive segment. However, when the segment is evicted from the miss-map directory, all

blocks in the segment must be invalidated from the DRAM cache in order to maintain the precise

residency information. To avoid such invalidations, Sim et al. [19] suggested to speculatively

issuing requests directly to main memory if the block is predicted not in the DRAM cache.

However, significant complexity must be dealt with for handling miss predictions.

Xun et al. [ [61]] observed the need of using counters to filter cache misses. To avoid the

counters, they proposed to delay updating the BF array for the evicted blocks. Instead, they

trigger the BF array recalibration periodically to reconcile the correct cache content in the BF

array. This delay recalibration method increases the chance of false-positive and incurs time and

power overheads for the recalibration.

4.2 Memory Hierarchy and Timing analysis

Considering a BFk for cache level k, we can adjust the Average Memory Access Time

(AMAT) as follows:

52

𝑨𝑴𝑨𝑻 = (𝟏 − 𝑩𝑭𝒓𝒂𝒕𝒆Lk ) × ( 𝑯𝒊𝒕𝑻𝒊𝒎𝒆L1 + ∑ ( 𝑴𝒊𝒔𝒔𝑹𝒂𝒕𝒆𝒌−𝟏

𝒊=𝟏Li × 𝑯𝒊𝒕𝑻𝒊𝒎𝒆Li+1) ) +

𝑩𝑭𝒓𝒂𝒕𝒆Lk × ( (𝑩𝑭𝒕𝒊𝒎𝒆 + 𝑯𝒊𝒕𝑻𝒊𝒎𝒆Lk) + ∑ ( 𝑴𝒊𝒔𝒔𝑹𝒂𝒕𝒆𝒏

𝒋=𝒌Lj × 𝑯𝒊𝒕𝑻𝒊𝒎𝒆Lj+1) )

where 𝐵𝐹𝑟𝑎𝑡𝑒Lk is the ratio of cache misses filtered by the BFk at level k. When a cache

miss is identified, the extra delays of hit times through levels 1 to k are avoided. Only the delay

(BFtime) of accessing the BFk is added to access cache k+1 up to the DRAM memory. This

formula also shows that using BFs in multiple levels overlap the benefits in bypassing higher

levels of caches. For example, if both BFi and BFi+1 are implemented and used when the memory

address becomes available, BFi+1 can only save the hit time at cache level i+1. In the base

memory hierarchy design shown in Figure 1-2, we will focus on a new BFL3 since the sector-

based L4 tags are on the processor die with small latency. Furthermore, the large L4 size requires

a large BFL4 to filter L4 misses for achieving a small false-positive rate.

Figure 4-1.14Memory latency with / without BFL3.

Figure 4-1 illustrates the memory latency on runahead L3 misses. After the memory

address is available, BFL3 is checked. Once a miss is filtered, the on-die L4 tag array is looked up

and followed by either eDRAM or regular DRAM access depending on a L4 hit or miss for

Timing Analysis:

AGAddress

L1 + L2 + L3 L1 + L2 + L3

Delay no BF

L4 tag

L4 data (eDRAM)

Regular DRAM

hit

L3 BF

L4 tag

L4 data (eDRAM)

Regular DRAM

hit

L3miss

miss

miss

L3miss

Delay L3 BF

53

fetching the requested data block. Regardless the filtering result, a memory request always goes

through the regular L1, L2 and L3 path. This is necessary for handling cache hits at these levels

as well as identifying any false-positive L3 misses from the BFL3. Even when the request is

identified as a L3 miss, the request is also issued to the normal L1, L2 and L3 path. This is

necessary to avoid major changes to the microarchitecture of the cache hierarchy. If the filtered

request goes through the normal cache levels and arrives to the memory controller, the early

runahead miss will block this request at the controller waiting for the block comes back from L4

or memory. On the other hand, if the request has not yet arrived to the controller when the block

for the runahead miss comes back, the block is inserted into L3 and treated as a prefetch to the

L3 cache. Eventually, the request through the normal path will be a L3 hit and shorten the

latency.

Formally, a Bloom filter (BF) for representing a set of n elements (cache blocks) from a

large universe (memory blocks) consists of an array of m bits, initially all set to 0. The filter uses

k independent hash functions h1, ... , hk with range 1, ... , m, where these hash functions map

each element x (block address) in memory to a random number uniformly over the range. When

a block enters cache, the bits hi(x) are set to 1 for 1 ≤ i ≤ k. To check if a block y is in cache, we

check whether all hi(y) are set to 1. If not, then clearly y is not in cache, hence a cache miss. In

practice, however, a BF for cache misses faces two major difficulties. First, it is hard to

implement in hardware multiple independent and randomized hashing functions. Second, cache

contents are dynamically changed with insertions and deletions. The BF array must be updated

accordingly to reflect the content changes for maintaining the correct BF function.

In Figure 4-2, we illustrate a solution for simplified BF hashing functions and for

handling dynamic cache updates. Let us first describe the conventional cache indexing scheme.

54

In a cache access, the target set is determined by decoding the index which is located in the low-

order bits (a0) of the block address as shown in Figure 4-2. Instead of constructing uniformly

distributed hashing functions, we can simply expand the cache index scheme to include a few

more adjacent tag bits (a1) to be used for indexing the BF array. Like in conventional cache

access, a simple address decoder to decode the BF index can decide the hashed BF location.

Based on the study in [54], the probability of false-positive rate is minimized when 𝑘 =

𝑙𝑛2 × (𝑚 𝑛),⁄ giving a false-positive rate ≈ (0.6185)𝑚 𝑛⁄ . Hence, increasing 𝑚 𝑛⁄ can reduce

the false-positive rate. For a cache with 2𝑝 way set associativity, the total number of cache

blocks 𝑛 = 2𝑝 × 2𝑎0 = 2(𝑎0+𝑝). Furthermore, the BF array size 𝑚 must be bigger than 𝑛 for

reducing the false-positive rate. Assuming 𝑚 𝑛⁄ = 2𝑞 where 𝑞 is a small positive integer, we

have 𝑚 = 2(𝑎0+𝑝+𝑞). Therefore, the BF index is 𝑎1||𝑎0 where 𝑎1 has 𝑝 + 𝑞 bits and must be a

positive number.

There is a unique advantage to include the cache index (a0) as a part of the BF index.

Due to collisions, multiple blocks can be hashed to the same location in the BF array. Since the

cache index is a part of the BF index, all blocks hashed to the same BF array location must be

located in the same cache set. By comparing the a1 bits in all cache tags in the set with the a1

bits of the replaced block, the BF array location is reset only if the replaced a1 is not found in

any other location.

Note that due to spatial locality in memory references, using low-order block address

may ease collisions in the BF array, hence lower the false-positive rate. Moreover, we apply a

simple cache index randomization technique by exclusive-ORing the a1 bits with the adjacent

high-order a2 bits to further reduce the collision. Consider a BFL3 design for an 8MB L3 cache

with 64-byte blocks and 16-way set associativity. The target set is determined by the low-order

55

13 bits (a0) of the block address to hash to the 8K sets. The total number of blocks n=217.

Assume that the BF array size m is 8 times larger than n. As a result, the additional index bits

(a1) is 7 and the BF index has 20 bits. For randomizing the a1, higher order 7 bits (a2) are used.

The total required address bit is 33 including the 6 offset bits. With limited physical address, we

can have several hashing combinations for BFL3 using a0, a1 and a2. Note that in this work we

set up 8GB for our simulated memory. With bigger memory and more physical address bits,

more hashing options can be explored.

(a) k=1: three BF indices: a1||a0, a2||a0, and (a1 XOR a2)||a0.

(b) k=2: three BF index groups: (a1||a0 and a2||a0), (a1||a0 and (a1 XOR a2)||a0)), and

(a2||a0 and (a1 XOR a2)||a0).

(c) k=3: one BF index group: (a1||a0, a2||a0, and (a1 XOR a2)||a0))

Figure 4-2.15Cache indexing and hashing for BF.

Figure 4-3 shows the false-positive rates for different hashing schemes. Note that we only

show the results of 6 hashing schemes since the false-positive rate using a single hashing index

(a2||a0) is very high. We simulate 3 m:n ratios: 4:1, 8:1, and 16:1 using memory traces

generated from SPEC2006 benchmarks. We first ran the workloads on a whole-system

simulation infrastructure based on a cycle-accurate 8-core model along with a memory hierarchy

model [29] to collect memory reference traces from misses to the L2 caches. 5 billion

instructions from 8 cores are collected for each workload. The simulation environment and

parameters will be given in Section 3.3.

tag cache index (a0) offset

BF index 1 = a1||a0

a0a1a2

BF index 2 = (a1 XOR a2)||a0

56

The false-positive rate is calculated as the ratio of the filter-hits-actual-misses divided by

the total misses. Each false-positive point in the figure is the geometric mean of 12 SPEC2006

benchmarks. As can be observed, when k=1, randomization of a1 helps very little in improving

the false-positive rate. Two hashing functions with indices a1||a0 and (a1 XOR a2)||a0) as

illustrated in Figure 4-3 show the lowest false positive rates about 2.3%, 4.8%, and 16.5%

respectively for m:n ratios 16:1, 8:1, and 4:1. Three hashing functions cannot further improve the

false-positive rate because of insufficient address bits where the third hashing index is highly

correlated with the first two.

Figure 4-3.16False-positive rates for 6 hashing mechanisms.

Figure 4-4.17False-positive rates with m:n = 2:1, 4:1, 8:1, 16:1, and k = 1, 2.

0%

5%

10%

15%

20%

25%

a1 a1â2 (a1, a2) (a2,

a1â2)

(a1,

a1â2)

(a1,a2,

a1â2)

fals

e p

osi

tive

rate

m/n=4 m/n=8 m/n=16

-5%

5%

15%

25%

35%

45%

fals

e p

osi

tive

rate

k=1 m/n=2 k=1 m/n=4 k=1 m/n=8 k=1 m/n=16

k=2 m/n=2 k=2 m/n=4 k=2 m/n=8 k=2 m/n=16

57

In Figure 4-4, we show the false positive rates for individual SPEC2006 benchmarks.

Based on the results in Figure 4-3, we pick two hashing schemes, a1||a0 for k=1, and a1||a0 and

(a1 XOR a2)||a0 for k=2. The results show that the m:n ratio plays an important role as bigger

BF arrays reduce the false-positive rate significantly for all benchmarks. The false-positive rates

are very high for small BF array with ratio m:n=2:1. The benefit of multiple hashing functions

becomes more evident when m/n is 4 or greater. The false-positive rate behavior is very

consistent across all benchmarks. For k=1, the average false-positive rates are 8.7% and 4.3%

using a BF array which has 8 and 16 times more bits than the total number of cache blocks.

When k=2, the false-positive rates are reduced to 4.8% and 2.3% respectively using the BF array

entries that are 8 and 16 times more than the number of cache blocks. These results are used to

guide our IPC timing simulation.

4.3 Performance Results

The IPC improvement using a BF for runahead L3 misses is presented in this section. We

also compare the improvement with a perfect BF without any false-positive misses. In addition,

the sensitivity studies of BF design parameters, the size of the L4 caches, and the latency and

bandwidth of the regular DRAM are also presented.

For an 8MB L3 cache with 64-byte blocks, the space overhead for the new BFL3 are

64KB, 128KB and 256KB respectively for m:n = 4:1, 8:1, and 16:1. We use Cacti [21] to

estimate the BF latency and get 2, 3, and 3 cycles for the three BF arrays. In addition, we add

two more cycles for the wiring delay. For delay recalibrations, since we need to get last 14 bits

out (a1 and a2) and perform hierarchical or operations, we measured using Cacti 6.5[21] that it

takes 3 cycles to recalibrate one set. 4 sets can be recalibrated in parallel and a total of 6K cycles

are charged for each recalibration.

58

4.3.1 IPC Comparison

Figure 4-5 displays the IPCs of the twelve benchmarks with six caching mechanisms

including a regular 4-level cache without BF, a BFL3 to filter and runahead L3 misses, three

delay calibration d1-BFs, d2-BFL3 and d3-BFL3 with recalibration periods of 0.5M, 1M and 2M

memory references, and a perfect BFL3 which does not incur false-positive. Note that we use two

hashing functions a1||a0 and (a1 XOR a2)||a0, and m:n = 8:1 for the BFL3. The results show that

the average IPC improvement is about 10.5% using the BFL3. This improvement is only 1.3%

less than using a perfect BFL3 which averages 11.8%. In comparison with the three delay-BF

designs, the improvements are 4.3%, 4.8% and 3.5% respectively. Shorter recalibration period

has less false-positives, but pays more overhead in recalibrations. In general, for the design using

a BFL3, all benchmarks show good IPC improvement. Mcf and sphinx benefit the most with

close to 20% improvement. Other workloads show at least 6% improvement compared against

design without a BF, except bwaves, which has about 4% improvement. The BFL3 design also

show 1.3˗8.9% improvement against the delay BF designs.

The major impact on the IPC comes from the average memory access latency. In Table 4-

1, we list the average memory latency for L3 misses with and without runahead using a BF. The

L3 miss latency is measured from the generation of the memory address till the return of the

requested data. Note that the measurement does not include the hit latencies of the L1, L2 and L3

caches since these latencies are basically the same with or without using the BF.

We can observe a significant saving in the L3 miss latency using the BFL3 to runahead

the misses for all benchmarks. On the average, the L3 miss latency is reduced from 154 cycles to

120 cycles which closely match to the savings of the L1, L2 and L3 hit times minus a 5-cycle

penalty for accessing the BF array. In general, the latency result is consistent with the IPC

59

results. The BF with delay recalibrations (not shown) also has longer latency due to more false-

positives. In addition, they are charged for the recalibration overheads.

Figure 4-5.18IPC comparisons with/without BF.

Table 4-1.6False-positive rates of 12 benchmarks.

Latency False-Positive Rate(%)

BFL3 w/o BF BFL3 d1- BFL3 d2- BFL3 d3- BFL3

mcf 105.9 138.2 7 9 12 14

soplex 106.2 141.1 5 10 12 13

lbm 153.3 186 5 8 9 11

leslie3d 158.8 190.4 6 7 9 11

gems 216.9 248.8 4 6 8 9

libquantum 137.6 162 9 12 13 17

milc 142.3 173.9 4 6 9 10

bwaves 128.9 161.5 3 6 8 9

sphinx 69.8 106.4 4 7 11 12

bt 134.6 168.3 4 5 7 9

omnetpp 77.3 112.2 6 10 13 16

gcc 80.6 116.3 3 6 8 9

In Table 4-1, we also show the false-positive rates for the twelve benchmarks measured

in the timing simulation. The results are ranged from 3 to 9%, which are consistent with the rates

based on simulating long memory traces (Figure 4-4). The small false-positive rate ensures a

small impact on the IPC improvement. As shown in Figure 4-5, the IPC improvement of using a

1

3

5

7

9

IPC

without BF BF d1-BF d2-BF d3-BF Perfect-BF

60

perfect BF can only surpass the average IPC improvement of a realistic BF from 10.5% to

11.8%.

We also provide a rough estimation of the power consumption. We measure using Cacti

[52] that each BF access takes around 0.013 nJ, which is close to a single L1 cache’s dynamic

access energy of 0.014 nJ. Since the only power difference is the Bloom Filter access energy,

and the number of Bloom Filter accesses is the same as L1 cache’s total accesses, the extra

power consumption is basically the same as L1 cache’s total power consumption. On the other

hand, using a Bloom Filter can speed up simulation by 10.5%, which can be translated to 10.5%

static energy saving. To even save more energy, we can only access our Bloom Filter after L1

cache misses.

4.3.2 Sensitivity Study

The sensitivity study of the IPC impact with respect to the m:n ratio and the number of

hashing function k is shown in Figure 4-6, in which each IPC point is the geometric mean of 12

benchmarks. Again, the results show that bigger BF arrays reduce the false-positive rate and

improve the IPC. The improvement rate is much more for k=3 than for k=2, and k=1. This is due

to the fact that without sufficient entries in the BF array, more hashing functions actually

increase the chance for collisions. On the other hand, with bigger BF arrays, more hashing

functions can spread each block more randomly and reduce the chance of collision. When the

m:n ratio is 2:1, there is insufficient room in the BF array even for k=2, resulting slightly lower

IPC than the IPC of k=1. k=3 is obviously much worse. However, the IPC of k=3 nearly catches

up with the IPC of k=2, when m:n=8:1. When m:n=16:1, the IPCs for different hashing functions

are very close. Given sufficient BF array size, the false-positive rates are small regardless the

number of hashing functions. Nevertheless, for large m:n=16:1, we expect that k=3 should have

a better IPC than the IPC for k=2. However, due to the limit address bits, the third BF index is

61

highly correlated with the first two BF indices resulting limited improvement in the false-

positive rate.

In Figure 4-7, we show the results of the IPC for four L4 sizes ranging from 64MB to 512

MB. In these simulations, we maintain m:n=8:1 and k=2. Regardless the L4 size, the BFL3

always improves the IPC significantly. As expected, however, bigger L4 reduces L4MPKI and

improves the IPC more using the BFL3. For the four L4 sizes, the IPC improvements are 9.0%,

10.5%, 11.5%, and 12.0% respectively.

Figure 4-6.19Average IPC for m:n ratios and hashing functions.

Figure 4-7.20Average IPC for different L4 sizes.

5.2

5.3

5.4

5.5

5.6

5.7

5.8

m/n=2 m/n=4 m/n=8 m/n=16

IPC

k=1 k=2 k=3

2

3

4

5

6

64MB 128MB 256MB 512MB

IPC

withoutBF BF

62

Figure 4-8.21Average IPC over different DRAM latency.

Table 4-2.7 Future Conventional DRAM parameters.

Faster DRAM latency Slower DRAM latency

tCAS-tRCD-tRP: 6-6-6

tRAS-tRC: 33-30

tCAS-tRCD-tRP: 11-15-15

tRAS-tRC: 38-50

Next, the impact of DRAM latency is simulated. In comparison with the original DRAM

latency in Table 2-1, we simulate a fast and a slow DRAM latency as shown in Table 4-2. We

also test two DRAM bandwidth configurations, one with 2 channels and the other with 4

channels. The L3 and L4 sizes are 8MB and 128MB and the BFL3 remains as m:n=8:1, k=2. The

results are shown in Figure 4-8. It is interesting to see that higher DRAM bandwidth and fast

DRAM latency helps the IPC with runahead L3 misses more than that without runahead misses.

For the fast latency with 4 DRAM channels, the average IPC improvement reaches to 12%. On

the other hand, for the slow latency with 2 DRAM channels, the average IPC improvement is

about 7%.

4.4 Summary

A new Bloom Filter is introduced to filter L3 cache misses for bypassing L1, L2 and L3

caches to shorten the L3 miss penalty in a 4-level cache hierarchy system. The proposed Bloom

3.5

4

4.5

5

5.5

6

6.5

fast

late

ncy

ori

gin

al

slow

_la

tency

fast

late

ncy

ori

gin

al

slow

_la

tency

channel = 4 channel = 2

IPC

without BF BF

63

Filter applies a simple indexing scheme by decoding the low-order block address to determine

the hashed location in the BF array. To provide better hashing randomization, a part of the index

bits are XORing with the adjacent higher-order address bits. In addition, with certain

combinations of the limited block address bits, multiple index functions can be selected to

further reduce the false-positive rate. Results show that the proposed simple hashing scheme can

lower the average false-positive rate below 5% for filtering L3 misses, and to improve the

average IPC by 10.5% from runahead these misses.

Furthermore, the proposed BF indexing scheme resolves an inherent difficult problem in

using the Bloom Filter for identifying L3 cache misses. Due to dynamic updates of the cache

content, a counting Bloom Filter is necessary to update the BF array to reflect dynamic changes

of the cache content. A unique advantage of the proposed BF index is that it includes the cache

index as a superset. As a result, the blocks which are hashed to the same BF array location, are

allocated in the same cache set. By searching the tags in the set when a block is replaced, the

corresponding BF bit can be reset correctly. This restricted hashing scheme demonstrates low

false-positive rate and simplifies the BF array updates without using expensive counters.

64

CHAPTER 5

GUIDED MULTIPLE HASHING

In this chapter, we present our guided multiple hashing work. We begin by introducing

the problems of single hashing and multiple hashing. We then use a simple example to illustrate

our proposed idea. A detailed algorithm that targets at maximizing the number of empty buckets

while balancing the keys at non-empty buckets is given. In the end, we present results that shows

our improvement over other traditional hashing methods.

5.1 Background

Hash-based lookup has been an important research direction on routing and packet

forwarding, among the core functions of the IP network-layer protocols. While there are other

alternative approaches for routing table lookup such as trie-based solutions, we only focus on

hash-based solutions, which have the advantages of simplicity and O(1) average lookup time,

whereas trie-based lookup tends to make much more memory accesses.

Single-hashing suffers from the collision problem, where multiple keys are hashed to the

same bucket and cause uneven distribution of keys among the buckets. It takes variable delays in

looking up keys located in different buckets. For hash-based network routing tables [62] [63]

[64] [65], it is critical to perform fast lookup for the next hop routing information. In today’s

backbone routers, routing tables are often too big to fit into on-chip memory of a network

processor. As a result, off-chip routing table access becomes the bottleneck for meeting the

increasing throughput requirement on high speed Internet [66] [67]. The unbalanced hash

buckets further worsen the off-chip access. Today’s memory technology is more efficient to

fetch a contiguous block (such as a cache block) at once than individual data elements separately

from off-chip memory. A heavy-loaded hash bucket may require two or more memory accesses

to fetch all its keys. However, in order to accommodate the most-loaded bucket for a constant

65

lookup delay, fetching a large memory block which can hold the highest number of keys in a

bucket increases the critical memory bandwidth requirement, wastes the memory space, and

lowers the network throughput [63] [65] [68] [69].

Methods were proposed to handle the hash collision problem for balancing the bucket

load by reducing the maximum number of keys in a bucket among all buckets. One approach is

to use multiple hashing such as d-random [70] which hashes each key to d buckets using d

independent hash functions and stores the key into the least-loaded bucket. The 2-left scheme

[62] [68] is a special case of d-random where the buckets are partitioned into left and right

regions. When inserting a key, a random hash function is applied in each region and the key is

allocated to the least-loaded bucket (to the left in case of a tie). The multiple-hashing approach

balances the buckets and reduces the fetched bucket size for each key look up. However, without

the knowledge of which bucket that a key is located, d-random (d-left) requires probing all d

buckets. As the bottleneck leans on the off-chip memory access, accessing multiple buckets

slows down the hash table access and degrades the network performance [65] [67].

To remedy probing d buckets, extended Bloom Filter [64] uses counters and extra

pointers to link keys in multiple hashed buckets to avoid lookups of multiple buckets. However,

it requires key replications and must handle complex key updates. The recently proposed

Deterministic Hashing [65] applies multiple hash functions to an on-chip intermediate index

table where the hashed bucket addresses are saved. By properly setting up the bucket addresses

in the index table, the hashed buckets can be balanced. This approach incurs space overhead and

delays due to indirect access through the index table. In [69], an improved approach uses an

intermediate table to record the hash function IDs, instead of the bucket addresses to alleviate the

space overhead. In addition, it uses a single hash function to the index table to ease the update

66

complexity. However, with limited index table and hashing functions, the achievable balance is

also limited. In another effort to avoid collision, the perfect hash function sets more rigid goal to

achieve one to one mapping between keys and buckets. It accomplishes the goal using complex

hash functions encoded on-chip with significant space and additional delays [71] [20]. It also

requires changes in the encoded hash function upon a hash table update.

5.2 Hashing

We first describe the challenges of a hashing based information table using a single hash

function. We also bring up the motivation and applications of using a multiple hashing approach

for organizing and accessing a hash table.

Figure 5-1.22Distribution of keys in buckets of four hashing algorithms.

To demonstrate the power of multiple hashing in accomplishing different objectives for

the hash table, we compare the simulation results of four hashing schemes: single hashing

(single-hash), 2-hash with load balancing (2-left), 4-hash with load balancing (4-left), and 2-hash

with maximum zero buckets (2-max-0). We simulate 200,000 randomly generated keys to be

67

hashed to 100,000 buckets. The distribution of keys in buckets is plotted in Figure 5-1. We can

observe substantial differences in the key distribution among the four hashing schemes. The

maximum number of keys in a bucket reach to ten for single-hash and 2-max-0. Meanwhile, 2-

max-0 produces 2.5 times empty buckets than single-hash does. 2-left and 4-left are more

balanced with four and three as the maximum numbers of keys in a bucket, respectively. It is

easy to see that increasing the number of hash functions from two to four helps improving the

balance.

5.3 Proposed Algorithm

In this section, we describe the detailed algorithms of the guided multiple-hashing

scheme that consists of a setup algorithm, a lookup algorithm, and an update algorithm. Assume

we have m buckets 𝐵1, . . . , 𝐵𝑚 and d independent hash functions 𝐻1, . . . , 𝐻𝑑. Each key x is hashed

and placed into all d buckets, 𝐵𝐻𝑖(𝑥), 1 ≤ 𝑖 ≤ 𝑑. The set of keys in bucket 𝐵𝑖 is denoted by

B[i], and the number of keys in bucket 𝐵𝑖 is 𝑣(𝐵[𝑖]), 1 ≤ 𝑖 ≤ 𝑚. The bucket load Ω𝑎 is

defined as the maximum number of keys in any bucket. We define the memory usage ratio as:

𝜃 = (Ω𝑎 × 𝑚)/𝑛 to indicate the memory requirement of the hash table. Other terminologies

are self-explanatory and are listed in Table 5-1. For better illustration of d-ghash, we use a

simple hashing table with 5 keys and 8 buckets. All keys are hashed to the buckets using two

hashing functions, where buckets 𝐵0, . . . , 𝐵7 have 1, 0, 1, 2, 3, 0, 1, 2 keys as indicated by the

arrows in Figure 5-2.

5.3.1 The Setup Algorithm

Since the objective is to minimize the bucket load while approaching to a single bucket

access per lookup, the setup algorithm needs to satisfy two criteria: (1) achieving near perfect

balance, and (2) maximizing the number of c-empty buckets. Recall that a c-empty bucket serves

68

as a multiple hashing target of one or more keys, but the key(s) is placed into other alternative

buckets that make c-empty bucket access unnecessary.

Table 5-1.8Notation and Definition.

Symbol Meaning

n Total number of keys

m Total number of buckets

𝐵[𝑖] Set of keys in i-th bucket

v(B[i]) Number of keys of the i-th bucket

s indices of the buckets in B sorted in

ascending order of v(B[i])

𝐻𝑖 i-th hash function

Ω𝑝 Optimal bucket load, ⌈𝑛/𝑚⌉

Ω𝑎 Achievable bucket load

𝑛𝑢 Total number of keys in under-loaded

buckets (bucket load less than Ω𝑎

𝑏𝑢 Number of under-loaded buckets

𝜃 Memory usage ratio

Figure 5-2.23A simple d-ghash table with 5 keys, 8 buckets and 2 hash functions. (The shaded

bucket is a c-empty bucket. The final key assignment is as illustrated.

Given n keys and m buckets, a perfect-balance hashing scheme achieves optimal bucket

load Ω𝑝 = ⌈𝑛/𝑚⌉. However, the perfect balance may not be achieved under our or other multi-

hashing schemes because even with multiple hashing, some buckets may still probabilistically be

under-loaded, i.e. zero or less than Ω𝑝 keys are hashed to the bucket. And this translates to some

other buckets being squeezed with more keys. Increasing the number of hash functions reduces

69

under-loaded buckets and helps approaching to the perfect balancing. Our simulation shows that

with 4 hash functions, the achievable balance is the same as or very close to Ω𝑝.

The first step in the setup algorithm is to estimate Ω𝑎, the achievable balance. The idea is

to count the number of under loaded buckets and the number of keys inside. If the remaining

buckets cannot hold the rest of the keys with Ω𝑎 keys in each bucket, we increase Ω𝑎 by one.

We then use Ω𝑎 as the benchmark bucket load for key assignment. We sort all buckets in B

resulting a sorted index array, s, such that 𝑣(𝐵[𝑠(𝑖)]) ≤ 𝑣(𝐵[𝑠(𝑖 + 1)], 1 ≤ 𝑖 ≤ 𝑚 − 1. In

the simple example of Figure 5-2, Ω𝑎 = Ω𝑝 = ⌈𝑛/𝑚⌉= 1.

The next step is key assignment, which consists of two procedures, creating c-empty

buckets, and balancing key assignment. For creating c-empty buckets, the procedure removes

duplicate keys starting from the most-loaded buckets to maximize their services as companion

buckets to reduce the bucket access. A key can be safely removed from a bucket if it exists in

other bucket(s). The procedure goes through all buckets whose initial load are greater than Ω𝑎

and tries to remove keys from them. In the illustrated example in Figure 5-2, all 3 keys in 𝐵4 are

successfully removed and 𝐵4 becomes empty. Next, we check 𝐵3 and 𝐵7, each of which has 2

keys. Note that both 𝐾2 and 𝐾4 in 𝐵3 can be removed if 𝐵3 is emptied first. As a result, 𝐾2 and

𝐾3 cannot be removed from 𝐵7 and exceed Ω𝑎. All buckets with the bucket load exceeding Ω𝑎

will be a target for reallocation as described next.

After emptying the buckets, the key assignment procedure assigns each key to a bucket

starting from the least-loaded bucket. Once a key is assigned, its duplicates are removed from the

remaining buckets. During the assignment, buckets with more than Ω𝑎 keys are skipped in order

to maintain the achievable balance. A re-assignment of the buckets with load greater than Ω𝑎 is

necessary after all the buckets are assigned. During re-assignment, the keys in the overflow

70

buckets are attempted to be relocated to other buckets. In our experiment, we use Cuckoo

Hashing [72] to relocate keys from an overflow bucket to an alternative bucket using multiple

hashing functions. If all alternative buckets are full, an attempt is made to make room in the

alternative buckets. For simplicity, however, such attempts stop after r tries, where r can be any

heuristic number. A larger r brings better balance at the expense of longer setup time. In the

illustrated example, 𝐾2 in 𝐵7 is relocated to 𝐵3 to reduce the bucket load of 𝐵7, hence, the

optimal load is achieved.

In case that the perfect balance is not achievable, Ω𝑎 is incremented by one and the key

assignment procedure repeats. It is important to note that the priority of the key assignment is to

achieve perfect balance. Therefore, the keys that are previously removed from an empty bucket

can be reassigned back in order to accomplish the perfect balance such that the number of keys

are less than or equal to Ω𝑎 in all buckets. It is also important to know that in order to reduce the

bucket load, we can decrease the ratio of 𝑛/𝑚, i.e. to increase the number of buckets for a fixed

number of keys. However, increasing the number of buckets inflates the memory space

requirement as the memory usage ratio can be calculated by θ = (Ω𝑎 × 𝑚 𝑛⁄ ) for a constant

bucket size for efficient fetch of a bucket from off-chip memory.

5.3.2 The Lookup Algorithm

In order to speed up the lookup of keys, we introduce a data structure called the empty

array, which is a bit array of size m indicating whether a bucket is empty or not. If a bit in the

empty array is ‘0’, it means that the corresponding bucket is empty; otherwise it is not empty.

Upon looking up a key x, the bits of indices 𝐻1(𝑥), . . . , 𝐻𝑑(𝑥) in the empty array are checked. If

there is only one of the hashed buckets is non-empty, we simply fetch that bucket and thus

complete a lookup. If there are two or more non-empty buckets, we access them one by one until

71

we find the key. In the worst case, all d bits are ones and d buckets are examined before we find

the key. As discussed above, creating c-empty buckets helps reduce bucket accesses per lookup,

thus alleviates the lookup cost.

To further enhance our algorithm, we introduce another data structure, the target array, to

record the hash function ID once a key is hashed to two or more non-empty buckets. To separate

from the algorithm described above, we call it enhanced d-ghash algorithm. The algorithm only

using the empty array is called base d-ghash algorithm. The recorded ID indicates the bucket that

the key is most-likely located. The empty array has m bits while the size of the target array varies

depending on the number of keys. Suppose m = 200K, and we use 200K-entry target array, then

the empty array takes 25KB and target array 25KB for enhanced 2-ghash and 50KB for

enhanced 4-ghash. These two small arrays can be placed on chip for fast access. Multiple keys

may collide in the target array. When a collision occurs, the priority of recording the target

hashing function is given to the key which hashes to more non-empty buckets. Given a fixed

number of keys, we can adjust the number of buckets (m) and hash functions (d) to achieve a

specific goal of the bucket size and the number of buckets to be fetched for looking up a key.

More discussions will be in Section 5.4.

5.3.3 The Update Algorithm

There are three common types of hash table updates: insertion, deletion, and

modification. It is straightforward to delete or to modify a key in the hash table. For deletion, the

key is probed first by fetching the bucket from off-chip memory. The key is then removed from

the bucket before the bucket is written back to memory. If the key is the last one in the bucket,

the corresponding bit in the empty array is set to zero. For modification of the associated record

of a key, the key and its associated record are fetched. The new record replaces the old one

72

before the bucket is written back to memory. Those two types of updates do not involve the

modification of the target array.

The key insertion is slightly complicated. All hashed buckets are probed and the key is

inserted into the least-loaded, nonempty bucket with the number of keys < Ω𝑎. If all non-empty

buckets are full, the key is inserted into an empty bucket if it is also hashed. The empty array is

updated accordingly. In case that all hashed buckets are full, the Cuckoo Hashing is applied to

make a room for the new key, i.e., “rehashing” a key in one of the hashed buckets to another

alternative bucket. During key relocations, both the empty and the target arrays are updated

accordingly. There are two options in case a key cannot be inserted without breaking the

property of 𝑣(𝐵[𝑖]) ≤ Ω𝑎, i.e., all its hashed/rehashed bucket loads are greater than or equal to

Ω𝑎. First, set Ω𝑎 = Ω𝑎 + 1 and insert the key normally; Second, initiate an off-line process to

re-setup the table. Normally, the possibility that a key cannot be inserted is small, and we should

use the second option to prevent the bucket size from growing fast. However, if this operation

happens very frequently, it implies that most of the buckets are “full”, i.e. the average number of

keys in buckets are approaching Ω𝑎. In this case we should use the first option. By increasing

the maximum load by one, all buckets gain one extra space to store another key.

5.4 Performance Results

The performance evaluation is based on simulations for seven hashing schemes: single-

hash, 2-left, 4-left, base 2-ghash, enhanced 2-ghash, base 4-ghash, and enhanced 4- ghash. Note

that we do not include d-random in the evaluation, because it is outperformed by d-left both in

terms of the bucket load and the number of bucket accesses per lookup. We simulate 200,000

randomly generated keys to be hashed into 100,000 to 500,000 buckets. To test the new multiple

hashing idea, we adopt the random hash function in [25] which uses a few shift, or, and addition

73

operations on the key to produce multiple hashing results. For relocation, we try to relocate keys

in no more than ten buckets to other alternative buckets in the Setup Algorithm and no more than

two in the Update Algorithm. We first compare the bucket load and the average number of

bucket accesses per lookup by varying 𝑛/𝑚. Then we normalize the number of keys per lookup

based on the memory usage ratios to understand the memory overhead for different hashing

schemes. In addition, we demonstrate the effectiveness of creating c-empty buckets to reduce the

bucket access. We also give a sensitivity study on the number of bucket accesses per lookup with

respect to the size of the target array. Lastly, we evaluate the robustness of d-ghash scheme by

using two simple probabilistic models.

Figure 5-3.24Bucket loads for the five hashing schemes. The enhanced d-ghash scheme and base

d-ghash scheme has the same bucket load.

Figure 5-3 displays the bucket loads of the hashing schemes. Note that enhanced d-ghash

scheme and base d-ghash scheme has the same bucket load. Note that enhanced d-ghash and

base d-ghash have the same bucket load. The only difference between the two is that enhanced

d-ghash uses a target array to reduce the number of bucket accesses per lookup. The results

74

show that d-ghash has the least bucket load, and hence achieves best balance among the buckets.

This is followed by d-left. More hash functions improve the balance for both d-ghash and d-left.

With 275,000 buckets, 4-ghash accomplishes perfect balance with the bucket load of a single

key. No other simulated scheme can achieve such balance with up to 500,000 buckets. 2-ghash

performs slightly better than 4-left as the former needs 150,000 buckets to reduce the bucket load

to two keys while the latter requires 175,000 buckets. This result demonstrates the power of d-

ghash in balancing the keys over that of d-left. The single-hash scheme is the worst. The bucket

load is six even with 500,000 buckets. Note that bucket load is an integer, but we slightly adjust

the integer values to separate the curves of different schemes for easy read.

Figure 5-4.25Number of bucket accesses per lookup for d-ghash.

In Figure 5-4, we evaluate the lookup efficiency of the seven hashing schemes. Single-

hash only accesses one bucket per lookup. The d-left scheme looks up a key from the left-most

bucket. In case that the key is not found, the next bucket to the right is accessed until the key is

located. Since the key is always placed in the left-most bucket to break a tie, the number of

75

bucket accesses per lookup is quite low, 1.68 ∼ 2.36 for 4-left and 1.27 ∼ 1.44 for 2-left. The

base 4-ghash and base 2-ghash reduce the number of bucket accesses per lookup to 1.25 ∼ 2.18

and 1.11 ∼ 1.44 respectively with a 5–34% and 0–14% reduction. With a target array of 1.5n

entries, the enhanced 4-ghash and the enhanced 2-ghash can further reduce the number of bucket

accesses per lookup to as low as 1.03 ∼ 1.23 and 1.01 ∼ 1.11 respectively with a 38–51% and

21–24% reduction.

It is interesting to see that the number of bucket accesses per lookup for d-ghash does not

decrease continuously when the number of buckets increases. We can observe a sudden jump at

m = 125, 000 and m = 275, 000 for 4-ghash. This is due to the fact that the optimal bucket load

drops from three to two when m = 125, 000 and from two to one when m = 275, 000. As the

average number of keys per bucket is very close to the optimal bucket load, it is hard to create c-

empty buckets. Therefore, there are sudden decreases in the amount of c-empty buckets at those

two points. As a result, 4-ghash experiences more bucket access for key lookup. The same

reason goes for 2-ghash.

Figure 5-5.26Average number of keys per lookup based on memory usage ratio.

76

In order to reduce the bucket load for a fixed number of keys, we can increase the number

of buckets. However, increasing buckets inflates the memory space requirement. In Figure 5-5,

we plot the average number of keys per lookup based on the memory usage ratio, where the

average number of keys is the product of the bucket load and the average number of buckets per

look up. The results clearly show the advantage of the d-ghash scheme. Enhanced 4-ghash

accomplishes a single key per bucket with 275,000 buckets which are only 37% more than the

number of keys. With slightly larger than one key per lookup, enhanced 4-ghash requires the

least amount of memory to achieve close to one key access per lookup.

Besides the perfect balance, d-ghash creates c-empty buckets to maximize the number of

keys hashing to empty buckets. Figure 5-6 shows the effectiveness of the c-empty buckets for

reducing the bucket access. In this figure, y-axis indicates the average number of non-empty

buckets that each key is hashed into. In comparison with d-left, d-ghash reduces nonempty

buckets more significantly, resulting in smaller number of bucket access. For d-left, the number

of non-empty buckets decreases as the number of buckets increases. This is due to the fact that d-

left assigns each key to the least-loaded bucket. D-ghash on the other hand, creates c-empty

buckets by removing keys away from those buckets with more keys hashed into. As a result,

there are fewer non-empty buckets for looking up each key. It is interesting to observe that the

ability to create c-empty buckets depends heavily on the optimal bucket load and the ratio of keys

and buckets. For example, when the number of buckets is 250,000, the optimal bucket load is 2

for 200,000 keys that leaves plenty of room to create many c-empty buckets. However, when the

number of buckets increases to 275,000, the optimal bucket load drops to 1 that leaves little room

for the c-empty buckets. Hence, the average number of non-empty buckets increases for each key

to be hashed into.

77

Moreover, we show a sensitivity study of bucket access per lookup with respect to the

size of the target array. We vary the size of the target array from n to 2n entries using enhanced

4-ghash and the result is shown in Figure 5-7. As expected, larger target array reduces the

collision, resulting in a smaller number of bucket accesses. We pick 1.5n entries as the target

array size in earlier simulations which has the best tradeoff in terms of space overhead and

bucket access per lookup.

Figure 5-6.27The average number of non-empty buckets for looking up a key. This parameter is

the same for enhanced d-ghash scheme and base d-ghash scheme.

Finally, we evaluate the robustness of our scheme. We first set up a table using 200,001

keys, 200,000 buckets, and 300,000 target array entries. The achievable bucket load Ω𝑎 is 2 in

this setting. We simulate two update models: (1) Balanced Update: 33% insertion, 33% deletion,

and 33% modification; and (2) Heavy Insertion: 40% insertion, 30% deletion, and 30%

modification. We run for 600K updates and record the rehash percentage of all the update

operations and the number of bucket accesses per lookup. The results are presented in Figure 5-

8. The top two lines reflect the number of bucket accesses per lookup under Heavy Insertion

78

model and Balanced Update model respectively. We notice increases for both lines. The number

of bucket accesses per lookup increases continuously to 1.37 for Heavy Insertion, an increase of

25% than the original number. While for Balanced Update, the number first increases up to 1.25

and then drops to 1.21, with an increase of 10% in the end. The bottom two lines are rehash

percentages of the whole update operations. These two lines give a clear view that if the insertion

is heavy, we will come across more rehashes. For Balanced Update, the rehash percentage stays

almost the same at 0.5%. There is a slight increase in Heavy Insertion rehash percentage. Since

the rehash percentages for both models are less than 2% and the rehash operation involves keys

in no more than two buckets, we believe d-ghash is able to handle these rehashes without

incurring too much delay.

Figure 5-7.28Sensitivity of the number of bucket accesses per lookup for enhanced 4-ghash with

respect to the target array size.

Finally, we apply our algorithm to a real routing table application. We use five routing

tables downloaded from the Internet backbone routers: as286 (KPN Internet Backbone), as513

(CERN, European Organization for Nuclear Research), as1103 (SURFnet, the Netherlands),

79

as4608 (Asia Pacific Network Information Center, Pty. Ltd.), and as4777 (Asia Pacific Network

Information Center) [66], with 276K, 291K, 279K, 283K, 281K prefixes respectively after

removing the redundant prefixes.

Figure 5-8.29Changes in the number of bucket accesses per lookup and rehash percentage for

two update models using enhanced 4-ghash. Bucket accesses per lookup lines

correspond to the left Y axis. Rehash percentage lines correspond to the right Y axis.

To handle the longest prefix matching problem, hash-based lookup adopts the controlled

prefix expansion [26] along with other techniques [73], [74], [75]; it is observed that there are

small numbers of prefixes for most lengths, and they can be dealt with separately, for example,

using TCAM, while other prefixes are expanded to a limited number of fixed lengths. Lookup

will then be performed for those lengths. In this experiment, we expand the majority of prefixes

(with lengths in the range of [26], [76]) to two lengths: 22 bits and 24 bits. Assuming the small

number of prefixes outside [26], [76] are handled by TCAM, we perform lookups against lengths

80

22 and 24. Because there are more prefixes of 24-bits long after expansion, we present the results

for 24-bit prefix lookup.

Figure 5-9.30Number of bucket accesses per lookup for experiments with five routing tables.

Figure 5-10.31Experiment with the update trace using enhanced 4-ghash. Bucket access per

lookup line correspond to the left Y axis. Rehash percentage line correspond to the

right Y axis.

Table 5-2.9Routing table updates for enhanced 4-ghash.

Number of buckets 110K 120K 130K 140K 150K

Rehash percentage re-setup 0.23% 0.13% 0.08% 0.05%

81

There are 159, 444, 159, 813, 159,395, 159,173 and 159,376 prefixes of 24-bits long

from the five routing tables, respectively. We use these prefixes to setup five tables separately

and vary the number of buckets from 100K to 250K with a target array of 150K entries. We find

the number of bucket accesses per lookup for d-ghash and d-left scheme. The results are obtained

based on the average of the five tables. As shown in Figure 5-9, both base 4-ghash and base 2-

ghash performs better than the respective d-left scheme. The maximum reduction rate for base 4-

ghash than 4-left is about 36% when m = 210,000 and base 2-ghash 12% when m = 250,000. The

average number of bucket accesses per lookup for enhanced 4-ghash scheme is almost one

bucket less than 4-left, with up to 50% reduction. For enhanced 2-ghash, there is an average of

20% reduction over 2-left. We also notice that there is a jump for 4-ghash at m = 220,000 and

another one for 2-ghash at m = 130,000. This is due to the change in Ω𝑎, as mentioned before.

In the second experiment, we setup our hash tables with the routing table as286

downloaded at January 1st, 2010 from [66] and use the collected update trace of the whole

month of January, 2010 to simulate the update process. To make experiments simple, we also

choose prefixes with the length of 24. There are 159,444 24-bit prefixes in the table. The update

trace contains 1,460,540 insertions and 1,458,675 deletions for those 24-bit prefixes. We vary the

number of buckets from 110K to 150K. For all these settings, the achievable bucket load Ω𝑎 is 2

for enhanced 4-ghash. We also use a fixed 150Kentry target array.

As shown in Table 5-2, if we use 110K buckets, we need a re-setup for the whole table. If

we use 120K buckets, we do not need a re-setup, but have to rehash 0.23% of the whole update

operations, which is about 0.5% of the 1.4 million insertions. And if we increase the number of

buckets, we will rehash less. When using 150K buckets, we have close to 0.05% chance of

rehash. We also show the lookup efficiency change in Figure 5-10 with m = 150K. The update

82

trace used has nearly the same number of insertions and deletions, which is similar to a Balanced

Update model used in Section 5.4. We can view that the rehash percentage grows continually to

0.05%. The number of bucket accesses per lookup increases and decreases through the update

process, with a 7% increase in the end.

5.5 Summary

A new guided multiple-hashing method, d-ghash is introduced in this chapter. Unlike

previous approaches which select the least-loaded bucket to place a key progressively, d-ghash

achieves global balance by allocating keys into buckets after all keys are placed into buckets d

times using d independent hash functions. D-ghash calculates the achievable perfect balance and

removes duplicate keys to achieve this goal. Meanwhile, d-ghash reduces the number of bucket

accesses for looking up a key by creating as many empty buckets as possible without disturbing

the balance. Furthermore, d-ghash uses a table to encode the hash function ID for the bucket

where a key is located to guide the lookup and to avoid extra bucket access. Simulation results

show that d-ghash achieves better balance than existing approaches and reduces the number of

bucket accesses significantly.

83

CHAPTER 6

INTELLIGENT ROW BUFFER PREFETCHES

6.1 Background and Motivation

As we discussed in the introduction, accesses to DRAM would be separated into each

channels, ranks and eventually each banks. Inside each bank, dram arrays are organized into

different rows. Before a memory location can be read, the entire row containing that memory

location is opened and read into the row buffer. Leaving a row buffer open after every access

(Open-page policy) enables more efficient access to the same open row, at the expense of

increased access delay to other rows in the same DRAM array. Requests to the opened row is

called a row buffer hit. A row buffer miss happens if the next request wants to access a different

row rather than the current row, which could cause long delay.

Row buffer locality has long been observed and utilized in previous proposals. There are

generally two approaches. The first one is to change DRAM mapping and scheduling in order to

improve row buffer hits. [77] proposes a permutation-based page interleaving scheme to reduce

row-buffer conflicts. [78] introduces a Minimalist Open-page memory scheduling policy that

utilizes open-page gains with a relatively small number of page accesses for each page

activation. They observe that while the commonly used open-page address mapping schemes

map each memory page to a sequential region of real memory, which results in linear access

sequences hitting in row buffer, it can cause interference between applications sharing the same

DRAM devices [79] and cannot utilize bank-level parallelism. They argue that the number of

row-buffer hits are generally small and through adjusted DRAM mapping scheme, they shall be

able to both keep row-buffer hits and reduce row-buffer conflicts. However, they need a complex

data prefetch engine and a complex scheduler to schedule normal requests and prefetch requests.

[80] proposes a three-stage memory controller that first groups requests based on row-buffer

84

locality, then focus on inter-application request scheduling and lastly sending simple FIFO

DRAM commands. They mainly target on CPU-GPU systems that applications between the two

can have great interference and different behavior in row-buffer accesses.

The second approach is to change DRAM page closure management. [81] first proposes

tracking history at a per DRAM-page granularity and uses a two-level branch predictor to predict

whether to close a row-buffer. [82] extends the previous proposal and proposes an one level low-

cost access based predictor (ABP) that closes a row-buffer after the specified number of access

or when a page-conflict occurs. They argue that the number of accesses to a given DRAM page

is better than timer based policies to determine page closure. [83] [84] proposes application-

aware page policy and assign different page policy to different applications based on memory

intensity and locality.

Of all the related works, [85] is most related to ours. [85] proposed the row based page

policy (RBPP) that tracks the row addresses and use it as an indicator to decide whether or not to

close the row buffer when the active memory request finishes. They use a few registers to record

the most accessed rows in a bank. For each recorded row address, a counter that dynamically

updates based on the access pattern determines whether or not the row buffer should be closed.

They use LRU scheme to replace old entries. We will show in the result section that only using

LRU scheme to replace entries is not accurate compared to replacing based on the row access

count. More specifically, requests that are accessed only once or twice are very common in many

workloads [39] that tend to replace the entries often in their most accessed row registers

(MARR), causing a poor hit ratio. Compared to theirs, we use a general approach that adds a

learning table to filter out those requests. The comparison result will be shown in the Section V.

85

When a hot row is identified, [85] proposed modifying DRAM page policy which cannot

reduce latency when two hot rows’ accesses interleaving with each other. We argue that by

simply caching the hot rows without modifying the DRAM mapping, we can still harness the

latency gain of row buffer hits and avoid the complexity of modifying DRAM.

Figure 6-1.32Hot Row pattern of 10 workloads.

4.0e+04

8.0e+04

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

bt

0.0e+00

4.0e+04

8.0e+04

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

bwaves

0.0e+00

4.0e+04

8.0e+04

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

gems

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

lbm

8.0e+04

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

leslie3d

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

mg

86

Figure 6-1. Continued.

Figure 6-1 shows a slice of row buffer accesses when we simulate the No-Cache Design

mentioned in Chapter 3. The y axis is the row number of a request address and the x axis is the

simulation cycle. The overall simulation length is 200k cycles. During the whole simulation

length, it is interesting to observe that 8 out of 10 workloads exhibit strong pattern that some

rows are accessed more than others. However, due to interleaved accesses to different rows, we

do not observe a decent row buffer hit rate. One way to improve row buffer hit rate would be

scheduling all requests to the same row together, the other way is to prefetch requests inside the

hot rows for later use.

6.2 Hot Row Buffer Design and Results

We propose a simple but effective design that could utilize the hot row pattern we

observed in the previous subsection. Two data structures are needed: a Learning Table (LT) and

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

milc

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

soplex

0.0e+00

4.0e+04

8.0e+04

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

swim

1.2e+05

1.6e+05

2.0e+05

2.4e+05

2.8e+05

1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07

Row

Numb

er

HybridSim Cycles

zeusmp

87

a Hot-Row Buffer (HRB). The LT captures new hot rows based on recently referenced rows and

the HRB buffers the Hot Rows. Each entry in LT records the address tag of a row with a shift

register. The size of the LT is n and the width of the shift register is m. The HRB has k rows and

each entry in HRB has a 2KB data buffer along with a reference counter to count the number of

references to the row.

When a reference hits a row in HRB, the respective counter is incremented. When a

requested row is not in HRB, the row enters the LT if it is not already in the LT. In case of LT is

full, the request is dropped. The m bits are initialized to all ‘0’s when the row first enters the LT.

A hit to a row in LT will shift ‘1’ into the corresponding shift register.

When the shift-out bit is ‘1’, a hot row is identified. The newly identified row is fetched

into the HRB to replace the row with the least used count. The counter of the new row is

initialized to a middle value that the counter can represent. Meanwhile, other counters are

decremented by ‘1’. The row is dropped from the LT creating an empty space. If a request

misses the HRB, hit the LT, but the shift-out bit is ‘0’, no action is taken.

The process continues until a maximum h hot blocks are identified. After inserted the last

hot row into the HRB, the entire LT is wiped out and re-start over again.

As for the replacement policy of HRB, one may think that LRU replacement may be a

good choice. We can add another competing directory that implements LRU replacement policy.

Periodically, we can compare the number of hits in either scheme and select the one with more

hits to be applied to the next time period. Note that at period end, the directory which records the

losing scheme should be updated to the contents of the other. The hit/miss counters for the two

schemes are reset to measure the next period. A dynamic method with a saturated counter is used

88

to switch between the two schemes. Figure 6-2 shows the flow diagram when a new row address

arrives. Note we omit the LT restart and dropping requests part.

Figure 6-2.33Hot Row Identification and Update.

Figure 6-3 shows the detailed hit ratio with different configuration of HRB for 10

workloads for capturing reused blocks. We can observe that the hybrid scheme proves reasonable

hit ratio among all workloads, which matches the hot row pattern we observed in Figure 6-1.

The X axis is the number of rows we store in directory for each scheme. The Y axis is the hit

ratio for HRB/LRU directory over all row accesses if we store x amount of rows. With more

entries recorded, the hit ratio is increasing obviously for all workloads. Milc is able to capture

more than 90% of row accesses if using more than 16 entries. Gems and zeusmp perform badly

Check Address in HRB

Check Address in LT

N

Row Address

Update HRB

Y

Update LT

LRU Driectory

HRB Directory

A Hot Row is Identified

Y

Y N, Insert Into

89

with slight increase of hit ratio when using more entries. In general, the LRU scheme perform

better than the HRB scheme if having enough entries. However, when space is limited and few

entries can be recorded, HRB scheme perform better.

Table 6-1 summarizes the hit ratio of 10 workloads when using 64 entries. 7 out of 10

workloads have a hit rate over 50%. 4 workloads even have a hit rate larger than 90%. The

average hit rate of the 10 workloads reaches to 57.7%. We believe this can be well utilized to

achieve reasonable performance gain.

Table 6-1.10Hit ratio for hybrid scheme of 10 workloads using 64 entries.

workload hit ratio (%)

bt 91.2

bwaves 91.3

gems 26.3

lbm 76.7

leslie3d 49.4

mg 96.8

milc 94.9

soplex 53.1

swim 76.1

zeusmp 13.3

Geomean 57.7

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage

Reuse distance Hot Rows

bt

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


bwaves

LRUHybrid

HRB

Figure 6-3.34Results of proposed hybrid scheme.

90

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


gems

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


lbm

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


leslie3d

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


mg

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


milc

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


soplex

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


swim

LRUHybrid

HRB

0%

25%

50%

75%

100%

0 10 20 30 40 50 60 70

Percentage


zeusmp

LRUHybrid

HRB


91

Previous proposals have only focused on different ways of modifying row buffer

management in order to reduce row buffer conflicts. We argue that we can just cache these hot

rows to avoid the complexity of changing DRAM organization. Of course caching an entire row

not only wastes space when only a few blocks in the row are accessed, but also puts heavy

burden on bandwidth. We can consider caching part of the row. If these cached blocks in the row

are accessed, we can start prefetch remaining blocks in the row. These can be easily

implemented with a simple stream prefetcher.

Figure 6-4.35Block column difference within a row for 10 workloads.

0 500 1000 1500 2000 2500 3000 3500

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

bt

0 500 1000 1500 2000 2500 3000 3500

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

bwaves

0

500

1000

1500

2000

2500

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

gems

0 1000 2000 3000 4000 5000 6000

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

lbm

92


Figure 6-4 shows the column difference inside a row buffer. Upon the first request to a

certain row, we record this address as the base address. When later new requests access this row,

we record the column difference between these two requests. We observed a high regularity that

the following blocks in the later rows would be accessed. Take lbm as an example, we can

observe that most requests in a row are having increasing differences compared to the first access

0 500 1000 1500 2000 2500 3000 3500

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

leslie3d

0 50 100 150 200 250 300 350 400 450

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

mg

0 1000 2000 3000 4000 5000 6000 7000 8000 9000

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

milc

0 500 1000 1500 2000 2500 3000 3500 4000

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

soplex

0 500 1000 1500 2000 2500 3000 3500 4000 4500

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

swim

0 10 20 30 40 50 60 70

-80-60-40-20 0 20 40 60 80

Count

Blocks Distance

zeusmp

93

to a row. And with difference increasing, the number of requests are decreasing. This motivates

us using a simple stream prefetcher in our performance evaluation. We can observe strong

pattern the same with lbm for 7 out of 10 workloads. For the milc workload, there is an gap

between odd difference and even difference. For swim and zeusmp, the difference is more

scattered. Soplex and zeusmp also shows lots of requests with negative difference. For such

pattern, more advanced prefetching algorithm can be applied.

6.3 Performance Evaluation

In this section, we present the IPC improvement of hot row prefetch against the base

design without hot row prefetch along with the row buffer hit ratio.

Figure 6-5.36IPC/Row buffer hit Ratio speedup of 10 workloads.

The total cost of learning table is calculated as follows: each Learning table costs

16*(0.5+3) = 56B per bank. (3B is used to record row address). We let HRB/LRU scheme to

record 1-64 rows and has a reference count of 1B, that costs at most 2 * 64 *(1+3) = 512B per

bank. So the total overhead is 568B for recording 64 rows. To record for all the 16 banks, we

will need 8.9KB in total, which can be easily fit into last level cache. Once we have identified

0

2

4

6

8

10

12

14

16

18

bt bwaves gems lbm leslie3d mg milc soplex swim zeusmp Geomean

Per

cen

tage

IPC_speedup row_buffer_hit_improve cache_hit_rate_improve

94

the hot rows, we implement a simple stream prefetcher in the hot row that prefetches next 4

blocks inside the row.

Figure 6-5 show the IPC and row buffer hit improvement of adding an hot row prefetcher

over 10 workloads. The overall IPC improvement is 9.1% with a range from 3.1 to 16.4% over

individual workloads. 6 out of 10 workloads have IPC improvement over 10%. Bt, mg and milc

has the most significant improvement, while zeusmp has the least improvement of all.

We can also observe an average of 6.3% more row buffer hits. In general, IPC

improvement is consistent with row buffer hit increase. Prefetching next 4 blocks upon a hot row

access would create 4 row buffer hits which improve the row buffer hit rate. If any of these

prefetched blocks are later requested, access time for them are reduced, hence lower the average

memory access time. Among the 10 workloads, milc and mg has more than 10% increased row

buffer hits. Gems, soplex and zeusmp has less than 5% increased row buffer hits.

We also show the last level cache hit rate improvement. The prefetched blocks may

become cache hits when they arrive faster to last level cache before requested. We do observe a

slight increase for all workloads of the L3 cache hit rate. The average increase of last level cache

hit rate is 3.9%. Note that even cache misses can be served faster because they are prefetched

earlier.

Table 6-2 shows the usage of the prefetched blocks. We can observe that the average

usage is about 52%, which means that on average 2 out 4 blocks we prefetched are requested

next. 7 out of 10 workloads have more than 52% accuracy. Among them, bt, milc, and mg have

the highest prefetch usage, which is consistent with their IPC results. Zeusmp, soplex, and gems

have the smallest prefetch usage rate. The prefetched useless blocks would consume bandwidth

and in the meanwhile pollute cache, hence hurt the improvement prefetching brings.

95

Table 6-2.11Prefetch usage for 10 workloads using a simple stream prefetcher.

workload prefetch_hit_percentage

bt 80

bwaves 71

gems 32

lbm 62

leslie3d 51

mg 72

milc 83

soplex 33

swim 54

zeusmp 23

Geomean 52

Table 6-3.12Sensitivity study on prefetch granularity.

Percentage(%) IPC_speedup row_buffer_hit_improve cache_hit_rate_improve

prefetch_2 5.4 3.7 2.1

prefetch_4 9.1 6.3 3.8

prefetch_6 3.1 8.3 0.4

prefetch_8 -2.8 10.1 -2.2

Table 6-3 shows the sensitivity study of changing the stream prefetch granularity.

Compared to prefetching only two blocks on every row access, prefetching 4 blocks improve the

row buffer hit ratio and improve the IPC because of reducing last level cache miss rate. But the

improvement becomes less and even negative when we prefetch 6 or more blocks. Prefetching

more blocks on every row access puts a heavy burden on the bandwidth and hurts the LLC

performance.

6.4 Conclusion

From the results, we can conclude that there exists strong locality pattern for row buffer

accesses. But because of those accesses are interleaved with each other, we tend to have more

row buffer conflicts, rather than row buffer hits. Based on the hot row pattern, we can easily

identify some frequently used rows. We evaluate the proposed scheme to capture hot rows with

96

trace collected from Marssx86 and DRAMSim2. The use of a learning table filters requests that

are accessed only a few times. The competing of LRU and HRB generates the hybrid scheme

that provides the best hit ratio. Results have shown that an average of 57.7% row accesses can be

captured by using 568B per bank. We also show that simple LRU replacement used by previous

RBPP scheme is not effective if recording limited hot rows.

We further implement a simple stream prefetcher to harness the hot row pattern we

captured through our learning table design. Results have demonstrated that with a simple

prefetch-in-row stream prefetcher, we are able to achieve a IPC speedup of 9.1% and a row

buffer hit rate improve of 6.3%.

97

CHAPTER 7

SUMMARY

We propose four works in this dissertation targeting at improving memory hierarchy

performance. The proposed ideas can be easily applied to real world systems and evaluations

have shown that they can improve system performance significantly. With the increasing

demand for high performance memory systems, the proposed techniques become valuable.

In the first work, we present a new caching technique to cache a portion of the large tag

array for an off-die stacked DRAM cache. Due to its large size, the tag array is impractical to fit

on-die, hence caching a portion of the tags can reduce the need to go off-die twice for each

DRAM cache access. In order to reduce the space requirement for cached tags and to obtain high

coverage for DRAM cache accesses, we proposed and evaluated a sector-based Cache Lookaside

Table (CLT) to record cache tags on-die. CLT reduces space requirement by sharing a sector tag

for a number of consecutive cache blocks and uses location (way) pointers to locate the blocks in

off-die cache data array. The large sector can also take the advantage of exploiting spatial

locality for better coverage. In comparison with the Alloy cache, the ATcache and the TagTables

approaches, the average improvements of CLT are in the range of 4-15%.

In the second work, A new Bloom Filter is introduced to filter L3 cache misses for

bypassing L1, L2 and L3 caches to shorten the L3 miss penalty in a 4-level cache hierarchy

system. The proposed Bloom Filter applies a simple indexing scheme by decoding the low-order

block address to determine the hashed location in the BF array. To provide better hashing

randomization, partial index bits are XORing with the adjacent higher-order address bits. In

addition, with certain combinations of the limited block address bits, multiple index functions

can be selected to further reduce the false-positive rate. Performance evaluation using SPEC2006

benchmarks on a 8-core system with 4-level caches show that the proposed simple hashing

98

scheme can lower the average false-positive rate below 5% for filtering L3 misses, and to

improve the average IPC by 10.5% over no L3 filtering and runahead. Furthermore, the proposed

BF indexing scheme resolves an inherent difficult problem in using the Bloom Filter for

identifying L3 cache misses. Due to dynamic updates of the cache content, a counting Bloom

Filter is necessary to update the BF array to reflect dynamic changes of the cache content. A

unique advantage of the proposed BF index is that it includes the cache index as a superset. As a

result, the blocks which are hashed to the same BF array location, are allocated in the same cache

set. By searching the tags in the set when a block is replaced, the corresponding BF bit can be

reset correctly without using expensive counters.

The third work proposes a new guided multiple-hashing method, d-ghash. Unlike

previous approaches which select the least-loaded bucket to place a key progressively, d-ghash

achieves global balance by allocating keys into buckets after all keys are placed into buckets d

times using d independent hash functions. D-ghash calculates the achievable perfect balance and

removes duplicate keys to achieve this goal. Meanwhile, d-ghash reduces the number of bucket

accesses for looking up a key by creating as many empty buckets as possible without disturbing

the balance. Furthermore, d-ghash uses a table to encode the hash function ID for the bucket

where a key is located to guide the lookup and to avoid extra bucket access. Simulation results

show that d-ghash achieves better balance than existing approaches and reduces the number of

bucket accesses significantly.

The fourth work digs into the details of DRAM row buffer accesses. By collecting

memory accesses of most of the SPECCPU workloads, we find out requests to the DRAM rows

in each bank are not evenly distributed, meaning some of the rows in a bank could have more

requests than other rows. We call these more accessed rows “hot rows”. Based on the observed

99

hot row pattern, we propose a simple design that uses a Learning table which is able to capture

these hot rows. Once we identifies a hot row, we would sequentially prefetch blocks in that hot

row upon a row access. We evaluate this idea using a simple stream prefetcher and the results

show we are able to gain 9.1% average IPC improvement over a design without a prefetcher.

The proposed ideas have been verified by the results presented in each individual

Chapter.

100

LIST OF REFERENCES

[1] P. Hammarlund, "The Fourth-Generation Intel Core Processor," in MICRO, 2014.

[2] Y. Deng and W. P. Maly, "2.5-dimensional VLSI system integration.," in IEEE

Transactions on Very Large Scale Integration (VLSI) Systems, 2005.

[3] K. Banerjee, S. Souri, P. Kapur and K. C. Saraswat, "3-D ICs: a novel chip design for

improving deep-submicrometer interconnect performance and systems-on-chip

integration," in Proceedings of the IEEE, 2001.

[4] G. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-

stacked DRAM caches," in MICRO, 2011.

[5] J. Sim, G. Loh, H. Kim , M. Connor and M. Thottehodi, "A Mostly-Clean DRAM

Cache for Effective Hit Speculation and Self-balancing Dispatch," in MICRO, 2012.

[6] X. Jiang and e. al., "CHOP: Adaptive filter-based DRAM caching for CMP server

platforms," in HPCA, 2010.

[7] G. Loh, "Extending the Effectiveness of 3D-stacked DRAM Cache with An Adaptive

Multi-queue Policy," in MICRO, 2009.

[8] G. Loh and M. Hill, " Supporting very large DRAM caches with compound access

scheduling and MissMaps," in MICRO, 2012.

[9] M. K. Qureshi and G. Loh, "Fundamental Latency Trade-offs in Architecting DRAM

Caches," in MICRO, 2012.

[10] L. Zhao, R. Iyer, R. Illikkal and D. Newell, "Exploring DRAM cache architectures for

CMP server platforms," in ICCD, 2007.

[11] D. Woo, N. Seong, D. Lewis and H. Lee, "An Optimized 3D-stacked Memory

Architecture by Exploiting Excessive High-density TSV Bandwidth," in HPCA, 2010.

[12] T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt and K.

Flautner, "PicoServer: Using 3D stacking technology to enable a compact energy

efficient chip multiprocessor," in ASPLOS, 2006.

[13] C. Liu, I. Ganusov and M. Burtscher, "Bridging the Procesor-memory Performance Gap

with 3D IC Technology," in IEEE Design & Test of Computers, 2005.

[14] G. Loh, "3D-stacked Memory Architectures for Multi-core Processors," in ISCA, 2008.

[15] C. Chou, A. Jaleel and M. K. Qureshi, "A Two-Level Memory Organization with

Capacity of Main Memory and Flexibility of Hardware-Managed Cache," in MICRO,

2014.

101

[16] X. Dong, Y. Xie, N. Muralimanohar and N. P. Jouppi, "Simple but effective

heterogeneous main memory with on-chip memory controller support," in SC, 2010.

[17] G. Loh and et al., "Challenges in Heterogeneous Die-Stacked and Off- Chip Memory

Systems," in SHAW, 2012.

[18] J. Pawlowski, "Hybrid Memory Cube: Breakthrough DRAM Performance with a

Fundamentally Re-Architected DRAM Subsystem," in Hot Chips, 2011.

[19] J. Sim, A. Alameldeen, Z. chishti, C. Wilkerson and H. Kim, "Transparent Hardware

Management of Stacked DRAM as Part of Memory," in MICRO, 2014.

[20] F. Botelho, R. Pagh and N. Ziviani, "Simple and space-efficient minimal perfect hash

functions," in WADS, 2007.

[21] B. Vocking, "How Asymmtry Helps Load Balancing," in IEEE Symn. on FCS, 1999.

[22] A. Kirsch and M. Mitzenmacher, "On the Performance of Multiple Choice Hash Tables

with Moves on Deletes and Inserts," in Communication, Control, and Computing, 2008.

[23] F. Hao, M. Kodialam and T. V. Lakshman, "Building high accuracy bloom filters using

partitioned hashing," in SIGMETRICS, 2007.

[24] B. Bloom, "Space / Time Trade-offs in Hash Coding with Allowable Errors," in Comm.

ACM, 1970.

[25] T. Wang. http://burtleburtle.net/bob/hash/integer.html.

[26] V. Srinivasan and G. Varghese, "Fast Address Lookups Using Controlled Prefix

Expansion," in ACM Transactions on Computer Systems, 1999.

[27] A. Patel, F. Afram, S. Chen and K. Ghose, "MARSSx86: A Full System Simulator for

x86 CPUs," in DAC, 2011.

[28] M. T. Yourst, PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural

Simulator, ISPASS, 2007.

[29] J. Stevens, P. Tschirhart, M. Chang, I. Bhati, P. Enns, J. Greensky, Z. Chishti, S. Lu and

B. Jacob, "An Integrated Simulation Infrastructure for the Entire Memory Hierarchy:

Cache, DRAM, Nonvolatile Memory, and Disk," in ITJ, 2013.

[30] Qemu http://wiki.qemu.org/Main_Page.

[31] Y. Chou, Y. Fahs and S. Abraham, Microarchitecture optimizations for exploiting

memory-level parallelism, 2004: ISCA.

[32] T. Carson, H. W. and E. L., Sniper: exploring the level of abstraction for scalable and

accurate parallel multi-core simulation, HPCA, 2011.

http://burtleburtle.net/bob/hash/integer.html

http://wiki.qemu.org/Main_Page

102

[33] J. Henning., "SPEC CPU2006 memory footprint," in ACM SIGARCH Computer

Architecture News, 2007.

[34] B. Rogers, A. Krishna, G. Bell, K. Vu, X. Jiang and Y. Solihin, "Scaling the Bandwidth

Wall: Challenges in and Avenues for CMP Scaling," in ISCA, 2009.

[35] N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S.

Makineni and D. Newell, "Optimizing Communication and Capacity in a 3D Stacked

Reconfigurable Cache Hierarchy," in HPCA, 2009.

[36] S. Lai, "Current Status of the Phase Change Memory and Its Future," in IEDM, 2003..

[37] S. Franey and M. Lipasti, Tag Tables, HPCA, 2015.

[38] C. Huang and V. Nagarajan, "ATCache: Reducing DRAM cache Latency via a Small

SRAM Tag Cache," in PACT, 2014.

[39] D. Jevdjic, S. Volos and B. Falsafi, "Die-stacked DRAM caches for servers: hit ratio,

latency, or bandwidth? have it all with footprint cache," in ISCA, 2013.

[40] D. Jevdjic, G. Loh, C. Kaynak and B. Falsafi, "Unison Cache : A Scalable and Effective

Die-Stacked DRAM Cache," in MICRO, 2014.

[41] N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki, "Reactive NUCA: Near-

optimal block placement and replication in distributed caches," in ISCA, 2009.

[42] J. Liptay, "Structural aspects of the System/360 Model 85, Part II: The cache," in IBM

Syst.J., 1968.

[43] S. Przybylski, "The Performance Impact of Block Sizes and Fetch Strategies," in ISCA,

1990.

[44] J. B. Rothman and A. J. Smith, "The Pool of Subsectors Cache Design," in ICS, 1999.

[45] A. Seznec, "Decoupled sectored caches: conciliating low tag implementation cost and

low miss ratio," in ISCA, 1994.

[46] S. Somogyi, T. Wenish, A. Ailamaki, B. Falsafi and A. Moshovos, "Spatial memory

streaming," in ISCA, 2006.

[47] G. Loh and M. Hill, "Addendum for “Efficiently enabling conventional block sizes for

very large die-stacked DRAM caches”," 2011.

[48] J. Meza, J. Chang, H. Yoon, O. Mutlu and P. Ranganathan, "Enabling Efficient and

Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management," in

CAL, 2012.

[49] W. Chou, Y. Nain, H. Wei and C. Ma, "Caching tag for a large scale cache computer

memory system," in US Patent 5813031, 1998.

103

[50] T. Wicki, M. Kasinathan and R. Hetherington, "Cache tag caching," in US Patent

6212602, 2001.

[51] M. Qureshi, "Memory access prediction," in US Patent 12700043, 2011.

[52] "Cacti 6.5," http://www.hpl.hp.com/research/cacti.

[53] A. Seznec and P. Michaud, A case for (partially) tagged geometric history length branch

prediction, Journal of Instruction Level Parallelism, 2006.

[54] A. Broder and M. Mitzenmacher, "Network applications of Bloom filters: A survey," in

Internet Math, 2004.

[55] J. K. Mullin, "Optimal semijoins for distributed database systems," in IEEE

Transactions on Software Engineering, 1990.

[56] L. Fan, P. Cao, J. Almeida and A. Broder, "Summary cache: a scalable widearea Web

cache sharing protocol," in IEEE Transactions on Networking, 2000.

[57] R. Rajwar, M. Herlihy and K. Lai, "Virtualizing Transactional Memory," in ISCA, 2005.

[58] A. Roth, "Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced

Load Optimization," in ISCA, 2005.

[59] A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based

Coherence," in ISCA, 2005.

[60] J. Peir, S. Lai, S. Lu, J. Stark and K. Lai, "Bloom filtering cache misses for accurate

data speculation and prefetching," in ICS, 2002.

[61] X. Li, D. Franklin, R. Bianchini and F. T. Chong, "ReDHiP: Recalibrating Deep

Hierarchy Prediction for Energy Efficiency," in IPDPS, 2014.

[62] A. Broder and M. Mitzenmacher, "Using Multiple Hash Functions to Improve IP

Lookups," in INFOCOM, 2001.

[63] S. Demetriades, S. C. M. Hanna and R. Melhem, "An Efficient Hardware-Based Multi-

hash Scheme for High Speed IP Lookup," in HOTI, 2008.

[64] H. Song, S. Dharmapurikar, J. Turner and J. Lockwood, "Fast Hash Table Lookup

Using Extended Bloom Filter: An Aid to Network," in SIGCOMM, 2005.

[65] Z. Huang, D. Lin, J.-K. Peir and S. M. I. Alam, "Fast Routing Table Lookup Based on

Deterministic Multi-hashing," in ICNP, 2010.

[66] "Routing Information Service," http://www.ripe.net/ris.

[67] C. Hermsmeyer, H. Song, R. Schlenk, R. Gemelli and S. Bunse, "Towards 100G packet

processing: Challenges and technologies," in Bell Labs Technical Journal, 2009.

http://www.hpl.hp.com/research/cacti

http://www.ripe.net/ris

104

[68] S. Lumetta and M. Mitzenmacher, "Using the Power of Two Choices to Improve Bloom

Filter," in Internet Mathematics, 2007.

[69] Z. Huang, J.-K. Peir and S. Chen, "Approximately-Perfect Hashing: Improving Network

Throughput through Efficient Off-chip Routing," in INFOCOM, 2011.

[70] Y. Azar, A. Broder, A. Karlin and E. Upfal, "Balanced Allocations," in Theory of

Computing, 1994.

[71] R. Sprugnoli, "Perfect hashing functions: a single probe retrieving method for static

sets,," in ACM Comm., 1977.

[72] F. F. Rodler and R. Pagh, "Cuckoo Hashing," in ESA, 2001.

[73] S. Dharmapurikar, P. Krishnamurthy and D. Taylor, "Longest Prefix Matching Using

Bloom Filters," in SIGCOMM, 2003.

[74] B. Chazelle, R. Kilian and A. Tal, "The Bloomier filter: an efficient data structure for

static support lookup tables," in ACM SIAM, 2004.

[75] J. Hasan, S. Cadambi, V. Jakkula and S. Chakradhar, "“Chisel: A Storage efficient,

Collision-free Hash-based Network Processing Architecture," in ISCA, 2006.

[76] M. L. Fredman and J. Komlos, "On the Size of Separating Systems and Families of

Perfect Hash Functions," in SIAM. J. on Algebraic and Discrete Methods, 1984.

[77] Z. Zhang and e. al, "A permutation-based page interleaving scheme to reduce row-

buffer conflicts and exploit data locality," in MICRO, 2000.

[78] D. Kaseridis, J. Stuecheli and L. John, "Minimalist Openpage: A DRAM Page-mode

Scheduling Policy for the manycore Era," in MICRO, 2011.

[79] T. Mosciborda and O. Mutlu, "Memory Performance Attacks: Denial of Memory

Service in Multi-Core Systems," in USENIX, 2007.

[80] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh and M. O., "Staged memory

scheduling: achieving high performance and scalability in heterogeneous systems," in

ISCA, 2012.

[81] Y. Xu, A. Agarwal and B. Davis, "Prediction in dynamic sdram controller policies," in

SAMOS, 2009.

[82] M. Awasthi, D. W. Nellans, R. Balasubramonian and A. Davis, "Prediction Based

DRAM Row-Buffer Management in the Many-Core Era," in PACT, 2011.

[83] M. Jeong, D. Yoon, D. Sunwoo, M. Sullivan, I. Lee and M. Erez, "Balancing DRAM

Locality and Parallelism in Shared Memory CMP system," in HPCA, 2012.

105

[84] M. Xie, D. Tong, Y. Feng, K. Huang and X. Cheng, "Page Policy Control with Memory

Partitioning for DRAM Performance and Power Efficiency," in ISLPED, 2013.

[85] X. Shen, F. Song, H. Meng and e. al, "RBPP: A Row Based DRAM Page Policy fo rthe

Many-core Era," in ICPADS, 2014.

[86] A. Jaleel, "Memory Characterization of Workloads Using Instrumentation-Driven

Simulation," in VSSAD, 2007.

[87] P. Rosenfeld, E. Copper-Balis and B. Jacob, "DRAMSim2: A Cycle Accurate Memory

System Simulator," in CAL, 2011.

[88] A. J. Smith, "Line (block) size choice for cpu cache memories," in IEEE transactions on

Computers, 1987.

[89] "NAS Parallel Benchmarks," http://www.nas.nasa.gov/publications/npb.html.

[90] A. Brodnik and J. I. Munro, "Membership in Constant Time and Almost-Minimum

Space," in SIAM Journal on Computing, 1999.

[91] W. Starke and et al, "The cache and memory subsystems of the IBM POWER8

processor," in IBM J. Res & Dev. Vol.59(1) , 2015.

[92] D. Patterson and J. Hennessy, Computer Architecture: A Quantitative Approach,

Morgan Kaufmann, 2011.

[93] K. D., S. J. and J. L. K., Minimalist open-page: a DRAM page-mode scheduling policy

for the many-core era, MICRO, 2011.

[94] T. Mosciborda and O. Mutlu, Memory Performance Attacks: Denial of Memory Service

in Multi-Core Systems, USENIX, 2007.

[95] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh and O. Mutlu, Staged

memory scheduling: achieving high performance and scalability in heterogeneous

systems, ISCA, 2012.

[96] M. Hill, "A case for direct-mapped caches," in IEEE computer, 1988.

http://www.nas.nasa.gov/publications/npb.html

106

BIOGRAPHICAL SKETCH

Xi Tao received his Ph.D. in computer engineering from the University of Florida in the

fall of 2016. He received his B.S. degree in Electric Engineering and Information Science in

University of Science and Technology of China, in 2007. His research interests include computer

architecture, cache and Bloom Filter applications.

IMPROVING MEMORY HIERARCHY PERFORMANCE WITH...

Documents

Transcript of IMPROVING MEMORY HIERARCHY PERFORMANCE WITH...