IMPROVING MEMORY HIERARCHY PERFORMANCE WITH...
Transcript of IMPROVING MEMORY HIERARCHY PERFORMANCE WITH...
IMPROVING MEMORY HIERARCHY PERFORMANCE WITH
DRAM CACHE, RUNAHEAD CACHE MISSES, AND
INTELLIGENT ROW-BUFFER PREFETCHES
By
XI TAO
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN PARTIAL FULFILLMENT
OF THE REQUIREMENTS FOR THE DEGREE OF
DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
2016
© 2016 Xi Tao
To my parents
4
ACKNOWLEDGMENTS
It was such a long journey since I first arrived at Gainesville. I have never dreamt of ever
studying at a place so distant from my hometown, yet I spent five and a half wonderful years
here.
Obtaining a Ph.D. degree is never an easy job. You constantly feel stressful, at loss, and
sometimes wondering how to continue. During those years, I am really grateful for all the help
and guidance from my advisor, Dr. Jih-Kwon Peir, who is always so patient and kind. His
brilliant suggestions helped me overcome many obstacles. He has also spent numerous hours
helping me reviewing my paper and making modifications. Without his help, I really cannot
imagine sitting here and writing this dissertation now.
I also want to thank my Ph.D. committee members: Dr. Shigang Chen, Dr. Prabhat
Mishra, Dr. Beverly Sanders and Dr. Tan Wong. Thank you for your advice and support during
my study at University of Florida. I also would like to thank my lab mate Qi Zeng, who has
provided great suggestions and advice on our collaborating work.
Lastly, I want to give my greatest thanks to my friends here at Gainesville. You guys
really made my life colorful here. I also want to thank my parents, who has always been there
encouraging me and believed in me. I could not achieve all these without your support!
5
TABLE OF CONTENTS
page
ACKNOWLEDGMENTS ...............................................................................................................4
LIST OF TABLES ...........................................................................................................................7
LIST OF FIGURES .........................................................................................................................8
ABSTRACT ...................................................................................................................................10
CHAPTER
1 INTRODUCTION ..................................................................................................................12
1.1 DRAM Caches ...............................................................................................................17
1.2 Runahead Cache Misses ................................................................................................18
1.3 Hashing Fundamentals and Bloom Filter ......................................................................19
1.4 Intelligent Row Buffer ...................................................................................................21
2 PERFORMANCE METHODOLOGY AND WORKLOAD SELECTION ..........................23
2.1 Evaluation Methodology ...............................................................................................23
2.2 Workload Selection .......................................................................................................25
3 CACHE LOOKASIDE TABLE .............................................................................................26
3.1 Background and Related Work ......................................................................................26
3.2 CLT Overview ...............................................................................................................29
3.2.1 Stacked Off-die DRAM Cache with On-Die CLT ............................................29
3.2.2 CLT Coverage ...................................................................................................31
3.2.3 Comparison of DRAM Cache Methods ............................................................32
3.3 CLT Design ...................................................................................................................37
3.4 Performance Evaluation .................................................................................................41
3.4.1 Difference between Related Proposals..............................................................41
3.4.2 Performance Results..........................................................................................43
3.4.3 Sensitivity Study and Future Projection ............................................................47
3.4.4 Summary ...........................................................................................................49
4 RUNAHEAD CACHE MISSES USING BLOOM FILTER .................................................50
4.1 Background and Related work .......................................................................................50
4.2 Memory Hierarchy and Timing analysis .......................................................................51
4.3 Performance Results ......................................................................................................57
4.3.1 IPC Comparison ................................................................................................58
4.3.2 Sensitivity Study ...............................................................................................60
4.4 Summary ........................................................................................................................62
6
5 GUIDED MULTIPLE HASHING .........................................................................................64
5.1 Background ....................................................................................................................64
5.2 Hashing ..........................................................................................................................66
5.3 Proposed Algorithm .......................................................................................................67
5.3.1 The Setup Algorithm .........................................................................................67
5.3.2 The Lookup Algorithm .....................................................................................70
5.3.3 The Update Algorithm ......................................................................................71
5.4 Performance Results ......................................................................................................72
5.5 Summary ........................................................................................................................82
6 INTELLIGENT ROW BUFFER PREFETCHES ..................................................................83
6.1 Background and Motivation ..........................................................................................83
6.2 Hot Row Buffer Design and Results .............................................................................86
6.3 Performance Evaluation .................................................................................................93
6.4 Conclusion .....................................................................................................................95
7 SUMMARY ............................................................................................................................97
LIST OF REFERENCES .............................................................................................................100
BIOGRAPHICAL SKETCH .......................................................................................................106
7
LIST OF TABLES
Table page
2-1 Architecture parameters of processor and memories ........................................................ 24
2-2 MPKI and footprint of the selected benchmarks .............................................................. 25
3-1 Comparison of different DRAM cache designs ................................................................ 33
3-2 Difference between three designs ..................................................................................... 42
3-3 Comparison of L4 MPKR, L4 occupancy and predictor accuracy ................................... 46
4-1 False-positive rates of 12 benchmarks .............................................................................. 59
4-2 Future Conventional DRAM parameters .......................................................................... 62
5-1 Notation and Definition .................................................................................................... 68
5-2 Routing table updates for enhanced 4-ghash .................................................................... 80
6-1 Hit ratio for hybrid scheme of 10 workloads using 64 entries .......................................... 89
6-2 Prefetch usage for 10 workloads using a simple stream prefetcher .................................. 95
6-3 Sensitivity study on prefetch granularity .......................................................................... 95
8
LIST OF FIGURES
Figure page
1-1 The structure of a memory hierarchy. ............................................................................... 13
1-2 Memory hierarchy organization with 4-level caches ........................................................ 14
1-3 Dram Internal Organization .............................................................................................. 15
3-1 Memory hierarchy with stacked DRAM cache ................................................................ 30
3-2 Reuse distance curves normalized to the percentage of the maximum distance .............. 32
3-3 Coefficient of variation (CV) of hashing 64K cache-set using different indices ............. 35
3-4 DRAM cache MPKI using sector indexing ...................................................................... 36
3-5 CLT design schematics ..................................................................................................... 38
3-6 CLT operations in handling memory requests .................................................................. 39
3-7 CLT speedup with respect to Alloy, TagTables_64, and TagTables_16 .......................... 45
3-8 Memory access latency (CPU cycles)............................................................................... 45
3-9 IPC change for different CLT coverage............................................................................ 48
3-10 Execution cycle change for different sector size in CLT design ...................................... 49
4-1 Memory latency with / without BFL3 .............................................................................. 52
4-2 Cache indexing and hashing for BF .................................................................................. 55
4-3 False-positive rates for 6 hashing mechanisms ................................................................. 56
4-4 False-positive rates with m:n = 2:1, 4:1, 8:1, 16:1, and k = 1, 2 ...................................... 56
4-5 IPC comparisons with/without BF .................................................................................... 59
4-6 Average IPC for m:n ratios and hashing functions ........................................................... 61
4-7 Average IPC for different L4 sizes ................................................................................... 61
4-8 Average IPC over different DRAM latency ..................................................................... 62
5-1 Distribution of keys in buckets of four hashing algorithms. ............................................. 66
5-2 A simple d-ghash table with 5 keys, 8 buckets and 2 hash functions. .............................. 68
9
5-3 Bucket loads for the five hashing schemes. ...................................................................... 73
5-4 Number of bucket accesses per lookup for d-ghash. ........................................................ 74
5-5 Average number of keys per lookup based on memory usage ratio. ................................ 75
5-6 The average number of non-empty buckets for looking up a key. ................................... 77
5-7 Sensitivity of the number of bucket accesses per lookup. ................................................ 78
5-8 Changes in the number of bucket accesses per lookup and rehash percentage. ............... 79
5-9 Number of bucket accesses per lookup for experiments with five routing tables. ........... 80
5-10 Experiment with the update trace using enhanced 4-ghash. ............................................. 80
6-1 Hot Row pattern of 10 workloads ..................................................................................... 85
6-2 Hot Row Identification and Update .................................................................................. 88
6-3 Results of proposed hybrid scheme .................................................................................. 89
6-4 Block column difference within a row for 10 workloads. ................................................ 91
6-5 IPC/Row buffer hit Ratio speedup of 10 workloads ......................................................... 93
10
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
IMPROVING MEMORY HIERARCHY PERFORMANCE WITH
DRAM CACHE, RUNAHEAD CACHE MISSES, AND
INTELLIGENT ROW-BUFFER PREFETCHES
By
Xi Tao
December 2016
Chair: Jih-Kwon Peir
Major: Computer Engineering
Large off-die stacked DRAM caches have been proposed to provide higher effective
bandwidth and lower average latency to main memory. Designing a large off-die DRAM cache
with conventional block size (64 bytes) requires a large tag array which is impractical to fit on-
die. We investigate a novel design called Cache Lookaside Table (CLT) to reduce the average
access latency and to lessen off-die tag array accesses. The proposed CLT exploits memory
reference locality and provides a fast alternative tag path to capture most of the DRAM cache
requests.
To hide long memory latency and to alleviate memory bandwidth requirement, a fourth-
level of cache (L4) is introduced in modern high-performance computing systems. However,
increasing cache levels worsens the cache miss penalty since memory requests go through levels
of cache hierarchy sequentially. We investigate a new way of using a Bloom Filter (BF) to
predict cache misses earlier at a particular cache level. These misses can runahead to access
lower level of caches and memory to shorten the miss penalty.
Inspired by the usefulness of Bloom filter in cache accesses, we conduct a fundamental
study to find a way to balance the hashing buckets while maintaining lower false-positive rate for
11
Bloom filter. To broaden the applications, our study is based on the routing and packet
forwarding function at the core of the IP network-layer protocols. We propose a guided multi-
hashing approach which achieves near perfect load balance among hash buckets, while limiting
the number of buckets to be probed for each key (address) lookup, where each bucket holds one
or a few routing entries.
A key challenge in effectively improving system performance lies in maximizing both
row-buffer hits and bank level parallelism while simultaneously providing fairness among
different requests. We observed that accesses to each bank in DRAM are not equally distributed
among different rows for most of the workloads we study. We propose a simple scheme to
capture the hot row pattern and prefetch data in these hot rows. Results have demonstrated the
effectiveness of our proposed scheme.
12
CHAPTER 1
INTRODUCTION
Memory hierarchy plays a critical role in designing high-performance processors. It
becomes increasingly difficult to advance processor performance further due to the memory wall
problem. Despite aggressive out-of-order, speculative execution, processor stalls waiting for the
data from memory. Lately, microprocessor manufacturers are putting a growing number of cores
on a chip to satisfy the increased demand for larger workloads such as data mining and analytics.
As the number of cores grows, the pressure on memory subsystem in terms of capacity and
bandwidth increases as well.
Memory hierarchy design takes advantage of memory reference locality, trade-offs in the
capacity and access speed of memory technologies to hide the memory latency and to alleviate
memory bandwidth requirement. Between CPU and main memory, there are multiple levels of
cache memory. Close to the CPU side are small, fast caches with higher bandwidth that
temporarily storing the most frequently used data. With increasing levels, the cache capacity
become larger, but the access speed becomes slower with reduced bandwidth. Based on
reference locality, the most recently referenced data can be accessed in the highest level of
cache. The large capacity of lower levels capture recently referenced data that cannot fit into the
highest level. Figure 1-1 depicts this memory hierarchy organization.
Conventional cache access goes through a tag path to determine a cache hit or a miss and
a data path to access the data in case of a hit. The cache tag and data arrays maintain topological
equivalency such that matching of the address tag in the tag array determines the location of the
block in the data array. These two paths may overlap to permit the data array access starts before
the hit position is determined to shorten the cache access time.
13
Level 1
Level 2
...
Level n
CPU
Levels in the
memory hierarchy
Increasing distance from
the CPU in access time,
decreasing bandwidth
Size of the memory at each level
Figure 1-1. The structure of a memory hierarchy: as the distance from the processor increases, so
does the size and access time, but with decreasing bandwidth.
Modern high-performance multi-core systems generally adopt 3-level on-die cache
architecture, referred as L1, L2 and L3 caches which are placed on the processor die using the
SRAM technology. The L1 and the L2 caches are usually private meaning each core has its own
private L1 and L2 caches. The L3 cache is normally shared by all the cores and serves as a
connection point to the main memory which is located off the processor chip with long access
latency. Intel Haswell [1], the 4th-generation core adopts a 4th-level cache built on embedded
DRAM technology to hide main memory latency and to deliver substantial performance
improvements for media, graphics and other high-performance computing applications. A
general organization of a multicore system with 4-level of caches is illustrated in Figure 1-2.
14
CPU CPU...
L1/L2 Caches L1/L2 Caches
L3 Cache
L4 Cache
Main Memory
Memory Controller
Processor Die
Multichip Package
Figure 1-2. Memory hierarchy organization with 4-level caches.
In order to measure quantitatively the memory performance of cache hierarchy, we use
the average memory access time (AMAT) to describe the average time it takes for the entire
hierarchy to return data. Suppose we have a three-level cache architecture with L1, L2 and L3
caches, the AMAT is calculated as follows:
𝐴𝑀𝐴𝑇 = 𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿1 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿1 × 𝑀𝑖𝑠𝑠𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝐿1
= 𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿1 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿1 × (𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿2 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿2
× (𝐻𝑖𝑡𝑇𝑖𝑚𝑒𝐿3 + 𝑀𝑖𝑠𝑠𝑅𝑎𝑡𝑒𝐿3 × 𝑀𝑖𝑠𝑠𝑃𝑒𝑛𝑎𝑙𝑡𝑦𝐿3))
where HitTime is the access time to the cache at a particular level, MissRate is the
percentage of misses at a cache level, and MissPenalty is the time to fetch the block from next
level of memory hierarchy. Fetching data from next level may encounter cache hits or cache
misses too, so the miss penalty of current level is also equivalent to the AMAT starting from next
level. For achieving high-performance, we want to design a cache hierarchy with fast hit time,
small miss ratio and miss penalty.
15
Beyond the cache hierarchy, main memory is where the application instruction and data
are stored. During program execution, the requested instruction and data are moved from disk to
main memory on demand initiating an I/O activity called page fault. Main memory is built using
the dynamic random access memory (DRAM) technology. When the requested instruction and
data are not located in caches, they are accessed from main memory with substantial longer
latency. Multiple levels of caches hold the working set of the recently referenced instructions and
data in hope to limit the need to access them from the main memory.
Memory Controller
Internal Row buffer
Rows
DRAM ChipCols
Bank 0 Bank 1
Addr
Data
Channel 0
Cmd
Figure 1-3. Dram Internal Organization.
DRAM-based main memory is a multi-level hierarchy of structures. At the highest level,
each processor die is connected to one or more DRAM channels. Each channel has a dedicated
command, address and data bus. One or more memory modules can be connected to each DRAM
channel. Each memory module contains a number of DRAM chips. As the data output width of
each DRAM chip is low (typically 8 bits for commodity DRAM), multiple chips are grouped
16
together to form a rank. In other words, a rank is a collection of DRAM chips that together feed
the standard 64-bit data bus.
Internally, each chip consists of multiple banks. Each bank consists of many rows of
DRAM cells and a row buffer that caches the last accessed row from the bank. Each dram cell in
the row is identified by the corresponding column address. Reading or writing data to DRAM
requires that the entire row first be read into the row buffer. Reads and writes then operate
directly on the row buffer. After the operation, the row is closed and the data in the row buffer is
written back into the DRAM array. Figure 1-3 shows this topology.
When the memory controller receives an access to 64-byte cache line, it will first decode
the address into the channel, rank, bank, row and column number. As the data of each 64-byte
cache line is split into different chips within the rank, the memory controller maintains a
mapping scheme to determine which parts of the cache line are mapped to which chips. Upon
receiving the command, each chip accesses the corresponding column of data from the row
buffer and transfer it on the data bus. Once the data is transferred, the memory controller
assembles the required cache line and sends back to the processor.
All banks on a channel share a common set of command and data buses, operations on
multiple banks may occur in parallel (e.g., opening a row in one bank while reading data from
another bank’s row buffer) so long as the commands are properly scheduled and any other
DRAM timing constraints are obeyed. A memory controller can improve memory system
throughput by scheduling requests such that can be paralleled among banks. In the meanwhile,
dram can have different page modes. Leaving a row buffer open after every access is called
Open-page policy. Closing a row buffer after every access is called Close-page policy. Accessing
data already loaded in the row buffer, also called a row-buffer hit, incurs a shorter latency than
17
when the corresponding row must first be “opened” from the DRAM array. Therefore, open-
page policy enables more efficient access to the same open row, at the expense of increased
access delay to other rows in the same DRAM array. A row-buffer conflict would happen if
requesting a different row rather than the currently opened row, which incurs substantial delay.
Close-page policies on the other hand, can server row buffer conflict requests faster.
Our proposed research is focused on various techniques from caches to memory to
improve performance of memory hierarchy on modern multicore systems. The outline of the
research topics is given in the following subsections. The performance evaluation methodology
and workload selection will be given in Chapter 2. This is followed by detailed description of
each research topic in Chapter 3, 4, 5, and 6. Finally, a summary of the proposed research is
given in Chapter 7.
1.1 DRAM Caches
Cache capacity is limited by the number of transistors in the processor die. With newer
packaging technology such as silicon interposer (2.5D) [2] or 3D integrated circuit stacking [3],
processor and DRAM can be in close proximity which gives high-bandwidth and low-latency
access to dense memory from processors. However, at the current time, unfortunately stacked
DRAM capacity is still insufficient to be used as the system main memory [4] [5]. There have
been two approaches to integrate the stacked DRAM, either as the last-level cache [6] [7] [4] [8]
[9] [10], or as a part of the main memory [11] [12] [13] [14]. Using stacked DRAM as a part of
memory requires extra address mapping and data block swapping between fast and slow DRAMs
[15] [16] [17] [18] [19]. This approach utilizes the entire DRAM capacity which is essential
when the capacities of stacked and off-chip DRAM are close. However, with tens of GBs off-
chip DRAM in today’s personal systems, it is more viable to use an order of magnitude smaller
18
stacked DRAM as the last-level cache (L4) to provide fast memory access and to alleviate off-
chip memory bandwidth requirement.
Our first research topic is to investigate fundamental issues and to access performance
advantage of a large stacked DRAM cache included in the memory hierarchy as the last-level
cache. Large off-die stacked DRAM caches have been proposed to provide higher effective
bandwidth and lower average latency to main memory. Designing a large off-die DRAM cache
with conventional block size (e.g. 64 bytes) requires a large tag array which is impractical to fit
on-die. Placing the large directory on off-die memory prolong the latency since a tag access is
necessary before the data can be accessed. This additional trip also generates extra off-die traffic.
We investigate a novel design called Cache Lookaside Table (CLT) to reduce the average
access latency and to lessen off-die tag array accesses. The basic approach is to cache a small
amount of recently referenced tags on-die. An off-die tag access is avoided when a requested
block’s tag hits a cached tag. To save on-die space, cached tags are recorded in a large sector for
sharing tags with multiple blocks. However, due to the loss of one-to-one physical mapping of
the cached tags and the data array, a way pointer is added for each block to indicate its way
location. The proposed CLT exploits memory reference locality and provides a fast alternative
tag path to capture most of the DRAM cache requests.
Experiment results show that with a small on-die CLT, in comparison with other
proposed DRAM caching mechanisms, the on-die CLT shows average performance
improvements in the range of 4-15%.
1.2 Runahead Cache Misses
To hide long memory latency and to alleviate memory bandwidth requirement, a fourth-
level of cache (L4) is introduced in modern high-performance computing systems as illustrated
in Figure 1-2. However, increasing cache levels worsens the cache miss penalty since memory
19
requests go through levels of cache hierarchy sequentially. We investigate a new way of using a
Bloom Filter (BF) to predict cache misses earlier at a particular cache level. These misses can
runahead to access lower level of caches and memory to shorten the miss penalty. One inherent
difficulty in using a BF to predict cache misses is due to the fact that cache contents are
dynamically updated through insertions and deletions. We propose a new BF hashing scheme
that extends the cache index for the target set to access the BF array. Since the BF index is a
superset of the cache index, all blocks hashed to the same BF location are allocated in the same
cache set to simplify updates to the BF array. When a block is evicted from the cache, the
corresponding BF bit is reset only when no block hashed to this location exists in the cache set.
Performance evaluation using a set of SPEC2006 benchmarks show that using a BF for
the third-level (L3) cache in a 4-level cache hierarchy to filter and runahead L3 misses, the IPCs
can be improved by 3-21% with an average improvement of 9.5%.
1.3 Hashing Fundamentals and Bloom Filter
Inspired by the usefulness of Bloom filter in cache accesses, we conduct a fundamental
study to find a way to balance the hashing buckets while maintaining lower false-positive rate for
Bloom filter. To broaden the applications, our study is based on the routing and packet
forwarding function at the core of the IP network-layer protocols. The throughput of a router is
constrained by the speed at which the routing table lookup can be performed. Hash-based lookup
has been a research focus in this area due to its O(1) average lookup time.
It is well-known that hash collision is an inherent problem when a single random hash
function is used, which causes uneven distribution of keys among the hash buckets in a
nondeterministic fashion. The multiple-hashing technique, on the other hand, uses d independent
hash functions to place a key into one of d possible buckets. The criteria of selecting the target
bucket for placement is flexible and can be controlled to accomplish a specific objective. One
20
well-known objective of using multiple hash functions is load balancing, i.e. to balance the keys
in the buckets [20] [21] [22] [23] [24] [25].
Another known objective of multiple hashing sets an opposite criteria for reducing the fill
factor of the hash buckets [22] [26]. The fill factor is measured by the ratio of nonempty buckets.
Instead of placing a key in the bucket with smaller number of keys for load balancing, this
approach places the key in the bucket with non-zero number of keys. The objective of this
placement is to maximize the amount of empty buckets. One potential application is to apply the
low fill-factor hashing method to a Bloom filter [22] [26]. With more zeros remained in the
Bloom filter, the critical false positive rate can be reduced. To create more zeros in establishing
the Bloom filter, however, multiple sets of hash functions are needed for different keys since all
the hashed k bits for each key must be set during the setup of the Bloom filter. Therefore, the
multiple hashing concept is actually applied for choosing a set of hash functions out of multiple
groups to maximize the number of zeros in the Bloom filter after recording 𝑘 ‘1’s for every key.
With a series of prior multi-hashing developments, including d-random, 2-left, and d-left,
we discover that a new guided multi-hashing approach holds the promise of further pushing the
envelope of this line of research to make significant performance improvement beyond what
today’s best technology can achieve. Our guided multi-hashing approach achieves near perfect
load balance among hash buckets, while limiting the number of buckets to be probed for each
key (address) lookup, where each bucket holds one or a few routing entries. Unlike the localized
optimization by the prior approaches, we utilize the full information of multi-hash mapping from
keys to hash buckets for global key-to-bucket assignment. We have dual objectives of lowering
the bucket size while increasing empty buckets, which helps to reduce the number of buckets
21
brought from off-chip memory to the network processor for each lookup. We introduce
mechanisms to make sure that most lookups only require one bucket to be fetched.
Our simulation results show that with the same number of hash functions, the guided
multiple hashing schemes are more balanced than d-left and others, while the average number of
buckets to be accessed for each lookup is reduced by 20–50%.
1.4 Intelligent Row Buffer
Accessing off-chip memory is a major performance bottleneck in microprocessors. As all
of the cores must share the limited off-chip memory bandwidth, a large number of outstanding
requests greatly increases contention for the memory data and command buses. Because a bank
can only process one command at a time, a large number of requests also increases bank
contention, where requests must wait for busy banks to finish servicing other requests.
A key challenge in effectively improving system performance lies in maximizing both
row-buffer hits and bank level parallelism while simultaneously providing fairness among
different requests. We observed that accesses to each bank in DRAM are not equally distributed
among different rows for most of the workloads we study in Chapter 2. Some rows tend to have
more frequent accesses than other rows in a certain amount of time due to spatial locality. We
call these rows “hot rows”. However, if requests from different hot rows in the same bank
interleave with each other, there is slight chance that those requests result in row buffer hits.
DRAM banks will frequently close opened rows and issue commands to open another row thus
causing large queuing delays (time spent waiting for the memory controller to start servicing a
request) and DRAM device access delays (due to decreased row-buffer hit rates and bus
contention).
22
We propose a simple scheme to capture the hot row pattern and prefetch data in these hot
rows. Prefetched data will be row buffer hits and thus saving access time later. Results show that
our design is able to consistently perform better than simple LRU and LFU hot row schemes.
23
CHAPTER 2
PERFORMANCE METHODOLOGY AND WORKLOAD SELECTION
2.1 Evaluation Methodology
In order to evaluate the performance advantages of the proposed works in memory
hierarchy designs, we adopt two cycle-accurate simulation methodologies. The first method is to
establish and run applications on MARSSx86 [27], an x86-based whole-system simulation
environment. MARSSx86 is built on QEMU, a full system emulation environment, where
selected multi-threaded and multi-programmed workloads are compiled and run in. The executed
instructions and memory requests drive a cycle-accurate multi-core model, which is extended
from PTLsim [28]. Memory requests are simulated through multiple levels of cache hierarchy. In
case of a last level cache miss, the request is issued to the memory, which is modelled using
DRAMsim2 [29], a cycle-accurate DDR-based DRAM model.
We develop a memory interface controller, called MICsim to handle requests from
processors to the off-die DRAM cache and memory. We also develop a callback function
between MICsim and the multicore processor model in MARSSx86. When a memory request
misses the last-level on-die caches, the request is inserted into a memory request queue. The
MICsim processes requests from the top of the memory request queue one at a time in every
cycle. We model partial hits to the stacked and the conventional DRAMs. Outstanding requests
are saved in a pending queue for detecting and holding subsequent requests to the pending
blocks. This first performance evaluation methodology is used in studying run-ahead cache
misses proposal since detailed L1, L2 and L3 caches are simulated to understand the
effectiveness of bypassing certain levels of caches.
24
Table 2-1. Architecture parameters of processor and memories.
Processor, DRAM Cache and DRAM Memory Parameters
processor 3.2GHz, 8 cores, out-of-order
L1 Caches I/D split, 32KB MESI, 64B line, 4-way, 2 read ports, 1 write port,
latency 4 cycles
L2 Cache Private, 256 KB, 64B line, 8-way, 2 read ports, 1 write ports, latency
11 cycles, snooping-bus MESI protocol, 2-cycle per request, split
transaction
L3 Cache Shared, 8MB, 64B line, 16-way, 2 read ports, 2 write ports, latency
24 cycles
L4-DRAM
Cache
128MB-256MB, 1.6GHz, 16-byte bus, Channels/Ranks/Banks:
4/1/16, tCAS-tRCD-tRP: 9-9-9, tRAS-tRC: 36-33
Conventional
DRAM
16GB, 800Mhz, 8-byte bus, Channels/Ranks/Banks: 2/1/8, tCAS-
tRCD-tRP: 9-9-9, tRAS-tRC: 36-33, 2KB row buffer, close page
Although MARSSx86 precisely simulates multicore processors, the simulation time is
unbearably long when simulating large stacked DRAM cache and memory. It requires billions of
instructions to drive meaningful results. The virtual machine infrastructure in MARSSx86 also
puts limits on the physical address space as pointed in [30]. In the second method, we adopt the
Epoch model [31] for estimating the execution time of different applications. It uses traces
generated by the Pin-based Sniper simulator [32] in which the representative regions of
applications are simulated and the requests sent to L3 are collected based on Intel Gainestown
configuration with private L1, L2 caches and a shared L3 that interface with main memory. Per-
core memory traces generated from Sniper are annotated with Epoch marks to ensure correct
dependence tracking for issuing a cadence of memory requests. Each core explores memory level
parallelism by issuing memory requests up to the Epoch mark. Each memory request is
simulated through cycle-accurate memory hierarchy model which is the same as in the first
method. The processor waits until all requests come back from the memory controller before
moves to the next Epoch. In this method, we model precise on-die shared L3 cache with correct
timing and bandwidth considerations. This Epoch simulation model is used in evaluation of the
25
Cache Lookaside Table proposal for providing alternative tag path for large stacked DRAM
cache as well as studying intelligent hot-row prefetch for DRAM row buffers. Table 2-1
summarizes the architecture parameters used in our simulation.
2.2 Workload Selection
Table 2-2. MPKI and footprint of the selected benchmarks.
Benchmarks FootPrint (MB) L3MPKI L4MPKI
mcf 9310 74.0 19.2
gcc 477 39.5 1.4
lbm 3222 30.2 15.1
soplex 1693 28.1 21.9
milc 4084 27.5 22.1
libquantum 256 24.1 13.2
omnetpp 259 19.1 0.2
sphinx 78 12.1 0.2
bt 240 11.2 0.4
bwaves 3794 9.2 5.0
leslie3d 599 8.3 4.0
gems 1663 7.7 5.2
zeusmp 3488 6.3 4.5
To conduct our evaluation, we evaluated all workloads from SPEC CPU2006 and
selected 12 applications with high L3 MPKI and large footprint. All workloads are running under
multithreaded mode, where each application is replicated 8 times with 8 threads running on 8
cores. Therefore, the total memory footprint is roughly 8 times large as the footprint reported in
[33]. Table 2-2 gives basic information for these workloads. In general, we use the first 5 billion
instructions to warm up caches, tables, and other data structures in different methods. We
simulate the next billion instructions to collect performance statistics.
26
CHAPTER 3
CACHE LOOKASIDE TABLE
In this chapter, we propose our CLT work on providing an alternative tag path for large
off-chip stacked DRAM cache. We begin by showing the current state-of-the-art designs to solve
the tag problem and the motivation for using small on-die space to capture the majority of
DRAM cache tags. We then present the detailed design of how CLT adopts the decoupled sector
cache idea to utilize spatial locality as well as save tag space. Finally we show some results to
support our work.
3.1 Background and Related Work
Future high-end servers with tens or hundreds of cores demand high memory bandwidth.
Recent advances in 3D stacking technology provides a viable solution to the memory wall
problem [13] [34] [11] [35]. It provides a promising venue for low-latency and high-bandwidth
interconnect between processor and DRAM dies through silicon vias [3]. However, due to
physical space availability, the capacity of this nearby memory is limited and not suitable to
serve as the system main memory [4] [5] [36]. One viable approach is to use this nearby DRAM
as the last level cache for fast access and reduced bandwidth to main memory. Intel Haswell [1],
a fourth generation core is an example, which has 128MB L4 cache on embedded DRAM
technology
Designs of large off-die DRAM caches have gained much interest recently [5] [37] [38]
[39] [40] [6] [14] [7] [41]. Researchers noticed large space requirement as well as the access
time and power overheads for implementing a tag array for a large DRAM cache [4] [9].
Considering a common cache block size of 64 bytes, if each tag is 6 bytes, the tag array will
consumes 24MB and 96MB respectively for 256MB and 1GB DRAM caches. Such a large tag
27
array is impractical to fit on the processor die. If this tag array is part of the off-die DRAM
cache, it requires extra trip to access the tag array.
There are two general approaches to handle the large tag array of a DRAM cache placed
off-die. The first approach is to record large block tags in order to fit the tag array on-die. Zhao,
et al. [10] explore DRAM cache for CMP platforms. They recognize tag space overhead and
suggest storing all tags in a set in a continuous DRAM block for fast access. They show that
using on-die partial tags and sector directory can achieve the best space-performance tradeoff.
However, partial tags are expensive and encounter false-positive situation. CHOP [6] advocates
large block size to alleviate the tag space overhead. It uses a separate filter cache to detect and
cache only hot blocks for reducing the fragmentation problem.
The Footprint cache [39] has large blocks and uses the sector cache idea [42] [43] [44]
[45] to reduce the tag space and the memory bandwidth requirement. In addition, they predict the
footprint in a 2KB sector and prefetch those 64-byte blocks based on the footprint history [46].
Data prefetching is orthogonal to the proposed caching methods and is beyond the scope of this
proposal. Nevertheless, it is noteworthy to point out that footprint cache will lose cache space for
blocks which are not a part of the footprint. The Unison cache [40] extends the Footprint idea for
handling even bigger stacked DRAM cache. They move the sector tag off-chip and use way-
prediction to fetch the tag and the predicted data block in parallel.
The second approach is to keep conventional block size and use other techniques to
alleviate the impact of extra off-die tag array access. Loh and Hill [4] [47]propose allocating all
tags and data blocks in a cache set in a single contiguous DRAM location to improve row buffer
hit. To reduce the miss latency, they use a miss-map directory to identify DRAM cache misses
and issue accesses directly to the next low-level memory. To save space, they record miss-map
28
for a large memory segment. However, when the segment is evicted from the miss-map
directory, all blocks in the segment must be evicted from the DRAM cache. Sim et al. [5] suggest
to speculatively issuing requests directly to main memory if the block is predicted not in the
DRAM cache. However, one will need to handle complicated miss predictions.
To reduce extra off-die tag access, we can cache on-die a small amount of recent
referenced tags. An off-die tag array access is avoided when the tag is found on-die. This simple
approach faces two problems. First, a tag is cached on-die only when a request misses on-die
cached tags. It does not take advantage of spatial locality in data access. Second, caching tags of
individual blocks does not save tag space if we want to maintain the same capacity. Even worse,
without one-to-one mapping of the cached tags and the DRAM data array, a location pointer to
the data array is needed for each cached block tag. Meza et al. [48] propose to cache the block
tags of the entire DRAM cache set as a unit in an on-die directory called Timber. Caching the
tags of the entire set as a unit avoids using way pointers. It does not require to invalidate blocks
in a set when the set of tags are evicted from Timber. However, caching all tags in a set does not
save any tag space. Moreover, it does not follow spatial nor temporal locality principle in
applications. The ATCache [38] applies the same idea of Timber with additional prefetching for
a set of tags. Other works [49] [50] have proposed caching tags to improve cache access latency
in the generic setting of a multilevel SRAM cache hierarchy.
The Alloy cache trades high hit ratios of a set-associative cache for fast access time of a
direct-mapped cache [9]. Besides suffering the hit ratio, Alloy cache relies on a cache-miss
predictor [51] to speculative issue parallel accesses to both cache and memory when a cache
miss is predicted in order to avoid sequential accesses to the off-die cache and memory. The
TagTables [37] applies the page-walk technique to identify large blocks (pages) located in cache.
29
It allocates the entire page into the same cache set in order to combine adjacent blocks into a
chunk to save space for the location pointers in the page-walk tables.
Our proposed CLT maintains conventional block size for off-die DRAM cache. It differs
from other proposed approaches in avoiding off-die tag array accesses. In contrast to the
TagTables, the Alloy cache, the Footprint, and the Unison caches, the role of CLT is to provide a
fast, alternative on-die tag path to cover a majority of cache requests. With off-die full block tags
as the backup, CLT can be decoupled from the off-die cache without the inherent complexity in
the decoupled sector cache [45]. Unlike Timber and ATCache, CLT caches on-die tags for
bigger sectors allowing CLT to capture spatial locality and to save tag space by sharing the
sector tags. In addition, unlike the miss-map approach, CLT can identify both cache hit and
cache miss without block invalidation when a sector is replaced from CLT. Furthermore,
different from using partial tag or hit/miss speculation, our proposed CLT maintains precise
hit/miss information for all recorded blocks and is non-speculative which can bypass the off-die
tag array for a majority of memory requests.
3.2 CLT Overview
3.2.1 Stacked Off-die DRAM Cache with On-Die CLT
Figure 3-1 depicts the block diagram of a CMP system with nearby off-die stacked
DRAM cache and on-die Cache Lookaside Table (CLT). All requests to the DRAM cache are
filtered through the CLT first, which records recently referenced memory sectors. When the
sector of a requested block is found in the CLT (CLT hit which is different from DRAM cache
hit), off-die tag directory access is avoided from the critical cache access path. Either the stacked
data array or the next-level DRAM will be accessed depending on the block hit/miss
information. The proposed sector-based CLT records large sector tags to save space and to
exploit spatial locality for better coverage. If a small CLT can cover a high percentage of the
30
total requests, we can reduce the average memory access latency as well as the off-die bandwidth
requirement without putting the entire cache tags on-die.
StackedDRAM Cache
Tag and Data Array
Main Memory
Stacked DRAM Controller
DRAM Controller
Core + Cache
Cache Lookaside Table(CLT)
Core + Cache
Core + Cache
Core + Cache
Main Memory
Main Memory
Main Memory
Figure 3-1.4Memory hierarchy with stacked DRAM cache.
It is well-known that conventional cache access goes through a tag path to determine a
cache hit or a miss and a data path to access the data in case of a hit. The cache tag and data
arrays maintain topological equivalency such that matching of the address tag in the tag array
determines the location of the block in the data array. The original sector cache [42] records
large sector tags to save tag space, but maintains the physical equivalency with the block-based
data array where all blocks in a sector are allocated in fixed locations in the data array. It wastes
cache space for unused blocks.
Decoupled sector cache [45] allocates requested blocks in a sector in any locations in the
target set of the data array for better cache performance. However, it faces a few inherent
difficulties. First, without physical equivalency, it requires a location pointer to the data array for
each block in a sector for locating the block. Second, due to topological mismatch of sector-
based tag array and block-based data array, the location pointers are also used to invalidate the
remaining valid blocks when a sector is replaced from the sector tag array. To minimize such
31
invalidations, the number of sector tags recorded in the sector tag array needs to be larger than
the number of sectors matching the cache size. Third, decoupled sector cache also requires a
backward pointer from each block in the data array to its parent sector in the sector tag array for
updating the validity information when the block is replaced from the data array. This double-
pointer requirement along with enlarged number of sector tags defeats the purpose of saving tag
space and further complicates the sector-cache design.
CLT only captures a portion of recently referenced sectors on-die and reply on off-die
full block tags to handle the rest. It avoids two critical issues that the decoupled sector cache
encounters. First, the backward pointer from the blocks in the data array to the parent sector in
the CLT can be eliminated. This is due to the fact that with only a portion of the sectors recorded
in the CLT, the index bits to large cache sets can be a superset of the index bits to the CLT.
Although the missed block and the replaced block in a cache set can be from different sectors,
they must be located in the same CLT set. A search in the CLT set can identify the sector where
the evicted block belongs to for updating the validity information.
Second, when a sector is replaced from the CLT, the valid blocks in the sector can remain
in cache as long as all blocks in a sector are allocated in the same cache set. This allocation can
be accomplished by using the low-order bits of the sector address as the cache index bits. When
the sector is referenced later, a search of the block tags in the target cache set can recover all the
valid blocks in the sector. This search is possible because the block tags are maintained in the
cache tag array. Without the block tags, the decoupled sector cache must invalidate remaining
blocks when a sector is evicted. Detailed CLT design will be given in Section 4.3.
3.2.2 CLT Coverage
We first validate the potential CLT coverage using 12 SPEC2006 CPU workloads. In
Figure 3-2, we plot the accumulated reuse distance curves of the 12 workloads with large (2KB)
32
block size. We want to show that a small portion of recently used large blocks (sectors) can
indeed cover the majority of cache access. The horizontal axis (logarithm scale) represents the
percentage of the reuse distance with respect to the full stack distance that covers the entire block
references, and the vertical axis is the accumulated percentage of the total blocks that can be
covered. We can observe that by recording 10% of the most recent referenced blocks, over 90%
of the requests can be covered for all workloads except for gcc and milc whose coverages are
82% and 88% respectively. These results support the CLT approach by recording a small portion
of recently reference sectors on-die to provide a fast path for a majority of the DRAM cache
requests.
Figure 3-2.5Reuse distance curves normalized to the percentage of the maximum distance.
3.2.3 Comparison of DRAM Cache Methods
There are several key aspects in comparing DRAM cache designs including Alloy cache,
ATcache, TagTables, and the proposed CLT as summarized in Table 3-1. For comparison
purpose, we also include an impractical method to have entire cache block tags on-die. We
assume 64-way set-associativity for all caches. Note that the footprint cache follows the original
33
sector cache design plus prefetching the blocks in the predicted footprint in each sector. Since
prefetching techniques (e.g. streaming prefetcher) benefit all proposed methods and is orthogonal
to the cache design, we omit the prefetch aspect in our comparison.
The goal of all proposals is to avoid off-die tag access. Therefore, the storage requirement
including on-die SRAM, stacked DRAM, and regular memory must be considered. First,
different methods require different on-die SRAM. Alloy uses a small on-die table to predict
DRAM cache misses for issuing parallel accesses to memory. ATcache caches tags of recently
referenced cache sets. TagTables caches the page-walk tables in on-die L3 on demand. CLT
records tags of recently referenced sectors along with valid, modify and way pointer for each
block to cover tag accesses for a majority of requests. For fair comparison, we will keep fixed
on-die SRAM size by adjusting the L3 size for all methods. For example, we deduct the CLT
size from the L3 cache size in our evaluation.
Table 3-1. Comparison of different DRAM cache designs.
On-die SRAM
(constant)
Stacked
DRAM
(constant)
Tag-data
mapping
Entropy on
cache set
Cache
placement
Block tag
on-die
block tag data 1-1 map block indices 64-way LRU
Alloy
cache
predict table tag + data 1-1 map block indices direct-map
TagTables cached page-walk
tables
data 1-many
decoupled
sector indices chunk
placement
CLT CLT
(tag+pointer)
tag + data 1-many
decoupled
sector indices 64-way LRU
Next, for stacked DRAM requirement, Alloy, ATcache and CLT must maintain the tags
along with the data blocks in the stacked DRAM while TagTables does not have separate block
tags. Therefore, we reduce the data array size proportionally for Alloy, ATcache and CLT in our
evaluation. Third, it is important to note that TagTables creates page-walk tables in main
34
memory. Since we do not evaluate I/O activity, we do not impose any penalty associated with
this extra memory requirement.
Physical mapping of the on-die tags and their data blocks is another key aspect. Alloy has
a simple direct-mapped topology without separate on-die tags. ATcache does not alter mapping
of the on-die tags and their data. TagTables, and CLT share a sector tag with multiple data
blocks to save tag space. CLT only records a portion of the sector tags. As a result, a location
pointer for each block in a sector is needed to locate the block in the data array. TagTables also
requires the location pointers. It further limits four pointers per sector (page) by combining
adjacent blocks into physical chunks using more restricted block placement and replacement in
the cache data array.
With respect to fetch bandwidth requirement, all methods fetches 64-byte blocks from
off-die stacked DRAM. However, Alloy needs to fetch the 64-byte data block along with its tag
and miscellaneous bits.
Last but not the least, the effort of avoiding off-die tag access could affect the cache
performance. The first impact is on the entropy of indexing the cache set. It is well-known that
using the low-order block address bits to hash to the cache sets provides good entropy of
distributing blocks across the entire cache sets. However, due to restricted mapping in CLT, as
well as a need to combine adjacent blocks into chunks in TagTables, these approaches use the
low-order bits from the sector address to determine the cache sets. Depending on the sector size,
the cache indices are taken from higher order address bits in comparison with the block indices.
Using higher-order bits for indexing the cache may adversely impact the entropy of hashing
blocks to the cache sets and create more conflicts.
35
These three methods are also different in cache placement and replacement policies.
ATcache and CLT maintains 64-way set-associativity in each set with a pseudo-LRU
replacement policy decoupled from the topology of the sector tags in the CLT. Alloy uses the
direct-mapped design which may suffer lower hit ratios. TagTables relies on special placement
and replacement mechanisms for creating big chunks since each sector can only record up to four
chunks.
Figure 3-3.6Coefficient of variation (CV) of hashing 64K cache-set using different indices.
In order to understand the impact on the entropy of hashing memory requests across
cache sets, we show the coefficient of variation (CV) using five different sector-based caching
indices in Figure 3-3. The CV is the ratio of the standard deviation to the mean based on the
number of requests hashed to each cache set. In this simulation, we assume there are 64K sets,
hence 16 index bits and the block size is 64 bytes. In the figure, Sector_n indicates that the
starting least-significant index bit position is from the least-significant bit of the block address to
the left of log2 𝑛 bits. For 64-byte block size, the least-significant 6 bits are the block offset. The
0
0.5
1
1.5
2
2.5
3
Coef
fiec
ien
t of
vari
ati
on
sector_1 sector_8 sector_16 sector_32 sector_64
36
index bits start from the 7th bit for sector_1, 10th bit for sector_8, and so on. In other words,
there are n consecutive blocks allocated to the same set. We can observe that the CV increase
significantly when n increases for all workloads except for milc. The significance of the variation
among the workloads is sorted from left to right. With large n, this uneven distribution of
memory requests across cache sets increases the chance of set conflict and degrades cache
performance. Milc presents special indexing behavior. By allocating consecutive blocks into the
same set, the CV is actually reduced. However, since milc has the least CV significance, the
impact should be minimum.
Figure 3-4.7DRAM cache MPKI using sector indexing.
Figure 3-4 shows the impact on the MPKI using the sector-based indexing schemes. We
simulate a 256MB cache, 64-way set-associativity with 64-byte blocks. The results indeed
demonstrate that with large sector size, allocating all blocks in the same cache set degrades cache
performance significantly, especially when the sector size is 64. However, with moderate sector
size (e.g. 16 blocks), the impacts are rather manageable. Among different workloads, mcf,
0
1
2
3
4
5
MPKI
bwaveszeusmpleslie3dgccomnetppsphinx3
5
15
25
35
45
55
65
mcfmilcsoplexlbmlibquantumGemsFDTD
37
sphinx3, omnetpp, gcc, and leslie3d have worse impacts which are consistent with the CV results
in Figure 3-3. These results suggest that a moderate sector size enables a decoupled CLT without
compromising much of cache performance.
3.3 CLT Design
An example of 3-way set-associative CLT with 64-byte blocks and 16 blocks per sector is
depicted in Figure 3-5. According to the size and set associativity of the CLT, a few low-order
bits in the sector address are used to determine the CLT set. The remaining high-order tag bits
are used to match the sector tags recorded in the set. In this example, the address part labeled sect
represents the block ID in a sector and is used to look up the recorded block information. Each
sector has a valid bit and 16 groups of valid (v), modify (m), and location pointer (way) for the
16 blocks. Given that the cache set index is a part of the address, only the cache way is needed in
the location pointer. For example, in a 64-way DRAM cache, it requires 6 bits to record the way.
Based on whether the sector is valid in the CLT and whether the requested block is located in the
DRAM cache, the cache access works as follows.
First, when a sector tag match is found and the target sector is valid (a CLT hit), the 4-bit
block ID (sect) selects the corresponding hit/miss (same as a valid bit) and the way pointer for
the block. In case of a cache hit, the way pointer is used to access the data block from off-die
stacked DRAM data array. The critical off-die tag path is bypassed.
Second, on a CLT hit and the hit/miss indicating that the block is not located in the
DRAM cache, the request is then issued to the conventional DRAM (main memory). The off-die
tag path is bypassed as well. When the missed block is returned from the conventional DRAM,
the block data and tag are stored into the DRAM cache in the LRU position given by the on-die
replacement logic. A writeback is necessary in case the evicted block is dirty. (For simplicity we
omit the dirty bits in the drawing.) Meanwhile, the CLT is updated by turning off the hit/miss for
38
the evicted block and recording a hit and the way location for the new block. Note that the
evicted block may not be in the same sector as the missing block. However, they must be located
in the same CLT set since the CLT index bits are a subset of the DRAM cache index bits. By
matching the cache tag and the remaining cache index bits with proper sector tag bits, the LRU
block in the CLT can be identified.
Figure 3-5.8CLT design schematics.
Third, when a request misses the CLT, an off-die tag access is necessary to bring in all
the tags in the target cache set for determining hit/miss status and the way location of the
requested block. Depending on whether the requested block is located in cache, the remaining
cache and memory accesses are the same as that when the target sector is valid in the CLT. In
order to update the CLT for the new sector, cache tag comparison logic is extended to allow
matching of the tags in the target cache set with all other block tags in the sector. For those
blocks in the set, hit/miss status bits are set and their way pointers are recorded. For other blocks
that are missing, their corresponding hit/miss indicator is recorded as a miss. The new sector tag
Address:
offsetsectindextag
64
sectagh/m, way
. . .
= =
MUX
=
DRAM way
Cache Lookaside Table (CLT)
...
h/m, way
. . .
h/m
v
16
39
and its associative hit/miss and location information replace the LRU sector in the CLT. Note
that there is no cache invalidation for the evicted sector.
Address:
offsetsect
index
tag
62
way0 1 2 3 4 6 75
DRAM Cache
CLT
Request CLT – Target Set for Sector A, B, C
addr condition
initial
tag v pt
B 0 --
v pt
1 001
v pt
0 --
v pt
1 101
A0 A2
C2 C3 C1 C0
B1 B3
tag v pt v pt v pt v pt
--
A0
A2
A3
B1
B2
B3
C1
C2
MRU LRU
m, h
h, h
h, m
h, h
h, m
h, h
m, h
h, h
B 0 -- 1 001 0 -- 1 101A 1 000 0 -- 1 100 0 --
B 0 -- 1 001 0 -- 1 101A 1 000 0 -- 1 100 0 --
A3
B 0 -- 1 001 0 -- 1 101A 1 000 0 -- 1 100 1 110
A 1 000 0 -- 1 100 1 110B 0 -- 1 001 0 -- 1 101
B2
B 0 -- 1 001 1 011 1 101 A 1 000 0 -- 1 100 1 110
B 0 -- 1 001 1 011 1 101 A 1 000 0 -- 1 100 1 110
B 0 -- 1 001 1 011 1 101C 1 111 1 101 1 010 1 011
C 1 111 1 101 1 010 1 011 B 0 -- 1 001 1 011 1 101
Note, A, B, C are 3 sectors each has 4 blocks; a few blocks are located in cache initially, the ones with circle are moved in due to cache miss. All three sectors are recorded in the same CLT set with 2-way set-associativity. ‘condition’ indicates hit/miss to CLT and cache for the request.
Figure 3-6.9CLT operations in handling memory requests.
In Figure 3-6, we illustrate the CLT operations in handling a sequence of DRAM cache
requests, A0, A2, A3, B1, B2, B3, C1, and C2, where A, B, C represent three different sectors
and each sector has 4 64-byte blocks as indicated by the subscript. The least-significant 6 address
bits are the block offset and the next 2 bits define the block IDs in a sector. Both the target sets of
the CLT and the cache are determined by the low-order bits in the sector address as illustrated in
the figure for allocating blocks in a sector in the same cache set. Since the number of cache
40
blocks is several times larger than the number of sectors in the CLT, the cache index bits are a
superset of the sector index bits. In this example, we assume sectors A, B, and C are hashed to
the same CLT set, but allocated to different cache sets. Several blocks in A, B, and C are located
in the 8-way DRAM cache initially. Note, the blocks marked by a circle are moved into the
cache after a miss occurs. For simplicity, we assume the CLT has 2-way set-associativity. We
also assume sector B is already recorded in the CLT with its sector tag, two valid blocks B1 and
B3 with location pointers ‘001’, ‘101’, and two invalid blocks B0 and B2.
When A0 is issued, it misses the CLT. All tags in the target DRAM cache set where A0 is
located are fetched and compared with all the block tags in A. A match of requested A0 is found
and so is A2, while A1 and A3 are missing. The request to the data array to fetch A0 is then
issued. The CLT is updated by recording sector tag for A and setting valid and way bits for A0,
A2 and invalid for A1, A3. When A2 is processed, it hits the CLT with a valid block indicator
and the location pointer is ‘100’. Therefore, the data block can be fetched from the correct data
array location directly. Next, A3 is also a CLT hit, but the block is invalid. A request is issued to
the conventional DRAM to bring in the missing block. According to the on-die pseudo-LRU
cache replacement logic, A3 is placed in way 6 to replace the LRU block. The DRAM cache tag
array and data array are updated accordingly. Meanwhile, the valid bit and location pointer are
updated for A3 and the valid bit for the replaced block is turned off in case the block is valid.
When B1 comes, it hits the CLT as a valid block, hence B1 can be fetched from the data
array directly with the location pointer ‘001’. The MRU/LRU position in the CLT is updated.
Next, B2 is a CLT hit, but a cache miss, which is handled the same way as A3. B2 is moved to
the DRAM cache set in way 3 afterwards. B3 hits both the CLT and the cache and can be treated
the same as B1. Next, C1 misses the CLT, but hit the cache. It can be handled the same as A0
41
where an off-die fetch to bring in all tags in the target DRAM set is necessary. In addition, to
record the new sector C in the CLT, sector A must be evicted to make room for sector C. Blocks
A0, A2 and A3 remain valid in cache. The update of C in the CLT is the same as the update of A
when A entered the CLT. Finally, C2 hits both the CLT and the cache and can be handled
accordingly.
3.4 Performance Evaluation
3.4.1 Difference between Related Proposals
Table 3-2 summarizes the on-die SRAM space and latency for Alloy, ATcache,
TagTables, and CLT, as well as L3 cache size and DRAM cache data array size. The MAP-I
cache miss predictor is used in Alloy with one cycle access latency and 768-byte SRAM space.
TagTables takes up L3 space for caching the page-walk tables. Proper partition based on address
of the TagTables allocates metadata on the same bank of the L3 cache that would trigger tag
access. As a result, the interconnect latency can be avoided. We use Cacti 6.5 [52] to estimate a
latency of 8 cycles for accessing the tag table, which is the same as that used in the TagTables
paper [37].
For CLT, the sector tag plus 16 groups of valid, modify, and way pointers account for 20
bytes per sector. With 4K CLT sets, each has 20 ways, the total number of sectors is 80K.
Therefore, the CLT space is 80𝐾 × 20 𝑏𝑦𝑡𝑒𝑠 = 1.6 𝑀𝐵. In addition, we use a 6-level binary
tree (63 bits) to implement a pseudo-LRU policy for the 64-way cache. The space requirement
is 64𝐾𝑠𝑒𝑡 × 63𝑏𝑖𝑡𝑠 = 504𝐾𝐵. Therefore, the total on-die SRAM for CLT is close to 2MB. We
use the same policy to allocate CLT partitions on the same L3 cache bank which triggers the
CLT access to avoid interconnect latency. With smaller 20 bytes of data, the estimated CLT
latency is 6 cycles. ATcache requires on-die Timber, pseudo-LRU logic, and a tag prefetcher.
Since each cache set consists of 64 4-byte tags, Timber has 12-way and 512 sets for a total of
42
1.54MB of tag space. In addition, each entry needs one bit for prefetching logic, which costs
12*512/8 = 768 B.
CLT only records recently referenced sectors and must fetch all cache tags from the
stacked DRAM when a CLT miss occurs. For a 256MB cache, 64-byte blocks and 64-way set-
associativity, each set has 64 blocks. Each block has 30-bit address tag, a valid, and a modify bit,
for a total of 4 bytes. Therefore, it requires fetching 4 tag blocks of 256 bytes on a CLT miss.
Note that 30-bit address tag can accommodate 52-bit physical address. To overlap tag accesses,
four tag blocks are allocated in different banks in the stacked DRAM. ATcache requires fetching
4 tag blocks on a miss too. It also includes a tag prefetcher as described in [38].
Table 3-2.4Difference between three designs.
SRAM size On-die
latency
cycles
L3 Cache size Stacked DRAM
Cache Data Array
Alloy 768 bytes 1 8MB, 16-way 240MB,
direct-mapped
ATcache 2MB (1.54MB +
768 bytes prefetch +
504KB pseudo-LRU)
6 6MB, 12-way 240 MB, 64-way,
pseudo-LRU
TagTables 2MB metadata in L3 8 8MB, 16-way
(metadata in L3)
256MB, 64-way,
chunk placement
CLT 2MB (1600KB tag +
504KB pseudo-LRU)
6 6MB, 12-way 240MB, 64-way,
pseudo-LRU
For Alloy, we follow the operations described in [9]. The MAP-I cache-miss predictor is
implemented to predict cache misses for parallel accesses to DRAM cache and memory. Each
core has a 256-entry 3-bit hit/miss counter table. The address of the L3 miss causing instruction
is hashed using folded-xor [53] into the table for the recorded counter. In Alloy, the extra tag
fetch along with data block from the stacked DRAM is charged for one burst cycle.
For TagTables, the page-walk tables are dynamically created in main memory during the
simulation. We do not charge any penalty in creating the page-walk tables. The tables are cached
43
in L3 on demand. Extra latency occurs when the needed entry in a table is not located in L3. A
fetch to the main memory is issued to bring back the block with the needed information. We
follow the same procedure in [37] in managing the shared L3 cache for caching the page-walk
tables. The page entries recorded in the leaf level are saved in the intermediate level whenever
possible to shorten the level of the page walk. We implement the same algorithm for allocating
and combining blocks into chunks with special cache placement and replacement mechanism.
TagTables allocates 64 blocks in a page into the same set, which hurts the entropy of
hashing blocks across the cache sets. In addition, the limit of four chunks for each page may
create holes (empty frames) in a cache set and underutilize the DRAM cache. Therefore, we also
evaluate a TagTables scheme with 16 blocks per page. We reduce the page offset from 6 bits to 4
bits and shift the remaining higher-order bits to the right. As a result, it may encounter a level-4
table with 2-bit index in 48-bit physical address format. We keep 4 chunks per entry at the leaf
level. The rest design and operations stay the same.
3.4.2 Performance Results
In this section, we first compare the speedup of five DRAM cache designs, Alloy,
ATcache, TagTables_64, TagTables_16, and CLT. We show the average memory access times
for the tag and the data, which contribute to the overall execution time. Multiple factors that
impact the memory access time such as number of DRAM cache misses, on-die tag hit/miss
ratios for ATcache, TagTables and CLT, Alloy’s miss-predictor accuracy are also discussed.
In Figure 3-7, we plot the speedups of CLT with respect to Alloy, ATcache,
TagTables_64, and TagTables_16. CLT demonstrates significant performance advantage over
the other four methods. On the average, CLT improves 4.3%, 12.8%, 12.9% and 14.9%
respectively over Alloy, ATcache, TagTables_64 and TagTables_16. The improvement of CLT
over Alloy is rather moderate. CLT performs worse than Alloy for omnetpp due to its sector-
44
indexing. In comparison with ATcache, CLT is able to gain 12-24% speedup for all workloads
except mcf and omnetpp. This is because CLT can capture most DRAM cache accesses and the
sector-indexing does not hurt DRAM cache performance much as shown in Figure 3-4. Both
TagTables perform especially poorly for mcf, lbm, and milc. For TagTables_64, the CLT
improvements are 44.8%, 70.6%, and 36.4% for these three workloads while TagTables_64
shows slight edge over CLT for omnetpp, leslie3d and zeusmp.
The diversified performance impacts on individual workloads are caused by multiple
factors. An overall speedup analysis is further complicated by the fact of exploiting MLP
(memory-level parallelism) in the Epoch model. During the timing simulation, a cadence of
memory requests is issued in each Epoch. The latency is dominated by the DRAM cache misses
in each Epoch. Therefore, the DRAM cache hit latency plays a small role. On the other hand, the
hit latency becomes the decisive factor in case there is no cache miss in an Epoch. This mix
performance factors exist even with a precise processor model. In the following, we analyze the
important parameters without detailing the MLP factor.
The most decisive performance factor is the average memory access time of the L3
misses. In Figure 3-8, we plot the average access latencies separated by the tag and the data
segments where the total access time is dominated by the data latency. In general, these average
latencies are consistent with the speedups shown in Figure 3-7. CLT has the shortest average
latency followed by Alloy, ATcache and both TagTables. As expected, Alloy has the shortest tag
latency since it only pays one-cycle predictor delay. However, in case of a false-positive miss
prediction, the tag latency includes a sequential DRAM cache access for fetching the tag.
ATcache has the longest tag latency due to: 1. recording the block tags does not save space,
hence lowering the Timber hit ratio; and 2. Sequential prefetch of set tags generates high traffic
45
since each set of tags occupied 4 blocks. The TagTables_16 has longer tag latency than
TagTables_64 in accessing the tags through the page-walk tables. When the page size reduces to
16 blocks, more active pages are requested, causing more L3 misses.
Figure 3-7.10CLT speedup with respect to Alloy, TagTables_64, and TagTables_16.
Figure 3-8.11Memory access latency (CPU cycles).
There are multiple factors contributed to the data latency. In table 3-3, we analyze three
performance parameters. The first and most important parameter is the DRAM cache
performance. Based on the trace-driven Epoch model, we can measure DRAM cache
performance using misses per thousand requests (MPKR) where each request is a L2 miss.
Similar to MPKI, MPKR is closely associated with the execution speedup estimation, higher
-20%
0%
20%
40%
60%
80% Alloy ATcache TagTables_64 TagTables_16
0
50
100
150
200
250
300
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
T
All
oy
AT
cach
eT
agT
able
s_64
Tag
Table
s_16
CL
Tmcf gcc lbm soplex milc libquantum omnetpp sphinx3 bwaves leslie3d GemsFDTD zeusmp GEOMEAN
Data Latency Tag Latency
46
MPKR causing longer average access time. In general, ATcache has the lowest MPKR due to its
64-way set-associativity. CLT is close to ATcache for all workloads except mcf and omnetpp. Its
moderate sector size (i.e. 16 blocks per sector) does not degrade cache performance much. Alloy
suffers higher MPKR, hence longer data latency, due to its direct-mapped design. TagTables
show much higher MPKR in comparison with CLT for mcf, gcc, lbm, and milc, hence lower
speedups as shown in Figure 3-7. Although omnetpp and sphinx also have large MPKR gap
between CLT and TagTables, much smaller MPKRs lessen the impact.
The high MPKR of TagTables is due to two reasons, a negative impact on the sector-
indexing scheme, and the restricted chunk-based placement and replacement. As observed in the
second parameter ˗ the DRAM cache occupancy, the restricted 4 chunks per page creates empty
space in the cache sets. For example, the average occupancy for gcc is only 75% for
TagTables_64. In other words, 25% of the cache space is wasted and caused more misses. By
reducing the page size from 64 to 16, we can alleviate both the negative sector-indexing effect
and the empty space in the cache data array. But, TagTables_16 encounters higher L3 misses for
accessing the page-walk tables. Note that Alloy, ATcache, and CLT have 100% cache
occupancy.
Table 3-3.5Comparison of L4 MPKR, L4 occupancy and predictor accuracy.
DRAM cache MPKR Occupancy
Predictor accuracy
Alloy ATcache TT-
64
TT-
16 CLT
TT-
64
TT-
16 Alloy
ATcache TT-
64
TT-
16 CLT
mcf 198 71 228 113 109 91 95 83 64 88 73 8
gcc 484 410 520 441 413 75 84 81 63 95 84 8
lbm 307 275 504 382 278 66 81 99 69 98 93 9
soplex 693 655 668 668 664 98 99 98 61 97 91 9
milc 747 658 722 714 665 99 87 89 55 90 79 7
libqutum 515 456 470 448 465 100 90 99 62 98 94 9
omnetpp 161 71 181 123 100 73 82 79 67 83 69 7
sphinx3 85 7 95 7 7 66 99 96 70 97 92 9
bwaves 429 411 422 411 412 100 100 99 58 98 93 9
leslie3d 438 394 459 402 408 99 99 95 59 98 93 9
gems 447 427 484 442 429 90 92 99 65 98 93 9
zeusmp 564 545 565 563 559 99 100 99 66 97 88 9
47
The third performance factor is the predictor accuracy which shows the accuracy for
Alloy miss predictor, the CLT hit ratio, TagTables tags hit ratio in L3 and cached tags hit ratio
for ATcache. In general, all schemes show high hit ratios except for ATcache. TagTables_16 has
lower tag hit ratio and causes higher average tag access latency.
In summary, mcf, lbm, and milc, have the highest MPKRs for the TagTables. Together
with wasted cache space and L3 miss for tags, CLT outperforms TagTables by a large margin.
For milc, the large MPKR gap between CLT and Alloy helps CLT outperforming Alloy. Alloy
outperforms CLT by 11.8% for omnetpp which has small MPKR, hence DRAM cache miss
plays an insignificant role. The difference in memory access latency is due to high CLT misses
and longer tag latency. ATcache hurts the most due to its low Timber hit ratios.
3.4.3 Sensitivity Study and Future Projection
In this section, we show the results of two sensitivity studies, CLT coverage and sector
size. For CLT coverage, we change the total SRAM space from 2MB to 1MB (low coverage)
and 3MB (high coverage) and adjust the L3 size to 7MB and 5MB accordingly. With 1MB
SRAM, the CLT size is reduced to (2𝐾 ×13)×20 𝑏𝑦𝑡𝑒𝑠= 520KB with 2K sets, 13-way, and 20
bytes per entry. The pseudo-LRU still costs 504 KB. On the other hand, with 3MB SRAM, the
CLT increases to (4𝐾 ×30)×20 𝑏𝑦𝑡𝑒𝑠= 2.4MB.
In Figure 3-9, we show the change of the total execution cycles for the new coverages
with respect to the original CLT which utilizes 2MB SRAM space. On the average, the low
coverage has about 9% increases in the total cycles, while the high coverage shows about 1%
increase. The low coverage option provokes more CLT misses and degrades the performance.
Bigger L3 helps little in this case. Mcf, omnetpp and sphinx3 have 20-26% increases in the
execution cycles with low CLT coverage due mainly to the significant increase in CLT misses.
48
On the other hand, the high coverage relinquishes more L3 space for building bigger CLT. Since
the CLT hit ratio is very high for most workloads, further improvement is limited. The increase
of L3 miss hurts the overall performance for most workloads. Omnetpp shows about 3% cycle
reduction due to improvement of its low CLT hit ratio (79%).
We also study the cycle change for smaller (8 blocks) and bigger (32 blocks) sector sizes.
As shown in Figure 3-10, both small and big sector sizes increase the total cycles by about 7%
and 3% respectively. In this study, we keep 2MB SRAM space for the CLT. We need to adjust
the number of sectors recorded in the CLT to utilize the available space since the number of
way-pointer is changed from 16 to 8 and 32 for the respective sector sizes. For 8-block sectors,
although it alleviates the sector-indexing effect, the CLT coverage is also reduced since each
sector can only record 8 blocks. It also lowers the advantage of the special locality since each
CLT miss can only record 8 adjacent blocks in the CLT. The impact of bigger sector is the
opposite. Although the CLT coverage is better with higher special locality exploitation,
allocating 32 blocks into the same set hurts the cache performance. Among the workloads, mcf
and gcc degrade the most with small sector size while mcf and omnetpp hurt the most with large
sector size.
Figure 3-9.12IPC change for different CLT coverage.
-10%
0%
10%
20%
30%
% o
f C
ycl
e ch
ange lower coverage
higher coverage
49
Figure 3-10.13Execution cycle change for different sector size in CLT design.
3.4.4 Summary
We present a new caching technique to cache a portion of the large tag array for an off-
die stacked DRAM cache. Due to its large size, the tag array is impractical to fit on-die, hence
caching a portion of the tags can reduce the need to go off-die twice for each DRAM cache
access. In order to reduce the space requirement for cached tags and to obtain high coverage for
DRAM cache accesses, we proposed and evaluated a sector-based Cache Lookaside Table (CLT)
to record cache tags on-die. CLT reduces space requirement by sharing a sector tag for a number
of consecutive cache blocks and uses a location (way) pointer to locate the blocks in off-die
cache data array. The large sector can also take the advantage of exploiting spatial locality for
better coverage. In comparison with the Alloy cache, ATcache and the TagTables approaches,
the average improvements are in the range of 4-15%.
-5%
0%
5%
10%
15%
20%
% o
f cy
cle
chan
ge sector_8
sector_32
50
CHAPTER 4
RUNAHEAD CACHE MISSES USING BLOOM FILTER
In this chapter, we present the work of using a Bloom Filter to filter out L3 cache misses
and issue requests to off-ide early. We first introduce some related works of Bloom Filter
applications as well as cache miss identification. We then present the timing analysis of using the
Bloom Filter, followed by the proposed indexing scheme for solving the problem of dynamic
updates of L3 cache contents when using Bloom Filter. In the end we present results that
demonstrates our idea.
4.1 Background and Related work
Membership queries using a Bloom Filter has been explored in many architecture,
database, and network applications [54] [55] [56] [57] [58] [59] [14] [60]. In [60], a cache-miss
BF based on partial or partitioned block address is proposed to filter cache miss early in the
processor pipeline. The early cache miss filtering helps scheduling load dependent instructions to
avoid execution pipeline bubbles. To reduce cache coherence traffic, the RegionScout [59] was
used to dynamically detect most non-shared regions. A node with a RegionScout filter can
determine in advance that a request will miss in all remote nodes, hence the coherence request
can be avoided. A vector Bloom Filter [14] was introduced to satisfy quick search of large
MSHRs in the critical execution path without the need for expensive CAM implementation. A
counting Bloom Filter called a summary cache to handle dynamic membership updates is
presented in [56]. In this approach, each proxy keeps a summary of internet cache directory and
use the summary to check for potential hit to avoid sending a useless query to other proxies. In
[57], a counting Bloom Filter is used as a conflict detector in virtualized Transactional Memory
to detect conflicts among all the active transactions. Multiprocessor deterministic replay was
51
introduced in [58] in which a replayer creates an equivalent execution despite inherent sources of
nondeterminism that exist in modern multicore computer systems. They use a write and a read
Bloom filters to track the current episode’s write and read sets. A good early survey paper in
network applications using Bloom Filters and the mathematical basis behind them is reported in
[54]. A Bloom-filter-based semijoin algorithm for distributed database systems is in [55]. This
algorithm reduces communications costs to process a distributed natural join as much as possible
with a filter approach.
A closely related work by Lok and Hill [4] suggested that the block residency for DRAM
cache can be recorded in an on-die structure called MissMap. As described in Section 3.1, off-die
trips to access the DRAM cache can be avoided if the block recorded in the MissMap indicates a
miss. To save space, the MissMap records the block residency information for a large
consecutive segment. However, when the segment is evicted from the miss-map directory, all
blocks in the segment must be invalidated from the DRAM cache in order to maintain the precise
residency information. To avoid such invalidations, Sim et al. [19] suggested to speculatively
issuing requests directly to main memory if the block is predicted not in the DRAM cache.
However, significant complexity must be dealt with for handling miss predictions.
Xun et al. [ [61]] observed the need of using counters to filter cache misses. To avoid the
counters, they proposed to delay updating the BF array for the evicted blocks. Instead, they
trigger the BF array recalibration periodically to reconcile the correct cache content in the BF
array. This delay recalibration method increases the chance of false-positive and incurs time and
power overheads for the recalibration.
4.2 Memory Hierarchy and Timing analysis
Considering a BFk for cache level k, we can adjust the Average Memory Access Time
(AMAT) as follows:
52
𝑨𝑴𝑨𝑻 = (𝟏 − 𝑩𝑭𝒓𝒂𝒕𝒆Lk ) × ( 𝑯𝒊𝒕𝑻𝒊𝒎𝒆L1 + ∑ ( 𝑴𝒊𝒔𝒔𝑹𝒂𝒕𝒆𝒌−𝟏
𝒊=𝟏Li × 𝑯𝒊𝒕𝑻𝒊𝒎𝒆Li+1) ) +
𝑩𝑭𝒓𝒂𝒕𝒆Lk × ( (𝑩𝑭𝒕𝒊𝒎𝒆 + 𝑯𝒊𝒕𝑻𝒊𝒎𝒆Lk) + ∑ ( 𝑴𝒊𝒔𝒔𝑹𝒂𝒕𝒆𝒏
𝒋=𝒌Lj × 𝑯𝒊𝒕𝑻𝒊𝒎𝒆Lj+1) )
where 𝐵𝐹𝑟𝑎𝑡𝑒Lk is the ratio of cache misses filtered by the BFk at level k. When a cache
miss is identified, the extra delays of hit times through levels 1 to k are avoided. Only the delay
(BFtime) of accessing the BFk is added to access cache k+1 up to the DRAM memory. This
formula also shows that using BFs in multiple levels overlap the benefits in bypassing higher
levels of caches. For example, if both BFi and BFi+1 are implemented and used when the memory
address becomes available, BFi+1 can only save the hit time at cache level i+1. In the base
memory hierarchy design shown in Figure 1-2, we will focus on a new BFL3 since the sector-
based L4 tags are on the processor die with small latency. Furthermore, the large L4 size requires
a large BFL4 to filter L4 misses for achieving a small false-positive rate.
Figure 4-1.14Memory latency with / without BFL3.
Figure 4-1 illustrates the memory latency on runahead L3 misses. After the memory
address is available, BFL3 is checked. Once a miss is filtered, the on-die L4 tag array is looked up
and followed by either eDRAM or regular DRAM access depending on a L4 hit or miss for
Timing Analysis:
AGAddress
L1 + L2 + L3 L1 + L2 + L3
Delay no BF
L4 tag
L4 data (eDRAM)
Regular DRAM
hit
L3 BF
L4 tag
L4 data (eDRAM)
Regular DRAM
hit
L3miss
miss
miss
L3miss
Delay L3 BF
53
fetching the requested data block. Regardless the filtering result, a memory request always goes
through the regular L1, L2 and L3 path. This is necessary for handling cache hits at these levels
as well as identifying any false-positive L3 misses from the BFL3. Even when the request is
identified as a L3 miss, the request is also issued to the normal L1, L2 and L3 path. This is
necessary to avoid major changes to the microarchitecture of the cache hierarchy. If the filtered
request goes through the normal cache levels and arrives to the memory controller, the early
runahead miss will block this request at the controller waiting for the block comes back from L4
or memory. On the other hand, if the request has not yet arrived to the controller when the block
for the runahead miss comes back, the block is inserted into L3 and treated as a prefetch to the
L3 cache. Eventually, the request through the normal path will be a L3 hit and shorten the
latency.
Formally, a Bloom filter (BF) for representing a set of n elements (cache blocks) from a
large universe (memory blocks) consists of an array of m bits, initially all set to 0. The filter uses
k independent hash functions h1, ... , hk with range 1, ... , m, where these hash functions map
each element x (block address) in memory to a random number uniformly over the range. When
a block enters cache, the bits hi(x) are set to 1 for 1 ≤ i ≤ k. To check if a block y is in cache, we
check whether all hi(y) are set to 1. If not, then clearly y is not in cache, hence a cache miss. In
practice, however, a BF for cache misses faces two major difficulties. First, it is hard to
implement in hardware multiple independent and randomized hashing functions. Second, cache
contents are dynamically changed with insertions and deletions. The BF array must be updated
accordingly to reflect the content changes for maintaining the correct BF function.
In Figure 4-2, we illustrate a solution for simplified BF hashing functions and for
handling dynamic cache updates. Let us first describe the conventional cache indexing scheme.
54
In a cache access, the target set is determined by decoding the index which is located in the low-
order bits (a0) of the block address as shown in Figure 4-2. Instead of constructing uniformly
distributed hashing functions, we can simply expand the cache index scheme to include a few
more adjacent tag bits (a1) to be used for indexing the BF array. Like in conventional cache
access, a simple address decoder to decode the BF index can decide the hashed BF location.
Based on the study in [54], the probability of false-positive rate is minimized when 𝑘 =
𝑙𝑛2 × (𝑚 𝑛),⁄ giving a false-positive rate ≈ (0.6185)𝑚 𝑛⁄ . Hence, increasing 𝑚 𝑛⁄ can reduce
the false-positive rate. For a cache with 2𝑝 way set associativity, the total number of cache
blocks 𝑛 = 2𝑝 × 2𝑎0 = 2(𝑎0+𝑝). Furthermore, the BF array size 𝑚 must be bigger than 𝑛 for
reducing the false-positive rate. Assuming 𝑚 𝑛⁄ = 2𝑞 where 𝑞 is a small positive integer, we
have 𝑚 = 2(𝑎0+𝑝+𝑞). Therefore, the BF index is 𝑎1||𝑎0 where 𝑎1 has 𝑝 + 𝑞 bits and must be a
positive number.
There is a unique advantage to include the cache index (a0) as a part of the BF index.
Due to collisions, multiple blocks can be hashed to the same location in the BF array. Since the
cache index is a part of the BF index, all blocks hashed to the same BF array location must be
located in the same cache set. By comparing the a1 bits in all cache tags in the set with the a1
bits of the replaced block, the BF array location is reset only if the replaced a1 is not found in
any other location.
Note that due to spatial locality in memory references, using low-order block address
may ease collisions in the BF array, hence lower the false-positive rate. Moreover, we apply a
simple cache index randomization technique by exclusive-ORing the a1 bits with the adjacent
high-order a2 bits to further reduce the collision. Consider a BFL3 design for an 8MB L3 cache
with 64-byte blocks and 16-way set associativity. The target set is determined by the low-order
55
13 bits (a0) of the block address to hash to the 8K sets. The total number of blocks n=217.
Assume that the BF array size m is 8 times larger than n. As a result, the additional index bits
(a1) is 7 and the BF index has 20 bits. For randomizing the a1, higher order 7 bits (a2) are used.
The total required address bit is 33 including the 6 offset bits. With limited physical address, we
can have several hashing combinations for BFL3 using a0, a1 and a2. Note that in this work we
set up 8GB for our simulated memory. With bigger memory and more physical address bits,
more hashing options can be explored.
(a) k=1: three BF indices: a1||a0, a2||a0, and (a1 XOR a2)||a0.
(b) k=2: three BF index groups: (a1||a0 and a2||a0), (a1||a0 and (a1 XOR a2)||a0)), and
(a2||a0 and (a1 XOR a2)||a0).
(c) k=3: one BF index group: (a1||a0, a2||a0, and (a1 XOR a2)||a0))
Figure 4-2.15Cache indexing and hashing for BF.
Figure 4-3 shows the false-positive rates for different hashing schemes. Note that we only
show the results of 6 hashing schemes since the false-positive rate using a single hashing index
(a2||a0) is very high. We simulate 3 m:n ratios: 4:1, 8:1, and 16:1 using memory traces
generated from SPEC2006 benchmarks. We first ran the workloads on a whole-system
simulation infrastructure based on a cycle-accurate 8-core model along with a memory hierarchy
model [29] to collect memory reference traces from misses to the L2 caches. 5 billion
instructions from 8 cores are collected for each workload. The simulation environment and
parameters will be given in Section 3.3.
tag cache index (a0) offset
BF index 1 = a1||a0
a0a1a2
BF index 2 = (a1 XOR a2)||a0
56
The false-positive rate is calculated as the ratio of the filter-hits-actual-misses divided by
the total misses. Each false-positive point in the figure is the geometric mean of 12 SPEC2006
benchmarks. As can be observed, when k=1, randomization of a1 helps very little in improving
the false-positive rate. Two hashing functions with indices a1||a0 and (a1 XOR a2)||a0) as
illustrated in Figure 4-3 show the lowest false positive rates about 2.3%, 4.8%, and 16.5%
respectively for m:n ratios 16:1, 8:1, and 4:1. Three hashing functions cannot further improve the
false-positive rate because of insufficient address bits where the third hashing index is highly
correlated with the first two.
Figure 4-3.16False-positive rates for 6 hashing mechanisms.
Figure 4-4.17False-positive rates with m:n = 2:1, 4:1, 8:1, 16:1, and k = 1, 2.
0%
5%
10%
15%
20%
25%
a1 a1^a2 (a1, a2) (a2,
a1^a2)
(a1,
a1^a2)
(a1,a2,
a1^a2)
fals
e p
osi
tive
rate
m/n=4 m/n=8 m/n=16
-5%
5%
15%
25%
35%
45%
fals
e p
osi
tive
rate
k=1 m/n=2 k=1 m/n=4 k=1 m/n=8 k=1 m/n=16
k=2 m/n=2 k=2 m/n=4 k=2 m/n=8 k=2 m/n=16
57
In Figure 4-4, we show the false positive rates for individual SPEC2006 benchmarks.
Based on the results in Figure 4-3, we pick two hashing schemes, a1||a0 for k=1, and a1||a0 and
(a1 XOR a2)||a0 for k=2. The results show that the m:n ratio plays an important role as bigger
BF arrays reduce the false-positive rate significantly for all benchmarks. The false-positive rates
are very high for small BF array with ratio m:n=2:1. The benefit of multiple hashing functions
becomes more evident when m/n is 4 or greater. The false-positive rate behavior is very
consistent across all benchmarks. For k=1, the average false-positive rates are 8.7% and 4.3%
using a BF array which has 8 and 16 times more bits than the total number of cache blocks.
When k=2, the false-positive rates are reduced to 4.8% and 2.3% respectively using the BF array
entries that are 8 and 16 times more than the number of cache blocks. These results are used to
guide our IPC timing simulation.
4.3 Performance Results
The IPC improvement using a BF for runahead L3 misses is presented in this section. We
also compare the improvement with a perfect BF without any false-positive misses. In addition,
the sensitivity studies of BF design parameters, the size of the L4 caches, and the latency and
bandwidth of the regular DRAM are also presented.
For an 8MB L3 cache with 64-byte blocks, the space overhead for the new BFL3 are
64KB, 128KB and 256KB respectively for m:n = 4:1, 8:1, and 16:1. We use Cacti [21] to
estimate the BF latency and get 2, 3, and 3 cycles for the three BF arrays. In addition, we add
two more cycles for the wiring delay. For delay recalibrations, since we need to get last 14 bits
out (a1 and a2) and perform hierarchical or operations, we measured using Cacti 6.5[21] that it
takes 3 cycles to recalibrate one set. 4 sets can be recalibrated in parallel and a total of 6K cycles
are charged for each recalibration.
58
4.3.1 IPC Comparison
Figure 4-5 displays the IPCs of the twelve benchmarks with six caching mechanisms
including a regular 4-level cache without BF, a BFL3 to filter and runahead L3 misses, three
delay calibration d1-BFs, d2-BFL3 and d3-BFL3 with recalibration periods of 0.5M, 1M and 2M
memory references, and a perfect BFL3 which does not incur false-positive. Note that we use two
hashing functions a1||a0 and (a1 XOR a2)||a0, and m:n = 8:1 for the BFL3. The results show that
the average IPC improvement is about 10.5% using the BFL3. This improvement is only 1.3%
less than using a perfect BFL3 which averages 11.8%. In comparison with the three delay-BF
designs, the improvements are 4.3%, 4.8% and 3.5% respectively. Shorter recalibration period
has less false-positives, but pays more overhead in recalibrations. In general, for the design using
a BFL3, all benchmarks show good IPC improvement. Mcf and sphinx benefit the most with
close to 20% improvement. Other workloads show at least 6% improvement compared against
design without a BF, except bwaves, which has about 4% improvement. The BFL3 design also
show 1.3˗8.9% improvement against the delay BF designs.
The major impact on the IPC comes from the average memory access latency. In Table 4-
1, we list the average memory latency for L3 misses with and without runahead using a BF. The
L3 miss latency is measured from the generation of the memory address till the return of the
requested data. Note that the measurement does not include the hit latencies of the L1, L2 and L3
caches since these latencies are basically the same with or without using the BF.
We can observe a significant saving in the L3 miss latency using the BFL3 to runahead
the misses for all benchmarks. On the average, the L3 miss latency is reduced from 154 cycles to
120 cycles which closely match to the savings of the L1, L2 and L3 hit times minus a 5-cycle
penalty for accessing the BF array. In general, the latency result is consistent with the IPC
59
results. The BF with delay recalibrations (not shown) also has longer latency due to more false-
positives. In addition, they are charged for the recalibration overheads.
Figure 4-5.18IPC comparisons with/without BF.
Table 4-1.6False-positive rates of 12 benchmarks.
Latency False-Positive Rate(%)
BFL3 w/o BF BFL3 d1- BFL3 d2- BFL3 d3- BFL3
mcf 105.9 138.2 7 9 12 14
soplex 106.2 141.1 5 10 12 13
lbm 153.3 186 5 8 9 11
leslie3d 158.8 190.4 6 7 9 11
gems 216.9 248.8 4 6 8 9
libquantum 137.6 162 9 12 13 17
milc 142.3 173.9 4 6 9 10
bwaves 128.9 161.5 3 6 8 9
sphinx 69.8 106.4 4 7 11 12
bt 134.6 168.3 4 5 7 9
omnetpp 77.3 112.2 6 10 13 16
gcc 80.6 116.3 3 6 8 9
In Table 4-1, we also show the false-positive rates for the twelve benchmarks measured
in the timing simulation. The results are ranged from 3 to 9%, which are consistent with the rates
based on simulating long memory traces (Figure 4-4). The small false-positive rate ensures a
small impact on the IPC improvement. As shown in Figure 4-5, the IPC improvement of using a
1
3
5
7
9
IPC
without BF BF d1-BF d2-BF d3-BF Perfect-BF
60
perfect BF can only surpass the average IPC improvement of a realistic BF from 10.5% to
11.8%.
We also provide a rough estimation of the power consumption. We measure using Cacti
[52] that each BF access takes around 0.013 nJ, which is close to a single L1 cache’s dynamic
access energy of 0.014 nJ. Since the only power difference is the Bloom Filter access energy,
and the number of Bloom Filter accesses is the same as L1 cache’s total accesses, the extra
power consumption is basically the same as L1 cache’s total power consumption. On the other
hand, using a Bloom Filter can speed up simulation by 10.5%, which can be translated to 10.5%
static energy saving. To even save more energy, we can only access our Bloom Filter after L1
cache misses.
4.3.2 Sensitivity Study
The sensitivity study of the IPC impact with respect to the m:n ratio and the number of
hashing function k is shown in Figure 4-6, in which each IPC point is the geometric mean of 12
benchmarks. Again, the results show that bigger BF arrays reduce the false-positive rate and
improve the IPC. The improvement rate is much more for k=3 than for k=2, and k=1. This is due
to the fact that without sufficient entries in the BF array, more hashing functions actually
increase the chance for collisions. On the other hand, with bigger BF arrays, more hashing
functions can spread each block more randomly and reduce the chance of collision. When the
m:n ratio is 2:1, there is insufficient room in the BF array even for k=2, resulting slightly lower
IPC than the IPC of k=1. k=3 is obviously much worse. However, the IPC of k=3 nearly catches
up with the IPC of k=2, when m:n=8:1. When m:n=16:1, the IPCs for different hashing functions
are very close. Given sufficient BF array size, the false-positive rates are small regardless the
number of hashing functions. Nevertheless, for large m:n=16:1, we expect that k=3 should have
a better IPC than the IPC for k=2. However, due to the limit address bits, the third BF index is
61
highly correlated with the first two BF indices resulting limited improvement in the false-
positive rate.
In Figure 4-7, we show the results of the IPC for four L4 sizes ranging from 64MB to 512
MB. In these simulations, we maintain m:n=8:1 and k=2. Regardless the L4 size, the BFL3
always improves the IPC significantly. As expected, however, bigger L4 reduces L4MPKI and
improves the IPC more using the BFL3. For the four L4 sizes, the IPC improvements are 9.0%,
10.5%, 11.5%, and 12.0% respectively.
Figure 4-6.19Average IPC for m:n ratios and hashing functions.
Figure 4-7.20Average IPC for different L4 sizes.
5.2
5.3
5.4
5.5
5.6
5.7
5.8
m/n=2 m/n=4 m/n=8 m/n=16
IPC
k=1 k=2 k=3
2
3
4
5
6
64MB 128MB 256MB 512MB
IPC
withoutBF BF
62
Figure 4-8.21Average IPC over different DRAM latency.
Table 4-2.7 Future Conventional DRAM parameters.
Faster DRAM latency Slower DRAM latency
tCAS-tRCD-tRP: 6-6-6
tRAS-tRC: 33-30
tCAS-tRCD-tRP: 11-15-15
tRAS-tRC: 38-50
Next, the impact of DRAM latency is simulated. In comparison with the original DRAM
latency in Table 2-1, we simulate a fast and a slow DRAM latency as shown in Table 4-2. We
also test two DRAM bandwidth configurations, one with 2 channels and the other with 4
channels. The L3 and L4 sizes are 8MB and 128MB and the BFL3 remains as m:n=8:1, k=2. The
results are shown in Figure 4-8. It is interesting to see that higher DRAM bandwidth and fast
DRAM latency helps the IPC with runahead L3 misses more than that without runahead misses.
For the fast latency with 4 DRAM channels, the average IPC improvement reaches to 12%. On
the other hand, for the slow latency with 2 DRAM channels, the average IPC improvement is
about 7%.
4.4 Summary
A new Bloom Filter is introduced to filter L3 cache misses for bypassing L1, L2 and L3
caches to shorten the L3 miss penalty in a 4-level cache hierarchy system. The proposed Bloom
3.5
4
4.5
5
5.5
6
6.5
fast
late
ncy
ori
gin
al
slow
_la
tency
fast
late
ncy
ori
gin
al
slow
_la
tency
channel = 4 channel = 2
IPC
without BF BF
63
Filter applies a simple indexing scheme by decoding the low-order block address to determine
the hashed location in the BF array. To provide better hashing randomization, a part of the index
bits are XORing with the adjacent higher-order address bits. In addition, with certain
combinations of the limited block address bits, multiple index functions can be selected to
further reduce the false-positive rate. Results show that the proposed simple hashing scheme can
lower the average false-positive rate below 5% for filtering L3 misses, and to improve the
average IPC by 10.5% from runahead these misses.
Furthermore, the proposed BF indexing scheme resolves an inherent difficult problem in
using the Bloom Filter for identifying L3 cache misses. Due to dynamic updates of the cache
content, a counting Bloom Filter is necessary to update the BF array to reflect dynamic changes
of the cache content. A unique advantage of the proposed BF index is that it includes the cache
index as a superset. As a result, the blocks which are hashed to the same BF array location, are
allocated in the same cache set. By searching the tags in the set when a block is replaced, the
corresponding BF bit can be reset correctly. This restricted hashing scheme demonstrates low
false-positive rate and simplifies the BF array updates without using expensive counters.
64
CHAPTER 5
GUIDED MULTIPLE HASHING
In this chapter, we present our guided multiple hashing work. We begin by introducing
the problems of single hashing and multiple hashing. We then use a simple example to illustrate
our proposed idea. A detailed algorithm that targets at maximizing the number of empty buckets
while balancing the keys at non-empty buckets is given. In the end, we present results that shows
our improvement over other traditional hashing methods.
5.1 Background
Hash-based lookup has been an important research direction on routing and packet
forwarding, among the core functions of the IP network-layer protocols. While there are other
alternative approaches for routing table lookup such as trie-based solutions, we only focus on
hash-based solutions, which have the advantages of simplicity and O(1) average lookup time,
whereas trie-based lookup tends to make much more memory accesses.
Single-hashing suffers from the collision problem, where multiple keys are hashed to the
same bucket and cause uneven distribution of keys among the buckets. It takes variable delays in
looking up keys located in different buckets. For hash-based network routing tables [62] [63]
[64] [65], it is critical to perform fast lookup for the next hop routing information. In today’s
backbone routers, routing tables are often too big to fit into on-chip memory of a network
processor. As a result, off-chip routing table access becomes the bottleneck for meeting the
increasing throughput requirement on high speed Internet [66] [67]. The unbalanced hash
buckets further worsen the off-chip access. Today’s memory technology is more efficient to
fetch a contiguous block (such as a cache block) at once than individual data elements separately
from off-chip memory. A heavy-loaded hash bucket may require two or more memory accesses
to fetch all its keys. However, in order to accommodate the most-loaded bucket for a constant
65
lookup delay, fetching a large memory block which can hold the highest number of keys in a
bucket increases the critical memory bandwidth requirement, wastes the memory space, and
lowers the network throughput [63] [65] [68] [69].
Methods were proposed to handle the hash collision problem for balancing the bucket
load by reducing the maximum number of keys in a bucket among all buckets. One approach is
to use multiple hashing such as d-random [70] which hashes each key to d buckets using d
independent hash functions and stores the key into the least-loaded bucket. The 2-left scheme
[62] [68] is a special case of d-random where the buckets are partitioned into left and right
regions. When inserting a key, a random hash function is applied in each region and the key is
allocated to the least-loaded bucket (to the left in case of a tie). The multiple-hashing approach
balances the buckets and reduces the fetched bucket size for each key look up. However, without
the knowledge of which bucket that a key is located, d-random (d-left) requires probing all d
buckets. As the bottleneck leans on the off-chip memory access, accessing multiple buckets
slows down the hash table access and degrades the network performance [65] [67].
To remedy probing d buckets, extended Bloom Filter [64] uses counters and extra
pointers to link keys in multiple hashed buckets to avoid lookups of multiple buckets. However,
it requires key replications and must handle complex key updates. The recently proposed
Deterministic Hashing [65] applies multiple hash functions to an on-chip intermediate index
table where the hashed bucket addresses are saved. By properly setting up the bucket addresses
in the index table, the hashed buckets can be balanced. This approach incurs space overhead and
delays due to indirect access through the index table. In [69], an improved approach uses an
intermediate table to record the hash function IDs, instead of the bucket addresses to alleviate the
space overhead. In addition, it uses a single hash function to the index table to ease the update
66
complexity. However, with limited index table and hashing functions, the achievable balance is
also limited. In another effort to avoid collision, the perfect hash function sets more rigid goal to
achieve one to one mapping between keys and buckets. It accomplishes the goal using complex
hash functions encoded on-chip with significant space and additional delays [71] [20]. It also
requires changes in the encoded hash function upon a hash table update.
5.2 Hashing
We first describe the challenges of a hashing based information table using a single hash
function. We also bring up the motivation and applications of using a multiple hashing approach
for organizing and accessing a hash table.
Figure 5-1.22Distribution of keys in buckets of four hashing algorithms.
To demonstrate the power of multiple hashing in accomplishing different objectives for
the hash table, we compare the simulation results of four hashing schemes: single hashing
(single-hash), 2-hash with load balancing (2-left), 4-hash with load balancing (4-left), and 2-hash
with maximum zero buckets (2-max-0). We simulate 200,000 randomly generated keys to be
67
hashed to 100,000 buckets. The distribution of keys in buckets is plotted in Figure 5-1. We can
observe substantial differences in the key distribution among the four hashing schemes. The
maximum number of keys in a bucket reach to ten for single-hash and 2-max-0. Meanwhile, 2-
max-0 produces 2.5 times empty buckets than single-hash does. 2-left and 4-left are more
balanced with four and three as the maximum numbers of keys in a bucket, respectively. It is
easy to see that increasing the number of hash functions from two to four helps improving the
balance.
5.3 Proposed Algorithm
In this section, we describe the detailed algorithms of the guided multiple-hashing
scheme that consists of a setup algorithm, a lookup algorithm, and an update algorithm. Assume
we have m buckets 𝐵1, . . . , 𝐵𝑚 and d independent hash functions 𝐻1, . . . , 𝐻𝑑. Each key x is hashed
and placed into all d buckets, 𝐵𝐻𝑖(𝑥), 1 ≤ 𝑖 ≤ 𝑑. The set of keys in bucket 𝐵𝑖 is denoted by
B[i], and the number of keys in bucket 𝐵𝑖 is 𝑣(𝐵[𝑖]), 1 ≤ 𝑖 ≤ 𝑚. The bucket load Ω𝑎 is
defined as the maximum number of keys in any bucket. We define the memory usage ratio as:
𝜃 = (Ω𝑎 × 𝑚)/𝑛 to indicate the memory requirement of the hash table. Other terminologies
are self-explanatory and are listed in Table 5-1. For better illustration of d-ghash, we use a
simple hashing table with 5 keys and 8 buckets. All keys are hashed to the buckets using two
hashing functions, where buckets 𝐵0, . . . , 𝐵7 have 1, 0, 1, 2, 3, 0, 1, 2 keys as indicated by the
arrows in Figure 5-2.
5.3.1 The Setup Algorithm
Since the objective is to minimize the bucket load while approaching to a single bucket
access per lookup, the setup algorithm needs to satisfy two criteria: (1) achieving near perfect
balance, and (2) maximizing the number of c-empty buckets. Recall that a c-empty bucket serves
68
as a multiple hashing target of one or more keys, but the key(s) is placed into other alternative
buckets that make c-empty bucket access unnecessary.
Table 5-1.8Notation and Definition.
Symbol Meaning
n Total number of keys
m Total number of buckets
𝐵[𝑖] Set of keys in i-th bucket
v(B[i]) Number of keys of the i-th bucket
s indices of the buckets in B sorted in
ascending order of v(B[i])
𝐻𝑖 i-th hash function
Ω𝑝 Optimal bucket load, ⌈𝑛/𝑚⌉
Ω𝑎 Achievable bucket load
𝑛𝑢 Total number of keys in under-loaded
buckets (bucket load less than Ω𝑎
𝑏𝑢 Number of under-loaded buckets
𝜃 Memory usage ratio
Figure 5-2.23A simple d-ghash table with 5 keys, 8 buckets and 2 hash functions. (The shaded
bucket is a c-empty bucket. The final key assignment is as illustrated.
Given n keys and m buckets, a perfect-balance hashing scheme achieves optimal bucket
load Ω𝑝 = ⌈𝑛/𝑚⌉. However, the perfect balance may not be achieved under our or other multi-
hashing schemes because even with multiple hashing, some buckets may still probabilistically be
under-loaded, i.e. zero or less than Ω𝑝 keys are hashed to the bucket. And this translates to some
other buckets being squeezed with more keys. Increasing the number of hash functions reduces
69
under-loaded buckets and helps approaching to the perfect balancing. Our simulation shows that
with 4 hash functions, the achievable balance is the same as or very close to Ω𝑝.
The first step in the setup algorithm is to estimate Ω𝑎, the achievable balance. The idea is
to count the number of under loaded buckets and the number of keys inside. If the remaining
buckets cannot hold the rest of the keys with Ω𝑎 keys in each bucket, we increase Ω𝑎 by one.
We then use Ω𝑎 as the benchmark bucket load for key assignment. We sort all buckets in B
resulting a sorted index array, s, such that 𝑣(𝐵[𝑠(𝑖)]) ≤ 𝑣(𝐵[𝑠(𝑖 + 1)], 1 ≤ 𝑖 ≤ 𝑚 − 1. In
the simple example of Figure 5-2, Ω𝑎 = Ω𝑝 = ⌈𝑛/𝑚⌉= 1.
The next step is key assignment, which consists of two procedures, creating c-empty
buckets, and balancing key assignment. For creating c-empty buckets, the procedure removes
duplicate keys starting from the most-loaded buckets to maximize their services as companion
buckets to reduce the bucket access. A key can be safely removed from a bucket if it exists in
other bucket(s). The procedure goes through all buckets whose initial load are greater than Ω𝑎
and tries to remove keys from them. In the illustrated example in Figure 5-2, all 3 keys in 𝐵4 are
successfully removed and 𝐵4 becomes empty. Next, we check 𝐵3 and 𝐵7, each of which has 2
keys. Note that both 𝐾2 and 𝐾4 in 𝐵3 can be removed if 𝐵3 is emptied first. As a result, 𝐾2 and
𝐾3 cannot be removed from 𝐵7 and exceed Ω𝑎. All buckets with the bucket load exceeding Ω𝑎
will be a target for reallocation as described next.
After emptying the buckets, the key assignment procedure assigns each key to a bucket
starting from the least-loaded bucket. Once a key is assigned, its duplicates are removed from the
remaining buckets. During the assignment, buckets with more than Ω𝑎 keys are skipped in order
to maintain the achievable balance. A re-assignment of the buckets with load greater than Ω𝑎 is
necessary after all the buckets are assigned. During re-assignment, the keys in the overflow
70
buckets are attempted to be relocated to other buckets. In our experiment, we use Cuckoo
Hashing [72] to relocate keys from an overflow bucket to an alternative bucket using multiple
hashing functions. If all alternative buckets are full, an attempt is made to make room in the
alternative buckets. For simplicity, however, such attempts stop after r tries, where r can be any
heuristic number. A larger r brings better balance at the expense of longer setup time. In the
illustrated example, 𝐾2 in 𝐵7 is relocated to 𝐵3 to reduce the bucket load of 𝐵7, hence, the
optimal load is achieved.
In case that the perfect balance is not achievable, Ω𝑎 is incremented by one and the key
assignment procedure repeats. It is important to note that the priority of the key assignment is to
achieve perfect balance. Therefore, the keys that are previously removed from an empty bucket
can be reassigned back in order to accomplish the perfect balance such that the number of keys
are less than or equal to Ω𝑎 in all buckets. It is also important to know that in order to reduce the
bucket load, we can decrease the ratio of 𝑛/𝑚, i.e. to increase the number of buckets for a fixed
number of keys. However, increasing the number of buckets inflates the memory space
requirement as the memory usage ratio can be calculated by θ = (Ω𝑎 × 𝑚 𝑛⁄ ) for a constant
bucket size for efficient fetch of a bucket from off-chip memory.
5.3.2 The Lookup Algorithm
In order to speed up the lookup of keys, we introduce a data structure called the empty
array, which is a bit array of size m indicating whether a bucket is empty or not. If a bit in the
empty array is ‘0’, it means that the corresponding bucket is empty; otherwise it is not empty.
Upon looking up a key x, the bits of indices 𝐻1(𝑥), . . . , 𝐻𝑑(𝑥) in the empty array are checked. If
there is only one of the hashed buckets is non-empty, we simply fetch that bucket and thus
complete a lookup. If there are two or more non-empty buckets, we access them one by one until
71
we find the key. In the worst case, all d bits are ones and d buckets are examined before we find
the key. As discussed above, creating c-empty buckets helps reduce bucket accesses per lookup,
thus alleviates the lookup cost.
To further enhance our algorithm, we introduce another data structure, the target array, to
record the hash function ID once a key is hashed to two or more non-empty buckets. To separate
from the algorithm described above, we call it enhanced d-ghash algorithm. The algorithm only
using the empty array is called base d-ghash algorithm. The recorded ID indicates the bucket that
the key is most-likely located. The empty array has m bits while the size of the target array varies
depending on the number of keys. Suppose m = 200K, and we use 200K-entry target array, then
the empty array takes 25KB and target array 25KB for enhanced 2-ghash and 50KB for
enhanced 4-ghash. These two small arrays can be placed on chip for fast access. Multiple keys
may collide in the target array. When a collision occurs, the priority of recording the target
hashing function is given to the key which hashes to more non-empty buckets. Given a fixed
number of keys, we can adjust the number of buckets (m) and hash functions (d) to achieve a
specific goal of the bucket size and the number of buckets to be fetched for looking up a key.
More discussions will be in Section 5.4.
5.3.3 The Update Algorithm
There are three common types of hash table updates: insertion, deletion, and
modification. It is straightforward to delete or to modify a key in the hash table. For deletion, the
key is probed first by fetching the bucket from off-chip memory. The key is then removed from
the bucket before the bucket is written back to memory. If the key is the last one in the bucket,
the corresponding bit in the empty array is set to zero. For modification of the associated record
of a key, the key and its associated record are fetched. The new record replaces the old one
72
before the bucket is written back to memory. Those two types of updates do not involve the
modification of the target array.
The key insertion is slightly complicated. All hashed buckets are probed and the key is
inserted into the least-loaded, nonempty bucket with the number of keys < Ω𝑎. If all non-empty
buckets are full, the key is inserted into an empty bucket if it is also hashed. The empty array is
updated accordingly. In case that all hashed buckets are full, the Cuckoo Hashing is applied to
make a room for the new key, i.e., “rehashing” a key in one of the hashed buckets to another
alternative bucket. During key relocations, both the empty and the target arrays are updated
accordingly. There are two options in case a key cannot be inserted without breaking the
property of 𝑣(𝐵[𝑖]) ≤ Ω𝑎, i.e., all its hashed/rehashed bucket loads are greater than or equal to
Ω𝑎. First, set Ω𝑎 = Ω𝑎 + 1 and insert the key normally; Second, initiate an off-line process to
re-setup the table. Normally, the possibility that a key cannot be inserted is small, and we should
use the second option to prevent the bucket size from growing fast. However, if this operation
happens very frequently, it implies that most of the buckets are “full”, i.e. the average number of
keys in buckets are approaching Ω𝑎. In this case we should use the first option. By increasing
the maximum load by one, all buckets gain one extra space to store another key.
5.4 Performance Results
The performance evaluation is based on simulations for seven hashing schemes: single-
hash, 2-left, 4-left, base 2-ghash, enhanced 2-ghash, base 4-ghash, and enhanced 4- ghash. Note
that we do not include d-random in the evaluation, because it is outperformed by d-left both in
terms of the bucket load and the number of bucket accesses per lookup. We simulate 200,000
randomly generated keys to be hashed into 100,000 to 500,000 buckets. To test the new multiple
hashing idea, we adopt the random hash function in [25] which uses a few shift, or, and addition
73
operations on the key to produce multiple hashing results. For relocation, we try to relocate keys
in no more than ten buckets to other alternative buckets in the Setup Algorithm and no more than
two in the Update Algorithm. We first compare the bucket load and the average number of
bucket accesses per lookup by varying 𝑛/𝑚. Then we normalize the number of keys per lookup
based on the memory usage ratios to understand the memory overhead for different hashing
schemes. In addition, we demonstrate the effectiveness of creating c-empty buckets to reduce the
bucket access. We also give a sensitivity study on the number of bucket accesses per lookup with
respect to the size of the target array. Lastly, we evaluate the robustness of d-ghash scheme by
using two simple probabilistic models.
Figure 5-3.24Bucket loads for the five hashing schemes. The enhanced d-ghash scheme and base
d-ghash scheme has the same bucket load.
Figure 5-3 displays the bucket loads of the hashing schemes. Note that enhanced d-ghash
scheme and base d-ghash scheme has the same bucket load. Note that enhanced d-ghash and
base d-ghash have the same bucket load. The only difference between the two is that enhanced
d-ghash uses a target array to reduce the number of bucket accesses per lookup. The results
74
show that d-ghash has the least bucket load, and hence achieves best balance among the buckets.
This is followed by d-left. More hash functions improve the balance for both d-ghash and d-left.
With 275,000 buckets, 4-ghash accomplishes perfect balance with the bucket load of a single
key. No other simulated scheme can achieve such balance with up to 500,000 buckets. 2-ghash
performs slightly better than 4-left as the former needs 150,000 buckets to reduce the bucket load
to two keys while the latter requires 175,000 buckets. This result demonstrates the power of d-
ghash in balancing the keys over that of d-left. The single-hash scheme is the worst. The bucket
load is six even with 500,000 buckets. Note that bucket load is an integer, but we slightly adjust
the integer values to separate the curves of different schemes for easy read.
Figure 5-4.25Number of bucket accesses per lookup for d-ghash.
In Figure 5-4, we evaluate the lookup efficiency of the seven hashing schemes. Single-
hash only accesses one bucket per lookup. The d-left scheme looks up a key from the left-most
bucket. In case that the key is not found, the next bucket to the right is accessed until the key is
located. Since the key is always placed in the left-most bucket to break a tie, the number of
75
bucket accesses per lookup is quite low, 1.68 ∼ 2.36 for 4-left and 1.27 ∼ 1.44 for 2-left. The
base 4-ghash and base 2-ghash reduce the number of bucket accesses per lookup to 1.25 ∼ 2.18
and 1.11 ∼ 1.44 respectively with a 5–34% and 0–14% reduction. With a target array of 1.5n
entries, the enhanced 4-ghash and the enhanced 2-ghash can further reduce the number of bucket
accesses per lookup to as low as 1.03 ∼ 1.23 and 1.01 ∼ 1.11 respectively with a 38–51% and
21–24% reduction.
It is interesting to see that the number of bucket accesses per lookup for d-ghash does not
decrease continuously when the number of buckets increases. We can observe a sudden jump at
m = 125, 000 and m = 275, 000 for 4-ghash. This is due to the fact that the optimal bucket load
drops from three to two when m = 125, 000 and from two to one when m = 275, 000. As the
average number of keys per bucket is very close to the optimal bucket load, it is hard to create c-
empty buckets. Therefore, there are sudden decreases in the amount of c-empty buckets at those
two points. As a result, 4-ghash experiences more bucket access for key lookup. The same
reason goes for 2-ghash.
Figure 5-5.26Average number of keys per lookup based on memory usage ratio.
76
In order to reduce the bucket load for a fixed number of keys, we can increase the number
of buckets. However, increasing buckets inflates the memory space requirement. In Figure 5-5,
we plot the average number of keys per lookup based on the memory usage ratio, where the
average number of keys is the product of the bucket load and the average number of buckets per
look up. The results clearly show the advantage of the d-ghash scheme. Enhanced 4-ghash
accomplishes a single key per bucket with 275,000 buckets which are only 37% more than the
number of keys. With slightly larger than one key per lookup, enhanced 4-ghash requires the
least amount of memory to achieve close to one key access per lookup.
Besides the perfect balance, d-ghash creates c-empty buckets to maximize the number of
keys hashing to empty buckets. Figure 5-6 shows the effectiveness of the c-empty buckets for
reducing the bucket access. In this figure, y-axis indicates the average number of non-empty
buckets that each key is hashed into. In comparison with d-left, d-ghash reduces nonempty
buckets more significantly, resulting in smaller number of bucket access. For d-left, the number
of non-empty buckets decreases as the number of buckets increases. This is due to the fact that d-
left assigns each key to the least-loaded bucket. D-ghash on the other hand, creates c-empty
buckets by removing keys away from those buckets with more keys hashed into. As a result,
there are fewer non-empty buckets for looking up each key. It is interesting to observe that the
ability to create c-empty buckets depends heavily on the optimal bucket load and the ratio of keys
and buckets. For example, when the number of buckets is 250,000, the optimal bucket load is 2
for 200,000 keys that leaves plenty of room to create many c-empty buckets. However, when the
number of buckets increases to 275,000, the optimal bucket load drops to 1 that leaves little room
for the c-empty buckets. Hence, the average number of non-empty buckets increases for each key
to be hashed into.
77
Moreover, we show a sensitivity study of bucket access per lookup with respect to the
size of the target array. We vary the size of the target array from n to 2n entries using enhanced
4-ghash and the result is shown in Figure 5-7. As expected, larger target array reduces the
collision, resulting in a smaller number of bucket accesses. We pick 1.5n entries as the target
array size in earlier simulations which has the best tradeoff in terms of space overhead and
bucket access per lookup.
Figure 5-6.27The average number of non-empty buckets for looking up a key. This parameter is
the same for enhanced d-ghash scheme and base d-ghash scheme.
Finally, we evaluate the robustness of our scheme. We first set up a table using 200,001
keys, 200,000 buckets, and 300,000 target array entries. The achievable bucket load Ω𝑎 is 2 in
this setting. We simulate two update models: (1) Balanced Update: 33% insertion, 33% deletion,
and 33% modification; and (2) Heavy Insertion: 40% insertion, 30% deletion, and 30%
modification. We run for 600K updates and record the rehash percentage of all the update
operations and the number of bucket accesses per lookup. The results are presented in Figure 5-
8. The top two lines reflect the number of bucket accesses per lookup under Heavy Insertion
78
model and Balanced Update model respectively. We notice increases for both lines. The number
of bucket accesses per lookup increases continuously to 1.37 for Heavy Insertion, an increase of
25% than the original number. While for Balanced Update, the number first increases up to 1.25
and then drops to 1.21, with an increase of 10% in the end. The bottom two lines are rehash
percentages of the whole update operations. These two lines give a clear view that if the insertion
is heavy, we will come across more rehashes. For Balanced Update, the rehash percentage stays
almost the same at 0.5%. There is a slight increase in Heavy Insertion rehash percentage. Since
the rehash percentages for both models are less than 2% and the rehash operation involves keys
in no more than two buckets, we believe d-ghash is able to handle these rehashes without
incurring too much delay.
Figure 5-7.28Sensitivity of the number of bucket accesses per lookup for enhanced 4-ghash with
respect to the target array size.
Finally, we apply our algorithm to a real routing table application. We use five routing
tables downloaded from the Internet backbone routers: as286 (KPN Internet Backbone), as513
(CERN, European Organization for Nuclear Research), as1103 (SURFnet, the Netherlands),
79
as4608 (Asia Pacific Network Information Center, Pty. Ltd.), and as4777 (Asia Pacific Network
Information Center) [66], with 276K, 291K, 279K, 283K, 281K prefixes respectively after
removing the redundant prefixes.
Figure 5-8.29Changes in the number of bucket accesses per lookup and rehash percentage for
two update models using enhanced 4-ghash. Bucket accesses per lookup lines
correspond to the left Y axis. Rehash percentage lines correspond to the right Y axis.
To handle the longest prefix matching problem, hash-based lookup adopts the controlled
prefix expansion [26] along with other techniques [73], [74], [75]; it is observed that there are
small numbers of prefixes for most lengths, and they can be dealt with separately, for example,
using TCAM, while other prefixes are expanded to a limited number of fixed lengths. Lookup
will then be performed for those lengths. In this experiment, we expand the majority of prefixes
(with lengths in the range of [26], [76]) to two lengths: 22 bits and 24 bits. Assuming the small
number of prefixes outside [26], [76] are handled by TCAM, we perform lookups against lengths
80
22 and 24. Because there are more prefixes of 24-bits long after expansion, we present the results
for 24-bit prefix lookup.
Figure 5-9.30Number of bucket accesses per lookup for experiments with five routing tables.
Figure 5-10.31Experiment with the update trace using enhanced 4-ghash. Bucket access per
lookup line correspond to the left Y axis. Rehash percentage line correspond to the
right Y axis.
Table 5-2.9Routing table updates for enhanced 4-ghash.
Number of buckets 110K 120K 130K 140K 150K
Rehash percentage re-setup 0.23% 0.13% 0.08% 0.05%
81
There are 159, 444, 159, 813, 159,395, 159,173 and 159,376 prefixes of 24-bits long
from the five routing tables, respectively. We use these prefixes to setup five tables separately
and vary the number of buckets from 100K to 250K with a target array of 150K entries. We find
the number of bucket accesses per lookup for d-ghash and d-left scheme. The results are obtained
based on the average of the five tables. As shown in Figure 5-9, both base 4-ghash and base 2-
ghash performs better than the respective d-left scheme. The maximum reduction rate for base 4-
ghash than 4-left is about 36% when m = 210,000 and base 2-ghash 12% when m = 250,000. The
average number of bucket accesses per lookup for enhanced 4-ghash scheme is almost one
bucket less than 4-left, with up to 50% reduction. For enhanced 2-ghash, there is an average of
20% reduction over 2-left. We also notice that there is a jump for 4-ghash at m = 220,000 and
another one for 2-ghash at m = 130,000. This is due to the change in Ω𝑎, as mentioned before.
In the second experiment, we setup our hash tables with the routing table as286
downloaded at January 1st, 2010 from [66] and use the collected update trace of the whole
month of January, 2010 to simulate the update process. To make experiments simple, we also
choose prefixes with the length of 24. There are 159,444 24-bit prefixes in the table. The update
trace contains 1,460,540 insertions and 1,458,675 deletions for those 24-bit prefixes. We vary the
number of buckets from 110K to 150K. For all these settings, the achievable bucket load Ω𝑎 is 2
for enhanced 4-ghash. We also use a fixed 150Kentry target array.
As shown in Table 5-2, if we use 110K buckets, we need a re-setup for the whole table. If
we use 120K buckets, we do not need a re-setup, but have to rehash 0.23% of the whole update
operations, which is about 0.5% of the 1.4 million insertions. And if we increase the number of
buckets, we will rehash less. When using 150K buckets, we have close to 0.05% chance of
rehash. We also show the lookup efficiency change in Figure 5-10 with m = 150K. The update
82
trace used has nearly the same number of insertions and deletions, which is similar to a Balanced
Update model used in Section 5.4. We can view that the rehash percentage grows continually to
0.05%. The number of bucket accesses per lookup increases and decreases through the update
process, with a 7% increase in the end.
5.5 Summary
A new guided multiple-hashing method, d-ghash is introduced in this chapter. Unlike
previous approaches which select the least-loaded bucket to place a key progressively, d-ghash
achieves global balance by allocating keys into buckets after all keys are placed into buckets d
times using d independent hash functions. D-ghash calculates the achievable perfect balance and
removes duplicate keys to achieve this goal. Meanwhile, d-ghash reduces the number of bucket
accesses for looking up a key by creating as many empty buckets as possible without disturbing
the balance. Furthermore, d-ghash uses a table to encode the hash function ID for the bucket
where a key is located to guide the lookup and to avoid extra bucket access. Simulation results
show that d-ghash achieves better balance than existing approaches and reduces the number of
bucket accesses significantly.
83
CHAPTER 6
INTELLIGENT ROW BUFFER PREFETCHES
6.1 Background and Motivation
As we discussed in the introduction, accesses to DRAM would be separated into each
channels, ranks and eventually each banks. Inside each bank, dram arrays are organized into
different rows. Before a memory location can be read, the entire row containing that memory
location is opened and read into the row buffer. Leaving a row buffer open after every access
(Open-page policy) enables more efficient access to the same open row, at the expense of
increased access delay to other rows in the same DRAM array. Requests to the opened row is
called a row buffer hit. A row buffer miss happens if the next request wants to access a different
row rather than the current row, which could cause long delay.
Row buffer locality has long been observed and utilized in previous proposals. There are
generally two approaches. The first one is to change DRAM mapping and scheduling in order to
improve row buffer hits. [77] proposes a permutation-based page interleaving scheme to reduce
row-buffer conflicts. [78] introduces a Minimalist Open-page memory scheduling policy that
utilizes open-page gains with a relatively small number of page accesses for each page
activation. They observe that while the commonly used open-page address mapping schemes
map each memory page to a sequential region of real memory, which results in linear access
sequences hitting in row buffer, it can cause interference between applications sharing the same
DRAM devices [79] and cannot utilize bank-level parallelism. They argue that the number of
row-buffer hits are generally small and through adjusted DRAM mapping scheme, they shall be
able to both keep row-buffer hits and reduce row-buffer conflicts. However, they need a complex
data prefetch engine and a complex scheduler to schedule normal requests and prefetch requests.
[80] proposes a three-stage memory controller that first groups requests based on row-buffer
84
locality, then focus on inter-application request scheduling and lastly sending simple FIFO
DRAM commands. They mainly target on CPU-GPU systems that applications between the two
can have great interference and different behavior in row-buffer accesses.
The second approach is to change DRAM page closure management. [81] first proposes
tracking history at a per DRAM-page granularity and uses a two-level branch predictor to predict
whether to close a row-buffer. [82] extends the previous proposal and proposes an one level low-
cost access based predictor (ABP) that closes a row-buffer after the specified number of access
or when a page-conflict occurs. They argue that the number of accesses to a given DRAM page
is better than timer based policies to determine page closure. [83] [84] proposes application-
aware page policy and assign different page policy to different applications based on memory
intensity and locality.
Of all the related works, [85] is most related to ours. [85] proposed the row based page
policy (RBPP) that tracks the row addresses and use it as an indicator to decide whether or not to
close the row buffer when the active memory request finishes. They use a few registers to record
the most accessed rows in a bank. For each recorded row address, a counter that dynamically
updates based on the access pattern determines whether or not the row buffer should be closed.
They use LRU scheme to replace old entries. We will show in the result section that only using
LRU scheme to replace entries is not accurate compared to replacing based on the row access
count. More specifically, requests that are accessed only once or twice are very common in many
workloads [39] that tend to replace the entries often in their most accessed row registers
(MARR), causing a poor hit ratio. Compared to theirs, we use a general approach that adds a
learning table to filter out those requests. The comparison result will be shown in the Section V.
85
When a hot row is identified, [85] proposed modifying DRAM page policy which cannot
reduce latency when two hot rows’ accesses interleaving with each other. We argue that by
simply caching the hot rows without modifying the DRAM mapping, we can still harness the
latency gain of row buffer hits and avoid the complexity of modifying DRAM.
Figure 6-1.32Hot Row pattern of 10 workloads.
4.0e+04
8.0e+04
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
bt
0.0e+00
4.0e+04
8.0e+04
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
bwaves
0.0e+00
4.0e+04
8.0e+04
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
gems
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
lbm
8.0e+04
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
leslie3d
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
mg
86
Figure 6-1. Continued.
Figure 6-1 shows a slice of row buffer accesses when we simulate the No-Cache Design
mentioned in Chapter 3. The y axis is the row number of a request address and the x axis is the
simulation cycle. The overall simulation length is 200k cycles. During the whole simulation
length, it is interesting to observe that 8 out of 10 workloads exhibit strong pattern that some
rows are accessed more than others. However, due to interleaved accesses to different rows, we
do not observe a decent row buffer hit rate. One way to improve row buffer hit rate would be
scheduling all requests to the same row together, the other way is to prefetch requests inside the
hot rows for later use.
6.2 Hot Row Buffer Design and Results
We propose a simple but effective design that could utilize the hot row pattern we
observed in the previous subsection. Two data structures are needed: a Learning Table (LT) and
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
milc
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
soplex
0.0e+00
4.0e+04
8.0e+04
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
swim
1.2e+05
1.6e+05
2.0e+05
2.4e+05
2.8e+05
1e+07 1.005e+07 1.01e+07 1.015e+07 1.02e+07
Row
Numb
er
HybridSim Cycles
zeusmp
87
a Hot-Row Buffer (HRB). The LT captures new hot rows based on recently referenced rows and
the HRB buffers the Hot Rows. Each entry in LT records the address tag of a row with a shift
register. The size of the LT is n and the width of the shift register is m. The HRB has k rows and
each entry in HRB has a 2KB data buffer along with a reference counter to count the number of
references to the row.
When a reference hits a row in HRB, the respective counter is incremented. When a
requested row is not in HRB, the row enters the LT if it is not already in the LT. In case of LT is
full, the request is dropped. The m bits are initialized to all ‘0’s when the row first enters the LT.
A hit to a row in LT will shift ‘1’ into the corresponding shift register.
When the shift-out bit is ‘1’, a hot row is identified. The newly identified row is fetched
into the HRB to replace the row with the least used count. The counter of the new row is
initialized to a middle value that the counter can represent. Meanwhile, other counters are
decremented by ‘1’. The row is dropped from the LT creating an empty space. If a request
misses the HRB, hit the LT, but the shift-out bit is ‘0’, no action is taken.
The process continues until a maximum h hot blocks are identified. After inserted the last
hot row into the HRB, the entire LT is wiped out and re-start over again.
As for the replacement policy of HRB, one may think that LRU replacement may be a
good choice. We can add another competing directory that implements LRU replacement policy.
Periodically, we can compare the number of hits in either scheme and select the one with more
hits to be applied to the next time period. Note that at period end, the directory which records the
losing scheme should be updated to the contents of the other. The hit/miss counters for the two
schemes are reset to measure the next period. A dynamic method with a saturated counter is used
88
to switch between the two schemes. Figure 6-2 shows the flow diagram when a new row address
arrives. Note we omit the LT restart and dropping requests part.
Figure 6-2.33Hot Row Identification and Update.
Figure 6-3 shows the detailed hit ratio with different configuration of HRB for 10
workloads for capturing reused blocks. We can observe that the hybrid scheme proves reasonable
hit ratio among all workloads, which matches the hot row pattern we observed in Figure 6-1.
The X axis is the number of rows we store in directory for each scheme. The Y axis is the hit
ratio for HRB/LRU directory over all row accesses if we store x amount of rows. With more
entries recorded, the hit ratio is increasing obviously for all workloads. Milc is able to capture
more than 90% of row accesses if using more than 16 entries. Gems and zeusmp perform badly
Check Address in HRB
Check Address in LT
N
Row Address
Update HRB
Y
Update LT
LRU Driectory
HRB Directory
A Hot Row is Identified
Y
Y N, Insert Into
89
with slight increase of hit ratio when using more entries. In general, the LRU scheme perform
better than the HRB scheme if having enough entries. However, when space is limited and few
entries can be recorded, HRB scheme perform better.
Table 6-1 summarizes the hit ratio of 10 workloads when using 64 entries. 7 out of 10
workloads have a hit rate over 50%. 4 workloads even have a hit rate larger than 90%. The
average hit rate of the 10 workloads reaches to 57.7%. We believe this can be well utilized to
achieve reasonable performance gain.
Table 6-1.10Hit ratio for hybrid scheme of 10 workloads using 64 entries.
workload hit ratio (%)
bt 91.2
bwaves 91.3
gems 26.3
lbm 76.7
leslie3d 49.4
mg 96.8
milc 94.9
soplex 53.1
swim 76.1
zeusmp 13.3
Geomean 57.7
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
bt
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
bwaves
LRUHybrid
HRB
Figure 6-3.34Results of proposed hybrid scheme.
90
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
gems
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
lbm
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
leslie3d
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
mg
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
milc
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
soplex
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
swim
LRUHybrid
HRB
0%
25%
50%
75%
100%
0 10 20 30 40 50 60 70
Percentage
Reuse distance Hot Rows
zeusmp
LRUHybrid
HRB
Figure 6-3. Continued.
91
Previous proposals have only focused on different ways of modifying row buffer
management in order to reduce row buffer conflicts. We argue that we can just cache these hot
rows to avoid the complexity of changing DRAM organization. Of course caching an entire row
not only wastes space when only a few blocks in the row are accessed, but also puts heavy
burden on bandwidth. We can consider caching part of the row. If these cached blocks in the row
are accessed, we can start prefetch remaining blocks in the row. These can be easily
implemented with a simple stream prefetcher.
Figure 6-4.35Block column difference within a row for 10 workloads.
0 500 1000 1500 2000 2500 3000 3500
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
bt
0 500 1000 1500 2000 2500 3000 3500
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
bwaves
0
500
1000
1500
2000
2500
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
gems
0 1000 2000 3000 4000 5000 6000
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
lbm
92
Figure 6-4. Continued.
Figure 6-4 shows the column difference inside a row buffer. Upon the first request to a
certain row, we record this address as the base address. When later new requests access this row,
we record the column difference between these two requests. We observed a high regularity that
the following blocks in the later rows would be accessed. Take lbm as an example, we can
observe that most requests in a row are having increasing differences compared to the first access
0 500 1000 1500 2000 2500 3000 3500
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
leslie3d
0 50 100 150 200 250 300 350 400 450
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
mg
0 1000 2000 3000 4000 5000 6000 7000 8000 9000
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
milc
0 500 1000 1500 2000 2500 3000 3500 4000
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
soplex
0 500 1000 1500 2000 2500 3000 3500 4000 4500
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
swim
0 10 20 30 40 50 60 70
-80-60-40-20 0 20 40 60 80
Count
Blocks Distance
zeusmp
93
to a row. And with difference increasing, the number of requests are decreasing. This motivates
us using a simple stream prefetcher in our performance evaluation. We can observe strong
pattern the same with lbm for 7 out of 10 workloads. For the milc workload, there is an gap
between odd difference and even difference. For swim and zeusmp, the difference is more
scattered. Soplex and zeusmp also shows lots of requests with negative difference. For such
pattern, more advanced prefetching algorithm can be applied.
6.3 Performance Evaluation
In this section, we present the IPC improvement of hot row prefetch against the base
design without hot row prefetch along with the row buffer hit ratio.
Figure 6-5.36IPC/Row buffer hit Ratio speedup of 10 workloads.
The total cost of learning table is calculated as follows: each Learning table costs
16*(0.5+3) = 56B per bank. (3B is used to record row address). We let HRB/LRU scheme to
record 1-64 rows and has a reference count of 1B, that costs at most 2 * 64 *(1+3) = 512B per
bank. So the total overhead is 568B for recording 64 rows. To record for all the 16 banks, we
will need 8.9KB in total, which can be easily fit into last level cache. Once we have identified
0
2
4
6
8
10
12
14
16
18
bt bwaves gems lbm leslie3d mg milc soplex swim zeusmp Geomean
Per
cen
tage
IPC_speedup row_buffer_hit_improve cache_hit_rate_improve
94
the hot rows, we implement a simple stream prefetcher in the hot row that prefetches next 4
blocks inside the row.
Figure 6-5 show the IPC and row buffer hit improvement of adding an hot row prefetcher
over 10 workloads. The overall IPC improvement is 9.1% with a range from 3.1 to 16.4% over
individual workloads. 6 out of 10 workloads have IPC improvement over 10%. Bt, mg and milc
has the most significant improvement, while zeusmp has the least improvement of all.
We can also observe an average of 6.3% more row buffer hits. In general, IPC
improvement is consistent with row buffer hit increase. Prefetching next 4 blocks upon a hot row
access would create 4 row buffer hits which improve the row buffer hit rate. If any of these
prefetched blocks are later requested, access time for them are reduced, hence lower the average
memory access time. Among the 10 workloads, milc and mg has more than 10% increased row
buffer hits. Gems, soplex and zeusmp has less than 5% increased row buffer hits.
We also show the last level cache hit rate improvement. The prefetched blocks may
become cache hits when they arrive faster to last level cache before requested. We do observe a
slight increase for all workloads of the L3 cache hit rate. The average increase of last level cache
hit rate is 3.9%. Note that even cache misses can be served faster because they are prefetched
earlier.
Table 6-2 shows the usage of the prefetched blocks. We can observe that the average
usage is about 52%, which means that on average 2 out 4 blocks we prefetched are requested
next. 7 out of 10 workloads have more than 52% accuracy. Among them, bt, milc, and mg have
the highest prefetch usage, which is consistent with their IPC results. Zeusmp, soplex, and gems
have the smallest prefetch usage rate. The prefetched useless blocks would consume bandwidth
and in the meanwhile pollute cache, hence hurt the improvement prefetching brings.
95
Table 6-2.11Prefetch usage for 10 workloads using a simple stream prefetcher.
workload prefetch_hit_percentage
bt 80
bwaves 71
gems 32
lbm 62
leslie3d 51
mg 72
milc 83
soplex 33
swim 54
zeusmp 23
Geomean 52
Table 6-3.12Sensitivity study on prefetch granularity.
Percentage(%) IPC_speedup row_buffer_hit_improve cache_hit_rate_improve
prefetch_2 5.4 3.7 2.1
prefetch_4 9.1 6.3 3.8
prefetch_6 3.1 8.3 0.4
prefetch_8 -2.8 10.1 -2.2
Table 6-3 shows the sensitivity study of changing the stream prefetch granularity.
Compared to prefetching only two blocks on every row access, prefetching 4 blocks improve the
row buffer hit ratio and improve the IPC because of reducing last level cache miss rate. But the
improvement becomes less and even negative when we prefetch 6 or more blocks. Prefetching
more blocks on every row access puts a heavy burden on the bandwidth and hurts the LLC
performance.
6.4 Conclusion
From the results, we can conclude that there exists strong locality pattern for row buffer
accesses. But because of those accesses are interleaved with each other, we tend to have more
row buffer conflicts, rather than row buffer hits. Based on the hot row pattern, we can easily
identify some frequently used rows. We evaluate the proposed scheme to capture hot rows with
96
trace collected from Marssx86 and DRAMSim2. The use of a learning table filters requests that
are accessed only a few times. The competing of LRU and HRB generates the hybrid scheme
that provides the best hit ratio. Results have shown that an average of 57.7% row accesses can be
captured by using 568B per bank. We also show that simple LRU replacement used by previous
RBPP scheme is not effective if recording limited hot rows.
We further implement a simple stream prefetcher to harness the hot row pattern we
captured through our learning table design. Results have demonstrated that with a simple
prefetch-in-row stream prefetcher, we are able to achieve a IPC speedup of 9.1% and a row
buffer hit rate improve of 6.3%.
97
CHAPTER 7
SUMMARY
We propose four works in this dissertation targeting at improving memory hierarchy
performance. The proposed ideas can be easily applied to real world systems and evaluations
have shown that they can improve system performance significantly. With the increasing
demand for high performance memory systems, the proposed techniques become valuable.
In the first work, we present a new caching technique to cache a portion of the large tag
array for an off-die stacked DRAM cache. Due to its large size, the tag array is impractical to fit
on-die, hence caching a portion of the tags can reduce the need to go off-die twice for each
DRAM cache access. In order to reduce the space requirement for cached tags and to obtain high
coverage for DRAM cache accesses, we proposed and evaluated a sector-based Cache Lookaside
Table (CLT) to record cache tags on-die. CLT reduces space requirement by sharing a sector tag
for a number of consecutive cache blocks and uses location (way) pointers to locate the blocks in
off-die cache data array. The large sector can also take the advantage of exploiting spatial
locality for better coverage. In comparison with the Alloy cache, the ATcache and the TagTables
approaches, the average improvements of CLT are in the range of 4-15%.
In the second work, A new Bloom Filter is introduced to filter L3 cache misses for
bypassing L1, L2 and L3 caches to shorten the L3 miss penalty in a 4-level cache hierarchy
system. The proposed Bloom Filter applies a simple indexing scheme by decoding the low-order
block address to determine the hashed location in the BF array. To provide better hashing
randomization, partial index bits are XORing with the adjacent higher-order address bits. In
addition, with certain combinations of the limited block address bits, multiple index functions
can be selected to further reduce the false-positive rate. Performance evaluation using SPEC2006
benchmarks on a 8-core system with 4-level caches show that the proposed simple hashing
98
scheme can lower the average false-positive rate below 5% for filtering L3 misses, and to
improve the average IPC by 10.5% over no L3 filtering and runahead. Furthermore, the proposed
BF indexing scheme resolves an inherent difficult problem in using the Bloom Filter for
identifying L3 cache misses. Due to dynamic updates of the cache content, a counting Bloom
Filter is necessary to update the BF array to reflect dynamic changes of the cache content. A
unique advantage of the proposed BF index is that it includes the cache index as a superset. As a
result, the blocks which are hashed to the same BF array location, are allocated in the same cache
set. By searching the tags in the set when a block is replaced, the corresponding BF bit can be
reset correctly without using expensive counters.
The third work proposes a new guided multiple-hashing method, d-ghash. Unlike
previous approaches which select the least-loaded bucket to place a key progressively, d-ghash
achieves global balance by allocating keys into buckets after all keys are placed into buckets d
times using d independent hash functions. D-ghash calculates the achievable perfect balance and
removes duplicate keys to achieve this goal. Meanwhile, d-ghash reduces the number of bucket
accesses for looking up a key by creating as many empty buckets as possible without disturbing
the balance. Furthermore, d-ghash uses a table to encode the hash function ID for the bucket
where a key is located to guide the lookup and to avoid extra bucket access. Simulation results
show that d-ghash achieves better balance than existing approaches and reduces the number of
bucket accesses significantly.
The fourth work digs into the details of DRAM row buffer accesses. By collecting
memory accesses of most of the SPECCPU workloads, we find out requests to the DRAM rows
in each bank are not evenly distributed, meaning some of the rows in a bank could have more
requests than other rows. We call these more accessed rows “hot rows”. Based on the observed
99
hot row pattern, we propose a simple design that uses a Learning table which is able to capture
these hot rows. Once we identifies a hot row, we would sequentially prefetch blocks in that hot
row upon a row access. We evaluate this idea using a simple stream prefetcher and the results
show we are able to gain 9.1% average IPC improvement over a design without a prefetcher.
The proposed ideas have been verified by the results presented in each individual
Chapter.
100
LIST OF REFERENCES
[1] P. Hammarlund, "The Fourth-Generation Intel Core Processor," in MICRO, 2014.
[2] Y. Deng and W. P. Maly, "2.5-dimensional VLSI system integration.," in IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, 2005.
[3] K. Banerjee, S. Souri, P. Kapur and K. C. Saraswat, "3-D ICs: a novel chip design for
improving deep-submicrometer interconnect performance and systems-on-chip
integration," in Proceedings of the IEEE, 2001.
[4] G. Loh and M. D. Hill, "Efficiently enabling conventional block sizes for very large die-
stacked DRAM caches," in MICRO, 2011.
[5] J. Sim, G. Loh, H. Kim , M. Connor and M. Thottehodi, "A Mostly-Clean DRAM
Cache for Effective Hit Speculation and Self-balancing Dispatch," in MICRO, 2012.
[6] X. Jiang and e. al., "CHOP: Adaptive filter-based DRAM caching for CMP server
platforms," in HPCA, 2010.
[7] G. Loh, "Extending the Effectiveness of 3D-stacked DRAM Cache with An Adaptive
Multi-queue Policy," in MICRO, 2009.
[8] G. Loh and M. Hill, " Supporting very large DRAM caches with compound access
scheduling and MissMaps," in MICRO, 2012.
[9] M. K. Qureshi and G. Loh, "Fundamental Latency Trade-offs in Architecting DRAM
Caches," in MICRO, 2012.
[10] L. Zhao, R. Iyer, R. Illikkal and D. Newell, "Exploring DRAM cache architectures for
CMP server platforms," in ICCD, 2007.
[11] D. Woo, N. Seong, D. Lewis and H. Lee, "An Optimized 3D-stacked Memory
Architecture by Exploiting Excessive High-density TSV Bandwidth," in HPCA, 2010.
[12] T. Kgil, S. D'Souza, A. Saidi, N. Binkert, R. Dreslinski, T. Mudge, S. Reinhardt and K.
Flautner, "PicoServer: Using 3D stacking technology to enable a compact energy
efficient chip multiprocessor," in ASPLOS, 2006.
[13] C. Liu, I. Ganusov and M. Burtscher, "Bridging the Procesor-memory Performance Gap
with 3D IC Technology," in IEEE Design & Test of Computers, 2005.
[14] G. Loh, "3D-stacked Memory Architectures for Multi-core Processors," in ISCA, 2008.
[15] C. Chou, A. Jaleel and M. K. Qureshi, "A Two-Level Memory Organization with
Capacity of Main Memory and Flexibility of Hardware-Managed Cache," in MICRO,
2014.
101
[16] X. Dong, Y. Xie, N. Muralimanohar and N. P. Jouppi, "Simple but effective
heterogeneous main memory with on-chip memory controller support," in SC, 2010.
[17] G. Loh and et al., "Challenges in Heterogeneous Die-Stacked and Off- Chip Memory
Systems," in SHAW, 2012.
[18] J. Pawlowski, "Hybrid Memory Cube: Breakthrough DRAM Performance with a
Fundamentally Re-Architected DRAM Subsystem," in Hot Chips, 2011.
[19] J. Sim, A. Alameldeen, Z. chishti, C. Wilkerson and H. Kim, "Transparent Hardware
Management of Stacked DRAM as Part of Memory," in MICRO, 2014.
[20] F. Botelho, R. Pagh and N. Ziviani, "Simple and space-efficient minimal perfect hash
functions," in WADS, 2007.
[21] B. Vocking, "How Asymmtry Helps Load Balancing," in IEEE Symn. on FCS, 1999.
[22] A. Kirsch and M. Mitzenmacher, "On the Performance of Multiple Choice Hash Tables
with Moves on Deletes and Inserts," in Communication, Control, and Computing, 2008.
[23] F. Hao, M. Kodialam and T. V. Lakshman, "Building high accuracy bloom filters using
partitioned hashing," in SIGMETRICS, 2007.
[24] B. Bloom, "Space / Time Trade-offs in Hash Coding with Allowable Errors," in Comm.
ACM, 1970.
[25] T. Wang. http://burtleburtle.net/bob/hash/integer.html.
[26] V. Srinivasan and G. Varghese, "Fast Address Lookups Using Controlled Prefix
Expansion," in ACM Transactions on Computer Systems, 1999.
[27] A. Patel, F. Afram, S. Chen and K. Ghose, "MARSSx86: A Full System Simulator for
x86 CPUs," in DAC, 2011.
[28] M. T. Yourst, PTLsim: A Cycle Accurate Full System x86-64 Microarchitectural
Simulator, ISPASS, 2007.
[29] J. Stevens, P. Tschirhart, M. Chang, I. Bhati, P. Enns, J. Greensky, Z. Chishti, S. Lu and
B. Jacob, "An Integrated Simulation Infrastructure for the Entire Memory Hierarchy:
Cache, DRAM, Nonvolatile Memory, and Disk," in ITJ, 2013.
[30] Qemu http://wiki.qemu.org/Main_Page.
[31] Y. Chou, Y. Fahs and S. Abraham, Microarchitecture optimizations for exploiting
memory-level parallelism, 2004: ISCA.
[32] T. Carson, H. W. and E. L., Sniper: exploring the level of abstraction for scalable and
accurate parallel multi-core simulation, HPCA, 2011.
102
[33] J. Henning., "SPEC CPU2006 memory footprint," in ACM SIGARCH Computer
Architecture News, 2007.
[34] B. Rogers, A. Krishna, G. Bell, K. Vu, X. Jiang and Y. Solihin, "Scaling the Bandwidth
Wall: Challenges in and Avenues for CMP Scaling," in ISCA, 2009.
[35] N. Madan, L. Zhao, N. Muralimanohar, A. Udipi, R. Balasubramonian, R. Iyer, S.
Makineni and D. Newell, "Optimizing Communication and Capacity in a 3D Stacked
Reconfigurable Cache Hierarchy," in HPCA, 2009.
[36] S. Lai, "Current Status of the Phase Change Memory and Its Future," in IEDM, 2003..
[37] S. Franey and M. Lipasti, Tag Tables, HPCA, 2015.
[38] C. Huang and V. Nagarajan, "ATCache: Reducing DRAM cache Latency via a Small
SRAM Tag Cache," in PACT, 2014.
[39] D. Jevdjic, S. Volos and B. Falsafi, "Die-stacked DRAM caches for servers: hit ratio,
latency, or bandwidth? have it all with footprint cache," in ISCA, 2013.
[40] D. Jevdjic, G. Loh, C. Kaynak and B. Falsafi, "Unison Cache : A Scalable and Effective
Die-Stacked DRAM Cache," in MICRO, 2014.
[41] N. Hardavellas, M. Ferdman, B. Falsafi and A. Ailamaki, "Reactive NUCA: Near-
optimal block placement and replication in distributed caches," in ISCA, 2009.
[42] J. Liptay, "Structural aspects of the System/360 Model 85, Part II: The cache," in IBM
Syst.J., 1968.
[43] S. Przybylski, "The Performance Impact of Block Sizes and Fetch Strategies," in ISCA,
1990.
[44] J. B. Rothman and A. J. Smith, "The Pool of Subsectors Cache Design," in ICS, 1999.
[45] A. Seznec, "Decoupled sectored caches: conciliating low tag implementation cost and
low miss ratio," in ISCA, 1994.
[46] S. Somogyi, T. Wenish, A. Ailamaki, B. Falsafi and A. Moshovos, "Spatial memory
streaming," in ISCA, 2006.
[47] G. Loh and M. Hill, "Addendum for “Efficiently enabling conventional block sizes for
very large die-stacked DRAM caches”," 2011.
[48] J. Meza, J. Chang, H. Yoon, O. Mutlu and P. Ranganathan, "Enabling Efficient and
Scalable Hybrid Memories Using Fine-Granularity DRAM Cache Management," in
CAL, 2012.
[49] W. Chou, Y. Nain, H. Wei and C. Ma, "Caching tag for a large scale cache computer
memory system," in US Patent 5813031, 1998.
103
[50] T. Wicki, M. Kasinathan and R. Hetherington, "Cache tag caching," in US Patent
6212602, 2001.
[51] M. Qureshi, "Memory access prediction," in US Patent 12700043, 2011.
[52] "Cacti 6.5," http://www.hpl.hp.com/research/cacti.
[53] A. Seznec and P. Michaud, A case for (partially) tagged geometric history length branch
prediction, Journal of Instruction Level Parallelism, 2006.
[54] A. Broder and M. Mitzenmacher, "Network applications of Bloom filters: A survey," in
Internet Math, 2004.
[55] J. K. Mullin, "Optimal semijoins for distributed database systems," in IEEE
Transactions on Software Engineering, 1990.
[56] L. Fan, P. Cao, J. Almeida and A. Broder, "Summary cache: a scalable widearea Web
cache sharing protocol," in IEEE Transactions on Networking, 2000.
[57] R. Rajwar, M. Herlihy and K. Lai, "Virtualizing Transactional Memory," in ISCA, 2005.
[58] A. Roth, "Store Vulnerability Window (SVW): Re-Execution Filtering for Enhanced
Load Optimization," in ISCA, 2005.
[59] A. Moshovos, "RegionScout: Exploiting Coarse Grain Sharing in Snoop-Based
Coherence," in ISCA, 2005.
[60] J. Peir, S. Lai, S. Lu, J. Stark and K. Lai, "Bloom filtering cache misses for accurate
data speculation and prefetching," in ICS, 2002.
[61] X. Li, D. Franklin, R. Bianchini and F. T. Chong, "ReDHiP: Recalibrating Deep
Hierarchy Prediction for Energy Efficiency," in IPDPS, 2014.
[62] A. Broder and M. Mitzenmacher, "Using Multiple Hash Functions to Improve IP
Lookups," in INFOCOM, 2001.
[63] S. Demetriades, S. C. M. Hanna and R. Melhem, "An Efficient Hardware-Based Multi-
hash Scheme for High Speed IP Lookup," in HOTI, 2008.
[64] H. Song, S. Dharmapurikar, J. Turner and J. Lockwood, "Fast Hash Table Lookup
Using Extended Bloom Filter: An Aid to Network," in SIGCOMM, 2005.
[65] Z. Huang, D. Lin, J.-K. Peir and S. M. I. Alam, "Fast Routing Table Lookup Based on
Deterministic Multi-hashing," in ICNP, 2010.
[66] "Routing Information Service," http://www.ripe.net/ris.
[67] C. Hermsmeyer, H. Song, R. Schlenk, R. Gemelli and S. Bunse, "Towards 100G packet
processing: Challenges and technologies," in Bell Labs Technical Journal, 2009.
104
[68] S. Lumetta and M. Mitzenmacher, "Using the Power of Two Choices to Improve Bloom
Filter," in Internet Mathematics, 2007.
[69] Z. Huang, J.-K. Peir and S. Chen, "Approximately-Perfect Hashing: Improving Network
Throughput through Efficient Off-chip Routing," in INFOCOM, 2011.
[70] Y. Azar, A. Broder, A. Karlin and E. Upfal, "Balanced Allocations," in Theory of
Computing, 1994.
[71] R. Sprugnoli, "Perfect hashing functions: a single probe retrieving method for static
sets,," in ACM Comm., 1977.
[72] F. F. Rodler and R. Pagh, "Cuckoo Hashing," in ESA, 2001.
[73] S. Dharmapurikar, P. Krishnamurthy and D. Taylor, "Longest Prefix Matching Using
Bloom Filters," in SIGCOMM, 2003.
[74] B. Chazelle, R. Kilian and A. Tal, "The Bloomier filter: an efficient data structure for
static support lookup tables," in ACM SIAM, 2004.
[75] J. Hasan, S. Cadambi, V. Jakkula and S. Chakradhar, "“Chisel: A Storage efficient,
Collision-free Hash-based Network Processing Architecture," in ISCA, 2006.
[76] M. L. Fredman and J. Komlos, "On the Size of Separating Systems and Families of
Perfect Hash Functions," in SIAM. J. on Algebraic and Discrete Methods, 1984.
[77] Z. Zhang and e. al, "A permutation-based page interleaving scheme to reduce row-
buffer conflicts and exploit data locality," in MICRO, 2000.
[78] D. Kaseridis, J. Stuecheli and L. John, "Minimalist Openpage: A DRAM Page-mode
Scheduling Policy for the manycore Era," in MICRO, 2011.
[79] T. Mosciborda and O. Mutlu, "Memory Performance Attacks: Denial of Memory
Service in Multi-Core Systems," in USENIX, 2007.
[80] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh and M. O., "Staged memory
scheduling: achieving high performance and scalability in heterogeneous systems," in
ISCA, 2012.
[81] Y. Xu, A. Agarwal and B. Davis, "Prediction in dynamic sdram controller policies," in
SAMOS, 2009.
[82] M. Awasthi, D. W. Nellans, R. Balasubramonian and A. Davis, "Prediction Based
DRAM Row-Buffer Management in the Many-Core Era," in PACT, 2011.
[83] M. Jeong, D. Yoon, D. Sunwoo, M. Sullivan, I. Lee and M. Erez, "Balancing DRAM
Locality and Parallelism in Shared Memory CMP system," in HPCA, 2012.
105
[84] M. Xie, D. Tong, Y. Feng, K. Huang and X. Cheng, "Page Policy Control with Memory
Partitioning for DRAM Performance and Power Efficiency," in ISLPED, 2013.
[85] X. Shen, F. Song, H. Meng and e. al, "RBPP: A Row Based DRAM Page Policy fo rthe
Many-core Era," in ICPADS, 2014.
[86] A. Jaleel, "Memory Characterization of Workloads Using Instrumentation-Driven
Simulation," in VSSAD, 2007.
[87] P. Rosenfeld, E. Copper-Balis and B. Jacob, "DRAMSim2: A Cycle Accurate Memory
System Simulator," in CAL, 2011.
[88] A. J. Smith, "Line (block) size choice for cpu cache memories," in IEEE transactions on
Computers, 1987.
[89] "NAS Parallel Benchmarks," http://www.nas.nasa.gov/publications/npb.html.
[90] A. Brodnik and J. I. Munro, "Membership in Constant Time and Almost-Minimum
Space," in SIAM Journal on Computing, 1999.
[91] W. Starke and et al, "The cache and memory subsystems of the IBM POWER8
processor," in IBM J. Res & Dev. Vol.59(1) , 2015.
[92] D. Patterson and J. Hennessy, Computer Architecture: A Quantitative Approach,
Morgan Kaufmann, 2011.
[93] K. D., S. J. and J. L. K., Minimalist open-page: a DRAM page-mode scheduling policy
for the many-core era, MICRO, 2011.
[94] T. Mosciborda and O. Mutlu, Memory Performance Attacks: Denial of Memory Service
in Multi-Core Systems, USENIX, 2007.
[95] R. Ausavarungnirun, K. Chang, L. Subramanian, G. H. Loh and O. Mutlu, Staged
memory scheduling: achieving high performance and scalability in heterogeneous
systems, ISCA, 2012.
[96] M. Hill, "A case for direct-mapped caches," in IEEE computer, 1988.
106
BIOGRAPHICAL SKETCH
Xi Tao received his Ph.D. in computer engineering from the University of Florida in the
fall of 2016. He received his B.S. degree in Electric Engineering and Information Science in
University of Science and Technology of China, in 2007. His research interests include computer
architecture, cache and Bloom Filter applications.