Caches Master

7/30/2019 Caches Master

1/51

An example memory hierarchy

registers

on-chip L1cache (SRAM)

main memory(DRAM)

local secondary storage(local disks)

Larger,slower,

andcheaper

(per byte)storagedevices

remote secondary storage(distributed file systems, Web servers)

Local disks hold filesretrieved from disks onremote network servers.

Main memory holds diskblocks retrieved from localdisks.

off-chip L2cache (SRAM)

L1 cache holds cache lines retrievedfrom the L2 cache.

CPU registers hold words retrievedfrom cache memory.

L2 cache holds cache linesretrieved from memory.

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,and

costlier(per byte)storagedevices


2/51

Principle of Locality

Programs tend to re-use data and instructions near

those they have used recently

Two types of locality1. Temporal locality: items currently being used are

likely to be reused in the near future

2. Spatial locality: items with nearby addresses tend

to be referenced close together in time

Principle of Locality


3/51

Are there things we can do in our code to

allow us to benefit from the principle of

locality?

Programs that reference the same variables

repeatedly have good locality

Programs with small strides in array

locations have good locality Programs with small loops with many

iterations have good locality

Programming to Maximize Benefits of

Locality


4/51

Locality Exampleint sumvec(int v[N]) {

int i, sum = 0;

for (i = 0; i < n; i++)

sum += v[i];

return sum;

}

Locality: Data

Reference array elements in succession

(stride-1 reference pattern): Spatial locality

Reference sum each iteration: Temporal locality

InstructionsReference instructions in sequence: Spatial locality

Cycle through loop repeatedly: Temporal locality


5/51

Locality Example

How is the locality of this program?

Note the elements are accessed in the orderthey are stored in memory This implies good locality


6/51

Locality Example 2

Claim: Being able to look at code and get a qualitative sense of its

locality is a key skill for a professional programmer.

Question: Does this function have good locality?


7/51

Locality Example 2

How is the locality of this program?

Recall arrays in C/C++ are stored row-wise This means that all of row 1 is stored contiguously, then

row 2, etc.

This leads to good locality


8/51

Locality Example 3

.Question: Does this function have good locality?

Assume a 2x3 array for ease

int sum_array_cols(int a[M][N])

{ int i, j, sum = 0;

for (j = 0; j < N; j++)

for (i = 0; i < M; i++)

sum += a[i][j];

return sum;}


9/51

Locality Example 3

How is the locality of this program ?

Recall arrays in C/C++ are stored row-wise But we are accessing the array by columns

This leads to very poor locality


10/51

Locality Example 4 Question: Can you permute the loops so that the function scans the

3-d array a[ ] with a stride-1 reference pattern (and thus has good

spatial locality)?

int sumarray3d(int a [N] [N] [N]){

int i, j, k, sum = 0;

for (i = 0; i < N; i++)

for (j = 0; j < N; j++)

for (k = 0; k < N; k++)

sum += a[k][i][j];return sum;

}


11/51

Locality Example 4 Recall: the best indexing for a 2-D array is to start at a row and

count across all the columns in a[i][j]

i.e. increment the j counter in the inner loop

So: for multi-dimensional loops, simply run the loops

in the order of the indices

int sumarray3d(int a [N] [N] [N]){

int i, j, k, sum = 0;

for (k = 0; k < N; k++)

for (i = 0; i < N; i++)

for (j = 0; j < N; j++)

sum += a[k][i][j];return sum;

}


12/51

An example memory hierarchy

registers

on-chip L1cache (SRAM)

main memory(DRAM)

local secondary storage(local disks)

Larger,slower,

andcheaper

(per byte)storagedevices

remote secondary storage(distributed file systems, Web servers)

Local disks hold filesretrieved from disks onremote network servers.

Main memory holds diskblocks retrieved from localdisks.

off-chip L2cache (SRAM)

L1 cache holds cache lines retrievedfrom the L2 cache.

CPU registers hold words retrievedfrom cache memory.

L2 cache holds cache linesretrieved from memory.

L0:

L1:

L2:

L3:

L4:

L5:

Smaller,faster,and

costlier(per byte)storagedevices


13/51

Definition of a Cache

The word cache is used throughout the memory

hierarchy diagram

What is a Cache?

A smaller, faster storage device that acts as astaging area for a subset of the data in a larger,

slower device.

Why would a cache help us at all?

It relies on the Principle of Locality


14/51

Caching in the Memory Hierarchy

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

9 14 3Level k:

Level k+1:

Larger, slower, cheaper storagedevice at level k+1 is partitioned

into blocks.

Data is copied between

levels in block-sized transfer units

Smaller, faster, more expensive

memory caches a subset of

the blocks

4

4

10

14

It works because programs tend to access data at level k more often

than at k+1.


15/51

Caching in the Memory

Hierarchy II

Goal:

The memory hierarchy provides a large pool

of memory that: Costs as much as the cheap storage near the

bottom (well, only a little bit more),

butserves data to programs at the rate of the faststorage near the top.


16/51

General caching concepts I

Program needsobject d, which is

stored in block 14.

Block 14 is locatedin the level k cache

This is called a

Cache hit

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

9 14 3Level k:

Level k+1:

4

4

10

14


17/51

General caching concepts II

Program needs objectb, which is stored inblock 12.

b is not at level k, solevel k cache must

fetch it from level k+1. This is called a Cache

miss

Simplest case is if

there is a spot for block12 in level k

What happens if level kis full?

0 1 2 3

4 5 6 7

8 9 10 11

12 13 14 15

9 14 3Level k:

Level k+1:

4

4

10

14


18/51

General caching concepts III

Program needs object b, whichis stored in block 12.

b is not at level k, so level k

cache must fetch it from level

k+1.

level k cache is full must be replaced (evicted)

Which one is the victim?

Placement policy:

where can the new block

go? E.g., b mod 4

Replacement policy:

which block should be

evicted? E.g., LRU


19/51

General Caching Concepts IV

Block A block is the smallest amount of memory that is

stored in a cache

The size of the block can be different for each stage

in the memory hierarchy

We rely on what kind of locality to ensure theadditional data in the block is needed?

Transfer Unit

These are the amount of data that we transfer intoand out of a cache when we need new data

A transfer unit is the same size as the block for thatstage


20/51

General Caching Concepts V

Cache hit

Program finds b in the cache at level k.

Cache miss

b is not at level k, so level k cache must fetch it fromlevel k+1. E.g., next lower level in the hierarchy

Placement policy: where can the new block go?

If level k cache is full, then some current block must be

replaced Replacement policy: which block should be evicted?


21/51

General Caching Concepts .

Cache Misses

Recall a cache miss is when the data wasnot in the cache

What can cause a block not to be in the cache

at a given time? There are three kinds of misses:

1. Cold (compulsory) miss

2. Conflict miss

3. Capacity miss


22/51

General Caching Concepts -

Cache Misses II

Cold (compulsory) miss

Cold misses occur because the cache is empty

This happens when the machine starts-up

Capacity miss

Occurs when the set of active cache blocks (working

set) is larger than the cache

There is no longer enough room in the cache for all the dataso some block needs to be discarded.


23/51

General Caching Concepts .

Cache Misses III

Conflict miss Most caches limit blocks at level k+1 to a small subset

(sometimes a singleton) of positions at level k.

E.g. Block i at level k+1 must be placed in block (i mod 4) atlevel k+1.

Conflict misses occur when multiple data objectsall map to the same level k block. E.g. Referencing blocks 0, 8, 0, 8, 0, 8, ... would miss

every time in a cache with 4 blocks where we store atlocation i mod 4.


24/51

General Org of a Cache Memory

The cache contains sets of memory words called Blocks These blocks usually contain a few words of memory

Each block is part of a Line in the cache

The line contains identifying information on what is actually in theblocks and a status bit to say whether the data in the block is stillvalid

These lines are then grouped into Sets There may be 1 or more lines of cache data per set

The exact number of lines per set is determined by the type ofcache (more on that in a few slides)


25/51

General Org of a Cache Memory


26/51

Four Key Questions for

Designing a Cache1. Block Identification . How is a block found if it

is in the upper level ?

2. Block Placement - Where can a block be placed

in the upper level?

3. Block Replacement . Which block should bereplaced when there is a cache miss?

4. Write Strategy . What happens on a memorywrite?


27/51

Q1: How is a block found if it is in the

cache? Blocks are maintained based on their block address

(assume it is a physical address in memory)

The address is divided into two parts:

The block address and The block offset (the LSBs that correspond to the

location of the desired data within the Block

The block address is further divided into two:

1. set index tells which Set

2. tag provides the remaining block address bits


28/51

Decoding the Address for a

Cached Word The following stages are needed to find a word in a cache

Different parts of the actual word address provide these values

1. Read the set IDto see which set the word may be in

Called set select ion2. Read the tagto see if the word we want is actually in

this line of the cache

The tag is a type of hash that provides the word address (at

least part of it)

Called l ine matchingis line of the cache


29/51


Cached Word II3. Read the val id bit

- Only want to read the desired word if the cache line is valid

4. Read the block of fsetvalue to know which wordin the line isthe word we want

- Called wo rd Extraction


30/51


Cached Word


31/51

Order of the Bits Used in the

Address

Remember that the bit fields in the actual wordaddress are used to access the cache A simple hash function uses these fields to access

the lines and individual words in the cache

Generally: a hash (hash value) is a non-uniqueaddress used to quickly access a table It is not unique because hash tables can only store

small subsets of a larger range of data values

The pigeon hole principle is in effect

The order of the bit field selection is critical toeffective performance


32/51


Address II First, the middle bits are used for set selection

Why does this make sense?

Its undesirable for nearby words in memory to map to the same line

in the cache

It will cause the cache to replace possibly useful data too often


33/51


Address III

It would go against our knowledge of spatial

Locality

Here we want consecutive words in different lines ofthe cache

This way, when the program accesses the next

data word in memory, the cache wont need to

discard the last one. Instead, the next cache

line is accessed, so the cache keeps them both.


34/51

Why Use the Middle Bits as an

Index?

High-Order Bit Indexing

Adjacent memory lines would mapto same cache

Entry Poor use of spatial locality

Middle-Order Bit Indexing

Consecutive memory lines map todifferent cache lines

(Least significant bits are not

shown here)


35/51


Address IV Second, the top (most significant) bits are used for line

matching (tag matching) Why does this make sense?

If the process got this far, then the middle order bitsalready match Again, this is using knowledge of spatial locality


36/51


Address V If the set index bits match, then the higher order bits need to

be checked

These are only the same if the address is from the same

higher order bit region of memory

With locality and the limited cache size, we know that thetag will indicate whether we are dealing with the same

words (the same block)


37/51


Address VI

Third (finally), the low order bits are used for

word extraction

Why does this make sense?

These are the most unique identifiers for a word in memory Another word with the same low order bits will be located far

away in memory

Therefore, according to the principle of locality, it is unlikely

that another word in memory with the same LSBs will be

needed near the same time


38/51

Four Key Questions for

Designing a Cache1. Block Identification . How is a block found if it

is in the upper level ?

2. 2. Block Placement - Where can a block be placed

in the upper level?

3. Block Replacement . Which block should bereplaced when there is a cache miss?

4. Write Strategy . What happens on a memorywrite?


39/51

Q2: Where Can a Block be

Placed in a Cache?

. There are three approaches to blockplacement: 1. Direct Mapped

The block can only be placed in a specific location in the

cache 2. Set-Assoc iative

A set is a group of blocks in the cache

The incoming block can only be placed in a specific set ofblocks

The block can be placed anywhere within this set

3. Ful ly Associat ive The block can be placed anywhere in the cache


40/51

Mapping Memory Blocks into the

Cache

This is really a continuum of associativeproperties:

Direct mapped can be considered 1-way set

associative (the block has only 1 alloweddestination)

For N-way set associative, the block has Nplaces it can be placed

Fully associative can be considered m-wayset associative (where m is the number ofblocks), since the block can go anywhere


41/51

Mapping Memory Blocks into the

Cache II


42/51

Direct-Mapped Cache

Simplest kind of cache

Characterized by exactly one line per set.

Therefore, the tag and set index are

somewhat redundant The tag is insurance that the wrong word is not

used


43/51

Accessing Direct-Mapped

Caches I

. Step 1: Set selection

identical to direct-mapped cache


44/51


Caches II

Step 2 : Line matching

Trivial, since there is only one line in this cache set


45/51


Caches III

Step 3 : Word selection

Use the block offset to grab the right word.

S W d Ab B k


46/51

Some Words About Book

Conventions . First . Most direct mapped caches do not even talk

about a set index There is only 1 line per set - it doesn.t make much sense

They may do it to make the later caches easier to understand,but most systems don.t use the set index with direct mappedorganizations

Second . the last figure talked about w0 being the lowerorder byte of word w, w1 is the next byte etc. It is easy in this convention to confuse words and bytes

It should have used w0 and w1 to talk about the words in thecache and then used b00 and b10 to talk about the low orderbytes of the words 0 and 1 in the cache line


47/51

Direct-Mapped Cache Simulation


48/51

Recall: Cache Misses

A cache miss is when the data was not inthe cache

What can cause a block not to be in the cache

at a given time? There are three kinds of misses:

Cold (compulsory) miss

Conflict miss

Capacity miss

What kinds of cache misses did we justsee?


49/51

Direct-Mapped Cache Simulation

Results

Di t M d C h


50/51

Direct Mapped Cache

Observations Direct mapped caches are easy to implement

Finding the line and the set are one step, since there is only oneline per set

What are the problems?

The simple example showed that it is already experiencingmisses due to conflicts

The conflicts occurred because there is only one line

What was needed: a second line for each set index to allow bothdesired values to be placed

This leads to set associative caches It will also introduce the need for replacement strategies

Up to now, the replacement was only dictated by the address


51/51

Set Associative Caches

Characterized by more than one line perset

Caches Master

Documents

Transcript of Caches Master