ECE200 – Computer Organization

ECE200 – Computer Organization

Chapter 7 – Large and Fast: Exploiting

Memory Hierarchy

Homework 7

7.2, 7.3, 7.5, 7.6, 7.7, 7.11, 7.15, 7.20, 7.21, 7.27, 7.32

Outline for Chapter 7 lectures

Motivation for, and concept of, memory hierarchies

Caches

Main memory

Characterizing memory hierarchy performance

Virtual memory

Real memory hierarchies

The memory dilemma

Ch 6 assumption: on-chip instruction and data memories hold the entire program and its data and can be accessed in one cycle

Reality checkIn high performance machines, programs may require

100’s of megabytes or even gigabytes of memory to runEmbedded processors have smaller needs but there is

also less room for on-chip memory

Basic problemWe need much more memory than can fit on the

microprocessor chipBut we do not want to incur stall cycles every time the

pipeline accesses instructions or data At the same time, we need the memory to be

economical for the machine to be competitive in the market

Solution: a hierarchy of memories

Another view

Typical characteristics of each level

First level (L1) is separate on-chip instruction and data caches placed where our instruction and data memories reside16-64KB for each cache (desktop/server machine)Fast, power-hungry, not-so-dense, static RAM (SRAM)

Second level (L2) consists of another larger unified cache Holds both instructions and data256KB-4MBOn or off-chipSRAM

Third level is main memory64-512MBSlower, lower-power, denser dynamic RAM (DRAM)

Final level is I/O (e.g., disk)

Caches and the pipeline

L1 instruction and data caches and L2 cache

cache cache

L2 cache

to mm

L1 L1

Memory hierarchy operation

(1) Search L1 for the instruction or dataIf found (cache hit), done

(2) Else (cache miss), search L2 cacheIf found, place it in L1 and repeat (1)

(3) Else, search main memoryIf found, place it in L2 and repeat (2)

(4) Else, get it from I/O (Chapter 8)

Steps (1)-(3) are performed in hardware1-3 cycles to get from L1 caches5-20 cycles to get from L2 cache50-200 cycles to get from main memory

L1

L2

main memory

Principle of locality

Programs access a small portion of memory within a short time period

Temporal locality: recently accessed memory locations will likely be accessed soon

Spatial locality: memory locations near recently accessed locations will likely be accessed soon

POL makes memory hierarchies work

A large percentage of the time (typically >90%) the instruction or data is found in L1, the fastest memory

Cheap, abundant main memory is accessed more rarely

Memory hierarchy operates at nearly the speed of expensive on-chip SRAM with about the cost of main memory (DRAMs)

Caches

Caches are small, fast, memories that hold recently accessed instructions and/or data

Separate L1 instruction and L1 data cachesNeed simultaneous access of instructions and data in

pipelines

L2 cache holds both instructions and dataSimultaneous access not as critical since >90% of the

instructions and data will be found in L1PC or effective address from L1 is sent to L2 to search

for the instruction or data

How caches exploit the POL

On a cache miss, a block of several instructions or data, including the requested item, are returned

The entire block is placed into the cache so that future searches for items in the block will be successful

instructioni instructioni+3instructioni+2instructioni+1

requested instruction

How caches exploit the POL

Consider sequence of instructions and data accesses in this loop with a block size of 4 words

Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4bne $s1, $zero, Loop

The cache is much smaller than main memory

Multiple memory blocks must share the same cache location

Searching the cache

block

Need a way to determine whether the desired instruction or data is held in the cache

Need a scheme for replacing blocks when a new block needs to be brought in on a miss

Searching the cache

block

Cache organization alternatives

Direct mapped: each block can be placed in only one cache location

Set associative: each block can be placed in any of n cache locations

Fully associative: each block can be placed in any cache location

Cache organization alternatives

Searching for block 12 in caches of size 8 blocks

Set Set # 0

Searching a direct mapped cache

Need log2 number of sets of the address bits (the index) to select the block location

block offset used to select the desired byte, half-word, or word within the block

Remaining bits (the tag) used to determine if this is the desired block or another that shares the same cache location

indextagblock offset

memory address

assume data cache with 16 byte blocks

8 sets

4 block offset bits

3 index bits

25 tag bits

Set

Searching a direct mapped cache

Block is placed in the set index

number of sets = cache size/block size

indextagblock offset

memory address

assume data cache with 16 byte blocks

8 sets

4 block offset bits

3 index bits

25 tag bits

Set

Direct mapped cache organization

64KB instruction cache with 16 byte (4 word) blocks

4K sets (64KB/16B) need 12 address bits to pick


The data section of the cache holds the instructions


The tag section holds the part of the memory address used to distinguish different blocks


A valid bit associated with each set indicates if the instructions are valid or not

Direct mapped cache access

The index bits are used to select one of the sets


The data, tag, and Valid bit from the selected set are simultaneously accessed


The tag from the selected entry is compared with the tag field of the address


A match between the tags and a Valid bit that is set indicates a cache hit


The block offset selects the desired instruction

Set associative cache

Block placed in one way of set index

Number of sets = cache size/(block size*ways)

ways 0-3

Set associative cache operation

The index bits are used to select one of the sets


The data, tag, and Valid bit from all ways of the selected entry are simultaneously accessed


The tags from all ways of the selected entry are compared with the tag field of the address


A match between the tags and a Valid bit that is set indicates a cache hit (hit in way1 shown)


The data from the way that had a hit is returned through the MUX

Fully associative cache

A block can be placed in any set

31 30 29 28 27 26 …6 5 4 3 2 1 0

V Tag Data

= ===

.

.

.

01

10231022

.

.

.

30

MUX

Data

Hit

......

.

..

Different degrees of associativity

Four different caches of size 8 blocks

Cache misses

A cache miss occurs when the block is not found in the cache

The block is requested from the next level of the hierarchy

When the block returns, it is loaded into the cache and provided to the requester

A copy of the block remains in the lower levels of the hierarchy

The cache miss rate is found by dividing the total number of misses by the total number of accesses (misses/accesses)The hit rate is 1-miss rate

L1

L2

main memory

Classifying cache misses

Compulsory missesCaused by the first access to a block that has never

been in the cache

Capacity missesDue to the cache not being big enough to hold all the

blocks that are needed

Conflict missesDue to multiple blocks competing for the same set

A fully associative cache with a “perfect” replacement policy has no conflict misses

Cache miss classification examples Direct mapped cache of size two blocks

Blocks A and B map to set 0, C and D to set 1

Access pattern 1: A, B, C, D, A, B, C, D

Access pattern 2: A, A, B, A

01

A 01

B 01

B 01

B

01

A 01

B 01

B 01

B

C D

D D C D

01

A 01

A 01

B 01

A

?? ?? ?? ??

?? ?? ?? ??

?? ?? ?? ??

Reducing capacity misses

Increase the cache sizeMore cache blocks can be simultaneously held in the

cacheDrawback: increased access time

Reducing compulsory misses Increase the block size

Each miss results in more words being loaded into the cacheBlock size should only be increased to a certain point!

As block size is increasedFewer cache sets (increased contention)Larger percentage of block may not be referenced

Reducing conflict misses Increase the associativity

More locations in which a block can be heldDrawback: increased access time

Cache miss rates for SPEC92

Block replacement policy

Determines what block to replace on a cache miss to make room for the new block

Least recently used (LRU)Pick the one that has been unused for the longest timeBased on temporal localityRequires ordering bits to be kept with each setToo expensive beyond 4-way

RandomPseudo-randomly pick a blockGenerally not as effective as LRU (higher miss rates)Simple even for highly associative organizations

Most recently used (MRU)Keep track of which block was accessed lastRandomly pick a block from other than that oneCompromise between LRU and random

Cache writes

The L1 data cache needs to handle writes (stores) in addition to reads (loads)

Need to check for a hit before writing Don’t want to write over another block on a missRequires a two cycle operation (tag check followed by

write)

Write back cacheCheck for hit If a hit, write the byte, halfword, or word to the correct

location in the blockIf a miss, request the block from the next level in the

hierarchyLoad the block into the cache, and then perform the

write

Cache writes and block replacement

With a write back cache, when a block is written to, copies of the block in the lower levels are not updated

If this block is chosen for replacement on a miss, we need save it to the next level

Solution:A dirty bit is asociated with each cache blockThe dirty bit is set if the block is written toA block with a set dirty bit that is chosen for

replacement is written to the next level before being overwritten with the new block

Cache writes and block replacement

L2

main memory

(2) dirty block

written to L2

L1D(4) block

loaded into L1D cache

(3) read request sent

to L2

(1) cache miss

Main memory systems

The main memory (MM) lies in the memory hierarchy between the cache hierarchy and I/O

MM is too large to fit on the microprocessor dieMM controller is on the

microprocessor die or a separate chip

DRAMs lie on Dual In-line Memory Modules (DIMMs) on the printed circuit board~10-20 DRAMs per DIMM

Interconnect width is typically ½ to ¼ of the cache block Cache block is returned to L2 in

multiple transfers

CentralProcessing

Unit

Level1Instruction

Cache

Level1Data

Cache

Level2Cache

MainMemoryControl

Input/Output

Interconnect

disketc

DIMMs

microprocessor

Main memory systems

DRAM characteristicsStorage cells lie in a grid of rows and columnsRow decoder selects the rowColumn latches latch data from the rowColumn MUX selects a subset of the data to be output Access time ~40ns from row address to data out

column latches

column MUX

row

dec

oder

address

data out

Storage cells

Main memory systems

Synchronous DRAMs (SDRAMs)Exploits the fact that the column latch holds a full row

of data although only a subset is read outAddress is accompanied by a burst length inputFirst data accessed from RAM arraySuccessive data in the column latch can be read out at

the DRAM clock speedExample: 100MHz 50ns SDRAM operation

50ns 10ns 10ns 10ns

Improving MM performance

The latency to get the first data out of the MM is largely determined byThe speed of the DRAMsThe speed of the interconnect between the L2 and MM

Focus is on reducing the time to return the entire cache block to L2

Baseline MM design assumptions

1 cycle to transfer the address

15 cycles for each DRAM access

1 cycle to transfer each data chunk

Memory and interconnect (bus) widths are ¼ that of the block

Cache miss penalty = 1 + 4 x (15 + 1) = 65 cycles

Wider MMs

Increase MM and bus widths

DownsidesHardware cost of wider bus2X the number of DRAMs is needed (unless wider DRAMs

used)

Cache miss penalty = 1 + 2 x (15 + 1) = 33 cycles for 2X width

Interleaved memory

Divide the memory into banks (may be a DIMM)

Each bank handles a fraction of the block on each access

Bank accesses are started a cycle apart

Cache miss penalty = 1 + 15 + 4 x 1 = 20 cycles

15 cycles 1 cycleBank 0Bank 1Bank 2Bank 3

Using the burst capability of SDRAMs

Assume first access takes 15 cycles, and 2 cycles thereafter to get the rest of the block

Cache miss penalty = 1 + 15 + 3 x 2 + 4 x 1 = 26 cycles

Very typical in commercial MM systems to interleave banks of DIMMs containing SDRAMs

Memory hierarchy performance

CPU time = (CPU execution cycles + memory stall cycles) x CT = INST x (CPIexecution + CPImisses) x CT

CPU execution cycles excludes cache miss cycles

Memory stall cycles are due to L1 cache misses

Cycle time may be impacted by the memory hierarchy if the L1 cache is a critical timing path

Example

Machine characteristicsCPIexecution = 2.0L1 Icache miss rate = 0.02L1 Dcache miss rate = 0.05L2 access time = 10 cyclesL2 miss rate = 0.2MM access time = 30 cyclesLoads and stores are 30% of all instructions

CPImisses= CPIL1Imisses + CPIL1Dmisses

CPIL1Imisses= 0.02 x (10 + 0.2 x 30) = 0.32

CPIL1Dmisses= 0.3 x 0.05 x (10 + 0.2 x 30) = 0.24

CPU time = INST x (2.0 + 0.32 + 0.24) x cycle time = INST x 2.56 x cycle time28% increase in CPU time because of cache misses

Virtual memory

A single program may require more main memory than is present in the machine

Multiple programs must share main memory without interfering with each other

With virtual memory , portions of different programs are loaded from I/O to memory on demand

When memory gets full, portions of programs are swapped out to I/O

Implementing virtual memory

Separate different programs in memory by assigning different memory addresses to each

Identify when the desired instructions or data are in memory or not

Generate an interrupt when they are not in memory and must be retrieved from I/O

Provide support in the OS to retrieve the desired instructions or data, replacing others if necessary

Prevent users from accessing instructions or data they do not own

Memory pages

Transfers between disk and memory occur in pages whose size is defined in the ISA

Page size is large to amortize the high cost of disk access (~5-10ms)

Tradeoffs in increasing page size are similar as for cache block size

diskmain

memory

L2 cache

page (4-64KB)

cache block (16-128B)

Virtual and physical addresses

The virtual addresses in your program are translated during program execution into the physical addresses used to address memory

Some virtual addresses may refer to data that is not in memory (not memory resident)

Address translation

The virtual page number (vpn) part of the virtual address is translated into a physical page number (ppn) that points to the desired page

The low-order page offset bits point to which byte is being accessed within a page

The ppn + page offset form the physical address

1GB of main memory in the machine

4GB of virtual address space

212 = 4KB page

Address translation

Another view of the physical address

ppn page offset

location of first byte in page

2page offset bytes

page

location of addressed byte

in page

Address translation

Address translation is performed by the hardware and the operating system (OS)

The OS maintains for each programWhat pages are associated with itWhere on disk each page residesWhat pages are memory residentThe ppn associated with the vpn of each memory

resident page

Address translation

For each program, the OS sets up a page table in memory that holds the ppn corresponding to the vpn of each memory resident page

The page table register in hardware provides the base address of the page table for the currently running process

Each program has a unique page table and page table register value that is loaded by the OS

Address translation

Page table access

offset into the page table

address is page table register + vpn

located in memory requires loads/stores to

access!

The TLB: faster address translation

Major problem: for each instruction or data access we have to first access the page table in memory to get the physical address

Solution: cache the address translations in a Translation Lookaside Buffer (TLB) in hardware

The TLB holds the ppns of the most recently accessed pages

Hardware first checks the TLB for the vpn/ppn pair; if not found (TLB miss), then the page table is accessed to get the ppn

The ppn is loaded into the TLB for later access

Accessing the TLB and the cache

MIPS R2000 fully associative TLB and Dcache


TLB and Dcache operation


Drawback: have to access the TLB first before accessing the cache (increases cache hit time)

Virtually indexed, physically tagged cache Organize the cache so that only some of the page

offset bits are needed to index the cacheAccess the cache and TLB in parallelCompare the ppn from the TLB and remaining page

offset bits with the tag read out of the cache to determine cache hit/miss

Virtually indexed, physically tagged cache

Example: 16KB page size, 128KB direct-mapped Dcache with 8B blocks, 1GB of main memory

vpn page offset

014 13 31

TLBcache

11

index

tag

=?

cache hit

18

16

16ppnTLB hit

block offset3

Page faults

If the Valid bit for the page table entry is 0 (page is not in memory), a page fault exception invokes the OS

The OS saves the state of the running process, changes its status to idle, and starts the access of the page from disk

The OS selects another process to run on the CPU while the disk is accessed

When the access completes, an I/O exception invokes the OS, which updates the page table and changes the status of the first process to runnable

Page replacement

If main memory is fully used, the OS replaces a page using a pseudo-MRU procedure:A page’s reference bit is set in the page table whenever a

page is accessedReference bits are periodically cleared by the OSOS chooses a page with a reference bit of 0 for

replacement

Dirty pages are written back to disk before replacementWritten-to pages have their dirty bit set in the page table

After page replacement, the desired page is loaded from disk into memory and the page table entry is updated

TLB entry also has reference and dirty bits

TLB misses

TLB misses are either handled by microcode or a fast privileged software routine

TLB miss operationSelect an entry for replacementWrite back any special bits of the replaced entry that

may have been modified (e.g., dirty bit) to the page table

Access the page table entry corresponding to the vpnIf Valid=0, generate a page fault exceptionOtherwise, load the information from the page table

into the TLB, and retry the access

Virtual memory and protection

Because only the OS can manipulate page tables, the TLB, and the page table register, it controls which pages different processes can access

TLB and page table entries also have a write access bit that is set only if the process has permission to write the pageExample: you have read-only access to a text editor

page, but both read and write access to the data page

Attempts to circumvent these mechanisms causes a memory protection violation exception

High performance memory hierarchies

Recall that pipelined superscalar processors attempt to commit 4-6 instructions each cycle

The memory hierarchy must be enhanced in three ways to support such processorsIncrease the number of load/store accesses each cycleReduce the miss penaltyAllow other cache accesses during a cache miss

(increase the processor’s tolerance to cache misses)

Even with these enhancements, the performance of many programs is limited by the performance of the memory hierarchy

Increasing load/store accesses

Multi-ported cacheExpensive in area, power, and access time

Banked cache: one bank holds data for even addresses and the other for odd

Fast, reasonable area and powerSingle access if don’t have an even and an odd access

Duplicated cache: multiple identical copies

Fast access time, but high area and power costStores use both copies (single access)

odd bank

even bank

copy 1 copy 2

Reducing the miss penalty

Early restartSend the desired word to the processor once it is

returned to the L1 cache and restart the pipeline

Critical word firstReturn the desired word at the head of the cache

block

word1word2word3word4

bypass to CPU and restart

pipeline

continue to load into

cache

CC 1CC 3 CC 2CC 4

word2word3word4word1

CC 1CC 3 CC 2CC 4


Prefetching Bring the block into the cache before it is requestedSoftware-based approach

Special instruction (pref in MIPS) is inserted into the program by the compiler to bring the desired block into the L1 Dcache

Compiler needs to determine what data to prefetch and guess how far in advance to request it

Incorrect guesses can degrade performance (why??)

pref 0, 20($3)

.

.

.

ld $5, 20($3)


Hardware-based approachHardware stream buffer requests blocks (instruction or data)

based on past cache access patternsData is placed into the stream buffer and loaded into cache

only when needed in order to avoid displacing useful dataSophisticated stream buffers can detect the stride of array

accesses, i.e., every nth array element is being accessed

cache

stream buffers

use on stream buffer “hit”

use on stream buffer “miss”

Nonblocking caches – miss tolerance

Basic idea: allow other data cache accesses after a miss occurs

Today’s designs allow several data cache misses to be handled in parallel

Requires a higher bandwidth memory hierarchy to handle multiple misses at the same time

L2 cache may be nonblocking as well

Nonblocking caches

Miss Status Holding Registers (MSHRs) hold information needed to complete misses

valid address register type (b/hw/w)

Cacheload info on miss

lookup when data returns

address

1 700 8 00

lb $8, 700($0)

MSHRs

Nonblocking caches

Nonblocking caches work well with ooo issue

add $9,$2,$2

sub $10,$5,$8

.

.

.

from decode

waiting for lw

issue

issue queue

lw $20,0($7)

lw $24,0($11)

issue

issue

add $4,$10,$2

The MIPS R12000 memory hierarchy

32KB, two-way set associative L1 IcacheFetches up to four instructions each cycle (16B block

size)Virtually-indexed, physically-taggedLRU replacement

32KB, two-way set associative L1 Dcache32B block sizeVirtually-indexed, physically-taggedLRU replacementTwo-way bankedNonblocking, with support for four cache block missesWriteback with a 5-entry victim bufferTwo cycle access time

The MIPS R12000 memory hierarchy

Up to 4MB L2 cache using off-chip SRAMsPseudo two-way set associative using way-prediction

Predict which way the access will hit in and access it first If a miss, access the other wayReduces bus width compared to set associative to save

package pins128 bit L2-L1 bus

Virtual memory support44-bit virtual addresses translated into 40 bit physical

addressesPage size any power of 4 between 4KB and 16MB8 entry fully associative ITLB (subset of main TLB)64 entry fully associative main TLB

ITLB

mainTLB

instruction translations

data translations

Questions?

ECE200 – Computer Organization

Documents

Transcript of ECE200 – Computer Organization