ECE200 – Computer Organization
description
Transcript of ECE200 – Computer Organization
ECE200 – Computer Organization
Chapter 7 – Large and Fast: Exploiting
Memory Hierarchy
Homework 7
7.2, 7.3, 7.5, 7.6, 7.7, 7.11, 7.15, 7.20, 7.21, 7.27, 7.32
Outline for Chapter 7 lectures
Motivation for, and concept of, memory hierarchies
Caches
Main memory
Characterizing memory hierarchy performance
Virtual memory
Real memory hierarchies
The memory dilemma
Ch 6 assumption: on-chip instruction and data memories hold the entire program and its data and can be accessed in one cycle
Reality checkIn high performance machines, programs may require
100’s of megabytes or even gigabytes of memory to runEmbedded processors have smaller needs but there is
also less room for on-chip memory
Basic problemWe need much more memory than can fit on the
microprocessor chipBut we do not want to incur stall cycles every time the
pipeline accesses instructions or data At the same time, we need the memory to be
economical for the machine to be competitive in the market
Solution: a hierarchy of memories
Another view
Typical characteristics of each level
First level (L1) is separate on-chip instruction and data caches placed where our instruction and data memories reside16-64KB for each cache (desktop/server machine)Fast, power-hungry, not-so-dense, static RAM (SRAM)
Second level (L2) consists of another larger unified cache Holds both instructions and data256KB-4MBOn or off-chipSRAM
Third level is main memory64-512MBSlower, lower-power, denser dynamic RAM (DRAM)
Final level is I/O (e.g., disk)
Caches and the pipeline
L1 instruction and data caches and L2 cache
cache cache
L2 cache
to mm
L1 L1
Memory hierarchy operation
(1) Search L1 for the instruction or dataIf found (cache hit), done
(2) Else (cache miss), search L2 cacheIf found, place it in L1 and repeat (1)
(3) Else, search main memoryIf found, place it in L2 and repeat (2)
(4) Else, get it from I/O (Chapter 8)
Steps (1)-(3) are performed in hardware1-3 cycles to get from L1 caches5-20 cycles to get from L2 cache50-200 cycles to get from main memory
L1
L2
main memory
Principle of locality
Programs access a small portion of memory within a short time period
Temporal locality: recently accessed memory locations will likely be accessed soon
Spatial locality: memory locations near recently accessed locations will likely be accessed soon
POL makes memory hierarchies work
A large percentage of the time (typically >90%) the instruction or data is found in L1, the fastest memory
Cheap, abundant main memory is accessed more rarely
Memory hierarchy operates at nearly the speed of expensive on-chip SRAM with about the cost of main memory (DRAMs)
Caches
Caches are small, fast, memories that hold recently accessed instructions and/or data
Separate L1 instruction and L1 data cachesNeed simultaneous access of instructions and data in
pipelines
L2 cache holds both instructions and dataSimultaneous access not as critical since >90% of the
instructions and data will be found in L1PC or effective address from L1 is sent to L2 to search
for the instruction or data
How caches exploit the POL
On a cache miss, a block of several instructions or data, including the requested item, are returned
The entire block is placed into the cache so that future searches for items in the block will be successful
instructioni instructioni+3instructioni+2instructioni+1
requested instruction
How caches exploit the POL
Consider sequence of instructions and data accesses in this loop with a block size of 4 words
Loop: lw $t0, 0($s1)addu $t0, $t0, $s2sw $t0, 0($s1)addi $s1, $s1 , -4bne $s1, $zero, Loop
The cache is much smaller than main memory
Multiple memory blocks must share the same cache location
Searching the cache
block
Need a way to determine whether the desired instruction or data is held in the cache
Need a scheme for replacing blocks when a new block needs to be brought in on a miss
Searching the cache
block
Cache organization alternatives
Direct mapped: each block can be placed in only one cache location
Set associative: each block can be placed in any of n cache locations
Fully associative: each block can be placed in any cache location
Cache organization alternatives
Searching for block 12 in caches of size 8 blocks
Set Set # 0
Searching a direct mapped cache
Need log2 number of sets of the address bits (the index) to select the block location
block offset used to select the desired byte, half-word, or word within the block
Remaining bits (the tag) used to determine if this is the desired block or another that shares the same cache location
indextagblock offset
memory address
assume data cache with 16 byte blocks
8 sets
4 block offset bits
3 index bits
25 tag bits
Set
Searching a direct mapped cache
Block is placed in the set index
number of sets = cache size/block size
indextagblock offset
memory address
assume data cache with 16 byte blocks
8 sets
4 block offset bits
3 index bits
25 tag bits
Set
Direct mapped cache organization
64KB instruction cache with 16 byte (4 word) blocks
4K sets (64KB/16B) need 12 address bits to pick
Direct mapped cache organization
The data section of the cache holds the instructions
Direct mapped cache organization
The tag section holds the part of the memory address used to distinguish different blocks
Direct mapped cache organization
A valid bit associated with each set indicates if the instructions are valid or not
Direct mapped cache access
The index bits are used to select one of the sets
Direct mapped cache access
The data, tag, and Valid bit from the selected set are simultaneously accessed
Direct mapped cache access
The tag from the selected entry is compared with the tag field of the address
Direct mapped cache access
A match between the tags and a Valid bit that is set indicates a cache hit
Direct mapped cache access
The block offset selects the desired instruction
Set associative cache
Block placed in one way of set index
Number of sets = cache size/(block size*ways)
ways 0-3
Set associative cache operation
The index bits are used to select one of the sets
Set associative cache operation
The data, tag, and Valid bit from all ways of the selected entry are simultaneously accessed
Set associative cache operation
The tags from all ways of the selected entry are compared with the tag field of the address
Set associative cache operation
A match between the tags and a Valid bit that is set indicates a cache hit (hit in way1 shown)
Set associative cache operation
The data from the way that had a hit is returned through the MUX
Fully associative cache
A block can be placed in any set
31 30 29 28 27 26 …6 5 4 3 2 1 0
V Tag Data
= ===
.
.
.
01
10231022
.
.
.
30
MUX
Data
Hit
......
.
..
Different degrees of associativity
Four different caches of size 8 blocks
Cache misses
A cache miss occurs when the block is not found in the cache
The block is requested from the next level of the hierarchy
When the block returns, it is loaded into the cache and provided to the requester
A copy of the block remains in the lower levels of the hierarchy
The cache miss rate is found by dividing the total number of misses by the total number of accesses (misses/accesses)The hit rate is 1-miss rate
L1
L2
main memory
Classifying cache misses
Compulsory missesCaused by the first access to a block that has never
been in the cache
Capacity missesDue to the cache not being big enough to hold all the
blocks that are needed
Conflict missesDue to multiple blocks competing for the same set
A fully associative cache with a “perfect” replacement policy has no conflict misses
Cache miss classification examples Direct mapped cache of size two blocks
Blocks A and B map to set 0, C and D to set 1
Access pattern 1: A, B, C, D, A, B, C, D
Access pattern 2: A, A, B, A
01
A 01
B 01
B 01
B
01
A 01
B 01
B 01
B
C D
D D C D
01
A 01
A 01
B 01
A
?? ?? ?? ??
?? ?? ?? ??
?? ?? ?? ??
Reducing capacity misses
Increase the cache sizeMore cache blocks can be simultaneously held in the
cacheDrawback: increased access time
Reducing compulsory misses Increase the block size
Each miss results in more words being loaded into the cacheBlock size should only be increased to a certain point!
As block size is increasedFewer cache sets (increased contention)Larger percentage of block may not be referenced
Reducing conflict misses Increase the associativity
More locations in which a block can be heldDrawback: increased access time
Cache miss rates for SPEC92
Block replacement policy
Determines what block to replace on a cache miss to make room for the new block
Least recently used (LRU)Pick the one that has been unused for the longest timeBased on temporal localityRequires ordering bits to be kept with each setToo expensive beyond 4-way
RandomPseudo-randomly pick a blockGenerally not as effective as LRU (higher miss rates)Simple even for highly associative organizations
Most recently used (MRU)Keep track of which block was accessed lastRandomly pick a block from other than that oneCompromise between LRU and random
Cache writes
The L1 data cache needs to handle writes (stores) in addition to reads (loads)
Need to check for a hit before writing Don’t want to write over another block on a missRequires a two cycle operation (tag check followed by
write)
Write back cacheCheck for hit If a hit, write the byte, halfword, or word to the correct
location in the blockIf a miss, request the block from the next level in the
hierarchyLoad the block into the cache, and then perform the
write
Cache writes and block replacement
With a write back cache, when a block is written to, copies of the block in the lower levels are not updated
If this block is chosen for replacement on a miss, we need save it to the next level
Solution:A dirty bit is asociated with each cache blockThe dirty bit is set if the block is written toA block with a set dirty bit that is chosen for
replacement is written to the next level before being overwritten with the new block
Cache writes and block replacement
L2
main memory
(2) dirty block
written to L2
L1D(4) block
loaded into L1D cache
(3) read request sent
to L2
(1) cache miss
Main memory systems
The main memory (MM) lies in the memory hierarchy between the cache hierarchy and I/O
MM is too large to fit on the microprocessor dieMM controller is on the
microprocessor die or a separate chip
DRAMs lie on Dual In-line Memory Modules (DIMMs) on the printed circuit board~10-20 DRAMs per DIMM
Interconnect width is typically ½ to ¼ of the cache block Cache block is returned to L2 in
multiple transfers
CentralProcessing
Unit
Level1Instruction
Cache
Level1Data
Cache
Level2Cache
MainMemoryControl
Input/Output
Interconnect
disketc
DIMMs
microprocessor
Main memory systems
DRAM characteristicsStorage cells lie in a grid of rows and columnsRow decoder selects the rowColumn latches latch data from the rowColumn MUX selects a subset of the data to be output Access time ~40ns from row address to data out
column latches
column MUX
row
dec
oder
address
data out
Storage cells
Main memory systems
Synchronous DRAMs (SDRAMs)Exploits the fact that the column latch holds a full row
of data although only a subset is read outAddress is accompanied by a burst length inputFirst data accessed from RAM arraySuccessive data in the column latch can be read out at
the DRAM clock speedExample: 100MHz 50ns SDRAM operation
50ns 10ns 10ns 10ns
Improving MM performance
The latency to get the first data out of the MM is largely determined byThe speed of the DRAMsThe speed of the interconnect between the L2 and MM
Focus is on reducing the time to return the entire cache block to L2
Baseline MM design assumptions
1 cycle to transfer the address
15 cycles for each DRAM access
1 cycle to transfer each data chunk
Memory and interconnect (bus) widths are ¼ that of the block
Cache miss penalty = 1 + 4 x (15 + 1) = 65 cycles
Wider MMs
Increase MM and bus widths
DownsidesHardware cost of wider bus2X the number of DRAMs is needed (unless wider DRAMs
used)
Cache miss penalty = 1 + 2 x (15 + 1) = 33 cycles for 2X width
Interleaved memory
Divide the memory into banks (may be a DIMM)
Each bank handles a fraction of the block on each access
Bank accesses are started a cycle apart
Cache miss penalty = 1 + 15 + 4 x 1 = 20 cycles
15 cycles 1 cycleBank 0Bank 1Bank 2Bank 3
Using the burst capability of SDRAMs
Assume first access takes 15 cycles, and 2 cycles thereafter to get the rest of the block
Cache miss penalty = 1 + 15 + 3 x 2 + 4 x 1 = 26 cycles
Very typical in commercial MM systems to interleave banks of DIMMs containing SDRAMs
Memory hierarchy performance
CPU time = (CPU execution cycles + memory stall cycles) x CT = INST x (CPIexecution + CPImisses) x CT
CPU execution cycles excludes cache miss cycles
Memory stall cycles are due to L1 cache misses
Cycle time may be impacted by the memory hierarchy if the L1 cache is a critical timing path
Example
Machine characteristicsCPIexecution = 2.0L1 Icache miss rate = 0.02L1 Dcache miss rate = 0.05L2 access time = 10 cyclesL2 miss rate = 0.2MM access time = 30 cyclesLoads and stores are 30% of all instructions
CPImisses= CPIL1Imisses + CPIL1Dmisses
CPIL1Imisses= 0.02 x (10 + 0.2 x 30) = 0.32
CPIL1Dmisses= 0.3 x 0.05 x (10 + 0.2 x 30) = 0.24
CPU time = INST x (2.0 + 0.32 + 0.24) x cycle time = INST x 2.56 x cycle time28% increase in CPU time because of cache misses
Virtual memory
A single program may require more main memory than is present in the machine
Multiple programs must share main memory without interfering with each other
With virtual memory , portions of different programs are loaded from I/O to memory on demand
When memory gets full, portions of programs are swapped out to I/O
Implementing virtual memory
Separate different programs in memory by assigning different memory addresses to each
Identify when the desired instructions or data are in memory or not
Generate an interrupt when they are not in memory and must be retrieved from I/O
Provide support in the OS to retrieve the desired instructions or data, replacing others if necessary
Prevent users from accessing instructions or data they do not own
Memory pages
Transfers between disk and memory occur in pages whose size is defined in the ISA
Page size is large to amortize the high cost of disk access (~5-10ms)
Tradeoffs in increasing page size are similar as for cache block size
diskmain
memory
L2 cache
page (4-64KB)
cache block (16-128B)
Virtual and physical addresses
The virtual addresses in your program are translated during program execution into the physical addresses used to address memory
Some virtual addresses may refer to data that is not in memory (not memory resident)
Address translation
The virtual page number (vpn) part of the virtual address is translated into a physical page number (ppn) that points to the desired page
The low-order page offset bits point to which byte is being accessed within a page
The ppn + page offset form the physical address
1GB of main memory in the machine
4GB of virtual address space
212 = 4KB page
Address translation
Another view of the physical address
ppn page offset
location of first byte in page
2page offset bytes
page
location of addressed byte
in page
Address translation
Address translation is performed by the hardware and the operating system (OS)
The OS maintains for each programWhat pages are associated with itWhere on disk each page residesWhat pages are memory residentThe ppn associated with the vpn of each memory
resident page
Address translation
For each program, the OS sets up a page table in memory that holds the ppn corresponding to the vpn of each memory resident page
The page table register in hardware provides the base address of the page table for the currently running process
Each program has a unique page table and page table register value that is loaded by the OS
Address translation
Page table access
offset into the page table
address is page table register + vpn
located in memory requires loads/stores to
access!
The TLB: faster address translation
Major problem: for each instruction or data access we have to first access the page table in memory to get the physical address
Solution: cache the address translations in a Translation Lookaside Buffer (TLB) in hardware
The TLB holds the ppns of the most recently accessed pages
Hardware first checks the TLB for the vpn/ppn pair; if not found (TLB miss), then the page table is accessed to get the ppn
The ppn is loaded into the TLB for later access
Accessing the TLB and the cache
MIPS R2000 fully associative TLB and Dcache
Accessing the TLB and the cache
TLB and Dcache operation
Accessing the TLB and the cache
Drawback: have to access the TLB first before accessing the cache (increases cache hit time)
Virtually indexed, physically tagged cache Organize the cache so that only some of the page
offset bits are needed to index the cacheAccess the cache and TLB in parallelCompare the ppn from the TLB and remaining page
offset bits with the tag read out of the cache to determine cache hit/miss
Virtually indexed, physically tagged cache
Example: 16KB page size, 128KB direct-mapped Dcache with 8B blocks, 1GB of main memory
vpn page offset
014 13 31
TLBcache
11
index
tag
=?
cache hit
18
16
16ppnTLB hit
block offset3
Page faults
If the Valid bit for the page table entry is 0 (page is not in memory), a page fault exception invokes the OS
The OS saves the state of the running process, changes its status to idle, and starts the access of the page from disk
The OS selects another process to run on the CPU while the disk is accessed
When the access completes, an I/O exception invokes the OS, which updates the page table and changes the status of the first process to runnable
Page replacement
If main memory is fully used, the OS replaces a page using a pseudo-MRU procedure:A page’s reference bit is set in the page table whenever a
page is accessedReference bits are periodically cleared by the OSOS chooses a page with a reference bit of 0 for
replacement
Dirty pages are written back to disk before replacementWritten-to pages have their dirty bit set in the page table
After page replacement, the desired page is loaded from disk into memory and the page table entry is updated
TLB entry also has reference and dirty bits
TLB misses
TLB misses are either handled by microcode or a fast privileged software routine
TLB miss operationSelect an entry for replacementWrite back any special bits of the replaced entry that
may have been modified (e.g., dirty bit) to the page table
Access the page table entry corresponding to the vpnIf Valid=0, generate a page fault exceptionOtherwise, load the information from the page table
into the TLB, and retry the access
Virtual memory and protection
Because only the OS can manipulate page tables, the TLB, and the page table register, it controls which pages different processes can access
TLB and page table entries also have a write access bit that is set only if the process has permission to write the pageExample: you have read-only access to a text editor
page, but both read and write access to the data page
Attempts to circumvent these mechanisms causes a memory protection violation exception
High performance memory hierarchies
Recall that pipelined superscalar processors attempt to commit 4-6 instructions each cycle
The memory hierarchy must be enhanced in three ways to support such processorsIncrease the number of load/store accesses each cycleReduce the miss penaltyAllow other cache accesses during a cache miss
(increase the processor’s tolerance to cache misses)
Even with these enhancements, the performance of many programs is limited by the performance of the memory hierarchy
Increasing load/store accesses
Multi-ported cacheExpensive in area, power, and access time
Banked cache: one bank holds data for even addresses and the other for odd
Fast, reasonable area and powerSingle access if don’t have an even and an odd access
Duplicated cache: multiple identical copies
Fast access time, but high area and power costStores use both copies (single access)
odd bank
even bank
copy 1 copy 2
Reducing the miss penalty
Early restartSend the desired word to the processor once it is
returned to the L1 cache and restart the pipeline
Critical word firstReturn the desired word at the head of the cache
block
word1word2word3word4
bypass to CPU and restart
pipeline
continue to load into
cache
CC 1CC 3 CC 2CC 4
word2word3word4word1
CC 1CC 3 CC 2CC 4
Reducing the miss penalty
Prefetching Bring the block into the cache before it is requestedSoftware-based approach
Special instruction (pref in MIPS) is inserted into the program by the compiler to bring the desired block into the L1 Dcache
Compiler needs to determine what data to prefetch and guess how far in advance to request it
Incorrect guesses can degrade performance (why??)
pref 0, 20($3)
.
.
.
ld $5, 20($3)
Reducing the miss penalty
Hardware-based approachHardware stream buffer requests blocks (instruction or data)
based on past cache access patternsData is placed into the stream buffer and loaded into cache
only when needed in order to avoid displacing useful dataSophisticated stream buffers can detect the stride of array
accesses, i.e., every nth array element is being accessed
cache
stream buffers
use on stream buffer “hit”
use on stream buffer “miss”
Nonblocking caches – miss tolerance
Basic idea: allow other data cache accesses after a miss occurs
Today’s designs allow several data cache misses to be handled in parallel
Requires a higher bandwidth memory hierarchy to handle multiple misses at the same time
L2 cache may be nonblocking as well
Nonblocking caches
Miss Status Holding Registers (MSHRs) hold information needed to complete misses
valid address register type (b/hw/w)
Cacheload info on miss
lookup when data returns
address
1 700 8 00
lb $8, 700($0)
MSHRs
Nonblocking caches
Nonblocking caches work well with ooo issue
add $9,$2,$2
sub $10,$5,$8
.
.
.
from decode
waiting for lw
issue
issue queue
lw $20,0($7)
lw $24,0($11)
issue
issue
add $4,$10,$2
The MIPS R12000 memory hierarchy
32KB, two-way set associative L1 IcacheFetches up to four instructions each cycle (16B block
size)Virtually-indexed, physically-taggedLRU replacement
32KB, two-way set associative L1 Dcache32B block sizeVirtually-indexed, physically-taggedLRU replacementTwo-way bankedNonblocking, with support for four cache block missesWriteback with a 5-entry victim bufferTwo cycle access time
The MIPS R12000 memory hierarchy
Up to 4MB L2 cache using off-chip SRAMsPseudo two-way set associative using way-prediction
Predict which way the access will hit in and access it first If a miss, access the other wayReduces bus width compared to set associative to save
package pins128 bit L2-L1 bus
Virtual memory support44-bit virtual addresses translated into 40 bit physical
addressesPage size any power of 4 between 4KB and 16MB8 entry fully associative ITLB (subset of main TLB)64 entry fully associative main TLB
ITLB
mainTLB
instruction translations
data translations
Questions?