CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design,...
-
Upload
logan-oswald-dickerson -
Category
Documents
-
view
227 -
download
1
Transcript of CS.305 Computer Architecture Memory: Structures Adapted from Computer Organization and Design,...
CS.305Computer Architecture
<local.cis.strath.ac.uk/teaching/ug/classes/CS.305>
Memory: Structures
Adapted from Computer Organization and Design, Patterson & Hennessy, © 2005, and from slides kindly made available by Dr Mary Jane Irwin, Penn State University.
Memory: Structures
Review: Major Components of a Computer
Processor
Control
Datapath
Memory
Devices
Input
Output
CS210_305_10/2
Memory: Structures
Processor-Memory Performance Gap
1
10
100
1000
10000
1980 1984 1988 1992 1996 2000 2004
Year
Performance
“Moore’s Law”
µProc55%/year(2X/1.5yr)
DRAM7%/year(2X/10yrs)
Processor-MemoryPerformance Gap(grows 50%/year)
CS210_305_10/3
Memory: Structures
The “Memory Wall”
Logic vs DRAM speed gap continues to grow
0.01
0.1
1
10
100
1000
VAX/1980 PPro/1996 2010+
Core
Memory
Clo
cks
per
inst
ruct
ion
Clo
cks
per
DR
AM
acc
ess
CS210_305_10/4
Memory: Structures
Memory Performance Impact on Performance
Suppose a processor executes at ideal CPI = 1.1 50% arith/logic, 30% ld/st, 20% control
and that 10% of data memory operations miss with a 50 cycle miss penalty
CPI = ideal CPI + average stalls per instruction= 1.1(cycle) + ( 0.30 (datamemops/instr)
x 0.10 (miss/datamemop) x 50 (cycle/miss) )= 1.1 cycle + 1.5 cycle = 2.6
so 58% of the time the processor is stalled waiting for memory!
A 1% instruction miss rate would add an additional 0.5 to the CPI!
Ideal CPI, 1.1
DataMiss, 1.5
InstrMiss, 0.5
CS210_305_10/5
Memory: Structures
The Memory Hierarchy Goal
Fact: Large memories are slow (and low cost per bit) and fast memories are small (and high cost per bit)
How do we create a memory that gives the illusion of being large, cheap and fast (most of the time)?
With hierarchy With parallelism
CS210_305_10/6
Memory: Structures
SecondLevelCache
(SRAM)
A Typical Memory Hierarchy
Control
Datapath
SecondaryMemory(Disk)
On-Chip Components
RegF
ile
MainMemory(DRAM)
Data
Cache
InstrC
ache
ITLB
DT
LB
eDRAM
Speed (%cycles): ½’s 1’s 10’s 100’s 1,000’s
Size (bytes): 100’s K’s 10K’s M’s G’s to T’s
Cost: highest lowest
By taking advantage of the principle of locality Can present the user with as much memory as is available in the
cheapest technology at the speed offered by the fastest technology
CS210_305_10/7
Memory: Structures
Characteristics of the Memory Hierarchy
Increasing distance from the processor in access time
L1$
L2$
Main Memory
Secondary Memory
Processor
(Relative) size of the memory at each level
Inclusive– what is in L1$ is a subset of what is in L2$ is a subset of what is in MM that is a subset of is in SM
4-8 bytes (word)
1 to 4 blocks
1,024+ bytes (disk sector = page)
8-32 bytes (block)
CS210_305_10/8
Memory: Structures
Memory Hierarchy Technologies
Caches use SRAM for speed and technology compatibility
Low density (6 transistor cells), high power, expensive, fast
Static: content will last “forever” (until power turned off)
Main Memory uses DRAM for size (density) High density (1 transistor cells), low power, cheap, slow Dynamic: needs to be “refreshed” regularly (~ every 8 ms)
- 1% to 2% of the active cycles of the DRAM
Addresses divided into 2 halves (row and column)- RAS or Row Access Strobe triggering row decoder
- CAS or Column Access Strobe triggering column selector
Dout[15-0]SRAM
2M x 16
Din[15-0]
Address
Chip select
Output enable
Write enable
16
16
21
CS210_305_10/9
Memory: Structures
Memory Performance Metrics
Latency: Time to access one word Access time: time between the request and when the data is
available (or written) Cycle time: time between requests Usually cycle time > access time Typical read access times for SRAMs in 2004 are 2 to 4 ns for
the fastest parts to 8 to 20ns for the typical largest parts
Bandwidth: How much data from the memory can be supplied to the processor per unit time
width of the data channel * the rate at which it can be used
Size: DRAM to SRAM 4 to 8 Cost/Cycle time: SRAM to DRAM 8 to 16
CS210_305_10/10
Memory: Structures
Classical RAM Organization (~Square)
Row
Decoder
rowaddress
data bit or word
RAM Cell Array
word (row) line
bit (data) lines
Each intersection represents a 6-T SRAM cell or a 1-T DRAM cell
Column Selector & I/O Circuits
columnaddress
One memory row holds a block of data, so the column address selects the requested bit or word from that block
CS210_305_10/11
Memory: Structures
data bitdata bit
Classical DRAM Org’n (~Square Planes)
Row
Decoder
rowaddress
Column Selector & I/O Circuits
columnaddress
data bit
word (row) line
bit (data) lines
Each intersection represents a 1-T DRAM cell
The column addressselects the requested bit from the row in eachplane
data word
. . .
. . .
RAM Cell Array
CS210_305_10/12
Memory: Structures
Classical DRAM Operation
DRAM Organization: N rows x N column x M-bit Read or Write M-bit at a time Each M-bit access requires
a RAS / CAS cycle
Row Address
CAS
RAS
Col Address Row Address Col Address
N r
ows
N cols
DRAM
M bit planes
RowAddress
ColumnAddress
M-bit Output
1st M-bit Access 2nd M-bit Access
Cycle Time
CS210_305_10/13
Memory: Structures
N r
ows
N cols
DRAM
ColumnAddress
M-bit Output
M bit planes N x M SRAM
RowAddress
Page Mode DRAM Operation
Page Mode DRAM N x M SRAM to save a row
After a row is read into the SRAM “register”
Only CAS is needed to access other M-bit words on that row
RAS remains asserted while CAS is toggled
Row Address
CAS
RAS
Col Address Col Address Col Address Col Address
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
Cycle Time
CS210_305_10/14
Memory: Structures
N r
ows
N cols
DRAM
ColumnAddress
M-bit Output
M bit planes N x M SRAM
RowAddress
Synchronous DRAM (SDRAM) Operation
After a row is read into the SRAM register
Inputs CAS as the starting “burst” address along with a burst length
Transfers a burst of data from a series of sequential addresses within that row
- A clock controls transfer of successive words in the burst – 300MHz in 2004
+1
Row Address
CAS
RAS
Col Address
1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit
Cycle Time
Row Add
CS210_305_10/15
Memory: Structures
Other DRAM Architectures
Double Data Rate SDRAMs – DDR-SDRAMs(and DDR-SRAMs)
Double data rate because they transfer data on both the rising and falling edge of the clock
Are the most widely used form of SDRAMs
DDR2-SDRAMs
<http://tinyurl.com/yyw9k9> ->
http://www.corsairmemory.com/corsair/products/tech/memory_basics/153707/main.swf
CS210_305_10/16
Memory: Structures
DRAM Memory Latency & Bandwidth
In the time that the memory to processor bandwidth doubles the memory latency improves by a factor of only 1.2 to 1.4
To deliver such high bandwidth, the internal DRAM has to be organized as interleaved memory banks
DRAM Page DRAM
FastPage DRAM
FastPage DRAM
Synch DRAM
DDR SDRAM
Module Width 16b 16b 32b 64b 64b 64b
Year 1980 1983 1986 1993 1997 2000
Mb/chip 0.06 0.25 1 16 64 256
Die size (mm2) 35 45 70 130 170 204
Pins/chip 16 16 18 20 54 66
BWidth (MB/s) 13 40 160 267 640 1600
Latency (nsec) 225 170 125 75 62 52
Patterson, CACM Vol 47, #10, 2004
CS210_305_10/17
Memory: Structures
The off-chip interconnect and memory architecture can affect overall system performance in dramatic ways
Memory Systems that Support Caches
CPU
Cache
Memory
bus
One word wide organization (one word wide bus and one word wide memory) Assume
1 clock cycle to send the address
25 clock cycles for DRAM cycle time, 8 clock cycles access time
1 clock cycle to return a word of data
Memory-Bus to Cache bandwidth
number of bytes accessed from memory and transferred to cache/CPU per clock cycle
32-bit data&
32-bit addrper cycle
on-chip
CS210_305_10/18
Memory: Structures
One Word Wide Memory Organization
CPU
Cache
Memory
bus
on-chip
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory
cycle to send address
cycles to read DRAM
cycle to return data
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
CS210_305_10/19
Memory: Structures
If the block size is one word, then for a memory access due to a cache miss, the pipeline will have to stall the number of cycles required to return one data word from memory
cycle to send address
cycles to read DRAM
cycle to return data
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
One Word Wide Memory Organization
CPU
Cache
Memory
bus
on-chip
1
25
1
27
4/27 = 0.148
CS210_305_10/20
Memory: Structures
What if the block size is four words? cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
One Word Wide Memory Organization, con’t
CPU
Cache
Memory
bus
on-chip
25 cycles
25 cycles
25 cycles
25 cycles
1
4 x 25 = 100
1
102
(4 x 4)/102 = 0.157
CS210_305_10/22
Memory: Structures
1
25 + 3*8 = 49
1
51
What if the block size is four wordsand if a fast page mode DRAM is used?
cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
One Word Wide Memory Organization, con’t
CPU
Cache
Memory
bus
on-chip
25 cycles
8 cycles
8 cycles
8 cycles
(4 x 4)/51 = 0.314
CS210_305_10/24
Memory: Structures
Interleaved Memory Organization
For a block size of four words cycle to send 1st address
cycles to read DRAM
cycles to return last data word
total clock cycles miss penalty
CPU
Cache
Memorybank 1
bus
on-chip
Memorybank 0
Memorybank 2
Memorybank 3
Number of bytes transferred per clock cycle (bandwidth) for a single miss is
bytes per clock
25 cycles
25 cycles
25 cycles
25 cycles
(4 x 4)/30 = 0.533
1
25 + 3 = 28
1
30
CS210_305_10/26
Memory: Structures
DRAM Memory System Summary
Its important to match the cache characteristics caches access one block at a time (usually more than one
word)
with the DRAM characteristics use DRAMs that support fast multiple word accesses,
preferably ones that match the block size of the cache
with the memory-bus characteristics make sure the memory-bus can support the DRAM access
rates and patterns with the goal of increasing the Memory-Bus to Cache
bandwidth
CS210_305_10/27