Memory Technology - Duke University...CPU Registers 100s Bytes ~1 ns Cache K - M Bytes 1-100 ns...
Transcript of Memory Technology - Duke University...CPU Registers 100s Bytes ~1 ns Cache K - M Bytes 1-100 ns...
Memory Technology
Computer Science 104 Alvin R. Lebeck
2 © Alvin R. Lebeck
Administrivia
• Midterm II Next Monday • HW 5 due Wednesday • Three more home works • Simulator (y86)
longer assignment, overlap with two other homeworks groups of 2
3 © Alvin R. Lebeck
• Memory Outline • Review • Big Picture of Memory • Memory Technology
SRAM DRAM
• Caches Reading
6.1-6.4
Today’s Lecture
4 © Alvin R. Lebeck
The Five Stages of mrmovl
• Ifetch: Instruction Fetch Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode • Exec: Calculate the memory address • Mem: Read the data from the Data Memory • WrB: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrB mrmovl
5 © Alvin R. Lebeck
Pipeline Stages
• Fetch Select current PC Read instruction Compute incremented PC
• Decode Read program registers
• Execute Operate ALU
• Memory Read or write data memory
• Write Back Update register file
6 © Alvin R. Lebeck
Controlling a Pipeline: Finite State Diagram
Ifetch
Rfetch/Decode BrComplete
RExec
Rfinish
OriExec
OriFinish
AdrCal
Rm_mem Mr_mem
MR_wr
mrmovl or rmmovl
mrmovl rmmovl Rtype
Ori
je
0 1
8
10 6
5 3
2
4 7 11
Wait
Wait
7 © Alvin R. Lebeck
Where Are We?
I/O system CPU
Compiler
Operating System
Application
Digital Design Circuit Design
Instruction Set Architecture, Memory, I/O
Firmware
Memory
Software
Hardware
Interface Between HW and SW
You are here.
8 © Alvin R. Lebeck
1 2 3 4
•
2n-1
• •
0 00110110 00001100
Byte Address Data
Review: Program’s View of Memory
• Memory is a large linear array of bytes. Each byte has a unique address (location). Byte of data at address 0x100, and 0x101
• Most computers have instructions with byte (8-bit) addressing.
• Data may have to be aligned on word (4 byte) or double word (8 byte) boundary. int is 4 bytes double precision floating point is 8 bytes
• 32-bit v.s. 64-bit addresses we will assume 32-bit for rest of course,
unless otherwise stated
1
•
2n-1
• •
0
2n-1-4
Word Address
9 © Alvin R. Lebeck
SEQ+ Hardware
Naïve view of Memory
IDEAL
IDEAL
10 © Alvin R. Lebeck
Question
• What issues do we need to worry about in implementing the memory system?
11 © Alvin R. Lebeck
I/O Bus
Memory Bus
CPU
Cache
Disk Controller
Disk
Memory
Disk
Graphics Controller
Network Interface
Graphics Network
interrupts
System Organization
I/O Bridge
Core Chip Set
The memory hierarchy
12 © Alvin R. Lebeck
Level Two Cache
Level One Cache Control
Datapath Registers
Processor / Core
Processor and Caches
To main memory
Level Three Cache
We will talk more about caches later
13 © Alvin R. Lebeck
Memory Controller
Memory Bus
DIM
M S
lot 0
DIM
M S
lot 1
DIM
M S
lot 2
DIM
M S
lot 3
DIM
M S
lot 4
DIM
M S
lot 5
DIM
M S
lot 6
DIM
M S
lot 7
DRAM DIMM
DRAM
DRAM
DRAM
DRAM DRAM DRAM DRAM
DRAM DRAM DRAM
Main Memory
Why is it called DRAM?
To Processor
14 © Alvin R. Lebeck
• Random Access: “Random” is good: access time is the same for all locations DRAM: Dynamic Random Access Memory
» High density, low power, cheap, slow » Dynamic: needs to be “refreshed” regularly » Main memory
SRAM: Static Random Access Memory » Low density, high power, expensive, fast » Static: content will last “forever” (until power loss) » Caches
• “Not-so-random” Access Technology: Access time varies from location to location and from time to time Examples: Disk, DVD/CD
• Sequential Access Technology: access time linear in location (e.g.,Tape)
Memory Technology
15 © Alvin R. Lebeck
• Why do computer professionals need to know about RAM technology? Processor performance is usually limited by
memory latency and bandwidth. Latency: The time it takes to access a single word in memory. Bandwidth: The average speed of access to memory (Words/Sec). As integrated circuit (IC) densities increase, lots of memory will fit
on processor chip » Tailor on-chip memory to specific needs.
- Instruction cache - Data cache - Write buffer
• What makes RAM different from a bunch of flip-flops? Density: RAM is much more dense Speed: RAM access is slower than flip-flop (register) access.
Random Access Memory (RAM) Technology
16 © Alvin R. Lebeck
DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995 64 Mb 120 ns 1999 128Mb 100 ns 2003 256Mb 100 ns 2007 2Gb 55ns 2010 2Gb 20ns
Capacity Speed Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 1.4x in 10 years Disk: 2x in 3 years 1.4x in 10 years
1000:1! 2:1!
Technology Historical Trends
17 © Alvin R. Lebeck
6-Transistor SRAM Cell
bit bit
word (row select)
• Write: 1. Drive bit lines (bit=1, bit=0) 2. Select row
• Read: 1. Precharge bit and bit to Vdd (set to 1) 2. Select row 3. Cell pulls one line low (pulls to 0) 4. Sense amp on column detects difference between bit and bit
bit bit
word
1 0
0 1
Static RAM Cell
18 © Alvin R. Lebeck
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
SRAM Cell
- + Sense Amp - + Sense Amp - + Sense Amp - + Sense Amp
: : : :
Word 0
Word 1
Word 15
Dout 0 Dout 1 Dout 2 Dout 3
- + Wr Driver & Precharger - +
Wr Driver & Precharger - +
Wr Driver & Precharger - +
Wr Driver & Precharger
Address D
ecoder
WrEn Precharge
Din 0 Din 1 Din 2 Din 3
A0
A1
A2
A3
Typical SRAM Organization: 16-word x 4-bit
19 © Alvin R. Lebeck
• Write Enable is usually active low (WE_L) • Din and Dout are combined to save pins:
A new control signal, output enable (OE_L) is needed WE_L is asserted (Low), OE_L is disasserted (High)
» D serves as the data input pin WE_L is disasserted (High), OE_L is asserted (Low)
» D is the data output pin Both WE_L and OE_L are asserted:
» Result is unknown. Don’t do that!!!
A
D OE_L
2 N words x M bit SRAM
N
M
WE_L
Logic Diagram of a Typical SRAM
20 © Alvin R. Lebeck
• Dynamic RAM (DRAM): Refresh required Very high density Low power (.1 - .5 W active, .25 - 10 mW standby) Low cost per bit Pin sensitive (few pins):
» Output Enable (OE_L) » Write Enable (WE_L) » Row address strobe (ras) » Col address strobe (cas)
cell array NxN bits
N
N
r o w
SA & c o l addr
log N 2
D
WE_L OE_L
Introduction to DRAM
21 © Alvin R. Lebeck
• Write: 1. Drive bit line 2. Select row
• Read: 1. Precharge bit line to Vdd (1) 2. Select row 3. Cell and bit line share charges
» Very small voltage changes on the bit line
4. Sense (fancy sense amp) » Can detect changes of ~1 million
electrons 5. Write: restore the value
• Refresh 1. Just do a dummy read to every cell.
row select
bit
1-Transistor Memory Cell (DRAM)
22 © Alvin R. Lebeck
r o w
d e c o d e r
row address
Sense-Amps, Column Selector & I/O Circuits
Column Address
data
RAM Cell Array
word (row) select
bit (data) lines
• Row and Column Address together: Select 1 bit a time
Each intersection represents a 1-T DRAM Cell
Classical DRAM Organization (square)
23 © Alvin R. Lebeck
• Typical DRAMs: access multiple bits in parallel Example: 2 Mb DRAM = 256K x 8 = 512 rows x 512 cols x 8 bits Row and column addresses are applied to all 8 planes in parallel
One “Plane” of 256 Kb DRAM 51
2 ro
ws
Plane 0
512 cols
D<0>
Plane 1
D<1>
Plane 7
D<7>
256 Kb DRAM
256 Kb DRAM
Typical DRAM Organization
24 © Alvin R. Lebeck
Access Pattern without Interleaving:
Start Access for D1
CPU Memory
Start Access for D2 D1 available
Access Pattern with 4-way Interleaving:
Acc
ess B
ank
0
Access Bank 1 Access Bank 2
Access Bank 3 We can Access Bank 0 again
CPU
Memory Bank 1
Memory Bank 0
Memory Bank 3
Memory Bank 2
Increasing Bandwidth - Interleaving
Cycle Time
25 © Alvin R. Lebeck
MICRON 2Gb DRAM (512Mx4, circa 2010)
26 © Alvin R. Lebeck
Fast Memory Systems: DRAM specific
• Modern DRAMs Synchronous DRAM (SDRAM): Provide a clock signal to DRAM,
transfer synchronous to system clock Dual Data Rate DRAM (DDRAM) Also RAMBUS (DDR, DDR2, DDR3)
» transfer data on both clock edges » Each Chip a module vs. slice of memory » Short bus between CPU and chips » Does own refresh » Variable amount of data returned » 1 byte / 2 ns (500 MB/s per chip)
27 © Alvin R. Lebeck
Summary of Memory Technology
• DRAM is slow but cheap and dense: Good choice for presenting the user with a BIG memory system Uses one transistor, must be refreshed.
• SRAM is fast but expensive and not very dense: Good choice for providing the user FAST access time. Uses six transistors, holds state as long as power is supplied.
• GOAL: Present the user with large amounts of memory using the cheapest
technology. Provide access at the speed offered by the fastest technology.
• Next Up: Caches
28 © Alvin R. Lebeck
Issues for Memory Systems
• Capacity/Size • Cost
What technology is cheap? • Performance
What technology is fast? • Ease of Use
How much do programmers have to worry about it?
29 © Alvin R. Lebeck
Cache
• What is a cache? • What is the motivation for a cache? • Why do caches work? • How do caches work?
30 © Alvin R. Lebeck
The Motivation for Caches
• Motivation: Large memories (DRAM) are slow Small memories (SRAM) are fast
• Make the average access time small by: Servicing most accesses from a small, fast memory.
• Reduce the bandwidth required of the large memory
Processor
Memory System
Cache DRAM
31 © Alvin R. Lebeck
Levels of the Memory Hierarchy
CPU Registers 100s Bytes ~1 ns
Cache K - M Bytes 1-100 ns ~$.0005/bit
Main Memory G Bytes 100ns-1us ~$.00001/bit
Disk G - T Bytes ms 10 - 10 cents -3 -4
Capacity Access Time Cost
Tape T Bytes sec-min 10 -6
Registers
Cache
Memory
Disk/Network
Tape/DVD/Network
Instr. Operands
Blocks
Pages
Files
Staging Xfer Unit
prog./compiler 1-8 bytes
cache controller 8-128 bytes
OS 512-4K bytes
user/operator Mbytes
Upper Level
Lower Level
faster
Larger
32 © Alvin R. Lebeck
The Principle of Locality
• The Principle of Locality: Program access a relatively small portion of the address space at
any instant of time. Example: 90% of time in 10% of the code
• Two Different Types of Locality: Temporal Locality (Locality in Time): If an item is referenced, it will
tend to be referenced again soon. Spatial Locality (Locality in Space): If an item is referenced, items
whose addresses are close by tend to be referenced soon.
Address Space 0 2n
Probability of reference
33 © Alvin R. Lebeck
Memory Hierarchy: Principles of Operation
• At any given time, data is copied between only 2 adjacent levels: Upper Level (Cache) : the one closer to the processor
» Smaller, faster, and uses more expensive technology Lower Level (Memory): the one further away from the processor
» Bigger, slower, and uses less expensive technology • Block:
The minimum unit of information that can either be present or not present in the two level hierarchy
Lower Level (Memory) Upper Level
(Cache) To Processor
From Processor Blk X
Blk Y
34 © Alvin R. Lebeck
Memory Hierarchy: Terminology
• Hit: data appears in some block in the upper level (example: Block X) Hit Rate: the fraction of memory access found in the upper level Hit Time: Time to access the upper level which consists of
RAM access time + Time to determine hit/miss • Miss: data needs to be retrieved from a lower level (Block Y)
Miss Rate = 1 - (Hit Rate) Miss Penalty = Time to replace a block in the upper level +
Time to deliver the block the processor • Hit Time << Miss Penalty
Lower Level (Memory) Upper Level
(Cache) To Processor
From Processor Blk X
Blk Y
35 © Alvin R. Lebeck
• Direct Mapped cache: array of fixed size frames. • Each frame holds consecutive bytes of main memory data (block). • The Tag Array holds the Block Memory Address. • A valid bit associated with each cache block tells if the data is valid.
Direct Mapped Cache
Cache-Index = (<Address> Mod (Cache_Size))/ Block_Size Block-Offset = <Address> Mod (Block_Size) Tag = <Address> / (Cache_Size)
• Cache Index: The location of a block (and it’s tag) in the cache. • Block Offset: The byte location in the cache block.
36 © Alvin R. Lebeck
The Simplest Cache: Direct Mapped Cache
Memory
4 Byte Direct Mapped Cache with 1-byte blocks
Memory Address 0 1 2 3 4 5 6 7 8 9 A B C D E F
Cache Index 0 1 2 3
• Location 0 can be occupied by data from: Memory location 0, 4, 8, ... etc. In general: any memory location
whose 2 LSBs of the address are 0s Address<1:0> => cache index
• Which one should we place in the cache? • How can we tell which one is in the cache?
37 © Alvin R. Lebeck
Direct Mapped Cache (Cont.)
For a Cache of 2M bytes with block size of 2L bytes
There are 2M-L cache blocks,
Lowest L bits of the address are Block-Offset bits
Next (M - L) bits are the Cache-Index.
The last (32 - M) bits are the Tag bits.
L bits block offset M-L bits Cache Index 32-M bits Tag
Data Address
38 © Alvin R. Lebeck
Example: 1-KB Cache with 32B blocks:
Cache Index = (<Address> Mod (1024))/ 32
Block-Offset = <Address> Mod (32)
Tag = <Address> / (1024)
1K = 210 = 1024 25 = 32
Direct Mapped Cache Data
Byte 0 Byte 1 Byte 30 Byte 31
Cache Tag Valid bit
. . . . 32-byte block 22 bits
32 cache blocks
5 bits block offset 5 bits Cache Index 22 bits Tag
Address
39 © Alvin R. Lebeck
Example: 1KB Direct Mapped Cache with 32B Blocks
• For a 1024 (210) byte cache with 32-byte blocks: The uppermost 22 = (32 - 10) address bits are the Cache Tag The lowest 5 address bits are the Byte Select (Block Size = 25) The next 5 address bits (bit5 - bit9) are the Cache Index
0 4 31 9 Cache Index
:
Cache Tag Example: 0x50 Ex: 0x01
0x50
Stored as part of the cache “state”
Valid Bit
:
0 1 2 3
:
Cache Data Byte 0
31
Byte 1 Byte 31 :
Byte 32 Byte 33 Byte 63 :
Byte 992 Byte 1023 :
Cache Tag
Byte Select Ex: 0x00
Byte Select
40 © Alvin R. Lebeck
Example: 1K Direct Mapped Cache
0 4 31 9 Cache Index
:
Cache Tag
0x0002fe 0x00
0x000050
Valid Bit
:
0 1 2 3
:
Cache Data Byte 0
31
Byte 1 Byte 31 :
Byte 32 Byte 33 Byte 63 :
Byte 992 Byte 1023 :
Cache Tag
Byte Select
0x00
Byte Select = Cache Miss
1 0
1
0xxxxxxx
0x004440
41 © Alvin R. Lebeck
Example: 1K Direct Mapped Cache
0 4 31 9 Cache Index
:
Cache Tag
0x0002fe 0x00
0x000050
Valid Bit
:
0 1 2 3
:
Cache Data
31
Byte 32 Byte 33 Byte 63 :
Byte 992 Byte 1023 :
Cache Tag
Byte Select 0x00
Byte Select =
1 1
1
0x0002fe
0x004440
New Block of data
42 © Alvin R. Lebeck
Example: 1K Direct Mapped Cache
0 4 31 9 Cache Index
:
Cache Tag
0x000050 0x01
0x000050
Valid Bit
:
0 1 2 3
:
Cache Data Byte 0
31
Byte 1 Byte 31 :
Byte 32 Byte 33 Byte 63 :
Byte 992 Byte 1023 :
Cache Tag
Byte Select 0x08
Byte Select = Cache Hit
1 1
1
0x0002fe
0x004440
43 © Alvin R. Lebeck
Example: 1K Direct Mapped Cache
0 4 31 9 Cache Index
:
Cache Tag
0x002450 0x02
0x000050
Valid Bit
:
0 1 2 3
:
Cache Data Byte 0
31
Byte 1 Byte 31 :
Byte 32 Byte 33 Byte 63 :
Byte 992 Byte 1023 :
Cache Tag
Byte Select
0x04
Byte Select = Cache Miss
1 1
1
0x0002fe
0x004440
44 © Alvin R. Lebeck
Example: 1K Direct Mapped Cache
0 4 31 9 Cache Index
:
Cache Tag
0x002450 0x02
0x000050
Valid Bit
:
0 1 2 3
:
Cache Data Byte 0
31
Byte 1 Byte 31 :
Byte 32 Byte 33 Byte 63 :
Byte 992 Byte 1023 :
Cache Tag
Byte Select
0x04
Byte Select =
1 1
1
0x0002fe
0x002450 New Block of data
45 © Alvin R. Lebeck
Intel Core i7 Cache Hierarchy