Topic: Sequential Circuits, Latches and Flip flops UNIT IV ...
Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
Transcript of Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
1/33
Lecture 8, Memory CS250, UC Berkeley, Fall 2010
CS250 VLSI Systems Design
Lecture 8: Memory
John Wawrzynek, Krste Asanovic,with
John Lazzaroand
Yunsup Lee (TA)
UC BerkeleyFall 2010
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
2/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
CMOS Bistable
Cross-coupled inverters used to hold state in CMOS
Static storage in powered cell, no refresh needed
If a storage node leaks or is pushed slightly away from correctvalue, non-linear transfer function of high-gain inverterremoves noise and recirculates correct value
To write new state, have to force nodes to opposite state
2
D D
1 0
D D
0 1
Flip State
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
3/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
CMOS Transparent LatchLatch transparent (output follows input) when clock is
high, holds last value when clock is low
3
OptionalInput Buffer
OptionalOutput Buffer
D Q
Clk Clk
Clk
Clk
D Q
Clk
D Q
Clk
SchematicSymbols
Transparent on clock low
Transmission gate switch with
both pMOS and nMOS passes
both ones and zeros well
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
4/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Latch Operation
4
D Q
1 1
0
0
D
Clock HighLatch Transparent
D Q
0 0
1
1
Q
Clock LowLatch Holding
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
5/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Flip-Flop as Two Latches
5
Q
ClkClk
Clk
ClkHold
D Q
Clk
Clk
ClkClk
Sample
QD
Clk Clk
SchematicSymbols
Thisishows
tandar
dcell
flip-flopsar
ebuilt
(usually
withextrain
/out
buffers
)
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
6/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Small Memories from Stdcell Latches
Add additional ports by replicatingread and write port logic (multiplewrite ports need mux in front of latch)
Expensive to add many ports
6
WriteAddressDecoder
ReadAddressDecoder
Clk
Write Address Write Data Read Address
Clk
Combinational logic for
read port (synthesized)
Optional read output latch
Data held in
transparent-low
latches
Write by
clocking latch
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
7/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
6-Transistor SRAM (Static RAM)
7
Large on-chip memories built from arrays of static RAM
bitcells, where each bit cell holds a bistable (cross-coupled inverters) and two access transistors.
Other clocking and access logic factored out intoperiphery
Bit Bit
Wordline
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
8/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Intels 22nm SRAM cell
8
% % %
% % % % % %% % % % %
0.092 um2 SRAM cell
for high density applications
0.108 um2 SRAM cell
for low voltage applications
[Bohr, Intel, Sept 2009]
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
9/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
General SRAM Structure
9
AddressDecode
andWordline
Driver
Differential Read Sense Amplifiers
Differential Write Drivers
Bitline Prechargers
Address
Write Data Read Data
Usually maximum of128-256 bits per row
or column
Clk
ClkWrite Enable
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
10/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Address Decoder Structure
10
A1A0 A3A2
2:4
Predecoders
Clocked WordLine Enable
Address
Word Line 0
Word Line 1
Word Line 15
Unary 1-of-4encoding
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
11/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Read Cycle
11
1) Precharge bitlines and senseamp
1)
2) Pulse wordlines, develop bitline differential voltage
2)
Bitline differential
Clk
Bit/Bit
Wordline
Sense
Data/Data
3) Disconnect bitlines from senseamp, activatesense pulldown, develop full-rail data signals
3) Full-rail swing
Pulses generated by internal self-timed signals, oftenusing replica circuits representing critical paths
Clk
Sense
DataData
From
Decoder
WordlineClock
Prechargers
Sense Amp
Storage Cells
BitBit
Output Set-Reset
Latch
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
12/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Write Cycle
12
1) Precharge bitlines1)
Clk
Bit/Bit
Wordline
2) Open wordline, pull down one bitline full rail
2)
Clk
Write Data
From
Decoder
WordlineClock
Prechargers
Storage Cells
BitBit
Write Enable
Write-enable can be controlled on aper-bit level. If bit lines not driven
during write, cell retains value (lookslike a read to the cell).
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
13/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Column-Muxing at Sense Amps
13
Sel1
Clk
Sel0
From
Decoder
WordlineClock
Sense Amp
Difficult to pitch match sense amp to tight SRAM bit cell spacingso often 2-8 columns share one sense amp. Impacts power
dissipation as multiple bitline pairs swing for each bit read.
Data Data
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
14/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Building Larger Memories
14
Bit cellsDec
I/O
Bit cells
I/O
Bit cellsDec
Bit cells
Bit cellsDec
I/O
Bit cells
I/O
Bit cellsDec
Bit cells
Bit cellsDec
I/O
Bit cells
I/O
Bit cellsDec
Bit cells
Bit cellsDec
I/O
Bit cells
I/O
Bit cellsDec
Bit cells
Large arrays constructed bytiling multiple leaf arrays, sharingdecoders and I/O circuitry
e.g., sense amp attached toarrays above and below
Leaf array limited in size to128-256 bits in row/column dueto RC delay of wordlines andbitlines
Also to reduce power by onlyactivating selected sub-bank
In larger memories, delay andenergy dominated by I/O wiring
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
15/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Adding More Ports
15
BitA BitA
WordlineA
WordlineB
BitB BitB
Wordline
Read Bitline
DifferentialRead or Write
ports
Optional Single-ended
Read port
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
16/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Memory CompilersIn ASIC flow, memory compilers used to generate layout
for SRAM blocks in designOften hundreds of memory instances in a modern SoC
Memory generators can also produce built-in self-test (BIST)logic, to speed manufacturing testing, and redundant rows/columns to improve yield
Compiler can be parameterized by number of words,number of bits per word, desired aspect ratio, number ofsub banks, degree of column muxing, etc.
Area, delay, and energy consumption complex function ofdesign parameters and generation algorithm
Worth experimenting with design space
Usually only single read orwrite port SRAM and oneread andone write SRAM enerators in ASIC librar
16
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
17/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Small Memories
17
Compiled SRAM arrays usually have a high overhead dueto peripheral circuits, BIST, redundancy.
Small memories are usually built from latches and/or flip-flops in a stdcell flow
Cross-over point is usually around 1K bits of storageShould try design both ways
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
18/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Memory DesignPatterns
18
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
19/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Multiport Memory Design Patterns
Often we require multiple access ports to a commonmemory
True Multiport MemoryAs describe earlier in lecture, completely independent read
and write port circuitryBanked Multiport Memory
Interleave lesser-ported banks to provide higher bandwidth
Stream-Buffered Multiport Memory
Use single wider access port to provide multiple narrowerstreaming ports
Cached Multiport MemoryUse large single-port main memory, but add cache to service
19
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
20/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
True Multiport Memory
Problem: Require simultaneous read and write access by multipleindependent agents to a shared common memory.
Solution: Provide separate read and write ports to each bit cell for eachrequester
Applicability:Where unpredictable access latency to the sharedmemory cannot be tolerated.
Consequences: High area, energy, and delay cost for large number ofports. Must define behavior when multiple writes on same cycle to sameword (e.g., prohibit, provide priority, or combine writes).
20
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
21/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
True Multiport Example: Itanium-2 RegfileIntel Itanium-2 [Fetzer et al, IEEE JSSCC 2002]
21
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
22/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Itanium-2 Regfile Timing
22
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
23/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Banked Multiport Memory
Problem: Require simultaneous read and write access by multipleindependent agents to a large shared common memory.
Solution: Divide memory capacity into smaller banks, each of which hasfewer ports. Requests are distributed across banks using a fixed hashingscheme. Multiple requesters arbitrate for access to same bank/port.
Applicability: Requesters can tolerate variable latency for accesses.Accesses are distributed across address space so as to avoid hotspots.
Consequences: Requesters must wait arbitration delay to determine ifrequest will complete. Have to provide interconnect between eachrequester and each bank/port. Can have greater, equal, or lesser number ofbanks*ports/bank compared to total number of external access ports.
23
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
24/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Banked Multiport Memory
Bank 0 Bank 1 Bank 2 Bank 3
24
Arbitration and Crossbar
Port BPort A
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
25/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Banked Multiport Memory Example
25
ache, allrocessorthat there band-CPU canthat theith data
pipeline.ferencesThird, in
r design,nd data
a unified8-Kbyte,
Dual-portedTLBBankconflictdetection 7 7
I I I
Dual-portedcache tags
Figure 7. Dual-access da ta cache.
Singe-ported andinterleavedb cache data
frequency.) For the Pentium microprocessor, with its higher
Pentium (P5) 8-way interleaved data cache, with two ports
[Alpert et al, IEEE Micro, May 1993]
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
26/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Stream-Buffered Multiport MemoryProblem: Require simultaneous read and write access by multiple
independent agents to a large shared common memory, where eachrequester usually makes multiple sequential accesses.
Solution: Organize memory to have a single wide port. Provide eachrequester with an internal stream buffer that holds width of data returned/consumed by each access. Each requester can access stream buffer withoutcontention, but arbitrates with others to read/write stream buffer.
Applicability: Requesters make mostly sequential requests and cantolerate variable latency for accesses.
Consequences: Requesters must wait arbitration delay to determine ifrequest will complete. Have to provide stream buffers for each requester.Need sufficient access width to serve aggregate bandwidth demands of allrequesters, but wide data access can be wasted if not all used by requester.Have to specify memory consistency model between ports (e.g., providestream flush operations).
26
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
27/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Stream-Buffered Multiport Memory
27
Arbitration
Port AStream Buffer A
Port BStream Buffer B
Wide Memory
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
28/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Stream-Buffered Multiport Examples
IBM Cell microprocessor local store
28
[Chen et al., IBM, 2005]
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
29/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Cached Multiport Memory
Problem: Require simultaneous read and write access by multipleindependent agents to a large shared common memory.
Solution: Provide each access port with a local cache of recently touchedaddresses from common memory, and use a cache coherence protocol tokeep the cache contents in sync.
Applicability: Request streams have significant temporal locality, andlimited communication between different ports.
Consequences: Requesters will experience variable delay depending onaccess pattern and operation of cache coherence protocol. Tag overhead inboth area, delay, and energy/access. Complexity of cache coherenceprotocol.
29
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
30/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Cached Multiport Memory
Cache A
30
Arbitration and Interconnect
Port BPort A
Cache B
Common Memory
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
31/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Replicated State Multiport Memory
Problem: Require simultaneous read and write access by multipleindependent agents to a small shared common memory. Cannot toleratevariable latency of access.
Solution: Replicate storage and divide read ports among replicas. Eachreplica has enough write ports to keep all replicas in sync.
Applicability: Many read ports required, and variable latency cannot betolerated.
Consequences: Potential increase in latency between some writers andsome readers.
31
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
32/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Replicated State Multiport Memory
32
Copy 0 Copy 1
Write Port 0 Write Port 1
Read PortsExample: Alpha 21264
Regfile clusters
-
7/27/2019 Memory Subsystems, SRAM,DRAM, Memory Compiler,Clock, Latency,Flip Flops, Latches
33/33
CS250, UC Berkeley, Fall 2010Lecture 8, Memory
Memory Hierarchy Design Patterns
33
Use small fast memory together large slow memory toprovide illusion of large fast memory.
Explicitly managed local stores
Automatically managed cache hierarchies