EE-382M VLSI–II Early Planning for Memory Array...

Foil # 1 / 58 The University of Texas at AustinEE 382M Class Notes

Early Planning for Memory Array

Design

EE-382M VLSI–II

Steven C. SullivanGian Gerosa


Class Agenda

• Memory Hierarchy (6 foils)

• Memory Cell Types (9 foils)

• Basic Array Structure (5 foils)

• Bitline Segmentation (3 foils)

• Area Estimation (7 foils)

• Access Time & Power Estimation (4 foils)

• Clock & Power Distribution (4 foils)


Access TimeCapacity

Register File 0.25-1ns0.5-1KB

Level 1 Cache 1-4ns8-64KB

Level 2 Cache 5-20ns256KB-2MB

Main Memory 35-50ns128-256MB

Hard Drive 5-10ms10-50GB

Memory Hierarchy

Processor

Memory hierarchy gives the appearance of large capacity and fast access time.


2006

1982

Processor-Memory Performance Gap

µProc60%/yr

DRAM7%/yr.

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

Performance Gap:(grows 50% per year)

Perf

orm

ance

The need for memory hierarchy is steadily

increasing. 20

0120

0220

0320

04

1.35X/yr

1.55X/yr

2005

2007

10000


Memory Hierarchy Evolution

Chipset

Cache

DRAM

386No on-die cache.

L1 cache on motherboard.

CPU

Chipset

L2

DRAM

L1

486

CPU

Level 1 cache on-die. Level 2 on motherboard

Chipset

L2

DRAM

Pentium

I D

CPU

Separate Instruction and Data Caches


Memory Hierarchy Evolution

Chipset DRAM

Pentium III

I D

CPU

L2 cache on-die

L2

Chipset DRAM

Pentium 4(Foster)

I D

CPU

L3 cache on-die

L2L3

Chipset

L2

DRAM

Pentium II

I D

CPU

Separate bus to L2 cache in same

package

Recent development: 3-D packaging allows more integration


P4


Functional Block Diagram

Multiplexors andSense Amplifiers

Column Decoder

Column Address

Data

Row

Decoder

Cell Array

2N x 2M2NNRowAddress

Word Lines

Read/WriteBuffer

2K 2K

2(M-K)

2M

(M-K)

“1-hot” select


Class Agenda









Memory Cell Overview

• A memory cell array has the following capabilities;• A means of storing bits of information (storage elements)• A means of selecting the stored information (wordlines)• A means of transferring data to/from storage elements (bitlines)

• 1T/1C memory cell is the simplest implementation• Only requires 1 W/L and 1 B/L metalization

• 6T SRAM cell consumes more area and requires true & complement bitlines, but is more stable and develops a sensing voltage faster than DRAM cell

• Register File cells allow multiple entries to be accessed or written simultaneously– However, this requires multiple wordlines and bitlines and

becomes metal-limited– Used for integer/floating point registers, single & multiple-cycle

queues and buffers


Memory Cell Types

• Schematic of 1-T DRAM cell, 6T dual ended SRAM cell

WL

BL

1-transistor DRAM

Storagecap

WL

BL #BL

6-transistor SRAM

• Industry standard DRAM cell• Smallest area per bit• Explicit storage capacitor• Destructive READ

• Industry standard SRAM cell• Used for FAST static arrays• Cross-coupled inverters• Non-destructive READ with

proper stability analysis


WL

BL #BL

6-transistor SRAM cell

BL #BL

WL

GND

VDD

PFET

NFET

PASSGATE

1.0 μm

(65n

m)

0.68 μm (65nm)

In 65nm CMOS, a typical6T bitcell area = .68 μm2


Multi-Port Memory Cell Types

WWL

WBL

RWL

D #D

#WBL

RBL #RBL

1 Read (DE), 1 Write (DE)



WWL

WBL

RBLRWL

D #D#WBL

1 Write (DE), 1 Read (SE)


Register File Multi-Ported Bitcell

VDD rwl wl0 GND wl1 GND

wl0

bl0

bl0b

VDD

bl1

GND

rbl

GND

bl1b

rwl

GND

RWL

D #D

WL0

WL1

BL0

BL1

BL0

BB

L1B

RBL

2 Write (DE), 1 Read (SE)



WWL

WBL

RBL

RWL

D #D

#RBL

1 Write (SE), 1 Read (DE)



WWL

WBL

RBL

RWL

D #D

#RBL

1 Write (SE), 1 Read (DE)Slight modification



WWL

WBLRBL0

RWL0

D #D

RWL1

RBL1

1 Write (SE), 2 Read (SE)


Relative Memory Cell SizesDimensions in M1 pitches.

(assume M1 same)

Cell WL Dir BL Dir Area

1T 1 1.5 1.5

4T 3 4 12

6T 4 6 24

4R/2W 9 9 81


Class Agenda









Array Design Choices

• Decoders– Predecoder & Banked WL Drivers - for large number of rows– Hierarchical WL & WL Repeaters - for large number of cols

• Cells– Differential - for few ports and large array size– Single Ended - for many ports or small array size

• Bitlines– Hierarchical - for many rows & available higher metal– Serial - for large number of rows & no higher metal

• Column Muxing– Differential - group by bit– Single Ended - group by entry


Basic Array Characteristics

• Array Size– Number of entries– Bits per entry

• Number of Ports– Number of simultaneous reads– Number of simultaneous writes

• Latency– Cycles from address to read data– Cycles from address to write completed


Precharge

Basic Array Layout

Cell

Address

BitL

ine

Bitline ReceiversWrite Buffers

Decoder

Rows

Columns

Cell

Cell

Read DataWrite Data

Pre-D

ec

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

Cell

CellCell

CellCellCellWordLine


Large Signal vs Small Signal Arrays

WordLine

Cell

Sense Amp

Bit Bit#

Data

Small Signal Arrays• Differential bitlines• Dual-ended Sense

amplifier

WordLine

Cell

Bit#

Data

Large Signal Arrays• Single-ended bitline• Inverter threshold

sense


• Small Signal Arrays:– DRAM and SRAM chips– Processor D-cache and I-cache

• Large Signal Arrays:– Processor register files– Multi-ported data structures

• Small Signal Arrays are less common because:– Sense amps require special characterization– More sensitive to noise– Area and timing overhead of differential sense amp– May not scale well to low supply voltage

Large Signal vs Small Signal Arrays


Class Agenda









Register File Bitline Segmentation

• Problem: In general, long bitlines cause very slow edge rates– May consider converting to an SSA design approach

• However, very short bitlines causes overall area to increase– Array efficiency goes down; wastes valuable silicon area

• Solution: Break up bitline depth to determine optimal design point– Divide up into smaller sections & recombine with “wire-OR”

• Example #1 shows 16 memory cells on a bitline which drives a dynamic “wire OR” global bitline

• Example #2 shows a “serial” global bitline structure– The lower global bitline is in series with the upper global bitline

with a receiver and NMOS pulldown device in the center (acts like a “repeater”)


Register File Segmentation Example #1

Memorycell

Local BL

Global BL

Global BL receiver

Dynamic latch

#pc

Global bitline acts a dynamic “wire-OR”16 cells


Register File Segmentation Example #2

• Serial global bitline

Memorycell

Local BL

Global BL

Global BL receiver

Dynamic “wire OR”

Dynamic “wire OR”

#pc

#pc Dynamic latch


Class Agenda









• Cell Area– 6T bitcell dimensions strongly dependent on technology

• Need an actual layout study to determine area– Multiported cells are wire limited and can be easily caclulated

• Cell Height is a function of {MV_Pitch*(Wordlines + Shields)}

• Cell Width is a function of {MH_Pitch*(Bitlines + Datalines + Shields)}

• Local Bitline Receivers and Dataline drivers– Height of array is increased by local bitline receivers

• NumReadPorts*NumEntries/CellPerLBL– Height of array is increased by local dataline drivers

• NumWritePorts*NumEntries/CellPerLBL

Array Area Estimation



• Decoder & Wordline Repeaters

– Width of array is increased by the decoder

• Decoder width is a function of number of ports

• 20% of total array width is a reasonable estimate

– Width of array is increased by wordline repeaters

• Typically no more than 32 to 64 bitcells on a single wordline (limits rise/fall time of selected row)


Array Area EstimationCell Height & Width CalculationRecall

Cell Height = {MH_Pitch*(Wordlines + Shields)}

MH_Pitch*[(#R + #W) + WL_shield*(#R + #W + 1)]

Cell Width = {MV_Pitch*(Bitlines + Datalines + Shields)}

Mv_Pitch*(#R + Rd_shield*#R + 1) + (#W + Wr_shield*#W + 1)

Where

#R Number of Read Ports#W Number of Write PortsWL_shield Read wordline shield factorRd_shield Read bitline shield factorWr_shield Write dataline shield factorMH_Pitch Wordline PitchMV_Pitch Bitline Pitch



Consider: 3 read ports & 2 write ports, 16-bits, 64-entryCell Height = MH_Pitch*(Wordlines + Shields)= MH_Pitch*[(#R + #W) + WL_shield*(#R + #W + 1)]

= 0.2um * [(3 + 2) + (5 shields + 1)] = 2.20um

Cell Width = MV_Pitch*(Bitlines + Datalines + Shields)= MV_Pitch*(#R + Rd_shield*#R + 1) + (#W + Wr_shield*#w + 1)

= 0.2um * [(3 + 0.5*3 + 1) + (2 + 0.5*2 + 1) ] = 1.90um

• Sub-array dimensions are:

X = 16 * (Cell_width) = 16 * 1.90um = 30.4umY = 64 * (Cell_Height) = 64 * 2.20um = 140.8um


SRAM Array Area Estimation

Estimate subarray first:1. # 6T bitcells * bitcell area + wordline & column decoders + sense-amp

+ read/write sequentials.2. The decoders + sense-amps + sequentials are typically 15% of the

subarray bitcell area.3. Use an ‘array efficiency’ factor to calculate the total SRAM array area;

this includes clock buffers, address decoders, control logic, repeaters, routing, etc.; typical numbers are in the range of ~60%.

EXAMPLE:

• A 16KB L1 cache with four 4KB subarrays; each subarray is comprised of 128 bitcells/colum and 256 bitcells/wordline; the 6T bitcell area in this 65 nm CMOS technology is 0.82 μm2.

Bitcell subarray = 0.68 μm2 * 128 * 256 = 22,282 μm2

Subarray = 1.15 * 22,282 = 25,624 μm2

4 subarrays = 4 * 25,624 = ~102,500 μm2

16KB L1 cache = 102,500 / 0.60 = 170,833 μm2 or ~ 0.17 mm2


Floorplan Options

DE

CO

DE

Sub Array

Rd Block

Wrt DriverCTL

DE

CO

DE

Sub Array

Rd Block

Wrt DriverCTL

Sub Array

Rd Block

Wrt Driver

Possible Large-Signal Array Floorplans• Array Area Calculator provides dimensions for these blocks

Pchg Pchg Pchg


Floorplanning ToolStructured Datapath


Sample FloorplanGenerated from a floorplanning CAD tool

bitslices

rwldrv

wwldrvdecode

mergelogic


Class Agenda









wordline RC delay (example) 128 bitcells in a row

• RT = Σ Ri = 140 mΩ/ * 348μm/0.1μm = 487.2 Ω

• CT = Σ Ci = CM1 + Ggate

= 348μm * 0.23fF/μm + 128*(2*0.5μm)*2.0fF/μm= 80fF + 256fF = 336fF

• trow = 0.38 * RT * CT = 62ps (50% point of rising wave)

Break into components= wordline driver + wordline RC delay + column fall time + colmux + setup

Access Time Estimation

R1 R2 R128

C1 C2 C128clk

V128


Access Time EstimationColumn Fall Time• Assume bitline is discharged linearly, then we can use;• dV/dt = Iread/CBL

• Bitline falls to VDD/2 = 1.0V/2 in 113ps

68fF0.5um*600uA/um

dV/dt = WL=VDD

68fFCBL

0.5μm

Iread

LOWdV/dt = 4.41 V/ns

1.0V

VDD/2 50%

BL

t {ns}

V

113ps

dV/dt = 4.41 V/ns

CJ=1.25fF/μm2


Access Time Estimation

Sum up components of delay; assume inverter delay is 40ps and nand2 is about 60ps delay and setup into latch is 30ps;

Taccess = Wordline driver + wordline delay + column delay + column mux + setup

= (60ps + 40ps) + 62ps + 113ps + 60ps + 30ps

= 365ps

Should easily meet machine cycle time since low frequency … however,the above calculated value of 365ps is only the READ-ACCESS time …Wire routing and data capture budgets have not been factored yet.May be able to use a “high Vt” device if it is available from Fab


Preliminary Power Estimation

• Most power dissipation for an array occurs in bitlines and sense amplifiers• Calculate total bitline capacitance

– {Metal2 bitline cap} + {junction cap} X {number of bitcells}• Calculate sense node capacitive load to include in power dissipation • For power dissipation, use the approximation:

Pdiss = a * Ctotal * (Vsupply)2 * frequency

Where alpha is the “Activity Factor” 0 < a < 1

• Memory cells can contribute significant D.C. power due to leakage from many cells in standby; be sure to take that into account

Pstatic = Ileakage * VDD


Class Agenda









Local Clock Distribution

• At high frequencies, clock uncertainties become a significant portion of the cycle time (10-15% of cycle time or more)

• Important to define the overall clocking scheme and distribution before implementation begins

• Clock inaccuracy is composed of 2 major sources;– Clock jitter: due to PLL, DLL, etc– Clock skew: mismatches in clock buffer tree, load,

inductance or variances due to process (Leff is not constant), VDD (it is not constant), and local temperature.

• A global clock grid that distributes to local clock buffers requires large overhead but helps minimize clock skew– LCB’s are evenly distributed within array block and tap off

of global clock grid with minimum route


Port1 Input Data LatchLCB

LCB

Port0 Input Data Latch LCB

LCB

Port0 Read/Write CktLCB

Port0 Output LatchLCB

LCB


Port1 Read/Write Ckt

LCB

LCB

LCB

LCB

BitcellArray


LCB


LCB

Port0 Read/Write Ckt LCB

BitcellArray

Port0 D

ecoder

LCB

LCB

Port0 Output Latch LCB

LCBPort1 Output LatchPort1 Read/Write Ckt

LCB

LCB

LCB

LCB

LCB

LCB

LCB

Port0 Read/Write CktP

ort1 Decoder

LCB Placement

Large number of LCBs minimizes wire load from LCB to sequentials, thus reducing skew variance.


SAMPLE Power/Ground GRID

Shielding takes up significant routing resources.Global M6 routes over the array should have minimal coupling noise to array bitlines.

* Where λ is minimum critical dimension for width/space

Sig

Sig

Si g

Sig

VSS VDD VSSS

ig

48λ

Sig

Vss

Vss

Vss

Vss

(Full Shielding, MCF = 1.0)

2λ

4λ

2λ

λ

2λ2λ

λ

2λ


Power/Clock Grid• Clock grid is interleaved between VDD and VSS on metal6


LCB

Port0 Input Data Latch LCB

LCB

Port0 Read/Write CktLCB


LCB


Port1 Read/Write Ckt

LCB

LCB

LCB

LCB

BitcellArray


LCB


LCB

Port0 Read/Write Ckt LCB

BitcellArray

Port0 D

ecoderLCB

LCB

Port0 Output Latch LCB

LCBPort1 Output LatchPort1 Read/Write Ckt

LCB

LCB

LCB

LCB

LCB

LCB

LCB

Port0 Read/Write CktP

ort1 Decoder


BACKUP


Memory Array Performance

• Optimization of memory arrays and caches requires careful analysis of:– Size and speed of the array which impacts:

• Power: static and dynamic• Latency: number of clocks to access the memory cell• Area and aspect ratios• Redundancy

– Hit rate (caches): requires additional logic and tag arrays.– Architecture: How many levels of caching?

• In addition need to account for array BIST. This requires additional logic and impacts performance.


Memory Array Performance


Array Redundant Elements

Cell

Address

WordLine

BitL

ine

Bitline ReceiversWrite Buffers

Decoder

Rows

Columns

Cell

Cell

Read DataWrite Data

Pre-D

ec

Precharge

Redundant Address &

enable

Redundant Wordline &

Driver

Redundant Column & Bitslice

Account for area overhead if redundancy is used for repair


Trade-offsLarge Signal Arrays Small Signal Arrays

Simplest sense scheme• Single-ended bitlines

Need sense-amplifier• Dual-ended bitlines

Good noise margin• Vdd/2 threshold

Noise-sensitive• Few hundred millivolts ΔV

Lower bitcell density(Used for small queues & register files, 8 ~ 32 cells on a bitline)

Highest bitcell density(Used for large 1st & 2nd level cache arrays, 64, 128, 256 or more cells on a bitline)

Static timing analysis works Static timing analysis difficult

Multi-portedUsually single-ended;Many READ/WRITE ports

Single portedUsually dual-ended; 1 ~ 3 ports


Dual-Ended Cell Column MuxingAddr[6:0]

Read D

ecoder

128 Rows

2 Cols

Data[1:0]

Write D

ecoderC

ells

Addr[6:2]

Read D

ec

32 Rows

8 Cols

Data[0]

Write D

ec

Cells

Bit 0

Cells

Bit 1

4:1 4:1

Data[1]

Addr[1:0]

For minimum delay cell array should be roughly square.


Single Ended Cell Column Muxing

Single ended arrays must group bits of the same entry together, to write wordlines only on cells of one entry.

Addr[6:2]

Read D

ec

32 Rows

8 Cols

Data[0]

Cells E

ntry A4:1

Data[1]

Addr[1:0]

Write D

ec

Cells E

ntry B

Cells E

ntry C

Cells E

ntry D

Write D

ec

4:1

Addr[6:0]

Read D

ecoder

128 Rows

2 Cols

Data[1:0]

Write D

ecoderC

ells


Dual Ended vs Single Ended Column Muxing

Same bit of different entries grouped together.

Write data driven only on some columns.

Dual-Ended Cells

Write wordline “on’ for entire row.

Different bits of same entry grouped together.

Write data can be driven on every column.

Write wordline “on” for only 1 entry.

A0 B0 C0 D0 A1 B1 C1 D1

Data[0]

4:1

Data[1]

4:1

Read WL

Write WL

A0 A1 B0 B1 C0 C1 D0 D1

Data[0]

4:1

Data[1]

Read WL

Write WLs

Single-EndedCells

4:1


Segmentation Guidelines• Design considerations for segmenting the bitlines are based on

variables such as;– Number of entries– Number of ports– Number of bits

• Processor architecture and manufacturing technology also contribute to design decisions– For example, a high-leakage process may limit the number of

cells on a bitline before losing state

• The following table is a guideline to help determine how to divide up the bitlines for optimum performance– The final decision will be based on careful HSPICE

simulations of the different options over PVT variations


Table of GuidelinesENTRIES PORTS <=64 <=128 <=256

1--7

Single Array; Split LBL with a maximum of 8 bits per LBL in M2; each to NAND2 receiver followed by a latch; GBL to the input of latch at the bottom in M4 ; 1-cycle latency is assumed

Split into 2 sub arrays with 64 entries each; LBL and GBL should follow the guidelines for similar ports; Output of GBL to NAND2 between subarrays. Single cycle latency is assumed

LBL and GBL guidelines are the same as < 64 entries with similar ports;Stacked twice for 256 entries. 2:1 mux between the two 128 entry sub-arrays; at least two cycle latency is required

8--16

Single Array; Split LBL with a maximum of 8 bits per LBL in M2;Each to a NAND2 receiver followed by latch;Split GBLs are routed in M4 to NAND2 ; dynamic latch in the middle; Lached outputs to destination drivers in M4 (or M3)

Split into 2 sub arrays with up to 64 entries each;LBL and GBL should follow the guidelines for entries with similar ports;Output of GBL to dynamic latch followed by latches;Two cycle latency is assumed.

LBL and GBL guidelines are the same as < 64 entries with similar ports;Stacked twice for 256 entries. 2:1 mux between the two 128 entry sub-arrays;More than 2-cycle latency is required.

17 --21

Single Array; Split LBL with a maximum of 8 bits per LBL in M2; each to NAND2 receiver followed by dynamic wire-ORSplit GBLs are routed in M4 (or M2) to NAND2Latch in the middle; Latched outputs to destination drivers in M4 (or M3); Maximum of 48 entries can be supported for this many ports

Split into 2 sub arrays with up to 48 entries each;LBL and GBL should follow the guidelines for similar ports;Ouput of GBL to dynamic latch followed by latches;At least 2-cycle latency assumed;

LBL and GBL guidelines are the same as < 64 entries with similar ports;Stacked twice for 256 entries.2:1 mux between the two 128 entry sub-arrays;More than 2-cycle latency is required

EE-382M VLSI–II Early Planning for Memory Array...

Documents

Transcript of EE-382M VLSI–II Early Planning for Memory Array...