Memory Technology - Duke University...CPU Registers 100s Bytes ~1 ns Cache K - M Bytes 1-100 ns...

Memory Technology

Computer Science 104 Alvin R. Lebeck

2 © Alvin R. Lebeck

Administrivia

•  Midterm II Next Monday •  HW 5 due Wednesday •  Three more home works •  Simulator (y86)

 longer assignment, overlap with two other homeworks  groups of 2


•  Memory Outline •  Review •  Big Picture of Memory •  Memory Technology

 SRAM  DRAM

•  Caches Reading

6.1-6.4

Today’s Lecture


The Five Stages of mrmovl

•  Ifetch: Instruction Fetch  Fetch the instruction from the Instruction Memory

•  Reg/Dec: Registers Fetch and Instruction Decode •  Exec: Calculate the memory address •  Mem: Read the data from the Data Memory •  WrB: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrB mrmovl


Pipeline Stages

•  Fetch  Select current PC  Read instruction  Compute incremented PC

•  Decode  Read program registers

•  Execute  Operate ALU

•  Memory  Read or write data memory

•  Write Back  Update register file


Controlling a Pipeline: Finite State Diagram

Ifetch

Rfetch/Decode BrComplete

RExec

Rfinish

OriExec

OriFinish

AdrCal

Rm_mem Mr_mem

MR_wr

mrmovl or rmmovl

mrmovl rmmovl Rtype

Ori

je

0 1

8

10 6

5 3

2

4 7 11

Wait

Wait


Where Are We?

I/O system CPU

Compiler

Operating System

Application

Digital Design Circuit Design

Instruction Set Architecture, Memory, I/O

Firmware

Memory

Software

Hardware

Interface Between HW and SW

You are here.


1 2 3 4

•

2n-1

• •

0 00110110 00001100

Byte Address Data

Review: Program’s View of Memory

•  Memory is a large linear array of bytes.  Each byte has a unique address (location).  Byte of data at address 0x100, and 0x101

•  Most computers have instructions with byte (8-bit) addressing.

•  Data may have to be aligned on word (4 byte) or double word (8 byte) boundary.   int is 4 bytes  double precision floating point is 8 bytes

•  32-bit v.s. 64-bit addresses  we will assume 32-bit for rest of course,

unless otherwise stated

1

•

2n-1

• •

0

2n-1-4

Word Address


SEQ+ Hardware

 Naïve view of Memory

IDEAL

IDEAL


Question

•  What issues do we need to worry about in implementing the memory system?


I/O Bus

Memory Bus

CPU

Cache

Disk Controller

Disk

Memory

Disk

Graphics Controller

Network Interface

Graphics Network

interrupts

System Organization

I/O Bridge

Core Chip Set

The memory hierarchy


Level Two Cache

Level One Cache Control

Datapath Registers

Processor / Core

Processor and Caches

To main memory

Level Three Cache

We will talk more about caches later


Memory Controller

Memory Bus

DIM

M S

lot 0

DIM

M S

lot 1

DIM

M S

lot 2

DIM

M S

lot 3

DIM

M S

lot 4

DIM

M S

lot 5

DIM

M S

lot 6

DIM

M S

lot 7

DRAM DIMM

DRAM

DRAM

DRAM

DRAM DRAM DRAM DRAM

DRAM DRAM DRAM

Main Memory

Why is it called DRAM?

To Processor


•  Random Access:  “Random” is good: access time is the same for all locations  DRAM: Dynamic Random Access Memory

»  High density, low power, cheap, slow »  Dynamic: needs to be “refreshed” regularly »  Main memory

 SRAM: Static Random Access Memory »  Low density, high power, expensive, fast »  Static: content will last “forever” (until power loss) »  Caches

•  “Not-so-random” Access Technology:  Access time varies from location to location and from time to time  Examples: Disk, DVD/CD

•  Sequential Access Technology: access time linear in location (e.g.,Tape)

Memory Technology


•  Why do computer professionals need to know about RAM technology?  Processor performance is usually limited by

memory latency and bandwidth.  Latency: The time it takes to access a single word in memory.  Bandwidth: The average speed of access to memory (Words/Sec).  As integrated circuit (IC) densities increase, lots of memory will fit

on processor chip »  Tailor on-chip memory to specific needs.

-  Instruction cache -  Data cache -  Write buffer

•  What makes RAM different from a bunch of flip-flops?  Density: RAM is much more dense  Speed: RAM access is slower than flip-flop (register) access.

Random Access Memory (RAM) Technology


DRAM Year Size Cycle Time 1980 64 Kb 250 ns 1983 256 Kb 220 ns 1986 1 Mb 190 ns 1989 4 Mb 165 ns 1992 16 Mb 145 ns 1995   64 Mb 120 ns 1999   128Mb 100 ns 2003   256Mb 100 ns 2007 2Gb 55ns 2010 2Gb 20ns

Capacity Speed Logic: 2x in 3 years 2x in 3 years DRAM: 4x in 3 years 1.4x in 10 years Disk: 2x in 3 years 1.4x in 10 years

1000:1! 2:1!

Technology Historical Trends


6-Transistor SRAM Cell

bit bit

word (row select)

•  Write: 1. Drive bit lines (bit=1, bit=0) 2. Select row

•  Read: 1. Precharge bit and bit to Vdd (set to 1) 2. Select row 3. Cell pulls one line low (pulls to 0) 4. Sense amp on column detects difference between bit and bit

bit bit

word

1 0

0 1

Static RAM Cell


SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

SRAM Cell

- + Sense Amp - + Sense Amp - + Sense Amp - + Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0 Dout 1 Dout 2 Dout 3

- + Wr Driver & Precharger - +

Wr Driver & Precharger - +

Wr Driver & Precharger - +

Wr Driver & Precharger

Address D

ecoder

WrEn Precharge

Din 0 Din 1 Din 2 Din 3

A0

A1

A2

A3

Typical SRAM Organization: 16-word x 4-bit


•  Write Enable is usually active low (WE_L) •  Din and Dout are combined to save pins:

 A new control signal, output enable (OE_L) is needed  WE_L is asserted (Low), OE_L is disasserted (High)

»  D serves as the data input pin  WE_L is disasserted (High), OE_L is asserted (Low)

»  D is the data output pin  Both WE_L and OE_L are asserted:

»  Result is unknown. Don’t do that!!!

A

D OE_L

2 N words x M bit SRAM

N

M

WE_L

Logic Diagram of a Typical SRAM


•  Dynamic RAM (DRAM):  Refresh required  Very high density   Low power (.1 - .5 W active, .25 - 10 mW standby)   Low cost per bit  Pin sensitive (few pins):

»  Output Enable (OE_L) »  Write Enable (WE_L) »  Row address strobe (ras) »  Col address strobe (cas)

cell array NxN bits

N

N

r o w

SA & c o l addr

log N 2

D

WE_L OE_L

Introduction to DRAM


•  Write:  1. Drive bit line  2. Select row

•  Read:  1. Precharge bit line to Vdd (1)  2. Select row  3. Cell and bit line share charges

»  Very small voltage changes on the bit line

 4. Sense (fancy sense amp) »  Can detect changes of ~1 million

electrons  5. Write: restore the value

•  Refresh  1. Just do a dummy read to every cell.

row select

bit

1-Transistor Memory Cell (DRAM)


r o w

d e c o d e r

row address

Sense-Amps, Column Selector & I/O Circuits

Column Address

data

RAM Cell Array

word (row) select

bit (data) lines

•  Row and Column Address together:  Select 1 bit a time

Each intersection represents a 1-T DRAM Cell

Classical DRAM Organization (square)


•  Typical DRAMs: access multiple bits in parallel  Example: 2 Mb DRAM = 256K x 8 = 512 rows x 512 cols x 8 bits  Row and column addresses are applied to all 8 planes in parallel

One “Plane” of 256 Kb DRAM 51

2 ro

ws

Plane 0

512 cols

D<0>

Plane 1

D<1>

Plane 7

D<7>

256 Kb DRAM

256 Kb DRAM

Typical DRAM Organization


Access Pattern without Interleaving:

Start Access for D1

CPU Memory

Start Access for D2 D1 available

Access Pattern with 4-way Interleaving:

Acc

ess B

ank

0

Access Bank 1 Access Bank 2

Access Bank 3 We can Access Bank 0 again

CPU

Memory Bank 1

Memory Bank 0

Memory Bank 3

Memory Bank 2

Increasing Bandwidth - Interleaving

Cycle Time


MICRON 2Gb DRAM (512Mx4, circa 2010)


Fast Memory Systems: DRAM specific

•  Modern DRAMs  Synchronous DRAM (SDRAM): Provide a clock signal to DRAM,

transfer synchronous to system clock  Dual Data Rate DRAM (DDRAM) Also RAMBUS (DDR, DDR2, DDR3)

»  transfer data on both clock edges »  Each Chip a module vs. slice of memory »  Short bus between CPU and chips »  Does own refresh »  Variable amount of data returned »  1 byte / 2 ns (500 MB/s per chip)


Summary of Memory Technology

•  DRAM is slow but cheap and dense:  Good choice for presenting the user with a BIG memory system  Uses one transistor, must be refreshed.

•  SRAM is fast but expensive and not very dense:  Good choice for providing the user FAST access time.  Uses six transistors, holds state as long as power is supplied.

•  GOAL:  Present the user with large amounts of memory using the cheapest

technology.  Provide access at the speed offered by the fastest technology.

•  Next Up: Caches


Issues for Memory Systems

•  Capacity/Size •  Cost

 What technology is cheap? •  Performance

 What technology is fast? •  Ease of Use

 How much do programmers have to worry about it?


Cache

•  What is a cache? •  What is the motivation for a cache? •  Why do caches work? •  How do caches work?


The Motivation for Caches

•  Motivation:  Large memories (DRAM) are slow  Small memories (SRAM) are fast

•  Make the average access time small by:  Servicing most accesses from a small, fast memory.

•  Reduce the bandwidth required of the large memory

Processor

Memory System

Cache DRAM


Levels of the Memory Hierarchy

CPU Registers 100s Bytes ~1 ns

Cache K - M Bytes 1-100 ns ~$.0005/bit

Main Memory G Bytes 100ns-1us ~$.00001/bit

Disk G - T Bytes ms 10 - 10 cents -3 -4

Capacity Access Time Cost

Tape T Bytes sec-min 10 -6

Registers

Cache

Memory

Disk/Network

Tape/DVD/Network

Instr. Operands

Blocks

Pages

Files

Staging Xfer Unit

prog./compiler 1-8 bytes

cache controller 8-128 bytes

OS 512-4K bytes

user/operator Mbytes

Upper Level

Lower Level

faster

Larger


The Principle of Locality

•  The Principle of Locality:  Program access a relatively small portion of the address space at

any instant of time.  Example: 90% of time in 10% of the code

•  Two Different Types of Locality:  Temporal Locality (Locality in Time): If an item is referenced, it will

tend to be referenced again soon.  Spatial Locality (Locality in Space): If an item is referenced, items

whose addresses are close by tend to be referenced soon.

Address Space 0 2n

Probability of reference


Memory Hierarchy: Principles of Operation

•  At any given time, data is copied between only 2 adjacent levels:  Upper Level (Cache) : the one closer to the processor

»  Smaller, faster, and uses more expensive technology  Lower Level (Memory): the one further away from the processor

»  Bigger, slower, and uses less expensive technology •  Block:

 The minimum unit of information that can either be present or not present in the two level hierarchy

Lower Level (Memory) Upper Level

(Cache) To Processor

From Processor Blk X

Blk Y


Memory Hierarchy: Terminology

•  Hit: data appears in some block in the upper level (example: Block X)  Hit Rate: the fraction of memory access found in the upper level  Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss •  Miss: data needs to be retrieved from a lower level (Block Y)

 Miss Rate = 1 - (Hit Rate)  Miss Penalty = Time to replace a block in the upper level +

Time to deliver the block the processor •  Hit Time << Miss Penalty

Lower Level (Memory) Upper Level

(Cache) To Processor

From Processor Blk X

Blk Y


•  Direct Mapped cache: array of fixed size frames. •  Each frame holds consecutive bytes of main memory data (block). •  The Tag Array holds the Block Memory Address. •  A valid bit associated with each cache block tells if the data is valid.

Direct Mapped Cache

Cache-Index = (<Address> Mod (Cache_Size))/ Block_Size Block-Offset = <Address> Mod (Block_Size) Tag = <Address> / (Cache_Size)

•  Cache Index: The location of a block (and it’s tag) in the cache. •  Block Offset: The byte location in the cache block.


The Simplest Cache: Direct Mapped Cache

Memory

4 Byte Direct Mapped Cache with 1-byte blocks

Memory Address 0 1 2 3 4 5 6 7 8 9 A B C D E F

Cache Index 0 1 2 3

•  Location 0 can be occupied by data from:  Memory location 0, 4, 8, ... etc.  In general: any memory location

whose 2 LSBs of the address are 0s  Address<1:0> => cache index

•  Which one should we place in the cache? •  How can we tell which one is in the cache?


Direct Mapped Cache (Cont.)

For a Cache of 2M bytes with block size of 2L bytes

  There are 2M-L cache blocks,

  Lowest L bits of the address are Block-Offset bits

  Next (M - L) bits are the Cache-Index.

  The last (32 - M) bits are the Tag bits.

L bits block offset M-L bits Cache Index 32-M bits Tag

Data Address


Example: 1-KB Cache with 32B blocks:

Cache Index = (<Address> Mod (1024))/ 32

Block-Offset = <Address> Mod (32)

Tag = <Address> / (1024)

1K = 210 = 1024 25 = 32

Direct Mapped Cache Data

Byte 0 Byte 1 Byte 30 Byte 31

Cache Tag Valid bit

. . . . 32-byte block 22 bits

32 cache blocks

5 bits block offset 5 bits Cache Index 22 bits Tag

Address


Example: 1KB Direct Mapped Cache with 32B Blocks

•  For a 1024 (210) byte cache with 32-byte blocks:  The uppermost 22 = (32 - 10) address bits are the Cache Tag  The lowest 5 address bits are the Byte Select (Block Size = 25)  The next 5 address bits (bit5 - bit9) are the Cache Index

0 4 31 9 Cache Index

:

Cache Tag Example: 0x50 Ex: 0x01

0x50

Stored as part of the cache “state”

Valid Bit

:

0 1 2 3

:

Cache Data Byte 0

31

Byte 1 Byte 31 :

Byte 32 Byte 33 Byte 63 :

Byte 992 Byte 1023 :

Cache Tag

Byte Select Ex: 0x00

Byte Select


Example: 1K Direct Mapped Cache


:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0 1 2 3

:

Cache Data Byte 0

31

Byte 1 Byte 31 :



Cache Tag

Byte Select

0x00

Byte Select = Cache Miss

1 0

1

0xxxxxxx

0x004440




:

Cache Tag

0x0002fe 0x00

0x000050

Valid Bit

:

0 1 2 3

:

Cache Data

31



Cache Tag

Byte Select 0x00

Byte Select =

1 1

1

0x0002fe

0x004440

New Block of data




:

Cache Tag

0x000050 0x01

0x000050

Valid Bit

:

0 1 2 3

:

Cache Data Byte 0

31

Byte 1 Byte 31 :



Cache Tag

Byte Select 0x08

Byte Select = Cache Hit

1 1

1

0x0002fe

0x004440




:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0 1 2 3

:

Cache Data Byte 0

31

Byte 1 Byte 31 :



Cache Tag

Byte Select

0x04

Byte Select = Cache Miss

1 1

1

0x0002fe

0x004440




:

Cache Tag

0x002450 0x02

0x000050

Valid Bit

:

0 1 2 3

:

Cache Data Byte 0

31

Byte 1 Byte 31 :



Cache Tag

Byte Select

0x04

Byte Select =

1 1

1

0x0002fe

0x002450 New Block of data


Intel Core i7 Cache Hierarchy

Memory Technology - Duke University...CPU Registers 100s Bytes ~1 ns Cache K - M Bytes 1-100 ns...

Documents

Transcript of Memory Technology - Duke University...CPU Registers 100s Bytes ~1 ns Cache K - M Bytes 1-100 ns...