CS152 / Kubiatowicz Lec17.1 10/27/99©UCB Fall 1999 CS152 Computer Architecture and Engineering...

10/27/99 ©UCB Fall 1999 CS152 / Kubiatowicz

Lec17.1

CS152Computer Architecture and Engineering

Lecture 17

Finish speculationLocality and Memory Technology

October 27, 1999

John Kubiatowicz (http.cs.berkeley.edu/~kubitron)

lecture slides: http://www-inst.eecs.berkeley.edu/~cs152/

Lec17.2

Review: Tomasulo Organization

FP addersFP adders

Add1Add2Add3

FP multipliersFP multipliers

Mult1Mult2

From Mem FP Registers

Reservation Stations

Common Data Bus (CDB)

To Mem

FP OpQueue

Load Buffers

Store Buffers

Load1Load2Load3Load4Load5Load6

Lec17.3

Review: Tomasulo Architecture° Reservations stations: renaming to larger set of registers + buffering

source operands• Prevents registers as bottleneck• Avoids WAR, WAW hazards of Scoreboard• Allows loop unrolling in HW

° Not limited to basic blocks (integer units gets ahead, beyond branches)

° Dynamic Scheduling:• Scoreboarding/Tomasulo• In-order issue, out-of-order execution, out-of-order commit

° Branch prediction/speculation• Regularities in program execution permit prediction of branch

directions and data values• Necessary for wide superscalar issue

Lec17.4

Review: Independent “Fetch” unit

Instruction Fetchwith

Branch Prediction

Out-Of-OrderExecution

Correctness FeedbackOn Branch Results

Stream of InstructionsTo Execute

° Instruction fetch decoupled from execution

° Need mechanism to “undo results” when prediction wrong??? Called “Speculation”

Lec17.5

° Address of branch index to get prediction AND branch address (if taken)• Must check for branch match now, since can’t use wrong branch address

• Grab predicted PC from table since may take several cycles to compute

° Update predicted PC when branch is actually resolved

° Return instruction addresses predicted with stack

Branch PC Predicted PC

Predict taken or untaken

Review: Branch Target Buffer (BTB)

Lec17.6

° Solution: 2-bit scheme where change prediction only if get misprediction twice: (Figure 4.13, p. 264)

° Red: stop, not taken

° Green: go, taken

° Adds hysteresis to decision making process

Review: Better Dynamic Branch Prediction

Predict Taken

Predict Not Taken

Predict Taken

Predict Not TakenT

Lec17.7

BHT Accuracy

° BHT: like branch target buffer• Table indexed by branch PC, with 2-bit counter value

° Mispredict because either:• Wrong guess for that branch

• Got branch history of wrong branch when index the table

° 4096 entry table programs vary from 1% misprediction (nasa7, tomcatv) to 18% (eqntott), with spice at 9% and gcc at 12%

° 4096 about as good as infinite table(in Alpha 211164)

Lec17.8

Correlating Branches° Hypothesis: recent branches are correlated; that is, behavior of

recently executed branches affects prediction of current branch

° Two possibilities; Current branch depends on:• Last m most recently executed branches anywhere in program

Produces a “GA” (for “global address”) in the Yeh and Patt classification (e.g. GAg)

• Last m most recent outcomes of same branch.Produces a “PA” (for “per address”) in same classification (e.g. PAg)

° Idea: record m most recently executed branches as taken or not taken, and use that pattern to select the proper branch history table entry

• A single history table shared by all branches (appends a “g” at end), indexed by history value.

• Address is used along with history to select table entry (appends a “p” at end of classification)

Lec17.9

Correlating Branches

(2,2) GAs predictor• First 2 means that we keep two

bits of history

• Second means that we have 2 bit counters in each slot.

• Then behavior of recent branches selects between, say, four predictions of next branch, updating just that prediction

• Note that the original two-bit counter solution would be a (0,2) GAs predictor

• Note also that aliasing is possible here...

Branch address

2-bits per branch predictors

PredictionPrediction

2-bit global branch history register

° For instance, consider global history, set-indexed BHT. That gives us a GAs history table.

Each slot is2-bit counter

Lec17.10

Accuracy of Different SchemesFre

doducd

4,096 entries: 2-bits per entry Unlimited entries: 2-bits/entry 1,024 entries (2,2)

4096 Entries 2-bit BHTUnlimited Entries 2-bit BHT1024 Entries (2,2) BHT

Lec17.11

HW support for More ILP

° Avoid branch prediction by turning branches into conditionally executed instructions:

if (x) then A = B op C else NOP• If false, then neither store result nor cause exception

• Expanded ISA of Alpha, MIPS, PowerPC, SPARC have conditional move; PA-RISC can annul any following instr.

• EPIC: 64 1-bit condition fields selected so conditional execution

° Drawbacks to conditional instructions• Still takes a clock even if “annulled”

• Stall if condition evaluated late

• Complex conditions reduce effectiveness; condition becomes known late in pipeline

Lec17.12

Now what about exceptions???

° Out-of-order commit really messes up our chance to get precise exceptions!

• When committing results out-of-order, register file contains results from later instructions while earlier ones have not completed yet.

• What if need to cause exception on one of those early instructions??

° Need to be able to “rollback” register file to consistent state

• Remember that “precise” means that there is some PC such that: all instructions before have committed results, and none after have committed results.

° Big problem for branch prediction as well:What if prediction wrong??

Lec17.13

° Speculation is a form of guessing.

° Important for branch prediction:• Need to “take our best shot” at predicting branch direction.

• If we issue multiple instructions per cycle, lose lots of potential instructions otherwise:

- Consider 4 instructions per cycle

- If take single cycle to decide on branch, waste from 4 - 7 instruction slots!

° If we speculate and are wrong, need to back up and restart execution to point at which we predicted incorrectly:

• This is exactly same as precise exceptions!

° Technique for both precise interrupts/exceptions and speculation: in-order completion or commit

Relationship between precise interrupts and specultation:

Lec17.14

HW support for precise interrupts

° Need HW buffer for results of uncommitted instructions: reorder buffer

• 3 fields: instr, destination, value

• Reorder buffer can be operand source => more registers like RS

• Use reorder buffer number instead of reservation station when execution completes

• Supplies operands between execution complete & commit

• Once operand commits, result is put into register

• Instructionscommit

• As a result, its easy to undo speculated instructions on mispredicted branches or on exceptions

ReorderBuffer

FP Adder FP Adder

Res Stations Res Stations

FP Regs

Lec17.15

1. Issue—get instruction from FP Op Queue• If reservation station and reorder buffer slot free, issue instr & send operands

& reorder buffer no. for destination (this stage sometimes called “dispatch”)

2. Execution—operate on operands (EX)• When both operands ready then execute; if not ready, watch CDB for result;

when both in reservation station, execute; checks RAW (sometimes called “issue”)

3. Write result—finish execution (WB)• Write on Common Data Bus to all awaiting FUs & reorder buffer; mark

reservation station available.

4. Commit—update register with reorder result• When instr. at head of reorder buffer & result present, update register with

result (or store to memory) and remove instr from reorder buffer.

• Mispredicted branch or interrupt flushes reorder buffer (sometimes called “graduation”)

Four Steps of Speculative Tomasulo Algorithm

Lec17.16

3 DIVD ROB2,R(F6)3 DIVD ROB2,R(F6)2 ADDD R(F4),ROB12 ADDD R(F4),ROB1

Tomasulo With Reorder buffer:

ToMemory

FP addersFP adders FP multipliersFP multipliers

Reservation Stations

FP OpQueue

F0F0<val2><val2>

<val2><val2>ST 0(R3),F0ST 0(R3),F0

ADDD F0,F4,F6ADDD F0,F4,F6YY

F4F4 M[10]M[10] LD F4,0(R3)LD F4,0(R3) YY

---- BNE F2,<…>BNE F2,<…> NN

F10F10

DIVD F2,F10,F6DIVD F2,F10,F6

ADDD F10,F4,F0ADDD F10,F4,F0

LD F0,10(R2)LD F0,10(R2)

DestDest

Oldest

Newest

from Memory

1 10+R21 10+R2Dest

Reorder Buffer

Registers

Lec17.17

Dynamic Scheduling in PowerPC 604 and Pentium Pro

° Both In-order Issue, Out-of-order execution, In-order Commit

PPro central reservation station for any functional units with one bus shared by a branch and an integer unit

Lec17.18

Dynamic Scheduling in PowerPC 604 and Pentium Pro

Parameter PPC PPro

Max. instructions issued/clock 4 3

Max. instr. complete exec./clock 6 5

Max. instr. commited/clock 6 3

Instructions in reorder buffer 16 40

Number of rename buffers 12 Int/8 FP 40

Number of reservations stations 12 20

No. integer functional units (FUs) 2 2No. floating point FUs 1 1 No. branch FUs 1 1 No. complex integer FUs 1 0No. memory FUs 1 1 load +1 store

Lec17.19

Dynamic Scheduling in Pentium Pro

° PPro doesn’t pipeline 80x86 instructions

° PPro decode unit translates the Intel instructions into 72-bit micro-operations ( MIPS)

° Sends micro-operations to reorder buffer & reservation stations

° Takes 1 clock cycle to determine length of 80x86 instructions + 2 more to create the micro-operations

° Most instructions translate to 1 to 4 micro-operations

° Complex 80x86 instructions are executed by a conventional microprogram (8K x 72 bits) that issues long sequences of micro-operations

Lec17.20

Limits to Multi-Issue Machines

° Inherent limitations of ILP• 1 branch in 5: How to keep a 5-way superscalar busy?

• Latencies of units: many operations must be scheduled

• Need about Pipeline Depth x No. Functional Units of independent instructions to keep fully busy

• Increase ports to Register File

- VLIW example needs 7 read and 3 write for Int. Reg. & 5 read and 3 write for FP reg

• Increase ports to memory

• Current state of the art: Many hardware structures (such as issue/rename logic) has delay proportional to square of number of instructions issued/cycle

Lec17.21

° Conflicting studies of amount• Benchmarks (vectorized Fortran FP vs. integer C programs)

• Hardware sophistication

• Compiler sophistication

° How much ILP is available using existing mechanims with increasing HW budgets?

° Do we need to invent new HW/SW mechanisms to keep on processor performance curve?

• Intel MMX

• Motorola AltaVec

• Supersparc Multimedia ops, etc.

Limits to ILP

Lec17.22

Initial HW Model here; MIPS compilers.

Assumptions for ideal/perfect machine to start:

1. Register renaming–infinite virtual registers and all WAW & WAR hazards are avoided

2. Branch prediction–perfect; no mispredictions

3. Instruction Window–machine with an unbounded buffer of instructions available

4. Memory-address alias analysis–addresses are known & a store can be moved before a load provided addresses not equal

1 cycle latency for all instructions; unlimited number of instructions issued per clock cycle

Limits to ILP

Lec17.23

Programs

gcc espresso li fpppp doducd tomcatv

54.862.6

Integer: 18 - 60

FP: 75 - 150

Upper Limit to ILP: Ideal Machine

Lec17.24

Program

Perfect Selective predictor Standard 2-bit Static None

Change from Infinite window to examine to 2000 and maximum issue of 64 instructions per clock cycle

ProfileBHT (512)Pick Cor. or BHTPerfect No prediction

FP: 15 - 45

Integer: 6 - 12

More Realistic HW: Branch Impact

Lec17.25

Program

9 10 11

5 5 6 5 57

Infinite 256 128 64 32 None

Change 2000 instr window, 64 instr issue, 8K 2 level Prediction

Integer: 5 - 15

FP: 11 - 45

More Realistic HW: Register Impact (rename regs)

64 None256Infinite 32128

Lec17.26

Program

45 4 4

53 3 4 4

Perfect Global/stack Perfect Inspection None

Change 2000 instr window, 64 instr issue, 8K 2 level Prediction, 256 renaming registers

FP: 4 - 45(Fortran,no heap)

Integer: 4 - 9

More Realistic HW: Alias Impact

NoneGlobal/Stack perf;heap conflicts

Perfect Inspec.Assem.

Lec17.27

Program

gcc expresso li fpppp doducd tomcatv

910 11

6 6 68

4 4 4 5 46

3 2 3 3 3 3

Infinite 256 128 64 32 16 8 4

Perfect disambiguation (HW), 1K Selective Prediction, 16 entry return, 64 registers, issue as many as window

Integer: 6 - 12

FP: 8 - 45

Realistic HW for ‘9X: Window Impact

64 16256Infinite 32128 8 4

Lec17.28

° 8-scalar IBM Power-2 @ 71.5 MHz (5 stage pipe) vs. 2-scalar Alpha @ 200 MHz (7 stage pipe)

Benchmark

ss sc gcc

mdljdp2

mdljsp

Braniac vs. Speed Demon(1993)

Lec17.29

° Start reading Chapter 7 of your book (Memory Hierarchy)

° Second midterm 2 in 3 weeks (Wed, November 17th)

• Pipelining

- Hazards, branches, forwarding, CPI calculations

- (may include something on dynamic scheduling)

• Memory Hierarchy

• Possibly something on I/O (see where we get in lectures)

• Possibly something on power (Broderson Lecture)

° Solutions for midterm 1 up today (promise!)

Administrative Issues

Lec17.30

° The Five Classic Components of a Computer

° Today’s Topics: • Recap last lecture

• Locality and Memory Hierarchy

• Administrivia

• SRAM Memory Technology

• DRAM Memory Technology

• Memory Organization

The Big Picture: Where are We Now?

Control

Datapath

Memory

Processor

Output

Lec17.31

Technology Trends (from 1st lecture)

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns

1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

Capacity Speed (latency)

Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 2x in 10 years

Disk: 4x in 3 years 2x in 10 years

1000:1! 2:1!

Lec17.32

µProc60%/yr.(2X/1.5yr)

DRAM9%/yr.(2X/10 yrs)1

CPU198

Processor-MemoryPerformance Gap:(grows 50% / year)

“Moore’s Law”

Processor-DRAM Memory Gap (latency)

Who Cares About the Memory Hierarchy?

“Less’ Law?”

Lec17.33

Today’s Situation: Microprocessor

° Rely on caches to bridge gap

° Microprocessor-DRAM performance gap• time of a full cache miss in instructions executed

1st Alpha (7000): 340 ns/5.0 ns = 68 clks x 2 or 136 instructions

2nd Alpha (8400): 266 ns/3.3 ns = 80 clks x 4 or 320 instructions

3rd Alpha (t.b.d.): 180 ns/1.7 ns =108 clks x 6 or 648 instructions

• 1/2X latency x 3X clock rate x 3X Instr/clock 5X

Lec17.34

Impact on Performance

° Suppose a processor executes at • Clock Rate = 200 MHz (5 ns per cycle)

• CPI = 1.1

• 50% arith/logic, 30% ld/st, 20% control

° Suppose that 10% of memory operations get 50 cycle miss penalty

° CPI = ideal CPI + average stalls per instruction= 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) )

= 1.1 cycle + 1.5 cycle = 2. 6

° 58 % of the time the processor is stalled waiting for memory!

° a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

DataMiss(1.6)49%

Ideal CPI(1.1)35%

Inst Miss(0.5)16%

Lec17.35

The Goal: illusion of large, fast, cheap memory

° Fact: Large memories are slow, fast memories are small

° How do we create a memory that is large, cheap and fast (most of the time)?

• Hierarchy

• Parallelism

Lec17.36

An Expanded View of the Memory System

Control

Datapath

Memory

Processor

Memory

Fastest Slowest

Smallest Biggest

Highest Lowest

Speed:

Lec17.37

Why hierarchy works

° The Principle of Locality:• Program access a relatively small portion of the address space at

any instant of time.

Address Space0 2^n - 1

Probabilityof reference

Lec17.38

Memory Hierarchy: How Does it Work?

° Temporal Locality (Locality in Time):=> Keep most recently accessed data items closer to the processor

° Spatial Locality (Locality in Space):=> Move blocks consists of contiguous words to the upper levels

Lower LevelMemoryUpper Level

MemoryTo Processor

From ProcessorBlk X

Lec17.39

Memory Hierarchy: Terminology° Hit: data appears in some block in the upper level

(example: Block X) • Hit Rate: the fraction of memory access found in the upper level

• Hit Time: Time to access the upper level which consists of

RAM access time + Time to determine hit/miss

° Miss: data needs to be retrieve from a block in the lower level (Block Y)

• Miss Rate = 1 - (Hit Rate)

• Miss Penalty: Time to replace a block in the upper level +

Time to deliver the block the processor

° Hit Time << Miss PenaltyLower Level

MemoryUpper LevelMemory

To Processor

From ProcessorBlk X

Lec17.40

Memory Hierarchy of a Modern Computer System

° By taking advantage of the principle of locality:• Present the user with as much memory as is available in the

cheapest technology.

• Provide access at the speed offered by the fastest technology.

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

1s 10,000,000s

(10s ms)

Speed (ns): 10s 100s

100s GsSize (bytes): Ks Ms

TertiaryStorage(Tape)

10,000,000,000s (10s sec)

Lec17.41

How is the hierarchy managed?

° Registers <-> Memory• by compiler (programmer?)

° cache <-> memory• by the hardware

° memory <-> disks• by the hardware and operating system (virtual memory)

• by the programmer (files)

Lec17.42

Memory Hierarchy Technology° Random Access:

• “Random” is good: access time is the same for all locations

• DRAM: Dynamic Random Access Memory

- High density, low power, cheap, slow

- Dynamic: need to be “refreshed” regularly

• SRAM: Static Random Access Memory

- Low density, high power, expensive, fast

- Static: content will last “forever”(until lose power)

° “Non-so-random” Access Technology:• Access time varies from location to location and from time to time

• Examples: Disk, CDROM

° Sequential Access Technology: access time linear in location (e.g.,Tape)

° The next two lectures will concentrate on random access technology• The Main Memory: DRAMs + Caches: SRAMs

Lec17.43

Main Memory Background

° Performance of Main Memory: • Latency: Cache Miss Penalty

- Access Time: time between request and word arrives

- Cycle Time: time between requests

• Bandwidth: I/O & Large Block Miss Penalty (L2)

° Main Memory is DRAM : Dynamic Random Access Memory• Dynamic since needs to be refreshed periodically (8 ms)

• Addresses divided into 2 halves (Memory as a 2D matrix):

- RAS or Row Access Strobe

- CAS or Column Access Strobe

° Cache uses SRAM : Static Random Access Memory• No refresh (6 transistors/bit vs. 1 transistor)

Size: DRAM/SRAM 4-8 Cost/Cycle time: SRAM/DRAM 8-16

Lec17.44

Random Access Memory (RAM) Technology

° Why do computer designers need to know about RAM technology?

• Processor performance is usually limited by memory bandwidth

• As IC densities increase, lots of memory will fit on processor chip

- Tailor on-chip memory to specific needs

- Instruction cache

- Data cache

- Write buffer

° What makes RAM different from a bunch of flip-flops?• Density: RAM is much denser

Lec17.45

Static RAM Cell

6-Transistor SRAM Cell

bit bit

word(row select)

bit bit

° Write:1. Drive bit lines (bit=1, bit=0)

2.. Select row

° Read:1. Precharge bit and bit to Vdd or Vdd/2 => make sure equal!

2.. Select row

3. Cell pulls one line low

4. Sense amp on column detects difference between bit and bit

replaced with pullupto save area

Lec17.46

Typical SRAM Organization: 16-word x 4-bit

SRAMCell

- +Sense Amp - +Sense Amp - +Sense Amp - +Sense Amp

: : : :

Word 0

Word 1

Word 15

Dout 0Dout 1Dout 2Dout 3

- +Wr Driver &Precharger - +

Wr Driver &Precharger - +

Wr Driver &Precharger

dress D

ecoder

WrEnPrecharge

Din 0Din 1Din 2Din 3

Q: Which is longer:word line or

bit line?

Lec17.47

° Write Enable is usually active low (WE_L)

° Din and Dout are combined to save pins:• A new control signal, output enable (OE_L) is needed

• WE_L is asserted (Low), OE_L is disasserted (High)

- D serves as the data input pin

• WE_L is disasserted (High), OE_L is asserted (Low)

- D is the data output pin

• Both WE_L and OE_L are asserted:

- Result is unknown. Don’t do that!!!

° Although could change VHDL to do what desire, must do the best with what you’ve got (vs. what you need)

2 Nwordsx M bitSRAM

Logic Diagram of a Typical SRAM

Lec17.48

Typical SRAM Timing

Write Timing:

Read Timing:

WriteHold Time

Write Setup Time

2 Nwordsx M bitSRAM

Data In

Write Address

High Z

Read Address

Read AccessTime

Data Out

Read AccessTime

Data Out

Read Address

Lec17.49

Problems with SRAM

° Six transistors use up a lot of area

° Consider a “Zero” is stored in the cell:• Transistor N1 will try to pull “bit” to 0

• Transistor P2 will try to pull “bit bar” to 1

° But bit lines are precharged to high: Are P1 and P2 necessary?

bit = 1 bit = 0

Select = 1

On Off

Off On

Lec17.50

1-Transistor Memory Cell (DRAM)

° Write:• 1. Drive bit line

• 2.. Select row

° Read:• 1. Precharge bit line to Vdd

• 2.. Select row

• 3. Cell and bit line share charges

- Very small voltage changes on the bit line

• 4. Sense (fancy sense amp)

- Can detect changes of ~1 million electrons

• 5. Write: restore the value

° Refresh• 1. Just do a dummy read to every cell.

row select

Lec17.51

Classical DRAM Organization (square)

decoder

rowaddress

Column Selector & I/O Circuits Column

Address

RAM Cell Array

word (row) select

bit (data) lines

° Row and Column Address together:

• Select 1 bit a time

Each intersection representsa 1-T DRAM Cell

Lec17.52

DRAM logical organization (4 Mbit)

° Square root of bits per RAS/CAS

Column Decoder

Sense Amps & I/O

Memory Array(2,048 x 2,048)

A0…A10

Word LineStorage Cell

Lec17.53

Block Row Dec.

9 : 512

RowBlock

Row Dec.9 : 512

Column Address

… BlockRow Dec.

9 : 512

BlockRow Dec.

9 : 512

Block 0 Block 3…

I/OI/O

Address

8 I/Os

DRAM physical organization (4 Mbit)

Lec17.54

DRAM2^n x 1chip

DRAMController

address

MemoryTimingController Bus Drivers

Tc = Tcycle + Tcontroller + Tdriver

Memory Systems

Lec17.55

256K x 8DRAM9 8

° Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low

° Din and Dout are combined (D):• WE_L is asserted (Low), OE_L is disasserted (High)

- D serves as the data input pin

• WE_L is disasserted (High), OE_L is asserted (Low)

- D is the data output pin

° Row and column addresses share the same pins (A)• RAS_L goes low: Pins A are latched in as row address

• CAS_L goes low: Pins A are latched in as column address

• RAS/CAS edge-sensitive

CAS_LRAS_L

Logic Diagram of a Typical DRAM

Lec17.56

° tRAC: minimum time from RAS line falling to the valid data output.

• Quoted as the speed of a DRAM

• A fast 4Mb DRAM tRAC = 60 ns

° tRC: minimum time from the start of one row access to the start of the next.

• tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns

° tCAC: minimum time from CAS line falling to valid data output.

• 15 ns for a 4Mbit DRAM with a tRAC of 60 ns

° tPC: minimum time from the start of one column access to the start of the next.

• 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

Key DRAM Timing Parameters

Lec17.57

° A 60 ns (tRAC) DRAM can • perform a row access only every 110 ns (tRC)

• perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC).

- In practice, external address delays and turning around buses make it 40 to 50 ns

° These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead.

• Drive parallel DRAMs, external memory controller, bus to turn around, SIMM module, pins…

• 180 ns to 250 ns latency from processor to memory is good for a “60 ns” (tRAC) DRAM

DRAM Performance

Lec17.58

256K x 8DRAM9 8

WE_LCAS_LRAS_L

A Row Address

WR Access Time WR Access Time

Col Address Row Address JunkCol Address

D Junk JunkData In Data In Junk

DRAM WR Cycle Time

Early Wr Cycle: WE_L asserted before CAS_L Late Wr Cycle: WE_L asserted after CAS_L

° Every DRAM access begins at:

• The assertion of the RAS_L

• 2 ways to write: early or late v. CAS

DRAM Write Timing

Lec17.59

256K x 8DRAM9 8

WE_LCAS_LRAS_L

A Row Address

Read AccessTime

Output EnableDelay

D High Z Data Out

DRAM Read Cycle Time

Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L

° Every DRAM access begins at:

• The assertion of the RAS_L

• 2 ways to read: early or late v. CAS

Junk Data Out High Z

DRAM Read Timing

Lec17.60

° Simple: • CPU, Cache, Bus, Memory

same width (32 bits)

° Interleaved: • CPU, Cache, Bus 1 word:

Memory N Modules(4 Modules); example is word interleaved

° Wide: • CPU/Mux 1 word;

Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)

Main Memory Performance

Lec17.61

° DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time

• 2:1; why?

° DRAM (Read/Write) Cycle Time :• How frequent can you initiate an access?

• Analogy: A little kid can only ask his father for money on Saturday

° DRAM (Read/Write) Access Time:• How quickly will you get what you want once you initiate an access?

• Analogy: As soon as he asks, his father will give him the money

° DRAM Bandwidth Limitation analogy:• What happens if he runs out of money on Wednesday?

TimeAccess Time

Cycle Time

Lec17.62

Access Pattern without Interleaving:

Start Access for D1

CPU Memory

Start Access for D2

D1 available

Access Pattern with 4-way Interleaving:

Access Bank 1

Access Bank 2

Access Bank 3

We can Access Bank 0 again

MemoryBank 1

MemoryBank 0

MemoryBank 3

MemoryBank 2

Increasing Bandwidth - Interleaving

Lec17.63

° Timing model• 1 to send address,

• 4 for access time, 10 cycle time, 1 to send data

• Cache Block is 4 words

° Simple M.P. = 4 x (1+10+1) = 48° Wide M.P. = 1 + 10 + 1 = 12° Interleaved M.P. = 1+10+1 + 3 =15

address

Bank 0

address

Bank 1

address

Bank 2

address

Bank 3

Lec17.64

° How many banks?number banks number clocks to access word in bank

• For sequential accesses, otherwise will return to original bank before it has next word ready

° Increasing DRAM => fewer chips => harder to have banks

• Growth bits/chip DRAM : 50%-60%/yr

• Nathan Myrvold M/S: mature software growth (33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr)

Independent Memory Banks

Lec17.65

Fewer DRAMs/System over TimeM

DRAM Generation‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb

128 MB

256 MB

Memory per System growth@ 25%-30% / year

Memory per DRAM growth@ 60% / year

(from PeteMacWilliams, Intel)

Lec17.66

Page Mode DRAM: Motivation

° Regular DRAM Organization:• N rows x N column x M-bit

• Read & Write M-bit at a time

• Each M-bit access requiresa RAS / CAS cycle

° Fast Page Mode DRAM• N x M “register” to save a row

A Row Address Junk

1st M-bit Access 2nd M-bit Access

N cols

M bits

RowAddress

ColumnAddress

M-bit Output

Lec17.67

Fast Page Mode Operation

° Fast Page Mode DRAM• N x M “SRAM” to save a row

° After a row is read into the register

• Only CAS is needed to access other M-bit blocks on that row

• RAS_L remains asserted while CAS_L is toggled

A Row Address

Col Address Col Address

1st M-bit Access

N cols

ColumnAddress

M-bit OutputM bits

N x M “SRAM”

RowAddress

Col Address Col Address

2nd M-bit 3rd M-bit 4th M-bit

Lec17.68

Standards pinout, package, binary compatibility,refresh rate, IEEE 754, I/O buscapacity, ...

Sources Multiple Single

Figures 1) capacity, 1a) $/bit 1) SPEC speedof Merit 2) BW, 3) latency 2) cost

Improve 1) 60%, 1a) 25%, 1) 60%, Rate/year 2) 20%, 3) 7% 2) little change

DRAM v. Desktop Microprocessors Cultures

Lec17.69

° Reduce cell size 2.5, increase die size 1.5

° Sell 10% of a single DRAM generation• 6.25 billion DRAMs sold in 1996

° 3 phases: engineering samples, first customer ship(FCS), mass production

• Fastest to FCS, mass production wins share

° Die size, testing time, yield => profit• Yield >> 60%

(redundant rows/columns to repair flaws)

DRAM Design Goals

Lec17.70

° DRAMs: capacity +60%/yr, cost –30%/yr• 2.5X cells/area, 1.5X die size in 3 years

° ‘97 DRAM fab line costs $1B to $2B• DRAM only: density, leakage v. speed

° Rely on increasing no. of computers & memory per computer (60% market)

• SIMM or DIMM is replaceable unit => computers use any generation DRAM

° Commodity, second source industry => high volume, low profit, conservative

• Little organization innovation in 20 years page mode, EDO, Synch DRAM

° Order of importance: 1) Cost/bit 1a) Capacity• RAMBUS: 10X BW, +30% cost => little impact

DRAM History

Lec17.71

° Commodity, second source industry high volume, low profit, conservative

• Little organization innovation (vs. processors) in 20 years: page mode, EDO, Synch DRAM

° DRAM industry at a crossroads:• Fewer DRAMs per computer over time

- Growth bits/chip DRAM : 50%-60%/yr

- Nathan Myrvold M/S: mature software growth (33%/yr for NT) growth MB/$ of DRAM (25%-30%/yr)

• Starting to question buying larger DRAMs?

Today’s Situation: DRAM

Lec17.72

DRAM Revenue per Quarter

$5,000

$10,000

$15,000

$20,000

• Intel: 30%/year since 1987; 1/3 income profit

Today’s Situation: DRAM

Lec17.73

° Two Different Types of Locality:• Temporal Locality (Locality in Time): If an item is referenced, it will

tend to be referenced again soon.

• Spatial Locality (Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon.

° By taking advantage of the principle of locality:• Present the user with as much memory as is available in the

cheapest technology.

• Provide access at the speed offered by the fastest technology.

° DRAM is slow but cheap and dense:• Good choice for presenting the user with a BIG memory system

° SRAM is fast but expensive and not very dense:• Good choice for providing the user FAST access time.

Summary:

Lec17.74

Processor % Area %Transistors

( cost) ( power)

° Alpha 21164 37% 77%

° StrongArm SA110 61% 94%

° Pentium Pro 64% 88%• 2 dies per package: Proc/I$/D$ + L2$

° Caches have no inherent value, only try to close performance gap

Summary: Processor-Memory Performance Gap “Tax”

CS152 / Kubiatowicz Lec17.1 10/27/99©UCB Fall 1999 CS152 Computer Architecture and Engineering...

Documents

Transcript of CS152 / Kubiatowicz Lec17.1 10/27/99©UCB Fall 1999 CS152 Computer Architecture and Engineering...

CS152 / Kubiatowicz Lec21.1 4/21/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 21 Memory Systems (recap) Caches April 21, 2003.

CS152 / Kubiatowicz Lec4.1 2/10/03©UCB Spring 2003 February 10, 2003 John Kubiatowicz (kubitron) lecture slides: cs152

Low Power Design - University of California, Berkeleykubitron/courses/cs152...5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.25 5/03/04 ©UCB Spring 2004 CS152 / Kubiatowicz Lec25.26

CS152 Computer Architecture and Engineering Lecture 10 Exceptions (continued) Introduction to Pipelining March 1, 2004 John Kubiatowicz (kubitron)

CS152 Computer Architecture and Engineering Lecture 22 Virtual Memory (continued) Buses April 21, 2004 John Kubiatowicz (kubitron)

CS152 / Kubiatowicz Lec22.1 4/12/01 ©UCB Spring 2001 CS152 Computer Architecture and Engineering Lecture 20 Caches (Con’t) and Virtual Memory March 12,

CS152 / Kubiatowicz Lec10.1 3/3/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 10 High-Level Design/ Microcode programming March.

CS152 / Kubiatowicz Lec23.1 11/22/99©UCB Fall 1999 CS152 Computer Architecture and Engineering Lecture 23 Busses (continued) I/O Systems November 22, 1999.

CS152 / Kubiatowicz Lec3.1 1/23/01©UCB Spring 2001 CS152 Computer Architecture and Engineering Lecture 3 Performance, Technology & Delay Modeling January.

CS152 / Kubiatowicz Lec5.1 9/14/01©UCB Fall 2001 CS152 Computer Architecture and Engineering Lecture 5 VHDL, Multiply, Shift September 14, 2001 John Kubiatowicz.

CS152 / Kubiatowicz Lec19.1 4/10/01©UCB Spring 2001 CS152 Computer Architecture and Engineering Lecture 21 Memory Systems (recap) Caches April 10, 2001.

CS152 / Kubiatowicz Lec23.1 4/17/01©UCB Spring 2001 CS152 Computer Architecture and Engineering Lecture 23 Virtual Memory (cont) Buses and I/O #1 April.

CS152 / Kubiatowicz Lec18.1 11/01/99©UCB Fall 1999 CS152 Computer Architecture and Engineering Lecture 18 Locality and Memory Technology November 1, 1999.

CS152 / Kubiatowicz Lec12.1 3/10/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 12 Exceptions (continued) Introduction to Pipelining.

CS152 / Kubiatowicz Lec24.1 4/30/03©UCB Spring 2003 CS152 Computer Architecture and Engineering Lecture 24 Buses (continued) Disk IO Queueing Theory April.

CS152 / Kubiatowicz Lec24.1 11/28/01©UCB Fall 2001 CS152 Computer Architecture and Engineering Lecture 24 Busses (continued) Queueing Theory Disk IO November.

January 28, 2004 John Kubiatowicz (kubitron) lecture slides: cs152/ CS152 Computer Architecture.

CS152 / Kubiatowicz Lec2.1 8/31/01©UCB Fall 2001 August 31, 2001 John Kubiatowicz (http.cs.berkeley.edu/~kubitron) lecture slides: cs152

CS152 / Kubiatowicz Lec1.1 ©UCB Spring 20011/16/01 CS152 Computer Architecture and Engineering Lecture 1 Introduction and Five Components of a Computer.

CS152 / Kubiatowicz Lec8.1 2/22/99©UCB Spring 1999 CS152 Computer Architecture and Engineering Lecture 8 Designing Single Cycle Control Feb 22, 1999 John.