1 The Memory System (Chapter 5) iosup/Courses/2011_ti1400_9.ppt.

1

The Memory System(Chapter 5)

http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_9.ppt

Agenda

1. Basic Concepts2. Performance Considerations:

Interleaving, Hit ratio/rate, etc.3. Caches4. Virtual Memory

1.1. Organization1.2. Pinning

TU-DelftTI1400/11-PDS

3

1.1. Organization

0 1 2 3

4 5 6 7

8 9 . .

. . . .

Word Address

Byte Address

0

1

2

3


4

1.1. Connection Memory-CPU

Memory

CPURead/Write

MFC

Address

Data

MAR

MDR


5

1.1. Memory: contents

• Addressable number of bits• Different orderings• Speed-up techniques

- Memory interleaving - Cache memories

• Enlargement- Virtual memory


6

1.1. Organisation (1)

sense/wr

W0

W1

W15

FF FF

Address decoder

input/outputlines b7 b1 b0

R/WCS

A0

A1

A2

A3

b1 b1


7

1.2. Pinning

• Total number of pins required for 16x8 memory: 16- 4 address lines- 8 data lines- 2 control lines- 2 power lines


8

32 by 32memory

array

W0

W31

......

1.2. A 1K by 1 Memory

5-bitdeco-der

10-bitaddresslines

two 32-to-1multiplexors

in out


9

1.2. Pinning

• Total number of pins required for 1024x1 memory: 16- 10 address lines- 2 data lines (in/out)- 2 control lines- 2 power lines

• For 128 by 8 memory: 19 pins (7+8+2+2)

• Conclusion: the smaller the addressable unit, the fewer pins needed


Agenda

1. Basic Concepts2. Performance Considerations3. Caches4. Virtual Memory

2.1. Interleaving2.2. Performance Gap

Processor-Memory2.3. Caching2.4. A Performance

Model: Hit ratio, Performance Penalty, etc.


11

2.1. InterleavingMultiple Modules (1)

Address in Module

m bits

CSaddress

Modulen-1

CSaddress

Modulei

CSaddress

Module0

Module

k bits

MMaddress

Block-wise organization (consecutive words in single module)CS=Chip Select


12

2.1. InterleavingMultiple Modules (2)

CSaddress

Module2**k-1

CSaddress

Modulei

CSaddress

Module0

Module

k bits

Address in Module

m bits

MMaddress

Interleaving organization (consecutive words in consecutive module)CS = Chip Select


13

Questions

• What is the advantage of the interleaved organization?

• What the disadvantage?

Higher bandwidth CPU-memory: data transfer to/from multiple modules simultaneously

When a module breaks down, memory has manysmall holes


14

2.2. Problem: The Performance Gap Processor-Memory

Processor: CPU Speeds 2X every 2 years~Moore’s Law; limit ~2010Memory: DRAM Speeds 2X every 7 years

Gap: 2X every 2 years

Gap Still Growing?


15

2.2. Idea: Memory Hierarchy

increasingsize

increasingspeed

increasingcost

Disks

MainMemory

Secondarycache: L2

Primary cache: L1

CPU


16

2.3. Caches (1)

• Problem: Main memory is slower than CPU registers (factor of 5-10)

• Solution: Fast and small memory between CPU and main memory

• Contains: recent references to memory

CPU Cache Mainmemory


17

2.3. Caches (2)/2.4. A Performance Model

• Works because of locality principle• Profit:

- cache hit ratio (rate): h- access time cache: c- cache miss ratio (rate): 1-h- access time main memory: m- mean access time: h.c + (1-

h).m

• Cache is transparent to programmer


18

2.3. Caches (3)

• READ operation:- if not in cache, copy block into cache and

read out of cache (possibly read-through)- if in cache, read out of cache

• WRITE operation:- if not in cache, write in main memory- if in cache, write in cache, and:

• write in main memory (store through)• set modified (dirty) bit, and write later


19

2.3. Caches (4) The Library Analogy

• Real-world analogue: - borrow books from a library- store these books according to the first letter

of the name of the first author in 26 locations

• Direct mapped: separate location for a single book for each letter of the alphabet

• Associative: any book can go to any of the 26 locations

• Set-associative: two locations for letters A-B, two for C-D, etc

1 2 3 26…

A Z


20

2.3. Caches (5)

• Suppose- size of main memory in bytes: N = 2n

- block size in bytes: b = 2k

- number of blocks in cache: 128- e.g., n=16, k=4, b=16

• Every block in cache has valid bit (is reset when memory is modified)

• At context switch: invalidate cache


Agenda


3.1. Mapping Function3.2. Replacement Algorithm3.3. Examples of Mapping3.4. Examples of Caches in

Commercial Processors3.5. Write Policy3.6. Number of Blocks/Caches/…


22

3.1. Mapping Function

1. Direct Mapped Cache (1)

• A block in main memory can be at only one

place in the cache

• This place is determined by its block number

j:

- place = j modulo size of cache

5 7 4

tag block word

main memory address


23

3.1. Direct Mapped Cache (2)

BLOCK 0

.................

BLOCK 127BLOCK 128BLOCK 129.................. BLOCK 255BLOCK 256

5 bitstag

tag

tag

BLOCK 0

BLOCK 1

BLOCK 2CACHE

main

mem

ory


24

3.1. Direct Mapped Cache (3)

BLOCK 0

BLOCK 1

.................


5 bits

CACHE

main

mem

ory

tag

tag

tag

BLOCK 0

BLOCK 1

BLOCK 2


25


2. Associative Cache (1)

• Each block can be at any place in cache

• Cache access: parallel (associative) match of tag in address with tags in all cache entries

• Associative: slower, more expensive, higher hit ratio

12 4

tag word

main memory address


26

3.1.2. Associative Cache (2)BLOCK 0

BLOCK 1

.................


12- bits

128 blocks

main

mem

ory

tag

tag

tag

BLOCK 0

BLOCK 1

BLOCK 2


27


3. Set-Associative Cache (1)

• Combination of direct mapped and associative

• Cache consists of sets

• Mapping of block to set is direct, determined by set number

• Each set is associative

6 6 4

tag set word

main memory address


28

3.1.3. Set-Associative Cache (2)

BLOCK 0

BLOCK 1

................

.BLOCK 127BLOCK 128BLOCK 129.................. BLOCK 255BLOCK 256

tag6- bits

BLOCK 0

128 blocks, 64 sets

tagBLOCK 1

tagBLOCK 2

tagBLOCK 3

tagBLOCK 4

set 0

set 1

Q: What is wrong in this picture?

Answer: 64 sets, so block 64 also goes to set 0


29

3.1.3. Set-Associative Cache (3)

BLOCK 0

BLOCK 1

...............

..BLOCK 127

BLOCK 128

BLOCK 129

...............

... BLOCK 255

BLOCK 256

tag6- bits

BLOCK 0

128 blocks, 64 sets

tagBLOCK 1

tagBLOCK 2

tagBLOCK 3

tagBLOCK 4

set 0

set 1


30

Question

• Main memory: 4 GByte• Cache: 512 blocks of 64 byte• Cache: 8-way set-associative (set size is 8)• All memories are byte addressable

Q How many bits is the:- byte address within a block- set numbertag


31

Answer

• Main memory is 4 GByte, so 32-bits address

• A block is 64 byte, so 6-bits byte address within a block

• 8-way set-associative cache with 512 blocks, so 512/8=64 sets, so 6-bits set number

• So, 32-6-6=20-bits tag20 6 6

tag set word


32

3.2. Replacement Algorithm

Replacement (1)(Set) associative replacement algorithms:• Least Recently Used (LRU)

- if 2k blocks per set, implement with k-bit counters per block

- hit: increase counters lower than the one referenced with 1, set counter at 0

- miss and set not full: replace, set counter new block 0, increase rest

- miss and set full: replace block with highest value (2k-1), set counter new block at 0, increase rest


33

3.2.1. LRU: Example 1

0 1

0 0

1 0

1 1

1 0

0 1

0 0

1 1

k=2 4 blocks per set

HIT

increased

increased

unchanged

now at the top


34


1 1

0 0

1 0

0 1

0 0

0 1

1 1

1 0

k=2EMPTY

miss and set not full

increased

increased

increased

now at the top


35


0 1

0 0

1 0

1 1

1 0

0 1

1 1

0 0

k=2

miss and set full

increased

increased

increased

now at the top


36

3.2. Replacement Algorithm

Replacement (2)

• Alternatives for LRU:- Replace oldest block, First-In-First-Out

(FIFO)- Least-Frequently Used (LFU)- Random replacement


37

3.3. Example (1): program

int SUM = 0;for(j=0, j<10, j++) {

SUM =SUM + A[0,j];}AVE = SUM/10;for(i=9, i>-1, i--){

A[0,i] = A[0,i]/AVE}

• Normalize the elements of row 0 of array A• First pass: from start to end• Second pass: from end to start


38

3.3. Example (2): cache

BLOCK 0tag

BLOCK 1tag

BLOCK 2tag

BLOCK 3tag

BLOCK 4tag

BLOCK 5tag

BLOCK 6tag

BLOCK 7tag

Cache: • 8 blocks• 2 sets • each block 1 word• LRU replacement

Set 0

Set 1

13 3

tag block

direct

16

tag

associative

15 1

tag setsetassociative

TU-DelftTI1400/11-PDS39

3.3. Example (3): array0111101000000 0 0 00111101000000 0 0 10111101000000 0 1 00111101000000 0 1 1........................ 1 0 0........................0111101000100 1 0 00111101000100 1 0 10111101000100 1 1 00111101000100 1 1 1

Tag direct

Tag set-associative

Tag associative

a(0,0) a(1,0) a(2,0) a(3,0) a(0,1) .... a(0,9) a(1,9) a(2,9) a(3,9)

Mem

ory

ad

dre

ss

4x10 array

column-majorordering

elements ofrow 0 are fourlocations apart

7A00


40

3.3. Example (4): direct mapped

a[0,0] a[0,2] a[0,4] a[0,6] a[0,8] a[0,6] a[0,4] a[0,2] a[0,0]

j=1 j=3 j=5 j=7 j=9 i=6 i=4 i=2 i=0

0

1

2

3

4

5

6

7

blockpos.

Contents of cache after pass:

a[0,1] a[0,3] a[0,5] a[0,7] a[0,9] a[0,7] a[0,5] a[0,3] a[0,1]

= miss

= hit

Elements of row 0 are also 4 locations apart in the cache

Conclusion: from 20 accesses none are in cache


41

3.3. Example (5): associative

a[0,0] a[0,8] a[0,8] a[0,8] a[0,0]

j=7 j=8 j=9 i=1 i=0

a[0,1] a[0,1] a[0,9] a[0,1] a[0,1]

a[0,2] a[0,2] a[0,2] a[0,2] a[0,2]

a[0,3] a[0,3] a[0,3] a[0,3] a[0,3]

a[0,4] a[0,4] a[0,4] a[0,4] a[0,4]

a[0,5] a[0,5] a[0,5] a[0,5] a[0,5]

a[0,6] a[0,6] a[0,6] a[0,6] a[0,6]

a[0,7] a[0,7] a[0,7] a[0,7] a[0,7]

0

1

2

3

4

5

6

7

blockpos.

from i=9 toi=2 all arein cache...

Conclusion: from 20 accesses 8 are in cache



42

3.3. Example (6): set-associative

a[0,0] a[0,4] a[0,8] a[0,4] a[0,4]

j=3 j=7 j=9 i=4 i=2

a[0,1] a[0,5] a[0,9] a[0,5] a[0,5]

a[0,2] a[0,6] a[0,6] a[0,6] a[0,2]

a[0,3] a[0,7] a[0,7] a[0,7] a[0,3]

0

1

2

3

4

5

6

7

blockpos.

a[0,0]

i=0

a[0,1]

a[0,2]

a[0,3]

set 0

all elements of row 0 are mapped to set 0


from i=9 toi=6 all arein cache...

Conclusion: from 20 accesses 4 are in cache


43

3.4. Example: PowerPC (1)

• PowerPC 604

• Separate data and instruction cache

• Caches are 16 Kbytes

• Four-way set-associative cache

• Cache has 128 sets

• Each block has 8 words of 32 bits


44

3.4. Example: PowerPC (2)

Block 000BA2 st

Block 1

Block 2

Block 3003F4 st

address 0000 0000 0011 1111 0100 0000000 01000

003F4 0 8

set 0

=?

=?

no

yes

word address in block

set numbertag

.....


Agenda


4.1. Basic Concepts4.2. Address Translation


46

4.1. Virtual Memory (1)

• Problem: compiled program does not fit into memory

• Solution: virtual memory, where the logical address space is larger than the physical address space

• Logical address space: addresses referable by instructions

• Physical address space: addresses referable in real machine


47

4.1. Virtual Memory (2)

• For realizing virtual memory, we need an

address conversion:

am = f(av)

• am is physical address (machine address)

• av is virtual address

• This is generally done by hardware


48

4.1. Organization

Processor

MMU

Cache

Main Memory

Disk Storage

am

am

av

data

data

DMA transfer


49

4.2. Address Translation

• Basic approach is to partition both

physical address space and virtual

address space in equally sized blocks

called pages

• A virtual address is composed of:

- a page number

- word number within a page (the offset)


50

4.2. Page tables (1)

virtual page number offset

page frame offset

page table address

+

virtual address from processor page table base register

physical address from processor

control bits page framenumber

Page table in main memory


51

4.2. Page tables (2)

• Having page tables only in main memory is much too slow

• Additional memory access for every instruction and operand

• Solution: keep a cache with recent address translation: a Translation Look-aside Buffer (TLB)


52

4.2. Operation of TLB

virtual page number offset

virtual address from processor

page frame offset

physical address from processor

virtual page # real page #

= ?

hitmiss

control bits

TLB

Idea: keep most recentaddress translations


53

4.2. Policies

• The pages of a process in main memory: resident set

• Mechanism works because of principle of locality

• Page replacement algorithms needed• Protection possible through page table

register• Sharing possible through page table• Hardware support: Memory Management

Unit (MMU)


54

Question

• Main memory: 256 MByte• Maximal virtual-address space: 4 GByte• Page size: 4 KByte• All memories are byte addressable

Q How many bits is the- offset within a page- virtual page frame number- (physical) page

frame number


55

Answer• Main memory: 256 MByte• Maximal virtual-address space: 4 GByte• Page size: 4 KByte• All memories are byte addressable

• Virtual address: 32 bits (232=4 Gbyte)

• Physical address: 28 bits (228=256 Mbyte)

• Offset in a page: 12 bits (212=4 kbyte)

• Virtual page frame number: 32-12=20 bits

• Physical page frame number: 28-12=16

bits

1 The Memory System (Chapter 5) iosup/Courses/2011_ti1400_9.ppt.

Documents

Transcript of 1 The Memory System (Chapter 5) iosup/Courses/2011_ti1400_9.ppt.