1 The Memory System (Chapter 5) iosup/Courses/2011_ti1400_9.ppt

download 1 The Memory System (Chapter 5)  iosup/Courses/2011_ti1400_9.ppt

of 55

  • date post

    21-Dec-2015
  • Category

    Documents

  • view

    213
  • download

    0

Embed Size (px)

Transcript of 1 The Memory System (Chapter 5) iosup/Courses/2011_ti1400_9.ppt

  • Slide 1
  • 1 The Memory System (Chapter 5) http://www.pds.ewi.tudelft.nl/~iosup/Courses/2011_ti1400_9.ppt
  • Slide 2
  • Agenda 1.Basic Concepts 2.Performance Considerations: Interleaving, Hit ratio/rate, etc. 3.Caches 4.Virtual Memory 1.1. Organization 1.2. Pinning
  • Slide 3
  • TU-Delft TI1400/11-PDS 3 1.1. Organization 0123 4567 89...... Word Address Byte Address 0 1 2 3
  • Slide 4
  • TU-Delft TI1400/11-PDS 4 1.1. Connection Memory-CPU Memory CPU Read/Write MFC Address Data MAR MDR
  • Slide 5
  • TU-Delft TI1400/11-PDS 5 1.1. Memory: contents Addressable number of bits Different orderings Speed-up techniques -Memory interleaving -Cache memories Enlargement -Virtual memory
  • Slide 6
  • TU-Delft TI1400/11-PDS 6 1.1. Organisation (1) sense/wr W0 W1 W15 FF Address decoder input/output lines b7b1b0 R/W CS A0 A1 A2 A3 b1
  • Slide 7
  • TU-Delft TI1400/11-PDS 7 1.2. Pinning Total number of pins required for 16x8 memory: 16 -4 address lines -8 data lines -2 control lines -2 power lines
  • Slide 8
  • TU-Delft TI1400/11-PDS 8 32 by 32 memory array W0 W31...... 1.2. A 1K by 1 Memory 5-bit deco- der 10-bit address lines two 32-to-1 multiplexors inout
  • Slide 9
  • TU-Delft TI1400/11-PDS 9 1.2. Pinning Total number of pins required for 1024x1 memory: 16 -10 address lines -2 data lines (in/out) -2 control lines -2 power lines For 128 by 8 memory: 19 pins (7+8+2+2) Conclusion: the smaller the addressable unit, the fewer pins needed
  • Slide 10
  • TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 2.1. Interleaving 2.2. Performance Gap Processor-Memory 2.3. Caching 2.4. A Performance Model: Hit ratio, Performance Penalty, etc.
  • Slide 11
  • TU-Delft TI1400/11-PDS 11 2.1. Interleaving Multiple Modules (1) Address in Module m bits CS address Module n-1 CS address Module i CS address Module 0 Module k bits MM address Block-wise organization (consecutive words in single module) CS=Chip Select
  • Slide 12
  • TU-Delft TI1400/11-PDS 12 2.1. Interleaving Multiple Modules (2) CS address Module 2**k-1 CS address Module i CS address Module 0 Module k bits Address in Module m bits MM address Interleaving organization (consecutive words in consecutive module) CS = Chip Select
  • Slide 13
  • TU-Delft TI1400/11-PDS 13 Questions What is the advantage of the interleaved organization? What the disadvantage? Higher bandwidth CPU-memory: data transfer to/from multiple modules simultaneously When a module breaks down, memory has many small holes
  • Slide 14
  • TU-Delft TI1400/11-PDS 14 2.2. Problem: The Performance Gap Processor-Memory Processor: CPU Speeds 2X every 2 years ~Moores Law; limit ~2010 Memory: DRAM Speeds 2X every 7 years Gap: 2X every 2 years Gap Still Growing?
  • Slide 15
  • TU-Delft TI1400/11-PDS 15 2.2. Idea: Memory Hierarchy increasing size increasing speed increasing cost Disks Main Memory Secondary cache: L2 Primary cache: L1 CPU
  • Slide 16
  • TU-Delft TI1400/11-PDS 16 2.3. Caches (1) Problem: Main memory is slower than CPU registers (factor of 5-10) Solution: Fast and small memory between CPU and main memory Contains: recent references to memory CPU Cache Main memory
  • Slide 17
  • TU-Delft TI1400/11-PDS 17 2.3. Caches (2)/2.4. A Performance Model Works because of locality principle Profit: -cache hit ratio (rate):h -access time cache: c -cache miss ratio (rate):1-h -access time main memory: m -mean access time: h.c + (1-h).m Cache is transparent to programmer
  • Slide 18
  • TU-Delft TI1400/11-PDS 18 2.3. Caches (3) READ operation: -if not in cache, copy block into cache and read out of cache (possibly read-through) -if in cache, read out of cache WRITE operation: -if not in cache, write in main memory -if in cache, write in cache, and: write in main memory (store through) set modified (dirty) bit, and write later
  • Slide 19
  • TU-Delft TI1400/11-PDS 19 2.3. Caches (4) The Library Analogy Real-world analogue: -borrow books from a library -store these books according to the first letter of the name of the first author in 26 locations Direct mapped: separate location for a single book for each letter of the alphabet Associative: any book can go to any of the 26 locations Set-associative: two locations for letters A-B, two for C-D, etc 12326 A Z
  • Slide 20
  • TU-Delft TI1400/11-PDS 20 2.3. Caches (5) Suppose -size of main memory in bytes: N = 2 n -block size in bytes: b = 2 k -number of blocks in cache: 128 -e.g., n=16, k=4, b=16 Every block in cache has valid bit (is reset when memory is modified) At context switch: invalidate cache
  • Slide 21
  • TU-Delft TI1400/11-PDS Agenda 1.Basic Concepts 2.Performance Considerations 3.Caches 4.Virtual Memory 3.1. Mapping Function 3.2. Replacement Algorithm 3.3. Examples of Mapping 3.4. Examples of Caches in Commercial Processors 3.5. Write Policy 3.6. Number of Blocks/Caches/
  • Slide 22
  • TU-Delft TI1400/11-PDS 22 3.1. Mapping Function 1. Direct Mapped Cache (1) A block in main memory can be at only one place in the cache This place is determined by its block number j: -place = j modulo size of cache 574 tagblockword main memory address
  • Slide 23
  • TU-Delft TI1400/11-PDS 23 3.1. Direct Mapped Cache (2) BLOCK 0................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 5 bits tag BLOCK 0 BLOCK 1 BLOCK 2 CACHE main memory
  • Slide 24
  • TU-Delft TI1400/11-PDS 24 3.1. Direct Mapped Cache (3) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 5 bits CACHE main memory tag BLOCK 0 BLOCK 1 BLOCK 2
  • Slide 25
  • TU-Delft TI1400/11-PDS 25 3.1. Mapping Function 2. Associative Cache (1) Each block can be at any place in cache Cache access: parallel (associative) match of tag in address with tags in all cache entries Associative: slower, more expensive, higher hit ratio 124 tagword main memory address
  • Slide 26
  • TU-Delft TI1400/11-PDS 26 3.1.2. Associative Cache (2) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 12- bits 128 blocks main memory tag BLOCK 0 BLOCK 1 BLOCK 2
  • Slide 27
  • TU-Delft TI1400/11-PDS 27 3.1. Mapping Function 3. Set-Associative Cache (1) Combination of direct mapped and associative Cache consists of sets Mapping of block to set is direct, determined by set number Each set is associative 664 tagsetword main memory address
  • Slide 28
  • TU-Delft TI1400/11-PDS 28 3.1.3. Set-Associative Cache (2) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 tag 6- bits BLOCK 0 128 blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1 Q: What is wrong in this picture? Answer: 64 sets, so block 64 also goes to set 0
  • Slide 29
  • TU-Delft TI1400/11-PDS 29 3.1.3. Set-Associative Cache (3) BLOCK 0 BLOCK 1................. BLOCK 127 BLOCK 128 BLOCK 129.................. BLOCK 255 BLOCK 256 tag 6- bits BLOCK 0 128 blocks, 64 sets tag BLOCK 1 tag BLOCK 2 tag BLOCK 3 tag BLOCK 4 set 0 set 1
  • Slide 30
  • TU-Delft TI1400/11-PDS 30 Question Main memory: 4 GByte Cache: 512 blocks of 64 byte Cache: 8-way set-associative (set size is 8) All memories are byte addressable Q How many bits is the: -byte address within a block -set number -tag
  • Slide 31
  • TU-Delft TI1400/11-PDS 31 Answer Main memory is 4 GByte, so 32-bits address A block is 64 byte, so 6-bits byte address within a block 8-way set-associative cache with 512 blocks, so 512/8=64 sets, so 6-bits set number So, 32-6-6=20-bits tag 2066 tagsetword
  • Slide 32
  • TU-Delft TI1400/11-PDS 32 3.2. Replacement Algorithm Replacement (1) (Set) associative replacement algorithms: Least Recently Used (LRU) -if 2 k blocks per set, implement with k-bit counters per block -hit: increase counters lower than the one referenced with 1, set counter at 0 -miss and set not full: replace, set counter new block 0, increase rest -miss and set full: replace block with highest value (2 k -1), set counter new block at 0, increase rest
  • Slide 33
  • TU-Delft TI1400/11-PDS 33 3.2.1. LRU: Example 1 01 00 10 11 1001 00 11 k=2 4 blocks per set HIT increased unchanged now at the top
  • Slide 34
  • TU-Delft TI1400/11-PDS 34 3.2.2. LRU: Example 2 11 00 10 01 00 01 11 10 k=2 EMPTY miss and set not full increased now at the top
  • Slide 35
  • TU-Delft TI1400/11-PDS 35 3.2.3. LRU: Example 3 01 00 10 11 10 01 11 00 k=2 miss and set full increased now at the top
  • Slide 36
  • TU-Delft TI1400/11-PDS 36 3.2. Replacement Algorithm Replacement (2) Alternatives for LRU: -Replace oldest block, First-In-First-Out (FIFO) -Least-Frequently Used (LFU) -Random replacement
  • Slide 37
  • TU-Delft TI1400/11-PDS 37 3.3. Example (1): program int SUM = 0; for(j=0, j-1, i--){ A[0,i] = A[0,i]/AVE } Normalize the elements of row 0 of array A First pass: from start to end Second pass: from end to start
  • Slide 38
  • TU-Delft TI1400/11-PDS 38 3.3. Exam