Hierarchy of IP Controllers (HIC) draft-li-teas-hierarchy ...
Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030F17/15-memory.pdfMemory Hierarchy Some...
Transcript of Memory Hierarchy - AndroBenchcsl.skku.edu/uploads/SSE2030F17/15-memory.pdfMemory Hierarchy Some...
Memory Hierarchy
Jin-Soo Kim ([email protected])
Computer Systems Laboratory
Sungkyunkwan University
http://csl.skku.edu
2SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
The CPU-Memory Gap
▪ The gap widens between DRAM, disk, and CPU speeds
0.0
0.1
1.0
10.0
100.0
1,000.0
10,000.0
100,000.0
1,000,000.0
10,000,000.0
100,000,000.0
1985 1990 1995 2000 2003 2005 2010 2015
Tim
e (
ns)
Year
Disk seek time
SSD access time
DRAM access time
SRAM access time
CPU cycle time
Effective CPU cycle time
3SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Principle of Locality
▪ Temporal locality
• Recently referenced items are
likely to be referenced in the near
future
▪ Spatial locality
• Items with nearby addresses tend
to be referenced close together
in time
4SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Principle of Locality: Example
▪ Data
• Reference array elements in succession Spatial locality
• Reference sum each iteration Temporal locality
▪ Instructions
• Reference instructions in sequence Spatial locality
• Cycle through loop repeatedly Temporal locality
sum = 0;for (i = 0; i < n; i++)
sum += a[i];return sum;
5SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Memory Hierarchy
▪ Some fundamental and enduring properties of
hardware and software
• Fast storage technologies cost more per byte, have less
capacity, and require more power
• The gap between CPU and main memory speed is widening
• Well-written programs tend to exhibit good locality
▪ These fundamental properties complement each
other beautifully
▪ They suggest an approach for organizing memory and
storage systems known as a memory hierarchy
6SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Memory Hierarchy: Example
Registers
L1 cache(SRAM)
L2 cache(SRAM)
L3 cache(SRAM)
Main memory(DRAM)
Local secondary storage(local disks)
Remote secondary storage(e.g. Web cache)
Larger, slower,
and cheaper
(per byte)storagedevices
Smaller,faster,and
costlier(per byte)storage devices
L0:
L1:
L2:
L3:
L4:
L5:
L6:
7SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Exploiting Locality
▪ How to exploit temporal locality?
• Speed up data accesses by caching data in faster storage
• Caching in multiple levels: form a memory hierarchy
• The lower levels of the memory hierarchy tend to be slower,
but larger and cheaper
▪ How to exploit spatial locality?
• Larger cache line size
• Cache nearby data together
8SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Cache
▪ A smaller, faster storage device that acts as a staging
area for a subset of the data in a larger, slower device
▪ Improves the average access time
▪ Exploits both temporal and spatial locality
9SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
General Cache Concepts
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
MemoryLarger, slower, cheaper memoryviewed as partitioned into “blocks”
Data is copied in block-sized transfer units
Smaller, faster, more expensivememory caches a subset ofthe blocks
4
4
4
10
10
10
10SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
General Cache Concepts: Hit
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
Data in block b is neededRequest: 14
14Block b is in cache:Hit!
11SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
General Cache Concepts: Miss
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
8 9 14 3Cache
Memory
Data in block b is neededRequest: 12
Block b is not in cache:Miss!
Block b is fetched frommemory
Request: 12
12
12
12
Block b is stored in cache• Placement policy:
determines where b goes•Replacement policy:
determines which blockgets evicted (victim)
12SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
CPU Cache Design Issues
▪ Cache size
• 8KB ~ 64KB for L1
▪ Cache line size
• Typically, 32 ~ 64 bytes for L1
▪ Lookup
• Fully associative
• Set associative: 2-way, 4-way, 8-way, 16-way, etc.
• Direct mapped
▪ Replacement
• LRU (Least Recently Used)
• FIFO (First-In First-Out), RANDOM, etc.
13SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Intel Core i7 Cache Hierarchy
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 0
Regs
L1 d-cache
L1 i-cache
L2 unified cache
Core 3
…
L3 unified cache(shared by all cores)
Main memory
Processor package
L1 i-cache and d-cache:32 KB, 8-way, Access: 4 cycles
L2 unified cache:256 KB, 8-way,
Access: 10 cycles
L3 unified cache:8 MB, 16-way,Access: 40-75 cycles
Block size: 64 bytes for all caches.
14SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Cache Performance
▪ Average memory access time = Thit + Rmiss * Tmiss
▪ Hit time (Thit)
• Time to deliver a line in the cache to the processor
• Includes time to determine whether the line is in the cache
• 1 clock cycle for L1, 3 ~ 8 clock cycles for L2
▪ Miss rate (Rmiss)
• Fraction of memory references not found in cache
• 3 ~ 10% for L1, < 1% for L2
▪ Miss penalty (Tmiss)
• Additional time required because of a miss
• Typically 25 ~ 100 cycles for main memory
15SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Writing Cache Friendly Code
▪ Make the common case go fast
• Focus on the inner loops of the core functions
▪ Minimize the misses in the inner loops
• Repeated references to variables are good (temporal locality)
• Sequential reference patterns are good (spatial locality)
16SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (1)
▪ Description
• Multiply N x N matrices
• O(N3) total operations
▪ Assumptions
• Line size = 32 bytes (big enough for 4 64-bit words)
• Matrix dimension (N) is very large
/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
Variable sumheld in register
A B C
(i,*)
(*,j)
(i,j)X =
17SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (2)
▪ Matrix multiplication (ijk)/* ijk */
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
A B C
(i,*)
(*,j)
(i,j)
Inner loop:
Column-wise
Row-wise Fixed
Misses per Inner Loop Iteration:
A B C
0.25 1.0 0.0
18SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (3)
▪ Matrix multiplication (jik)
A B C
(i,*)
(*,j)
(i,j)
Inner loop:
Misses per Inner Loop Iteration:
A B C
0.25 1.0 0.0
/* jik */
for (j=0; j<n; j++) {
for (i=0; i<n; i++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum
}
} Column-wise
Row-wise Fixed
19SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (4)
▪ Matrix multiplication (kij)
Misses per Inner Loop Iteration:
A B C
0.0 0.25 0.25
/* kij */
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
A B C
(i,*)(i,k) (k,*)
Inner loop:
Row-wiseFixed Row-wise
20SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (5)
▪ Matrix multiplication (ikj)
Misses per Inner Loop Iteration:
A B C
0.0 0.25 0.25
A B C
(i,*)(i,k) (k,*)
Inner loop:/* ikj */
for (i=0; i<n; i++) {
for (k=0; k<n; k++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
} Row-wiseFixed Row-wise
21SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (6)
▪ Matrix multiplication (jki)
Misses per Inner Loop Iteration:
A B C
1.0 0.0 1.0
/* jki */
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
A B C
(*,j)
(k,j)
Inner loop:
(*,k)
Column-wise
Fixed Column-wise
22SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (7)
▪ Matrix multiplication (kji)
Misses per Inner Loop Iteration:
A B C
1.0 0.0 1.0
A B C
(*,j)
(k,j)
Inner loop:
(*,k)
/* kji */
for (k=0; k<n; k++) {
for (j=0; j<n; j++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}Column-
wiseFixed Column-
wise
23SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (8)
▪ Summary
for (i=0; i<n; i++) {
for (j=0; j<n; j++) {
sum = 0.0;
for (k=0; k<n; k++)
sum += a[i][k] * b[k][j];
c[i][j] = sum;
}
}
ijk (& jik): • 2 loads, 0 stores• misses/iter = 1.25
for (k=0; k<n; k++) {
for (i=0; i<n; i++) {
r = a[i][k];
for (j=0; j<n; j++)
c[i][j] += r * b[k][j];
}
}
for (j=0; j<n; j++) {
for (k=0; k<n; k++) {
r = b[k][j];
for (i=0; i<n; i++)
c[i][j] += a[i][k] * r;
}
}
kij (& ikj): • 2 loads, 1 store• misses/iter = 0.5
jki (& kji): • 2 loads, 1 store• misses/iter = 2.0
A B C
0.25 1.0 0.0
A B C
0.0 0.25 0.25
A B C
1.0 0.0 1.0
24SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Matrix Multiplication (9)
▪ Performance in Core i7
1
10
100
50 100 150 200 250 300 350 400 450 500 550 600 650 700
Cyc
les
pe
r in
ner
loo
p i
tera
tio
n
Array size (n)
jki
kji
ijk
jik
kij
ikj
ijk / jik
jki / kji
kij / ikj
25SSE2030: Introduction to Computer Systems | Fall 2017 | Jin-Soo Kim ([email protected])
Summary
▪ Programmer can optimize for cache performance
• How data structures are organized
• How data are accessed
– Nested loop structure
– Blocking is a general technique
▪ All systems favor “cache friendly code”
• Getting absolute optimum performance is very platform
specific
– Cache sizes, line sizes, associativities, etc.
• Can get most of the advantage with generic code
– Keep working set reasonably small (temporal locality)
– Use small strides (spatial locality)