Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin...
-
Upload
stephanie-webb -
Category
Documents
-
view
220 -
download
0
Transcript of Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin...
![Page 1: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/1.jpg)
Managing Distributed, Shared L2 Cachesthrough
OS-Level Page Allocation
Sangyeun Cho and Lei Jin
Dept. of Computer ScienceUniversity of Pittsburgh
![Page 2: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/2.jpg)
Dec. 13 ’06 – MICRO-39
Multicore distributed L2 caches
L2 caches typically sub-banked and distributed• IBM Power4/5: 3 banks• Sun Microsystems T1: 4 banks• Intel Itanium2 (L3): many “sub-arrays”
(Distributed L2 caches + switched NoC) NUCA
Hardware-based management schemes• Private caching• Shared caching• Hybrid caching
processor core
local L2 cache
router
![Page 3: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/3.jpg)
Dec. 13 ’06 – MICRO-39
Private caching
2. L2 access
1. L1 miss
1. L1 miss2. L2 access
• Hit• Miss
3. Access directory• A copy on chip• Global miss
3. Access directory
short hit latency (always local)
high on-chip miss rate
long miss resolution time
complex coherence enforcement
![Page 4: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/4.jpg)
Dec. 13 ’06 – MICRO-39
Shared caching
1. L1 miss
1. L1 miss
2. L2 access• Hit• Miss
low on-chip miss rate
straightforward data location
simple coherence (no replication)
long average hit latency
![Page 5: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/5.jpg)
Dec. 13 ’06 – MICRO-39
Our work
Placing “flexibility” as the top design consideration
OS-level data to L2 cache mapping• Simple hardware based on shared caching• Efficient mapping maintenance at page granularity
Demonstrating the impact using different policies
![Page 6: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/6.jpg)
Dec. 13 ’06 – MICRO-39
Talk roadmap
Data mapping, a key property
Flexible page-level mapping• Goals• Architectural support• OS design issues
Management policies
Conclusion and future works
![Page 7: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/7.jpg)
Dec. 13 ’06 – MICRO-39
Data mapping, the key
Data mapping = deciding data location (i.e., cache slice)
Private caching• Data mapping determined by program location• Mapping created at miss time• No explicit control
Shared caching• Data mapping determined by address
slice number = (block address) % (Nslice)
• Mapping is static• Cache block installation at miss time• No explicit control• (Run-time can impact location within slice)
Mapping granularity = block
![Page 8: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/8.jpg)
Dec. 13 ’06 – MICRO-39
Changing cache mapping granularity
Memory blocks Memory pages
miss rate?
impact on existing techniques?
(e.g., prefetching)
latency?
![Page 9: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/9.jpg)
Dec. 13 ’06 – MICRO-39
Observation: page-level mapping
Memory pages Program 1
Program 2
OS PAGE ALLOCATIONOS PAGE ALLOCATION
Mapping data to different $$ feasible
Key: OS page allocation policies
Flexible
![Page 10: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/10.jpg)
Dec. 13 ’06 – MICRO-39
Goal 1: performance management
Proximity-aware data mapping
![Page 11: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/11.jpg)
Dec. 13 ’06 – MICRO-39
Goal 2: power management
Usage-aware cache shut-off
0
0 0
000 0
0000
0
![Page 12: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/12.jpg)
Dec. 13 ’06 – MICRO-39
Goal 3: reliability management
On-demand cache isolation
X
X
![Page 13: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/13.jpg)
Dec. 13 ’06 – MICRO-39
Goal 4: QoS management
Contract-based cache allocation
![Page 14: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/14.jpg)
Dec. 13 ’06 – MICRO-39
page_num page offset
Architectural support
L1 miss
Method 1: “bit selection”
slice_num = (page_num) % (Nslice)
other bits slice_num page offset
data address
Method 2: “region table”
regionx_low ≤ page_num ≤ regionx_high
page_num page offset
region0_low slice_num0
region0_high
region1_low slice_num1
region1_high
Method 3: “page table (TLB)”
page_num «–» slice_num
vpage_num0 slice_num0
ppage_num0
vpage_num1 slice_num1
ppage_num1
reg_table
TLB
Method 1: “bit selection”slice number = (page_num) % (Nslice)
Method 2: “region table”regionx_low ≤ page_num ≤ regionx_high
Method 3: “page table (TLB)”page_num «–» slice_num
Simple hardware support enough
Combined scheme feasible
![Page 15: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/15.jpg)
Dec. 13 ’06 – MICRO-39
Some OS design issues
Congruence group CG(i)• Set of physical pages mapped to slice i• A free list for each i multiple free lists
On each page allocation, consider• Data proximity• Cache pressure• (e.g.) Profitability function P = f(M, L, P, Q, C)
M: miss ratesL: network link statusP: current page allocation statusQ: QoS requirementsC: cache configuration
Impact on process scheduling
Leverage existing frameworks• Page coloring – multiple free lists• NUMA OS – process scheduling & page allocation
![Page 16: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/16.jpg)
Dec. 13 ’06 – MICRO-39
Working example
Program
10 2 3
4 5 6
8
7
9 10 11
12 13 14 15
5
5
5
5
P(4) = 0.9P(6) = 0.8P(5) = 0.7…
P(1) = 0.95P(6) = 0.9P(4) = 0.8… 4
1
6
Static vs. dynamic mapping
Program information (e.g., profile)
Proper run-time monitoring needed
![Page 17: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/17.jpg)
Dec. 13 ’06 – MICRO-39
Page mapping policies
![Page 18: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/18.jpg)
Dec. 13 ’06 – MICRO-39
Simulating private caching
For a page requested from a program running on core i, map the page to cache slice i
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
L2 c
ach
e late
ncy
(cy
cles)
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
SPEC2k INT SPEC2k FP
private caching
OS-based
L2 cache slice size
Simulating private caching is simple
Similar or better performance
![Page 19: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/19.jpg)
Dec. 13 ’06 – MICRO-39
Simulating “large” private caching
For a page requested from a program running on core i, map the page to cache slice i; also spread pages
SPEC2k INT SPEC2k FP
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
gcc parser eon twolf
Rela
tive p
erf
orm
ance
(ti
me
-1)
OSprivate
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
wupwise galgel ammp sixtracks
1.93
512kB cache slice
![Page 20: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/20.jpg)
Dec. 13 ’06 – MICRO-39
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
Simulating shared caching
For a page requested from a program running on core i, map the page to all cache slices (round-robin, random, …)
L2 c
ach
e late
ncy
(cy
cles)
SPEC2k INT SPEC2k FP
0
20
40
60
80
100
128kB 256kB 512kB
0
20
40
60
80
100
128kB 256kB 512kB
L2 cache slice size
sharedOS
129 106
Simulating shared caching is simple
Mostly similar behavior/performance
Pathological cases (e.g., applu)
![Page 21: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/21.jpg)
Dec. 13 ’06 – MICRO-39
10 2 3
4 5 6
8
7
9 10 11
12 13 14 15
0
0.2
0.4
0.6
0.8
1
1.2
FFT LU RADIX OCEAN
Simulating clustered caching
For a page requested from a program running on core of group j, map the page to any cache slice within group (round-robin, random, …)
Rela
tive p
erf
orm
ance
(ti
me
-1)
privateOSshared
4 cores used; 512kB cache slice
Simulating clustered caching is simple
Lower miss traffic than private
Lower on-chip traffic than shared
![Page 22: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/22.jpg)
Dec. 13 ’06 – MICRO-39
Profile-driven page mapping
Using profiling collect:• Inter-page conflict information• Per-page access count information
Page mapping cost function (per slice)• Given program location, page to map, and previously mapped
pages• (# conflicts miss penalty) + weight (# accesses latency)• weight as a knob
Larger value more weight on proximity (than miss rate)Optimize both miss rate and data proximity
Theoretically important to understand limits Can be practically important, too
miss cost Latency cost
![Page 23: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/23.jpg)
Dec. 13 ’06 – MICRO-39
Profile-driven page mapping, cont’d
0%
20%
40%
60%
80%
100%
amm
p
art
bzi
p2
craf
ty
eon
equa
ke
gap gcc
gzi
p
mcf
mes
a
mg
rid
par
ser
twol
f
vort
ex vpr
wup
wis
e
0%
20%
40%
60%
80%
100%
256kB L2 cache slice
remote
local
miss
on-chip hit
L2 c
ach
e a
ccess
es
weight
![Page 24: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/24.jpg)
Dec. 13 ’06 – MICRO-39
0
50
100
150
200
250
300
350
400
450
Profile-driven page mapping, cont’d
# p
ages
mapped
256kB L2 cache slice
Program location
GCC
![Page 25: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/25.jpg)
Dec. 13 ’06 – MICRO-39
Profile-driven page mapping, cont’d
256kB L2 cache slice
Perf
orm
ance
im
pro
vem
ent
Over
share
d c
ach
ing
-1%
9%
1%
17%
7%0%
9%
21%
2%
39%
4% 3%9%
23%
2%6%
-20%
0%
20%
40%
60%
80%
amm
p
art
bzi
p2
craf
ty
eon
equa
ke
gap gcc
gzi
p
mcf
mes
a
mg
rid
par
ser
twol
f
vort
ex vpr
wup
wis
e
108%
Room for performance improvement
Best of the two or better than the two
Dynamic mapping schemes desired
![Page 26: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/26.jpg)
Dec. 13 ’06 – MICRO-39
Isolating faulty caches
When there are faulty cache slices, avoid mapping pages to them
0
2
4
6
8
0 1 2 4 8
Rela
tive L
2 c
ach
e late
ncy
4 cores running a multiprogrammed workload; 512kB cache slice
shared
OS
# cache slice deletions
![Page 27: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/27.jpg)
Dec. 13 ’06 – MICRO-39
Conclusion
“Flexibility” will become important in future multicores• Many shared resources• Allows us to implement high-level policies
OS-level page-granularity data-to-slice mapping• Low hardware overhead• Flexible
Several management policies studied• Mimicking private/shared/clustered caching straightforward• Performance-improving schemes
![Page 28: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/28.jpg)
Dec. 13 ’06 – MICRO-39
Future works
Dynamic mapping schemes• Performance• Power
Performance monitoring techniques• Hardware-based• Software-based
Data migration and replication support
![Page 29: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/29.jpg)
Dec. 13 ’06 – MICRO-39
Managing Distributed, Shared L2 Caches
through OS-Level Page AllocationSangyeun Cho and Lei Jin
Dept. of Computer ScienceUniversity of Pittsburgh
Thank you!
![Page 30: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/30.jpg)
Dec. 13 ’06 – MICRO-39
Multicores are here
AMD Opteron dual-core (2005)
IBM Power5 (2004)
Sun Micro. T1, 8 cores (2005)
Intel Core2 Duo (2006)
Quad cores (2007)
Intel 80 cores? (2010?)
![Page 31: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/31.jpg)
Dec. 13 ’06 – MICRO-39
A multicore outlook
???
![Page 32: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/32.jpg)
Dec. 13 ’06 – MICRO-39
A processor model
Many cores (e.g., 16)
processor core
local L2 cache
router
Private L1 I/D-$$• 8kB~32kB
Local unified L2 $$• 128kB~512kB• 8~18 cycles
Switched network• 2~4 cycles/switch
Distributed directory• Scatter hotspots
![Page 33: Managing Distributed, Shared L2 Caches through OS-Level Page Allocation Sangyeun Cho and Lei Jin Dept. of Computer Science University of Pittsburgh.](https://reader033.fdocuments.net/reader033/viewer/2022051517/5697bfd01a28abf838caace3/html5/thumbnails/33.jpg)
Dec. 13 ’06 – MICRO-39
Other approaches
Hybrid/flexible schemes• “Core clustering” [Speight et al., ISCA2005]• “Flexible CMP cache sharing” [Huh et al., ICS2004]• “Flexible bank mapping” [Liu et al., HPCA2004]
Improving shared caching• “Victim replication” [Zhang and Asanovic, ISCA2005]
Improving private caching• “Cooperative caching” [Chang and Sohi, ISCA2006]• “CMP-NuRAPID” [Chishti et al., ISCA2005]