WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions...
Transcript of WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions...
![Page 1: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/1.jpg)
Anurag Mukkara, Nathan Beckmann, Daniel Sanchez
MIT CSAIL
ASPLOS XXI - Atlanta, Georgia – 4 April 2016
WHIRLPOOL!
IMPROVING DYNAMIC CACHE MANAGEMENT
WITH STATIC DATA CLASSIFICATION
![Page 2: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/2.jpg)
Processors are limited by data movement
Data movement often consumes >50% of time & energy
E.g., FP multiply-add: 20 pJ DRAM access: 20,000 pJ
To scale performance, must keep data near where its used
But how do programs use memory?
Cache banks
Good: nearby cache banks
Bad: faraway cache banks
Terrible: DRAM access
![Page 3: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/3.jpg)
Static policies have limitations3
Program Code
Fixed policy
Exploits program semantics
Binary
E.g., scratchpads, bypass hints
Can’t adapt to application
phases, input-dependent
behavior, or shared systems
Static analysis
or profiling
![Page 4: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/4.jpg)
Dynamic policies have limitations, too4
Binary
Dynamic policy
Responsive to actual
application behavior
E.g., data migration & replicationDifficult to recover program
semantics from loads/stores
Expensive mechanisms
(eg, extra data movement &
directories)
Observe
loads/stores
![Page 5: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/5.jpg)
Combining static and dynamic is best5
Program Code
Binary
Static analysis
or profiling
Observe
loads/stores
Pool
A
Pool
B
Pool
C
Pool
D
Policy
A
Policy
B
Policy
C
Policy
D
Exploits program
semantics at low overhead
Responsive to actual
application behavior
![Page 6: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/6.jpg)
Agenda6
Case study
Manual classification
Parallel applications
WhirlTool
![Page 7: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/7.jpg)
System configuration7
Core
L1i L1d
Private L2
Non-uniform cache access (NUCA):
Cache banks have different access latencies
![Page 8: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/8.jpg)
We apply Whirlpool to Jigsaw [Beckmann PACT’13],
a state-of-the-art NUCA cache
Allocates virtual caches, collections of parts of cache banks
Significantly outperforms prior D-NUCA schemes
Baseline dynamic NUCA scheme8
Reduce cache misses
Reduce on-chip
network traversals
Simple mechanisms
![Page 9: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/9.jpg)
Dynamic policies can reduce data movement9
Jigsaw[Beckmann, PACT’13]
Dynamic policy performs somewhat better:
Static NUCA
4% better performance
12% lower energy
App: Delaunay
triangulation
![Page 10: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/10.jpg)
Static analysis can help!10
Acc
ess
Inte
nsi
ty
Points
Vertices
Triangles
Accesses Footprint (MB)
![Page 11: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/11.jpg)
Jigsaw with Static Classification11
Jigsaw[Beckmann, PACT’13]
Whirlpool!
Vs Jigsaw:
19% better performance
42% lower energy
Few data structures accessed
more frequently than others
Acc
ess
Inte
nsi
ty
Points
Vertices
Triangles
![Page 12: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/12.jpg)
Agenda12
Case study
Manual classification
Parallel applications
WhirlTool
![Page 13: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/13.jpg)
Whirlpool – Manual classification
Organize application data into memory pools
int poolPoints = pool_create();
Point* points = pool_malloc(sizeof(Point)*n, poolPoints);
int poolTris = pool_create();
Tri* smallTris = pool_malloc(sizeof(Tri)*m, poolTris);
Tri* largeTris = pool_malloc(sizeof(Tri)*M, poolTris);
Insight: Group semantically similar data into a pool
Points, Triangles
13
![Page 14: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/14.jpg)
Minor changes to programs14
Application Pools LOC
Delaunay triangulation 3 11
Maximal matching 3 13
Delaunay refinement 3 8
Maximal independent set 3 13
Minimal spanning forest 3 11
401.bzip2 4 43
470.lbm 2 21
429.mcf 2 14
436.cactusADM 2 53
SPECCPU
2006
PBBS
![Page 15: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/15.jpg)
Whirlpool on NUCA placement15
Use pools to improve Jigsaw’s decisions
Each pool is allocated to a virtual cache
Jigsaw transparently places pools in NUCA banks
Whirlpool requires no changes to core Jigsaw
Increase size of structures (few KBs)
Minor improvements, e.g. bypassing (see paper)
Pools useful elsewhere, eg to dynamic prefetching
![Page 16: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/16.jpg)
Significant improvements on some apps16
bzip
2
refin
eM
ST
lbm
mcf
cactus
mat
ching
DT
MIS
0
10
20
30
40
50
60
En
erg
y s
avin
gs v
s J
igsa
w (
%)
bzip
2
refin
eM
ST
lbm
mcf
cactus
mat
ching
DT
MIS
0
2
4
6
8
10
12
14
Sp
ee
du
p v
s J
igsa
w (
%)
38
Up to 38% better performance Up to 53% lower energy
Performance Energy
![Page 17: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/17.jpg)
Agenda17
Case study
Manual classification
Parallel applications
WhirlTool
![Page 18: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/18.jpg)
Conventional runtimes can harm locality18
Optimize load
balance, not locality
![Page 19: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/19.jpg)
Whirlpool co-locates tasks and data19
Break input into pools
Application indicates task affinity
Schedule + steal tasks from nearby their data
Dynamically adapt data placement
Requires minimal changes to task-parallel runtimes
Input
![Page 20: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/20.jpg)
Whirlpool improves locality20
![Page 21: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/21.jpg)
Whirlpool adapts schedule dynamically21
Data placement implicitly schedules tasks
![Page 22: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/22.jpg)
Significant improvements at 16 cores22
MS FFT TC DT PR CC
0
10
20
30
40
50
60
70
Sp
ee
dup
vs J
igsaw
(%
)
MS FFT TC DT PR CC
1.0
1.5
2.0
2.5
3.0
En
erg
y s
avin
gs v
s J
igsa
w
Up to 67% better performance Up to 2.6x lower energy
ApplicationsDivide and conquer algorithms: Mergesort, FFT
Graph analytics: PageRank, Triangle Counting, Connected Components
Graphics: Delaunay Triangulation
Caveat: Splitting data into
pools can be expensive!
![Page 23: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/23.jpg)
Agenda23
Case study
Manual classification
Parallel applications
WhirlTool
![Page 24: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/24.jpg)
WhirlTool – Automated classification24
Modifying program code is not always practical
A profile-guided tool can automatically classify data into
pools
WhirlTool
Profiler
WhirlTool
Analyzer
Per-callpoint
miss curves
Callpoint-to-
pool map
Application
WhirlTool
runtime
Whirlpool
Allocator
malloc()
pool_malloc()
![Page 25: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/25.jpg)
WhirlTool profiles miss curves25
Periodically records
per-callpoint
miss curves
Application
A B C ….
Allo
cA
ccs
Groups allocations
by callpoint
Profiles accesses
to each pool
T
i
m
e
Misses
Cache size
![Page 26: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/26.jpg)
WhirlTool analyzes curves to find pools26
Hardware can only support a limited number of pools
Jigsaw uses 3 virtual caches / thread
0.6% area overhead over LLC
Whirlpool adds 4 pools (each mapped to a virtual cache)
1.2% total area overhead over LLC
Must cluster callpoints into semantically similar groups
Per-callpoint
miss curves
Agglomerative
clustering
Callpoint-to-pool
mapping
![Page 27: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/27.jpg)
Example of agglomerative clustering27
1
1
1
2
2
3
![Page 28: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/28.jpg)
WhirlTool’s distance metric28
Cache SizeM
isse
s
Small distance
Cache Size
Mis
ses
Large distance
Pool 1
Pool 2
Separated
Combined
Pool 3
How many misses are saved by separating pools?
![Page 29: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/29.jpg)
WhirlTool matches manual hints29
lesl
iegc
cge
ms
bzip
2om
net
ray
refin
esp
hinx
3M
ST
lbm
setC
over
sopl
exxa
lanc mcf SA
cact
usm
atch
ing
DT
MIS
0
2
4
6
8
10
12
14
Sp
eed
up
vs J
igsa
w (
%)
38
WhirlTool
lesl
iegc
cge
ms
bzip
2om
net
ray
refin
esp
hinx
3M
ST
lbm
setC
over
sopl
exxa
lanc mcf SA
cact
usm
atch
ing
DT
MIS
0
2
4
6
8
10
12
14
Sp
eed
up
vs J
igsa
w (
%)
38 38
WhirlTool
Manual
![Page 30: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/30.jpg)
Multiprogram mixes30
4-core system with random SPECCPU2006 apps
Including those that do not benefit
Whirlpool improves performance by (gmean over 20 mixes)
35% over S-NUCA
30% over idealized shared-private D-NUCA [Hererro, ISCA’10]
26% over R-NUCA [Hardavellas, ISCA’09]
18% over page placement by Awasthi et al. [Awasthi HPCA’09]
5% over Jigsaw [Beckmann, PACT’13]
![Page 31: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/31.jpg)
Conclusion31
Semantic information from applications improves
performance of dynamic policies
Coordinated data and task placement gives large
improvements in parallel applications
Automated classification reduces programmer burden
![Page 32: WHIRLPOOL - cs.cmu.edu · Whirlpool on NUCA placement 15 Use pools to improve Jigsaw’s decisions Each pool is allocated to a virtual cache Jigsaw transparently places pools in NUCA](https://reader030.fdocuments.net/reader030/viewer/2022040914/5e8b3afd94233c5f1732a35e/html5/thumbnails/32.jpg)
THANKS FOR YOUR ATTENTION!
QUESTIONS ARE WELCOME!
32
WhirlTool code available at http://bit.ly/WhirlTool