sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author:...
Transcript of sfma13-slidesomutlu/pub/tumanov_sfma13... · 2013. 5. 8. · Title: sfma13-slides.pptx Author:...
Asymmetry-aware execution placement on
manycore chips Alexey Tumanov
Joshua Wise, Onur Mutlu, Greg Ganger
CARNEGIE MELLON UNIVERSITY
Introduction: Core Scaling
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 2
• Moore’s Law continues: can still fit more transistors on a chip
?
Introduction: Memory Controllers
• Multi-core multi-MC chips gain prevalence • But, now latency is NOT:
• uniform memory access latency (UMA) • hierarchically non-uniform memory latency (NUMA)
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 3
core MC
NUMA • NUMA: symmetrically non-uniform
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 4
crossbar 2x 2x 2x 2x
2x 2x 2x 2x
x x 2x 2x
x x 2x 2x
bottom left MC access from all cores
ANUMA • ANUMA: asymmetrically non-uniform
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 5
bottom left MC access from all cores
core MC
NUMA vs. ANUMA • Latency & bandwidth asymmetries in multi-
MC manycore NoC systems are different • [numa] sharp division into local/remote that dwarfs
other intra-node asymmetries • [anuma] gradual continuum of access latencies
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 6
Problem Statement • MC access asymmetries can result in
suboptimal placement w.r.t. MC access patterns
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 7
Motivating Experiments • “Bare metal” latency differential ~ 14%
• does it matter? • GCC: 1 thread, 1 MC, 1 core at a time, tile linux
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 8
0
1
2
3
4
5
6
7
0 1 2 3 4 5 6 7
1.1e+07
1.12e+07
1.14e+07
1.16e+07
1.18e+07
1.2e+07
1.22e+07
1.24e+07
1.1e+07
1.12e+07
1.14e+07
1.16e+07
1.18e+07
1.2e+07
1.22e+07
1.24e+07
0 2 4 6 8 10 12
GC
C r
untim
e (µ
s)
Manhattan distance to MC
14%
not 0
Motivation • Latency differential gets much worse
• 14% figure is NOT an upper bound • Major factor: contention on the NoC
• NoC node and link contention slows down packets • amplifying effect on “bare metal” asymmetries
• Examples: • up to 100% latency differential in a simulator
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 9
Exploring Solution Space: NUMA • Designate grid quadrants into NUMA zones (+) a well-explored approach (+) kernel code readily available (-) lacks flexibility (-) contention avoidance
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 10
Solution Space: data placement • OS allocates physical frames in a way that:
• balances the NoC contention • optimizes some objective function • maximizes throughput
(+) proactive (-) no crystal ball: which pages will be hot? (-) once allocated, can’t be easily undone
• inter-MC page migration worsens NoC contention
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 11
Solution Space: thread placement • Move execution closer to MCs used (-) waits for the problem to occur before acting (-) cache thrashing (+) better-informed decisions are possible
• MC usage statistics collection
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 12
Solution Space: hybrid • Combine execution and data placement
• allocate frames, then adjust execution placement • place execution, then adjust data placement
(+) Corrective action possible (-) Complexity
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 13
MC Usage Stats Collection • Interface:
• procfs control handle to start/stop collection • procfs data file for stats data extraction
– pages accessed per MC per sampling cycle • Implementation
• instrumented VM to page fault • each page fault is attributed to corresponding MCi
• very inefficient, but does work • adjustable sampling frequency (20Hz)
– trades off overhead vs. accuracy
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 14
Placement Algorithm • Place threads nearest to MCs they use
• greedily based on memory intensity • Weighted geometric placement algorithm
• calculate thread’s desired coordinates as a function of – MC coordinates – thread’s access count vector (L1 normal)
• Placement collisions resolved • find second choices in direction of preferred MCs
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 15
Bandwidth Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 16
250
300
350
400
450
500
550
(%'" BC)A#?@A+# ()*+"
<1+%'> <=1+%'>'
,-./0
!1!(2'"31456789
!$&$;*;'!"#$%&'
!%:$;*;'
(%'" BC)A#?@A+# ()*+"Base
Bandwidth Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 17
250
300
350
400
450
500
550
(%'" BC)A#?@A+# ()*+"
<1+%'> <=1+%'>'
,-./0
!1!(2'"31456789
!$&$;*;'!"#$%&'
!%:$;*;'
(%'" BC)A#?@A+# ()*+"Base Overhead
Bandwidth Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 18
250
300
350
400
450
500
550
(%'" BC)A#?@A+# ()*+"
<1+%'> <=1+%'>'
,-./0
!1!(2'"31456789
!$&$;*;'!"#$%&'
!%:$;*;'
(%'" BC)A#?@A+# ()*+"Base Ovrhd Weighted
Bandwidth Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 19
250
300
350
400
450
500
550
(%'" BC)A#?@A+# ()*+"
<1+%'> <=1+%'>'
,-./0
!1!(2'"31456789
!$&$;*;'!"#$%&'
!%:$;*;'
(%'" BC)A#?@A+# ()*+"Base Ovrhd Wghtd Brute
Bandwidth Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 20
250
300
350
400
450
500
550
(%'" BC)A#?@A+# ()*+"
<1+%'> <=1+%'>'
,-./0
!1!(2'"31456789
!$&$;*;'!"#$%&'
!%:$;*;'
(%'" BC)A#?@A+# ()*+"Base Ovrhd Wghtd Brute Base
Bandwidth Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 21
250
300
350
400
450
500
550
(%'" BC)A#?@A+# ()*+"
<1+%'> <=1+%'>'
,-./0
!1!(2'"31456789
!$&$;*;'!"#$%&'
!%:$;*;'
(%'" BC)A#?@A+# ()*+"Base Ovrhd Wghtd Brute Base Ovrhd Wghtd Brute
Runtime Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 22
4e+06
5e+06
6e+06
7e+06
8e+06
9e+06
1e+07
1.1e+07
1.2e+07
1.3e+07
N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2
1 threads 2 threads 4 threads 8 threads 16 threads
Mic
rose
cond
s to
com
plet
e be
nch2
MinimumsMedians
Maximums
ben
ch2
runt
ime
(us)
B O WG Br
Runtime Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 23
4e+06
5e+06
6e+06
7e+06
8e+06
9e+06
1e+07
1.1e+07
1.2e+07
1.3e+07
N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2
1 threads 2 threads 4 threads 8 threads 16 threads
Mic
rose
cond
s to
com
plet
e be
nch2
MinimumsMedians
Maximums
ben
ch2
runt
ime
(us)
B O WG Br B O WG Br
Runtime Performance Results
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 24
4e+06
5e+06
6e+06
7e+06
8e+06
9e+06
1e+07
1.1e+07
1.2e+07
1.3e+07
N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2 N 0 1 2
1 threads 2 threads 4 threads 8 threads 16 threads
Mic
rose
cond
s to
com
plet
e be
nch2
MinimumsMedians
Maximums
ben
ch2
runt
ime
(us)
B O WG Br B O WG Br B O WG Br B O WG Br B O WG Br
Takeaways • Promising, but stat collection overhead is high • Call for per-core MC stats counters
• in hardware • Works despite high cost of migration
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 25
Future Work • More experimentation to understand benefits
• as a function of application properties • heterogeneous workloads
• ANUMCA • exploring how these same ideas apply to cache • cache access is asymmetrically non-uniform too
• Optimizing the application-level goal • SLAs, multi-threaded app. performance goals
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 26
Conclusion • Promising, but stat collection overhead is high • Call for per-core MC stats counters • Works despite high cost of migration
• Questions?
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 27
References [1] Alexey Tumanov, Joshua Wise, Onur Mutlu, Greg Ganger.
“Asymmetry-Aware Execution Placement on Manycore Chips. In Proc. of SFMA’13, EuroSys 2013, Czech Republic.
[2] Reetuparna Das, Rachata Ausavarungnirun, Onur Mutlu, Akhilesh Kumar, and Mani Azimi. “Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems”. In Proc. of HPCA’2013.
[3] Tilera Corporation, “The Processor Architecture Overview for the TILEPro Series”, Feb 2013. [pdf]
Alexey Tumanov © April 2013!http://www.pdl.cmu.edu/ 28