Emmett Witchel Krste Asanovic MIT Lab for Computer Science Mondriaan Memory Protection.
Donald E. Porter and Emmett Witchel
description
Transcript of Donald E. Porter and Emmett Witchel
![Page 1: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/1.jpg)
1
The University of Texas at Austin
UNDERSTANDING TRANSACTIONAL MEMORY
PERFORMANCE
Donald E. Porter and Emmett Witchel
![Page 2: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/2.jpg)
2
Multicore is here
Only concurrent applications will perform better on new hardware
Intel Single-chip Cloud Computer48 cores
Tilera Tile GX100 cores
This laptop2 Intel cores
![Page 3: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/3.jpg)
3
Concurrent programming is hard Locks are the state of the art
Correctness problems: deadlock, priority inversion, etc.
Scaling performance requires more complexity Transactional memory makes correctness easy
Trade correctness problems for performance problems
Key challenge: performance tuning transactions This work:
Develops a TM performance model and tool Systems integration challenges for TM
![Page 4: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/4.jpg)
4
Simple microbenchmark
Intuition: Transactions execute optimistically TM should scale at low contention threshold Locks always execute serially
lock();if(rand() < threshold)
shared_var = new_value;unlock();
xbegin();if(rand() < threshold)
shared_var = new_value;xend();
![Page 5: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/5.jpg)
5
Ideal TM performance
0 20 40 60 80 100
00.5
11.5
22.5
3
Locks 32 CPUs
Probability of Conflict (%)
Exec
utio
n Ti
me
(s)
Performance win at low contention
Higher contention degrades gracefully
xbegin();if(rand() < threshold)
shared_var = new_value;xend();
Lower is betterIdeal, not real data
![Page 6: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/6.jpg)
6
Actual performance under contention
0 20 40 60 80 100
00.5
11.5
22.5
3
Locks 32 CPUs
Probability of Conflict (%)
Exec
utio
n Ti
me
(s)
Comparable performance at modest contention
40% worse at 100% contention
xbegin();if(rand() < threshold)
shared_var = new_value;xend();
Lower is betterActual data
![Page 7: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/7.jpg)
7
First attempt at microbenchmark
0 20 40 60 80 100
00.5
11.5
22.5
3
Locks 32 CPUs
Probability of Conflict (%)
Exec
utio
n Ti
me
(s)
xbegin();if(rand() < threshold)
shared_var = new_value;xend();
Lower is betterApproximate data
![Page 8: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/8.jpg)
8
Subtle sources of contention
if(a < threshold) shared_var = new_value;
eax = shared_var;if(edx < threshold) eax = new_value;shared_var = eax;
Microbenchmark code
gcc optimized code
Compiler optimization to avoid branches Optimization causes 100% restart rate Can’t identify problem with source inspection +
reason
![Page 9: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/9.jpg)
9
Developers need TM tuning tools Transactional memory can perform pathologically
Contention Poor integration with system components HTM “best effort” not good enough
Causes can be subtle and counterintuitive Syncchar: Model that predicts TM performance
Predicts poor performance remove contention Predicts good performance + poor performance system issue
![Page 10: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/10.jpg)
10
This talk Motivating example Syncchar performance model Experiences with transactional memory
Performance tuning case study System integration challenges
![Page 11: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/11.jpg)
11
The Syncchar model Approximate transaction performance model Intuition: scalability limited by serialized length
of critical regions Introduce two key metrics for critical regions:
Data Independence: Likelihood executions do not conflict
Conflict Density: How many threads must execute serially to resolve a conflict
Model inputs: samples critical region executions Memory accesses and execution times
![Page 12: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/12.jpg)
12
Data independence (In) Expected number of non-conflicting, concurrent
executions of a critical region. Formally:In = n - |Cn|
n =thread countCn = set of conflicting critical region executions
Linear speedup when all critical regions are data independent (In = n ) Example: thread-private data structures
Serialized execution when (In = 0 ) Example: concurrent updates to a shared variable
![Page 13: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/13.jpg)
13
Example:
Write a
Read a
Write a
Read a Write a
Write a
Time
Same data independence (0) Different serialization
Thread 1
Thread 2
Thread 3
![Page 14: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/14.jpg)
14
Intuition: Low density High density
How many threads must be serialized to eliminate a conflict?
Similar to dependence density introduced by von Praun et al. [PPoPP ‘07]
Conflict density (Dn)
Write a
Read a
Write a
Read a Write a
Write a
Time
Thread 1
Thread 2
Thread 3
![Page 15: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/15.jpg)
15
Syncchar metrics in STAMP
8 16 32 8 16 32 8 16 32 8 16 320
2
4
6
8
10
12Conflict DensityData Inde-pendence
Proj
ecte
d Sp
eedu
p ov
er
Lock
ing
intruder
kmeans
bayes ssca2Higher is better
![Page 16: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/16.jpg)
16
Predicting execution time Speedup limited by conflict density Amdahl’s law: Transaction speedup limited
to time executing transactions concurrently
cs_cycles = time executing a critical regionother = remaining execution time
Dn = Conflict density
otherDncyclescsTimeExecution
n
)1,max(__
![Page 17: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/17.jpg)
17
Syncchar tool Implemented as Simics machine simulator
module Samples lock-based application behavior Predicts TM performance Features:
Identifies contention “hot spot” addresses Sorts by time spent in critical region Identifies potential asymmetric conflicts
between transactions and non-transactional threads
![Page 18: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/18.jpg)
18
Syncchar validation: microbenchmark
0 10 20 30 40 50 60 70 80 90 1000
0.51
1.52
2.53
Locks 8 CPUsTM 8 CPUs
Probability of Conflict (%)
Exec
utio
n Ti
me
(s)
Tracks trends, does not model pathologies
Balances accuracy with generality
Lower is better
![Page 19: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/19.jpg)
19
Syncchar validation: STAMP
ssca2 32CPUssca2 16CPUssca2 8CPU
intruder 32CPUintruder 16CPUintruder 8CPU
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2
Predicted
Execution Time (s) Coarse predictions track scaling trend
Mean error 25% Additional benchmarks in paper
![Page 20: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/20.jpg)
20
Syncchar summary Model: data independence and conflict
density Both contribute to transactional speedup
Syncchar tool predicts scaling trends Predicts poor performance remove
contention Predicts good performance + poor
performance system issue
Distinguishing high contention from system issues is key step in performance tuning
![Page 21: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/21.jpg)
21
This talk Motivating example Syncchar performance model Experiences with transactional memory
Performance tuning case study System integration challenges
![Page 22: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/22.jpg)
22
TxLinux case study TxLinux – modifies Linux synchronization
primitives to use hardware transactions [SOSP 2007]
Linux
TxLinux-xs
TxLinux-cx
Linux
TxLinux-xs
TxLinux-cx
Linux
TxLinux-xs
TxLinux-cx
Linux
TxLinux-xs
TxLinux-cx
Linux
TxLinux-xs
TxLinux-cx
Linux
TxLinux-xs
TxLinux-cx
pmake bonnie++ mab find config dpunish
0
2
4
6
8
10
12
14
aborts spins
% K
erne
l Tim
e Sp
ent S
ynch
roni
zing
16 CPUs – graph taken from SOSP talkLower is better
![Page 23: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/23.jpg)
23
Bonnie++ pathology Simple execution profiling indicated ext3 file
system journaling code was the culprit Code inspection yielded no clear culprit What information missing?
What variable causing the contention What other code is contending with the
transaction Syncchar tool showed:
Contended variable High probability (88-92%) of asymmetric conflict
![Page 24: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/24.jpg)
24
Bonnie++ pathology, explained
False asymmetric conflicts for unrelated bits Tuned by moving state lock to dedicated
cache line
lock(buffer->state);... xbegin(); ... assert(locked(buffer->state)); ... xend();...unlock(buffer->state);
struct bufferhead{ … bit state; bit dirty; bit free; …};
Tx RW
![Page 25: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/25.jpg)
25
Tuned performance – 16 CPUs
bonnie++ MAB pmake radix-0.2-1.66533453693773E-16
0.20.40.60.8
11.2
TxLinuxTxLinux Tuned
Exec
utio
n Ti
me
(s)
>10 s
Tuned performance strictly dominates TxLinux
Lower is better
![Page 26: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/26.jpg)
26
This talk Motivating example Syncchar performance model Experiences with transactional memory
Performance tuning case study System integration challenges
Compiler (motivation) Architecture Operating system
![Page 27: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/27.jpg)
27
HTM designs must handle TLB misses
Some best effort HTM designs cannot handle TLB misses Sun Rock
What percent of STAMP txns would abort for TLB misses? 2% for kmeans 50-100%
How many times will these transactions restart? 3 (ssca2) 908 (bayes)
Practical HTM designs must handle TLB misses
![Page 28: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/28.jpg)
28
Input size Simulation studies need scaled inputs
Simulating 1 second takes hours to weeks STAMP comes with parameters for real
and simulated environments
![Page 29: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/29.jpg)
29
Input size
8 16 32 8 16 32 8 16 3205
1015202530 Speedup normalized to 1 CPU – Higher is
better BigSim
Spee
dup
genome ssca2 yada Simulator inputs too small to amortize
costs of scheduling threads
![Page 30: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/30.jpg)
30
System calls – memory allocation
xbegin();malloc();xend();
Thread 1
Common case behavior:Rollback of transaction rolls back heap bookkeeping
HeapPages: 2
Allocated FreeLegend
![Page 31: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/31.jpg)
31
System calls – memory allocation
xbegin();malloc();xend();
Thread 1
Heap
Uncommon case behavior:Allocator adds pages to heapRolls back bookkeeping, leaking pages
Pages: 2Pages: 3
Pathological memory leaks in STAMP genome and labyrinth benchmark
Allocated FreeLegend
![Page 32: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/32.jpg)
32
System integration issues Developers need tools to identify these
subtle issues Indicated by poor performance despite
good predictions from Syncchar Pain for early adopters, important for
designers System call support evolving in OS
community xCalls [Volos et al. – Eurosys 2009]
Userspace compensation built on transactional pause
TxOS [Porter et al. – SOSP 2009] Kernel support for transactional system calls
![Page 33: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/33.jpg)
33
Related work TM performance models
von Praun et al. [PPoPP ’07] – Dependence density
Heindl and Pokam [Computer Networks 2009] – analytic model of STM performance
HTM conflict behavior Bobba et al. [ISCA 2007] Ramadan et al. [MICRO 2008] Pant and Byrd [ICS 2009] Shriraman and Dwarkadas [ICS 2009]
![Page 34: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/34.jpg)
34
Conclusion Developers need tools for tuning TM
performance Syncchar provides practical techniques Identified system integration challenges
for TM
Code available at: http://syncchar.code.csres.utexas.edu
![Page 35: Donald E. Porter and Emmett Witchel](https://reader036.fdocuments.net/reader036/viewer/2022062305/56816102550346895dd0455d/html5/thumbnails/35.jpg)
35
Backup slides