Transactional Memory
-
Upload
yuuki-takano -
Category
Technology
-
view
779 -
download
3
Transcript of Transactional Memory
Why Transactional Memory?
• lock is difficult to manage.
• deadlock
• starvation
• priority Inversion
• lock convoy
• transactional memory mitigates these problems2
Deadlock
3
t
Thread 1
Thread 2
Lock B
Lock A
try to acquire Aand fail
try to acquire Band fail
Starvation
4t
High Priority Thread (acquire A)
High Priority Thread (acquire B)
Lock B
Lock A
Low Priority Thread (acquire A and B)try to acquire A
and fail Lock A try to acuire Band fail
Lock A
Release A
Priority Inversion
5
t
High Priority Thread
Low Priority Thread
acquiring lock
try to acquireand fail
Lock Convoy
6
Scheduler
Thread1
Thread2
Thread3
ThreadN
1. contention
Thread24. acquire
2. event
3. contention (spin lock)
4. reschedule
high overhead when many threads
Complexity of Multithread Programming
7
algorithm data structure
ideal world
algorithm data structure
parallelism
parallel algorithm parallel data structure
real world
complicated source code
simple source code
buggydifficult to maintain
actually we want
Lock and Transactional Memory
• Lock
• execute critical section exclusively
• only one code enter the critical section
• Transactional Memory
• execute critical section speculatively
• multiple codes enter same critical section simultaneously
• conflicts are detected both while executing critical section and the end of critical section
8
Spin-lock by Atomic Operation• CAS (compare-and-swap)
• compare and swap are performed atomically
• test-and-set, compare-and-add, etc…
• spin-lock is achieved by using CAS
9
int locked; lock_spin() { while (__sync_lock_test_and_set(&locked, 1)) { while (locked) ; // busy-wait } }
unlock_spin() { __sync_lock_release(&locked); }
if locked is 0, set 1
Syntax of Transactional Memory atomic, retry, orElse
10
atomic { // transaction if (q.size() == 0) { // rollback and retry // transactions is restarted when // read-set is updated retry; } … // do something } orElse { // detect rollback and retry }
Software Transactional Memory
11
Software Transactional Memory
• TL2
• Dave Dice, Ori Shalev, and Nir Shavit. “Transactional locking II”, 20th International Conference on Distributed Computing”, DISC 2006
• LSA
• Torvald Riegel, Pascal Felber, and Christof Fetzer, “A Lazy Snapshot Algorithm with Eager Validation”, 20th International Conference on Distributed Computing, DISC 2006
• LogTM
• Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, David A. Wood, “LogTM: log-based transactional memory”, HPCA 2006: 254-265
• DEUCE
• Guy Korland, Nir Shavit and Pascal Felber, “Noninvasive Java Concurrency with Deuce STM”, MultiProg 2010
12 etc
Summary of TL2• prepare a variable called global clock
• associate memory regions with version numbers
• update version numbers when writing
• detect conflicts when reading and writing by comparing the global clock with memory version number
• retry transaction when detecting conflicts
• otherwise commit13
TL2 - Variables
14
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
TL2 - Algorithm (1)
15
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
transaction { load var1; load var2; … store var3; }
1. load the global version clock and store it in a thread local read-version number.
1.
TL2 - Algorithm (2)
16
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 12. run through a speculative execution
transaction { load var1; load var2; … store var3; }
2. run
TL2 - Algorithm (3)
17
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
2.1. log read addresses to the read-set
transaction { load var1; load var2; … store var3; }
2. log read-set
TL2 - Algorithm (4)
18
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
2.2. log write addresses and values to the write-set
transaction { load var1; load var2; … store var3; }2.2 log write-set
TL2 - Algorithm (5)
19
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
thread 1
pointer to 1pointer to 2
read-set 1
pointer to 3write-set 1
pointer to 3
value of 3
variable 3 is stored and loaded
Note that if a variable in the read-set already appears in the write-set, refer to the variable in the write-set
from to avoid read-after-write hazard.
TL2 - Algorithm (6)
20
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
2.3. check variables are not modified when loading. make sure that version numbers are
less than the read-version number.
transaction { load var1; load var2; … store var3; }
<=
if modified, abort transaction
TL2 - Algorithm (7)
21
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
2.4. check write-locks are free?
transaction { load var1; load var2; … store var3; }
free?
free?
if locked, abort transaction
TL2 - Algorithm (8)
22
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
3. acquire write-locks using bounded spin lock
transaction { load var1; load var2; … store var3; }
lockif failed to acquire write-locklocked, abort transaction
TL2 - Algorithm (9)
23
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
4. increment the global version clock (CAS operation) and store it to the write-version number.
transaction { load var1; load var2; … store var3; }
increment
and store
TL2 - Algorithm (10)
24
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
5.1. check variables are not modified when loading. make sure that version numbers are
less than the read-version number.
transaction { load var1; load var2; … store var3; }
<=
if modified, abort transaction
TL2 - Algorithm (11)
25
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
5.2. check write-locks are free?
transaction { load var1; load var2; … store var3; }
free?
free?
if locked, abort transaction
TL2 - Algorithm (12)
26
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
transaction { load var1; load var2; … store var3; }
rv + 1 = wv?
5.3. in the special case (where read-version number + 1 = write-version number) it is not
necessary to validate the read-set
TL2 - Algorithm (13)
27
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
transaction { load var1; load var2; … store var3; }
6.1. commit values of the write-set
TL2 - Algorithm (14)
28
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
transaction { load var1; load var2; … store var3; }
6.2. update version numbers by the write version number
release
TL2 - Algorithm (15)
29
global version clockvariable 1
version number 1
Global Memory Thread Local Memory
read-version number 1
write-lock 1variable 2
version number 2write-lock 2variable 3
version number 3write-lock 3
write-version number 1
thread 1
read-set 1write-set 1
transaction { load var1; load var2; … store var3; }
6.3. release the write-locks
release
Hardware Transactional Memory
30
Hardware Transactional Memory
• use CPU cache to detect conflicts
• modify cache coherence algorithm to achieve transactional memory
31
Cache Coherence
• MESI protocol
• There are 4 states
• Modified, Exclusive, Shared, Invalid
32
MESI Modified State
33
main memory
CPU0 CPU1
cache 0 cache 1
cache line
dirty, must write back
not shared with other CPU
MESI Exclusive State
34
main memory
CPU0 CPU1
cache 0 cache 1
cache line
not modified
not shared with other CPU
MESI Shared State
35
main memory
CPU0 CPU1
cache 0 cache 1
cache line
not modified
shared with other CPU
MESI Invalid State
36
main memory
CPU0 CPU1
cache 0 cache 1
cache line
no meaningful data
MESI Exclusive Load
37
main memory
CPU0 CPU1
cache 0 cache 1
1. request exclusive load
2. write back if modified
3. change state to invalid
4. load state with exclusive state
MESI Shared Load
38
main memory
CPU0 CPU1
cache 0 cache 1
1. request shared load
2. write back if modified
3. change state to shared
4. load state with shared state
MESI eviction
39
main memory
CPU0 CPU1
cache 0 cache 1
1. write back if modified
2. discard
Transactional Cache Coherence (1)
40
main memory
CPU0 CPU1
cache 0 cache 1
0prepare transactional bit in each cache line
0: not in transaction1: in transaction
Transactional Cache Coherence (2)
41
main memory
CPU0 CPU1
cache 0 cache 1
1abort transaction if MESI protocol invalidates transaction entry
shared or exclusive state
Transactional Cache Coherence (3)
42
main memory
CPU0 CPU1
cache 0 cache 1
1discard modified value and abort transaction
if MESI protocol invalidates or evicts transaction entry
modified
Transactional Cache Coherence (4)
43
main memory
CPU0 CPU1
cache 0 cache 1
1abort transaction if MESI protocol evicts transaction entrybecause cache coherence protocol cannot detect conflicts
evicted
Problems
44
Problem (1)• infinite loop in transaction
• detection of variable version in loops should reduce performance significantly
• requirement of closed memory management
• codes out of transaction can refer and update variables in transaction in languages like C, C++
• compiler or running environment should care about45
Problem (2)
46
atomic { … launchMissile(); … }
Missiles may be launched many times!
IO in transaction must causes abort
Problem (3)
• livelock
47
Implementation
48
Software Transactional Memory (STM) in Haskell
• Haskell provides STM by concurrent module
• STM monad is provided to achieve STM
• example implementation
• https://gist.github.com/ytakano/228b68ef099c7bdd2f2c
49
Hardware Transactional Memory (HTM) Intel TSX
• HTM is available from Haswell
• Intel TSX HLE
• xacquire and xrelease instructions
• Intel TSX RTM
• xbegin and xend instructions50
Intel TSX RTM
51
xbegin ABORT . . . xend ABORT: // fallback
if aborted sometimes, must go to fallback codes (such as spin lock)
Lock by using tsx-tools https://github.com/andikleen/tsx-tools
52
volatile int lock = 0;
rtm_lock() { for (int i = 0; i < RTM_MAX_RETRY; i++) { unsigned status = _xbegin(); if (status == _XBEGIN_STARTED) { if (! lock) return; // successfully started _xabort(0xff); }
if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff && ! (status & _XABORT_NESTED) { while (lock) _mm_pause(); // busy-wait } else if (!(status & _XABORT_RETRY)) { break; } }
while (__sync_lock_test_and_set(&lock, 1)) { // fallback to spin-lock while (lock) _mm_pause(); // busy-wait } }
lock by using Intel TSX RTM
Unlock by using tsx-tools https://github.com/andikleen/tsx-tools
53
rtm_unlock() { if (lock) { __sync_lock_release(&lock); } else { _xend(); } }
unlock by using Intel TSX RTM
Performance of Intel TSX
• Intel says that codes of coarse-grained lock can compare with codes of fine-grained lock
• easy to write core scalable codes
54
5535
Applying Intel® TSX
scal
ing
Threads
scal
ing
Threads
Application with Coarse Grain Lock
Application re-written with Finer Grain Locks
An example of secondary benefits
of Intel® TSX
Coarse Grain Lock
Coarse Grain Lock + Intel® TSX
Fine Grain Locks
Fine Grain Locks + Intel® TSX
Fine Grain Behavior at Coarse Grain Effort
Intel® Transactional Synchronization Extensions (Intel® TSX)
from Intel Developer Forum 2012
56
36
Intel® TSX Can Enable Simpler Scalable Algorithms
Enabling Simpler Algorithms Lock-Free Algorithm • Don’t use critical section locks • Developer manages concurrency • Very difficult to get correct & optimize
– Constrain data structure selection – Highly contended atomic operations
State of the art lock-free algorithm
Ops
/sec
Threads
Ops
/sec
Threads
TSX lock based algorithm Lock-Based + Intel® TSX • Use critical section locks for ease • Let hardware extract concurrency • Enables algorithm simplification
– Flexible data structure selection – Equivalent data structure lock-free
algorithm very hard to verify
Real World Example
Intel® Transactional Synchronization Extensions (Intel® TSX)
from Intel Developer Forum 2012
EOF
57