Transactional Memory

57
Transactional Memory Yuuki Takano [email protected]

Transcript of Transactional Memory

Page 1: Transactional Memory

Transactional MemoryYuuki Takano

[email protected]

Page 2: Transactional Memory

Why Transactional Memory?

• lock is difficult to manage.

• deadlock

• starvation

• priority Inversion

• lock convoy

• transactional memory mitigates these problems2

Page 3: Transactional Memory

Deadlock

3

t

Thread 1

Thread 2

Lock B

Lock A

try to acquire Aand fail

try to acquire Band fail

Page 4: Transactional Memory

Starvation

4t

High Priority Thread (acquire A)

High Priority Thread (acquire B)

Lock B

Lock A

Low Priority Thread (acquire A and B)try to acquire A

and fail Lock A try to acuire Band fail

Lock A

Release A

Page 5: Transactional Memory

Priority Inversion

5

t

High Priority Thread

Low Priority Thread

acquiring lock

try to acquireand fail

Page 6: Transactional Memory

Lock Convoy

6

Scheduler

Thread1

Thread2

Thread3

ThreadN

1. contention

Thread24. acquire

2. event

3. contention (spin lock)

4. reschedule

high overhead when many threads

Page 7: Transactional Memory

Complexity of Multithread Programming

7

algorithm data structure

ideal world

algorithm data structure

parallelism

parallel algorithm parallel data structure

real world

complicated source code

simple source code

buggydifficult to maintain

actually we want

Page 8: Transactional Memory

Lock and Transactional Memory

• Lock

• execute critical section exclusively

• only one code enter the critical section

• Transactional Memory

• execute critical section speculatively

• multiple codes enter same critical section simultaneously

• conflicts are detected both while executing critical section and the end of critical section

8

Page 9: Transactional Memory

Spin-lock by Atomic Operation• CAS (compare-and-swap)

• compare and swap are performed atomically

• test-and-set, compare-and-add, etc…

• spin-lock is achieved by using CAS

9

int locked; lock_spin() { while (__sync_lock_test_and_set(&locked, 1)) { while (locked) ; // busy-wait } }

unlock_spin() { __sync_lock_release(&locked); }

if locked is 0, set 1

Page 10: Transactional Memory

Syntax of Transactional Memory atomic, retry, orElse

10

atomic { // transaction if (q.size() == 0) { // rollback and retry // transactions is restarted when // read-set is updated retry; } … // do something } orElse { // detect rollback and retry }

Page 11: Transactional Memory

Software Transactional Memory

11

Page 12: Transactional Memory

Software Transactional Memory

• TL2

• Dave Dice, Ori Shalev, and Nir Shavit. “Transactional locking II”, 20th International Conference on Distributed Computing”, DISC 2006

• LSA

• Torvald Riegel, Pascal Felber, and Christof Fetzer, “A Lazy Snapshot Algorithm with Eager Validation”, 20th International Conference on Distributed Computing, DISC 2006

• LogTM

• Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, David A. Wood, “LogTM: log-based transactional memory”, HPCA 2006: 254-265

• DEUCE

• Guy Korland, Nir Shavit and Pascal Felber, “Noninvasive Java Concurrency with Deuce STM”, MultiProg 2010

12 etc

Page 13: Transactional Memory

Summary of TL2• prepare a variable called global clock

• associate memory regions with version numbers

• update version numbers when writing

• detect conflicts when reading and writing by comparing the global clock with memory version number

• retry transaction when detecting conflicts

• otherwise commit13

Page 14: Transactional Memory

TL2 - Variables

14

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

Page 15: Transactional Memory

TL2 - Algorithm (1)

15

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

1. load the global version clock and store it in a thread local read-version number.

1.

Page 16: Transactional Memory

TL2 - Algorithm (2)

16

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 12. run through a speculative execution

transaction { load var1; load var2; … store var3; }

2. run

Page 17: Transactional Memory

TL2 - Algorithm (3)

17

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.1. log read addresses to the read-set

transaction { load var1; load var2; … store var3; }

2. log read-set

Page 18: Transactional Memory

TL2 - Algorithm (4)

18

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.2. log write addresses and values to the write-set

transaction { load var1; load var2; … store var3; }2.2 log write-set

Page 19: Transactional Memory

TL2 - Algorithm (5)

19

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

thread 1

pointer to 1pointer to 2

read-set 1

pointer to 3write-set 1

pointer to 3

value of 3

variable 3 is stored and loaded

Note that if a variable in the read-set already appears in the write-set, refer to the variable in the write-set

from to avoid read-after-write hazard.

Page 20: Transactional Memory

TL2 - Algorithm (6)

20

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.3. check variables are not modified when loading. make sure that version numbers are

less than the read-version number.

transaction { load var1; load var2; … store var3; }

<=

if modified, abort transaction

Page 21: Transactional Memory

TL2 - Algorithm (7)

21

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.4. check write-locks are free?

transaction { load var1; load var2; … store var3; }

free?

free?

if locked, abort transaction

Page 22: Transactional Memory

TL2 - Algorithm (8)

22

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

3. acquire write-locks using bounded spin lock

transaction { load var1; load var2; … store var3; }

lockif failed to acquire write-locklocked, abort transaction

Page 23: Transactional Memory

TL2 - Algorithm (9)

23

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

4. increment the global version clock (CAS operation) and store it to the write-version number.

transaction { load var1; load var2; … store var3; }

increment

and store

Page 24: Transactional Memory

TL2 - Algorithm (10)

24

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

5.1. check variables are not modified when loading. make sure that version numbers are

less than the read-version number.

transaction { load var1; load var2; … store var3; }

<=

if modified, abort transaction

Page 25: Transactional Memory

TL2 - Algorithm (11)

25

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

5.2. check write-locks are free?

transaction { load var1; load var2; … store var3; }

free?

free?

if locked, abort transaction

Page 26: Transactional Memory

TL2 - Algorithm (12)

26

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

rv + 1 = wv?

5.3. in the special case (where read-version number + 1 = write-version number) it is not

necessary to validate the read-set

Page 27: Transactional Memory

TL2 - Algorithm (13)

27

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

6.1. commit values of the write-set

Page 28: Transactional Memory

TL2 - Algorithm (14)

28

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

6.2. update version numbers by the write version number

release

Page 29: Transactional Memory

TL2 - Algorithm (15)

29

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

6.3. release the write-locks

release

Page 30: Transactional Memory

Hardware Transactional Memory

30

Page 31: Transactional Memory

Hardware Transactional Memory

• use CPU cache to detect conflicts

• modify cache coherence algorithm to achieve transactional memory

31

Page 32: Transactional Memory

Cache Coherence

• MESI protocol

• There are 4 states

• Modified, Exclusive, Shared, Invalid

32

Page 33: Transactional Memory

MESI Modified State

33

main memory

CPU0 CPU1

cache 0 cache 1

cache line

dirty, must write back

not shared with other CPU

Page 34: Transactional Memory

MESI Exclusive State

34

main memory

CPU0 CPU1

cache 0 cache 1

cache line

not modified

not shared with other CPU

Page 35: Transactional Memory

MESI Shared State

35

main memory

CPU0 CPU1

cache 0 cache 1

cache line

not modified

shared with other CPU

Page 36: Transactional Memory

MESI Invalid State

36

main memory

CPU0 CPU1

cache 0 cache 1

cache line

no meaningful data

Page 37: Transactional Memory

MESI Exclusive Load

37

main memory

CPU0 CPU1

cache 0 cache 1

1. request exclusive load

2. write back if modified

3. change state to invalid

4. load state with exclusive state

Page 38: Transactional Memory

MESI Shared Load

38

main memory

CPU0 CPU1

cache 0 cache 1

1. request shared load

2. write back if modified

3. change state to shared

4. load state with shared state

Page 39: Transactional Memory

MESI eviction

39

main memory

CPU0 CPU1

cache 0 cache 1

1. write back if modified

2. discard

Page 40: Transactional Memory

Transactional Cache Coherence (1)

40

main memory

CPU0 CPU1

cache 0 cache 1

0prepare transactional bit in each cache line

0: not in transaction1: in transaction

Page 41: Transactional Memory

Transactional Cache Coherence (2)

41

main memory

CPU0 CPU1

cache 0 cache 1

1abort transaction if MESI protocol invalidates transaction entry

shared or exclusive state

Page 42: Transactional Memory

Transactional Cache Coherence (3)

42

main memory

CPU0 CPU1

cache 0 cache 1

1discard modified value and abort transaction

if MESI protocol invalidates or evicts transaction entry

modified

Page 43: Transactional Memory

Transactional Cache Coherence (4)

43

main memory

CPU0 CPU1

cache 0 cache 1

1abort transaction if MESI protocol evicts transaction entrybecause cache coherence protocol cannot detect conflicts

evicted

Page 44: Transactional Memory

Problems

44

Page 45: Transactional Memory

Problem (1)• infinite loop in transaction

• detection of variable version in loops should reduce performance significantly

• requirement of closed memory management

• codes out of transaction can refer and update variables in transaction in languages like C, C++

• compiler or running environment should care about45

Page 46: Transactional Memory

Problem (2)

46

atomic { … launchMissile(); … }

Missiles may be launched many times!

IO in transaction must causes abort

Page 47: Transactional Memory

Problem (3)

• livelock

47

Page 48: Transactional Memory

Implementation

48

Page 49: Transactional Memory

Software Transactional Memory (STM) in Haskell

• Haskell provides STM by concurrent module

• STM monad is provided to achieve STM

• example implementation

• https://gist.github.com/ytakano/228b68ef099c7bdd2f2c

49

Page 50: Transactional Memory

Hardware Transactional Memory (HTM) Intel TSX

• HTM is available from Haswell

• Intel TSX HLE

• xacquire and xrelease instructions

• Intel TSX RTM

• xbegin and xend instructions50

Page 51: Transactional Memory

Intel TSX RTM

51

xbegin ABORT . . . xend ABORT: // fallback

if aborted sometimes, must go to fallback codes (such as spin lock)

Page 52: Transactional Memory

Lock by using tsx-tools https://github.com/andikleen/tsx-tools

52

volatile int lock = 0;

rtm_lock() { for (int i = 0; i < RTM_MAX_RETRY; i++) { unsigned status = _xbegin(); if (status == _XBEGIN_STARTED) { if (! lock) return; // successfully started _xabort(0xff); }

if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff && ! (status & _XABORT_NESTED) { while (lock) _mm_pause(); // busy-wait } else if (!(status & _XABORT_RETRY)) { break; } }

while (__sync_lock_test_and_set(&lock, 1)) { // fallback to spin-lock while (lock) _mm_pause(); // busy-wait } }

lock by using Intel TSX RTM

Page 53: Transactional Memory

Unlock by using tsx-tools https://github.com/andikleen/tsx-tools

53

rtm_unlock() { if (lock) { __sync_lock_release(&lock); } else { _xend(); } }

unlock by using Intel TSX RTM

Page 54: Transactional Memory

Performance of Intel TSX

• Intel says that codes of coarse-grained lock can compare with codes of fine-grained lock

• easy to write core scalable codes

54

Page 55: Transactional Memory

5535

Applying Intel® TSX

scal

ing

Threads

scal

ing

Threads

Application with Coarse Grain Lock

Application re-written with Finer Grain Locks

An example of secondary benefits

of Intel® TSX

Coarse Grain Lock

Coarse Grain Lock + Intel® TSX

Fine Grain Locks

Fine Grain Locks + Intel® TSX

Fine Grain Behavior at Coarse Grain Effort

Intel® Transactional Synchronization Extensions (Intel® TSX)

from Intel Developer Forum 2012

Page 56: Transactional Memory

56

36

Intel® TSX Can Enable Simpler Scalable Algorithms

Enabling Simpler Algorithms Lock-Free Algorithm • Don’t use critical section locks • Developer manages concurrency • Very difficult to get correct & optimize

– Constrain data structure selection – Highly contended atomic operations

State of the art lock-free algorithm

Ops

/sec

Threads

Ops

/sec

Threads

TSX lock based algorithm Lock-Based + Intel® TSX • Use critical section locks for ease • Let hardware extract concurrency • Enables algorithm simplification

– Flexible data structure selection – Equivalent data structure lock-free

algorithm very hard to verify

Real World Example

Intel® Transactional Synchronization Extensions (Intel® TSX)

from Intel Developer Forum 2012

Page 57: Transactional Memory

EOF

57