Transactional Memory

Post on 12-Apr-2017

779 views 3 download

Transcript of Transactional Memory

Transactional MemoryYuuki Takano

ytakanoster@gmail.com

Why Transactional Memory?

• lock is difficult to manage.

• deadlock

• starvation

• priority Inversion

• lock convoy

• transactional memory mitigates these problems2

Deadlock

3

t

Thread 1

Thread 2

Lock B

Lock A

try to acquire Aand fail

try to acquire Band fail

Starvation

4t

High Priority Thread (acquire A)

High Priority Thread (acquire B)

Lock B

Lock A

Low Priority Thread (acquire A and B)try to acquire A

and fail Lock A try to acuire Band fail

Lock A

Release A

Priority Inversion

5

t

High Priority Thread

Low Priority Thread

acquiring lock

try to acquireand fail

Lock Convoy

6

Scheduler

Thread1

Thread2

Thread3

ThreadN

1. contention

Thread24. acquire

2. event

3. contention (spin lock)

4. reschedule

high overhead when many threads

Complexity of Multithread Programming

7

algorithm data structure

ideal world

algorithm data structure

parallelism

parallel algorithm parallel data structure

real world

complicated source code

simple source code

buggydifficult to maintain

actually we want

Lock and Transactional Memory

• Lock

• execute critical section exclusively

• only one code enter the critical section

• Transactional Memory

• execute critical section speculatively

• multiple codes enter same critical section simultaneously

• conflicts are detected both while executing critical section and the end of critical section

8

Spin-lock by Atomic Operation• CAS (compare-and-swap)

• compare and swap are performed atomically

• test-and-set, compare-and-add, etc…

• spin-lock is achieved by using CAS

9

int locked; lock_spin() { while (__sync_lock_test_and_set(&locked, 1)) { while (locked) ; // busy-wait } }

unlock_spin() { __sync_lock_release(&locked); }

if locked is 0, set 1

Syntax of Transactional Memory atomic, retry, orElse

10

atomic { // transaction if (q.size() == 0) { // rollback and retry // transactions is restarted when // read-set is updated retry; } … // do something } orElse { // detect rollback and retry }

Software Transactional Memory

11

Software Transactional Memory

• TL2

• Dave Dice, Ori Shalev, and Nir Shavit. “Transactional locking II”, 20th International Conference on Distributed Computing”, DISC 2006

• LSA

• Torvald Riegel, Pascal Felber, and Christof Fetzer, “A Lazy Snapshot Algorithm with Eager Validation”, 20th International Conference on Distributed Computing, DISC 2006

• LogTM

• Kevin E. Moore, Jayaram Bobba, Michelle J. Moravan, Mark D. Hill, David A. Wood, “LogTM: log-based transactional memory”, HPCA 2006: 254-265

• DEUCE

• Guy Korland, Nir Shavit and Pascal Felber, “Noninvasive Java Concurrency with Deuce STM”, MultiProg 2010

12 etc

Summary of TL2• prepare a variable called global clock

• associate memory regions with version numbers

• update version numbers when writing

• detect conflicts when reading and writing by comparing the global clock with memory version number

• retry transaction when detecting conflicts

• otherwise commit13

TL2 - Variables

14

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

TL2 - Algorithm (1)

15

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

1. load the global version clock and store it in a thread local read-version number.

1.

TL2 - Algorithm (2)

16

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 12. run through a speculative execution

transaction { load var1; load var2; … store var3; }

2. run

TL2 - Algorithm (3)

17

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.1. log read addresses to the read-set

transaction { load var1; load var2; … store var3; }

2. log read-set

TL2 - Algorithm (4)

18

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.2. log write addresses and values to the write-set

transaction { load var1; load var2; … store var3; }2.2 log write-set

TL2 - Algorithm (5)

19

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

thread 1

pointer to 1pointer to 2

read-set 1

pointer to 3write-set 1

pointer to 3

value of 3

variable 3 is stored and loaded

Note that if a variable in the read-set already appears in the write-set, refer to the variable in the write-set

from to avoid read-after-write hazard.

TL2 - Algorithm (6)

20

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.3. check variables are not modified when loading. make sure that version numbers are

less than the read-version number.

transaction { load var1; load var2; … store var3; }

<=

if modified, abort transaction

TL2 - Algorithm (7)

21

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

2.4. check write-locks are free?

transaction { load var1; load var2; … store var3; }

free?

free?

if locked, abort transaction

TL2 - Algorithm (8)

22

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

3. acquire write-locks using bounded spin lock

transaction { load var1; load var2; … store var3; }

lockif failed to acquire write-locklocked, abort transaction

TL2 - Algorithm (9)

23

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

4. increment the global version clock (CAS operation) and store it to the write-version number.

transaction { load var1; load var2; … store var3; }

increment

and store

TL2 - Algorithm (10)

24

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

5.1. check variables are not modified when loading. make sure that version numbers are

less than the read-version number.

transaction { load var1; load var2; … store var3; }

<=

if modified, abort transaction

TL2 - Algorithm (11)

25

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

5.2. check write-locks are free?

transaction { load var1; load var2; … store var3; }

free?

free?

if locked, abort transaction

TL2 - Algorithm (12)

26

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

rv + 1 = wv?

5.3. in the special case (where read-version number + 1 = write-version number) it is not

necessary to validate the read-set

TL2 - Algorithm (13)

27

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

6.1. commit values of the write-set

TL2 - Algorithm (14)

28

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

6.2. update version numbers by the write version number

release

TL2 - Algorithm (15)

29

global version clockvariable 1

version number 1

Global Memory Thread Local Memory

read-version number 1

write-lock 1variable 2

version number 2write-lock 2variable 3

version number 3write-lock 3

write-version number 1

thread 1

read-set 1write-set 1

transaction { load var1; load var2; … store var3; }

6.3. release the write-locks

release

Hardware Transactional Memory

30

Hardware Transactional Memory

• use CPU cache to detect conflicts

• modify cache coherence algorithm to achieve transactional memory

31

Cache Coherence

• MESI protocol

• There are 4 states

• Modified, Exclusive, Shared, Invalid

32

MESI Modified State

33

main memory

CPU0 CPU1

cache 0 cache 1

cache line

dirty, must write back

not shared with other CPU

MESI Exclusive State

34

main memory

CPU0 CPU1

cache 0 cache 1

cache line

not modified

not shared with other CPU

MESI Shared State

35

main memory

CPU0 CPU1

cache 0 cache 1

cache line

not modified

shared with other CPU

MESI Invalid State

36

main memory

CPU0 CPU1

cache 0 cache 1

cache line

no meaningful data

MESI Exclusive Load

37

main memory

CPU0 CPU1

cache 0 cache 1

1. request exclusive load

2. write back if modified

3. change state to invalid

4. load state with exclusive state

MESI Shared Load

38

main memory

CPU0 CPU1

cache 0 cache 1

1. request shared load

2. write back if modified

3. change state to shared

4. load state with shared state

MESI eviction

39

main memory

CPU0 CPU1

cache 0 cache 1

1. write back if modified

2. discard

Transactional Cache Coherence (1)

40

main memory

CPU0 CPU1

cache 0 cache 1

0prepare transactional bit in each cache line

0: not in transaction1: in transaction

Transactional Cache Coherence (2)

41

main memory

CPU0 CPU1

cache 0 cache 1

1abort transaction if MESI protocol invalidates transaction entry

shared or exclusive state

Transactional Cache Coherence (3)

42

main memory

CPU0 CPU1

cache 0 cache 1

1discard modified value and abort transaction

if MESI protocol invalidates or evicts transaction entry

modified

Transactional Cache Coherence (4)

43

main memory

CPU0 CPU1

cache 0 cache 1

1abort transaction if MESI protocol evicts transaction entrybecause cache coherence protocol cannot detect conflicts

evicted

Problems

44

Problem (1)• infinite loop in transaction

• detection of variable version in loops should reduce performance significantly

• requirement of closed memory management

• codes out of transaction can refer and update variables in transaction in languages like C, C++

• compiler or running environment should care about45

Problem (2)

46

atomic { … launchMissile(); … }

Missiles may be launched many times!

IO in transaction must causes abort

Problem (3)

• livelock

47

Implementation

48

Software Transactional Memory (STM) in Haskell

• Haskell provides STM by concurrent module

• STM monad is provided to achieve STM

• example implementation

• https://gist.github.com/ytakano/228b68ef099c7bdd2f2c

49

Hardware Transactional Memory (HTM) Intel TSX

• HTM is available from Haswell

• Intel TSX HLE

• xacquire and xrelease instructions

• Intel TSX RTM

• xbegin and xend instructions50

Intel TSX RTM

51

xbegin ABORT . . . xend ABORT: // fallback

if aborted sometimes, must go to fallback codes (such as spin lock)

Lock by using tsx-tools https://github.com/andikleen/tsx-tools

52

volatile int lock = 0;

rtm_lock() { for (int i = 0; i < RTM_MAX_RETRY; i++) { unsigned status = _xbegin(); if (status == _XBEGIN_STARTED) { if (! lock) return; // successfully started _xabort(0xff); }

if ((status & _XABORT_EXPLICIT) && _XABORT_CODE(status) == 0xff && ! (status & _XABORT_NESTED) { while (lock) _mm_pause(); // busy-wait } else if (!(status & _XABORT_RETRY)) { break; } }

while (__sync_lock_test_and_set(&lock, 1)) { // fallback to spin-lock while (lock) _mm_pause(); // busy-wait } }

lock by using Intel TSX RTM

Unlock by using tsx-tools https://github.com/andikleen/tsx-tools

53

rtm_unlock() { if (lock) { __sync_lock_release(&lock); } else { _xend(); } }

unlock by using Intel TSX RTM

Performance of Intel TSX

• Intel says that codes of coarse-grained lock can compare with codes of fine-grained lock

• easy to write core scalable codes

54

5535

Applying Intel® TSX

scal

ing

Threads

scal

ing

Threads

Application with Coarse Grain Lock

Application re-written with Finer Grain Locks

An example of secondary benefits

of Intel® TSX

Coarse Grain Lock

Coarse Grain Lock + Intel® TSX

Fine Grain Locks

Fine Grain Locks + Intel® TSX

Fine Grain Behavior at Coarse Grain Effort

Intel® Transactional Synchronization Extensions (Intel® TSX)

from Intel Developer Forum 2012

56

36

Intel® TSX Can Enable Simpler Scalable Algorithms

Enabling Simpler Algorithms Lock-Free Algorithm • Don’t use critical section locks • Developer manages concurrency • Very difficult to get correct & optimize

– Constrain data structure selection – Highly contended atomic operations

State of the art lock-free algorithm

Ops

/sec

Threads

Ops

/sec

Threads

TSX lock based algorithm Lock-Based + Intel® TSX • Use critical section locks for ease • Let hardware extract concurrency • Enables algorithm simplification

– Flexible data structure selection – Equivalent data structure lock-free

algorithm very hard to verify

Real World Example

Intel® Transactional Synchronization Extensions (Intel® TSX)

from Intel Developer Forum 2012

EOF

57