The Case for Hardware Transactional Memory

Lecture 20:Transactional Memory

CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012)

(CMU 15-418, Spring 2012)

Giving credit▪ Many of the slides in today’s talk are the work of Professor

Christos Kozyrakis (Stanford University)

(CMU 15-418, Spring 2012)

Raising level of abstraction for synchronization▪ Machine-level synchronization prims:- fetch-and-op, test-and-set, compare-and-swap

▪ We used these primitives to construct higher level, but still quite basic, synchronization prims:- lock, unlock, barrier

▪ Today:- transactional memory: higher level synchronization

(CMU 15-418, Spring 2012)

What you should know▪ What a transaction is

▪ The difference between the atomic construct and locks

▪ Design space of transactional memory implementations- data versioning policy- con"ict detection policy- granularity of detection

▪ Understand HW implementation of transaction memory (consider how it relates to coherence protocol implementations we’ve discussed in the past)

(CMU 15-418, Spring 2012)

Example

▪ Deposit is a read-modify-write operation: want “deposit” to be atomic with respect to other bank operations on this account.

▪ Lock/unlock pair is one mechanism to ensure atomicity(ensures mutual exclusion on the account)

void deposit(account, amount){ int t = bank.get(account); t = t + amount; bank.put(account, t); }

lock(account);

unlock(account);

(CMU 15-418, Spring 2012)

Programming with transactional memory

nDeclarative synchronizationnProgrammers says what but not hownNo explicit declaration or management of locks

nSystem implements synchronizationnTypically with optimistic concurrencynSlow down only on true con"icts (R-W or W-W)

void deposit(account, amount){ lock(account); int t = bank.get(account); t = t + amount; bank.put(account, t); unlock(account);}

void deposit(account, amount){ atomic { int t = bank.get(account); t = t + amount; bank.put(account, t); }}

(CMU 15-418, Spring 2012)

Declarative vs. imperative abstractions

▪ Declarative: programmer de#nes what should be done- Process all these 1000 tasks

▪ Imperative: programmer states how it should be done- Spawn N worker threads. Pull work from shared task queue

- Perform this set of operations atomically

- Acquire a lock, perform operations, release the lock

(CMU 15-418, Spring 2012)

Transactional Memory (TM)nMemory transaction

nAn atomic & isolated sequence of memory accesses nInspired by database transactions

nAtomicity (all or nothing) nAt commit, all memory writes take effect at oncenOn abort, none of the writes appear to take effect

nIsolationnNo other code can observe writes before commit

nSerializability nTransactions seem to commit in a single serial ordernThe exact order is not guaranteed though

(CMU 15-418, Spring 2012)

Advantages of transactional memory

nMap: Key → Value

public Object get(Object key) { int idx = hash(key); // Compute hash HashEntry e = buckets[idx]; // to find bucket while (e != null) { // Find element in bucket if (equals(key, e.key)) return e.value; e = e.next; } return null; }

nNot thread safenBut no lock overhead when not needed

(CMU 15-418, Spring 2012)

Another example: Java 1.4 HashMap

nJava 1.4 solution: synchronized layernConvert any map to thread-safe variantnUses explicit, coarse-grain locking specified by programmer

public Object get(Object key) { synchronized (mutex) { // mutex guards all accesses to hashMap return myHashMap.get(key); }}

nCoarse-grain synchronized HashMapnPros: thread-safe, easy to programnCons: limits concurrency, poor scalability

nOnly one thread can operate on map at any time

(CMU 15-418, Spring 2012)

Synchronized HashMap

(CMU 15-418, Spring 2012)

Better solution?

▪ Fined-grained synchronization: e.g., lock per bucket▪ Now thread safe: but lock overhead even if not needed

public Object get(Object key) { int idx = hash(key); // Compute hash HashEntry e = buckets[idx]; // to find bucket while (e != null) { // Find element in bucket if (equals(key, e.key)) return e.value; e = e.next; } return null; }

(CMU 15-418, Spring 2012)

Performance: Locks

0.2500

0.5000

0.7500

1.0000

1 2 4 8 16

Processors

coarse locks fine locks

1.2500

2.5000

3.7500

5.0000

1 2 4 8 16

Processors

coarse locks fine locks

nSimply enclose all operation in atomic blocknSystem ensures atomicity

public Object get(Object key) { atomic { // System guarantees atomicity return m.get(key); } }

nTransactional HashMapnPros: thread-safe, easy to programnQ: good performance & scalability?

nDepends on the implementation, but typically yes

(CMU 15-418, Spring 2012)

Transactional HashMap

Another example: tree update

Goal: Modify node 3 in a thread-‐safe way.

Slide credit: Austen McDonald

Synchronization Example

Locking prevents concurrencyGoals: Modify nodes 3 and 4 in a thread-‐safe way.

Transac/on AREAD: 1, 2, 3WRITE:

TM Example

Transac/on AREAD: 1, 2, 3WRITE: 3

TM Example: no con"icts

Transac/on BREAD: 1, 2, 4WRITE: 4

NO READ-‐WRITE, or WRITE-‐WRITE conflicts!Slide credit: Austen McDonald

TM Example: with con"icts

Transac/on BREAD: 1, 2, 3WRITE: 3

Conflicts exist! Must be serialized!Slide credit: Austen McDonald

(CMU 15-418, Spring 2012)

Performance: locks vs. transactions

0.2500

0.5000

0.7500

1.0000

1 2 4 8 16

Processors

coarse locks fine locks TCC

1.0000

2.0000

3.0000

4.0000

1 2 4 8 16

Processors

coarse locks fine locks TCC

TCC: a HW-based TM system

(CMU 15-418, Spring 2012)

Failure atomicity: locks

nManually catch exceptionsnProgrammer provides undo code on a case by case basis

nComplexity: what to undo and how… nSome side-effects may become visible to other threads

nE.g., an uncaught case can deadlock the system…

void transfer(A, B, amount) synchronized(bank) { try { withdraw(A, amount); deposit(B, amount); } catch(exception1) { /* undo code 1*/ } catch(exception2) { /* undo code 2*/ } … }

(CMU 15-418, Spring 2012)

Failure atomicity: transactions

nSystem processes exceptionsnAll but those explicitly managed by the programmernTransaction is aborted and updates are undonenNo partial updates are visible to other threads

nE.g., No locks held by a failing threads…

void transfer(A, B, amount) atomic { withdraw(A, amount); deposit(B, amount); }

(CMU 15-418, Spring 2012)

Composability: locks

nComposing lock-based code can be trickynRequires system-wide policies to get correctnBreaks software modularity

nBetween an extra lock & a hard placenFine-grain locking: good for performance, but can lead to deadlock

void transfer(A, B, amount) synchronized(A) { synchronized(B) { withdraw(A, amount); deposit(B, amount); } }

void transfer(B, A, amount) synchronized(B) { synchronized(A) { withdraw(B, amount); deposit(A, amount); } }

DEADLOCK!

(CMU 15-418, Spring 2012)

Composability: transactions

nTransactions compose gracefullynProgrammer declares global intent (atomic transfer)

n No need to know of global implementation strategynTransaction in transfer subsumes those in withdraw & deposit

nOutermost transaction defines atomicity boundary

nSystem manages concurrency as well as possiblenSerialization for transfer(A, B, $100) & transfer(B, A, $200)nConcurrency for transfer(A, B, $100) & transfer(C, D, $200)

void transfer(A, B, amount) atomic { withdraw(A, amount); deposit(B, amount); }

void transfer(B, A, amount) atomic { withdraw(B, amount); deposit(A, amount); }

(CMU 15-418, Spring 2012)

Advantages of transactional memory nEasy to use synchronization construct

nAs easy to use as coarse-grain locksnProgrammer declares, system implements

nOften performs as well as fine-grain locksnAutomatic read-read concurrency & fine-grain concurrency

nFailure atomicity & recoverynNo lost locks when a thread failsnFailure recovery = transaction abort + restart

nComposabilitynSafe & scalable composition of software modules

(CMU 15-418, Spring 2012)

Example integration with OpenMPnExample: OpenTM = OpenMP + TM

n OpenMP: master-slave parallel modelnEasy to specify parallel loops & tasks

n TM: atomic & isolation execution nEasy to specify synchronization and speculation

nOpenTM featuresn Transactions, transactional loops & sectionsn Data directives for TM (e.g., thread private data)n Runtime system hints for TM

nCode example#pragma omp transfor schedule (static, chunk=50) for (int i=0; i<N; i++) { bin[A[i]] = bin[A[i]]+1; }

(CMU 15-418, Spring 2012)

Atomic() ≠ lock()+unlock()nThe difference

nAtomic: high-level declaration of atomicitynDoes not specify implementation/blocking behaviornDoes not provide a consistency model

nLock: low-level blocking primitivenDoes not provide atomicity or isolation on its own

nKeep in mindnLocks can be used to implement atomic(), but…nLocks can be used for purposes beyond atomicity

nCannot replace all lock regions with atomic regionsnAtomic eliminates many data races, but..nProgramming with atomic blocks can still suffer from atomicity violations. e.g., atomic sequence incorrectly split into two atomic blocks

(CMU 15-418, Spring 2012)

Example: lock-based code that does not work with atomic

n What is the problem with replacing synchronized with atomic?

// Thread 1synchronized(lock1) { … flagB = true; while (flagA==0); …}

// Thread 2synchronized(lock2) { … flagA = true; while (flagB==0); …}

(CMU 15-418, Spring 2012)

Example: atomicity violation

n Programmer mistake: logically atomic code sequence separated into two atomic() blocks.

// Thread 1atomic() { … ptr = A; …}

atomic() { B = ptr-‐>field;}

// Thread 2atomic { … ptr = NULL; }

(CMU 15-418, Spring 2012)

Transactional memory: summary + bene#tsnTM = declarative synchronization

nUser specifies requirement (atomicity & isolation)nSystem implements in best possible way

nMotivation for TMnDifficult for user to get explicit sync right

nCorrectness vs. performance vs. complexitynExplicit sync is difficult to scale

nLocking scheme for 4 CPUs is not the best for 64nDifficult to do explicit sync with composable SW

nNeed a global locking strategynOther advantages: fault atomicity, …

nProductivity argument: system support for transactions can achieve 90% of the benefit of programming with fined-grained locks, with 10% of the development time

(CMU 15-418, Spring 2012)

Implementing transactional memory

(CMU 15-418, Spring 2012)

Recall: transactional memorynAtomicity (all or nothing)

nAt commit, all memory writes take effect at oncenOn abort, none of the writes appear to take effect

nIsolationnNo other code can observe writes before commit

nSerializability nTransactions seem to commit in a single serial ordernThe exact order is not guaranteed though

(CMU 15-418, Spring 2012)

TM implementation basicsnTM systems must provide atomicity and isolation

nWithout sacrificing concurrency

nBasic implementation requirementsnData versioning (ALLOWS abort)nConflict detection & resolution (WHEN to abort)

nImplementation optionsnHardware transactional memory (HTM)nSoftware transactional memory (STM)nHybrid transactional memory

ne.g., Hardware accelerated STMs

(CMU 15-418, Spring 2012)

Data versioningn Manage uncommited (new) and commited (old) versions of

data for concurrent transactions

1.Eager versioning (undo-log based)

2.Lazy versioning (write-buffer based)

(CMU 15-418, Spring 2012)

Eager versioning

Begin Xaction

Thread

X: 10 Memory

Undo Log

Write X←15

Thread

X: 15 Memory

Undo LogX: 10

Commit Xaction

Thread

X: 15 Memory

Undo LogX: 10

Abort Xaction

Thread

X: 10 Memory

Undo LogX: 10

Update memory immediately, maintain “undo log” in case of abort

(CMU 15-418, Spring 2012)

Lazy versioning

Begin Xaction

Thread

X: 10 Memory

Write Buffer

Write X←15

Thread

X: 10 Memory

Write BufferX: 15

Abort Xaction

Thread

X: 10 Memory

Write BufferX: 15

Commit Xaction

Thread

X: 15 Memory

Write BufferX: 15

Log memory updates in transaction write buffer, "ush buffer on commit

(CMU 15-418, Spring 2012)

Data versioningn Manage uncommited (new) and commited (old) versions of

data for concurrent transactions

1.Eager versioning (undo-log based)n Update memory location directlyn Maintain undo info in a log (per store penalty)+ Faster commit – Slower aborts, fault tolerance issues (crash in middle of trans)

2.Lazy versioning (write-buffer based)n Buffer data until commit in a write-buffern Update actual memory location on commit + Faster abort, no fault tolerance issues– Slower commits

(CMU 15-418, Spring 2012)

Con"ict detectionnDetect and handle conflicts between transactions

n read-write conflict: transaction A reads addr X, which was written to by pending transaction B

n write-write conflict: transactions A and B are pending, both write to address X.

n Must track the transaction’s read-set and write-set n Read-set: addresses read within the transactionn Write-set: addresses written within transaction

(CMU 15-418, Spring 2012)

Pessimistic detectionnCheck for conflicts during loads or stores

ne.g., HW implementation will check through coherence actions(will discuss later)

n“Contention manager” decides to stall or abort transactionnVarious priority policies to handle common case fast

(CMU 15-418, Spring 2012)

Pessimistic detection example

Case 1 Case 2 Case 3 Case 4

wr Ccheck

commitcommit

Success

commit

Early Detect

commit

restart

rd Acheck

No progress

rd Awr A

checkrestart

checkwr A

restart

rd Awr A

restart

(CMU 15-418, Spring 2012)

Optimistic detection

nDetect conflicts when a transaction attempts to commitnHW: validate write-set using coherence actions

nGet exclusive access for cache lines in write-set

nOn a conflict, give priority to committing transactionnOther transactions may abort later onnOn conflicts between committing transactions, use contention manager to decide priority

n Note: can use optimistic & pessimistic schemes togethern Several STM systems use optimistic for reads and pessimistic for

writes

(CMU 15-418, Spring 2012)

Optimistic detection

Case 1 Case 2 Case 3 Case 4

commit

Success

commit

restart

commit

Success

Forward progress

rd Awr A

commitcheck

restart

rd Awr A

commitcheck

(CMU 15-418, Spring 2012)

Con"ict detection trade-offs

1.Pessimistic conflict detection (a.k.a. “encounter” or “eager”)+ Detect conflicts early

• Undo less work, turn some aborts to stalls

– No forward progress guarantees, more aborts in some cases – Fine-grain communication– On critical path

2.Optimistic conflict detection (a.k.a. “commit” or “lazy”)+ Forward progress guarantees+ Potentially less conflicts, bulk communication– Detects conflicts late, can still have fairness problems

(CMU 15-418, Spring 2012)

Con"ict detection granularitynObject granularity (SW-based techniques)+Reduced overhead (time/space)+Close to programmer’s reasoning– False sharing on large objects (e.g. arrays)

nWord granularity+Minimize false sharing– Increased overhead (time/space)

nCache line granularity+Compromise between object & word

nMix & match è best of both words nWord-level for arrays, object-level for other data, …

(CMU 15-418, Spring 2012)

TM implementation space (examples)nHardware TM systems

nLazy + optimistic: Stanford TCCnLazy + pessimistic: MIT LTM, Intel VTMnEager + pessimistic: Wisconsin LogTMnEager + optimistic: not practical

nSoftware TM systemsnLazy + optimistic (rd/wr): Sun TL2nLazy + optimistic (rd)/pessimistic (wr): MS OSTMnEager + optimistic (rd)/pessimistic (wr): Intel STMnEager + pessimistic (rd/wr): Intel STM

nOptimal design remains an open questionnMay be different for HW, SW, and hybrid

(CMU 15-418, Spring 2012)

Hardware transactional memory (HTM)nData versioning in caches

nCache the write-buffer or the undo-lognNew cache meta-data to track read-set and write-setnCan do with private, shared, and multi-level caches

nConflict detection through cache coherence protocolnCoherence lookups detect conflicts between transactionsnWorks with snooping & directory coherence

nNotesnRegister checkpoint must be taken at transaction begin

nCache lines annotated to track read-set & write setnR bit: indicates data read by transaction; set on loadsnW bit: indicates data written by transaction; set on stores

n R/W bits can be at word or cache-line granularity

nR/W bits gang-cleared on transaction commit or abortnFor eager versioning, need a 2nd cache write for undo log

nCoherence requests check R/W bits to detect conflicts nShared request to W-word is a read-write conflictnExclusive request to R-word is a write-read conflictnExclusive request to W-word is a write-write conflict

(CMU 15-418, Spring 2012)

HTM design

V D E Tag R W Word 1 R W Word N. . .

(CMU 15-418, Spring 2012)

Example HTM: lazy optimistic

TM State

Tag DataV

Registers

n CPU changesnRegister checkpoint (available in many CPUs)nTM state registers (status, pointers to handlers, …)

(CMU 15-418, Spring 2012)

Example HTM: Lazy Optimistic

TM State

Tag DataVWR

Registers

nCache changesn R bit indicates membership to read-setn W bit indicates membership to write-set

(CMU 15-418, Spring 2012)

HTM transaction execution

XbeginLoad AStore B ⇐ 5

Load CXcommit

TM State

Tag DataV

Registers

nTransaction beginn Initialize CPU & cache staten Take register checkpoint

(CMU 15-418, Spring 2012)

Load CXcommit

TM State

Tag DataV

Registers

nLoad operationn Serve cache miss if neededn Mark data as part of read-set

A 3311 0

(CMU 15-418, Spring 2012)

Load CXcommit

TM State

Tag DataV

Registers

n Store operationn Serve cache miss if needed (eXclusive if not shared, Shared otherwise)n Mark data as part of write-set

A 3311 0B 510 1

(CMU 15-418, Spring 2012)

Load CXcommit

TM State

Tag DataV

Registers

n Fast, 2-phase commitn Validate: request exclusive access to write-set lines (if needed)n Commit: gang-reset R & W bits, turns write-set data to valid (dirty) data

A 3311 0B 510 1 upgradeX B

(CMU 15-418, Spring 2012)

HTM con"ict detection

Load CXcommit

TM State

Tag DataV

Registers

n Fast conflict detection & abortn Check: lookup exclusive requests in the read-set and write-setn Abort: invalidate write-set, gang-reset R and W bits, restore checkpoint

A 3311 0B 510 1

upgradeX D þ

ýupgradeX A

01 000 1

(CMU 15-418, Spring 2012)

Transactional memory summary▪ Atomic construct: declaration of atomic behavior- Motivating idea: increase simplicity of synchronization, without

sacri#cing performance▪ Transactional memory implementation- Many variants have been proposes: SW, HW, SW+HW- Differ in versioning policy (eager vs. lazy)- Con"ict detection policy (pessimistic vs. optimistic)- Detection granularity

▪ Hardware transactional memory- Versioned data kept in caches- Con"ict detection built upon coherence protocol

The Case for Hardware Transactional Memory

Documents

Transcript of The Case for Hardware Transactional Memory

Hardware Transactional Memory

Does Hardware Transactional Memory Change Everything?

Hybrid Transactional Memory - Virginia Tech2. We present a hybrid transactional memory scheme that com-bines the best of both hardware and software transactional memory: low performance

Transactional Memory

LogTM-SE: Decoupling Hardware Transactional Memory from Caches

Understanding Hardware Transactional Memory...Understanding Hardware Transactional Memory Gil Tene, CTO & co-Founder, Azul Systems @giltene ©2015 Azul Systems, Inc. Agenda Brief introduction

Understanding Hardware Transactional Memory - QCon...Understanding Hardware Transactional Memory Gil Tene, CTO & co-Founder, Azul Systems @giltene ©2016 Azul Systems, Inc. Agenda

TxLinux: Using and Managing Hardware Transactional Memory in an Operating System

Hardware Transactional Memoryece742/S20/slides/Hardware... · 2020. 5. 18. · Transactional memory exploits and extends multiprocessor cache-coherence protocol so that transactions

An Introduction to Transactional Memory - Engineering · 2010. 12. 9. · Software Transactional Memory •Non-Blocking Software Transactional Memory with Dynamic Software Transactional

KAUSHIK LAKSHMINARAYANAN MICHAEL ROZYCZKO VIVEK SESHADRI Transactional Memory: Hybrid Hardware/Software Approaches.

Hardware Acceleration of Transactional Memory on ......tures]: Parallel—Distributed Architectures General Terms Algorithms, Design, Performance Keywords Transactional Memory, FPGA,

Transactional Memory An Overview of Hardware Alternatives

Performance Pathologies in Hardware Transactional Memory

Hardware Transactional Memory

Hardware Acceleration of Software Transactional Memory

FaulTM: Fault-Tolerance Using Hardware Transactional Memory

Understanding Hardware Transactional Memory - QCon … · Understanding Hardware Transactional Memory Gil Tene, ... execute block atomically in relation to other blocks ... public

Improved Single Global Lock Fallback for Best-effort Hardware Transactional Memory

EazyHTM: Eager-Lazy Hardware Transactional Memory€¦ · EazyHTM: Eager-Lazy Hardware Transactional Memory SašaTomić, Cristian Perfumo, Chinmay Kulkarni, Adrià Armejach, Adrián