Dynamic Software Transactional Memory

Dynamic Software Transactional Memory

Idan Igra

Topics in Reliable Distributed Computing (048961)Technion, Nov 2008

Agenda

• Motivation• Software Transactional Memory• Dynamic Software Transactional Memory• Faser’s STM• Dynamic STM vs. Faser’s STM• A blocking STM implementation• Another obstruction free STM implementation by Faser• DSTM Contention management

William N. Scherer IIIDepartment of Computer Science

University of RochesterRochester, NY 14620, [email protected]

Mark MoirSun Microsystems Laboratories

1 Network DriveBurlington, MA 01803, USA

[email protected]

Victor LuchangcoSun Microsystems Laboratories

1 Network DriveBurlington, MA 01803, [email protected]

Maurice HerlihyDepartment of Computer Science

Brown UniversityProvidence, RI 02912, USA

[email protected]

Multicore history

• Parallel computing was used for HPCs and networking.– PRAM & other shared memory models aren’t realistic.– BSP & LogP (message passing models) were used.

• Only for HPC specialists.• Demand complicated system analyze per application.

• HW constraints force multicore architectures.• Today’s parallel programming based on locks.

– Coarse grained code prevent parallelism, fine grained are hard to use.

– Code reuse demands exposing internal locks.– No conventional way to connect mutex and its data.

Nonblocking liveness properties

• Wait freedom: Every process which tries to do an operation will complete it in a finite number of steps.

• Lock freedom: If any process tries to do an operation, then there is a process which will succeed completing an operation.

• Obstruction freedom: Process that runs by its own tries to do an operation will complete it.

Atomic hardware primitives

• Load_Linked / Store_Conditional (LL/SC): LL(addr) returns the value pointed by addr. Next call to SC(addr, val) writes val into addr if it was not written since last LL call.

• Compare And Swap (CAS): The operation CAS(addr, e, v) swaps the values of addr and v if addr == e.

• MCAS: Atomic m CAS operations (particular case: DCAS).

Helping methodology

• A methodology for non-blocking algorithms.

• Any process which holds a data that other process needs is helped by the other.

• Usually recursive help.

• Particularly, used widely in Transactional Memory for MCAS software implementation (known as k-RMW).

Software Transactional Memory

• First try to catch the whole data it needs.

• If succeeded – compute transaction and release the data.

• If failed – release all and retry.


Why Software Transactional Memory?• Unexpected delays decreases performances of locking

method, besides its inherent programming difficulties.– Memory allocation and deallocation synchronization conflicts.

• Hardware Transactional Memory lacks the platform support, portability and delay anomalies.

• Methods like translating the code to k-RMW actions is non-trivial.

• Working on a copy of the object is not good for large data structure.

• Programmable and flexible non-blocking parallel programming method is needed.


Data set pre-acquiring

• Unintuitive programming.

• Reduces parallelism.– Common data structures should be acquired

totally.

• Dynamic data structures are impossible.


Hardware support

• LL/SC is not commonly supported by hardware.

• Operating system can support it.– Much slower.– Reduce parallelism (force some scheduling).– More useful primitive can be defined.


Wait freedom cost:

• Complicated acquiring code.

• Not flexible.

• Non-common primitives.

• Long locking time.

Dynamic STM

• Enables also dynamic transactions – with a changing data set.

• Satisfies Obstruction freedom.

• Modular contention manager for progress forcing, priorities and application-adapting.

Dynamic STM

Dynamic STM

Implementation principles:• A TM object points to Locator which contains an

old version, a new one and the last transaction opened it for writing.

• The right version is determined by the status (active / aborted / committed).

• All objects are committed at once by changing the status.

• Obstruction free is obtained by aborting a conflicting transaction (conditioned by contention manager agreement).

Dynamic STM

DSTM properties and results:• Much natural to write and convert

sequential code into DSTM code.• Releases can significantly increase

performance.• Re-use simpler algorithms for a bigger one

is easier using DSTM.• Disadvantage: no way to know that an

object was opened for reading.

Dynamic STM

• Obstruction free enables:– simplicity,– for some application is good enough,– enables implementation of priorities,– enables separating correctness and progress– and most important – prevent the need of

helping mechanism.

• However, one can consider it is not a real progress property.

Dynamic STM

DiscussionDSTM vs. STM:• DSTM relates to STM like Coarse-grained to fine-grained.• But STM meets a real requirement and not weakened one

(obstruction free).• Releases as an integral part of the mechanism reduces

conflicts (compared to locks).

Non-blocking, particularly obstruction free, is better for delayed/failed processes won’t stop the whole system (Very strong for DSTM).

• DSTM’s implementation might cause loosing that gain for real parallelized systems.

• Let the contention manager do the work is exactly like assuming the scheduler will do that.

Faser’s STM

STM should satisfy:• Small fixed storage overhead per object.• Small shared memory operations.• Contention time is short.

– Reduces time that transactions meet.

Nice to have:• Supporting varying object sizes.• Nesting transactions.

Faser’s STM

• Every object is represented as a pointer to object handler, which consists of version number and a pointer to the data block.

• Open for read returns the data block pointer.

• Open for write returns a pointer to a shadow copy.

• Commit is done by acquiring all the opened object, MCAS and helping.

Faser’s STM

Faser’s STM

• Problem: Acquiring and releasing read-only object block non-conflicted transactions.– Critical for single start point data structures

(head of linked list).

• Solution: not to acquire read-only objects.– Add a read-checking state in which the

transactions checks all the opened read only objects, so other transactions don’t update it during this time.

Faser’s STM

• Deadlock Prevention: T1 can abort T2 only if:– both’ status is read-checking– T2 holds a location that T1 tries to read– T1 < T2 according to a given total order

between transactions.

DSTM vs. FSTM

FSTM is much better:• Lazy acquire exposes a transaction to

others for a very short time, reduces conflict number.

• Indirection levels decrease performances (mainly for read-only transactions).

• Obstruction freedom’s contention manager has a 5-10% overhead and hard for designing.

DSTM vs. FSTM

DSTM vs. FSTM

DSTM is much better:• Eager acquire helps capturing conflicts earlier.

– Possible thanks to Obstruction freedom weakness.

• Fewer CAS’s (N+1 for DSTM vs. 2N+2 for FSTM).

• Implementation is simpler and more efficient.• MCAS causes a lot of cache block trashing.

DSTM vs. FSTM

DSTM vs. FSTM

DSTM is better for workloads which:• Opening a lot of locations.

– Mainly write accesses for the same location (IntSet).

– Transactions must be serialized (stack).

FSTM is better for workloads which:• Livelocks are common (RBTree).• Small Transactions

– Small conflict probability (IntSetRelease).

DSTM vs. FSTM

General remarks:

• Not validating repeatedly improves performances.

• How can non-consistent (aborted) transactions be avoided?

Contention Management

Recall – DSTM contention manager should:• ensure progress.• eventually returns from every call.• eventually aborts conflicting transaction.

Management approaches are tested for:• Various data set• Visible/Invisible reads (optimistic/non-optimistic).• Eliminating unnecessary aborts.


• Aggressive – always abort enemytransaction. Good baseline to compare.

• Polite – backoff before aborting. Sensitive to preemption, page faults…

• Randomized – (Balanced) coin if aborting or wait (64ns).

• Eruption – a transaction helps its blocking transaction by giving its momentum (Momentum = successful open tries + blocked transactions momentum).– The reasoning is let transactions which hold critical

data to finish.


• Karma – the older transaction (in terms of opening tries) wins. Also tries on previous aborted runs are accounted.

• Kindergarten – First backoff is used beforeaborting. Later the abort is done by turns.

• KillBlocked – a transaction will abort its blocking if it is also blocked (or after fixed time).

• Timestamp – the older transaction wins. Failure detector is used.

• QueueOnBlock – blocked transactionsare released according to a queue whenthe blocking has finished (or after a fixed time).


Results:• Most of Managers except TimeStamps, are good for

IntSetRelease with Invisible reads.• Aggressive, Randomized, Eruption, Polite perform badly.• QueueOnBlock and KillBlocked has good performance

only for RBTree with Invisible reads.• TimeStamps is good only for Counter.• KinderGarten is excellent, except for IntSetRelease with

Visible reads and for RBTree.• Karma is not good for IntSet and for LFUCache with

visible reads.


Visible reads vs. Invisible reads:• In IntSet and Counter there is no difference as

all the accesses are for writing.• In IntSetRelease visible reads are better (except

for Kindergarten which is bad for both). – Visible reads let an option to avoid conflicts on short

time accesses.

• In LFUCache for all managers, and RBTree for all but Karma, Invisible reads is much better.– Most of conflicts are between a reader which scans its

path and writer which updates the path to the root.

Blocking STM implementation

Why not be annoyed about blocking (mainly compared to obstruction free)?

• Long transactions must be aborted. Obstruction free is forced only for a single transaction.

• Context switch is not a problem– Temporary.– OS automatic adaption.– Platform support (by priorities, etc.).

• Independent failure– Not common in multicore.– Sequential programs also fail due to a single failure.


Non-blocking is bad because:

• Metadata and the object must be stored separately in order to satisfy non-blocking.– Doubling the cache misses.

• Assume N active transactions on N processors: A new transaction mustn’t be blocked, the conflict number increases.


• Every transaction has in its private data descriptor per opened object (consists of the version, pointer and (maybe) a copy).

• Every object has a lock (with deadlock prevention) which is used when trying to commit.

• Accesses wait for the object to be unlocked. Read accesses are optimistic.

• Priority mechanism.


CPU time for various processor number:


CPU time for various contention instances:


Discussion:

• Context switch IS a problem because of long delays.

• Failure are more common on parallel programs than on sequential ones.

• Delay is more interesting than throughput?

Another STM

• Similarly to DSTM, Committing is done by changing a state and current version is determined by owner transaction state.

• But like FSTM, before committing the transaction tries to acquire all of its owned records.

• Wait method is provided in order to wait an acquired data before retrying.

Another STM

• An Ownership-record (orec) contains either the version number of one (or more) objects or a pointer to the owner transaction descriptor.

• Before committing, any transaction tries to acquire its owned data.

• In case of already acquired data, the transaction can abort the other transaction, wait for it to finish or awake it (if it sleeps).

References• Robert Ennals (Jan 2006). Software Transactional Memory Should Not Be Obstruction-Free.

Technical Report Nr. IRC-TR-06-052. Intel Research Cambridge Tech Report.• K. Fraser. Practical Lock-Freedom. Technical Report UCAM-CL-TR-579, Cambridge University

Computer Laboratory, February 2004.• Tim Harris , Keir Fraser. Language support for lightweight transactions. Proceedings of the

18th annual ACM SIGPLAN conference on Object-oriented programing, systems, languages, and applications, October 26-30, 2003, Anaheim, California, USA.

• Maurice Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. Software Transactional Memory for Dynamic-Sized Data Structures. ACM Symposium on Principles of Distributed Computing (PODC): 92-101, 2003.

• Maurice Herlihy , Victor Luchangco. Distributed computing and the multicore revolution. ACM SIGACT News, v.39 n.1, March 2008.

• Virendra J. Marathe and William N. Scherer III and Michael L. Scott (Oct 2004). Design Tradeoffs in Modern Software Transactional Memory Systems. In: Proceedings of the 7th Workshop on Languages, Compilers, and Run-time Systems for Scalable Computers. Houston, TX.

• N. Shavit and D. Touitou. Software transactional memory. Distributed Computing, Special Issue(10): 99-116, 1997.

• William N. Scherer III and Michael L. Scott (Jul 2004). Contention Management in Dynamic Software Transactional Memory. In: Proceedings of the ACM PODC Workshop on Concurrency and Synchronization in Java Programs. St. John's, NL, Canada. In conjunction with PODC'04.

More readingEnnals’ blocking STM:• Robert Ennals. Efficient Software Transactional Memory. Intel Research Cambridge Technical Report: IRC-

TR-05-051, 2005.PRAM:• S. Fortune and J. Wyllie. Parallelism in Random Access Machines. In Proceedings of the 10th Annual

Symposium on Theory of Computing, pages 114-118, 1978.• Phillip B. Gibbons , Yossi Matias , Vijaya Ramachandran. Can shared-memory model serve as a bridging

model for parallel computation?. Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, p.72-83, June 23-25, 1997, Newport, Rhode Island, United States.

• P. B. Gibbons. A more practical PRAM model. Proceedings of the first annual ACM symposium on Parallel algorithms and architectures, p.158-168, June 18-21, 1989, Santa Fe, New Mexico, United States.

Popular message-passing old models:• David Culler , Richard Karp , David Patterson , Abhijit Sahay , Klaus Erik Schauser , Eunice Santos , Ramesh

Subramonian , Thorsten von Eicken. LogP: towards a realistic model of parallel computation. ACM SIGPLAN Notices, v.28 n.7, p.1-12, July 1993.

• Leslie G. Valiant. A bridging model for parallel computation. Communications of the ACM, v.33 n.8, p.103-111, Aug. 1990.

Memory allocation in multi-core:• Andrei Gorine, Konstantin Knizhnik. Tackling memory allocation in multicore and multithreaded

applications. MCObject LLC, May 29 2006. Available on the internet from http://www.embedded.com/columns/showArticle.jhtml?articleID=188101359

• Voon-Yee Vee , Wen-Jing Hsu. A Scalable and Efficient Storage Allocator on Shared Memory Multiprocessors. Proceedings of the 1999 International Symposium on Parallel Architectures, Algorithms and Networks (ISPAN '99), p.230, June 23-25, 1999.

• P.R. Wilson, M.S. Johnstone, M. Neely, and D. Boles. Dynamic storage allocation: A survey and critical review. In H.G. Baker, editor, Proceedings of International Workshop on Memory Management (IWMM'95), volume 986 of Lecture Notes in Computer Science, pages 1-116, Kirnoss, Scotland, Sept. 1995.

Dynamic Software Transactional Memory

Documents

Transcript of Dynamic Software Transactional Memory