©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the...

©2009 HP Confidential1

A Proposal to Incorporate Software Transactional Memory (STM) Support

in the Open64 Compiler

Dhruva R. ChakrabartiHP Labs, USA

2

How is transactional memory better than locks

Traditional threads programming is hard Requires reasoning with locks

Prone to synchronization errors

Tools exist but complexities still remain

Transactional memory programming raises the abstraction level Locks are not exposed

Program with atomic sections Atomicity, Consistency, and Isolation (ACI) properties guaranteed

Underlying system implements transactions

Deadlock freedom at the programmer level

3

Quantitative comparison between locks and TM

Source: Rossbach et al., Is Transactional Programming actually easier?, PPoPP 2010

User study of undergrads in an OS class

Same programs written with coarse grain locks, fine grain locks, monitors, and TM

Compared development effort, ease, and programming errors

Conclusion was that TM was harder to use than coarse grain locks but easier than fine grain locksThe study used API-based STM libraries for Java (DSTM2 and JDASTM)Use of compiler support for atomic section based programming should be even easier

Synchronization errors were much less for transactionsOn a similar programming problem, 70% errors with fine grain locks but 10% errors with transactions

4

What about performance?

Single-thread overhead is large because of logging costs

Multi-threaded performance is typically much better than coarse grain locks and approaches that of fine grain locksShown on micobenchmarks using hashtable, map, tree operations [source: PLDI 2006 papers on transactions]

Reasonable scalability using large transactions for STAMP benchmarks has been shown [source: http://stamp.stanford.edu ]

Good results on minimum spanning forest of sparse graphs [source: Kang et al., An Efficient Transactional Memory Algorithm for Computing Minimum Spanning Forest of Sparse Graphs, PPoPP 2009]

Large programs such as Quake and RMS applications have been transactified [http://www.bscmsrc.eu/research/software]

Numerous research papers have shown how to reduce overheadsSome are purely library-based approachesSome optimize the calls made to STM, potentially reducing transaction regions

Some feed the STM with information that leads to an optimized STM

5

Tool support

Few existHerlihy et al., tm_db (An open source generic debugging library for transactional programs)

Debugging/profiling support (Zyulkyarov et al., Debugging programs that use atomic blocks and transactional memory, PPoPP 2010)

Fine grain conflict graph to aid performance analysis (Chakrabarti et al., New abstractions for effective performance analysis of STM programs, PPoPP 2010)

Going forward, debuggers and performance analysis tools will be important to adoption of STMs

http://portal.acm.org/citation.cfm?id=1693463&dl=GUIDE&coll=GUIDE&CFID=101619311&CFTOKEN=71196825

6

State of STMs today

Has shown a lot of promiseAtomic section included in emerging lanaguages, e.g. Fortress, X10, Chapel

Improved programmability over locks

Does require programmer annotations

Performance benefits have been shown but pathological situations exist

Debuggers and performance tools starting to show up

A small set of benchmarks exists, some large programs have been transactifiedMore benchmarks and applications are required

All multi-threaded programming paradigms are not expressed easily in terms of atomic sectionsAn example is cond-wait (or retry)Atomic section will have to co-exist with locks

7

Outline

What is an atomic section

What is an STM library

Basic STM API and a flavor of the draft spec

Basic STM library/compiler interface and a flavor of the Intel ABI

Platforms available today and their state

Proposal to incorporate STM in Open64 framework

8

Shared counter update

lock(L)++ counterunlock(L)

Lock-based

atomic { ++ counter}

Atomic section-based

For atomic section-based code, • No need to associate shared data with locks• Still need to identify atomic sections• No deadlocks at the programmer level since there are no locks• Livelocks could be present but usually resolved by contention manager• Data races could still be present

9

STM library/compiler interface

atomic{ x = y w = z}

Compiler

TxStart()TxRead(y)TxWrite(x)TxRead(z)TxWrite(w)TxCommit

STM

10

Different implementation strategies

Non-blocking vs blocking

Strong vs weak isolation It has been shown that strong isolation is very hard to provide Most STMs today only support weak isolation

Direct vs deferred update

Flattened vs closed vs open nesting

Transaction granularity: object vs word

Pessimistic vs optimistic concurrency control

11

Main STM data structures (blocking implementation)Shared lock table

A hash function maps a given address to an entry in the (tagless) hash table

Designed to get to the lock without locking the hash table

Shared addresses Lock Table

12

Transactional read

Reads are typically optimistic ---- no locks are acquired

Read from shared memory location into local buffer

Validate readset to ensure its consistency

If validation fails, the transaction is rolled back

The address and its current version are entered into a readset

13

Transactional write

Buffered write Make the change onto a local buffer Add the location and the new value to a write set (redo log) All subsequent reads of this location are serviced from the write set The original location is unchanged

Direct update Acquire the lock corresponding to the shared location. If already locked,

abort Log the old value into a write set (undo log) Directly change the shared location

14

Transactional commit

Acquire locks for all entries of the writeset, if not already done

If a lock is held by another transaction, abort

Validate the read set, aborting if required

Copy all buffered data to shared locations, if required

Release all locks and update the versions of modified locations

15

A more elaborate API

Refer to Draft Specification of Transactional Language Constructs for C++1 for more details

Irrevocability of certain statements introduces complicationsStatements can be either safe or unsafe

A conventional atomic section can contain only safe statements

Necessitates use of attributes in certain cases

Annotation of functions called within a transaction

A relaxed transaction can contain unsafe statements

Supporting explicit abort/cancel of a transaction Only a conventional atomic section can contain an abort/cancel statement

Allowing nesting of transactions

Exceptions, exception specifications

1 http://software.intel.com/file/21569

16

Compiler/STM interface

Intel has released Intel® Transactional Memory Compiler and Runtime Application Binary Interface 2

An interface that a compiler writer or an expert user calling the STM has to conform to

Enables use of different STMs without changing the application

A fixed naming convention of library routines is imposed

Standard interfaces for starting a transaction, getting a handle to a transaction, aborting/committing a transaction, reading/writing memory locations, etc.

2 http://software.intel.com/file/8097

17

Platforms available today

Some examplesIntel STM compiler and libraryIBM xl C/C++ for transactional memory for AIXSkySTM: Sun Studio based compiler and STMMicrosoft .NET implementationgcc transactional memory support (currently on a branch)TL2TinySTM…

18

Proposal to incorporate STM support in Open64 Writing an STM library

First step is to support the stock atomic section Provide a blocking implementation

Support closed nesting with partial roll-back

Initial work is to provide support for the minimal set of entry points needed for simple programs to pass

Will follow the ABI document released by Intel

Could leverage work done in gcc space

In the compiler space, mostly front-end work for functional completeness

Main task is to lower the atomic section into calls to STM

Will need to support annotations to support static checking of proper use of TM constructs

Will follow the draft spec of the API

Could leverage work done in gcc space

Set up a small number of benchmarks and applications for testing

19

Possible optimizations

Inter-procedural optimizations can reduce overheads substantiallyNot required for initial implementation but something that would possibly give an Open64-based implementation an edge over other frameworks

Redundant calls to STM could be removedRead after read, write after read (of the same location) could be optimizedRecognition of local memory accesses could remove barriers altogetherIPA could feed STM with critical information to help reduce STM overheads

The STM library admits numerous optimizationsUse of pessimistic concurrency in addition to primarily optimistic Use of application-specific policiesReduction of false conflictsReduction of validation costs

20

Summary

An STM implementation based on Open64 will be useful to the community

Should conform to the draft API and the ABI

Will provide a great research platform for further advances in STMs

Will help development of more transactional applications

©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the...

Documents

Transcript of ©2009 HP Confidential1 A Proposal to Incorporate Software Transactional Memory (STM) Support in the...