1 Lecture 4: Memory: HMC, Scheduling Topics: BOOM, memory blades, HMC, scheduling policies.
Scheduling Memory Transactions
description
Transcript of Scheduling Memory Transactions
![Page 1: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/1.jpg)
Scheduling Memory Transactions
![Page 2: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/2.jpg)
Synchronization alternatives: Transactional Memory
A (memory) transaction is a sequence of memory reads and writes executed by a single thread that either commits or aborts
If a transaction commits, all the reads and writes appear to have executed atomically
If a transaction aborts, none of its operations take effect
Transaction operations aren't visible until they commit (if they do)
![Page 3: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/3.jpg)
Transactional Memory Implementations
Hardware Transactional Memory Transactional Memory [Herlihy & Moss, '93] Transactional Memory Coherence and Consistency [Hammond et al., '04] Unbounded transactional memory [Ananian, Asanovic, Kuszmaul,
Leiserson, Lie, '05]…
Software Transactional Memory Software Transactional Memory [Shavit &Touitou, '97] DSTM [Herlihy, Luchangco, Moir, Scherer, '03] RSTM [Marathe et al., '06] WSTM [Harris & Fraser, '03], OSTM [Fraser, '04], ASTM [Marathe,
Scherer, Scott, '05], SXM [Herlihy]…
![Page 4: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/4.jpg)
“Conventional” STM system high-level structure
TM system
OS-scheduler-controlledthreads
Contention
Manager
ContentionDetection
arbitrate
proceed
Abort/retry, waitgreedy
Aggressive
Karma
Polka
Passive
Polite
![Page 5: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/5.jpg)
Talk outline
Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Schedulers TM-scheduling OS support
![Page 6: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/6.jpg)
Loser resumes execution after pre-determined waiting periodo May resume execution too earlyo May resume execution too late
Repeated collisions occur under high contentiono Livelockso Performance may become worse than single lock
Scheduling-based CM to the rescue.
Conventional conflict resolution policies are often insufficient
![Page 7: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/7.jpg)
TM schedulers: rationale
Transactional threads controlled by TM-aware schedulero Kernel-level, user-level
Richer “tool-box“ for reducing and/or preventing transaction conflicts
Improve performance under high-contention
![Page 8: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/8.jpg)
“Adaptive Transaction Scheduling for transactional memory systems”, Yoo & Lee, SPAA'08
“CAR-STM: Scheduling-based collision avoidance and resolution for software transactional memory”, Dolev, Hendler & Suissa, PODC '08
“Steal-on-abort: dynamic transaction reordering to reduce conflicts in transactional memory”, Ansari , Jarvis, Kirkham, Kotsedilis, Lujan and Watson, HiPEAC'09
The first TM schedulers
![Page 9: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/9.jpg)
Our work
“CAR-STM: Scheduling-based collision avoidance and resolution for software transactional memory” [Dolev, Hendler & Suissa, PODC '08]
“On the impact of Serializing Contention Management on STM performance” [Heber, Hendler & Suissa, OPODIS '09]
“Scheduling support for transactional memory contention management” [Fedorova, Felber, Hendler, Lawall, Maldonado, Marlier Muller & Suissa, PPoPP'10]
![Page 10: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/10.jpg)
CAR-STM (Collision Avoidance and Reduction for STM) Design Goals
Limit Parallelism to a single transaction per core (or hardware thread)
Serialize conflicting transactions
Contention avoidance
![Page 11: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/11.jpg)
CAR-STM high-level architecture
Transaction queue #1
TQ thread
TQ thread
Transaction thread
T-Info
Core #1
Serializing contention
mgr.
Dispatcher
Collision
Avoider
Core #k
Transaction queue #k
![Page 12: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/12.jpg)
TQ-Entry Structure
Transaction queue #1
TQ thread
TQ thread
Transaction thread
T-Info
Core #1
Serializing contention
mgr.
Dispatcher
Collision
Avoider
Core #k
Transaction queue #k
wrapper method
Transaction data
T-Info
Trans. thread
Lock, condition var
![Page 13: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/13.jpg)
Transaction dispatching processEnque transaction in most-conflicting queue. Put thread to sleep, notify TQ thread.
4
4
![Page 14: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/14.jpg)
Transaction execution
TQ thread
Core #i
Transaction queue #i
wrapper method
Transaction data
T-Info
Trans. threadLock, condition var
![Page 15: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/15.jpg)
Serializing Contention Managers
When two transactions collide, fail the newer transaction and move it to the TQ of the older transaction
Fast elimination of live-lock scenarios Two SCMs implemented
o Basic (BSCM) – move failed transaction to end of the other transactions' TQ
o Permanent (PSCM) – Make the failed transaction a subordinate-transaction of the other transaction
![Page 16: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/16.jpg)
PSCM
Ta
Transaction
queue #1
TQ thread
Core #1
PSCM
Tb
Transaction
queue #k
TQ thread
Core #k
Tc
Td Te
Transactions a and b collide, b is older
![Page 17: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/17.jpg)
PSCM
Transaction queue #1
TQ thread
Core #1
PSCM
Tb
Transaction queue #k
TQ thread
Core #k
Ta
Tc
Td Te
Losing transaction and its subordinates are made subordinates of winning transaction
Ta Tc
![Page 18: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/18.jpg)
Execution time: STMBench7R/W dominated workloads
![Page 19: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/19.jpg)
Throughput: STMBench7R/W dominated workloads
![Page 20: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/20.jpg)
CAR-STM Shortcomings
May restrict parallelism too much At most a single transactional thread per
core/hardware-thread Transitive serialization
High overhead
Non-adaptive
![Page 21: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/21.jpg)
Talk outline
Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Scheduling TM-scheduling OS support
![Page 22: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/22.jpg)
“On the impact of Serializing Contention Management on STM performance”
CBench – synthetic benchmark generating workloads with pre-determined length and abort probability.
A low-overhead serialization mechanism
Better understanding of adaptive serialization algorithms
![Page 23: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/23.jpg)
A Low Overhead Serialization Mechanism(LO-SER)
Transactional threads
Conditionvariables
![Page 24: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/24.jpg)
A Low Overhead Serialization Mechanism (cont'd)
1) t Identifies a collision
2) t calls contention manager: ABORT_OTHER
3) t change status of t' to ABORT (writes that t is winner)
tt'
4) t' identifies it was aborted
![Page 25: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/25.jpg)
A Low Overhead Serialization Mechanism (cont'd)
t
t'
5) t' rolls back transaction and goes to sleep on the condition variable of t
6) Eventually t commits and broadcasts on its condition variable…
![Page 26: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/26.jpg)
A Low Overhead Serialization Mechanism (cont'd)
tt'
![Page 27: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/27.jpg)
Requirements for serialization mechanism
Commit broadcasts only if transaction won a collision since last broadcast (or start of execution)
No waiting cycles (deadlock-freedom)
Avoid race conditions
![Page 28: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/28.jpg)
LO-SER algorithm: data structures
![Page 29: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/29.jpg)
LO-SER algorithm: pseudo-code
![Page 30: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/30.jpg)
LO-SER algorithm: pseudo-code (cont'd)
![Page 31: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/31.jpg)
LO-SER algorithm: pseudo-code (cont'd)
![Page 32: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/32.jpg)
Adaptive algorithms
Collect (local or global) statistics on contention level.
Apply serialization only when contention is high. Otherwise, apply a “conventional” contention-management algorithm.
We find that Stabilized adaptive algorithms perform better.
First adaptive TM scheduler:“Adaptive transaction scheduling for transactional memory systems” [Yoo & Lee, SPAA'08]
![Page 33: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/33.jpg)
CBench Evaluation
CAR-STM incurs high overhead as compared with
other algorithms
Always serializing is bad in medium
contention
Always serializing is best in high contention
Always serializing incurs no overhead in the lack of contention
![Page 34: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/34.jpg)
CBench EvaluationAdaptive
serialization fares well for all
contention levels
![Page 35: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/35.jpg)
CBench Evaluation
Conventional CM performance
degrades for high contention
![Page 36: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/36.jpg)
CBench Evaluation (cont'd)
CAR-STM has best efficiency but worst
throughput
![Page 37: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/37.jpg)
RandomGraph Evaluation
Stabilized algorithm improves
throughput by up to 30%
Throughput and efficiency of conventional algorithms are
bad
![Page 38: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/38.jpg)
Preliminaries Memory Transactions Scheduling: Rationale CAR-STM Adaptive TM Schedulers TM-scheduling OS support
Talk outline
![Page 39: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/39.jpg)
“Scheduling Support for Transactional Memory Contention Management”
Implement CM scheduling support in the kernel scheduler (Linux & OpenSolaris) (Strict) serialization Soft serialization Time-slice extension
Different mechanisms for communication between user-level STM library and kernel scheduler
![Page 40: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/40.jpg)
TM Library / Kernel Communication via Shared Memory Segment (Ser-k)
User code notifies kernel on events such as: transaction start, commit and abort (in which case thread yields)
Kernel code handles moving thread between ready and blocked queues
![Page 41: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/41.jpg)
Soft Serialization
Instead of blocking, reduce loser thread priority and yield
Efficient in scenarios where loser transactions may take a different execution path when retrying (non-determinism)
Priority should be restored upon commit or when conflicting transactions terminate
![Page 42: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/42.jpg)
Time-slice extention
Preemption in the midst of a transaction increases conflict “window of vulnerability”
Defer preemption of transactional threads avoid CPU monopolization by bounding number of
extensions and yielding after commit
May be combined with serialization/soft serialization
![Page 43: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/43.jpg)
Evaluation (STMBench7, 16 core machine)
Conventional CM deteriorates when
threads>cores
Serializing by local spinning is efficient as long as threads ≤
cores
![Page 44: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/44.jpg)
Evaluation - STMBench7 throughput
Serializing by sleeping on condition var is best when threads>cores, since system call
overhead is negligible (long transactions)
![Page 45: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/45.jpg)
Evaluation - STMBench7 aborts data
![Page 46: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/46.jpg)
Conclusions
Scheduling-based CM results in Improved throughput in high contention Improved efficiency in all contention levels LO-SER-based serialization incurs no visible overhead
Lightweight kernel support can improve performance and efficiency
Dynamically selecting best CM algorithm for workload at hand is a challenging research direction
![Page 47: Scheduling Memory Transactions](https://reader033.fdocuments.net/reader033/viewer/2022051117/56815a84550346895dc7f405/html5/thumbnails/47.jpg)
Thank you. Any questions?