Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri &...

Syed Ali Raza Jafri et al. 1

LiteTM: Reducing HTM State Overhead

T. N. Vijaykumar

with Ali Jafri & Mithuna Thottethodi in HPCA ‘10

Transactional Memory (TM)Multicores require parallel programming

• Significantly harder than sequential programming

Locks may cause incorrect behavior• Deadlocks/livelocks and data races

TM appears to make correct programming easier

TM implementations can be efficient

Transactions may provide better programmability and performance than locks

Previous WorkHardware, software and hybrid TMs

• HTMs piggyback conflict detection on coherence • STMs and HybridTMs detect conflicts in software

Recent HTMs support many features• Transaction time and footprint not limited by hardware

• Can exceed caches and even be swapped out of memory

• Transaction-OS interactions not restricted• In-flight context switches, page/thread migrations

• Modest hardware complexity• No coherence protocol changes (very big deal)

Supporting these features incurs high hardware cost

HTM Cost: State overheadHTMs need large state throughout memory hierarchy

• Numerous state bits in L1 and L2• Hijack memory ECC weaker protection

» E.g., 25% fewer SECDED bits in TokenTM

Supporting all features large state in caches + weaker memory ECC high barrier for adoption

19 bits

16 bits

HTM State Overhead

Thread Id/sharer-count + state bits per block Thread Id to determine conflictors or own blocks

Sharer count to track multiple readers• Ideally, need all ids but too much state make do with counts

Avoid coherence changes Extra bits (beyond R, W)• E.g., TokenTM uses 5 bits instead of usual R, W

Thread Ids+sharer-counts in hardware Detect conflicts + identify conflictors mostly in hardware

LiteTM: Key Observations

Most state information not needed in common case Eliminate thread Ids and sharer-counts

• Intended for conflicts on L1-evicted blocks, but• Conflict usually on L1-resident blocks• Coherence trivially identifies L1-resident conflictors & count

Merge R,W into T• Coherence’s “Modified” state can approximate W• False positives possible but rare

Uncommon case: scan transactional log

LiteTM detects conflicts in h/w (all cases, like all HTMs);identifies conflictors: h/w (common) & s/w (uncommon)

LiteTM: Contributions (1)LiteTM reduces transactional state

Average (worst) case 4% (10%) performance loss in STAMP ( 8 cores)

Key reduction is removal of thread id/count (W approx is secondary)

2 bits

19 bits

16 bits

LiteTM: Contributions (2)

LiteTM compensates for the loss of• Thread Id

• Read-sharer count

• Separate R,W bits

via novel mechanisms • Self-log walks

• Lazy clearing of L1-spilled transaction state

• W approximation

• All-log walks (a la TokenTM)

Smaller state in caches & fewer hijacked memory ECC bits significantly lower barrier for adoption

LiteTM in the HTM-STM spectrum

LiteTM improves HTM by pushing more into software• i.e., by moving HTMs closer to STMs!

LiteTM differs from HybridTMs in h/w-s/w split• Hybrids: conflict detection in h/w if fits in cache; otherwise in s/w

• LiteTM: conflict detection always in h/w; resolution in s/w

Key point: Conflict detection• Needed for all accesses must be fast

• Is a global operation usually hard to do fast in software

• Closely matches coherence which is fast easy to piggyback

• Hence, always in hardware in LiteTM (like all HTMs)

Outline Introduction

LiteTM transactional state

Lazy clearing

Experimental Results

Conclusion

Transactional State in L1

TokenTM (~16 bits)

R, W – transactionally read/written

R',W' + id – read/written and moved to another cache upon coherence movement

•no change in coherence •Identifies conflictor

R+ + count – fusion of multiple read copies

LiteTM (2 bits)

T + clean/modified – transactionally read/written

T' – T moved to another cache No id All log walk if conflict

Upon conflict, abort writer and all but one reader, or all readers

Transactional State in L2 & Memory

TokenTM (~16 bits)

States in L2 & memoryIdle (transactionally clean)Single reader + idSingle writer + idMultiple readers + count

Conflict on multiple readers all log walks

LiteTM (2 bits)

State in L2 & memory IdleSingle readerSingle writerMultiple readers

Conflict in any state all log walks

No id self log walk

No count no decrement of count Lazy Clearing of ‘Multiple readers’

Lazy Clearing

‘Multiple readers’ conflict/commit leaves state behind• No count don’t know who is last reader cannot clear• Lazy clear on next conflict via all log walks

All log walk check and state clearing should be atomic• Hardware address buffers + software support

Details in HPCA ‘10 paper

Outline Introduction

LiteTM transactional state

Lazy clearing

Experimental Results

Conclusion

Methodology GEMS HTM simulator on top of Simics 8 core, 1GHz in-order issue processor Typical memory hierarchy parameters All STAMP benchmarks Multiple runs for statistical significance Transactional state bits: TokenTM 16 vs. LiteTM 2

• Also show LiteTM-1bit: read sharing triggers log walks

Hybrid-bound: Emulate spilled transactions in hybrid TMs• 1 extra hash-table write per first transactional access

•Lack of distinction between read-sharing and conflict degrades LiteTM

LiteTM Performance

•Mostly 1-3% loss; Contentious, long transactions 10% loss•Labyrinth’s contention hurts base optimistic TM small loss

-1bit, conflict detection in s/w degrades Hybrid-bound

LiteTM Aborts & Log WalksBenchmarks % false abort due

to W approxself log walksper commit

all log walks per commit

ssca2, km-low, km-high, intruder

0 ~0 ~0

genome 2.5 0.02 ~0

vac-low 0 ~0 ~0

vac-high 0 0.02 0.01

yada 0.9 0.3 ~0

bayes 0.3 3.9 0.08

labyrinth 0.1 58 0.94

Overhead increases with contention yet still low

Conclusion

Current HTMs support many key features

Incur high transactional state overhead• Many state bits in all caches & hijacked memory ECC bits

• High barrier for adoption

LiteTM significantly reduces transactional state• Most state information not needed in common case

• Employs novel mechanisms for uncommon case

LiteTM reduces TokenTM’s 16 bits/block to 2 bits• Average (worst) case 4% (10%) performance loss in STAMP

LiteTM significantly lowers the barrier for adoption

A couple points on Cliff’s talk

Main problem: Conflicts due to auxiliary data This problem exists for all optimistic TMs

• HTMs, STMs, and hybrids

Options• Learn from past conflicts to skew the schedule (prevent conflict)

• Repair transactional state - Martin et al. ISCA ’10 (cure conflict)

• Instead of learning, compiler can provide hints to aid prevention

These problems don’t seem big enough to give up on HTMs

Questions?

Is TokenTM overhead really high?

16 bits/L1-block is a lot in absolute terms

16 bits in memory may be hijacked from ECC• 25% fewer SECDED bits weaker protection

Or, 16 bits may be placed in main memory• Increase the bandwidth requirements

Narrow Topic?

LiteTM separates• Conflict detection (hardware)

• From conflictors identification (software)

Fundamental and can be applied to other unbounded HTMs

Focus on TokenTM

TokenTM is the only design which supports all features mentioned previously

Hence we attempt to improve TokenTM

Our design is applicable to other HTMs as well • OneTM-concurrent's ids and • VTM's ids (pointers to XSW in XADT) • And counts (#entries in XADT))

What about UFO?

UFO is not a TM

Supports strong atomicity in Hybrids/STMs

We compare against upper bound on hybrids

Read-sharing Support

LiteTM allows read sharing

Multiple L1's can have T-bits

L2 has multiple read sharing state

Disallows readsharing if T bit + Modified• Uncommon

Should logs be locked to avoid racing conflicts?

Recall: Conflicting access faults and retries Suppose thread F is checking thread N’s log

• Looking for block X

N makes racing access to X N takes away coherence permissions from F After log walk F will RETRY the access to X Coherence action will cause F to fault again Back stop available to prevent livelock Context switches handled similarly

Coherence Actions are Completed

Invalidations of a reader• T' bit sent to writer• T' states that there exists a token

Read sharing of writer• T' bit sent to reader

STM Acceleration Easier?

STM-acceleration provides weaker semantics• Requires at least one bit per memory block • UFO-like mechanisms

LiteTM only 2 bits per block• No changes to coherence protocol• Performs better than STM-accelerated approach

Shown by our hybrid-upper bound comparison

Smallest Input Dataset

8-core setup

Suitable scaling for all benchmarks

Reasonable simulation times, • Statistical variation.

Hybrid better than signature HTMs

Signature saturation causes serialized execution

TokenTM and LiteTM use per-block metastate,

Support for SMT Cores

LiteTM can support multithreaded cores• Replicates the T bits per hardware context• Single T' bit

T’ bit Remote transactional access

Bits every where are hard?

No, adding nacks/delays in coherence is hard• Leads to deadlocks/livelocks

Adding bits is quite easy

Validity of Hybrid Bound

Upper bound on Hybrids which retry spilled TX in s/w

Does not apply to other self-proclaimed hybrids• E.g. SigTM

SigTM uses signatures for conflict detection

Signature-based TMs have other issues• Signature saturation causes serualization

TokenTM vs LiteTM: Transactional state for Conflict Detection

TokenTM LiteTM

Sensitivity to Busy Buffers

No buffers all L1 misses wait till lazy clearing significant loss for high contention

4 Buffers

no buffers

()Hybrid Upper-bound

Upperbound for any hybrid that retries

transactions in an STM (with software

conflict detection) after a failure in HTM

Transactional State Overheads

Thread Id/sharer-count + state bits per block

Avoid coherence changes Extra bits (beyond R, W)• Previously, conflicting access is nacked (cannot complete)

• Such nacks are invasive changes to coherence (cause deadlocks)

• TokenTM allows coherence to complete even on a conflict

» Access itself does not complete & excepts• Needs transactional state to move with blocks under coherence

• Tracks non-local transactional state

• E.g., TokenTM‘s R', W', R+

Thread Ids+sharer-counts in hardware Detect conflicts + identify conflictors mostly in hardware

Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri &...

Documents

Transcript of Syed Ali Raza Jafri et al.1 LiteTM: Reducing HTM State Overhead T. N. Vijaykumar with Ali Jafri &...

Zenith LiteTM Targets - Tokyo Inst · 2016. 8. 26. · 500 x 500 mm and 1000 x 1000 mm Zenith LiteTM Targets will be delivered with a 100 x 100 mm reference target for recali-bration

Why Awakening With Sidra Jafri [Episode 41] Wired for Success TV

Banshee Excel LiteTM LED Audible Warning Device · Banshee Excel LiteTM LED product data sheet Issue 4 Speciﬁcation Termination: 1 x 8 way terminal block. for screw terminals for

Syed Ubaid Ali Jafri - Black Box Penetration testing for Associates

Nur Emma Binti Mustaffa · Mendapatkan Maklumat Latihan dari Aspek Perundangan dan Praktis Semasa. Mohammad Ismail, Jafri Mohd Rohani, Kherun Nita Ali Hamizah Liyana Tajul Ariffin,

Maryam Jafri - contemporaryartgallery.ca · Maryam Jafri Cover Anxiety (detail) (2017) Above Where We’re At (2017) Courtesy the artist and Laveronica arte contemporanea, Modica

Tazmeen Qaseeda e Mairaaj (by Hilal Jafri)

BOARD OF INTERMEDIATE & SECONDARY … · 400293 aqeel raza jafri 335 400339 muhammad ramzan engi cmsi phsi 400388 shehroze ali cmsi 400294 zafar iqbal 347 400340 gulistan khan …

Jammu and Kashmir: Institute of Management, Public ... · Sa ima Rakhsar D o Abdul Khali R o Parat Mendhar Pin 185211 Sayed Mazhar Ali Jafri S/o Shabir Hussain Shah R/o Village Gursai

JAFRI JAFRI-M—93-096 J Q310068 JAERI-M ~AERI-M 9393-096 096 · 2010. 3. 24. · JAFRI-M—93-096 .1P9310068 JAERI-M 93-096 19 9 3 '!-• 3 i\ Japan Atomic Energy Research Institute

Tazmeen Salam e Raza (by Dr. Syed Hilal Jafri)

Impact of Material Chemistry on the Performance ... · Impact of Material Chemistry on the Performance Characteristics of a Coal Handling Plant Kumar Harshit, Syed Ali Hussain Jafri,

Tazmeen Qasida e Mairaj (by Hilal Jafri)

AFRICAN NUTRITION MATTERS · Nonsi Mathe Muniirah Mbabazi Keiron Audain Saad Jafri Ali Jafri ... Out student contribution comes from Clara Nambunga, A BSc Human Nutrition student

Makhdoom Mohiuddin-Sardar Jafri-Kutab Pub Ltd Bombay-1948

LABRAD : Vol 41, Issue 2 - July 2015 · Dr Natasha Ali Associate Editor Dr Lena Jafri Patrons Dr Aysha Habib Dr Bushra Moiz Editorial Committee Department of Pathology and Laboratory

Syed Jafri, MBBS - University of Chicago · Syed Jafri, MBBS Medical Inquires ... insight into the science of sleep, sleep disorders and dream interpretation. ... (Lantern of path‐Imam

PRODUCT TECHNICAL DATA SHEET AERO LITETM · PDF fileproduct technical data sheet aero litetm insulation jacket author: chris evans date: 17.02.2015 version: 1.1 ... electronic id tagging

Worlds Second Highest selling artist Sacha Jafri sells ...

Faiyaz Jafri - Selected Works