CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic...

Post on 20-Dec-2015

218 views 0 download

Tags:

Transcript of CS 7810 Lecture 18 The Potential for Using Thread-Level Data Speculation to Facilitate Automatic...

CS 7810 Lecture 18

The Potential for Using Thread-Level DataSpeculation to Facilitate Automatic Parallelization

J.G. Steffan and T.C. MowryProceedings of HPCA-4

February 1998

Multi-Threading

• CMPs advocate low complexity and static approaches to parallelism extraction

• Resolving memory dependences for integer codes is not easy!

Large window100 in-flight instrs

Compiler-generated threads4 windows of 25 instrs each

Probable Conflicts

p

q

Example: Compress

Example Execution

• Bullet

Compiler Optimizations

• Induction variables: in_count

• Reduction: out_count

• Parallel I/O: getchar() and putchar()

• Scalar forwarding: free_entries

• Ambiguous loads and stores: hash[…]

Methodology

• Threads (epochs) were constructed by hand

• The procs are in-order and instrs are unit latency

Ambiguous Loads and Stores

Average Run Lengths

Forwarding Registers and Scalars

Average Run Lengths

Realistic Models

• 10-cycle forwarding latency• Sharing at cache line granularity• Recovery from misspeculation• Results are not sensitive to forwarding latency or cache line size

Hardware Support

• Cache coherence protocol for the L1 caches

• For each cache line, keep track of whether the line has been read/modified

• When the oldest thread writes to a cache line, an invalidate is sent to the other caches

• The younger thread sets a violation flag if the younger thread has speculatively loaded the line -- s/w recovery is initiated when the thread commits

• Cache line evicts cause violations (not common)

Role of the Compiler

• Profiling to identify epochs large enough to offset thread management and communication cost; small enough to have low speculative state

• Estimate probability of violation (static/dynamic)

• Optimizations (induction, reduction, parallel I/O)

• Scalar forwarding and rescheduling

• Insertion of register recovery code

Conclusions

• Hardware catches violations; compiler can parallelize aggressively

• Competitive implementation: large window with store sets prediction

Title

• Bullet