Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by:...
-
Upload
thomas-payne -
Category
Documents
-
view
219 -
download
0
Transcript of Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by:...
![Page 1: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/1.jpg)
Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar
Presented by: Ashay Rane
Published in: SIGARCH Computer Architecture News, 2003
![Page 2: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/2.jpg)
Agenda
Overview (IMT, state-of-art)
IMT enhancements
Key results
Critique
Relation to Term Project
![Page 3: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/3.jpg)
Implicitly Multithreaded Processor (IMT)
SMT with speculation
Optimizations to basic SMT support
Average perf. improvement of 24%Max: 69%
![Page 4: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/4.jpg)
State-of-the-art
Pentium 4 HT
IBM POWER5
MIPS MT
![Page 5: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/5.jpg)
Speculative SMT operation
When branch encountered, start executing likely path “speculatively”
i.e. allow for rollback (thread squash) in certain circumstances (misprediction, dependence)
Overcome cost, overhead with savings in execution time and power (but worth the effort)
Complication because commit by independent threads (buffer for each thread). Also issue, register renaming, cache & TLB conflicts.
If dependence violation, squash thread and restart execution
![Page 6: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/6.jpg)
How to buffer speculative data?
Load/Store Queue (LSQ) Buffers data (along with its address) Helps enforce dependency check Makes rollback possible
Cache-based approaches
![Page 7: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/7.jpg)
IMT: Most significant improvements
Assistance from Multiscalar compiler
Resource- and dependence-aware fetch policy
Multiplexing threads on a single hardware context
Overlapping thread startup operations with previous threads execution
![Page 8: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/8.jpg)
What does Compiler do?Extracts threads from program (loops)
Generates thread descriptor data about registers read and written and control flow exits (for rename tables)
Annotates instructions with special codes (“forward” & “release”) for dependence checking
![Page 9: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/9.jpg)
Fetch PolicyHardware keeps track of resource utilization
Resource requirement prediction from past four execution instances
When dependencies exist (detected from compiler-generated data), bias towards non-speculative threads
Goal is to reduce number of thread squashes
![Page 10: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/10.jpg)
Multiplexing threads on a single hardware context
Observations: Threads usually short Number of contexts less (2-8)
Hence frequent switching, less overlap
![Page 11: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/11.jpg)
Multiplexing (contd.)Larger threads can lead to:
Speculation buffer overflow Increased dependence mis-speculation Hence thread squashing
Each execution context can further support multiple threads (3-6)
![Page 12: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/12.jpg)
Multiplexing: Required Hardware
Per context per thread: Program Counter Register rename table
LSQ shared among threads running on 1 execution context
![Page 13: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/13.jpg)
Multiplexing: Implementation Issues
LSQ shared but it needs to maintain loads and stores for each thread separately
Therefore, create “gaps” for yet-to-be-fetched instructions / data
If space falls short, squash subsequent thread
What if threads from one program are mapped to different contexts?
IMT searches through other contexts
Easier to have multiple LSQs per context per thread but not good cost and power consumption
![Page 14: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/14.jpg)
Register renamingRequired because multiple threads may use
same registers
Separate rename tables
Master Rename Table (global)Local Rename Table (per thread)Pre-assign table (per thread)
![Page 15: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/15.jpg)
Register renaming: FlowThread Invocation:
Copy from Master table into Local table (to reflect current status)
Also use “create” and “use” mask of thread descriptor(to for dependence check)
Before every subsequent thread invocation: Pre-assign rename maps into Pre-assign table Copy from Pre-assign table to Master table and
mark registers as “busy”. So no successor thread can use them before current thread writes to them.
![Page 16: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/16.jpg)
Hiding thread startup delay
Rename tables to be setup before execution begins
Occupies table bandwidth, hence cannot be done for a number of threads in parallel
Hence overlap setting up of rename tables with previous thread’s execution
![Page 17: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/17.jpg)
Load/Store QueuePer context
Speculative load / store: Search through current and other contexts for dependence
No searching for non-speculative loads
Searching can take time, so schedules load-dependent instructions accordingly
![Page 18: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/18.jpg)
Key Results
![Page 19: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/19.jpg)
Average improvement: 24%
Reduction in data dependence stalls
Little overhead of optimizations
Not all benchmark programs
![Page 20: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/20.jpg)
• Assuming 2-3 threads per context, 6-8 LSQ entries per thread.
• Performance relative to IMT with unlimited resources
![Page 21: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/21.jpg)
• ICOUNT: Favor least number of instructions remaining to be executed
• Biased-ICOUNT: Favor non-speculative threads
• Worst-case resource estimation
• Reduced thread squashing
![Page 22: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/22.jpg)
• TME: Executes both paths of an unpredictable branch (but such branches uncommon)
• DMT:– Hardware-selection of threads. So spawns threads on
backward-branch or function call instead of loops.– Also spawns threads out of order. So lower accuracy of branch
prediction.
![Page 23: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/23.jpg)
Critique
![Page 24: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/24.jpg)
Compiler Support
Improvement in applications compiled using Multiscalar compiler
Scientific computing applications, not for desktop applications
![Page 25: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/25.jpg)
LSQ LimitationsLSQ size deciding the size of speculative
thread
Pentium 4 (without SMT):48 Loads, 24 Stores
Pentium 4 HT:24 Loads, 12 Stores per thread
IBM Power5:32 Loads, 32 Stores per thread
![Page 26: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/26.jpg)
LSQ Limitations: AlternativeCache-based approach
i.e. Partition the cache to support different versions
Extra support required, but scalable
![Page 27: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/27.jpg)
Register file size IMT considers register file sizes of 128 and up.
Pentium 4 (as well as HT):Register file size = 128
IBM POWER5:Register file size = 80
![Page 28: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/28.jpg)
Searching LSQ
Since loads and stores organized as per thread, search involves all locations of other threads.
If loads/stores organized according to addresses then lesser values to search.
Can make use of associativity of cache
![Page 29: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/29.jpg)
Searching LSQ (contd.)
![Page 30: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/30.jpg)
So how is performance still high?Assistance from Compiler
Resource and dependency-aware fetching
Multiple threads on an execution context
Overlapping rename table creation with execution
![Page 31: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/31.jpg)
Term project “Cache-based throughput improvement
techniques for Speculative SMT processors”
Optimizations from IMT
Increasing granularity to reduce number of thread squashes
![Page 32: Implicitly-Multithreaded Processors Il Park and Babak Falsafi and T. N. Vijaykumar Presented by: Ashay Rane Published in: SIGARCH Computer Architecture.](https://reader030.fdocuments.net/reader030/viewer/2022032806/56649efe5503460f94c13a0e/html5/thumbnails/32.jpg)
Thank you