Multithreaded processors ppt
-
Upload
siddhartha-anand -
Category
Science
-
view
90 -
download
1
Transcript of Multithreaded processors ppt
Goal
Utilization of coarser-grained parallelism by CMPs and multithreaded processors
Focus is on processors that are designed to simultaneously execute threads of
same or different processes.(explicit multithreaded processors)
Explicit multithreaded processors aim is to increase the performance(low
execution time) of a multiprogramming workload, while single threaded/implicit
multithreaded and superscalar processors increase performance of single
program.
CMP – Chip Multiprocessor(2 or more processors on a single chip)
Multithreaded processors- interleaves execution of different threads of control in
the same pipeline.
What is it?
●Notion of thread
● Different from a software application thread
● coarse-grained thread-level parallelism● Implies separate logical address space
●Implicit Multithreading
● Find multiple lines of execution in a single seq. program.
●Explicit multithreading
● Multiple PCs, register contexts
● Different from RISC processors
Why do we need it?
• ILP is limited
• Memory latency problem, covering up long latency cycles by useful work.
• Div and branch interlocking. Covering up idle time of CPU
• Latency: primary cache miss/2ndary cache miss
• Several enabled instructions from diff threads that may be candidates for
execution.
• Switching in a single threaded processor is costly!
• Idle hardware utilization
Multithreaded Processors –
Principle Approaches
●Techniques
● Fast context switch(how?)
●Interleaved multithreading technique
● Instruction from different threads every cycle
●Blocked multithreading technique
● Continues until an event occurs
●Simultaneous multithreading
● Simultaneously issue multiple instructions from multiple threads(Superscalar)
Interleaved multithreading(fine-
grained)• Processor switches to a different thread after each IF
• Context switch after every clock cycle
• Eliminates data and control hazards
• Improves overall performance(execution time)
• Requires at least as many threads as pipeline stages
• Single-thread performance degrades
• Two techniques to overcome this
• Dependence lookahead technique(CRAY MTA)
• Interleaving technique
CRAY MTA
• Interleaved multithreaded VLIW processor
• uses explicit look ahead technique
• 3 bits to encode
• Supports 128 distinct threads• Hides memory latency
• VLIW
• 64 bit instructions consists of 3 operations
• <M-op, A-op, C-op>- priority from high to low
Blocked multithreading(coarse-
grained)• Continues execution until a context switch is forced
• Single thread can proceed at full speed
• Lesser threads needed compared to interleaved multithreading
• Context switch events
• Switch-on-load
• Switch-on-store
• Switch-on-branch
• Switch-on-cache-miss
• Switch-on-signals(interrupts)
• Conditional switch
MIT Sparcle
• Context switch only during a remote cache miss
• Small latencies are taken care of by the compilers.
• Implementation of fast context switching
• Also uses multiple register contexts and PCs
Simultaneous multithreading(SMT)
• Mix of superscalar and multithreading technique
• All hardware contexts are active leading to competition
• Issue multiple instructions from multiple threads each cycle
• Both TLP and ILP comes into play
• Multiple slots for different threads are filled as well multiple
issue slots are filled.
• Resource organization
• Resource sharing
• Resource replication
SMT Alpha 21164 processor
• Simulations conducted on 8 threaded 8-issue
superscalar
• 3 Floating point units and 6 integer units are
assumed
• Fetch policy
• Throughput
• 6.64 IPC on SPEC92 benchmark
Comparison
Chip Multiprocessors
1. Multiple processors on a single
chip
2. Every unit is duplicated and
works independently
3. Latency problem remains in
multiple issue cycles.
4. Every part of a processor is
duplicated so easier to
implement.
Multithreaded Processors
1. Multithreading comes into play
2. multiple threads under execution
so multiple PCs and registers
3. Latencies arising in one stream
are filled by another thread unlike
RISC architectures.
4. Hardware either shared or
replicated so complex.