High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200:...
Transcript of High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200:...
![Page 1: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/1.jpg)
High Performance Architectures
Superscalar and VLIW
![Page 2: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/2.jpg)
VLIW (very long instruction word,1024 bits!)
![Page 3: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/3.jpg)
Superscalar (sequential stream of instructions)
![Page 4: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/4.jpg)
Basic Structure of VLIW Architecture
![Page 5: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/5.jpg)
Basic principle of VLIW Controlled by very long instruction words
− comprising a control field for each of the execution units
Length of instruction depends on− the number of execution units (530 EU)− the code lengths required for controlling each EU
(1632 bits)− 256 to 1024 bits
Disadvantages: on average only some of the control fields will actually be used
− waste memory space and memory bandwidth− e.g. Fortran code is 3 times larger for VLIW (Trace
processor)
![Page 6: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/6.jpg)
VLIW: Static Scheduling of instructions/
Instruction scheduling done entirely by [software] compiler
Lesser Hardware complexity translate to− increase the clock rate− raise the degree of parallelism (more EU);
{Can this be utilized?} Higher Software (compiler) complexity
− compiler needs to aware hardware detail number of EU, their latencies, repetition rates, memory
loaduse delay, and so on cache misses: compiler has to take into account worstcase
delay value− this hardware dependency restricts the use of the
same compiler for a family of VLIW processors
![Page 7: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/7.jpg)
6.2 overview of proposed and commercial VLIW architectures
![Page 8: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/8.jpg)
6.3 Case study: Trace 200
![Page 9: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/9.jpg)
Trace 7/200 256bit VLIW words Capable of executing 7 instructions/cycle
− 4 integer operations− 2 FP− 1 Conditional branch
Found that every 5th to 8th operation on average is a conditional branch
Use sophisticated branching scheme: multiway branching capability
− executing multipaths− assign priority code corresponds to its relative
order
![Page 10: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/10.jpg)
Trace 28/200: storing long instructions
1024 bit per instruction a number of 32bit fields maybe empty Storing scheme to save space
− 32bit mask indicating each subfield is empty or not
− followed by all subfields that are not empty− resulting still 3 time larger memory space required
to store Fortran code (vs. VAX object code)− very complex hardware for cache fill and refill
Performance data is Impressive indeed!
![Page 11: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/11.jpg)
Trace: Performance data
![Page 12: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/12.jpg)
Transmeta: Crusoe Full x86compatible: by dynamic code
translation High performance: 700MHz in mobile platforms
![Page 13: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/13.jpg)
Crusoe “Remarkably low power consumption”
− “A full day of web browsing on a single battery charge”
Playing DVD: Conventional? 105 C vs. Crusoe: 48 C
![Page 14: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/14.jpg)
Intel: Itanium (VLIW) {EPIC} Processor
![Page 15: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/15.jpg)
Superscalar Processors vs. VLIW
![Page 16: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/16.jpg)
Superscalar Processor: Intro Parallel Issue Parallel Execution {Hardware} Dynamic Instruction Scheduling Currently the predominant class of processors
− Pentium− PowerPC− UltraSparc− AMD K5− HP PA7100− DEC α
![Page 17: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/17.jpg)
Emergence and spread of superscalar processors
![Page 18: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/18.jpg)
Evolution of superscalar processor
![Page 19: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/19.jpg)
Specific tasks of superscalar processing
![Page 20: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/20.jpg)
Parallel decoding {and Dependencies check} What need to be done
![Page 21: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/21.jpg)
Decoding and Predecoding Superscalar processors tend to use 2 and
sometimes even 3 or more pipeline cycles for decoding and issuing instructions
>> Predecoding:− shifts a part of the decode task up into loading phase− resulting of predecoding
the instruction class the type of resources required for the execution in some processor (e.g. UltraSparc), branch target addresses
calculation as well− the results are stored by attaching 47 bits
+ shortens the overall cycle time or reduces the number of cycles needed
![Page 22: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/22.jpg)
The principle of predecoding
![Page 23: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/23.jpg)
Number of predecode bits used
![Page 24: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/24.jpg)
Number of read and write ports how many instructions may be written into
(input ports) or read out from (output ports) a particular
shelving buffer in a cycle depend on individual, group, or central
reservation stations
![Page 25: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/25.jpg)
Operand fetch during instruction issue
Reg. file
![Page 26: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/26.jpg)
Operand fetch during instruction dispatch
Reg. file
![Page 27: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/27.jpg)
Scheme for checking the availability of operands: The principle of scoreboarding
![Page 28: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/28.jpg)
Schemes for checking the availability of operand
![Page 29: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/29.jpg)
Operands fetched during dispatch or during issue
![Page 30: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/30.jpg)
Use of multiple buses for updating multiple individual reservation strations
![Page 31: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/31.jpg)
Interal data paths of the powerpc 604 42
![Page 32: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/32.jpg)
Implementation of superscalar CISC processors using superscalar RISC core
CISC instructions are first converted into RISClike instructions <during decoding>.
− Simple CISC registertoregister instructions are converted to single RISC operation (1to1)
− CISC ALU instructions referring to memory are converted to two or more RISC operations (1to(24))
SUB EAX, [EDI]− converted to e.g.
MOV EBX, [EDI] SUB EAX, EBX
− More complex CISC instructions are converted to long sequences of RISC operations (1to(more than 4))
On average one CISC instruction is converted to 1.52 RISC operations
![Page 33: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/33.jpg)
The principle of superscalar CISC execution using a superscalar RISC core
![Page 34: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/34.jpg)
PentiumPro: Decoding/converting CISC instructions to RISC operations (are done in program order)
![Page 35: High Performance Architecturesricardo/Courses/CompArchII-2008/Material/Ar… · Trace 28/200: storing long instructions 1024 bit per instruction a number of 32bit fields maybe empty](https://reader034.fdocuments.net/reader034/viewer/2022042209/5eace96b2dcbe277c6496bbc/html5/thumbnails/35.jpg)
Case Studies: R10000
Core part of the microarchitecture of the R10000 67