9/20/6Lecture 3 - Instruction Set - Al1 Program Design Examples.
Chapter 4 Instruction Set Examples
description
Transcript of Chapter 4 Instruction Set Examples
![Page 1: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/1.jpg)
Chapter 4Instruction Set Examples
Advanced Computer Architecture
COE 501
![Page 2: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/2.jpg)
DLX Architecture• Introduced by Hennessey and Patterson in 1990
– Derived from many different instruction set architectures from MIPS, Sun, IBM, Intel, HP, AMD, etc.
• DLX is a typical RISC architecture.– 32-bit fixed length instructions
– 3 instruction formats
– Load/store architecture
– Simple branch conditions (no condition codes)
• DLX registers– 32 32-bit general-purpose registers (R0 = 0)
– 32 32-bit (or 16 64-bit) floating point registers
– Special purpose registers (e.g., FP Status and PC)
![Page 3: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/3.jpg)
DLX Design Decisions
• DLX is based on the following design decisions– Use general purpose registers with a load-store architecture
– Support commonly used addressing modes
» displacement, immediate, and register differed
– Support simple instructions that occur frequently
» load, store, add, subtract, move, and, shift, compare equal, branch, jump, call, and return
– Support commonly required data sizes
» 8 (byte), 16 (half word), and 32-bit (word) integers
» 32 (float) and 64-bit (double) floating point
– Use fixed length instructions that are easy to decode
– Provide plenty of general purpose registers and separate floating point registers
![Page 4: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/4.jpg)
DLX Instruction Formats
Op
31 26 01516202125
rs1 rd immediate
Op
31 26 025
Op
31 26 01516202125
rs1 rs2
offset added to PC
rd
Register-Register (R-type) ADD R1, R2, R3
561011
Register-Immediate (I-type) SUB R1, R2, #3
Jump / Call (J-type) JUMP end
func
(ALU imm. operations, loads and stores, conditional branch, jump (and link)
(jump, jump and link, trap and return from exception)
(ALI reg. operations, read/write special registers and moves)
![Page 5: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/5.jpg)
Examples of DLX Instructions
• Data Transfer– LW R1, 30(R2) Regs[R1] <= Mem[30 + Regs[R2]]
– SD F0, 40(R3) Mem[40 + Regs[3]] <= Regs[F0]
Mem[44 + Regs[3]] <= Regs[F1]
– Loads and stores also for bytes, half words, and floats
– How would you perform a register move? a no-op?
• Arithmetic and Logic– SUB R1, R2, R3 Regs[R1] <= Regs[R2] - Regs[R3]
– SLLI R1, R2, #5 Regs[R1] <= Regs[R2] << #5
– LHI R1, #42 Regs[R1] <= 42##016
– SLT R1, R2, R3 if (Regs[R2] < Regs[R3]) Regs[1] <= 1
else Regs[1] <= 0
- How would you load a 32 bit immediate into a register?
16
![Page 6: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/6.jpg)
Examples of DLX Instructions
• Control– JALR R2 Regs[31] <= PC+4, PC <= Regs[R2]
– JR R3 PC <= Regs[R3]
– BENZ R4, name if (Regs[R4] != 0) PC <- name
else PC <- PC + 4
– How would you implement a subroutine call and return?
• Floating Point– MULF F1, F2, F3Regs[F1] <= Regs[F2] * Regs[F3]
– ADDD F0, F2, F4 Regs[F0&F1] < Regs[F2&F3] + Regs[F4&F5]
– What would be difficult about adding a floating-point multiply and add instruction to DLX?
![Page 7: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/7.jpg)
DLX Instruction SetAppendix C.3
• Data transfer– Load/store word
– Load/store halfword or byte (singed/unsigned loads)
– Load/store floating point single/double
– Register moves
• Arithmetic and Logic– Add/subtract (signed or unsigned, reg. or imm.)
– Multiply/divide (signed or unsigned, operands in FP reg.)
– And, or, xor (reg. or imm.)
– Load high word (loads upper half of a reg. with imm.)
– Shifts (LL, RL, RA) (reg. or imm.)
– Set conditionals (LT, GT, LE, GE, EQ, NE) (reg. or imm.)
![Page 8: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/8.jpg)
DLX Instruction Set
• Control – Conditional branch on register (compare with zero)
– Conditional on FP status bit (bit true or false)
– Jump, jump register (26 bit imm. or reg.)
– Jump and link, jump and link register (26 bit imm. or reg.)
– Trap, return from exception (trap to and return from O.S.)
• Floating Point– Add, subtract, multiply, divide (single or double)
– FP converts (convert between single, double, and integer)
– FP compares (single or double, sets bit in FP status)
![Page 9: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/9.jpg)
DLX Instruction Usage
Instr. gcc espresso spice nasa7int load/store 43% 29% 23% 1%int. arith 26% 30% 33% 22%control 17% 13% 11% 4%logical 10% 23% 5% 1%fp load/store 0% 0% 8% 33%fp arith 0% 0% 5% 39%misc 4% 0% 5% 0%
![Page 10: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/10.jpg)
DLX Summary• Simple load/store architecture
– Only accesses memory on loads/stores– All other operations use registers and immediate
• Designed for pipeline efficiency– Fixed length instruction encoding– Simple instructions
• Easy to compile to– Simple, frequently used instructions– Orthogonal instruction set– Few addressing modes
• Reduces execution time by– reducing CPI– reducing clock rate
![Page 11: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/11.jpg)
History of the Intel 80x86
• 1971: Intel invents microprocessor - 4004
• 1975: 8080 introduced– 8-bit microprocessor, used in the Altair personal computer
– Accumulator machine
• 1978: 8086 introduced– 16 bit microprocessor
– Accumulator plus dedicated registers
• 1980: IBM selects 8088 as basis for IBM PC– 8088 is 8-bit external bus version of 8086
• 1980: 8087 floating point coprocessor – adds 60 floating point instructions
– 80 bit floating point registers
– uses hybrid stack/register scheme
![Page 12: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/12.jpg)
History of the Intel 80x86• 1982: 80286 introduced
– 24-bit address
– memory mapping & protection
• 1985: 80386 introduced– 32-bit address, 32-bit GP registers
– Support for multitasking
• 1989: 80486 introduced– Built in math coprocessor
– More powerful cache and instruction pipelining
• 1992: Pentium introduced– Superscalar processor (multiple instructions per cycle)
• 1995: Pentium Pro introduced– More aggressive superscalar with register renaming, branch
prediction, and speculative execution
![Page 13: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/13.jpg)
History of the Intel 80x86• Pentium II introduced
– Incoporated MMX Technology– 57 new instructions for processing video, audio, and graphics– Support single instructions that operate on multiple data (SIMD)
– 8 data of 8 bits, 4 data of 16 bits, or 2 data by 32 bits
• Pentium III introduced– Features Internet Streaming SIMD extensions– Improve performance of 3D graphics and internet applications– Allow one instruction to executed on 4 pairs of 32-bit floating
point data
• Itanium introduced– Release of IA-64 architecture– New 64-bit RISC-like architecture– 128-bit instructions bundles (3 instructions per bundle)
• Intel architecture was due to the desire for backward compatability
– Highly irregular architecture– Over 50 million sold per year
![Page 14: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/14.jpg)
Intel 80x86 Integer Registers
![Page 15: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/15.jpg)
X86 Operand Types• x86 instructions typically have two operands, where one
operand is both a source and a destination operand.
• Possible combinations include
Source/destination type Second source type
Register Register
Register Immediate
Register Memory
Memory Register
Memory Immediate
• No memory-memory or immediate-immediate
• Immediates can be 8, 16, or 32 bits
![Page 16: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/16.jpg)
Intel 80x86 Floating Point Registers
• Operations on the top of stack and one register within the stack
![Page 17: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/17.jpg)
Usage of Intel 80x86 Floating Point Registers
NASA 7 SpiceStack (2nd operand ST(1)) 0.3% 2.0%
Register (2nd operand ST(i), i>1) 23.3% 8.3%
Memory 76.3% 89.7%
Above are dynamic instruction percentages (i.e., based on counts of executed instructions)
Stack unused by Solaris compilers for fastest execution
![Page 18: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/18.jpg)
80x86 Instruction Format• Instructions sizes vary from 1 to 17 bytes
![Page 19: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/19.jpg)
80x86 Instructions
• Data movement (move, push, pop)
• Arithmetic and logic (logic ops, tests CCs, shifts, integer and decimal arithmetic)
• Control flow (branches, jumps, calls, returns)
• String instructions (move and compare)
• FP data movement (load, load const., store)
• Arithmetic instructions (add, subtract, multiply, divide, square root, absolute value)
• Comparisons (can send result to ALU)
• Transcendental functions (sin, cos, log, etc.)
![Page 20: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/20.jpg)
80x86 Addressing Mode Usage for 32-bit Mode
Addressing ModeGcc Espr.NASA7Spice Avg.
Register indirect10% 10% 6% 2% 7%
Base + 8-bit disp46% 43% 32% 4% 31%
Base + 32-bit disp 2% 0% 24% 10% 9%
Indexed 1% 0% 1% 0% 1%
Based indexed + 8b disp 0% 0% 4% 0% 1%
Based indexed + 32b disp 0% 0% 0% 0% 0%
Base + Scaled Indexed 12% 31% 9% 0% 13%
Base + Scaled Index + 8b disp 2% 1% 2% 0% 1%
Base + Scaled Index + 32b disp 6% 2% 2% 33% 11%
32-bit Direct 19% 12% 20% 51% 26%
![Page 21: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/21.jpg)
80x86 Length Distribution
Le
ng
th in
by
tes
% instructions at each length
0% 10% 20% 30%
1
2
3
4
5
6
7
8
9
10
11
24%
23%
21%
3%
12%
13%
3%
0%
0%
1%
19%
17%
16%
1%
15%
27%
4%
0%
0%
1%
24%
24%
27%
4%
13%
6%
2%
0%
0%
0%
25%
24%
29%
3%
12%
4%
2%
0%
0%
0%
Espresso
Gcc
Spice
NASA7
![Page 22: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/22.jpg)
Instruction Counts: 80x86 vs. DLX
SPEC pgm x86 DLX DLX÷86
gcc 3,771,327,742 3,892,063,460 1.03
espresso 2,216,423,413 2,801,294,286 1.26
spice 15,257,026,309 16,965,928,788 1.11
nasa7 15,603,040,963 6,118,740,321 0.39
• DLX tends to perform more instructions for integer programs, while the 80x86 performs more instructions for floating point programs
• 80x86 performs many more data transfers– Two to four times more for floating point programs
– About 1.25 times more for integer programs
– Why so many more for floating point
![Page 23: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/23.jpg)
Comparison
• How would you expect the x86 and MIPS architectures to compare on the following.
– CPI on SPEC benchmarks
– Ease of design and implementation
– Ease of writing assembly language & compilers
– Code density
– Overall performance
• What other advantages/disadvantages are there to the two architectures.
![Page 24: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/24.jpg)
Graphics and Multimedia Instruction Set Extensions
• Several companies have extended their computer’s instruction sets to better support graphics and multimedia applications.
– Intel’s MMX Technology– Intel’s Internet Streaming SIMD Extensions– AMD’s 3DNow! Technology – Sun’s Visual Instruction Set– Motorola’s and IBM’s AltiVec Technology
• These extensions improve the performance of – Computer-aided design– Internet applications– Computer visualization– Video games– Speech recognition
![Page 25: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/25.jpg)
MMX Data Types• MMX Technology supports operations on the
following 64-bit integer data types.
Packed byte (eight 8-bit elements)
Packed word (four 16-bit elements)
Packed double word (two 32-bit elements)
Packed quad word (one 64-bit elements)
![Page 26: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/26.jpg)
SIMD Operations
• MMX Technology allows a Single Instruction to work on Multiple pieces of Data (SIMD).
PADD[W]: Packed add word
• In the above example, 4 parallel adds are performed on 16-bit elements.
• Most MMX instructions only require a single cycle.
A3 A2 A1 A0
B3 B2 B1 B0
A3+B3 A2+B2 A1+B1 A0+B0
![Page 27: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/27.jpg)
Saturating Arithmetic
• Both wrap-around and saturating adds are supported.
• With saturating arithmetic, results that overflow are set to the largest value.
PADD[W]: Packed wrap-around add PADDUS[W]: Packed saturating add
![Page 28: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/28.jpg)
Pack and Unpack Instructions
• Pack and unpack instructions provide conversion between standard data types and packed data types
PACKSS[DW]: Packed signed with saturating double to packed word
![Page 29: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/29.jpg)
Multiply-Add Operations
• Many graphics applications require multiply-accumulate operations
– Vector Dot Products
– Matrix Multiplies
– Fast Fourier Transforms (FFTs)
– Filter implementations
PMADDWD: Packed multiply-add word to double
![Page 30: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/30.jpg)
Vector Dot Product
• A dot product on an 8-element vector can be performed using 9 MMX instructions
• Without MMX 40 instructions are required
a0*c0+..+ a3*c3 a4*c4+..+ a7*c70 0
a0*c0+..+ a7*c7
![Page 31: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/31.jpg)
Packed Compare Instructions
• Packed compare instructions allow a bit mask to be set or cleared
• This is useful when images with certain qualities need to be extracted.
![Page 32: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/32.jpg)
MMX Instructions• MMX Technology adds 57 new instructions to the
x86 architecture. • Some of these instructions include
– PADD(b, w, d) Packed addition– PSUB(b, w, d) Packed subtraction– PCMPE(b, w, d) Packed compare equal– PMULLw Packed word multiply low– PMULHw Packed word multiply high– PMADDwd Packed word multiply-add– PSRL(w, d, q) Pack shift right logical– PACKSS(wb, dw) Pack data– PUNPCK(bw, wd, dq) Unpack data– PAND, POR, PXOR Packed logical operations
![Page 33: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/33.jpg)
Performance Comparison• The following shows the performance of Pentium
processors with and without MMX Technology
1.64255.43156.00Overall
2.13318.90149.80Audio
1.03166.44161.523D geometry
4.67743.90159.03Image Processing
1.72268.70155.52Video
SpeedupWith MMX
Without MMX
Application
![Page 34: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/34.jpg)
MMX Technology Summary• MMX technology extends the Intel x86 architecture to
improve the performance of multimedia and graphics applications.
• It provides a speedup of 1.5 to 2.0 for certain applications.
• MMX instructions are hand-coded in assembly or implemented as libraries to achieve high performance.
• MMX data types use the x86 floating point registers to avoid adding state to the processor.
– Makes it easy to handle context switches– Makes it hard to perform MMX and floating point instructions
at the same time
• Only increase the chip area by about 5%.
![Page 35: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/35.jpg)
Questions on MMX
• What are the strengths and weaknesses of MMX Technology?
• How could MMX Technology potentially be improved?
• How did the developers of MMX preserve backward compatibility with the x86 architecture?
– Why was this important?
– What are the disadvantages of this approach?
• What restrictions/limitations are there on the use of MMX Technology?
![Page 36: Chapter 4 Instruction Set Examples](https://reader033.fdocuments.net/reader033/viewer/2022061610/56815839550346895dc597d6/html5/thumbnails/36.jpg)
Internet Streaming SIMD Extensions
• Intel’s Internet Streaming SIMD Extensions (ISSE) – Help improve the performance of video and 3D applications– Are designed for steaming data, which is used once and then
discarded. – 70 new instructions beyond MMX Technology– Adds new 128-bit registers– Provide the ability to perform parallel floating point operations
» Four parallel operations on 32-bit numbers» Reciprocal and reciprocal root instructions - normalization» Packed average instruction – Motion compensation
– Provide data prefetch instructions– Make certain applications 1.5 to 2.0 times faster.