The performance Improving of Microprocessor World
description
Transcript of The performance Improving of Microprocessor World
The performance Improving of Microprocessor World
By Ming-Haw Jing
•The current status of microprocessors
•The instructions set architecture
•The concept of pipelining
•The structure of pipelining
•The problems solving of pipelining
•The computer system architecture
•The metrics of the computer
Original
Big Fishes Eating Little Fishes
1988 Computer Food Chain
Mainframe
Supercomputer Mini-supercomputer
Mini-computer
Work-station
PC
Massively Parallel
Processors
Technology Trends(Summary)
Capacity Speed (latency)
Logic 2x in 3 years 2x in 3 years
DRAM 4x in 3 years 2x in 10 years
Disk 4x in 3 years 2x in 10 years
Year
Perf
orm
an
ce
0.1
1
10
100
1000
1965 1970 1975 1980 1985 1990 1995 2000
Microprocessors
Minicomputers
Mainframes
Supercomputers
Performance and Technology Trends
1998 Computer Food Chain
Mini-supercomputerMassively Parallel
Processors
Mini-computer
Now who is eating whom?
PCWork-station
Mainframe
Supercomputer
Server
Instruction Set Architecture (ISA)
instruction set
software
hardware
Where Is The Instruction Set?
(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s
Cycles per second (clock rate)
Megabytes per second
Answers per monthOperations per second
Compiler
Programming Language
Application
DatapathControl
Transistors Wires Pins
ISA
Function Units
Evolution of Instruction SetsSingle Accumulator (EDSAC 1950)
Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)
Separation of Programming Model from Implementation
High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)
General Purpose Register Machines
Complex Instruction Sets Load/Store Architecture
RISC
(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)
(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)
VLIW/PIC (IA-64. . .1999)
Instruction Format of MIPS
Op
31 26 01516202125
Rs1 Rd immediate
Op
31 26 025
Op
31 26 01516202125
Rs1 Rs2
target
Rd Opx
Register-Register
561011
Register-Immediate
Op
31 26 01516202125
Rs1 Rs2/Opx immediate
Branch
Jump / Call
Pipelining is Natural!• Laundry Example
• Ann, Brian, Cathy, Dave
each have one load of clothes
to wash, dry, and fold
• Washer takes 30 minutes
• Dryer takes 30 minutes
• Folder takes 30 minutes
• Dasher takes 30 minutes
to put clothes into drawers
A B C D
Sequential Laundry
• Sequential laundry takes 8 hours for 4 loads
• If they learned pipelining, how long would laundry
take?
30Task
Order
B
C
D
ATime
30 30 3030 30 3030 30 30 3030 30 30 3030
6 PM 7 8 9 10 11 12 1 2 AM
Pipelined Laundry: Start work ASAP
• Pipelined laundry takes 3.5 hours for 4 loads!
Task
Order
12 2 AM6 PM 7 8 9 10 11 1
Time
B
C
D
A
3030 30 3030 30 30
Pipelined Datapath of CPU
MemoryAccess
WriteBack
InstructionFetch Instr. Decode
Reg. FetchExecute
Addr. Calc.
The Five Stages of Load
• Ifetch: Instruction Fetch–Fetch the instruction from the Instruction Memory
• Reg/Dec: Registers Fetch and Instruction Decode
• Exec: Calculate the memory address
• Mem: Read the data from the Data Memory
• Wr: Write the data back to the register file
Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5
Ifetch Reg/Dec Exec Mem WrLoad
Visualizing Pipelining
Instr.
Order
Pipelined Execution Representation
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WB
IFetchDcd Exec Mem WBProgram Flow
Time
Why Pipeline?
Instr.
Order
Time (clock cycles)
Inst 0
Inst 1
Inst 2
Inst 4
Inst 3
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
AL
UIm Reg Dm Reg
Data Hazard on R1
Instr.
Order
Time (clock cycles)
add r1, r2, r3
sub r4, r1, r3
and r6, r1, r7
or r8, r1, r9
xor r10, r1, r11
IF ID/RF EX MEM WB
• Stall: wait until decision is clear for branch instruction
Control Hazard Solutions
Instr.
Order
Add
Beq
Load
Time (clock cycles)
AL
U
Mem Reg Mem Reg
AL
U
Mem Reg Mem Reg
AL
U
Reg Mem RegMem
Data Hazard on r1:
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Time (clock cycles)
IF ID/RF EX MEM WBAL
U
Im Reg Dm Reg
AL
UIm Reg Dm RegA
LUIm Reg Dm Reg
Im
AL
UReg Dm Reg
AL
UIm Reg Dm Reg
Data Hazard Solution:
Instr.
Order
add r1,r2,r3
sub r4,r1,r3
and r6,r1,r7
or r8,r1,r9
xor r10,r1,r11
Time (clock cycles)
IF ID/RF EX MEM WBAL
U
Im Reg Dm Reg
AL
U
Im Reg Dm RegA
LU
Im Reg Dm Reg
Im
AL
U
Reg Dm Reg
AL
U
Im Reg Dm Reg
Pipeline HazardsI-Fet ch DCD MemOpFetch OpFetch Exec Store
IFetch DCD ??StructuralHazard
I-Fet ch DCD OpFetch Jump
IFetch DCD ??
Control Hazard
IF DCD EX Mem WB
IF DCD OF Ex Mem
RAW (read after write) Data Hazard
WAW Data Hazard (write after write)
IF DCD OF Ex RS WAR Data Hazard (write after read)
IF DCD EX Mem WB
IF DCD EX Mem WB
Loop Unrolling in SuperscalarLoop: LD F0,0(R1) 1
LD F6,-8(R1) 2
LD F10,-16(R1) ADDD F4,F0,F2 3
LD F14,-24(R1) ADDD F8,F6,F2 4
LD F18,-32(R1) ADDD F12,F10,F2 5
SD 0(R1),F4 ADDD F16,F14,F2 6
SD -8(R1),F8 ADDD F20,F18,F2 7
SD -16(R1),F12 8
SD -24(R1),F16 9
SUBI R1,R1,#40 10
BNEZ R1,LOOP 11
SD -32(R1),F20 12
Software Pipelining
Iteration 0 Iteration
1 Iteration 2 Iteration
3 Iteration 4
Software- pipelined iteration
Loop Unrolling in VLIW
Memory Memory FP FP Int. op/ Clock
reference 1 reference 2 operation 1 op. 2 branch
LD F0,0(R1) LD F6,-8(R1) 1
LD F10,-16(R1) LD F14,-24(R1) 2
LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3
LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4
ADDD F20,F18,F2 ADDD F24,F22,F2 5
SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6
SD -16(R1),F12 SD -24(R1),F16 7
SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8
SD -0(R1),F28 BNEZ R1,LOOP 9
Dynamic Branch Prediction
Solution: 2-bit scheme where change prediction only i
f get misprediction twice
T
T
T
T
NT
NT
NT
NT
Predict Taken
Predict Not Taken
Predict Taken
Predict Not Taken
Branch Target Buffer (BTB): Address of branch index to get
prediction AND branch address (if taken) Predicted PC
Branch Prediction:Taken or not Taken
Recap: Who Cares About the Memory Hierarchy?
Proc=60%/yr.(2X/1.5yr)
DRAM=9%/yr.(2X/10 yrs)
1
10
100
1000
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
DRAM
CPU19
82
Processor-MemoryPerformance Gap:(grows 50% / year)
Per
form
ance
Time
Moore’s Law
Processor-DRAM Memory Gap (latency)
Levels of the Memory Hierarchy
CPU Registers100s Bytes<10s ns
CacheK Bytes10-100 ns1-0.1 cents/bit
Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)
10 - 10 cents/bit-5 -6
CapacityAccess TimeCost
Tapeinfinitesec-min
10-8
Registers
Cache
Memory
Disk
Tape
Instr. Operands
Blocks
Pages
Files
StagingXfer Unit
prog./compiler1-8 bytes
cache cntl8-128 bytes
OS512-4K bytes
user/operatorMbytes
Upper Level
Lower Level
faster
Larger
1 KB Direct Mapped Cache, 32B blocks
Cache Index
0
1
2
3
:
Cache Data
Byte 0
0431
:
Cache Tag Example: 0x50
Ex: 0x01
0x50
Stored as partof the cache tate
Valid Bit
:
31
Byte 1Byte 31 :
Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :
Cache Tag
Byte Select
Ex: 0x00
9
A Modern Memory Hierarchy
Control
Datapath
SecondaryStorage(Disk)
Processor
Registers
MainMemory(DRAM)
SecondLevelCache
(SRAM)
On
-Ch
ipC
ache
1s 10,000,000s
(10s ms)
Speed (ns): 10s 100s
100sGs
Size (bytes):Ks Ms
TertiaryStorage
(Disk/Tape)
10,000,000,000s (10s sec)
Ts
SPEC First Round
Benchmark
SP
EC
Perf
0100200300400500600700800g
cc
epre
sso
spic
e
dod
uc
nasa
7
li
eqn
tott
matr
ix3
00
fpp
pp
tom
catv
SPECint95base Performance (Oct. 1997)
02468
101214161820
go
88ks
im gcc
com
pres
s li
ijpeg
per
l
vort
ex
SP
EC
int
PA-8000
21164
PPro