The performance Improving of Microprocessor World

The performance Improving of Microprocessor World

By Ming-Haw Jing

•The current status of microprocessors

•The instructions set architecture

•The concept of pipelining

•The structure of pipelining

•The problems solving of pipelining

•The computer system architecture

•The metrics of the computer

Original

Big Fishes Eating Little Fishes

1988 Computer Food Chain

Mainframe

Supercomputer Mini-supercomputer

Mini-computer

Work-station

PC

Massively Parallel

Processors

Technology Trends(Summary)

Capacity Speed (latency)

Logic 2x in 3 years 2x in 3 years

DRAM 4x in 3 years 2x in 10 years

Disk 4x in 3 years 2x in 10 years

Year

Perf

orm

an

ce

0.1

1

10

100

1000

1965 1970 1975 1980 1985 1990 1995 2000

Microprocessors

Minicomputers

Mainframes

Supercomputers

Performance and Technology Trends

1998 Computer Food Chain

Mini-supercomputerMassively Parallel

Processors

Mini-computer

Now who is eating whom?

PCWork-station

Mainframe

Supercomputer

Server

Instruction Set Architecture (ISA)

instruction set

software

hardware

Where Is The Instruction Set?

(millions) of Instructions per second: MIPS(millions) of (FP) operations per second: MFLOP/s

Cycles per second (clock rate)

Megabytes per second

Answers per monthOperations per second

Compiler

Programming Language

Application

DatapathControl

Transistors Wires Pins

ISA

Function Units

Evolution of Instruction SetsSingle Accumulator (EDSAC 1950)

Accumulator + Index Registers(Manchester Mark I, IBM 700 series 1953)

Separation of Programming Model from Implementation

High-level Language Based Concept of a Family(B5000 1963) (IBM 360 1964)

General Purpose Register Machines

Complex Instruction Sets Load/Store Architecture

RISC

(Vax, Intel 432 1977-80) (CDC 6600, Cray 1 1963-76)

(Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987)

VLIW/PIC (IA-64. . .1999)

Instruction Format of MIPS

Op

31 26 01516202125

Rs1 Rd immediate

Op

31 26 025

Op

31 26 01516202125

Rs1 Rs2

target

Rd Opx

Register-Register

561011

Register-Immediate

Op

31 26 01516202125

Rs1 Rs2/Opx immediate

Branch

Jump / Call

Pipelining is Natural!• Laundry Example

• Ann, Brian, Cathy, Dave

each have one load of clothes

to wash, dry, and fold

• Washer takes 30 minutes

• Dryer takes 30 minutes

• Folder takes 30 minutes

• Dasher takes 30 minutes

to put clothes into drawers

A B C D

Sequential Laundry

• Sequential laundry takes 8 hours for 4 loads

• If they learned pipelining, how long would laundry

take?

30Task

Order

B

C

D

ATime

30 30 3030 30 3030 30 30 3030 30 30 3030

6 PM 7 8 9 10 11 12 1 2 AM

Pipelined Laundry: Start work ASAP

• Pipelined laundry takes 3.5 hours for 4 loads!

Task

Order

12 2 AM6 PM 7 8 9 10 11 1

Time

B

C

D

A

3030 30 3030 30 30

Pipelined Datapath of CPU

MemoryAccess

WriteBack

InstructionFetch Instr. Decode

Reg. FetchExecute

Addr. Calc.

The Five Stages of Load

• Ifetch: Instruction Fetch–Fetch the instruction from the Instruction Memory

• Reg/Dec: Registers Fetch and Instruction Decode

• Exec: Calculate the memory address

• Mem: Read the data from the Data Memory

• Wr: Write the data back to the register file

Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5

Ifetch Reg/Dec Exec Mem WrLoad

Visualizing Pipelining

Instr.

Order

Pipelined Execution Representation

IFetchDcd Exec Mem WB





IFetchDcd Exec Mem WBProgram Flow

Time

Why Pipeline?

Instr.

Order

Time (clock cycles)

Inst 0

Inst 1

Inst 2

Inst 4

Inst 3

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

AL

UIm Reg Dm Reg

Data Hazard on R1

Instr.

Order

Time (clock cycles)

add r1, r2, r3

sub r4, r1, r3

and r6, r1, r7

or r8, r1, r9

xor r10, r1, r11

IF ID/RF EX MEM WB

• Stall: wait until decision is clear for branch instruction

Control Hazard Solutions

Instr.

Order

Add

Beq

Load

Time (clock cycles)

AL

U

Mem Reg Mem Reg

AL

U

Mem Reg Mem Reg

AL

U

Reg Mem RegMem

Data Hazard on r1:

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Time (clock cycles)

IF ID/RF EX MEM WBAL

U

Im Reg Dm Reg

AL

UIm Reg Dm RegA

LUIm Reg Dm Reg

Im

AL

UReg Dm Reg

AL

UIm Reg Dm Reg

Data Hazard Solution:

Instr.

Order

add r1,r2,r3

sub r4,r1,r3

and r6,r1,r7

or r8,r1,r9

xor r10,r1,r11

Time (clock cycles)

IF ID/RF EX MEM WBAL

U

Im Reg Dm Reg

AL

U

Im Reg Dm RegA

LU

Im Reg Dm Reg

Im

AL

U

Reg Dm Reg

AL

U

Im Reg Dm Reg

Pipeline HazardsI-Fet ch DCD MemOpFetch OpFetch Exec Store

IFetch DCD ??StructuralHazard

I-Fet ch DCD OpFetch Jump

IFetch DCD ??

Control Hazard

IF DCD EX Mem WB

IF DCD OF Ex Mem

RAW (read after write) Data Hazard

WAW Data Hazard (write after write)

IF DCD OF Ex RS WAR Data Hazard (write after read)

IF DCD EX Mem WB

IF DCD EX Mem WB

Loop Unrolling in SuperscalarLoop: LD F0,0(R1) 1

LD F6,-8(R1) 2

LD F10,-16(R1) ADDD F4,F0,F2 3

LD F14,-24(R1) ADDD F8,F6,F2 4

LD F18,-32(R1) ADDD F12,F10,F2 5

SD 0(R1),F4 ADDD F16,F14,F2 6

SD -8(R1),F8 ADDD F20,F18,F2 7

SD -16(R1),F12 8

SD -24(R1),F16 9

SUBI R1,R1,#40 10

BNEZ R1,LOOP 11

SD -32(R1),F20 12

Software Pipelining

Iteration 0 Iteration

1 Iteration 2 Iteration

3 Iteration 4

Software- pipelined iteration

Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clock

reference 1 reference 2 operation 1 op. 2 branch

LD F0,0(R1) LD F6,-8(R1) 1

LD F10,-16(R1) LD F14,-24(R1) 2

LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3

LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5

SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6

SD -16(R1),F12 SD -24(R1),F16 7

SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8

SD -0(R1),F28 BNEZ R1,LOOP 9

Dynamic Branch Prediction

Solution: 2-bit scheme where change prediction only i

f get misprediction twice

T

T

T

T

NT

NT

NT

NT

Predict Taken

Predict Not Taken

Predict Taken

Predict Not Taken

Branch Target Buffer (BTB): Address of branch index to get

prediction AND branch address (if taken) Predicted PC

Branch Prediction:Taken or not Taken

Recap: Who Cares About the Memory Hierarchy?

Proc=60%/yr.(2X/1.5yr)

DRAM=9%/yr.(2X/10 yrs)

1

10

100

1000

1980

1981

1983

1984

1985

1986

1987

1988

1989

1990

1991

1992

1993

1994

1995

1996

1997

1998

1999

2000

DRAM

CPU19

82

Processor-MemoryPerformance Gap:(grows 50% / year)

Per

form

ance

Time

Moore’s Law

Processor-DRAM Memory Gap (latency)

Levels of the Memory Hierarchy

CPU Registers100s Bytes<10s ns

CacheK Bytes10-100 ns1-0.1 cents/bit

Main MemoryM Bytes200ns- 500ns$.0001-.00001 cents /bitDiskG Bytes, 10 ms (10,000,000 ns)

10 - 10 cents/bit-5 -6

CapacityAccess TimeCost

Tapeinfinitesec-min

10-8

Registers

Cache

Memory

Disk

Tape

Instr. Operands

Blocks

Pages

Files

StagingXfer Unit

prog./compiler1-8 bytes

cache cntl8-128 bytes

OS512-4K bytes

user/operatorMbytes

Upper Level

Lower Level

faster

Larger

1 KB Direct Mapped Cache, 32B blocks

Cache Index

0

1

2

3

:

Cache Data

Byte 0

0431

:

Cache Tag Example: 0x50

Ex: 0x01

0x50

Stored as partof the cache tate

Valid Bit

:

31

Byte 1Byte 31 :

Byte 32Byte 33Byte 63 :Byte 992Byte 1023 :

Cache Tag

Byte Select

Ex: 0x00

9

A Modern Memory Hierarchy

Control

Datapath

SecondaryStorage(Disk)

Processor

Registers

MainMemory(DRAM)

SecondLevelCache

(SRAM)

On

-Ch

ipC

ache

1s 10,000,000s

(10s ms)

Speed (ns): 10s 100s

100sGs

Size (bytes):Ks Ms

TertiaryStorage

(Disk/Tape)

10,000,000,000s (10s sec)

Ts

SPEC First Round

Benchmark

SP

EC

Perf

0100200300400500600700800g

cc

epre

sso

spic

e

dod

uc

nasa

7

li

eqn

tott

matr

ix3

00

fpp

pp

tom

catv

SPECint95base Performance (Oct. 1997)

02468

101214161820

go

88ks

im gcc

com

pres

s li

ijpeg

per

l

vort

ex

SP

EC

int

PA-8000

21164

PPro

The performance Improving of Microprocessor World

Documents

Transcript of The performance Improving of Microprocessor World