Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance

ECE 4100/6100Advanced Computer Architecture

Lecture 3 Performance

Prof. Hsien-Hsin Sean LeeSchool of Electrical and Computer EngineeringGeorgia Institute of Technology

2

Performance• Execution/Response time (Latency)

– Elapsed time between start and completion of an event

– How long my job takes?

• Throughput (Bandwidth)– Total amount of work done within a given

period of time– How many jobs done per unit time on a

system?

3

CPU Performance• Execution Time = Seconds / Program

cyclenInstructiocyclesnsInstructio seconds

program

• Programmer• Algorithms• ISA• Compilers

• Microarchitecture

• System architecture

• Microarchitecture, pipeline depth

• Circuit design• Technology

4

Pipeline Stage

Combinational

LogicF/F

F/F

• Optimal FO4 per pipe– 6 to 8 [UT/Compaq, ISCA-29]– 18 (15+3 latch) [IBM, MICRO-

35]

P4 pipe stage~ 16 FO4

1 FO4

Slide from Lecture 1 Pipelining

5

Architecture Comparison• Many architecture research just make the

following assumptions• Instructions / program is fixed

– Same binary ()– Same compiler ()– Same benchmark

• Seconds per cycle is constant () – Same frequency– Same pipeline depth– Typically a bad assumption today

• Focus on IPC or CPI• It is more complicated for today’s architects !

6

Example: Calculating CPI

Typical Mix of instruction typesin program

Base Machine (Reg / Reg)Op Freq Cycles CPI(i) (% Time)ALU 50% 1 .5 (33%)Load 20% 2 .4 (27%)Store 10% 2 .2 (13%)Branch 20% 2 .4 (27%) 1.5

Design guideline: Make the common case fast

MIPS 1% rule: only consider adding an instruction of it is shown to add 1% performance improvement on reasonable benchmarks.

Run benchmark and collect workload characterization (simulate, machine counters, or sampling)

7

Performance Comparison• For some program running on machine X,

PerformanceX = 1 / Execution timeX

• "X is nn times faster than Y"PerformanceX / PerformanceY = n n = speedup of X over Y

• Problem:– machine A runs a program in 20 seconds– machine B runs the same program in 25 seconds

8

Performance Evaluation: Benchmark• (Real) Programs

– In the form of collection of programs– E.g., SPEC, Winstone, SYSMARK, 3D Winbench, EEMBC

• Kernels: – Small key pieces of real programs – E.g., Livermore Fortran Loops Kernels (LFK), Linpack

• Modified (or scripted)– To focus on some particular aspects (e.g. remove I/O, focus on

CPU)• (Toy) Benchmarks

– Produce expected results• Synthetic Benchmarks:

– Representative instruction mix– E.g., Dhrystone, Whetstone

• Important for – Architectural and microarchitectural design trade-off– Competitive analysis of real products

9

Performance Summary Measurement• Average of total execution time

• This is Arithmetic Mean (Weighted Arithmetic Mean (Weighted Arithmetic Mean)Arithmetic Mean)

n

iii

n

ii TimeWeight

nTime

n 11

1or 1

10

Performance Summary Measurement

• Ratei is a function of 1/Timei

• Used to represent the average “rate” such as instruction per cycle (IPC)

n

i i

in

i i RateWeightn

Rate

n

11

or 1

11

Why Harmonic Mean?• 30 mph for the first 10 miles• 90 mph for the next 10 miles• Average speed? (30+90)/2 = 60 mph?? • Wrong!

• Average speed = total distance / total time

• (10+10)/(10/30 + 10/90) = 45 mph

12

New Breed of Metrics • Performance / Watt

– Performance achievable at the same cooling capacity

• Performance / Joule (Energy)– Achievable performance at the lifetime of

the same energy source (i.e., battery = energy)

– Equivalent to reciprocal of energy-delay product (ED product)

13

Amdahl’s Law (Law of Diminishing Returns)• Make the common case faster• Speedup

= Perfnew / Perfold = Told / Tnew = • Performance improvement from using faster mode is limited by the fraction the faster mode can be applied.

f(1 - f)Told

(1 - f)

Tnew

f / P

Pff )1(

1

14

Amdahl’s Law Analogy

• Driving from Orlando to Atlanta– 60 miles/hr from Orlando to Macon– 120 miles/hr from Macon to Atlanta– How much time you can save

compared against driving all the way at 60 miles/hr from Orlando to Atlanta?

• 6hr 45min vs. 7hr 30min = ~11% speedup

• Key is to speed up the biggie portion, i.e. speed up frequently executed blocks

15

Parallelism vs. Speedup

1.11x

1.97x

1.33x1

10

100

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Spee

d-up

Code portion in Faster mode (f)

Amdahl's Law speed-up as a function of parallelism

P=1P=2P=4P=8

P=16P=32P=64

16

Gustafson’s Law• Amdahl’s Law killed massive parallel processing

(MPP)• Gustafson came to rescue

Seq

Tnew

ParallelTold

Seq P * Parallel Time

Assume: Seq + Parallel = 1 (Tnew)

Speedup = Seq + p * (1 – Seq) where p=parallel factorIf Seq diminishes with increased problem size,

Speedup p

17

Amdahl versus Gustafson

Who is right?

18

The Principle of Locality• Knuth made the original observation about

program locality in 1971.– … less than 4 percent of a program generally

accounts for more than half of its running time.• 90/10 rule: a program spends 90% of its execution

time in only 10% of the code• Two types of locality

– Temporal locality (locality in time)– Spatial locality (locality in space)

• Memory subsystem design heavily leverages the locality concept for better performance

19

Example of Performance Evaluation (I)Operation Frequency Clock cycle

countALU Ops (reg-reg)

43% 1

Loads 21% 2Stores 12% 2Branches 24% 2Assume 25% of the ALU ops directly use a loaded operand that is not used again.

We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time.Would this change improve performance ?

20

Example of Performance Evaluation (I)Operation Frequency Clock cycle

countALU Ops (reg-reg)

43% 1

Loads 21% 2Stores 12% 2Branches 24% 2Assume 25% of the ALU ops directly use a loaded operand that is not used again.

We propose adding ALU instructions that have one src operand in memory. These new reg-mem instructions spend 2 clock cycles. Also assume that the extended instruction set increase branch’s clock by 1, but no impact to cycle time.Would this change improve performance ?

703.13*24.02*12.02*)43.0*25.021.0(1)43.025.043.0(243.025.0 newCycles

57.12*24.0212.0221.0143.0 oldCycles

21

Example of Performance Evaluation (II)FP instructions = 25%

Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2% of all instructions, CPI of

FPSQRT = 20• Design Option 1: decrease the CPI of FQSQRT to 2• Design Option 2: decease the average CPI of all FP

instructions to 2.5

22

Example of Performance Evaluation (II)FP instructions = 25%

Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2% of all instructions, CPI of

FPSQRT = 20• Design Option 1: decrease the CPI of FQSQRT to 2• Design Option 2: decease the average CPI of all FP

instructions to 2.5Original CPI = 0.25*4 + 1.33*(1-0.25) = 2.0

Option 1 CPI = 2.0 – 2%*(20-2) = 1.64

Option 2 CPI = 0.25*2.5 + 1.33*(1-0.25) = 1.625

Speedup of Option 1 = 2/1.64 = 1.2195Speedup of Option 2 = 2/1.625 = 1.2308

23

Example of Performance Evaluation (III)Clock freq = 1.4 GHz

FP instructions = 25%Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2%, CPI of FPSQRT = 20

• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz

• Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz

24

Example of Performance Evaluation (III)Clock freq = 1.4 GHz

FP instructions = 25%Average CPI of FP instructions = 4.0Average CPI of other instructions = 1.33FPSQRT = 2%, CPI of FPSQRT = 20

• Design Option 1: decrease the CPI of FQSQRT to 2, clock freq = 1.2GHz

• Design Option 2: decease the average CPI of all FP instructions to 2.5, clock freq = 1.1 GHz

Original CPI = 2.0, IPC = 1/2, Inst/Sec = ½*1.4G = 0.7G inst/s

Option 1 CPI = 1.64, IPC = 1/1.64, Inst/Sec = 1/1.64*1.2G = 0.73G inst/s

Option 2 CPI = 1.625, IPC = 1/1.625, Inst/Sec = 1/1.625*1.1G = 0.68G inst/s

Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance

Devices & Hardware

Transcript of Lec3 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- Performance