CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.

CEN 316Computer Organization and Design

Assessing and Understanding Performance

Mansour AL Zuair

CEN316 - Chapter 1-2 2

Chapter 2: Performance

• Performance: – What is it: measures of performance

• The CPU performance equation:– Execution time as the measure– What affects execution time– Examples

• Popular alternative metrics– Why they don’t work

• Benchmarks• Amdahl's law


Performance is Time

• Time to do the task (execution time)– Execution time, response time, latency

• Tasks per unit time (sec, minute, ...)– Throughput, bandwidth

Plane

Boeing 747

BAD/Sud Concorde

Speed

610 MPH

1350 MPH

DC to Paris

6.5 hours

3 hours

Passengers

470

132

Throughput (PMPH)

286,700

178,200


Throughput and Response Time

• Example– What is the effect of the following changes on throughput

and response time?

» Increasing processor speed?» Increasing the number of processors on the same system

(multitask)?

» Is there any relation between response time and throughput?» What about queuing?


Performance as Response Time

• Performance is most often measured as response time or execution time for some task.

• “X is n times faster than Y” means

• Example– Execution time of program P

» X is 5 sec » Y is 10 sec.

– X is 2 times faster than Y.

nTime(X)Execution

Time(Y)Execution

e(Y)Performanc

e(X)Performanc


What Time to Measure?

• Elapsed time, wall-clock time:– Actual time from start to completion.– Depends on CPU, system, I/O, etc.– Often used in real benchmarks.– Only suitable choice when I/O is included.

• CPU time:– Measure/analyze CPU performance only.– May be suitable when machine is timeshared.– Possibly both user and system component.– User CPU time is our focus for first part of course.

• Elapsed time = CPU time + idle time.– Usually and assuming time is accurately accounted for.


Metrics of performance

• Different performance metrics are appropriate at different levels:

Compiler

Programming Language

Application

DatapathControl

Transistors

ISAISA

Function Units

(millions) of Instructions per second – MIPS(millions) of (F.P.) operations per second – MFLOP/s

Cycles per second (clock rate)

Cycles per Instruction

Answers per monthOperations per second

Instruction Set Architecture


Relating Processor Metrics

• CPU execution time per program =

CPU clock cycles/program Clock cycle time =

CPU clock cycles/program ÷ Clock rate (frequency)• CPU clock cycles/program =

Instructions/program Clock cycles Per Instruction• Clock cycles Per Instruction (CPI) is an average measurement,

it depends on :– ISA, the implementation, and the program measured– CPI = CPU clock cycles/program ÷ Instructions/program – Also, Instructions per clock cycle or IPC = 1 / CPI

• CPU execution time = Instructions CPI Clock cycle


Aspects of CPU Performance

• Instead of reporting execution time in seconds, we often use cycles

• Clock “ticks” indicate when to start activities (one abstraction):

• cycle time = time between ticks = seconds per cycle

• clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)

A 200 Mhz. clock has a cycle time

cycle

seconds

ninstructio

cycles

program

nsinstructio

program

secondstimeCPU

time

snanosecond 5910 610200

1


Example

• Two machines A and B with the same ISA,• A has a clock cycle time of 1ns and a CPI=2 for certain

program. • B has clock cycle time of 2ns and a CPI=1.2 for the same

program.• Which machine is faster and by how much?• CPU time (A) = I x 2 x 1 ns• CPU time (B) = I x 1.2 x 2 ns

1.2 (B) eperformanc CPU

(A) eperformanc CPU


Example of Improving performance

Given the following Information– program run in 10

seconds on computer A, 4 GH clock

– Build a computer that runs the same program in 6 seconds

– Assumptions:» Clock rate can be increased

substantially» Increasing clock rate increases

the clock cycles for this program by 20%

What is the target clock rate?

rateClock

CyclesClock CPU timeCPU

A

AA

secondscycles

10 x 4

CyclesClock CPUseconds 10

9

A

cycles 10 x 40 CyclesClock CPU 9A

rateClock

CyclesClock 1.2xCPU timeCPU

B

AB

rateClock

1.2x40x10seconds 6

B

9Acycles

GHonds

8sec6

cycles 10 x x402.1rateclock

9

B


Organizational Trade-offs

• Instruction set (and hence Instruction Count), CPI, and Clock cycle time interact in complex ways:

Compiler

Programming Language

Application

Datapath

Transistors

ISA

Function Units

Instruction Mix

Cycle Time

CPI


Cycles Per Instruction (CPI)

• CPI = average number of clock cycles per instruction– CPI = Clock Cycles / Instruction Count

• Affected by both :– cost for each instruction type– the frequency of different instructions

» called the instruction mix

• Useful way to compute CPI (for n instruction types):

CountnInstructio

CountnInstructioFwhereFCPICPI i

i

n

1iii


Example comparing code segments

Given the following Info from HW designer

Number of instructions? 5 for sequence 1 and 6 For sequence 2

Which one is faster? CPU clock cycles1 = (2x1)+ (1x2) + (2x3)=10 cyclesCPU clock cycles2 = 9 cycles

What is the CPI for each sequence?CPI1 = 10/5 =2 CPI2 = 9/6=1.5

Instruction class

CPI

A

B

C

1

2

3

Code Sequence

Instruction count for instruction class

A B C

1

2

2

4

1

1

2

1


Example CPI Computation

• RISC processor: register-register ISA:

Typical Mix CPI = 1.5

Instruction Type

Frequency

Fi

Cycles

CPIi

CPI Contr

Fi CPIi

Time %

This Instr

ALU

Load

Store

Branch

50%

20%

10%

20%

1

2

2

2

0.5

0.4

0.2

0.4

33%

27%

13%

27%


Using the CPU Performance Equation

• Example:Consider adding ALU instructions that can have one memory operand to the MIPS ISA to produce MIPSE:– MIPSE = MIPS + ALU instrs with a memory operand.– Initial mix and cycle counts on MIPS:

Instr class Freq Cycles MIPS CPILoad 30% 2 0.6Store 15% 2 0.3ALU op 40% 1 0.4Branch 15% 2 0.3Overall CPI 1.6

– Assume: » CPI of the MIPSE instruction ALU-with-memory-instruction is 2 » Clock cycle 1.25 times the MIPS clock cycle» One half of the load instructions and a corresponding number of ALU

instructions are replaced by ALU-with-memory

– Which machine is faster?


Solution

• Normalize mix to 100 instructions

– can be easier to calculate and enhance intuition

Instruc Cycle Instruc Instruc CycleCycles Count Count Delta Count Count

ALU 1 40 40 -15 25 25Load 2 30 60 -15 15 30Store 2 15 30 15 30Branch 2 15 30 15 30ALU+mem 2 0 0 15 15 30Total 100 160 85 145

MIPS MIPSE

MIPS execution time = 160 cycles CCMIPS

MIPSE execution time = 145 cycles (1.25X CCMIPS)MIPSE takes ((145 1.25) / 160) times as long as MIPSMIPSE is 1.13 performance of MIPS


Alternative Performance Metrics: MIPS

• Use something other than time– Often good intention to find a simple metric

» bigger is better, general measure, summarizes performance

• Most common metric:MIPS (Millions of Instructions Per Second):

• Flaws in using MIPS– Machines with different instruction sets ?– Programs with different instruction mixes ?

» dynamic frequency of instructions

– Can vary inversely with performance!

66 10CPI

RateClock

10Time

CountnInstructioMIPS


Example of Problems

• Consider an optimized and unoptimized version of the same program:

• Here are cycle counts for instructions

Memory Instructions

ALU Instructions

Branch Instruction

FP Instruction

Total Instructions

Unoptimized Program

100M 100M 30M 40M 270M

Optimized Program

50M 50M 30M 40M 170M

Memory Cycles

ALU Cycles Branch Cycles

FP Cycles CPI

Per Instruction

2 1 3 5

Unoptimized Program

200M 100M 90M 200M 2.2

Optimized Program

100M 50M 90M 200M 2.6


Example continued

• Assuming a 200 MHz clock:

– MIPSunoptimized = 200/2.2 = 91

– MIPSoptimized = 200/2.6 = 77

– Performanceunoptimized > Performanceoptimized

• But look at Execution time!

– Execution timeunoptimized = CPI IC / CR = 2.2 270 / 200 = 3s

– Execution timeoptimized = CPI IC / CR = 2.6 170 / 200 = 2.2s

– Performanceoptimized > Performanceunoptimized

• MIPS measurement is inverse of reality!


Another Alternative: MFLOPS

• MFLOPS (Millions of FLoating Operations per Second):– common metric in scientific/engineering and supercomputer

arenas– MFLOPS = Floating point Operations

Time X 106– Machine dependent: what is a floating point op?– often not where time is spent (i.e. not in FP operations)– at best, no better than execution time– at worst, much less informative and more deceptive


What is benchmarks?

• Benchmarks – a set of programs that form a “workload” specifically chosen to measure performance

• SPEC (System Performance Evaluation Cooperative) creates standard sets of benchmarks starting with SPEC89. The latest is SPEC CPU2006 which consists of 12 integer benchmarks (CINT2006) and 17 floating-point benchmarks (CFP2006).

www.spec.org • There are also benchmark collections for power workloads

(SPECpower_ssj2008), for mail workloads (SPECmail2008), for multimedia workloads (mediabench), …


Comparing and Summarizing Performance

How do we summarize the performance for benchmark set with a single number?

First the execution times are normalized giving the “SPEC ratio” (bigger is faster, i.e., SPEC ratio is the inverse of execution time)

The SPEC ratios are then “averaged” using the geometric mean (GM)

• Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.))

GM = n SPEC ratioi

i = 1

n


Amdahl's Law

• Handy for evaluating impact of a change not tied to CPU performance equation

• Insight: No improvement of a feature enhances performance by more than the use of the feature.

• Suppose that enhancement E accelerates fraction F of a program by a factor S (remainder of the task is unaffected):

FS =

1–F

unenhancedenhanced TimeExecutionS

FF1TimeExecution


Example on Amdahl's Law

• Assume a program runs in 100 seconds on a machine whre multiply operations consumes 80 seconds of this time. How much do we have to improve the speed of multiplication if we want to run the program 5 times faster?

• Solution:– Execution time after improvement =

– 20 seconds = (80/n) + 20 seconds– What is the value of n???

Execution time affected by improvement

A mount of improvement+ Execution time unaffected


Summary : Performance

• Time is the measure of computer performance!– Performance equation includes three parts; all three

together determine performance• Good products created when have:

– Good benchmarks– Good ways to summarize performance

• Will need different performance metrics as well as a different set of applications to benchmark embedded and desktop computers, which are more focused on response time, versus servers, which are more focused on throughput

• Remember Amdahl’s Law: Speedup is limited by unimproved part of program

CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.

Documents

Transcript of CEN 316 Computer Organization and Design Assessing and Understanding Performance Mansour AL Zuair.