CSE2021: Computer Organization Instructor: Dr. Amir Asif Department of Computer Science York...

21
CSE2021: Computer Organization Instructor: Dr. Amir Asif Department of Computer Science York University Handout # 2: Measuring Performance Topics: 1. Performance: Definition 2. Performance Metrics: CPU Execution Time and Throughput 3. Benchmarks: SPEC’95 4. Alternative Performance Metrics: MIPS and FLOPS Patterson: Sections 1.4 – 1.9.

Transcript of CSE2021: Computer Organization Instructor: Dr. Amir Asif Department of Computer Science York...

CSE2021: Computer OrganizationInstructor: Dr. Amir Asif

Department of Computer ScienceYork University

Handout # 2: Measuring Performance

Topics:1. Performance: Definition2. Performance Metrics: CPU Execution Time and Throughput3. Benchmarks: SPEC’954. Alternative Performance Metrics: MIPS and FLOPS

Patterson: Sections 1.4 – 1.9.

2

Analogy with Commercial Airplanes

— To know which of the four planes exhibits the best performance, we need to define a criteria for measuring “performance”.

Performance Criteria: Winner:

Speed BAC/Scud ConcordeCapacity Boeing 747Range Douglas DC-8-50

Throughput Airbus A3xx

AirplaneAirplanePassengerPassengerCapacityCapacity

Cruising range Cruising range ((milesmiles))

Cruising Cruising speed speed ((mphmph))

Passenger throughputPassenger throughput((passengers passengers ×× mph mph))

Boeing 777 375 4630 610 375 × 610 = 228,750

Boeing 747 470 4150 610 470 × 610 = 286,700

BAC/Scud Concorde 132 4000 1350 132 × 1350 = 178,200

Douglas DC-8-50 146 8720 544 146 × 544 = 79,424

What does it mean do say that one computer is better than another?

3

Computer Performance (1)

— Performance of a computer is based on the following criteria:

1. Response/Execution Time: Elapsed time between the start and the end of one task.

2. Throughput: Total number of tasks finished in a given interval of time.— An IT manager will be interested in having a higher overall throughput while a computer

user will like to have a lower execution time for his task.— Using execution time as the criteria, the performance of a machine X is defined as

— Performance ratio (n) between two machines X and Y is defined as

XX timeExecution

1ePerformanc

X

Y

Y

XnTimeExecution

TimeExecution

ePerformanc

ePerformanc

4

Computer Performance (2)

Activity 1: If machine X runs a program in 30 seconds and machine Y runs the same program in 45 seconds, how much faster is X than Y?

Activity 2: Discuss which of the following two options is suited to enhance performance from a user’s perspective:

(a) Upgrading a machine to a faster CPU

(b) Adding additional processors to the machine s.t. multiple processors are used for different tasks.

Repeat for an IT manager?

5

What is Execution Time (1)?

1. Elapsed Time/Response Time/Wall Clock Time: is defined as the clock time it takes from the start to the end of a program or a task

— Since computer is a timeshared machine (running several programs simultaneously), the elapsed time will be dependent on the number and complexity of other programs running on the machine.

2. CPU Time: is the execution time that the CPU spends on completing a task— CPU time does not include time spent on running other programs or waiting for the I/O to

become free.— CPU time can be further broken down into:

(a) User CPU time: CPU time spent to execute the measured program

(b) System CPU time: CPU time spent in the operating system performing tasks on behalf of the measured program

3. Command time in Unix can be used to determine the elapsed time and CPU time for a particular programSyntax: time <name_of_program>Result: 2.9% 44.95:0:0 0.130s 1.180u

timeElapsed timeCPU s:mh:in timeElapsed timesystem CPU user time CPU

6

What is Execution Time (2)?

4. Performance based on User CPU time is called the CPU performance

5. Performance based on System time is called the system performance

6. Vendors specify the speed of a computer in terms of clock cycles.

For example, a 1GHz Pentium is generally believed to be faster than a 500MHz Pentium. We define the clock cycles formally next.

(penta)1021P

(tera)1021T(giga)1021G(mega)1021M(kilo)1021K

1550

1240

930

620

310

XX timeExecution CPU

1ePerformanc CPU

XX timeExecution System

1ePerformanc System

Multiples defined:

7

Clock (1)

— All events in a computer are synchronized to the clock signal— Clock signal is therefore received by every HW component in a computer— Clock cycle time: is defined as the duration of 1 cycle of the clock signal— Clock cycle rate: is the inverse of the clock cycle time.

— CPU execution time is therefore defined as

Binary 0

Binary 1

Leadingedge

1 cycle Trailingedge Clock Signal

CPU Execution Time

for a program

⎧ ⎨ ⎩

⎫ ⎬ ⎭=

CPU Clock Cycles

for a program

⎧ ⎨ ⎩

⎫ ⎬ ⎭× Clock Cycle Time

=CPU Clock Cycles for a program

Clock Rate

8

Clock (2)

— Timing Programs generally return the average number of clock cycle needed per instruction (denoted by CPI)

CPU Execution Time

for a program

⎧ ⎨ ⎩

⎫ ⎬ ⎭=

CPU Clock Cycles for a program

Clock Rate

=

Instruction Count in a programi × CPIii=1

n

Clock RateActivity 3: For a CPU, instructions from a high-level language are classified in 3 classes

Two SW implementations with the following instruction counts are being considered

Which implementation executes the higher number of instructions? Which runs faster?

What is the CPI count for each implementation?

Instruction Class A B C

CPI for the instruction class 1 2 3

Instruction counts per instruction class

A B C

Implementation 1 2 1 2

Implementation 2 4 1 1

9

Performance Comparison (1)

To compare performance between two computers,

1. Select a set of programs that represent the workload

2. Run these programs on each computer

3. Compare the average execution time of each computer

4. At time the geometric mean is used.

Activity 4: Based on the arithmetic and geometric means execution time, which of the two computer is faster?

Execution Time Computer A Computer B

Program 1 2 10

Program 2 100 105

Program 3 1000 100

Program 4 25 75

Geometric Mean = Execution Time( )ii=1

n

∏n

10

Performance Comparison: Benchmarks (2)

— Benchmarks: are standard programs chosen to compare performance between different computers.

— Benchmarks are generally chosen from the applications that a user would typically use the computer to execute.

— Benchmarks can be classified in three categories:

1. Real applications reflecting the expected workload, e.g., multimedia, computer visualization, database, or macromedia director applications

2. Small benchmarks are specialized code segments with a mixture of different types of instructions

3. Benchmark suites containing a standard set of real programs and applications. A commonly used suite is SPEC (System Performance Evaluation Cooperative) with different versions available, e.g., SPEC89, SPEC’92, SPEC’95, SPEChpc96, SPEC CPU2006 and SPECINTC2006 suites.

11

Performance Comparison: SPEC Suite (2)

— SPEC’95 suite has a total of 18 programs (integer and floating point). However, SPEC CPU2000 has a total of 32 programs

— SPEC ratio for a program is defined as the ratio of the execution time of the program on a Sun Ultra 5/10(300 MHz processor) to the execution time on the measured machine.

— CINT2000 is the geometric mean of the SPEC ratios obtained from the integer programs. The geometric mean is defined as

— CFP2000 is the geometric mean of the SPEC ratios from the floating-point programs

Activity 5: Complete the following table to predict the performance of machines A and B

CINT2006 = SPEC ratio( )ii=1

n

∏n

Time on A(seconds)

Time on B(seconds)

Normalized to A Normalized to B

A B A B

Program 1 5 25

Program 2 125 25

Arithmetic Mean

Geometric Mean

12

Performance Comparison: CINT2006 for Opteron X4 2356 (3)

Name Description Instruct. Count ×109

CPI Clock Cycle time (ns)

Exec time (s)

Reference time (s)

SPECratio = Ref. time / Ex. time

perl Interpreted string processing 2,118 0.75 0.40 637 9,777 15.3

bzip2 Block-sorting compression 2,389 0.85 0.40 817 9,650 11.8

gcc GNU C Compiler 1,050 1.72 0.47 24 8,050 11.1

mcf Combinatorial optimization 336 10.00 0.40 1,345 9,120 6.8

go Go game (AI) 1,658 1.09 0.40 721 10,490 14.6

hmmer Search gene sequence 2,783 0.80 0.40 890 9,330 10.5

sjeng Chess game (AI) 2,176 0.96 0.48 37 12,100 14.5

libquantum Quantum computer simulation 1,623 1.61 0.40 1,047 20,720 19.8

h264avc Video compression 3,102 0.80 0.40 993 22,130 22.3

omnetpp Discrete event simulation 587 2.94 0.40 690 6,250 9.1

astar Games/path finding 1,082 1.79 0.40 773 7,020 9.1

xalancbmk XML parsing 1,058 2.70 0.40 1,143 6,900 6.0

Geometric mean 11.7

High cache miss rates

CINT2006 = SPEC ratio( )ii=1

n

∏n

13

Improving Performance

Performance of a CPU can be improved by:

1. Increasing the clock rate (decreasing the clock cycle time)

2. Enhancements in the Compiler to decrease the instruction count in a program

3. Improvement in the CPU to decrease the clock cycle per instruction (CPI)

Unfortunately factors (1 – 3) are not independent. For example, if you increase the clock frequency then the CPI may also increase.

RateClock

CPI program ain Count n Instructio

program afor

TimeExecution CPU

14

Power Trends (1)

— One way of enhancing performance is to increase clock rate (implying a reduction in clock cycle time) .

— A consequence of increasing clock rate is an increase in power dissipation.

15

Power Trends (2)

— Dynamic power dissipation is computed from the expression

— In the previous figure, note that when clock rates have increased by a factor of 1000 (25 to 2667), the power dissipation only increased by a factor of 20 (4.9 to 95).

— The above is largely a consequence of the operating voltage in CMOS technology going down from 5V to 1V.

Activity: What is the impact on dynamic power dissipation if a new processor reduces the capacitive load, voltage, and clock frequency all by a factor of 15%?

Power = Capacitive load × Voltage 2 × Frequency

16

Uniprocessor Performance

Constrained by power, instruction-level parallelism, memory latency

17

SPEC Power Benchmark

— SPECpower reports power consumption of servers at different workloads, divided by 10% increments, over a period of time.

Target Load % Perform. (ssj_ops/sec) Average Power (Watts)

100% 231,867 295

90% 211,282 286

80% 185,803 275

70% 163,427 265

60% 140,160 256

50% 118,324 246

40% 920,35 233

30% 70,500 222

20% 47,126 206

10% 23,066 180

0% 0 141

Overall sum 1,283,590 2,605

∑ssj_ops/ ∑power 493

_ Overall ssj ops per Watt=

_ssj opsii=0

10

∑ powerii=0

10

18

PITFALL: Amdahl’s Law (1) Improving an aspect of a computer and expecting a proportional improvement in overall

performance

Suppose a program runs in 100s on a computer with multiply operations responsible for 80% of the time. How much should the speed of multiplication be improved to make the program run twice faster? Five times faster?

Solution:

For 2 times faster, the speed of multiplication should be improved by a factor of 8/3.

Five times faster is not possible.

Timproved=Taffected

improvement factor+ Tunaffected

19

PITFALL: MIPS as a Performance Metric (2)

Do not use MIPS (million instructions per second) as a performance metric

MIPS =Instruction Count in a program

Execution time ×106

Activity 6: For a CPU, instructions from a high-level language are classified in 3 classes

Two SW implementations with the following instruction counts are being considered

Assuming that the clock rate is 500 MHz, which code sequence will be run faster based on (a) MIPS and (b) execution time.

Instruction Class A B C

CPI for the instruction class 1 2 3

Instruction counts (in billions) for each instruction class

A B C

Implementation 1 5 1 1

Implementation 2 10 1 1

20

PITFALLS: Do Not’s (3)

4. Do not use MFLOPS (Million Floating point operations per second) as a performance metric

5. Do not use PEAK Performance as a performance metric

Example:

I860: @50MHz: 100MFLOPS and 150 MOPS

MIPS @R3000: 16MFLOPS and 33 MOPS

Running SPEC benchmarks, R3000 was 15% faster in execution time based on the geometric mean

610 timeExecution

program ain operationspoint -floating ofNumber MFLOPS

21

PITFALLS: Do Not’s (4)

4. Do not use synthetic benchmarks (vendor created) to predict performance— Commonly used synthetic benchmarks are Whetstone and Dhrystone— Whetstone was based on Algol in an engineering environment and later converted to

Fortran— Dhrystone was written in Ada for systems programming environments and later

converted to C— Such synthetic benchmarks do not reflect the applications typically run by a user

5. Do not use the arithmetic mean of execution time to predict performance. Geometric means provide better estimates.