Computer Architecture Exercises with Solutions

14
Stalls and performance Stalls impede progress of a pipeline and result in deviation from 1 instruction executing/clock cycle CPI pipelined = Ideal CPI + Pipeline stall cycles per instruction 1 + Pipeline stall cycles per instruction Ignoring overhead and assuming stages are balanced: Ideally, speedup equal to # of pipeline stages Stalls occur because of hazards! 1 Computer Architecture

description

Performance Computer Architecture Exercises with Solutions

Transcript of Computer Architecture Exercises with Solutions

Page 1: Computer Architecture Exercises with Solutions

Stalls and performance

• Stalls impede progress of a pipeline and result in

deviation from 1 instruction executing/clock cycle

• CPI pipelined =

– Ideal CPI + Pipeline stall cycles per instruction

– 1 + Pipeline stall cycles per instruction

• Ignoring overhead and assuming stages are balanced:

• Ideally, speedup equal to # of pipeline stages

Stalls occur because of hazards!

1Computer Architecture

Page 2: Computer Architecture Exercises with Solutions

Computer Performance

“X is N% faster than Y.”

Execution Time of Y

Execution Time of X=

1001

N

Amdahl’s law for overall speedup

Overall Speedup =

S

FF )1(

1

F = The fraction enhanced

S = The speedup of the enhanced fraction

2Computer Architecture

Page 3: Computer Architecture Exercises with Solutions

Using Amdahl’s law

Overall speedup if we make 90% of a program run 10 times faster.

Overall Speedup

10

9.0)9.01(

1

26.509.01.0

1

F = 0.9 S = 10

= =

Overall speedup if we make 80% of a program run 20% faster.

Overall Speedup

2.1

8.0)8.01(

1

153.166.02.0

1

F = 0.8 S = 1.2

= =

3Computer Architecture

Page 4: Computer Architecture Exercises with Solutions

[1] .You have a system that contains a special processor for doing floating-point

operations. You have determined that 50% of your computations can use the

floating-point processor. The speedup of the floating pointing-point processor is 15.

a) Overall speedup achieved by using the floating-point processor.

F = 0.5 S = 15

Overall speedup = 876.1033.05.0

1

15

5.0)5.01(

1

b) Overall speedup achieved if you modify the compiler so that 75% of the

computations can use the floating-point processor.

F = 0.75 S = 15

Overall speedup = 33.305.025.0

1

15

75.0)75.01(

1

4Computer Architecture

Page 5: Computer Architecture Exercises with Solutions

c) What fraction of the computations should be able to use the floating–point

processor in order to achieve an overall speedup of 2.25?

F = ? S = 15

15)1(

125.2

FF

FFF 1415

15

1515

15

15)1415(25.2 F

155.3175.33 F

75.185.31 F

595.05.31

75.18F or 60%

5Computer Architecture

Page 6: Computer Architecture Exercises with Solutions

[2] . You have a system that contains a special processor for doing floating-point

operations. You have determined that 60% of your computations can use the

floating-point processor. When a program uses the floating-point processor, the

speedup of the floating-point processor is 40% faster than when it doesn’t use it.

a) Overall speedup by using the floating-point processor.

F = 0.6 S = 1.4

Overall speedup = 206.1429.04.0

1

4.1

6.0)6.01(

1

b) In order to improve the speedup you are considering two options:

• Option 1: Modifying the compiler so that 70% of the computations can use

the floating-point processor. Cost of this option is $50K.

• Option 2: Modifying the floating-point processor . The speedup of the

floating-point processor is 100% faster than when it doesn’t use it. Assume

in this case that 50% of the computations can use the floating–point

processor. Cost of this option is $60K.

Which option would you recommend? Justify your answer quantitatively.6Computer Architecture

Page 7: Computer Architecture Exercises with Solutions

F = 0.7 S = 1.4

Overall speedup = 25.15.03.0

1

4.1

7.0)7.01(

1

F = 0.5 S = 2

Overall speedup = 33.125.05.0

1

2

5.0)5.01(

1

KK

SpeedupCost 40$

25.1

50$ Option 1

KK

SpeedupCost 1.45$

33.1

60$ Option 2

Therefore, Option 1 is better because it has a smaller Cost/Speedup ratio.

7Computer Architecture

Page 8: Computer Architecture Exercises with Solutions

[3]. Suppose you have a load/store computer with the following instruction mix:

Operation Frequency No. of Clock cyclesALU ops 35% 1Loads 25% 2Stores 15% 2Branches 25% 3

a) Compute the CPI.

b) We observe that 35% of the ALU ops are paired with a load, and we propose to replace these ALU ops and their loads with a new instruction. The new instruction takes 1 clock cycle. With the new instruction added, branches take 5 clock cycles, Compute the CPI for the new version.

9.1)3*25.0()2*15.0()2*25.0()1*35.0( old

CPI

1225.035.0*35.0

8Computer Architecture

Page 9: Computer Architecture Exercises with Solutions

)1225.01(

1*1225.05*25.02*15.02*)1225.025.0(1*)1225.035.0(

newCPI

455.28775.0

155.2

c) If the clock of the old version is 20% faster than the new version, which version has faster CPU Execution time and by how much percent?

2.1old

new

CCT

CCToldnew CCTCCT *2.1

36.1,

*

*2.1*46.2*

**9.1

1.9

2.59 faster is version old So

CCTIC*2.59

CCTIC*0.8775 Time Exec. CPU

CCTIC Time Exec. CPU

oldold

oldoldnew

oldoldold

By 36% 9Computer Architecture

Page 10: Computer Architecture Exercises with Solutions

[4].For the purpose of solving a given application problem, you benchmark a program on two computer systems. On system A, the object code executed 80 million Arithmetic Logic Unit operations (ALU ops), 40 million load instructions, and 25 million branch instructions. On system B, the object code executed 50 million ALU ops, 50 million loads, and 40 million branch instructions. In both systems, each ALU op takes 1 clock cycles, each load takes 3 clock cycles, and each branch takes 5 clock cycles.

a) Compute the relative frequency of occurrence of each type of instruction executed in both systems.

0.28140

40 0.17

145

25

0.36140

50 0.28

145

40

0.36140

50 0.55

145

80

ALU ops

Loads

Branches

A B

10Computer Architecture

Page 11: Computer Architecture Exercises with Solutions

b) Find the CPI for each system.

84.2)5*28.0()3*36.0()1*36.0(

24.2)5*17.0()3*28.0()1*55.0(

B

A

CPI

CPI

c) Assuming that the clock on system B is 10% faster than the clock on system A, which system is faster for the given application problem and by how much percent?

1.1CCT

CCT

B

A BA CCT*1.1CCT

1.11357.28

397.6 faster is A SystemSo,

CCT*10*397.6

CCT*2.84*10*140 Time Exec. CPU

CCT*10*357.28

CCT*1.1*2.24*10*145 Time Exec. CPU

B6

B6

B

B6

B6

A

By 11% 11Computer Architecture

Page 12: Computer Architecture Exercises with Solutions

A common memory hierarchy

CPU Registers 100s Bytes<10s ns

Cache K Bytes10-100 ns1-0.1 cents/bit

Main Memory M Bytes 200ns- 500ns$.0001-.00001 cents /bit

DiskG Bytes, 10 ms (10,000,000 ns)10-5 - 10-6 cents/bit

Tape infinitesec-min 10 -8

Registers

Cache

Memory

Disk

Tape

Upper Level

faster

Larger

Lower Level

Computer Architecture

12

Page 13: Computer Architecture Exercises with Solutions

13

Average Memory Access Time

AMAT = (Hit Time) + (1 - h) x (Miss Penalty)

•Hit time:

– basic time of every access.

•Hit rate (h):

– fraction of access that hit

•Miss penalty:

– extra time to fetch a block from lower level, including time

to replace in CPU

•Introduces caches to improve hit time.

Computer Architecture

Page 14: Computer Architecture Exercises with Solutions

Second-level caches

• Introduces new definition of AMAT:

– Hit timeL1 + Miss RateL1 * Miss PenaltyL1

– Where, Miss PenaltyL1 =• Hit TimeL2 + Miss RateL2 * Miss PenaltyL2

• So 2nd level miss rate measure from 1st level cache misses…

Computer Architecture

14