OLD DAT105 Exercise4

31
Department of Computer Science & Engineering Chalmers University of Technology DAT105: Computer Architecture Exercise 4 (5.1, 5.2, 5.3, 5.4, 5.5) By Minh Quang Do 2007-11-29

Transcript of OLD DAT105 Exercise4

Page 1: OLD DAT105 Exercise4

Department of Computer Science & EngineeringChalmers University of Technology

DAT105: Computer Architecture

Exercise 4(5.1, 5.2, 5.3, 5.4, 5.5)

By

Minh Quang Do

2007-11-29

Page 2: OLD DAT105 Exercise4

11/8/2007 2

Cache Access and Cycle TIme model (http://quid.hpl.hp.com:9081/cacti/)

References:

[1] ”CACTI4.0”, David Tarjan et al., HPL-2006-86, HP Laboratories Palo Alto, USA, June 2, 2006

[2] ”CACTI: An Enhanced Cache Access and Cycle Time Model ”Steven J.E. Wilton et al., TR-1993 Western Research Laboratory Palo Alto, USA, July 1994

[3] ”eCACTI: Enhanced Power Estimation Model for On-chip Caches”, Mahesh Mamidipaka et al., CECS Technical report #04-28, University of California, Irvine, USA, Sept. 14, 2004

[4] ”HotLeakage: A temperatureaware model of subthreshold and gate leakage for architects”, Yan Zhang et al., TR-CS-2003-05, University of Virginia, Dept. of Comp. Science, USA, March 2003

Page 3: OLD DAT105 Exercise4

11/8/2007 3

Cache Organization(CACTI)

Parameters: [2]• C: cache size (Bytes)• B: Block size (Bytes)• A: associativity• b0: output width (bits)• baddr: address width

(bits)• Ndwl, Ndbl, Nspd• Ntwl, Ntbl, Ntspd

Page 4: OLD DAT105 Exercise4

11/8/2007 4

CACTI Valid

Trans-formation

(Nspd, Ndbl, Ndwl)

Page 5: OLD DAT105 Exercise4

11/8/2007 5

Cache Organization

Size of a Tag field = s – (n + m + w)Where:

• s: # of memory address bits• w: Byte offset (2w = # bytes per a word)• m: word offset (2m= # of words per a block)

• n: Index (2n = # of sets)

Direct-mapped cache: 4KB

s=32, w=2,m=0, n=10 -> tag= 20

Page 6: OLD DAT105 Exercise4

11/8/2007 6

Direct mapped cache: 4KBs=32, w=2,m=2, n=12 -> tag= 16

Page 7: OLD DAT105 Exercise4

11/8/2007 7

Cache Organization

Size of a Tag field = s – (n – log2A + m + w)

Page 8: OLD DAT105 Exercise4

11/8/2007 8

4-way set associative cache: 4KBs=32, w=2,m=0, n=8 -> tag= 22

Page 9: OLD DAT105 Exercise4

11/8/2007 9

Figure 5.3 (p 292): Memory Hierarchy

L1: direct-mapped 8KBL2: direct-mapped 4MBL1, L2 use 64B blocksPage size: 8KBTLB: direct-mapped with

256 entriesVirtual: 64b, Physical: 40b

Page 10: OLD DAT105 Exercise4

11/8/2007 10

SRAM Memory Partitioning

Block diagram of a typical physically partitioned SRAM array (within a bank)

Page 11: OLD DAT105 Exercise4

11/8/2007 11

CACTI Algorithm to find the best “Power*Delay”

Product and Area Estimation for it

(from [2])

Page 12: OLD DAT105 Exercise4

11/8/2007 12

Web Interface CACTI 4.0

Page 13: OLD DAT105 Exercise4

11/8/2007 13

Web Interface CACTI 4.2

http://quid.hpl.hp.com:9081/cacti/

Page 14: OLD DAT105 Exercise4

11/8/2007 14

Web Inter-face

Page 15: OLD DAT105 Exercise4

11/8/2007 15

Page 16: OLD DAT105 Exercise4

11/8/2007 16

Exercise 5.1aUsing CACTI 4.2, for direct-mapped, 2-way and 4-way set associative caches of 32KB with 64B block size implemented in 90-nm process: (OBS! No leakage power for 90-nm)

13216Ntspd

1614Ntbl

11632Ntwl

221Nspd

148Ndbl

1641Ndwl

128256512N of sets per bank

111Nbank

0.7437597810.789846223490.555439619Total area subbanked (mm^2)

0.4211198160.470786474740.474678046Cycle Time (ns)

0.1499683430.057370283510.041520158Total dynamic Read Power at max. freq. (W)

0.8837994630.959166417080.727237609Access time (ns)

4-way2-way1-wayCache size: 32KB with 64B line

90nm (Vdd=1.04869097076)

32% more

21% more

Page 17: OLD DAT105 Exercise4

11/8/2007 17

Exercise 5.1b

Using CACTI 4.2, for 2-way set associative caches of 16KB, 32KB and 64KB with 64B block size implemented in 90-nm process:

163216Ntspd

111Ntbl

8168Ntwl

220.5Nspd

444Ndbl

441Ndwl

512256128No of sets per bank

111No of bank

1.1424086360.789846223490.3412720341Total area subbanked (mm^2)

0.4957744890.470786474740.4630303771Cycle Time (ns)

0.0784559440.057370283510.0668911783Total dynamic Read Power at max. freq. (W)

1.0044885490.959166417080.8154413713Access time (ns)

64K32K16KCache size (2-way) 64B line

90nm (Vdd=1.04869097076)

18% more

23% more

Page 18: OLD DAT105 Exercise4

11/8/2007 18

Exercise 5.1c (1)

Using CACTI 4.2, for 2-way set associative caches of 8KB, 16KB, 32KB, 64KB with 64B block size implemented in 90-nm process:

1632162Ntspd

1114Ntbl

81681Ntwl

220.50.25Nspd

4444Ndbl

4411Ndwl

51225612864N of sets per bank

1111Nbank

1.14240860.7898462230.341272030.184715483Total area subbanked (mm^2)

0.49577440.4707864740.463030370.38809874Cycle Time (ns)

0.07845590.0573702830.066891170.075579576Total dynamic Read Power (W)

1.00448850.9591664170.815441370.721949734Access time (ns)

64K32K16K8KCache size (2-way) 64B line

90nm (Vdd=1.04869097076)

38% increase in access time, 8X in size → Log relation

Page 19: OLD DAT105 Exercise4

11/8/2007 19

Exercise 5.1c (2)

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1 2 4 8

Cache Size (normalized to 8KB cache)

Incr

ease

in A

cces

s ti

me

(no

rmal

ized

to

8K

B c

ach

e) 38% increase in access time, 8X in size → Log relation

Page 20: OLD DAT105 Exercise4

11/8/2007 20

Exercise 5.1d

From the Fig. 5.29, the current version of CACTI states that 16 KB 8-way set-associative caches with 64 byte blocks have an access time of 0.88 ns. This has the lowest miss rate for 16 KB caches, except for fully associative caches, which would have an access time greater than 0.90 ns.

Page 21: OLD DAT105 Exercise4

11/8/2007 21

Page 22: OLD DAT105 Exercise4

11/8/2007 22

Exercise 5.2a: AMAT

Miss Penalty = 20 cycles;

• Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101→ AMAT = 0.0056101 x 20 + (1 - 0.0056101) = 1.106 cycles

• Way-predicted cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 2 cycles:

→ AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 – 0.85) x 2) x (1 - 0.0056101)

= 1.26

Page 23: OLD DAT105 Exercise4

11/8/2007 23

Exercise 5.2b

• Using CACTI 4.2, for 16KB direct-mapped with 64B block size implemented in 90-nm process: Access time = 0.66318353 ns; Cycle time = 0.36661061 nsTotal dynamic power = 0.033881430 W

• A 32KB 2-way set associative caches with 64B block size implemented in 90-nm process: Access time = 0.95916641708 ns

→ Improvement in execution = 0. 9591 / 0.6631= 1.446, or 44.6 % faster

Page 24: OLD DAT105 Exercise4

11/8/2007 24

Exercise 5.2c: Way-prediction on a data cache

Assumptions: • Miss Penalty = 20 cycles; • Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101

• Way-predicted data cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 15 cycles:

→ AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 – 0.85) x 15) x (1 - 0.0056101)

= 3.19

Increase in: 3.19 – 1.26 = 1.93 ns

Page 25: OLD DAT105 Exercise4

11/8/2007 25

Exercise 5.2d• Using CACTI 4.2, for 1MB 4-way with 64B block size, 144b read

out, 1 bank, 1 read/write port, 30b tags implemented in 90-nm process:

16164Nspt

323232Ntbl

884Ntwl

1.1251.1254.5Nspd

32328Ndbl

8832Ndwl

409640964096N of sets per bank

111Nbank

56.9211850127.2843772619.71918143Total area subbanked (mm^2)

0.5231164480.5133363150.466345737Cycle Time (ns)

0.3023125560.6119481650.360252018Total dynamic Read Power at max. freq. (W)

3.5002652851.7151685892.542393257Access time (ns)

SerialFastNormalCache size: 1MB with 64B line

90nm (Vdd=1.04869097076)

37% increase in access time, 17% reduction in total dynamic read power

Page 26: OLD DAT105 Exercise4

11/8/2007 26

Exercise 5.3a: Pipelined vs. Banked

• Using CACTI 4.2, for 64KB 2-way, 2 banks, with 64B block size implemented in 90-nm process: Access time = 0.958448597337 (ns) Cycle time = 0.47078647474 (ns)Total dynamic power = 0.114334683539 (W)Total area subbanked = 1.64216420153 (mm^2)

Number of potential pipe stages = 0.958 / 0.471 = 2.03

Page 27: OLD DAT105 Exercise4

11/8/2007 27

Exercise 5.3b: Deeper Pipelined

Assumptions: (from Fig. 5.29)Miss penalty = 40 cyclesMiss rate (64KB L1 cache, 2-way, 1 bank)= 0.0036625

→ AMAT = 0.00367 x 40 + (1 - 0.00367) x 1 = 1.14 cycles

If 20% of cache access pipe stages is added:

→ AMAT = 0.00367 x 40 + (1 - 0.00367) x 1.2 = 1.34 cycles

Page 28: OLD DAT105 Exercise4

11/8/2007 28

Exercise 5.3c

Assumptions:• 2 banks; a bank conflict causes 1 cycle delay• A random distribution of addresses, a steady stream of

accesses, each access has a 50% probability of conflicting with the previous access.

• Miss rate (similar to the one of 64KB L1 cache, 2-way, 1 bank) = 0.0036625

• Miss penalty = 20 cycles

→ AMAT = 0.00367 x 20 + + (0.5 x 1 + 0.5 x 2) x (1 - 0.00367)

= 1.57 cycles

Page 29: OLD DAT105 Exercise4

11/8/2007 29

Exercise 5.4a: Early restart and critical word first

Assumptions:• 1 MB L2 with 64B block size and 16B refill path• Be written with 16B every 4 processor cycles,• Time to receive data from main memory: first 16B block in 100

cycles; each additional 16B in 16 cycles• Ignore cycles to transfer the miss request to L2 and the

requested data to L1 cache

• With critical word first and early restart:→ L2 cache miss requires 100 cycles

• Without critical word first and early restart:→ L2 cache miss requires: 100 + 3 x 16 = 148 cycles

Page 30: OLD DAT105 Exercise4

11/8/2007 30

Exercise 5.4b: Early restart and critical word first

It depends on:1. The contribution to AMAT of the L1 and L2 cache misses 2. The percent reduction in miss service times provided by critical word first and early restart.

In case if 2) is aproximately the same for both L1 and L2, then the AMAT contribution for L1 and L2 decides the importance of critical word first and early restart

Page 31: OLD DAT105 Exercise4

11/8/2007 31

Exercise 5.5: Optimizing Write Buffer

Assumptions:• Write-through L1 and Write-back L2 cache • L2 write data bus: 16B wide; perform a write every 4 processor

cycles

A) How many bytes wide should each write buffer entry be?→ Should be equal to L2 write data bus: 16B

B) Speedup using merging buffer to execute 32b (or 4B) stores? If merging buffer entry is assumed to be 16B wide

→ With nonmerging buffer takes: 4 cycles x (16B / 4B) = 16 cycles→ With merging buffer takes 4 cycles to fill an 16B entry

→ Speedup: 16 / 4 = 4 X faster