OLD DAT105 Exercise4

Department of Computer Science & EngineeringChalmers University of Technology

DAT105: Computer Architecture

Exercise 4(5.1, 5.2, 5.3, 5.4, 5.5)

By

Minh Quang Do

2007-11-29

11/8/2007 2

Cache Access and Cycle TIme model (http://quid.hpl.hp.com:9081/cacti/)

References:

[1] ”CACTI4.0”, David Tarjan et al., HPL-2006-86, HP Laboratories Palo Alto, USA, June 2, 2006

[2] ”CACTI: An Enhanced Cache Access and Cycle Time Model ”Steven J.E. Wilton et al., TR-1993 Western Research Laboratory Palo Alto, USA, July 1994

[3] ”eCACTI: Enhanced Power Estimation Model for On-chip Caches”, Mahesh Mamidipaka et al., CECS Technical report #04-28, University of California, Irvine, USA, Sept. 14, 2004

[4] ”HotLeakage: A temperatureaware model of subthreshold and gate leakage for architects”, Yan Zhang et al., TR-CS-2003-05, University of Virginia, Dept. of Comp. Science, USA, March 2003

11/8/2007 3

Cache Organization(CACTI)

Parameters: [2]• C: cache size (Bytes)• B: Block size (Bytes)• A: associativity• b0: output width (bits)• baddr: address width

(bits)• Ndwl, Ndbl, Nspd• Ntwl, Ntbl, Ntspd

11/8/2007 4

CACTI Valid

Trans-formation

(Nspd, Ndbl, Ndwl)

11/8/2007 5

Cache Organization

Size of a Tag field = s – (n + m + w)Where:

• s: # of memory address bits• w: Byte offset (2w = # bytes per a word)• m: word offset (2m= # of words per a block)

• n: Index (2n = # of sets)

Direct-mapped cache: 4KB

s=32, w=2,m=0, n=10 -> tag= 20

11/8/2007 6

Direct mapped cache: 4KBs=32, w=2,m=2, n=12 -> tag= 16

11/8/2007 7

Cache Organization

Size of a Tag field = s – (n – log2A + m + w)

11/8/2007 8

4-way set associative cache: 4KBs=32, w=2,m=0, n=8 -> tag= 22

11/8/2007 9

Figure 5.3 (p 292): Memory Hierarchy

L1: direct-mapped 8KBL2: direct-mapped 4MBL1, L2 use 64B blocksPage size: 8KBTLB: direct-mapped with

256 entriesVirtual: 64b, Physical: 40b

11/8/2007 10

SRAM Memory Partitioning

Block diagram of a typical physically partitioned SRAM array (within a bank)

11/8/2007 11

CACTI Algorithm to find the best “Power*Delay”

Product and Area Estimation for it

(from [2])

11/8/2007 12

Web Interface CACTI 4.0

11/8/2007 13

Web Interface CACTI 4.2

http://quid.hpl.hp.com:9081/cacti/

11/8/2007 14

Web Inter-face

11/8/2007 15

11/8/2007 16

Exercise 5.1aUsing CACTI 4.2, for direct-mapped, 2-way and 4-way set associative caches of 32KB with 64B block size implemented in 90-nm process: (OBS! No leakage power for 90-nm)

13216Ntspd

1614Ntbl

11632Ntwl

221Nspd

148Ndbl

1641Ndwl

128256512N of sets per bank

111Nbank

0.7437597810.789846223490.555439619Total area subbanked (mm^2)

0.4211198160.470786474740.474678046Cycle Time (ns)

0.1499683430.057370283510.041520158Total dynamic Read Power at max. freq. (W)

0.8837994630.959166417080.727237609Access time (ns)

4-way2-way1-wayCache size: 32KB with 64B line

90nm (Vdd=1.04869097076)

32% more

21% more

11/8/2007 17

Exercise 5.1b

Using CACTI 4.2, for 2-way set associative caches of 16KB, 32KB and 64KB with 64B block size implemented in 90-nm process:

163216Ntspd

111Ntbl

8168Ntwl

220.5Nspd

444Ndbl

441Ndwl

512256128No of sets per bank

111No of bank


0.4957744890.470786474740.4630303771Cycle Time (ns)


1.0044885490.959166417080.8154413713Access time (ns)

64K32K16KCache size (2-way) 64B line

90nm (Vdd=1.04869097076)

18% more

23% more

11/8/2007 18

Exercise 5.1c (1)

Using CACTI 4.2, for 2-way set associative caches of 8KB, 16KB, 32KB, 64KB with 64B block size implemented in 90-nm process:

1632162Ntspd

1114Ntbl

81681Ntwl

220.50.25Nspd

4444Ndbl

4411Ndwl


1111Nbank

1.14240860.7898462230.341272030.184715483Total area subbanked (mm^2)

0.49577440.4707864740.463030370.38809874Cycle Time (ns)

0.07845590.0573702830.066891170.075579576Total dynamic Read Power (W)

1.00448850.9591664170.815441370.721949734Access time (ns)

64K32K16K8KCache size (2-way) 64B line

90nm (Vdd=1.04869097076)

38% increase in access time, 8X in size → Log relation

11/8/2007 19

Exercise 5.1c (2)

1

1.05

1.1

1.15

1.2

1.25

1.3

1.35

1.4

1.45

1 2 4 8

Cache Size (normalized to 8KB cache)

Incr

ease

in A

cces

s ti

me

(no

rmal

ized

to

8K

B c

ach

e) 38% increase in access time, 8X in size → Log relation

11/8/2007 20

Exercise 5.1d

From the Fig. 5.29, the current version of CACTI states that 16 KB 8-way set-associative caches with 64 byte blocks have an access time of 0.88 ns. This has the lowest miss rate for 16 KB caches, except for fully associative caches, which would have an access time greater than 0.90 ns.

11/8/2007 21

11/8/2007 22

Exercise 5.2a: AMAT

Miss Penalty = 20 cycles;

• Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101→ AMAT = 0.0056101 x 20 + (1 - 0.0056101) = 1.106 cycles

• Way-predicted cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 2 cycles:

→ AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 – 0.85) x 2) x (1 - 0.0056101)

= 1.26

11/8/2007 23

Exercise 5.2b

• Using CACTI 4.2, for 16KB direct-mapped with 64B block size implemented in 90-nm process: Access time = 0.66318353 ns; Cycle time = 0.36661061 nsTotal dynamic power = 0.033881430 W

• A 32KB 2-way set associative caches with 64B block size implemented in 90-nm process: Access time = 0.95916641708 ns

→ Improvement in execution = 0. 9591 / 0.6631= 1.446, or 44.6 % faster

11/8/2007 24

Exercise 5.2c: Way-prediction on a data cache

Assumptions: • Miss Penalty = 20 cycles; • Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101

• Way-predicted data cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 15 cycles:

→ AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 – 0.85) x 15) x (1 - 0.0056101)

= 3.19

Increase in: 3.19 – 1.26 = 1.93 ns

11/8/2007 25

Exercise 5.2d• Using CACTI 4.2, for 1MB 4-way with 64B block size, 144b read

out, 1 bank, 1 read/write port, 30b tags implemented in 90-nm process:

16164Nspt

323232Ntbl

884Ntwl

1.1251.1254.5Nspd

32328Ndbl

8832Ndwl


111Nbank


0.5231164480.5133363150.466345737Cycle Time (ns)


3.5002652851.7151685892.542393257Access time (ns)

SerialFastNormalCache size: 1MB with 64B line

90nm (Vdd=1.04869097076)

37% increase in access time, 17% reduction in total dynamic read power

11/8/2007 26

Exercise 5.3a: Pipelined vs. Banked

• Using CACTI 4.2, for 64KB 2-way, 2 banks, with 64B block size implemented in 90-nm process: Access time = 0.958448597337 (ns) Cycle time = 0.47078647474 (ns)Total dynamic power = 0.114334683539 (W)Total area subbanked = 1.64216420153 (mm^2)

Number of potential pipe stages = 0.958 / 0.471 = 2.03

11/8/2007 27

Exercise 5.3b: Deeper Pipelined

Assumptions: (from Fig. 5.29)Miss penalty = 40 cyclesMiss rate (64KB L1 cache, 2-way, 1 bank)= 0.0036625

→ AMAT = 0.00367 x 40 + (1 - 0.00367) x 1 = 1.14 cycles

If 20% of cache access pipe stages is added:

→ AMAT = 0.00367 x 40 + (1 - 0.00367) x 1.2 = 1.34 cycles

11/8/2007 28

Exercise 5.3c

Assumptions:• 2 banks; a bank conflict causes 1 cycle delay• A random distribution of addresses, a steady stream of

accesses, each access has a 50% probability of conflicting with the previous access.

• Miss rate (similar to the one of 64KB L1 cache, 2-way, 1 bank) = 0.0036625

• Miss penalty = 20 cycles

→ AMAT = 0.00367 x 20 + + (0.5 x 1 + 0.5 x 2) x (1 - 0.00367)

= 1.57 cycles

11/8/2007 29

Exercise 5.4a: Early restart and critical word first

Assumptions:• 1 MB L2 with 64B block size and 16B refill path• Be written with 16B every 4 processor cycles,• Time to receive data from main memory: first 16B block in 100

cycles; each additional 16B in 16 cycles• Ignore cycles to transfer the miss request to L2 and the

requested data to L1 cache

• With critical word first and early restart:→ L2 cache miss requires 100 cycles

• Without critical word first and early restart:→ L2 cache miss requires: 100 + 3 x 16 = 148 cycles

11/8/2007 30

Exercise 5.4b: Early restart and critical word first

It depends on:1. The contribution to AMAT of the L1 and L2 cache misses 2. The percent reduction in miss service times provided by critical word first and early restart.

In case if 2) is aproximately the same for both L1 and L2, then the AMAT contribution for L1 and L2 decides the importance of critical word first and early restart

11/8/2007 31

Exercise 5.5: Optimizing Write Buffer

Assumptions:• Write-through L1 and Write-back L2 cache • L2 write data bus: 16B wide; perform a write every 4 processor

cycles

A) How many bytes wide should each write buffer entry be?→ Should be equal to L2 write data bus: 16B

B) Speedup using merging buffer to execute 32b (or 4B) stores? If merging buffer entry is assumed to be 16B wide

→ With nonmerging buffer takes: 4 cycles x (16B / 4B) = 16 cycles→ With merging buffer takes 4 cycles to fill an 16B entry

→ Speedup: 16 / 4 = 4 X faster

OLD DAT105 Exercise4

Documents

Transcript of OLD DAT105 Exercise4