OLD DAT105 Exercise4
-
Upload
eagleforce06 -
Category
Documents
-
view
229 -
download
1
Transcript of OLD DAT105 Exercise4
Department of Computer Science & EngineeringChalmers University of Technology
DAT105: Computer Architecture
Exercise 4(5.1, 5.2, 5.3, 5.4, 5.5)
By
Minh Quang Do
2007-11-29
11/8/2007 2
Cache Access and Cycle TIme model (http://quid.hpl.hp.com:9081/cacti/)
References:
[1] ”CACTI4.0”, David Tarjan et al., HPL-2006-86, HP Laboratories Palo Alto, USA, June 2, 2006
[2] ”CACTI: An Enhanced Cache Access and Cycle Time Model ”Steven J.E. Wilton et al., TR-1993 Western Research Laboratory Palo Alto, USA, July 1994
[3] ”eCACTI: Enhanced Power Estimation Model for On-chip Caches”, Mahesh Mamidipaka et al., CECS Technical report #04-28, University of California, Irvine, USA, Sept. 14, 2004
[4] ”HotLeakage: A temperatureaware model of subthreshold and gate leakage for architects”, Yan Zhang et al., TR-CS-2003-05, University of Virginia, Dept. of Comp. Science, USA, March 2003
11/8/2007 3
Cache Organization(CACTI)
Parameters: [2]• C: cache size (Bytes)• B: Block size (Bytes)• A: associativity• b0: output width (bits)• baddr: address width
(bits)• Ndwl, Ndbl, Nspd• Ntwl, Ntbl, Ntspd
11/8/2007 4
CACTI Valid
Trans-formation
(Nspd, Ndbl, Ndwl)
11/8/2007 5
Cache Organization
Size of a Tag field = s – (n + m + w)Where:
• s: # of memory address bits• w: Byte offset (2w = # bytes per a word)• m: word offset (2m= # of words per a block)
• n: Index (2n = # of sets)
Direct-mapped cache: 4KB
s=32, w=2,m=0, n=10 -> tag= 20
11/8/2007 6
Direct mapped cache: 4KBs=32, w=2,m=2, n=12 -> tag= 16
11/8/2007 7
Cache Organization
Size of a Tag field = s – (n – log2A + m + w)
11/8/2007 8
4-way set associative cache: 4KBs=32, w=2,m=0, n=8 -> tag= 22
11/8/2007 9
Figure 5.3 (p 292): Memory Hierarchy
L1: direct-mapped 8KBL2: direct-mapped 4MBL1, L2 use 64B blocksPage size: 8KBTLB: direct-mapped with
256 entriesVirtual: 64b, Physical: 40b
11/8/2007 10
SRAM Memory Partitioning
Block diagram of a typical physically partitioned SRAM array (within a bank)
11/8/2007 11
CACTI Algorithm to find the best “Power*Delay”
Product and Area Estimation for it
(from [2])
11/8/2007 12
Web Interface CACTI 4.0
11/8/2007 13
Web Interface CACTI 4.2
http://quid.hpl.hp.com:9081/cacti/
11/8/2007 14
Web Inter-face
11/8/2007 15
11/8/2007 16
Exercise 5.1aUsing CACTI 4.2, for direct-mapped, 2-way and 4-way set associative caches of 32KB with 64B block size implemented in 90-nm process: (OBS! No leakage power for 90-nm)
13216Ntspd
1614Ntbl
11632Ntwl
221Nspd
148Ndbl
1641Ndwl
128256512N of sets per bank
111Nbank
0.7437597810.789846223490.555439619Total area subbanked (mm^2)
0.4211198160.470786474740.474678046Cycle Time (ns)
0.1499683430.057370283510.041520158Total dynamic Read Power at max. freq. (W)
0.8837994630.959166417080.727237609Access time (ns)
4-way2-way1-wayCache size: 32KB with 64B line
90nm (Vdd=1.04869097076)
32% more
21% more
11/8/2007 17
Exercise 5.1b
Using CACTI 4.2, for 2-way set associative caches of 16KB, 32KB and 64KB with 64B block size implemented in 90-nm process:
163216Ntspd
111Ntbl
8168Ntwl
220.5Nspd
444Ndbl
441Ndwl
512256128No of sets per bank
111No of bank
1.1424086360.789846223490.3412720341Total area subbanked (mm^2)
0.4957744890.470786474740.4630303771Cycle Time (ns)
0.0784559440.057370283510.0668911783Total dynamic Read Power at max. freq. (W)
1.0044885490.959166417080.8154413713Access time (ns)
64K32K16KCache size (2-way) 64B line
90nm (Vdd=1.04869097076)
18% more
23% more
11/8/2007 18
Exercise 5.1c (1)
Using CACTI 4.2, for 2-way set associative caches of 8KB, 16KB, 32KB, 64KB with 64B block size implemented in 90-nm process:
1632162Ntspd
1114Ntbl
81681Ntwl
220.50.25Nspd
4444Ndbl
4411Ndwl
51225612864N of sets per bank
1111Nbank
1.14240860.7898462230.341272030.184715483Total area subbanked (mm^2)
0.49577440.4707864740.463030370.38809874Cycle Time (ns)
0.07845590.0573702830.066891170.075579576Total dynamic Read Power (W)
1.00448850.9591664170.815441370.721949734Access time (ns)
64K32K16K8KCache size (2-way) 64B line
90nm (Vdd=1.04869097076)
38% increase in access time, 8X in size → Log relation
11/8/2007 19
Exercise 5.1c (2)
1
1.05
1.1
1.15
1.2
1.25
1.3
1.35
1.4
1.45
1 2 4 8
Cache Size (normalized to 8KB cache)
Incr
ease
in A
cces
s ti
me
(no
rmal
ized
to
8K
B c
ach
e) 38% increase in access time, 8X in size → Log relation
11/8/2007 20
Exercise 5.1d
From the Fig. 5.29, the current version of CACTI states that 16 KB 8-way set-associative caches with 64 byte blocks have an access time of 0.88 ns. This has the lowest miss rate for 16 KB caches, except for fully associative caches, which would have an access time greater than 0.90 ns.
11/8/2007 21
11/8/2007 22
Exercise 5.2a: AMAT
Miss Penalty = 20 cycles;
• Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101→ AMAT = 0.0056101 x 20 + (1 - 0.0056101) = 1.106 cycles
• Way-predicted cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 2 cycles:
→ AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 – 0.85) x 2) x (1 - 0.0056101)
= 1.26
11/8/2007 23
Exercise 5.2b
• Using CACTI 4.2, for 16KB direct-mapped with 64B block size implemented in 90-nm process: Access time = 0.66318353 ns; Cycle time = 0.36661061 nsTotal dynamic power = 0.033881430 W
• A 32KB 2-way set associative caches with 64B block size implemented in 90-nm process: Access time = 0.95916641708 ns
→ Improvement in execution = 0. 9591 / 0.6631= 1.446, or 44.6 % faster
11/8/2007 24
Exercise 5.2c: Way-prediction on a data cache
Assumptions: • Miss Penalty = 20 cycles; • Miss rate (32KB L1 cache, 2-way, single-bank)= 0.0056101
• Way-predicted data cache (16KB, direct-mapped), 85% prediction accuracy, mispredicted penalty = 15 cycles:
→ AMAT = 0.0056101 x 20 + + (0.85 x1 + (1 – 0.85) x 15) x (1 - 0.0056101)
= 3.19
Increase in: 3.19 – 1.26 = 1.93 ns
11/8/2007 25
Exercise 5.2d• Using CACTI 4.2, for 1MB 4-way with 64B block size, 144b read
out, 1 bank, 1 read/write port, 30b tags implemented in 90-nm process:
16164Nspt
323232Ntbl
884Ntwl
1.1251.1254.5Nspd
32328Ndbl
8832Ndwl
409640964096N of sets per bank
111Nbank
56.9211850127.2843772619.71918143Total area subbanked (mm^2)
0.5231164480.5133363150.466345737Cycle Time (ns)
0.3023125560.6119481650.360252018Total dynamic Read Power at max. freq. (W)
3.5002652851.7151685892.542393257Access time (ns)
SerialFastNormalCache size: 1MB with 64B line
90nm (Vdd=1.04869097076)
37% increase in access time, 17% reduction in total dynamic read power
11/8/2007 26
Exercise 5.3a: Pipelined vs. Banked
• Using CACTI 4.2, for 64KB 2-way, 2 banks, with 64B block size implemented in 90-nm process: Access time = 0.958448597337 (ns) Cycle time = 0.47078647474 (ns)Total dynamic power = 0.114334683539 (W)Total area subbanked = 1.64216420153 (mm^2)
Number of potential pipe stages = 0.958 / 0.471 = 2.03
11/8/2007 27
Exercise 5.3b: Deeper Pipelined
Assumptions: (from Fig. 5.29)Miss penalty = 40 cyclesMiss rate (64KB L1 cache, 2-way, 1 bank)= 0.0036625
→ AMAT = 0.00367 x 40 + (1 - 0.00367) x 1 = 1.14 cycles
If 20% of cache access pipe stages is added:
→ AMAT = 0.00367 x 40 + (1 - 0.00367) x 1.2 = 1.34 cycles
11/8/2007 28
Exercise 5.3c
Assumptions:• 2 banks; a bank conflict causes 1 cycle delay• A random distribution of addresses, a steady stream of
accesses, each access has a 50% probability of conflicting with the previous access.
• Miss rate (similar to the one of 64KB L1 cache, 2-way, 1 bank) = 0.0036625
• Miss penalty = 20 cycles
→ AMAT = 0.00367 x 20 + + (0.5 x 1 + 0.5 x 2) x (1 - 0.00367)
= 1.57 cycles
11/8/2007 29
Exercise 5.4a: Early restart and critical word first
Assumptions:• 1 MB L2 with 64B block size and 16B refill path• Be written with 16B every 4 processor cycles,• Time to receive data from main memory: first 16B block in 100
cycles; each additional 16B in 16 cycles• Ignore cycles to transfer the miss request to L2 and the
requested data to L1 cache
• With critical word first and early restart:→ L2 cache miss requires 100 cycles
• Without critical word first and early restart:→ L2 cache miss requires: 100 + 3 x 16 = 148 cycles
11/8/2007 30
Exercise 5.4b: Early restart and critical word first
It depends on:1. The contribution to AMAT of the L1 and L2 cache misses 2. The percent reduction in miss service times provided by critical word first and early restart.
In case if 2) is aproximately the same for both L1 and L2, then the AMAT contribution for L1 and L2 decides the importance of critical word first and early restart
11/8/2007 31
Exercise 5.5: Optimizing Write Buffer
Assumptions:• Write-through L1 and Write-back L2 cache • L2 write data bus: 16B wide; perform a write every 4 processor
cycles
A) How many bytes wide should each write buffer entry be?→ Should be equal to L2 write data bus: 16B
B) Speedup using merging buffer to execute 32b (or 4B) stores? If merging buffer entry is assumed to be 16B wide
→ With nonmerging buffer takes: 4 cycles x (16B / 4B) = 16 cycles→ With merging buffer takes 4 cycles to fill an 16B entry
→ Speedup: 16 / 4 = 4 X faster