Krste Asanovic [email protected] inst.eecs.berkeley /~cs252/sp14
Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic...
-
date post
21-Dec-2015 -
Category
Documents
-
view
224 -
download
0
Transcript of Highly-Associative Caches for Low-Power Processors Michael Zhang Krste Asanovic...
Highly-Associative Caches for Low-Power Processors
Michael Zhang
Krste Asanovic
{rzhang|krste}@lcs.mit.edu
Motivation
Cache uses 30-60% processor energy in embedded systems. Example: 43% for StrongArm-1
Many academic studies on cache [Albera, Bahar, ’98] – Power and performance trade-offs [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache [Kin,Gupta, Mangione-Smith, ’97] – Filter cache [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC [Wilton, Jouppi, ’94] – CACTI cache model
Many Industrial Low-Power Processors use CAM (content-addressable-memory) ARM3 – 64-way set-associative – [Furber et. al. ’89] StrongArm – 32-way set-associative – [Santhanam et. al. ’98] Intel XScale – 32-way set-associative – ’01
CAM: Fast and Energy-Efficient
Talk Outline
Structural Comparison
Area and Delay Comparison
Energy Comparison
Related work
Conclusion
Set-Associative RAM-tag Cache
Not energy-efficient All ways are read out
Two-phase approach More energy-efficient 2X latency
=? =?
Tag Status Data Tag Status Data
Tag Index Offset
Set-Associative RAM-tag Sub-bank
Not energy-efficient All ways are read out
Two-phase approach More energy-efficient 2X latency
Sub-banking
1 sub-bank = 1 way
Low-swing Bitlines Only for reads, writes
performed full-swing
Wordline Gating
I/O
BUS
addr
Ad
dre
ss D
ecod
er
gwl
lwl
Offset Dec.
offset
DataSRAM Cells
SenseAmps
lwl
Offset Dec.
offset
DataSRAM Cells
SenseAmps
32128
Tag SRAMCells
Tag Comp
BUS Cache
CAM-tag Cache
Only one sub-bank activated
Associativity within sub-bank
Easy to implement high associativity
Tag Status Data
HIT?
Word
Tag OffsetBank
Tag Status Data
HIT?HIT?
CAM-tag Cache Sub-bank
Only one sub-bank activated
Associativity within sub-bank
Easy to implement high associativity
I/O
BUS
tag
CA
M-t
ag
Arr
ay
gwl
lwl
Offset Dec.
offset
SRAM Cells
SenseAmps
lwl
Offset Dec.
offset
SRAM Cells
SenseAmps
32128
CAM Functionality and Energy Usage
WLBit Bit_b SBitSBit_b
match
10-T CAM CellWith Separate
Write/Search LinesAnd Low-Swing
Match Line
WL
Bit Bit_b SBitSBit_b
match
Match
1
01
1
10
WL
Bit Bit_b SBitSBit_b
match
Mismatch
1 0
011 0
CAM Energy Dissipation Search Lines Match Lines Drivers
SR
AM
XOR
CAM-tag Cache Sub-bank Layout
10% area overhead over RAM-tag cache
2x12x32 CAM Array
1-KB Cache Sub-bank implemented in 0.25 m CMOS technology
32x64 RAM Array
Delay Comparison
gwl
Global Wordline Decoding
lwl
Decoded offset
Local Wordline Decoding
Tag readout
Data readout
Index Bits
Tag bits
Data out
Tag bitsTag Comp.
Data out
RAM tag Cache Critical Path:
CAM tag Cache Critical Path:
Within 3% of each other
Tag bits broadcasting Tag bits
Tag Comp.
gwl
Data readout
Local Wordline Decoding
lwl
Decoded offset
Hit Energy ComparisonH
it E
nerg
y p
er
Access f
or
8K
B C
ach
e in
pJ
0
50
100
150
200
250
300
350
400
450
1-wayRAM
2-wayRAM
4-wayRAM
8-wayRAM
8-wayCAM
16-wayCAM
32-wayCAM
LZWijpegpegwitperlm88ksimgccAvg
Associativity and Implementation
Miss Rate Results
0
5
10
15
20
25
1-way 2-way 4-way 8-way 16-way 32-way 64-way
LZW
0
0.5
1
1.5
2
2.5
3
3.5
1-way 2-way 4-way 8-way 16-way 32-way 64-way
ijpeg
0
0.5
1
1.5
2
2.5
3
3.5
1-way 2-way 4-way 8-way 16-way 32-way 64-way
m88ksim
0
2
4
6
8
10
12
14
16
1-way 2-way 4-way 8-way 16-way 32-way 64-way
8KB
16KB
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
1-way 2-way 4-way 8-way 16-way 32-way 64-way
0
1
2
3
4
5
6
1-way 2-way 4-way 8-way 16-way 32-way 64-way
perl
pegwit
gcc
Total Access Energy (pegwit)
Pegwit – High miss rate for high associativity
Tota
l En
erg
y p
er
Access f
or
8K
B C
ach
e in
pJ
Miss Energy Expressed in Multiples of 32-bit Read Access Energy
0
500
1000
1500
2000
2500
32X 64X 128X 256X 512X 1024X
1-RAM2-RAM4-RAM8-RAM8-CAM16-CAM32-CAM
Total Access Energy (perl)
Perl – Very low miss rate for high associativity
Tota
l En
erg
y p
er
Access f
or
8K
B C
ach
e in
pJ
Miss Energy Expressed in Multiples of 32-bit Read Access Energy
0
50
100
150
200
250
300
350
400
450
500
32X 64X 128X 256X 512X 1024X
1-RAM2-RAM4-RAM8-RAM8-CAM16-CAM32-CAM
Other Advantages of CAM-tag
Hit signal generated earlier Simplifies pipelines
Simplified store operation Wordline only enabled during a hit Stores can happen in a single cycle No write buffer necessary
Related Work
CACTI and CACTI2 [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99]Accurate delay and energy estimate
Results within 10%Energy estimate not suited for low-power designsTypical Low-power features not included in CACTI
Sub-banking Low-swing bitlines Wordline gating Separate CAM search line Low-swing match lines
Energy Estimation 10X greater than our model for one CAM-tag cache sub-bank Our results closely agree with [Amruthur and Horowitz, 98]
Conclusion
CAM tags – high performance and low-power Energy consumption of 32-way CAM < 2-way RAM Easy to implement highly-associative tags Low area overhead (10%) Comparable access delay Better CPI by reducing miss rate