Compiler Managed Partitioned Data Caches for Low Power
description
Transcript of Compiler Managed Partitioned Data Caches for Low Power
![Page 1: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/1.jpg)
1 University of MichiganElectrical Engineering and Computer Science
Compiler Managed Partitioned Data Compiler Managed Partitioned Data Caches for Low PowerCaches for Low Power
Rajiv Ravindran*, Michael Chu, and Scott Mahlke
Advanced Computer Architecture LabDepartment of Electrical Engineering and Computer Science
University of Michigan, Ann Arbor
* Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California
![Page 2: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/2.jpg)
2 University of MichiganElectrical Engineering and Computer Science
Introduction: Memory PowerIntroduction: Memory Power• On-chip memories are a major contributor to system energy• Data caches ~16% in StrongARM [Unsal et. al, ‘01]
Hardware Software
Banking, dynamic voltage/frequency,scaling, dynamic resizing
+ Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive
Software controlled scratch-pad, data/code reorganization
+ Whole program information + Proactive – No dynamic adaptability – Conservative
![Page 3: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/3.jpg)
3 University of MichiganElectrical Engineering and Computer Science
Reducing Data Memory Power:Reducing Data Memory Power:Compiler Managed, Hardware AssistedCompiler Managed, Hardware Assisted
Hardware Software
Banking, dynamic voltage/frequency,scaling, dynamic resizing
+ Transparent to the user + Handle arbitrary instr/data accesses ー Limited program information ー Reactive
Software controlled scratch-pad, data/code reorganization
+ Whole program information + Proactive ー No dynamic adaptability ー Conservative
Global program knowledge Proactive optimizations Dynamic adaptability Efficient execution Aggressive software optimizations
![Page 4: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/4.jpg)
4 University of MichiganElectrical Engineering and Computer Science
Data Caches: TradeoffsData Caches: Tradeoffs
Advantages Disadvantages
+ Capture spatial/temporal locality+ Transparent to the programmer+ General than software scratch-pads+ Efficient lookups
– Fixed replacement policy– Set index no program locality– Set-associativity has high overhead– Activate multiple data/tag-array per access
![Page 5: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/5.jpg)
5 University of MichiganElectrical Engineering and Computer Science
tag set offset
=? =?=?=?
tag data lru tag data lru tag data lru tag data lru
4:1 mux
Replace
• Lookup Activate all ways on every access• Replacement Choose among all the ways
Traditional Cache ArchitectureTraditional Cache Architecture
![Page 6: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/6.jpg)
6 University of MichiganElectrical Engineering and Computer Science
Partitioned Cache ArchitecturePartitioned Cache Architecture
tag set offset
=? =?=?=?
tag data lru tag data lru tag data lru tag data lru
Ld/St Reg [Addr] [k-bitvector] [R/U]
4:1 mux
Replace
• Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions• Replacement Restricted to partitions specified in bit-vector
P0 P3P2P1
• Advantages Improve performance by controlling replacement Reduce cache access power by restricting number of accesses
![Page 7: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/7.jpg)
7 University of MichiganElectrical Engineering and Computer Science
Partitioned Caches: ExamplePartitioned Caches: Example
tag data tag datatag data
ld1, st1, ld2, st2 ld5, ld6 ld3, ld4
way-0 way-2way-1
ld1 [100], Rld5 [010], Rld3 [001], R
for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k]}
ld1/st1
ld2/st2
ld3
ld4
ld5
ld6
y w1/w2 x
• Reduce number of tag checks per iteration from 12 to 4 !
![Page 8: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/8.jpg)
8 University of MichiganElectrical Engineering and Computer Science
Compiler Controlled Data PartitioningCompiler Controlled Data Partitioning• Goal: Place loads/stores into cache partitions• Analyze application’s memory characteristics
– Cache requirements Number of partitions per ld/st– Predict conflicts
• Place loads/stores to different partitions– Satisfies its caching needs– Avoid conflicts, overlap if possible
![Page 9: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/9.jpg)
9 University of MichiganElectrical Engineering and Computer Science
Cache Analysis: Cache Analysis: Estimating Number of PartitionsEstimating Number of Partitions
X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y
j-loop k-loop
MM MMB1 B1B1 B1
• M has working-set size = 1
• Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate
• Use the working-set to compute number of partitions
![Page 10: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/10.jpg)
10 University of MichiganElectrical Engineering and Computer Science
Cache Analysis:Cache Analysis:Estimating Number Of PartitionsEstimating Number Of Partitions
1
1
1
1
8
16
24
32
1 2 3 4
D = 0
.76
.98
1
1
8
16
24
32
1 2 3 4
D = 2
.87
1
1
1
8
16
24
32
1 2 3 4
D = 1
Avoid conflict/capacity misses for an instruction Estimates hit-rate based on
• Reuse-distance (D), total number of cache blocks (B), associativity (A)
Compute energy matrices in reality Pick most energy efficient configuration per instruction
(Brehob et. al., ’99)
![Page 11: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/11.jpg)
11 University of MichiganElectrical Engineering and Computer Science
Cache Analysis: Cache Analysis: Computing InterferencesComputing Interferences
• Avoid conflicts among temporally co-located references • Model conflicts using interference graph
M4D = 1
M3D = 1
M2D = 1
M1D = 1
X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y
M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1
![Page 12: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/12.jpg)
12 University of MichiganElectrical Engineering and Computer Science
Partition AssignmentPartition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique
For each clique, new D Σ D of each node Combined D for all overlaps Max (All cliques)
M4D = 1
M3D = 1
M2D = 1
M1D = 1
Clique 2
Clique 1
Clique 1 : M1, M2, M4 New reuse distance (D) = 3Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3
![Page 13: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/13.jpg)
13 University of MichiganElectrical Engineering and Computer Science
Experimental SetupExperimental Setup• Trimaran compiler and simulator infrastructure• ARM9 processor model• Cache configurations:
– 1-Kb to 32-Kb– 32-byte block size– 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache
• Mediabench suite• CACTI for cache energy modeling
![Page 14: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/14.jpg)
14 University of MichiganElectrical Engineering and Computer Science
Reduction in Tag & Data-Array ChecksReduction in Tag & Data-Array Checks
0
1
2
3
4
5
6
7
8
1-K 2-K 4-K 8-K 16-K 32-K Average
Aver
age
way
acce
sses
8-part 4-part 2-part
• 36% reduction on a 8-partition cache
Cache size
![Page 15: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/15.jpg)
15 University of MichiganElectrical Engineering and Computer Science
Improvement in Fetch EnergyImprovement in Fetch Energy
0
10
20
30
40
50
60
rawc
audi
o
rawd
audi
o
g721
enco
de
g721
deco
de
mpe
g2de
c
mpe
g2en
c
pegw
itenc
pegw
itdec
pgpe
ncod
e
pgpd
ecod
e
gsm
enco
de
gsm
deco
de epic
unep
ic
cjpeg
djpe
g
Aver
age
Perc
enta
ge e
nerg
y im
prov
emen
t 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way
16-Kb cache
![Page 16: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/16.jpg)
16 University of MichiganElectrical Engineering and Computer Science
SummarySummary• Maintain the advantages of a hardware-cache• Expose placement and lookup decisions to the compiler
– Avoid conflicts, eliminate redundancies• 24% energy savings for 4-Kb with 4-partitions• Extensions
– Hybrid scratch-pad and caches– Disable selected tags convert them to scratch-pads– 35% additional savings in 4-Kb cache with 1 partition as SP
![Page 17: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/17.jpg)
17 University of MichiganElectrical Engineering and Computer Science
Thank You&
Questions
![Page 18: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/18.jpg)
18 University of MichiganElectrical Engineering and Computer Science
Cache Analysis Cache Analysis Step 1: Instruction FusioningStep 1: Instruction Fusioning
• Combine ld/st that accesses the same set of objects• Avoids coherence and duplication• Points-to analysis
M1 M2
for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k]}
ld1/st1
ld2/st2
ld3
ld4
ld5
ld6
![Page 19: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/19.jpg)
19 University of MichiganElectrical Engineering and Computer Science
Partition AssignmentPartition Assignment• Greedily place instructions based on its cache estimates• Overlap instructions if required• Compute number of partitions for overlapped instructions
– Enumerate cliques within interference graph– Compute combined working-set of all cliques
• Assign the R/U bit to control lookup
M4D = 1
M3D = 1
M2D = 1
M1D = 1
Clique 2
Clique 1
![Page 20: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/20.jpg)
20 University of MichiganElectrical Engineering and Computer Science
Related WorkRelated Work• Direct addressed, cool caches [Unsal ’01, Asanovic ’01]
– Tags maintained in registers that are addressed within loads/stores• Split temporal/spatial cache [Rivers ’96]
– Hardware managed, two partitions• Column partitioning [Devdas ’00]
– Individual ways can be configured as a scratch-pad– No load/store based partitioning
• Region based caching [Tyson ’02]– Heap, stack, globals– More finer grained control and management
• Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99]– Reduce tag check power– Compromises on cycle time– Orthogonal to our technique
![Page 21: Compiler Managed Partitioned Data Caches for Low Power](https://reader035.fdocuments.net/reader035/viewer/2022062315/5681632e550346895dd3a989/html5/thumbnails/21.jpg)
21 University of MichiganElectrical Engineering and Computer Science
Code Size OverheadCode Size Overhead
0
2
4
6
8
10
12
rawc
audi
o
rawd
audi
o
g721
enco
de
g721
deco
de
mpe
g2de
c
mpe
g2en
c
pegw
itenc
pegw
itdec
pgpe
ncod
e
pgpd
ecod
e
gsm
enco
de
gsm
deco
de epic
unep
ic
cjpeg
djpe
g
Aver
age
Perc
enta
ge in
stru
ctio
ns
Annotated LD/STs Extra MOV instructions15% 16%