Compiler Managed Partitioned Data Caches for Low Power

1 University of MichiganElectrical Engineering and Computer Science

Compiler Managed Partitioned Data Compiler Managed Partitioned Data Caches for Low PowerCaches for Low Power

Rajiv Ravindran*, Michael Chu, and Scott Mahlke

Advanced Computer Architecture LabDepartment of Electrical Engineering and Computer Science

University of Michigan, Ann Arbor

* Currently with the Java, Compilers, and Tools Lab, Hewlett Packard, Cupertino, California


Introduction: Memory PowerIntroduction: Memory Power• On-chip memories are a major contributor to system energy• Data caches ~16% in StrongARM [Unsal et. al, ‘01]

Hardware Software

Banking, dynamic voltage/frequency,scaling, dynamic resizing

+ Transparent to the user + Handle arbitrary instr/data accesses – Limited program information – Reactive

Software controlled scratch-pad, data/code reorganization

+ Whole program information + Proactive – No dynamic adaptability – Conservative


Reducing Data Memory Power:Reducing Data Memory Power:Compiler Managed, Hardware AssistedCompiler Managed, Hardware Assisted

Hardware Software

Banking, dynamic voltage/frequency,scaling, dynamic resizing

+ Transparent to the user + Handle arbitrary instr/data accesses ｰ Limited program information ｰ Reactive

Software controlled scratch-pad, data/code reorganization

+ Whole program information + Proactive ｰ No dynamic adaptability ｰ Conservative

Global program knowledge Proactive optimizations Dynamic adaptability Efficient execution Aggressive software optimizations


Data Caches: TradeoffsData Caches: Tradeoffs

Advantages Disadvantages

+ Capture spatial/temporal locality+ Transparent to the programmer+ General than software scratch-pads+ Efficient lookups

– Fixed replacement policy– Set index no program locality– Set-associativity has high overhead– Activate multiple data/tag-array per access


tag set offset

=? =?=?=?

tag data lru tag data lru tag data lru tag data lru

4:1 mux

Replace

• Lookup Activate all ways on every access• Replacement Choose among all the ways

Traditional Cache ArchitectureTraditional Cache Architecture


Partitioned Cache ArchitecturePartitioned Cache Architecture

tag set offset

=? =?=?=?

tag data lru tag data lru tag data lru tag data lru

Ld/St Reg [Addr] [k-bitvector] [R/U]

4:1 mux

Replace

• Lookup Restricted to partitions specified in bit-vector if ‘R’, else default to all partitions• Replacement Restricted to partitions specified in bit-vector

P0 P3P2P1

• Advantages Improve performance by controlling replacement Reduce cache access power by restricting number of accesses


Partitioned Caches: ExamplePartitioned Caches: Example

tag data tag datatag data

ld1, st1, ld2, st2 ld5, ld6 ld3, ld4

way-0 way-2way-1

ld1 [100], Rld5 [010], Rld3 [001], R

for (i = 0; i < N1; i++) { … for (j = 0; j < N2; j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < N3; k++) y[i + k] += *w2++ + x[i + k]}

ld1/st1

ld2/st2

ld3

ld4

ld5

ld6

y w1/w2 x

• Reduce number of tag checks per iteration from 12 to 4 !


Compiler Controlled Data PartitioningCompiler Controlled Data Partitioning• Goal: Place loads/stores into cache partitions• Analyze application’s memory characteristics

– Cache requirements Number of partitions per ld/st– Predict conflicts

• Place loads/stores to different partitions– Satisfies its caching needs– Avoid conflicts, overlap if possible


Cache Analysis: Cache Analysis: Estimating Number of PartitionsEstimating Number of Partitions

X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y

j-loop k-loop

MM MMB1 B1B1 B1

• M has working-set size = 1

• Minimal partitions to avoid conflict/capacity misses • Probabilistic hit-rate estimate

• Use the working-set to compute number of partitions


Cache Analysis:Cache Analysis:Estimating Number Of PartitionsEstimating Number Of Partitions

1

1

1

1

8

16

24

32

1 2 3 4

D = 0

.76

.98

1

1

8

16

24

32

1 2 3 4

D = 2

.87

1

1

1

8

16

24

32

1 2 3 4

D = 1

Avoid conflict/capacity misses for an instruction Estimates hit-rate based on

• Reuse-distance (D), total number of cache blocks (B), associativity (A)

Compute energy matrices in reality Pick most energy efficient configuration per instruction

(Brehob et. al., ’99)


Cache Analysis: Cache Analysis: Computing InterferencesComputing Interferences

• Avoid conflicts among temporally co-located references • Model conflicts using interference graph

M4D = 1

M3D = 1

M2D = 1

M1D = 1

X W1 Y Y X W1 Y Y X W2 Y Y X W2 Y Y

M4 M2 M1 M1 M4 M2 M1 M1 M4 M3 M1 M1 M4 M3 M1 M1


Partition AssignmentPartition Assignment Placement phase can overlap references Compute combined working-set Use graph-theoretic notion of a clique

For each clique, new D Σ D of each node Combined D for all overlaps Max (All cliques)

M4D = 1

M3D = 1

M2D = 1

M1D = 1

Clique 2

Clique 1

Clique 1 : M1, M2, M4 New reuse distance (D) = 3Clique 2 : M1, M3, M4 New reuse distance (D) = 3 Combined reuse distance Max(3, 3) = 3


Experimental SetupExperimental Setup• Trimaran compiler and simulator infrastructure• ARM9 processor model• Cache configurations:

– 1-Kb to 32-Kb– 32-byte block size– 2, 4, 8 partitions vs. 2, 4, 8-way set-associative cache

• Mediabench suite• CACTI for cache energy modeling


Reduction in Tag & Data-Array ChecksReduction in Tag & Data-Array Checks

0

1

2

3

4

5

6

7

8

1-K 2-K 4-K 8-K 16-K 32-K Average

Aver

age

way

acce

sses

8-part 4-part 2-part

• 36% reduction on a 8-partition cache

Cache size


Improvement in Fetch EnergyImprovement in Fetch Energy

0

10

20

30

40

50

60

rawc

audi

o

rawd

audi

o

g721

enco

de

g721

deco

de

mpe

g2de

c

mpe

g2en

c

pegw

itenc

pegw

itdec

pgpe

ncod

e

pgpd

ecod

e

gsm

enco

de

gsm

deco

de epic

unep

ic

cjpeg

djpe

g

Aver

age

Perc

enta

ge e

nerg

y im

prov

emen

t 2-part vs 2-way 4-part vs 4-way 8-part vs 8-way

16-Kb cache


SummarySummary• Maintain the advantages of a hardware-cache• Expose placement and lookup decisions to the compiler

– Avoid conflicts, eliminate redundancies• 24% energy savings for 4-Kb with 4-partitions• Extensions

– Hybrid scratch-pad and caches– Disable selected tags convert them to scratch-pads– 35% additional savings in 4-Kb cache with 1 partition as SP


Thank You&

Questions


Cache Analysis Cache Analysis Step 1: Instruction FusioningStep 1: Instruction Fusioning

• Combine ld/st that accesses the same set of objects• Avoids coherence and duplication• Points-to analysis

M1 M2

for (i = 0; i < N1; i++) { … for (j = 0; j < readInput1(); j++) y[i + j] += *w1++ + x[i + j] for (k = 0; k < readInput2(); k++) y[i + k] += *w2++ + x[i + k]}

ld1/st1

ld2/st2

ld3

ld4

ld5

ld6


Partition AssignmentPartition Assignment• Greedily place instructions based on its cache estimates• Overlap instructions if required• Compute number of partitions for overlapped instructions

– Enumerate cliques within interference graph– Compute combined working-set of all cliques

• Assign the R/U bit to control lookup

M4D = 1

M3D = 1

M2D = 1

M1D = 1

Clique 2

Clique 1


Related WorkRelated Work• Direct addressed, cool caches [Unsal ’01, Asanovic ’01]

– Tags maintained in registers that are addressed within loads/stores• Split temporal/spatial cache [Rivers ’96]

– Hardware managed, two partitions• Column partitioning [Devdas ’00]

– Individual ways can be configured as a scratch-pad– No load/store based partitioning

• Region based caching [Tyson ’02]– Heap, stack, globals– More finer grained control and management

• Pseudo set-associative caches [Calder ’96,Inou ’99,Albonesi ‘99]– Reduce tag check power– Compromises on cycle time– Orthogonal to our technique


Code Size OverheadCode Size Overhead

0

2

4

6

8

10

12

rawc

audi

o

rawd

audi

o

g721

enco

de

g721

deco

de

mpe

g2de

c

mpe

g2en

c

pegw

itenc

pegw

itdec

pgpe

ncod

e

pgpd

ecod

e

gsm

enco

de

gsm

deco

de epic

unep

ic

cjpeg

djpe

g

Aver

age

Perc

enta

ge in

stru

ctio

ns

Annotated LD/STs Extra MOV instructions15% 16%

Compiler Managed Partitioned Data Caches for Low Power

Documents

Transcript of Compiler Managed Partitioned Data Caches for Low Power