Embedded Memory Systems

Advanced Computer Architecture

Part II: Embedded ComputingEmbedded Memory Systemsy y

[email protected] I&C LAP

(Largely based on slides by P. R. Panda, IITD and P. Marwedel, University of Dortmund)

MotivationMotivation

Memories are the limiting performance factorMemories are the limiting performance factorSystem-on-Chip memories and SRAMs embedded in

FPGAs are fast (1-2 cycles access) but:FPGAs are fast (1-2 cycles access) but: On-chip memory might not be enough eDRAM or eFLASH may be coming into the picturey g p

Memories are a key energy consumer on SoCMemory systems in embedded systems can beMemory systems in embedded systems can be

customisedLarge design space to exploit for optimisationLarge design space to exploit for optimisationSoC and FPGA technologies support to a good extent

irregular memory systems

Ienne 2003-08AdvCompArch Embedded Memory Systems2

irregular memory systems

Importance of Memory in SoCsImportance of Memory in SoCs

Some rough rule of thumb figures:Some rough rule-of-thumb figures:

Area50-70% of chip area may be memory

Performance10-90% of system performance may be memory y p y y

related

Powero25-40% of system power may be memory related


Things Are Only Getting WorseThings Are Only Getting Worse

EnergySub-banking

e

l

,

2

0

0

7

gy

Access times r c e:

M

a

r

w

e

d

e Applications are getting larger and largerAccess times

S

o

u

rlarger The energy cost of

keeping accesskeeping access times low is very high

Ienne 2003-084 AdvCompArch Embedded Memory Systems

g

Some More Recent EstimationsSome More Recent Estimations

Cacheless

29%

Processor Energy

Main Mem.E e

r

2

0

0

7

Cacheless monoprocessor

71%

d

e

l

,

S

p

r

i

n

g

Proc. Energy

I C h E a

a

n

d

M

a

r

w

e

d

28,1%51,9%I-Cache Energy

D-Cache Energy

Main Mem.Energy

Multiprocessor withI nd D he o

u

r

c

e

:

V

e

r

m

a

14,8%

5,2%

gyI and D caches S o

Average of over 200 benchmarks


OutlineOutline

Memory data layoutMemory data layoutScratchpad memoryCustom memory architectures


Memory Data LayoutMemory Data Layout

Problem statementProblem statementOptimise the placement of data in memory to

maximise cache effectiveness with minimal resourcesmaximise cache effectiveness with minimal resources

Normally compilers place data following language conventions and program orderlanguage conventions and program orderCan they do any better if they see the complete

embedded application?embedded application?

SimilarityThi i d f th t d f l t f DSPThis reminds of the study of placement for DSP

variables (but problem and strategy was different)


Array Layout and Data CacheArray Layout and Data Cache

a

a[i]

b

b[i]

int a[1024];int b[1024];int c[1024];

c

b[i][ ];...for (i = 0; i < N; i++)

c[i] = a[i] + b[i];

Memory

Data Cache(Direct-mapped,

512 d )

c[i]

Memory 512 words)


Problem: Every access leads to cache miss!

Aliasing ExampleAliasing Example

Cache size C line size M array size NCache size C, line size M, array size NAddresses and cache position:a[i]: i (i mod C) / Ma[i]: i (i mod C) / Mb[i]: i + N ((i + N) mod C) / Mc[i]: i + 2N ((i + 2N) mod C) / Mc[i]: i + 2N ((i + 2N) mod C) / M

If N = kC all cache positions identical!C is normally a power of 2C is normally a power of 2N is often a power of 2, too

Solutions?Solutions?Set-associative cacheMake C larger than N Costly!


Make C larger than N?!

Energy Cost of AssociativityEnergy Cost of Associativity

0

2

.

,

I

E

E

E

2

0

0

B

a

n

a

k

a

r

e

t

a

l

S

o

u

r

c

e

:

B


A Solution: Array PaddingA Solution: Array Padding

a

a[i]

M words

int a[1024];int b[1024];int c[1024]; b

DUMMY

[ ];...for (i = 0; i < N; i++)

c[i] = a[i] + b[i];b[i]

DUMMY

Data Cache(Direct-mapped,

2 d )

c

c[i]

DUMMY

Memory

512 words)c[i]


Data alignment avoids cache conflicts

Classic TransformationLoop BlockingLoop Blocking

Modify loop exploration space in blocks (or tiles) so Modify loop exploration space in blocks (or tiles ) so that all elements accessed at once fit the cache

Original Code Blocked Code

for i = 1 to Nfor kk = 1 to N step B

for jj = 1 to N step B

Original Code Blocked Code

for k = 1 to Nr = X [i,k]for j = 1 to NZ[i,j] = r * Y[k,j]

for i = 1 to Nfor k = kk to min (kk+B-1, N)

r = X [i,k]for j = jj to min (jj+B-1 N)Z[i,j] r Y[k,j] for j = jj to min (jj+B-1, N)

Z[i,j] = r * Y[k,j]

B

N

B


Array Tiling Reduces Aliasing TooArray Tiling Reduces Aliasing Too

0

0

1

Idea:Split the array in blocks or tiles and group tiles of

I

E

E

E

2

0

tiles and group tiles of each array which are

accessed at once

d

a

e

t

a

l

.

,

If the tiles are small enough, the set of tiles accessed at once will fit

u

r

c

e

:

P

a

n

d

accessed at once will fit into the cache

Since they are adjacent in

S

o

u

y jdata memory, they will not

conflict in the cache


Real-World Example: FFTReal-World Example: FFT

double sigreal[2048]g [ ]le = le / 2f (i j i < 2048 i + 2*l )for (i = j; i < 2048; i += 2*le)

{ = sigreal[i] = sigreal[i + le] sigreal[i] =

1st Outer Loop Iterationg [ ]

sigreal[i + le] = } 0 1024 0

511Arraysigreal Cache


g Cache

Padded FFTPadded FFT

double sigreal[2048 + 16]g [ ]le = le / 2; le = le + le / 128f (i j i < 2048 i + 2*l ) {for (i = j; i < 2048; i += 2*le) {

i = i + i / 128 = sigreal[i] 1st Outer Loop Iteration = sigreal[i + le] sigreal[i] =

Pads (~1 cache line, every cache size)

sigreal[i ] = sigreal[i + le] =

}0

01032

511Array

sigreal Cache


Padding 15% Speed-up on a Sparc5

Algorithms to Decide Data LayoutAlgorithms to Decide Data Layout

Need to make the following decisions:Need to make the following decisions:Tile Size Computation Largest possible tiles such that working set fits the

cache

Pad Size Computation Minimum pad size which eliminates aliasing

Interleaving of Tiled Arrays Arrangement of multiple arrays so that there is no g p y

aliasing among arrays and all working sets fit the cache


Algorithms to Decide Data LayoutAlgorithms to Decide Data Layout

Matrix Multiplication (Array Sizes 35-350)Matrix Multiplication (Array Sizes 35 350)

TSS ESSTSS ESS

LRW DAT

DAT fi d til di i


DAT uses fixed tile dimensionsOthers use widely varying sizes

OutlineOutline



Scratchpad Memory Idea Scratchpad Memory Idea

Scratchpad

1 cycle

Scratchpad

On-chipMemory

0

P-1

Addressable

P

Data

CPU

AddressableMemoryOff-chip

Memory

DataCache

(on-chip)

N-1

( p)

1 cycle


N 110-20 cycles

Scratchpad Memory AdvantagesScratchpad Memory Advantages

Architecturally visible static software managed Architecturally visible static software-managed cache Avoid aliasing problems Avoid aliasing problems Decide explicitly which data will be reused

Avoid evicting useful data from the cache Avoid evicting useful data from the cache Increase determinism

Data are always in the cache when needed Data are always in the cache when needed Save power

Avoid energy cost of high associativityAvoid energy cost of high associativity Avoid caching irrelevant data Avoid using 2 levels of the memory system for temporaries


Energy Cost of Hardware CachesEnergy Cost of Hardware Caches

8

9

.

0

2

5

6

7

J

]

Scratch pad

.

,

I

E

E

E

2

0

0

3

4

5

r

a

c

c

e

s

s

[

n Cache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space

B

a

n

a

k

a

r

e

t

a

l

1

2

E

n

e

r

g

y

p

e

r

S

o

u

r

c

e

:

B

0256 512 1024 2048 4096 8192 16384

memory size

Energy consumption in tags, comparators, and muxes is significant Ienne 2003-0821 AdvCompArch Embedded Memory Systems

Timing PredictabilityTiming Predictability

Many embedded systems are real-time systems: computations must be finished in a given amount of time

Most memory hierarchies (i e caches) for PC-like systems designed forMost memory hierarchies (i.e., caches) for PC like systems designed for good average case, not for good worst case behavior

e

l

,

2

0

0

7

Worst case execution time (WCET) larger than r c

e

:

M

a

r

w

e

d

e

(WCET) larger than without cache S

o

u

r

G.721 using unified h ARM7TDMIcache on a ARM7TDMI


Scratchpad MemoryScratchpad Memory

Embedded processor-based system Embedded processor based system Processor core Embedded memory

Instruction and Data Cache Embedded SRAM Embedded DRAM Scratch Pad Memory

Design problems1. How much on-chip memory?2 Partitioning of on chip memory in cache and scratchpad?2. Partitioning of on-chip memory in cache and scratchpad?3. Which variables/arrays in the scratchpad?

Goals Goals Improve performance Save power


Architecture ExplorationArchitecture Exploration

Explore exhaustively the design space Explore exhaustively the design space Requires an algorithm to perform partitioning between

on- and off-chip memoryon and off chip memory

Algorithm Memory ExploreAlgorithm Memory Explorefor On-chip Memory Size T (in powers of 2)

for Cache Size C (in powers of 2, < T)SRAM Size S = T - CData Partition (S)for line size L (in powers of 2 < C < MaxLine)for line size L (in powers of 2, < C, < MaxLine)

Estimate Memory Performance (T, C, S)Select (T, C, S, L) which maximize optimisation goals


( , , , ) p g

Variation of On-chip Memory AllocationAllocation

t

o

g

r

a

m

]

p

l

e

:

H

i

s

t

[

E

x

a

m

Effect of different ratios scratchpad/cache sizesT t l hi i 2 KB


Total on-chip memory size = 2 KB

Variation of Total On-chip MemoryVariation of Total On-chip Memory

t

o

g

r

a

m

]

p

l

e

:

H

i

s

t

[

E

x

a

m

Effect of on-chip memory size


Effect of on chip memory size

Data Partitioning (I)Data Partitioning (I)

procedure Histogram_Evaluationchar BrightnessLevel[512][512]; Regular Accessgint Hist [256];for (i = 0; i < 512; i++)

Regular Access Off-chip + Cache

for (j = 0; j < 512; j++) {/* for each pixel (i,j) in image */level = BrightnessLevel[i][j];Hist[level] += 1;

}}

Irregular Access


Scratchpad

Data Partitioning (II)Data Partitioning (II)

procedure Convolutionprocedure Convolutionint source[128][128], dest[128][128];int mask[4][4];for (all points x,y of source)new = 0;for (i scanning the mask horizontally)

for (j scanning the mask vertically)for (j scanning the mask vertically)new += source[x+i][y+j] * mask[i,j];

dest[x][y] = new / norm;

Iteration (0,0) Iteration (0,1)maskSmall

Scratchpad

source + dest

Scratchpad


Large and Regular Off-chip + Cache

Data PartitioningData Partitioning

Pre Partitioning Scratchpad/Off chipPre-Partitioning Scratchpad/Off-chip Scalar variables and constants to scratchpadLarge arrays to off chip memoryLarge arrays to off-chip memory

Detailed PartitioningIdentify critical data for scratchpadCriteria:

Life-times of arrays Access frequency of arrays Loop conflicts Loop conflicts

Similar reasoning can be applied to code


Global Placement OptimizationGlobal Placement Optimization

Which object (array, loop, etc.) to be for i ...{ } j ( y p )stored in a scratchpad?

l i ll i

for i ...{ }

for j ...{ }

while 0 2Non-overlaying allocationGain gk + size sk for each object k

Maximise gain G g respecting themain

memory

while...

repeat...

function ,

I

E

E

E

2

0

0

Maximise gain G = gk, respecting the scratchpad size SSP sk

Solution: knapsack algorithm

y function...

Array...

S

t

e

i

n

k

e

e

t

a

l

.

p g

Overlaying allocationScratchpad

?y

Array

S

o

u

r

c

e

:

S

Moving objects back and forth between hierarchy levelsSolution: more complexProcessor

Scratchpad memory,

capacity SSP

Array...

Solution: more complex...Processor Int...


Integer Linear ProgrammingInteger Linear ProgrammingSymbols:

S (vark ) = size of variable kn (vark ) = number of accesses to variable k

( ) d i bl if i i t d 0 2e (vark ) = energy saved per variable access, if vark is migratedE (vark ) = energy saved if variable vark is migrated (= e (vark ) n (vark ))

x (var ) = decision variable ,

I

E

E

E

2

0

0

x (vark ) = decision variable=1 if variable k is migrated to scratchpad, =0 otherwise

K = set of variables

S

t

e

i

n

k

e

e

t

a

l

.

Similar for functions I

S

o

u

r

c

e

:

S

Integer programming formulation:Maximize kK x (vark ) E (vark ) + iI x (Fi ) E (Fi )

Subject to the constraintSubject to the constrainti I S (Fi ) x (Fi ) + k K S (vark ) x (vark ) SSP


Reduction in Energy and RuntimeReduction in Energy and Runtime

0

2

s

[

x

1

0

0

]

g

y

[

J

]

,

I

E

E

E

2

0

0

C

y

c

l

e

s

E

n

e

r

S

t

e

i

n

k

e

e

t

a

l

.

multi_sort benchmark (mix

S

o

u

r

c

e

:

S

of sorting algorithms)

Numbers will change with technology but algorithms will remain unchanged


OutlineOutline



Array-to-Memory Assignment and Memory BankingMemory Banking

Exploit the possibility of designing Bank Bank Bank Exploit the possibility of designing a memory system not for general use (e.g., completely uniform) and not of standard components (e g

Bank#1

Bank#2

Bank#3

not of standard components (e.g., off-the-shelf DRAMs) Ad-hoc bit widths

Smallbitwidth

E.g., 32-bit word-addressed architecture with specific 6-bit arrays

Clustering of arrays in memories s s

S

p

a

c

e

Clustering of arrays in memories Exploit features of eDRAMs Trade offs power/energy/area A

d

d

r

e

s

Multiple accesses per cycle E.g., Allow concurrent accesses by

coprocessorsA ibl


Multiple CPU busese.g., DSPs Accessibleat once

Memory Banking Motivation(e)DRAMs(e)DRAMs

A

for (i = 0, i < 1000; i++){ A[i] += B[i] * C[2*i]; }

B

A

PageRow Address

Addr[15:8]

C

Addr[15:8]

Page BufferColumn Address

Addr[7:0]

Ienne 2003-08AdvCompArch Embedded Memory Systems35Address Data

Memory Banking Motivation(e)DRAMs(e)DRAMs

for (i = 0, i < 1000; i++){ A[i] += B[i] * C[2*i];}}

A[I]Row B[I]Row

C[2I]

Row

Col Col Col

Page Buffer

Add T D t th

Page Buffer

Add T D t th

Page Buffer

Add T D t th


Addr To Datapath Addr To Datapath Addr To Datapath

Typical DRAM Tradeoffs Also in High-End ServersHigh-End Servers

DRAMs are complex objects:DRAMs are complex objects: Multiple interleaved DRAM banks in a system Large premium for burst accesses Large premium for burst accesses Tradeoff between leaving a page open and having

neighbouring accesses faster and closing a page and notneighbouring accesses faster and closing a page and not requiring precharge time for far accesses

In servers, optimizations are usually of dynamic nature performed by the memory controller subsystem and p y y ycontrolled by the BIOS/OS

In embedded computers, similar optimizations can be done statically and application-specific


Minimal Number of Banks to Remove Access ConflictsRemove Access Conflicts

M

2

0

0

1

Can be extended to model multiple

a

l

.

,

A

C

M

DFGsimultaneous accesses to the same array ( multi-ported memories)

P

a

n

d

a

e

t

a

Conflict Graph

i

e

d

f

r

o

m

P

Conflict Graph

M

o

d

i

f

i

Schedule


Minimal Allocation

Memory Allocation ExplorationMemory Allocation Exploration

Useful Exploration Spacep p

0

0

1

A

C

M

2

0

d

a

e

t

a

l

.

,

u

r

c

e

:

P

a

n

d

S

o

u


SummarySummary

In SoCs and FPGAs situation is different from general In SoCs and FPGAs situation is different from general purpose computers Different design space (fast memories almost as fast as logic) Less constraints to use standard components: any size possible,

more types of memory available (e.g., dual port), etc. More bandwidth exploitable (no pins)p ( p )

Hence different world where many more things are possible (whereas in classic computing or normal chip based embedded systems there is not much freedom)based embedded systems there is not much freedom)

Companion situation to the customisation of processors: Optimizations tailored to Data Cache Optimizations tailored to Data Cache

Memory Data Layout Memory architecture customized to a given application

Scratchpad Memory


Scratchpad Memory Memory Banking

ReferencesReferences

P R Panda et al Data and Memory Optimization P. R. Panda et al., Data and Memory Optimization Techniques for Embedded Systems, ACM Transactions on Design Automation of Electronic Systems, 6(2):149206 A il 2001206, April 2001

M. Verma and P. Marwedel, Advanced Memory Optimization Techniques for Low Power EmbeddedOptimization Techniques for Low Power Embedded Processors, Springer, 2007

P. R. Panda (ed.), Memory Issues in Embedded Systems-on-Chip, Kluwer Academic, 1999

IEEE Design & Test of Computers, Special issue on Large Embedded Memories May June 2001Embedded Memories, May-June 2001


Embedded Memory Systems

Documents

Transcript of Embedded Memory Systems