Embedded Memory Systems
-
Upload
wiliam-javier -
Category
Documents
-
view
8 -
download
0
description
Transcript of Embedded Memory Systems
-
Advanced Computer Architecture
Part II: Embedded ComputingEmbedded Memory Systemsy y
[email protected] I&C LAP
(Largely based on slides by P. R. Panda, IITD and P. Marwedel, University of Dortmund)
-
MotivationMotivation
Memories are the limiting performance factorMemories are the limiting performance factorSystem-on-Chip memories and SRAMs embedded in
FPGAs are fast (1-2 cycles access) but:FPGAs are fast (1-2 cycles access) but: On-chip memory might not be enough eDRAM or eFLASH may be coming into the picturey g p
Memories are a key energy consumer on SoCMemory systems in embedded systems can beMemory systems in embedded systems can be
customisedLarge design space to exploit for optimisationLarge design space to exploit for optimisationSoC and FPGA technologies support to a good extent
irregular memory systems
Ienne 2003-08AdvCompArch Embedded Memory Systems2
irregular memory systems
-
Importance of Memory in SoCsImportance of Memory in SoCs
Some rough rule of thumb figures:Some rough rule-of-thumb figures:
Area50-70% of chip area may be memory
Performance10-90% of system performance may be memory y p y y
related
Powero25-40% of system power may be memory related
Ienne 2003-08AdvCompArch Embedded Memory Systems3
-
Things Are Only Getting WorseThings Are Only Getting Worse
EnergySub-banking
e
l
,
2
0
0
7
gy
Access times r c e:
M
a
r
w
e
d
e Applications are getting larger and largerAccess times
S
o
u
rlarger The energy cost of
keeping accesskeeping access times low is very high
Ienne 2003-084 AdvCompArch Embedded Memory Systems
g
-
Some More Recent EstimationsSome More Recent Estimations
Cacheless
29%
Processor Energy
Main Mem.E e
r
2
0
0
7
Cacheless monoprocessor
71%
d
e
l
,
S
p
r
i
n
g
Proc. Energy
I C h E a
a
n
d
M
a
r
w
e
d
28,1%51,9%I-Cache Energy
D-Cache Energy
Main Mem.Energy
Multiprocessor withI nd D he o
u
r
c
e
:
V
e
r
m
a
14,8%
5,2%
gyI and D caches S o
Average of over 200 benchmarks
Ienne 2003-085 AdvCompArch Embedded Memory Systems
-
OutlineOutline
Memory data layoutMemory data layoutScratchpad memoryCustom memory architectures
Ienne 2003-08AdvCompArch Embedded Memory Systems6
-
Memory Data LayoutMemory Data Layout
Problem statementProblem statementOptimise the placement of data in memory to
maximise cache effectiveness with minimal resourcesmaximise cache effectiveness with minimal resources
Normally compilers place data following language conventions and program orderlanguage conventions and program orderCan they do any better if they see the complete
embedded application?embedded application?
SimilarityThi i d f th t d f l t f DSPThis reminds of the study of placement for DSP
variables (but problem and strategy was different)
Ienne 2003-08AdvCompArch Embedded Memory Systems7
-
Array Layout and Data CacheArray Layout and Data Cache
a
a[i]
b
b[i]
int a[1024];int b[1024];int c[1024];
c
b[i][ ];...for (i = 0; i < N; i++)
c[i] = a[i] + b[i];
Memory
Data Cache(Direct-mapped,
512 d )
c[i]
Memory 512 words)
Ienne 2003-08AdvCompArch Embedded Memory Systems8
Problem: Every access leads to cache miss!
-
Aliasing ExampleAliasing Example
Cache size C line size M array size NCache size C, line size M, array size NAddresses and cache position:a[i]: i (i mod C) / Ma[i]: i (i mod C) / Mb[i]: i + N ((i + N) mod C) / Mc[i]: i + 2N ((i + 2N) mod C) / Mc[i]: i + 2N ((i + 2N) mod C) / M
If N = kC all cache positions identical!C is normally a power of 2C is normally a power of 2N is often a power of 2, too
Solutions?Solutions?Set-associative cacheMake C larger than N Costly!
Ienne 2003-08AdvCompArch Embedded Memory Systems9
Make C larger than N?!
-
Energy Cost of AssociativityEnergy Cost of Associativity
0
2
.
,
I
E
E
E
2
0
0
B
a
n
a
k
a
r
e
t
a
l
S
o
u
r
c
e
:
B
Ienne 2003-0810 AdvCompArch Embedded Memory Systems
-
A Solution: Array PaddingA Solution: Array Padding
a
a[i]
M words
int a[1024];int b[1024];int c[1024]; b
DUMMY
[ ];...for (i = 0; i < N; i++)
c[i] = a[i] + b[i];b[i]
DUMMY
Data Cache(Direct-mapped,
2 d )
c
c[i]
DUMMY
Memory
512 words)c[i]
Ienne 2003-08AdvCompArch Embedded Memory Systems11
Data alignment avoids cache conflicts
-
Classic TransformationLoop BlockingLoop Blocking
Modify loop exploration space in blocks (or tiles) so Modify loop exploration space in blocks (or tiles ) so that all elements accessed at once fit the cache
Original Code Blocked Code
for i = 1 to Nfor kk = 1 to N step B
for jj = 1 to N step B
Original Code Blocked Code
for k = 1 to Nr = X [i,k]for j = 1 to NZ[i,j] = r * Y[k,j]
for i = 1 to Nfor k = kk to min (kk+B-1, N)
r = X [i,k]for j = jj to min (jj+B-1 N)Z[i,j] r Y[k,j] for j = jj to min (jj+B-1, N)
Z[i,j] = r * Y[k,j]
B
N
B
Ienne 2003-08AdvCompArch Embedded Memory Systems12
-
Array Tiling Reduces Aliasing TooArray Tiling Reduces Aliasing Too
0
0
1
Idea:Split the array in blocks or tiles and group tiles of
I
E
E
E
2
0
tiles and group tiles of each array which are
accessed at once
d
a
e
t
a
l
.
,
If the tiles are small enough, the set of tiles accessed at once will fit
u
r
c
e
:
P
a
n
d
accessed at once will fit into the cache
Since they are adjacent in
S
o
u
y jdata memory, they will not
conflict in the cache
Ienne 2003-08AdvCompArch Embedded Memory Systems13
-
Real-World Example: FFTReal-World Example: FFT
double sigreal[2048]g [ ]le = le / 2f (i j i < 2048 i + 2*l )for (i = j; i < 2048; i += 2*le)
{ = sigreal[i] = sigreal[i + le] sigreal[i] =
1st Outer Loop Iterationg [ ]
sigreal[i + le] = } 0 1024 0
511Arraysigreal Cache
Ienne 2003-08AdvCompArch Embedded Memory Systems14
g Cache
-
Padded FFTPadded FFT
double sigreal[2048 + 16]g [ ]le = le / 2; le = le + le / 128f (i j i < 2048 i + 2*l ) {for (i = j; i < 2048; i += 2*le) {
i = i + i / 128 = sigreal[i] 1st Outer Loop Iteration = sigreal[i + le] sigreal[i] =
Pads (~1 cache line, every cache size)
sigreal[i ] = sigreal[i + le] =
}0
01032
511Array
sigreal Cache
Ienne 2003-08AdvCompArch Embedded Memory Systems15
Padding 15% Speed-up on a Sparc5
-
Algorithms to Decide Data LayoutAlgorithms to Decide Data Layout
Need to make the following decisions:Need to make the following decisions:Tile Size Computation Largest possible tiles such that working set fits the
cache
Pad Size Computation Minimum pad size which eliminates aliasing
Interleaving of Tiled Arrays Arrangement of multiple arrays so that there is no g p y
aliasing among arrays and all working sets fit the cache
Ienne 2003-08AdvCompArch Embedded Memory Systems16
-
Algorithms to Decide Data LayoutAlgorithms to Decide Data Layout
Matrix Multiplication (Array Sizes 35-350)Matrix Multiplication (Array Sizes 35 350)
TSS ESSTSS ESS
LRW DAT
DAT fi d til di i
Ienne 2003-08AdvCompArch Embedded Memory Systems17
DAT uses fixed tile dimensionsOthers use widely varying sizes
-
OutlineOutline
Memory data layoutMemory data layoutScratchpad memoryCustom memory architectures
Ienne 2003-08AdvCompArch Embedded Memory Systems18
-
Scratchpad Memory Idea Scratchpad Memory Idea
Scratchpad
1 cycle
Scratchpad
On-chipMemory
0
P-1
Addressable
P
Data
CPU
AddressableMemoryOff-chip
Memory
DataCache
(on-chip)
N-1
( p)
1 cycle
Ienne 2003-08AdvCompArch Embedded Memory Systems19
N 110-20 cycles
-
Scratchpad Memory AdvantagesScratchpad Memory Advantages
Architecturally visible static software managed Architecturally visible static software-managed cache Avoid aliasing problems Avoid aliasing problems Decide explicitly which data will be reused
Avoid evicting useful data from the cache Avoid evicting useful data from the cache Increase determinism
Data are always in the cache when needed Data are always in the cache when needed Save power
Avoid energy cost of high associativityAvoid energy cost of high associativity Avoid caching irrelevant data Avoid using 2 levels of the memory system for temporaries
Ienne 2003-08AdvCompArch Embedded Memory Systems20
-
Energy Cost of Hardware CachesEnergy Cost of Hardware Caches
8
9
.
0
2
5
6
7
J
]
Scratch pad
.
,
I
E
E
E
2
0
0
3
4
5
r
a
c
c
e
s
s
[
n Cache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space
B
a
n
a
k
a
r
e
t
a
l
1
2
E
n
e
r
g
y
p
e
r
S
o
u
r
c
e
:
B
0256 512 1024 2048 4096 8192 16384
memory size
Energy consumption in tags, comparators, and muxes is significant Ienne 2003-0821 AdvCompArch Embedded Memory Systems
-
Timing PredictabilityTiming Predictability
Many embedded systems are real-time systems: computations must be finished in a given amount of time
Most memory hierarchies (i e caches) for PC-like systems designed forMost memory hierarchies (i.e., caches) for PC like systems designed for good average case, not for good worst case behavior
e
l
,
2
0
0
7
Worst case execution time (WCET) larger than r c
e
:
M
a
r
w
e
d
e
(WCET) larger than without cache S
o
u
r
G.721 using unified h ARM7TDMIcache on a ARM7TDMI
Ienne 2003-0822 AdvCompArch Embedded Memory Systems
-
Scratchpad MemoryScratchpad Memory
Embedded processor-based system Embedded processor based system Processor core Embedded memory
Instruction and Data Cache Embedded SRAM Embedded DRAM Scratch Pad Memory
Design problems1. How much on-chip memory?2 Partitioning of on chip memory in cache and scratchpad?2. Partitioning of on-chip memory in cache and scratchpad?3. Which variables/arrays in the scratchpad?
Goals Goals Improve performance Save power
Ienne 2003-08AdvCompArch Embedded Memory Systems23
-
Architecture ExplorationArchitecture Exploration
Explore exhaustively the design space Explore exhaustively the design space Requires an algorithm to perform partitioning between
on- and off-chip memoryon and off chip memory
Algorithm Memory ExploreAlgorithm Memory Explorefor On-chip Memory Size T (in powers of 2)
for Cache Size C (in powers of 2, < T)SRAM Size S = T - CData Partition (S)for line size L (in powers of 2 < C < MaxLine)for line size L (in powers of 2, < C, < MaxLine)
Estimate Memory Performance (T, C, S)Select (T, C, S, L) which maximize optimisation goals
Ienne 2003-08AdvCompArch Embedded Memory Systems24
( , , , ) p g
-
Variation of On-chip Memory AllocationAllocation
t
o
g
r
a
m
]
p
l
e
:
H
i
s
t
[
E
x
a
m
Effect of different ratios scratchpad/cache sizesT t l hi i 2 KB
Ienne 2003-08AdvCompArch Embedded Memory Systems25
Total on-chip memory size = 2 KB
-
Variation of Total On-chip MemoryVariation of Total On-chip Memory
t
o
g
r
a
m
]
p
l
e
:
H
i
s
t
[
E
x
a
m
Effect of on-chip memory size
Ienne 2003-08AdvCompArch Embedded Memory Systems26
Effect of on chip memory size
-
Data Partitioning (I)Data Partitioning (I)
procedure Histogram_Evaluationchar BrightnessLevel[512][512]; Regular Accessgint Hist [256];for (i = 0; i < 512; i++)
Regular Access Off-chip + Cache
for (j = 0; j < 512; j++) {/* for each pixel (i,j) in image */level = BrightnessLevel[i][j];Hist[level] += 1;
}}
Irregular Access
Ienne 2003-08AdvCompArch Embedded Memory Systems27
Scratchpad
-
Data Partitioning (II)Data Partitioning (II)
procedure Convolutionprocedure Convolutionint source[128][128], dest[128][128];int mask[4][4];for (all points x,y of source)new = 0;for (i scanning the mask horizontally)
for (j scanning the mask vertically)for (j scanning the mask vertically)new += source[x+i][y+j] * mask[i,j];
dest[x][y] = new / norm;
Iteration (0,0) Iteration (0,1)maskSmall
Scratchpad
source + dest
Scratchpad
Ienne 2003-08AdvCompArch Embedded Memory Systems28
Large and Regular Off-chip + Cache
-
Data PartitioningData Partitioning
Pre Partitioning Scratchpad/Off chipPre-Partitioning Scratchpad/Off-chip Scalar variables and constants to scratchpadLarge arrays to off chip memoryLarge arrays to off-chip memory
Detailed PartitioningIdentify critical data for scratchpadCriteria:
Life-times of arrays Access frequency of arrays Loop conflicts Loop conflicts
Similar reasoning can be applied to code
Ienne 2003-08AdvCompArch Embedded Memory Systems29
-
Global Placement OptimizationGlobal Placement Optimization
Which object (array, loop, etc.) to be for i ...{ } j ( y p )stored in a scratchpad?
l i ll i
for i ...{ }
for j ...{ }
while 0 2Non-overlaying allocationGain gk + size sk for each object k
Maximise gain G g respecting themain
memory
while...
repeat...
function ,
I
E
E
E
2
0
0
Maximise gain G = gk, respecting the scratchpad size SSP sk
Solution: knapsack algorithm
y function...
Array...
S
t
e
i
n
k
e
e
t
a
l
.
p g
Overlaying allocationScratchpad
?y
Array
S
o
u
r
c
e
:
S
Moving objects back and forth between hierarchy levelsSolution: more complexProcessor
Scratchpad memory,
capacity SSP
Array...
Solution: more complex...Processor Int...
Ienne 2003-0830 AdvCompArch Embedded Memory Systems
-
Integer Linear ProgrammingInteger Linear ProgrammingSymbols:
S (vark ) = size of variable kn (vark ) = number of accesses to variable k
( ) d i bl if i i t d 0 2e (vark ) = energy saved per variable access, if vark is migratedE (vark ) = energy saved if variable vark is migrated (= e (vark ) n (vark ))
x (var ) = decision variable ,
I
E
E
E
2
0
0
x (vark ) = decision variable=1 if variable k is migrated to scratchpad, =0 otherwise
K = set of variables
S
t
e
i
n
k
e
e
t
a
l
.
Similar for functions I
S
o
u
r
c
e
:
S
Integer programming formulation:Maximize kK x (vark ) E (vark ) + iI x (Fi ) E (Fi )
Subject to the constraintSubject to the constrainti I S (Fi ) x (Fi ) + k K S (vark ) x (vark ) SSP
Ienne 2003-0831 AdvCompArch Embedded Memory Systems
-
Reduction in Energy and RuntimeReduction in Energy and Runtime
0
2
s
[
x
1
0
0
]
g
y
[
J
]
,
I
E
E
E
2
0
0
C
y
c
l
e
s
E
n
e
r
S
t
e
i
n
k
e
e
t
a
l
.
multi_sort benchmark (mix
S
o
u
r
c
e
:
S
of sorting algorithms)
Numbers will change with technology but algorithms will remain unchanged
Ienne 2003-0832 AdvCompArch Embedded Memory Systems
-
OutlineOutline
Memory data layoutMemory data layoutScratchpad memoryCustom memory architectures
Ienne 2003-08AdvCompArch Embedded Memory Systems33
-
Array-to-Memory Assignment and Memory BankingMemory Banking
Exploit the possibility of designing Bank Bank Bank Exploit the possibility of designing a memory system not for general use (e.g., completely uniform) and not of standard components (e g
Bank#1
Bank#2
Bank#3
not of standard components (e.g., off-the-shelf DRAMs) Ad-hoc bit widths
Smallbitwidth
E.g., 32-bit word-addressed architecture with specific 6-bit arrays
Clustering of arrays in memories s s
S
p
a
c
e
Clustering of arrays in memories Exploit features of eDRAMs Trade offs power/energy/area A
d
d
r
e
s
Multiple accesses per cycle E.g., Allow concurrent accesses by
coprocessorsA ibl
Ienne 2003-08AdvCompArch Embedded Memory Systems34
Multiple CPU busese.g., DSPs Accessibleat once
-
Memory Banking Motivation(e)DRAMs(e)DRAMs
A
for (i = 0, i < 1000; i++){ A[i] += B[i] * C[2*i]; }
B
A
PageRow Address
Addr[15:8]
C
Addr[15:8]
Page BufferColumn Address
Addr[7:0]
Ienne 2003-08AdvCompArch Embedded Memory Systems35Address Data
-
Memory Banking Motivation(e)DRAMs(e)DRAMs
for (i = 0, i < 1000; i++){ A[i] += B[i] * C[2*i];}}
A[I]Row B[I]Row
C[2I]
Row
Col Col Col
Page Buffer
Add T D t th
Page Buffer
Add T D t th
Page Buffer
Add T D t th
Ienne 2003-08AdvCompArch Embedded Memory Systems36
Addr To Datapath Addr To Datapath Addr To Datapath
-
Typical DRAM Tradeoffs Also in High-End ServersHigh-End Servers
DRAMs are complex objects:DRAMs are complex objects: Multiple interleaved DRAM banks in a system Large premium for burst accesses Large premium for burst accesses Tradeoff between leaving a page open and having
neighbouring accesses faster and closing a page and notneighbouring accesses faster and closing a page and not requiring precharge time for far accesses
In servers, optimizations are usually of dynamic nature performed by the memory controller subsystem and p y y ycontrolled by the BIOS/OS
In embedded computers, similar optimizations can be done statically and application-specific
Ienne 2003-08AdvCompArch Embedded Memory Systems37
-
Minimal Number of Banks to Remove Access ConflictsRemove Access Conflicts
M
2
0
0
1
Can be extended to model multiple
a
l
.
,
A
C
M
DFGsimultaneous accesses to the same array ( multi-ported memories)
P
a
n
d
a
e
t
a
Conflict Graph
i
e
d
f
r
o
m
P
Conflict Graph
M
o
d
i
f
i
Schedule
Ienne 2003-08AdvCompArch Embedded Memory Systems38
Minimal Allocation
-
Memory Allocation ExplorationMemory Allocation Exploration
Useful Exploration Spacep p
0
0
1
A
C
M
2
0
d
a
e
t
a
l
.
,
u
r
c
e
:
P
a
n
d
S
o
u
Ienne 2003-08AdvCompArch Embedded Memory Systems39
-
SummarySummary
In SoCs and FPGAs situation is different from general In SoCs and FPGAs situation is different from general purpose computers Different design space (fast memories almost as fast as logic) Less constraints to use standard components: any size possible,
more types of memory available (e.g., dual port), etc. More bandwidth exploitable (no pins)p ( p )
Hence different world where many more things are possible (whereas in classic computing or normal chip based embedded systems there is not much freedom)based embedded systems there is not much freedom)
Companion situation to the customisation of processors: Optimizations tailored to Data Cache Optimizations tailored to Data Cache
Memory Data Layout Memory architecture customized to a given application
Scratchpad Memory
Ienne 2003-08AdvCompArch Embedded Memory Systems40
Scratchpad Memory Memory Banking
-
ReferencesReferences
P R Panda et al Data and Memory Optimization P. R. Panda et al., Data and Memory Optimization Techniques for Embedded Systems, ACM Transactions on Design Automation of Electronic Systems, 6(2):149206 A il 2001206, April 2001
M. Verma and P. Marwedel, Advanced Memory Optimization Techniques for Low Power EmbeddedOptimization Techniques for Low Power Embedded Processors, Springer, 2007
P. R. Panda (ed.), Memory Issues in Embedded Systems-on-Chip, Kluwer Academic, 1999
IEEE Design & Test of Computers, Special issue on Large Embedded Memories May June 2001Embedded Memories, May-June 2001
Ienne 2003-08AdvCompArch Embedded Memory Systems41