Embedded Memory Systems

download Embedded Memory Systems

of 41

description

Sistemas embebidos

Transcript of Embedded Memory Systems

  • Advanced Computer Architecture

    Part II: Embedded ComputingEmbedded Memory Systemsy y

    [email protected] I&C LAP

    (Largely based on slides by P. R. Panda, IITD and P. Marwedel, University of Dortmund)

  • MotivationMotivation

    Memories are the limiting performance factorMemories are the limiting performance factorSystem-on-Chip memories and SRAMs embedded in

    FPGAs are fast (1-2 cycles access) but:FPGAs are fast (1-2 cycles access) but: On-chip memory might not be enough eDRAM or eFLASH may be coming into the picturey g p

    Memories are a key energy consumer on SoCMemory systems in embedded systems can beMemory systems in embedded systems can be

    customisedLarge design space to exploit for optimisationLarge design space to exploit for optimisationSoC and FPGA technologies support to a good extent

    irregular memory systems

    Ienne 2003-08AdvCompArch Embedded Memory Systems2

    irregular memory systems

  • Importance of Memory in SoCsImportance of Memory in SoCs

    Some rough rule of thumb figures:Some rough rule-of-thumb figures:

    Area50-70% of chip area may be memory

    Performance10-90% of system performance may be memory y p y y

    related

    Powero25-40% of system power may be memory related

    Ienne 2003-08AdvCompArch Embedded Memory Systems3

  • Things Are Only Getting WorseThings Are Only Getting Worse

    EnergySub-banking

    e

    l

    ,

    2

    0

    0

    7

    gy

    Access times r c e:

    M

    a

    r

    w

    e

    d

    e Applications are getting larger and largerAccess times

    S

    o

    u

    rlarger The energy cost of

    keeping accesskeeping access times low is very high

    Ienne 2003-084 AdvCompArch Embedded Memory Systems

    g

  • Some More Recent EstimationsSome More Recent Estimations

    Cacheless

    29%

    Processor Energy

    Main Mem.E e

    r

    2

    0

    0

    7

    Cacheless monoprocessor

    71%

    d

    e

    l

    ,

    S

    p

    r

    i

    n

    g

    Proc. Energy

    I C h E a

    a

    n

    d

    M

    a

    r

    w

    e

    d

    28,1%51,9%I-Cache Energy

    D-Cache Energy

    Main Mem.Energy

    Multiprocessor withI nd D he o

    u

    r

    c

    e

    :

    V

    e

    r

    m

    a

    14,8%

    5,2%

    gyI and D caches S o

    Average of over 200 benchmarks

    Ienne 2003-085 AdvCompArch Embedded Memory Systems

  • OutlineOutline

    Memory data layoutMemory data layoutScratchpad memoryCustom memory architectures

    Ienne 2003-08AdvCompArch Embedded Memory Systems6

  • Memory Data LayoutMemory Data Layout

    Problem statementProblem statementOptimise the placement of data in memory to

    maximise cache effectiveness with minimal resourcesmaximise cache effectiveness with minimal resources

    Normally compilers place data following language conventions and program orderlanguage conventions and program orderCan they do any better if they see the complete

    embedded application?embedded application?

    SimilarityThi i d f th t d f l t f DSPThis reminds of the study of placement for DSP

    variables (but problem and strategy was different)

    Ienne 2003-08AdvCompArch Embedded Memory Systems7

  • Array Layout and Data CacheArray Layout and Data Cache

    a

    a[i]

    b

    b[i]

    int a[1024];int b[1024];int c[1024];

    c

    b[i][ ];...for (i = 0; i < N; i++)

    c[i] = a[i] + b[i];

    Memory

    Data Cache(Direct-mapped,

    512 d )

    c[i]

    Memory 512 words)

    Ienne 2003-08AdvCompArch Embedded Memory Systems8

    Problem: Every access leads to cache miss!

  • Aliasing ExampleAliasing Example

    Cache size C line size M array size NCache size C, line size M, array size NAddresses and cache position:a[i]: i (i mod C) / Ma[i]: i (i mod C) / Mb[i]: i + N ((i + N) mod C) / Mc[i]: i + 2N ((i + 2N) mod C) / Mc[i]: i + 2N ((i + 2N) mod C) / M

    If N = kC all cache positions identical!C is normally a power of 2C is normally a power of 2N is often a power of 2, too

    Solutions?Solutions?Set-associative cacheMake C larger than N Costly!

    Ienne 2003-08AdvCompArch Embedded Memory Systems9

    Make C larger than N?!

  • Energy Cost of AssociativityEnergy Cost of Associativity

    0

    2

    .

    ,

    I

    E

    E

    E

    2

    0

    0

    B

    a

    n

    a

    k

    a

    r

    e

    t

    a

    l

    S

    o

    u

    r

    c

    e

    :

    B

    Ienne 2003-0810 AdvCompArch Embedded Memory Systems

  • A Solution: Array PaddingA Solution: Array Padding

    a

    a[i]

    M words

    int a[1024];int b[1024];int c[1024]; b

    DUMMY

    [ ];...for (i = 0; i < N; i++)

    c[i] = a[i] + b[i];b[i]

    DUMMY

    Data Cache(Direct-mapped,

    2 d )

    c

    c[i]

    DUMMY

    Memory

    512 words)c[i]

    Ienne 2003-08AdvCompArch Embedded Memory Systems11

    Data alignment avoids cache conflicts

  • Classic TransformationLoop BlockingLoop Blocking

    Modify loop exploration space in blocks (or tiles) so Modify loop exploration space in blocks (or tiles ) so that all elements accessed at once fit the cache

    Original Code Blocked Code

    for i = 1 to Nfor kk = 1 to N step B

    for jj = 1 to N step B

    Original Code Blocked Code

    for k = 1 to Nr = X [i,k]for j = 1 to NZ[i,j] = r * Y[k,j]

    for i = 1 to Nfor k = kk to min (kk+B-1, N)

    r = X [i,k]for j = jj to min (jj+B-1 N)Z[i,j] r Y[k,j] for j = jj to min (jj+B-1, N)

    Z[i,j] = r * Y[k,j]

    B

    N

    B

    Ienne 2003-08AdvCompArch Embedded Memory Systems12

  • Array Tiling Reduces Aliasing TooArray Tiling Reduces Aliasing Too

    0

    0

    1

    Idea:Split the array in blocks or tiles and group tiles of

    I

    E

    E

    E

    2

    0

    tiles and group tiles of each array which are

    accessed at once

    d

    a

    e

    t

    a

    l

    .

    ,

    If the tiles are small enough, the set of tiles accessed at once will fit

    u

    r

    c

    e

    :

    P

    a

    n

    d

    accessed at once will fit into the cache

    Since they are adjacent in

    S

    o

    u

    y jdata memory, they will not

    conflict in the cache

    Ienne 2003-08AdvCompArch Embedded Memory Systems13

  • Real-World Example: FFTReal-World Example: FFT

    double sigreal[2048]g [ ]le = le / 2f (i j i < 2048 i + 2*l )for (i = j; i < 2048; i += 2*le)

    { = sigreal[i] = sigreal[i + le] sigreal[i] =

    1st Outer Loop Iterationg [ ]

    sigreal[i + le] = } 0 1024 0

    511Arraysigreal Cache

    Ienne 2003-08AdvCompArch Embedded Memory Systems14

    g Cache

  • Padded FFTPadded FFT

    double sigreal[2048 + 16]g [ ]le = le / 2; le = le + le / 128f (i j i < 2048 i + 2*l ) {for (i = j; i < 2048; i += 2*le) {

    i = i + i / 128 = sigreal[i] 1st Outer Loop Iteration = sigreal[i + le] sigreal[i] =

    Pads (~1 cache line, every cache size)

    sigreal[i ] = sigreal[i + le] =

    }0

    01032

    511Array

    sigreal Cache

    Ienne 2003-08AdvCompArch Embedded Memory Systems15

    Padding 15% Speed-up on a Sparc5

  • Algorithms to Decide Data LayoutAlgorithms to Decide Data Layout

    Need to make the following decisions:Need to make the following decisions:Tile Size Computation Largest possible tiles such that working set fits the

    cache

    Pad Size Computation Minimum pad size which eliminates aliasing

    Interleaving of Tiled Arrays Arrangement of multiple arrays so that there is no g p y

    aliasing among arrays and all working sets fit the cache

    Ienne 2003-08AdvCompArch Embedded Memory Systems16

  • Algorithms to Decide Data LayoutAlgorithms to Decide Data Layout

    Matrix Multiplication (Array Sizes 35-350)Matrix Multiplication (Array Sizes 35 350)

    TSS ESSTSS ESS

    LRW DAT

    DAT fi d til di i

    Ienne 2003-08AdvCompArch Embedded Memory Systems17

    DAT uses fixed tile dimensionsOthers use widely varying sizes

  • OutlineOutline

    Memory data layoutMemory data layoutScratchpad memoryCustom memory architectures

    Ienne 2003-08AdvCompArch Embedded Memory Systems18

  • Scratchpad Memory Idea Scratchpad Memory Idea

    Scratchpad

    1 cycle

    Scratchpad

    On-chipMemory

    0

    P-1

    Addressable

    P

    Data

    CPU

    AddressableMemoryOff-chip

    Memory

    DataCache

    (on-chip)

    N-1

    ( p)

    1 cycle

    Ienne 2003-08AdvCompArch Embedded Memory Systems19

    N 110-20 cycles

  • Scratchpad Memory AdvantagesScratchpad Memory Advantages

    Architecturally visible static software managed Architecturally visible static software-managed cache Avoid aliasing problems Avoid aliasing problems Decide explicitly which data will be reused

    Avoid evicting useful data from the cache Avoid evicting useful data from the cache Increase determinism

    Data are always in the cache when needed Data are always in the cache when needed Save power

    Avoid energy cost of high associativityAvoid energy cost of high associativity Avoid caching irrelevant data Avoid using 2 levels of the memory system for temporaries

    Ienne 2003-08AdvCompArch Embedded Memory Systems20

  • Energy Cost of Hardware CachesEnergy Cost of Hardware Caches

    8

    9

    .

    0

    2

    5

    6

    7

    J

    ]

    Scratch pad

    .

    ,

    I

    E

    E

    E

    2

    0

    0

    3

    4

    5

    r

    a

    c

    c

    e

    s

    s

    [

    n Cache, 2way, 4GB spaceCache, 2way, 16 MB spaceCache, 2way, 1 MB space

    B

    a

    n

    a

    k

    a

    r

    e

    t

    a

    l

    1

    2

    E

    n

    e

    r

    g

    y

    p

    e

    r

    S

    o

    u

    r

    c

    e

    :

    B

    0256 512 1024 2048 4096 8192 16384

    memory size

    Energy consumption in tags, comparators, and muxes is significant Ienne 2003-0821 AdvCompArch Embedded Memory Systems

  • Timing PredictabilityTiming Predictability

    Many embedded systems are real-time systems: computations must be finished in a given amount of time

    Most memory hierarchies (i e caches) for PC-like systems designed forMost memory hierarchies (i.e., caches) for PC like systems designed for good average case, not for good worst case behavior

    e

    l

    ,

    2

    0

    0

    7

    Worst case execution time (WCET) larger than r c

    e

    :

    M

    a

    r

    w

    e

    d

    e

    (WCET) larger than without cache S

    o

    u

    r

    G.721 using unified h ARM7TDMIcache on a ARM7TDMI

    Ienne 2003-0822 AdvCompArch Embedded Memory Systems

  • Scratchpad MemoryScratchpad Memory

    Embedded processor-based system Embedded processor based system Processor core Embedded memory

    Instruction and Data Cache Embedded SRAM Embedded DRAM Scratch Pad Memory

    Design problems1. How much on-chip memory?2 Partitioning of on chip memory in cache and scratchpad?2. Partitioning of on-chip memory in cache and scratchpad?3. Which variables/arrays in the scratchpad?

    Goals Goals Improve performance Save power

    Ienne 2003-08AdvCompArch Embedded Memory Systems23

  • Architecture ExplorationArchitecture Exploration

    Explore exhaustively the design space Explore exhaustively the design space Requires an algorithm to perform partitioning between

    on- and off-chip memoryon and off chip memory

    Algorithm Memory ExploreAlgorithm Memory Explorefor On-chip Memory Size T (in powers of 2)

    for Cache Size C (in powers of 2, < T)SRAM Size S = T - CData Partition (S)for line size L (in powers of 2 < C < MaxLine)for line size L (in powers of 2, < C, < MaxLine)

    Estimate Memory Performance (T, C, S)Select (T, C, S, L) which maximize optimisation goals

    Ienne 2003-08AdvCompArch Embedded Memory Systems24

    ( , , , ) p g

  • Variation of On-chip Memory AllocationAllocation

    t

    o

    g

    r

    a

    m

    ]

    p

    l

    e

    :

    H

    i

    s

    t

    [

    E

    x

    a

    m

    Effect of different ratios scratchpad/cache sizesT t l hi i 2 KB

    Ienne 2003-08AdvCompArch Embedded Memory Systems25

    Total on-chip memory size = 2 KB

  • Variation of Total On-chip MemoryVariation of Total On-chip Memory

    t

    o

    g

    r

    a

    m

    ]

    p

    l

    e

    :

    H

    i

    s

    t

    [

    E

    x

    a

    m

    Effect of on-chip memory size

    Ienne 2003-08AdvCompArch Embedded Memory Systems26

    Effect of on chip memory size

  • Data Partitioning (I)Data Partitioning (I)

    procedure Histogram_Evaluationchar BrightnessLevel[512][512]; Regular Accessgint Hist [256];for (i = 0; i < 512; i++)

    Regular Access Off-chip + Cache

    for (j = 0; j < 512; j++) {/* for each pixel (i,j) in image */level = BrightnessLevel[i][j];Hist[level] += 1;

    }}

    Irregular Access

    Ienne 2003-08AdvCompArch Embedded Memory Systems27

    Scratchpad

  • Data Partitioning (II)Data Partitioning (II)

    procedure Convolutionprocedure Convolutionint source[128][128], dest[128][128];int mask[4][4];for (all points x,y of source)new = 0;for (i scanning the mask horizontally)

    for (j scanning the mask vertically)for (j scanning the mask vertically)new += source[x+i][y+j] * mask[i,j];

    dest[x][y] = new / norm;

    Iteration (0,0) Iteration (0,1)maskSmall

    Scratchpad

    source + dest

    Scratchpad

    Ienne 2003-08AdvCompArch Embedded Memory Systems28

    Large and Regular Off-chip + Cache

  • Data PartitioningData Partitioning

    Pre Partitioning Scratchpad/Off chipPre-Partitioning Scratchpad/Off-chip Scalar variables and constants to scratchpadLarge arrays to off chip memoryLarge arrays to off-chip memory

    Detailed PartitioningIdentify critical data for scratchpadCriteria:

    Life-times of arrays Access frequency of arrays Loop conflicts Loop conflicts

    Similar reasoning can be applied to code

    Ienne 2003-08AdvCompArch Embedded Memory Systems29

  • Global Placement OptimizationGlobal Placement Optimization

    Which object (array, loop, etc.) to be for i ...{ } j ( y p )stored in a scratchpad?

    l i ll i

    for i ...{ }

    for j ...{ }

    while 0 2Non-overlaying allocationGain gk + size sk for each object k

    Maximise gain G g respecting themain

    memory

    while...

    repeat...

    function ,

    I

    E

    E

    E

    2

    0

    0

    Maximise gain G = gk, respecting the scratchpad size SSP sk

    Solution: knapsack algorithm

    y function...

    Array...

    S

    t

    e

    i

    n

    k

    e

    e

    t

    a

    l

    .

    p g

    Overlaying allocationScratchpad

    ?y

    Array

    S

    o

    u

    r

    c

    e

    :

    S

    Moving objects back and forth between hierarchy levelsSolution: more complexProcessor

    Scratchpad memory,

    capacity SSP

    Array...

    Solution: more complex...Processor Int...

    Ienne 2003-0830 AdvCompArch Embedded Memory Systems

  • Integer Linear ProgrammingInteger Linear ProgrammingSymbols:

    S (vark ) = size of variable kn (vark ) = number of accesses to variable k

    ( ) d i bl if i i t d 0 2e (vark ) = energy saved per variable access, if vark is migratedE (vark ) = energy saved if variable vark is migrated (= e (vark ) n (vark ))

    x (var ) = decision variable ,

    I

    E

    E

    E

    2

    0

    0

    x (vark ) = decision variable=1 if variable k is migrated to scratchpad, =0 otherwise

    K = set of variables

    S

    t

    e

    i

    n

    k

    e

    e

    t

    a

    l

    .

    Similar for functions I

    S

    o

    u

    r

    c

    e

    :

    S

    Integer programming formulation:Maximize kK x (vark ) E (vark ) + iI x (Fi ) E (Fi )

    Subject to the constraintSubject to the constrainti I S (Fi ) x (Fi ) + k K S (vark ) x (vark ) SSP

    Ienne 2003-0831 AdvCompArch Embedded Memory Systems

  • Reduction in Energy and RuntimeReduction in Energy and Runtime

    0

    2

    s

    [

    x

    1

    0

    0

    ]

    g

    y

    [

    J

    ]

    ,

    I

    E

    E

    E

    2

    0

    0

    C

    y

    c

    l

    e

    s

    E

    n

    e

    r

    S

    t

    e

    i

    n

    k

    e

    e

    t

    a

    l

    .

    multi_sort benchmark (mix

    S

    o

    u

    r

    c

    e

    :

    S

    of sorting algorithms)

    Numbers will change with technology but algorithms will remain unchanged

    Ienne 2003-0832 AdvCompArch Embedded Memory Systems

  • OutlineOutline

    Memory data layoutMemory data layoutScratchpad memoryCustom memory architectures

    Ienne 2003-08AdvCompArch Embedded Memory Systems33

  • Array-to-Memory Assignment and Memory BankingMemory Banking

    Exploit the possibility of designing Bank Bank Bank Exploit the possibility of designing a memory system not for general use (e.g., completely uniform) and not of standard components (e g

    Bank#1

    Bank#2

    Bank#3

    not of standard components (e.g., off-the-shelf DRAMs) Ad-hoc bit widths

    Smallbitwidth

    E.g., 32-bit word-addressed architecture with specific 6-bit arrays

    Clustering of arrays in memories s s

    S

    p

    a

    c

    e

    Clustering of arrays in memories Exploit features of eDRAMs Trade offs power/energy/area A

    d

    d

    r

    e

    s

    Multiple accesses per cycle E.g., Allow concurrent accesses by

    coprocessorsA ibl

    Ienne 2003-08AdvCompArch Embedded Memory Systems34

    Multiple CPU busese.g., DSPs Accessibleat once

  • Memory Banking Motivation(e)DRAMs(e)DRAMs

    A

    for (i = 0, i < 1000; i++){ A[i] += B[i] * C[2*i]; }

    B

    A

    PageRow Address

    Addr[15:8]

    C

    Addr[15:8]

    Page BufferColumn Address

    Addr[7:0]

    Ienne 2003-08AdvCompArch Embedded Memory Systems35Address Data

  • Memory Banking Motivation(e)DRAMs(e)DRAMs

    for (i = 0, i < 1000; i++){ A[i] += B[i] * C[2*i];}}

    A[I]Row B[I]Row

    C[2I]

    Row

    Col Col Col

    Page Buffer

    Add T D t th

    Page Buffer

    Add T D t th

    Page Buffer

    Add T D t th

    Ienne 2003-08AdvCompArch Embedded Memory Systems36

    Addr To Datapath Addr To Datapath Addr To Datapath

  • Typical DRAM Tradeoffs Also in High-End ServersHigh-End Servers

    DRAMs are complex objects:DRAMs are complex objects: Multiple interleaved DRAM banks in a system Large premium for burst accesses Large premium for burst accesses Tradeoff between leaving a page open and having

    neighbouring accesses faster and closing a page and notneighbouring accesses faster and closing a page and not requiring precharge time for far accesses

    In servers, optimizations are usually of dynamic nature performed by the memory controller subsystem and p y y ycontrolled by the BIOS/OS

    In embedded computers, similar optimizations can be done statically and application-specific

    Ienne 2003-08AdvCompArch Embedded Memory Systems37

  • Minimal Number of Banks to Remove Access ConflictsRemove Access Conflicts

    M

    2

    0

    0

    1

    Can be extended to model multiple

    a

    l

    .

    ,

    A

    C

    M

    DFGsimultaneous accesses to the same array ( multi-ported memories)

    P

    a

    n

    d

    a

    e

    t

    a

    Conflict Graph

    i

    e

    d

    f

    r

    o

    m

    P

    Conflict Graph

    M

    o

    d

    i

    f

    i

    Schedule

    Ienne 2003-08AdvCompArch Embedded Memory Systems38

    Minimal Allocation

  • Memory Allocation ExplorationMemory Allocation Exploration

    Useful Exploration Spacep p

    0

    0

    1

    A

    C

    M

    2

    0

    d

    a

    e

    t

    a

    l

    .

    ,

    u

    r

    c

    e

    :

    P

    a

    n

    d

    S

    o

    u

    Ienne 2003-08AdvCompArch Embedded Memory Systems39

  • SummarySummary

    In SoCs and FPGAs situation is different from general In SoCs and FPGAs situation is different from general purpose computers Different design space (fast memories almost as fast as logic) Less constraints to use standard components: any size possible,

    more types of memory available (e.g., dual port), etc. More bandwidth exploitable (no pins)p ( p )

    Hence different world where many more things are possible (whereas in classic computing or normal chip based embedded systems there is not much freedom)based embedded systems there is not much freedom)

    Companion situation to the customisation of processors: Optimizations tailored to Data Cache Optimizations tailored to Data Cache

    Memory Data Layout Memory architecture customized to a given application

    Scratchpad Memory

    Ienne 2003-08AdvCompArch Embedded Memory Systems40

    Scratchpad Memory Memory Banking

  • ReferencesReferences

    P R Panda et al Data and Memory Optimization P. R. Panda et al., Data and Memory Optimization Techniques for Embedded Systems, ACM Transactions on Design Automation of Electronic Systems, 6(2):149206 A il 2001206, April 2001

    M. Verma and P. Marwedel, Advanced Memory Optimization Techniques for Low Power EmbeddedOptimization Techniques for Low Power Embedded Processors, Springer, 2007

    P. R. Panda (ed.), Memory Issues in Embedded Systems-on-Chip, Kluwer Academic, 1999

    IEEE Design & Test of Computers, Special issue on Large Embedded Memories May June 2001Embedded Memories, May-June 2001

    Ienne 2003-08AdvCompArch Embedded Memory Systems41