2PartitioninganEmbeddedSystemforMulticoreDesign

download 2PartitioninganEmbeddedSystemforMulticoreDesign

of 36

Transcript of 2PartitioninganEmbeddedSystemforMulticoreDesign

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    1/36

    Microprocessors, Advanced Partitioning an Embedded System for

    Multicore Design

    January 31, 2012

    Jack Ganssle

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    2/36

    The Schedule Grows Faster ThanThe Code!

    IBM: person-yrs LOC/month1 439

    10 220100 110

    1000 55

    COCOMO: Schedule = C * KLOC M

    (C and M are both > 1)

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    3/36

    The Productivity Crash

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    4/36

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    5/36

    Partitioning Code

    Fact: The easiest way to write great modules fast is to keep them small, with fewdependencies.

    Smaller functions have: fewer bugs: bug rate is 2 to 6x lower

    more likely to meet specs done faster.

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    6/36

    Eye Scans

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    7/36

    We Turn Micros into Mainframes

    - Sensors- Interface

    1,000,000 lines of code

    8051

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    8/36

    Complexity is not linear with LOC

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    9/36

    A Better Design

    supervisory code

    I/O Code

    I/O Code

    I/O Code

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    10/36

    Small and cheap

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    11/36

    Interprocessor Communications

    Main CPU

    Serial/Encrypt

    Rangefinder

    TransactionProcessing

    I2C a fast serial interface

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    12/36

    National Airports Radar

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    13/36

    The Sergeant York

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    14/36

    The Tradeoff

    schedule

    featuresquality

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    15/36

    Feature Management

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    16/36

    Requirements Scrubbing

    Features removed % of designs

    0 to 10% 21.4%10 to 30% 17.9%

    30 to 50% 47.7%

    More than 50% 25.9%= 73.6%!

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    17/36

    Dont Wait for Hardware

    Build an I/O board that plugs into the PC Simulate! Virtualization Virtutech, CoWare, VaST Fitnesse: http://fitnesse.org/ Catsrunner:

    www.agilerules.com/projects/catsrunner/index.phtml

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    18/36

    What About Multicore?

    CPU Memory

    Hundreds of nsecTens of MHz

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    19/36

    Then Came Prefetchers

    CPU Memory

    Under 100 nsecTens of MHz

    Queue

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    20/36

    Then Came Pipelines

    CPU Memory

    30-50 nsecTens of MHz

    Old:Fetch -> Decode -> Execute

    Pipelined:FetchDecodeExecute

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    21/36

    Cache

    CPU Cache

    CPU speed

    Hundreds of MHz

    Memory

    30-50 nsec

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    22/36

    Cache Splits in Two

    CPU L1 Cache

    CPU speed

    Over 1 GHz

    L2 Cache

    30-50 nsec Memory

    3-5 nsec

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    23/36

    SMP

    Symmetric Multiprocessing (SMP) multipleidentical CPUs working with a shared memoryarray.

    CPU Core CPU Core

    Shared memory

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    24/36

    Amdahls Law for SMP

    Where:n = Number of processors

    f = Percent of operation that can not be parallelized

    Max speedup =

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    25/36

    With an Infinite # CPUs

    0.00

    2.00

    4.00

    6.00

    8.00

    10.00

    12.00

    0 . 1

    0 . 1

    4

    0 . 1

    8

    0 . 2

    2

    0 . 2

    6 0

    . 3

    0 . 3

    4

    0 . 3

    8

    0 . 4

    2

    0 . 4

    6 0

    . 5

    0 . 5

    4

    0 . 5

    8

    0 . 6

    2

    0 . 6

    6 0

    . 7

    0 . 7

    4

    0 . 7

    8

    0 . 8

    2

    0 . 8

    6 0

    . 9

    0 . 9

    4

    Portion not parallelizable

    S p e e

    d u p

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    26/36

    Best Case: 66% Parallelizable

    0

    0.5

    1

    1.5

    2

    2.5

    3

    1 3 5 7 9 1 1 1 3 1 5 1 7 1 9 2 1 2 3 2 5 2 7 2 9

    S p e e

    d u p

    Number of cores

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    27/36

    But Memory is a Bottleneck!

    CPU Core

    L1 Cache

    CPU Core

    L1 Cache

    Shared L2 Cache

    Memory

    Typically 32KB

    Typically 2-4MB

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    28/36

    And so is Comm

    Memory

    CPU Core

    L1 Cache

    CPU Core

    L1 Cache

    Shared L2 Cache

    CPU Core

    L1 Cache

    CPU Core

    L1 Cache

    Shared L2 Cache

    Then theres the cache coherency problem

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    29/36

    The Irony

    Programs in L1 run blazingly fast

    But why use a 32 bit CPU that canaddress 4 GB on a 32 KB program?

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    30/36

    A Colorimeter SMP Design

    Memory

    Common Bus

    A/D A/D A/D Display Display Display

    Core R Core G- Read A/D- FIFO data- Do FIR

    - Calculate R- Display

    - Read A/D- FIFO data- Do FIR

    - Calculate R- Display

    - Read A/D- FIFO data- Do FIR

    - Calculate R- Display

    Core B

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    31/36

    ASMPAsymmetric Multiprocessing (ASMP or AMP)

    Multiple CPUS, identical or not, each runninga specific activity

    CPU Core

    Memory

    CPU Core

    Memory Some comm link

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    32/36

    The Assembly Line

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    33/36

    A More Natural Design via AMP

    A/D FIFO FIR Calc B Display

    DisplayA/D FIFO FIR Calc G

    A/D FIFO FIR Calc R Display

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    34/36

    Another Assembly Line

    CPU

    Memory

    CPU

    Memory

    CPU

    Memory

    Data

    Memory

    CPU

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    35/36

    Implications Multicore can give huge performance improvements.

    But for non-parallel problems they may not yieldmuch improvement.

    Its hard to impossible to predict speedimprovements of most algorithms once they growlarger than L1

    Many embedded apps are hugely non-parallelizable. In some cases AMP offers a better solution than SMP

  • 8/12/2019 2PartitioninganEmbeddedSystemforMulticoreDesign

    36/36

    Questions?