Multi Core 15213 Sp07

download Multi Core 15213 Sp07

of 67

Transcript of Multi Core 15213 Sp07

  • 7/30/2019 Multi Core 15213 Sp07

    1/67

    1

    Multi-core architectures

    Jernej Barbic

    15-213, Spring 2007May 3, 2007

  • 7/30/2019 Multi Core 15213 Sp07

    2/67

    2

    Single-core computer

  • 7/30/2019 Multi Core 15213 Sp07

    3/67

    3

    Single-core CPU chip

    the single core

  • 7/30/2019 Multi Core 15213 Sp07

    4/67

    4

    Multi-core architectures

    This lecture is about a new trend incomputer architecture:

    Replicate multiple processor cores on a

    single die.Core 1 Core 2 Core 3 Core 4

    Multi-core CPU chip

  • 7/30/2019 Multi Core 15213 Sp07

    5/67

    5

    Multi-core CPU chip

    The cores fit on a single processor socket

    Also called CMP (Chip Multi-Processor)

    c

    o

    r

    e

    1

    c

    o

    r

    e

    2

    c

    o

    r

    e

    3

    c

    o

    r

    e

    4

  • 7/30/2019 Multi Core 15213 Sp07

    6/67

    6

    The cores run in parallel

    c

    o

    r

    e

    1

    c

    o

    r

    e

    2

    c

    o

    r

    e

    3

    c

    o

    r

    e

    4

    thread 1 thread 2 thread 3 thread 4

  • 7/30/2019 Multi Core 15213 Sp07

    7/67

  • 7/30/2019 Multi Core 15213 Sp07

    8/67

    8

    Interaction with the

    Operating System

    OS perceives each core as a separate processor

    OS scheduler maps threads/processesto different cores

    Most major OS support multi-core today:

    Windows, Linux, Mac OS X,

  • 7/30/2019 Multi Core 15213 Sp07

    9/67

    9

    Why multi-core ?

    Difficult to make single-coreclock frequencies even higher

    Deeply pipelined circuits: heat problems

    speed of light problems

    difficult design and verification

    large design teams necessary

    server farms need expensive

    air-conditioning

    Many new applications are multithreaded

    General trend in computer architecture (shifttowards more parallelism)

  • 7/30/2019 Multi Core 15213 Sp07

    10/67

    10

    Instruction-level parallelism

    Parallelism at the machine-instruction level

    The processor can re-order, pipeline

    instructions, split them into

    microinstructions, do aggressive branch

    prediction, etc.

    Instruction-level parallelism enabled rapid

    increases in processor speeds over the

    last 15 years

  • 7/30/2019 Multi Core 15213 Sp07

    11/67

    11

    Thread-level parallelism (TLP)

    This is parallelism on a more coarser scale Server can serve each client in a separate

    thread (Web server, database server)

    A computer game can do AI, graphics, andphysics in three separate threads

    Single-core superscalar processors cannot

    fully exploit TLP

    Multi-core architectures are the next step in

    processor evolution: explicitly exploiting TLP

  • 7/30/2019 Multi Core 15213 Sp07

    12/67

  • 7/30/2019 Multi Core 15213 Sp07

    13/67

    13

    Multiprocessor memory types

    Shared memory:

    In this model, there is one (large) common

    shared memory for all processors

    Distributed memory:

    In this model, each processor has its own

    (small) local memory, and its content is not

    replicated anywhere else

  • 7/30/2019 Multi Core 15213 Sp07

    14/67

    14

    Multi-core processor is a special

    kind of a multiprocessor:All processors are on the same chip

    Multi-core processors are MIMD:

    Different cores execute different threads(Multiple Instructions), operating on differentparts of memory (Multiple Data).

    Multi-core is a shared memory multiprocessor:All cores share the same memory

  • 7/30/2019 Multi Core 15213 Sp07

    15/67

    15

    What applications benefit

    from multi-core?

    Database servers

    Web servers (Web commerce)

    Compilers

    Multimedia applications Scientific applications,

    CAD/CAM

    In general, applications with

    Thread-level parallelism(as opposed to instruction-level parallelism)

    Each can

    run on its

    own core

  • 7/30/2019 Multi Core 15213 Sp07

    16/67

    16

    More examples

    Editing a photo while recording a TV show

    through a digital video recorder

    Downloading software while running an

    anti-virus program

    Anything that can be threaded today will

    map efficiently to multi-core

    BUT: some applications difficult to

    parallelize

  • 7/30/2019 Multi Core 15213 Sp07

    17/67

    17

    A technique complementary to multi-core:

    Simultaneous multithreading

    Problem addressed:The processor pipelinecan get stalled:

    Waiting for the resultof a long floating point(or integer) operation

    Waiting for data to

    arrive from memoryOther execution unitswait unused BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROMBTBL2CacheandC

    ontrol

    Bus

    Source: Intel

  • 7/30/2019 Multi Core 15213 Sp07

    18/67

    18

    Simultaneous multithreading (SMT)

    Permits multiple independent threads to execute

    SIMULTANEOUSLY on the SAME core

    Weaving together multiple threads

    on the same core

    Example: if one thread is waiting for a floating

    point operation to complete, another thread canuse the integer units

  • 7/30/2019 Multi Core 15213 Sp07

    19/67

    19

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cacheand

    Control

    Bus

    Thread 1: floating point

    Without SMT, only a single thread

    can run at any given time

  • 7/30/2019 Multi Core 15213 Sp07

    20/67

    20

    Without SMT, only a single thread

    can run at any given time

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cacheand

    Control

    Bus

    Thread 2:

    integer operation

  • 7/30/2019 Multi Core 15213 Sp07

    21/67

    21

    SMT processor: both threads can

    run concurrently

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cacheand

    Control

    Bus

    Thread 1: floating pointThread 2:

    integer operation

  • 7/30/2019 Multi Core 15213 Sp07

    22/67

    22

    But: Cant simultaneously use the

    same functional unit

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCode ROMBTBL2Cacheand

    Control

    Bus

    Thread 1 Thread 2

    This scenario is

    impossible with SMT

    on a single core

    (assuming a single

    integer unit)IMPOSSIBLE

  • 7/30/2019 Multi Core 15213 Sp07

    23/67

    23

    SMT not a true parallel processor

    Enables better threading (e.g. up to 30%)

    OS and applications perceive each

    simultaneous thread as a separate

    virtual processor

    The chip has only a single copy

    of each resource

    Compare to multi-core:

    each core has its own copy of resources

  • 7/30/2019 Multi Core 15213 Sp07

    24/67

    24

    Multi-core:

    threads can run on separate cores

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL2CacheandC

    ontrol

    Bus

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL2CacheandControl

    Bus

    Thread 1 Thread 2

  • 7/30/2019 Multi Core 15213 Sp07

    25/67

    25

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL2CacheandC

    ontrol

    Bus

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL

    2CacheandControl

    Bus

    Thread 3 Thread 4

    Multi-core:

    threads can run on separate cores

  • 7/30/2019 Multi Core 15213 Sp07

    26/67

    26

    Combining Multi-core and SMT

    Cores can be SMT-enabled (or not)

    The different combinations:

    Single-core, non-SMT: standard uniprocessor

    Single-core, with SMT

    Multi-core, non-SMT

    Multi-core, with SMT: our fish machines

    The number of SMT threads:

    2, 4, or sometimes 8 simultaneous threads

    Intel calls them hyper-threads

  • 7/30/2019 Multi Core 15213 Sp07

    27/67

    27

    SMT Dual-core: all four threads can

    run concurrently

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL

    2CacheandC

    ontrol

    Bus

    BTB and I-TLB

    Decoder

    Trace Cache

    Rename/Alloc

    Uop queues

    Schedulers

    Integer Floating Point

    L1 D-Cache D-TLB

    uCodeROM

    BTBL

    2CacheandControl

    Bus

    Thread 1 Thread 3 Thread 2 Thread 4

  • 7/30/2019 Multi Core 15213 Sp07

    28/67

    28

    Comparison: multi-core vs SMT

    Advantages/disadvantages?

  • 7/30/2019 Multi Core 15213 Sp07

    29/67

    29

    Comparison: multi-core vs SMT

    Multi-core:

    Since there are several cores,each is smaller and not as powerful

    (but also easier to design and manufacture) However, great with thread-level parallelism

    SMT

    Can have one large and fast superscalar core

    Great performance on a single thread

    Mostly still only exploits instruction-levelparallelism

  • 7/30/2019 Multi Core 15213 Sp07

    30/67

    30

    The memory hierarchy

    If simultaneous multithreading only:

    all caches shared

    Multi-core chips:

    L1 caches private

    L2 caches private in some architectures

    and shared in others

    Memory is always shared

  • 7/30/2019 Multi Core 15213 Sp07

    31/67

    31

    Fish machines

    Dual-core

    Intel Xeon processors

    Each core is

    hyper-threaded

    Private L1 caches

    Shared L2 cachesmemory

    L2 cache

    L1 cache L1 cacheCO

    R

    E

    1

    CO

    R

    E

    0

    hyper-threads

  • 7/30/2019 Multi Core 15213 Sp07

    32/67

    32

    Designs with private L2 caches

    memory

    L2 cache

    L1 cache L1 cacheCOR

    E

    1

    C

    OR

    E

    0

    L2 cache

    memory

    L2 cache

    L1 cache L1 cacheCOR

    E

    1

    C

    OR

    E

    0

    L2 cache

    Both L1 and L2 are private

    Examples: AMD Opteron,

    AMD Athlon, Intel Pentium D

    L3 cache L3 cache

    A design with L3 caches

    Example: Intel Itanium 2

  • 7/30/2019 Multi Core 15213 Sp07

    33/67

    33

    Private vs shared caches?

    Advantages/disadvantages?

  • 7/30/2019 Multi Core 15213 Sp07

    34/67

    34

    Private vs shared caches

    Advantages of private:

    They are closer to core, so faster access

    Reduces contention

    Advantages of shared:

    Threads on different cores can share the

    same cache data

    More cache space available if a single (or afew) high-performance thread runs on the

    system

  • 7/30/2019 Multi Core 15213 Sp07

    35/67

    35

    The cache coherence problem

    Since we have private caches:How to keep the data consistent across caches?

    Each core should perceive the memory as a

    monolithic array, shared by all the cores

  • 7/30/2019 Multi Core 15213 Sp07

    36/67

    36

    The cache coherence problem

    Suppose variable x initially contains 15213

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=15213

    multi-core chip

  • 7/30/2019 Multi Core 15213 Sp07

    37/67

    37

    The cache coherence problem

    Core 1 reads x

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=15213

    multi-core chip

  • 7/30/2019 Multi Core 15213 Sp07

    38/67

    38

    The cache coherence problem

    Core 2 reads x

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=15213

    multi-core chip

  • 7/30/2019 Multi Core 15213 Sp07

    39/67

    39

    The cache coherence problem

    Core 1 writes to x, setting it to 21660

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

    assuming

    write-throughcaches

  • 7/30/2019 Multi Core 15213 Sp07

    40/67

    40

    The cache coherence problem

    Core 2 attempts to read x gets a stale copy

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

  • 7/30/2019 Multi Core 15213 Sp07

    41/67

    41

    Solutions for cache coherence

    This is a general problem with

    multiprocessors, not limited just to multi-core

    There exist many solution algorithms,

    coherence protocols, etc.

    A simple solution:

    invalidation-based protocol with snooping

  • 7/30/2019 Multi Core 15213 Sp07

    42/67

    42

    Inter-core bus

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    multi-core chip

    inter-core

    bus

  • 7/30/2019 Multi Core 15213 Sp07

    43/67

    43

    Invalidation protocol with snooping

    Invalidation:

    If a core writes to a data item, all other

    copies of this data item in other caches

    are invalidated

    Snooping:

    All cores continuously snoop (monitor)

    the bus connecting the cores.

  • 7/30/2019 Multi Core 15213 Sp07

    44/67

    44

    The cache coherence problem

    Revisited: Cores 1 and 2 have both read x

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=15213

    multi-core chip

  • 7/30/2019 Multi Core 15213 Sp07

    45/67

    45

    The cache coherence problem

    Core 1 writes to x, setting it to 21660

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    x=15213

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

    assuming

    write-throughcaches

    INVALIDATEDsends

    invalidation

    request

    inter-core

    bus

  • 7/30/2019 Multi Core 15213 Sp07

    46/67

    46

    The cache coherence problem

    After invalidation:

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

    Th h h bl

  • 7/30/2019 Multi Core 15213 Sp07

    47/67

    47

    The cache coherence problem

    Core 2 reads x. Cache misses,and loads the new copy.

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

    Alternative to invalidate protocol:

  • 7/30/2019 Multi Core 15213 Sp07

    48/67

    48

    Alternative to invalidate protocol:

    update protocol

    Core 1 writes x=21660:

    Core 1 Core 2 Core 3 Core 4

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    x=21660

    One or more

    levels of

    cache

    One or more

    levels of

    cache

    Main memory

    x=21660

    multi-core chip

    assuming

    write-throughcaches

    UPDATED

    broadcasts

    updated

    value inter-core

    bus

    Whi h d thi k i b tt ?

  • 7/30/2019 Multi Core 15213 Sp07

    49/67

    49

    Which do you think is better?

    Invalidation or update?

  • 7/30/2019 Multi Core 15213 Sp07

    50/67

    50

    Invalidation vs update

    Multiple writes to the same location

    invalidation: only the first time

    update: must broadcast each write

    (which includes new variable value)

    Invalidation generally performs better:

    it generates less bus traffic

  • 7/30/2019 Multi Core 15213 Sp07

    51/67

    51

    Invalidation protocols

    This was just the basic

    invalidation protocol

    More sophisticated protocols

    use extra cache state bits

    MSI, MESI

    (Modified, Exclusive, Shared, Invalid)

  • 7/30/2019 Multi Core 15213 Sp07

    52/67

    52

    Programming for multi-core

    Programmers must use threads or

    processes

    Spread the workload across multiple cores

    Write parallel algorithms

    OS will map threads/processes to cores

  • 7/30/2019 Multi Core 15213 Sp07

    53/67

    53

    Thread safety very important

    Pre-emptive context switching:

    context switch can happen AT ANY TIME

    True concurrency, not just uniprocessortime-slicing

    Concurrency bugs exposed much fasterwith multi-core

    H N d t h i ti

  • 7/30/2019 Multi Core 15213 Sp07

    54/67

    54

    However: Need to use synchronization

    even if only time-slicing on a uniprocessor

    int counter=0;

    void thread1() {

    int temp1=counter;

    counter = temp1 + 1;

    }

    void thread2() {

    int temp2=counter;

    counter = temp2 + 1;

    }

    N d t h i ti if l

  • 7/30/2019 Multi Core 15213 Sp07

    55/67

    55

    Need to use synchronization even if only

    time-slicing on a uniprocessor

    temp1=counter;

    counter = temp1 + 1;

    temp2=counter;

    counter = temp2 + 1

    temp1=counter;

    temp2=counter;

    counter = temp1 + 1;

    counter = temp2 + 1

    gives counter=2

    gives counter=1

  • 7/30/2019 Multi Core 15213 Sp07

    56/67

    56

    Assigning threads to the cores

    Each thread/process has an affinity mask

    Affinity mask specifies what cores thethread is allowed to run on

    Different threads can have different masks

    Affinities are inherited across fork()

  • 7/30/2019 Multi Core 15213 Sp07

    57/67

    57

    Affinity masks are bit vectors

    Example: 4-way multi-core, without SMT

    1011

    core 3 core 2 core 1 core 0

    Process/thread is allowed to run on

    cores 0,2,3, but not on core 1

    Affinity masks when multi core and

  • 7/30/2019 Multi Core 15213 Sp07

    58/67

    58

    Affinity masks when multi-core and

    SMT combined

    Separate bits for each simultaneous thread

    Example: 4-way multi-core, 2 threads per core

    1

    core 3 core 2 core 1 core 0

    1 0 0 1 0 1 1

    thread

    1

    Core 2 cant run the process Core 1 can only use one simultaneous thread

    thread

    0

    thread

    1

    thread

    0

    thread

    1

    thread

    0

    thread

    1

    thread

    0

  • 7/30/2019 Multi Core 15213 Sp07

    59/67

    59

    Default Affinities

    Default affinity mask is all 1s:all threads can run on all processors

    Then, the OS scheduler decides whatthreads run on what core

    OS scheduler detects skewed workloads,migrating threads to less busy processors

  • 7/30/2019 Multi Core 15213 Sp07

    60/67

    60

    Process migration is costly

    Need to restart the execution pipeline

    Cached data is invalidated

    OS scheduler tries to avoid migration as

    much as possible:

    it tends to keeps a thread on the same core

    This is called soft affinity

  • 7/30/2019 Multi Core 15213 Sp07

    61/67

    61

    Hard affinities

    The programmer can prescribe her own

    affinities (hard affinities)

    Rule of thumb: use the default scheduler

    unless a good reason not to

  • 7/30/2019 Multi Core 15213 Sp07

    62/67

    62

    When to set your own affinities

    Two (or more) threads share data-structures in

    memory

    map to same core so that can share cache

    Real-time threads:Example: a thread running

    a robot controller:

    - must not be context switched,

    or else robot can go unstable

    - dedicate an entire core just to this threadSource: Sensable.com

  • 7/30/2019 Multi Core 15213 Sp07

    63/67

    63

    Kernel scheduler API

    #include

    int sched_getaffinity(pid_t pid,unsigned int len, unsigned long * mask);

    Retrieves the current affinity mask of process pid and

    stores it into space pointed to by mask.

    len is the system word size: sizeof(unsigned int long)

  • 7/30/2019 Multi Core 15213 Sp07

    64/67

    64

    Kernel scheduler API

    #include

    int sched_setaffinity(pid_t pid,unsigned int len, unsigned long * mask);

    Sets the current affinity mask of process pid to *mask

    len is the system word size: sizeof(unsigned int long)

    To query affinity of a running process:[barbic@bonito ~]$ taskset -p 3935

    pid 3935's current affinity mask: f

  • 7/30/2019 Multi Core 15213 Sp07

    65/67

    65

    Windows Task Manager

    core 2

    core 1

  • 7/30/2019 Multi Core 15213 Sp07

    66/67

    66

    Legal licensing issues

    Will software vendors charge a separate

    license per each core or only a single

    license per chip?

    Microsoft, Red Hat Linux, Suse Linux will

    license their OS per chip, not per core

  • 7/30/2019 Multi Core 15213 Sp07

    67/67

    Conclusion

    Multi-core chips an

    important new trend in

    computer architecture

    Several new multi-core

    chips in design phases

    Parallel programming techniques

    likely to gain importance