Shared memoy architecture

download Shared memoy architecture

of 33

Transcript of Shared memoy architecture

  • 7/30/2019 Shared memoy architecture

    1/33

    09/10/02 CS267

    CS 267: Applications of Parallel Computers

    Lecture 5:Shared Memory Parallel Machines

    Horst D. Simon

    http://www.cs.berkeley.edu/~strive/cs267

  • 7/30/2019 Shared memoy architecture

    2/33

    09/10/02 CS267 Lecture 5

    Basic Shared MemoryArchitecture

    P1 P2 Pn

    network

    $ $ $

    memory

    Processors all connected to a large shared memory

    Local caches for each processor Cost: much cheaper to cache than main memory

    Simple to program, but hard to scale

    Now take a closer look at structure, costs, limits

  • 7/30/2019 Shared memoy architecture

    3/33

    09/10/02 CS267 Lecture 5

    Programming Shared Memory (review)

    Program is a collection of threads of control.

    Each thread has a set of private variables e.g. local variables on the stack.

    Collectively with a set of shared variables

    e.g., static variables, shared common blocks, global heap.

    Communication and synchronization through shared variables

    iress

    PPP

    iress

    . . .

    x = ...

    y = ..x ...

    Address

    :Shared

    Pr

    ivate

  • 7/30/2019 Shared memoy architecture

    4/33

    09/10/02 CS267 Lecture 5

    Outline

    Historical perspective

    Bus-based machines Pentium SMP

    IBM SP node

    Directory-based (CC-NUMA) machine

    Origin 2000

    Global address space machines

    Cray t3d and (sort of) t3e

  • 7/30/2019 Shared memoy architecture

    5/33

    09/10/02 CS267 Lecture 5

    60s Mainframe Multiprocessors

    Enhance memory capacity or I/O capabilities by adding

    memory modules or I/O devices

    How do you enhance processing capacity?

    Add processors

    Already need an interconnect between slow memorybanks and processor + I/O channels

    cross-bar or multistage interconnection network

    Pr oc

    I/ODe vices

    Interconnect

    Proc

    Mem IOCMem Mem Mem IOC

    PIO IO

    P

    M

    M

    M

    M

  • 7/30/2019 Shared memoy architecture

    6/33

    09/10/02 CS267 Lecture 5

    70s Breakthrough: Caches

    Memory system scaled by adding memory modules

    Both bandwidth and capacity

    Memory was still a bottleneck

    Enter Caches!

    Cache does two things:

    Reduces average access time (latency)

    Reduces bandwidth requirements to memory

    P

    memory (slow)

    interconnect

    I/O Deviceor

    Processor

    A: 17

    processor (fast)

  • 7/30/2019 Shared memoy architecture

    7/33

    09/10/02 CS267 Lecture 5

    Technology Perspective

    DRAM

    Year Size Cycle Time

    1980 64 Kb 250 ns

    1983 256 Kb 220 ns1986 1 Mb 190 ns

    1989 4 Mb 165 ns

    1992 16 Mb 145 ns

    1995 64 Mb 120 ns

    1000:1! 2:1!

    Capacity Speed

    Logic: 2x in 3 years 2x in 3 years

    DRAM: 4x in 3 years 1.4x in 10 years

    Disk: 2x in 3 years 1.4x in 10 years

    0

    50100

    150

    200

    250

    300

    350

    1986 1988 1990 1992 1994 1996

    Year

    SpecInt

    SpecFP

  • 7/30/2019 Shared memoy architecture

    8/33

    09/10/02 CS267 Lecture 5

    Approaches to Building ParallelMachines

    P1Switch

    Main memory

    Pn

    (Interleaved)

    (Interleaved)

    First-level $

    P1

    $

    Interconnection network

    $

    Pn

    Mem Mem

    P1

    $

    Interconnection network

    $

    Pn

    Mem MemShared Cache

    Centralized Memory

    Dance Hall, UMA

    Distributed Memory (NUMA)

    Scale

  • 7/30/2019 Shared memoy architecture

    9/33

    09/10/02 CS267 Lecture 5

    80s Shared Memory: Shared Cache

    i80286

    i80486

    Pentium

    i80386

    i8086

    i4004

    R10000

    R4400

    R3010

    SU MIPS

    1000

    10000

    100000

    1000000

    10000000

    100000000

    1965 1970 1975 1980 1985 1990 1995 2000 2005

    Year

    Transistors

    i80x86M68K

    MIPS

    Alliant FX-8

    early 80s eight 68020s with x-bar to 512 KB interleaved cache

    Encore & Sequent

    first 32-bit micros (N32032)

    two to a board with a shared cache

    P1

    Switch

    Main memory

    Pn

    (Interleaved)

    (Interleaved)

    First-level $

  • 7/30/2019 Shared memoy architecture

    10/33

    09/10/02 CS267 Lecture 5

    Shared Cache: Advantages and Disadvantages

    Advantages

    Cache placement identical to single cache only one copy of any cached block

    Fine-grain sharing is possible

    Interference

    One processor may prefetch data for another

    Can share data within a line without moving line

    Disadvantages

    Bandwidth limitation Interference

    One processor may flush another processors data

  • 7/30/2019 Shared memoy architecture

    11/33

    09/10/02 CS267 Lecture 5

    Approaches to Building ParallelMachines

    P1

    Switch

    Main memory

    Pn

    (Interleaved)

    (Interleaved)

    First-level $

    P1

    $

    Interconnection network

    $

    Pn

    Mem Mem

    P1

    $

    Interconnection network

    $

    Pn

    Mem MemShared Cache

    Centralized MemoryDance Hall, UMA

    Distributed Memory (NUMA)

    Scale

  • 7/30/2019 Shared memoy architecture

    12/33

    09/10/02 CS267 Lecture 5

    Intuitive Memory Model

    Reading an address should return the last value written

    to that address Easy in uniprocessors

    except for I/O

    Cache coherence problem in MPs is more pervasive

    and more performance critical

    More formally, this is called sequential consistency:

    A multiprocessor is sequentially consistent if the result of any

    execution is the same as if the operations of all the processorswere executed in some sequential order, and the operations ofeach individual processor appear in this sequence in the orderspecified by its program. [Lamport, 1979]

  • 7/30/2019 Shared memoy architecture

    13/33

    09/10/02 CS267 Lecture 5

    Cache Coherence: Semantic Problem

    p1 and p2 both have cached copies of x (as 0)

    p1 writes x=1

    May write through to memory

    p2 reads x, but gets the stale cached copy

    x 0 x 0

    x = 0

    p1 p2

    x 1

  • 7/30/2019 Shared memoy architecture

    14/33

    09/10/02 CS267 Lecture 5

    Cache Coherence: Semantic Problem

    What does this imply about program behavior?

    No process ever sees garbage values, I.e., of 2 values

    Processors always see values written by some some processor

    The value seen is constrained by program order on all processors

    Time always move forward

    Example: P1 writes x=1, then writes y=1

    P2 read y, then reads x

    x = 0

    y = 0

    x = 1y = 1

    = y= x

    If P2 sees the newvalue of y, it must seethe new value of x

    P1 P2

  • 7/30/2019 Shared memoy architecture

    15/33

    09/10/02 CS267 Lecture 5

    Snoopy Cache-Coherence Protocols

    Bus is a broadcast medium & caches know what they have

    Cache Controller snoops all transactions on the shared bus

    A transaction is a relevant transaction if it involves a cache block

    currently contained in this cache take action to ensure coherence

    invalidate, update, or supply value

    depends on state of the block and the protocol

    StateAddressData

    I/O dev icesMem

    P1

    $

    Bus snoop

    $

    Pn

    Cache-memorytransaction

  • 7/30/2019 Shared memoy architecture

    16/33

    09/10/02 CS267 Lecture 5

    Basic Choices in Cache Coherence

    Cache may keep information such as:

    Valid/invalid Dirty (inconsistent with memory)

    Shared (in another caches)

    When a processor executes a write operation to shared

    data, basic design choices are: Write thru: do the write in memory as well as cache

    Write back: wait and do the write later, when the item is flushed

    Update: give all other processors the new value

    Invalidate: all other processors remove from cache

  • 7/30/2019 Shared memoy architecture

    17/33

    09/10/02 CS267 Lecture 5

    Example: Write-thru Invalidate

    Update and write-thru both use more memorybandwidth if there are writes to the same address

    Update to the other caches

    Write-thru to memory

    I/O devices

    Memory

    P1

    $ $ $

    P2 P3

    5

    u= ?

    4

    u= ?

    u:51

    u:5

    2

    u:5

    3

    u= 7

  • 7/30/2019 Shared memoy architecture

    18/33

    09/10/02 CS267 Lecture 5

    Write-Back/Ownership Schemes

    When a single cache has ownership of a block, processorwrites do not result in bus writes, thus conservingbandwidth.

    reads by others cause it to return to shared state

    Most bus-based multiprocessors today use suchschemes.

    Many variants of ownership-based protocols

  • 7/30/2019 Shared memoy architecture

    19/33

    09/10/02 CS267 Lecture 5

    Sharing: A Performance Problem

    True sharing

    Frequent writes to a variable can create a bottleneck OK for read-only or infrequently written data

    Technique: make copies of the value, one per processor, if thisis possible in the algorithm

    Example problem: the data structure that stores the

    freelist/heap for malloc/free

    False sharing

    Cache block may also introduce artifacts

    Two distinct variables in the same cache block

    Technique: allocate data used by each processor contiguously,or at least avoid interleaving

    Example problem: an array of ints, one written frequently byeach processor

  • 7/30/2019 Shared memoy architecture

    20/33

    09/10/02 CS267 Lecture 5

    Limits of Bus-Based Shared Memory

    I/O MEM MEM

    PROC

    cache

    PROC

    cache

    Assume:

    1 GHz processor w/o cache=> 4 GB/s inst BW per processor (32-bit)

    => 1.2 GB/s data BW at 30% load-store

    Suppose 98% inst hit rate and 95% datahit rate

    => 80 MB/s inst BW per processor

    => 60 MB/s data BW per processor

    140 MB/s combined BW

    Assuming 1 GB/s bus bandwidth

    \ 8 processors will saturate bus

    5.2 GB/s

    140 MB/s

  • 7/30/2019 Shared memoy architecture

    21/33

    09/10/02 CS267 Lecture 5

    Engineering: Intel Pentium Pro Quad

    SMP for the masses:

    All coherence andmultiprocessing glue in

    processor module Highly integrated, targeted at

    high volume

    Low latency and bandwidth

    P-Pro bus (64-bit data, 36-bit address, 66 MHz)

    CPU

    Bus interface

    MIU

    P-Pro

    module

    P-Pro

    module

    P-Pro

    module256-KBL2 $Interruptcontroller

    PCIbridge

    PCIbridge

    Memorycontroller

    1-, 2-, or 4-wayinterleaved

    DRAM

    PCIbus

    PC

    Ibus

    PCII/O

    cards

  • 7/30/2019 Shared memoy architecture

    22/33

    09/10/02 CS267 Lecture 5

    Engineering: SUN Enterprise

    Proc + mem card - I/O card

    16 cards of either type

    All memory accessed over bus, so symmetric

    Higher bandwidth, higher latency bus

    Gigaplane bus (256 data, 41 address, 83 MHz)

    S

    BUS

    SBUS

    S

    BUS

    2F

    iberChannel

    100bT,

    SCSI

    Bus interf ace

    CPU/memcardsP

    $2

    $

    P

    $2

    $

    Mem ctrl

    Bus interface/switch

    I/O cards

  • 7/30/2019 Shared memoy architecture

    23/33

    09/10/02 CS267 Lecture 5

    Approaches to Building ParallelMachines

    P1

    Switch

    Main memory

    Pn

    (Interleaved)

    (Interleaved)

    First-level $

    P1

    $

    Interconnection network

    $

    Pn

    Mem Mem

    P1

    $

    Interconnection network

    $

    Pn

    Mem MemShared Cache

    Centralized MemoryDance Hall, UMA

    Distributed Memory (NUMA)

    Scale

  • 7/30/2019 Shared memoy architecture

    24/33

    09/10/02 CS267 Lecture 5

    Directory-Based Cache-Coherence

  • 7/30/2019 Shared memoy architecture

    25/33

    09/10/02 CS267 Lecture 5

    90 Scalable, Cache Coherent Multiprocessors

    P

    Cache

    P

    Cache

    Interconnection Net work

    Memory presence bit s

    dirty-bit

    Directory

    memory block

    1 n

  • 7/30/2019 Shared memoy architecture

    26/33

    09/10/02 CS267 Lecture 5

    SGI Origin 2000

    L2 cache

    P

    (1-4 MB)L2 cache

    P

    (1-4 MB)

    HubXbow

    MainMemory(1-4 GB)

    Direc-tory

    L2 cache

    P

    (1-4 MB)L2 cache

    P

    (1-4 MB)

    Hub Xbow

    MainMemory(1-4 GB)

    Direc-tory

    Interconnection Netw ork

    Single 16-by-11 PCB

    Directory state in same or separate DRAMs, accessed in parallel

    Up to 512 nodes ( 2 processors per node)

    With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc

    Peak SysAD bus bw is 780MB/s, so also Hub-Mem

    Hub to router chip and to Xbow is 1.56 GB/s (both are off-board)

  • 7/30/2019 Shared memoy architecture

    27/33

    09/10/02 CS267 Lecture 5

    Caches and Scientific Computing

    Caches tend to perform worst on demanding

    applications that operate on large data sets transaction processing

    operating systems

    sparse matrices

    Modern scientific codes use tiling/blocking to becomecache friendly

    easier for dense codes than for sparse

    tiling and parallelism are similar transformations

  • 7/30/2019 Shared memoy architecture

    28/33

    09/10/02 CS267 Lecture 5

    Scalable Global Address Space

  • 7/30/2019 Shared memoy architecture

    29/33

    09/10/02 CS267 Lecture 5

    Global Address Space: Structured Memory

    Processor performs load Pseudo-memory controller turns it into a message

    transaction with a remote controller, which performs thememory operation and replies with the data.

    Examples: BBN butterfly, Cray T3D

    src

    Scalable Network

    M

    Pseudo

    Mem

    P$

    mmu

    M

    Pseudo

    Proc

    readaddr desttag

    src rrsp tag data

    Ld R

  • 7/30/2019 Shared memoy architecture

    30/33

    09/10/02 CS267 Lecture 5

    Cray T3D: Global Address Space machine

    2048 Alphas (150 MHz, 16 or 64 MB each) + fast network

    43-bit virtual address space, 32-bit physical 32-bit and 64-bit load/store + byte manipulation on regs.

    no L2 cache

    non-blocking stores, load/store re-ordering, memory fence

    load-lock / store-conditional

    Direct global memory access via external segment regs

    DTB annex, 32 entries, remote processor number and mode

    atomic swap between special local reg and memory

    special fetch&inc register

    global-OR, global-AND barriers

    Prefetch Queue

    Block Transfer Engine

    User-level Message Queue

  • 7/30/2019 Shared memoy architecture

    31/33

    09/10/02 CS267 Lecture 5

    Cray T3E

    Scales up to 1024 processors, 480MB/s links

    Memory system similar to t3d Memory controller generates request message for non-local

    references

    No hardware mechanism for coherence

    Switch

    P

    $

    XY

    Z

    External I /O

    Memctrl

    and NI

    Mem

  • 7/30/2019 Shared memoy architecture

    32/33

    09/10/02 CS267 Lecture 5

    What to Take Away?

    Programming shared memory machines

    May allocate data in large shared region without too manyworries about where

    Memory hierarchy is critical to performance

    Even more so than on uniprocs, due to coherence traffic

    For performance tuning, watch sharing (both true and false)

    Semantics

    Need to lock access to shared variable for read-modify-write

    Sequential consistency is the natural semantics

    Architects worked hard to make this work Caches are coherent with buses or directories

    No caching of remote data on shared address space machines

    But compiler and processor may still get in the way

    Non-blocking writes, read prefetching, code motion

  • 7/30/2019 Shared memoy architecture

    33/33

    09/10/02 CS267 Lecture 5

    Where are things going

    High-end collections of almost complete workstations/SMP on high-speed

    network (Millennium, IBM SP machines)

    with specialized communication assist integrated with memorysystem to provide global access to shared data (??)

    Mid-end

    almost all servers are bus-based CC SMPs high-end servers are replacing the bus with a network

    Sun Enterprise 10000, Cray SV1, HP/Convex SPP

    SGI Origin 2000

    volume approach is Pentium pro quadpack + SCI ring

    Sequent, Data General Low-end

    SMP desktop is here

    Major change ahead SMP on a chip as a building block