Shared memoy architecture

7/30/2019 Shared memoy architecture

1/33

09/10/02 CS267

CS 267: Applications of Parallel Computers

Lecture 5:Shared Memory Parallel Machines

Horst D. Simon

http://www.cs.berkeley.edu/~strive/cs267


2/33

09/10/02 CS267 Lecture 5

Basic Shared MemoryArchitecture

P1 P2 Pn

network

$ $ $

memory

Processors all connected to a large shared memory

Local caches for each processor Cost: much cheaper to cache than main memory

Simple to program, but hard to scale

Now take a closer look at structure, costs, limits


3/33

09/10/02 CS267 Lecture 5

Programming Shared Memory (review)

Program is a collection of threads of control.

Each thread has a set of private variables e.g. local variables on the stack.

Collectively with a set of shared variables

e.g., static variables, shared common blocks, global heap.

Communication and synchronization through shared variables

iress

PPP

iress

. . .

x = ...

y = ..x ...

Address

:Shared

Pr

ivate


4/33

09/10/02 CS267 Lecture 5

Outline

Historical perspective

Bus-based machines Pentium SMP

IBM SP node

Directory-based (CC-NUMA) machine

Origin 2000

Global address space machines

Cray t3d and (sort of) t3e


5/33

09/10/02 CS267 Lecture 5

60s Mainframe Multiprocessors

Enhance memory capacity or I/O capabilities by adding

memory modules or I/O devices

How do you enhance processing capacity?

Add processors

Already need an interconnect between slow memorybanks and processor + I/O channels

cross-bar or multistage interconnection network

Pr oc

I/ODe vices

Interconnect

Proc

Mem IOCMem Mem Mem IOC

PIO IO

P

M

M

M

M


6/33

09/10/02 CS267 Lecture 5

70s Breakthrough: Caches

Memory system scaled by adding memory modules

Both bandwidth and capacity

Memory was still a bottleneck

Enter Caches!

Cache does two things:

Reduces average access time (latency)

Reduces bandwidth requirements to memory

P

memory (slow)

interconnect

I/O Deviceor

Processor

A: 17

processor (fast)


7/33

09/10/02 CS267 Lecture 5

Technology Perspective

DRAM

Year Size Cycle Time

1980 64 Kb 250 ns

1983 256 Kb 220 ns1986 1 Mb 190 ns

1989 4 Mb 165 ns

1992 16 Mb 145 ns

1995 64 Mb 120 ns

1000:1! 2:1!

Capacity Speed

Logic: 2x in 3 years 2x in 3 years

DRAM: 4x in 3 years 1.4x in 10 years

Disk: 2x in 3 years 1.4x in 10 years

0

50100

150

200

250

300

350

1986 1988 1990 1992 1994 1996

Year

SpecInt

SpecFP


8/33

09/10/02 CS267 Lecture 5

Approaches to Building ParallelMachines

P1Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

P1

$

Interconnection network

$

Pn

Mem Mem

P1

$


$

Pn

Mem MemShared Cache

Centralized Memory

Dance Hall, UMA

Distributed Memory (NUMA)

Scale


9/33

09/10/02 CS267 Lecture 5

80s Shared Memory: Shared Cache

i80286

i80486

Pentium

i80386

i8086

i4004

R10000

R4400

R3010

SU MIPS

1000

10000

100000

1000000

10000000

100000000

1965 1970 1975 1980 1985 1990 1995 2000 2005

Year

Transistors

i80x86M68K

MIPS

Alliant FX-8

early 80s eight 68020s with x-bar to 512 KB interleaved cache

Encore & Sequent

first 32-bit micros (N32032)

two to a board with a shared cache

P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $


10/33

09/10/02 CS267 Lecture 5

Shared Cache: Advantages and Disadvantages

Advantages

Cache placement identical to single cache only one copy of any cached block

Fine-grain sharing is possible

Interference

One processor may prefetch data for another

Can share data within a line without moving line

Disadvantages

Bandwidth limitation Interference

One processor may flush another processors data


11/33

09/10/02 CS267 Lecture 5


P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

P1

$


$

Pn

Mem Mem

P1

$


$

Pn

Mem MemShared Cache

Centralized MemoryDance Hall, UMA


Scale


12/33

09/10/02 CS267 Lecture 5

Intuitive Memory Model

Reading an address should return the last value written

to that address Easy in uniprocessors

except for I/O

Cache coherence problem in MPs is more pervasive

and more performance critical

More formally, this is called sequential consistency:

A multiprocessor is sequentially consistent if the result of any

execution is the same as if the operations of all the processorswere executed in some sequential order, and the operations ofeach individual processor appear in this sequence in the orderspecified by its program. [Lamport, 1979]


13/33

09/10/02 CS267 Lecture 5

Cache Coherence: Semantic Problem

p1 and p2 both have cached copies of x (as 0)

p1 writes x=1

May write through to memory

p2 reads x, but gets the stale cached copy

x 0 x 0

x = 0

p1 p2

x 1


14/33

09/10/02 CS267 Lecture 5

Cache Coherence: Semantic Problem

What does this imply about program behavior?

No process ever sees garbage values, I.e., of 2 values

Processors always see values written by some some processor

The value seen is constrained by program order on all processors

Time always move forward

Example: P1 writes x=1, then writes y=1

P2 read y, then reads x

x = 0

y = 0

x = 1y = 1

= y= x

If P2 sees the newvalue of y, it must seethe new value of x

P1 P2


15/33

09/10/02 CS267 Lecture 5

Snoopy Cache-Coherence Protocols

Bus is a broadcast medium & caches know what they have

Cache Controller snoops all transactions on the shared bus

A transaction is a relevant transaction if it involves a cache block

currently contained in this cache take action to ensure coherence

invalidate, update, or supply value

depends on state of the block and the protocol

StateAddressData

I/O dev icesMem

P1

$

Bus snoop

$

Pn

Cache-memorytransaction


16/33

09/10/02 CS267 Lecture 5

Basic Choices in Cache Coherence

Cache may keep information such as:

Valid/invalid Dirty (inconsistent with memory)

Shared (in another caches)

When a processor executes a write operation to shared

data, basic design choices are: Write thru: do the write in memory as well as cache

Write back: wait and do the write later, when the item is flushed

Update: give all other processors the new value

Invalidate: all other processors remove from cache


17/33

09/10/02 CS267 Lecture 5

Example: Write-thru Invalidate

Update and write-thru both use more memorybandwidth if there are writes to the same address

Update to the other caches

Write-thru to memory

I/O devices

Memory

P1

$ $ $

P2 P3

5

u= ?

4

u= ?

u:51

u:5

2

u:5

3

u= 7


18/33

09/10/02 CS267 Lecture 5

Write-Back/Ownership Schemes

When a single cache has ownership of a block, processorwrites do not result in bus writes, thus conservingbandwidth.

reads by others cause it to return to shared state

Most bus-based multiprocessors today use suchschemes.

Many variants of ownership-based protocols


19/33

09/10/02 CS267 Lecture 5

Sharing: A Performance Problem

True sharing

Frequent writes to a variable can create a bottleneck OK for read-only or infrequently written data

Technique: make copies of the value, one per processor, if thisis possible in the algorithm

Example problem: the data structure that stores the

freelist/heap for malloc/free

False sharing

Cache block may also introduce artifacts

Two distinct variables in the same cache block

Technique: allocate data used by each processor contiguously,or at least avoid interleaving

Example problem: an array of ints, one written frequently byeach processor


20/33

09/10/02 CS267 Lecture 5

Limits of Bus-Based Shared Memory

I/O MEM MEM

PROC

cache

PROC

cache

Assume:

1 GHz processor w/o cache=> 4 GB/s inst BW per processor (32-bit)

=> 1.2 GB/s data BW at 30% load-store

Suppose 98% inst hit rate and 95% datahit rate

=> 80 MB/s inst BW per processor

=> 60 MB/s data BW per processor

140 MB/s combined BW

Assuming 1 GB/s bus bandwidth

\ 8 processors will saturate bus

5.2 GB/s

140 MB/s


21/33

09/10/02 CS267 Lecture 5

Engineering: Intel Pentium Pro Quad

SMP for the masses:

All coherence andmultiprocessing glue in

processor module Highly integrated, targeted at

high volume

Low latency and bandwidth

P-Pro bus (64-bit data, 36-bit address, 66 MHz)

CPU

Bus interface

MIU

P-Pro

module

P-Pro

module

P-Pro

module256-KBL2 $Interruptcontroller

PCIbridge

PCIbridge

Memorycontroller

1-, 2-, or 4-wayinterleaved

DRAM

PCIbus

PC

Ibus

PCII/O

cards


22/33

09/10/02 CS267 Lecture 5

Engineering: SUN Enterprise

Proc + mem card - I/O card

16 cards of either type

All memory accessed over bus, so symmetric

Higher bandwidth, higher latency bus

Gigaplane bus (256 data, 41 address, 83 MHz)

S

BUS

SBUS

S

BUS

2F

iberChannel

100bT,

SCSI

Bus interf ace

CPU/memcardsP

$2

$

P

$2

$

Mem ctrl

Bus interface/switch

I/O cards


23/33

09/10/02 CS267 Lecture 5


P1

Switch

Main memory

Pn

(Interleaved)

(Interleaved)

First-level $

P1

$


$

Pn

Mem Mem

P1

$


$

Pn

Mem MemShared Cache

Centralized MemoryDance Hall, UMA


Scale


24/33

09/10/02 CS267 Lecture 5

Directory-Based Cache-Coherence


25/33

09/10/02 CS267 Lecture 5

90 Scalable, Cache Coherent Multiprocessors

P

Cache

P

Cache

Interconnection Net work

Memory presence bit s

dirty-bit

Directory

memory block

1 n


26/33

09/10/02 CS267 Lecture 5

SGI Origin 2000

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

HubXbow

MainMemory(1-4 GB)

Direc-tory

L2 cache

P

(1-4 MB)L2 cache

P

(1-4 MB)

Hub Xbow

MainMemory(1-4 GB)

Direc-tory

Interconnection Netw ork

Single 16-by-11 PCB

Directory state in same or separate DRAMs, accessed in parallel

Up to 512 nodes ( 2 processors per node)

With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc

Peak SysAD bus bw is 780MB/s, so also Hub-Mem

Hub to router chip and to Xbow is 1.56 GB/s (both are off-board)


27/33

09/10/02 CS267 Lecture 5

Caches and Scientific Computing

Caches tend to perform worst on demanding

applications that operate on large data sets transaction processing

operating systems

sparse matrices

Modern scientific codes use tiling/blocking to becomecache friendly

easier for dense codes than for sparse

tiling and parallelism are similar transformations


28/33

09/10/02 CS267 Lecture 5

Scalable Global Address Space


29/33

09/10/02 CS267 Lecture 5

Global Address Space: Structured Memory

Processor performs load Pseudo-memory controller turns it into a message

transaction with a remote controller, which performs thememory operation and replies with the data.

Examples: BBN butterfly, Cray T3D

src

Scalable Network

M

Pseudo

Mem

P$

mmu

M

Pseudo

Proc

readaddr desttag

src rrsp tag data

Ld R


30/33

09/10/02 CS267 Lecture 5

Cray T3D: Global Address Space machine

2048 Alphas (150 MHz, 16 or 64 MB each) + fast network

43-bit virtual address space, 32-bit physical 32-bit and 64-bit load/store + byte manipulation on regs.

no L2 cache

non-blocking stores, load/store re-ordering, memory fence

load-lock / store-conditional

Direct global memory access via external segment regs

DTB annex, 32 entries, remote processor number and mode

atomic swap between special local reg and memory

special fetch&inc register

global-OR, global-AND barriers

Prefetch Queue

Block Transfer Engine

User-level Message Queue


31/33

09/10/02 CS267 Lecture 5

Cray T3E

Scales up to 1024 processors, 480MB/s links

Memory system similar to t3d Memory controller generates request message for non-local

references

No hardware mechanism for coherence

Switch

P

$

XY

Z

External I /O

Memctrl

and NI

Mem


32/33

09/10/02 CS267 Lecture 5

What to Take Away?

Programming shared memory machines

May allocate data in large shared region without too manyworries about where

Memory hierarchy is critical to performance

Even more so than on uniprocs, due to coherence traffic

For performance tuning, watch sharing (both true and false)

Semantics

Need to lock access to shared variable for read-modify-write

Sequential consistency is the natural semantics

Architects worked hard to make this work Caches are coherent with buses or directories

No caching of remote data on shared address space machines

But compiler and processor may still get in the way

Non-blocking writes, read prefetching, code motion


33/33

09/10/02 CS267 Lecture 5

Where are things going

High-end collections of almost complete workstations/SMP on high-speed

network (Millennium, IBM SP machines)

with specialized communication assist integrated with memorysystem to provide global access to shared data (??)

Mid-end

almost all servers are bus-based CC SMPs high-end servers are replacing the bus with a network

Sun Enterprise 10000, Cray SV1, HP/Convex SPP

SGI Origin 2000

volume approach is Pentium pro quadpack + SCI ring

Sequent, Data General Low-end

SMP desktop is here

Major change ahead SMP on a chip as a building block

Shared memoy architecture

Documents

Transcript of Shared memoy architecture