Communication Styles in Business Process Support Systems with the Shared Spaces Architecture
Shared memoy architecture
-
Upload
sachinshetty001 -
Category
Documents
-
view
220 -
download
0
Transcript of Shared memoy architecture
-
7/30/2019 Shared memoy architecture
1/33
09/10/02 CS267
CS 267: Applications of Parallel Computers
Lecture 5:Shared Memory Parallel Machines
Horst D. Simon
http://www.cs.berkeley.edu/~strive/cs267
-
7/30/2019 Shared memoy architecture
2/33
09/10/02 CS267 Lecture 5
Basic Shared MemoryArchitecture
P1 P2 Pn
network
$ $ $
memory
Processors all connected to a large shared memory
Local caches for each processor Cost: much cheaper to cache than main memory
Simple to program, but hard to scale
Now take a closer look at structure, costs, limits
-
7/30/2019 Shared memoy architecture
3/33
09/10/02 CS267 Lecture 5
Programming Shared Memory (review)
Program is a collection of threads of control.
Each thread has a set of private variables e.g. local variables on the stack.
Collectively with a set of shared variables
e.g., static variables, shared common blocks, global heap.
Communication and synchronization through shared variables
iress
PPP
iress
. . .
x = ...
y = ..x ...
Address
:Shared
Pr
ivate
-
7/30/2019 Shared memoy architecture
4/33
09/10/02 CS267 Lecture 5
Outline
Historical perspective
Bus-based machines Pentium SMP
IBM SP node
Directory-based (CC-NUMA) machine
Origin 2000
Global address space machines
Cray t3d and (sort of) t3e
-
7/30/2019 Shared memoy architecture
5/33
09/10/02 CS267 Lecture 5
60s Mainframe Multiprocessors
Enhance memory capacity or I/O capabilities by adding
memory modules or I/O devices
How do you enhance processing capacity?
Add processors
Already need an interconnect between slow memorybanks and processor + I/O channels
cross-bar or multistage interconnection network
Pr oc
I/ODe vices
Interconnect
Proc
Mem IOCMem Mem Mem IOC
PIO IO
P
M
M
M
M
-
7/30/2019 Shared memoy architecture
6/33
09/10/02 CS267 Lecture 5
70s Breakthrough: Caches
Memory system scaled by adding memory modules
Both bandwidth and capacity
Memory was still a bottleneck
Enter Caches!
Cache does two things:
Reduces average access time (latency)
Reduces bandwidth requirements to memory
P
memory (slow)
interconnect
I/O Deviceor
Processor
A: 17
processor (fast)
-
7/30/2019 Shared memoy architecture
7/33
09/10/02 CS267 Lecture 5
Technology Perspective
DRAM
Year Size Cycle Time
1980 64 Kb 250 ns
1983 256 Kb 220 ns1986 1 Mb 190 ns
1989 4 Mb 165 ns
1992 16 Mb 145 ns
1995 64 Mb 120 ns
1000:1! 2:1!
Capacity Speed
Logic: 2x in 3 years 2x in 3 years
DRAM: 4x in 3 years 1.4x in 10 years
Disk: 2x in 3 years 1.4x in 10 years
0
50100
150
200
250
300
350
1986 1988 1990 1992 1994 1996
Year
SpecInt
SpecFP
-
7/30/2019 Shared memoy architecture
8/33
09/10/02 CS267 Lecture 5
Approaches to Building ParallelMachines
P1Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem MemShared Cache
Centralized Memory
Dance Hall, UMA
Distributed Memory (NUMA)
Scale
-
7/30/2019 Shared memoy architecture
9/33
09/10/02 CS267 Lecture 5
80s Shared Memory: Shared Cache
i80286
i80486
Pentium
i80386
i8086
i4004
R10000
R4400
R3010
SU MIPS
1000
10000
100000
1000000
10000000
100000000
1965 1970 1975 1980 1985 1990 1995 2000 2005
Year
Transistors
i80x86M68K
MIPS
Alliant FX-8
early 80s eight 68020s with x-bar to 512 KB interleaved cache
Encore & Sequent
first 32-bit micros (N32032)
two to a board with a shared cache
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
-
7/30/2019 Shared memoy architecture
10/33
09/10/02 CS267 Lecture 5
Shared Cache: Advantages and Disadvantages
Advantages
Cache placement identical to single cache only one copy of any cached block
Fine-grain sharing is possible
Interference
One processor may prefetch data for another
Can share data within a line without moving line
Disadvantages
Bandwidth limitation Interference
One processor may flush another processors data
-
7/30/2019 Shared memoy architecture
11/33
09/10/02 CS267 Lecture 5
Approaches to Building ParallelMachines
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem MemShared Cache
Centralized MemoryDance Hall, UMA
Distributed Memory (NUMA)
Scale
-
7/30/2019 Shared memoy architecture
12/33
09/10/02 CS267 Lecture 5
Intuitive Memory Model
Reading an address should return the last value written
to that address Easy in uniprocessors
except for I/O
Cache coherence problem in MPs is more pervasive
and more performance critical
More formally, this is called sequential consistency:
A multiprocessor is sequentially consistent if the result of any
execution is the same as if the operations of all the processorswere executed in some sequential order, and the operations ofeach individual processor appear in this sequence in the orderspecified by its program. [Lamport, 1979]
-
7/30/2019 Shared memoy architecture
13/33
09/10/02 CS267 Lecture 5
Cache Coherence: Semantic Problem
p1 and p2 both have cached copies of x (as 0)
p1 writes x=1
May write through to memory
p2 reads x, but gets the stale cached copy
x 0 x 0
x = 0
p1 p2
x 1
-
7/30/2019 Shared memoy architecture
14/33
09/10/02 CS267 Lecture 5
Cache Coherence: Semantic Problem
What does this imply about program behavior?
No process ever sees garbage values, I.e., of 2 values
Processors always see values written by some some processor
The value seen is constrained by program order on all processors
Time always move forward
Example: P1 writes x=1, then writes y=1
P2 read y, then reads x
x = 0
y = 0
x = 1y = 1
= y= x
If P2 sees the newvalue of y, it must seethe new value of x
P1 P2
-
7/30/2019 Shared memoy architecture
15/33
09/10/02 CS267 Lecture 5
Snoopy Cache-Coherence Protocols
Bus is a broadcast medium & caches know what they have
Cache Controller snoops all transactions on the shared bus
A transaction is a relevant transaction if it involves a cache block
currently contained in this cache take action to ensure coherence
invalidate, update, or supply value
depends on state of the block and the protocol
StateAddressData
I/O dev icesMem
P1
$
Bus snoop
$
Pn
Cache-memorytransaction
-
7/30/2019 Shared memoy architecture
16/33
09/10/02 CS267 Lecture 5
Basic Choices in Cache Coherence
Cache may keep information such as:
Valid/invalid Dirty (inconsistent with memory)
Shared (in another caches)
When a processor executes a write operation to shared
data, basic design choices are: Write thru: do the write in memory as well as cache
Write back: wait and do the write later, when the item is flushed
Update: give all other processors the new value
Invalidate: all other processors remove from cache
-
7/30/2019 Shared memoy architecture
17/33
09/10/02 CS267 Lecture 5
Example: Write-thru Invalidate
Update and write-thru both use more memorybandwidth if there are writes to the same address
Update to the other caches
Write-thru to memory
I/O devices
Memory
P1
$ $ $
P2 P3
5
u= ?
4
u= ?
u:51
u:5
2
u:5
3
u= 7
-
7/30/2019 Shared memoy architecture
18/33
09/10/02 CS267 Lecture 5
Write-Back/Ownership Schemes
When a single cache has ownership of a block, processorwrites do not result in bus writes, thus conservingbandwidth.
reads by others cause it to return to shared state
Most bus-based multiprocessors today use suchschemes.
Many variants of ownership-based protocols
-
7/30/2019 Shared memoy architecture
19/33
09/10/02 CS267 Lecture 5
Sharing: A Performance Problem
True sharing
Frequent writes to a variable can create a bottleneck OK for read-only or infrequently written data
Technique: make copies of the value, one per processor, if thisis possible in the algorithm
Example problem: the data structure that stores the
freelist/heap for malloc/free
False sharing
Cache block may also introduce artifacts
Two distinct variables in the same cache block
Technique: allocate data used by each processor contiguously,or at least avoid interleaving
Example problem: an array of ints, one written frequently byeach processor
-
7/30/2019 Shared memoy architecture
20/33
09/10/02 CS267 Lecture 5
Limits of Bus-Based Shared Memory
I/O MEM MEM
PROC
cache
PROC
cache
Assume:
1 GHz processor w/o cache=> 4 GB/s inst BW per processor (32-bit)
=> 1.2 GB/s data BW at 30% load-store
Suppose 98% inst hit rate and 95% datahit rate
=> 80 MB/s inst BW per processor
=> 60 MB/s data BW per processor
140 MB/s combined BW
Assuming 1 GB/s bus bandwidth
\ 8 processors will saturate bus
5.2 GB/s
140 MB/s
-
7/30/2019 Shared memoy architecture
21/33
09/10/02 CS267 Lecture 5
Engineering: Intel Pentium Pro Quad
SMP for the masses:
All coherence andmultiprocessing glue in
processor module Highly integrated, targeted at
high volume
Low latency and bandwidth
P-Pro bus (64-bit data, 36-bit address, 66 MHz)
CPU
Bus interface
MIU
P-Pro
module
P-Pro
module
P-Pro
module256-KBL2 $Interruptcontroller
PCIbridge
PCIbridge
Memorycontroller
1-, 2-, or 4-wayinterleaved
DRAM
PCIbus
PC
Ibus
PCII/O
cards
-
7/30/2019 Shared memoy architecture
22/33
09/10/02 CS267 Lecture 5
Engineering: SUN Enterprise
Proc + mem card - I/O card
16 cards of either type
All memory accessed over bus, so symmetric
Higher bandwidth, higher latency bus
Gigaplane bus (256 data, 41 address, 83 MHz)
S
BUS
SBUS
S
BUS
2F
iberChannel
100bT,
SCSI
Bus interf ace
CPU/memcardsP
$2
$
P
$2
$
Mem ctrl
Bus interface/switch
I/O cards
-
7/30/2019 Shared memoy architecture
23/33
09/10/02 CS267 Lecture 5
Approaches to Building ParallelMachines
P1
Switch
Main memory
Pn
(Interleaved)
(Interleaved)
First-level $
P1
$
Interconnection network
$
Pn
Mem Mem
P1
$
Interconnection network
$
Pn
Mem MemShared Cache
Centralized MemoryDance Hall, UMA
Distributed Memory (NUMA)
Scale
-
7/30/2019 Shared memoy architecture
24/33
09/10/02 CS267 Lecture 5
Directory-Based Cache-Coherence
-
7/30/2019 Shared memoy architecture
25/33
09/10/02 CS267 Lecture 5
90 Scalable, Cache Coherent Multiprocessors
P
Cache
P
Cache
Interconnection Net work
Memory presence bit s
dirty-bit
Directory
memory block
1 n
-
7/30/2019 Shared memoy architecture
26/33
09/10/02 CS267 Lecture 5
SGI Origin 2000
L2 cache
P
(1-4 MB)L2 cache
P
(1-4 MB)
HubXbow
MainMemory(1-4 GB)
Direc-tory
L2 cache
P
(1-4 MB)L2 cache
P
(1-4 MB)
Hub Xbow
MainMemory(1-4 GB)
Direc-tory
Interconnection Netw ork
Single 16-by-11 PCB
Directory state in same or separate DRAMs, accessed in parallel
Up to 512 nodes ( 2 processors per node)
With 195MHz R10K processor, peak 390MFLOPS or 780 MIPS per proc
Peak SysAD bus bw is 780MB/s, so also Hub-Mem
Hub to router chip and to Xbow is 1.56 GB/s (both are off-board)
-
7/30/2019 Shared memoy architecture
27/33
09/10/02 CS267 Lecture 5
Caches and Scientific Computing
Caches tend to perform worst on demanding
applications that operate on large data sets transaction processing
operating systems
sparse matrices
Modern scientific codes use tiling/blocking to becomecache friendly
easier for dense codes than for sparse
tiling and parallelism are similar transformations
-
7/30/2019 Shared memoy architecture
28/33
09/10/02 CS267 Lecture 5
Scalable Global Address Space
-
7/30/2019 Shared memoy architecture
29/33
09/10/02 CS267 Lecture 5
Global Address Space: Structured Memory
Processor performs load Pseudo-memory controller turns it into a message
transaction with a remote controller, which performs thememory operation and replies with the data.
Examples: BBN butterfly, Cray T3D
src
Scalable Network
M
Pseudo
Mem
P$
mmu
M
Pseudo
Proc
readaddr desttag
src rrsp tag data
Ld R
-
7/30/2019 Shared memoy architecture
30/33
09/10/02 CS267 Lecture 5
Cray T3D: Global Address Space machine
2048 Alphas (150 MHz, 16 or 64 MB each) + fast network
43-bit virtual address space, 32-bit physical 32-bit and 64-bit load/store + byte manipulation on regs.
no L2 cache
non-blocking stores, load/store re-ordering, memory fence
load-lock / store-conditional
Direct global memory access via external segment regs
DTB annex, 32 entries, remote processor number and mode
atomic swap between special local reg and memory
special fetch&inc register
global-OR, global-AND barriers
Prefetch Queue
Block Transfer Engine
User-level Message Queue
-
7/30/2019 Shared memoy architecture
31/33
09/10/02 CS267 Lecture 5
Cray T3E
Scales up to 1024 processors, 480MB/s links
Memory system similar to t3d Memory controller generates request message for non-local
references
No hardware mechanism for coherence
Switch
P
$
XY
Z
External I /O
Memctrl
and NI
Mem
-
7/30/2019 Shared memoy architecture
32/33
09/10/02 CS267 Lecture 5
What to Take Away?
Programming shared memory machines
May allocate data in large shared region without too manyworries about where
Memory hierarchy is critical to performance
Even more so than on uniprocs, due to coherence traffic
For performance tuning, watch sharing (both true and false)
Semantics
Need to lock access to shared variable for read-modify-write
Sequential consistency is the natural semantics
Architects worked hard to make this work Caches are coherent with buses or directories
No caching of remote data on shared address space machines
But compiler and processor may still get in the way
Non-blocking writes, read prefetching, code motion
-
7/30/2019 Shared memoy architecture
33/33
09/10/02 CS267 Lecture 5
Where are things going
High-end collections of almost complete workstations/SMP on high-speed
network (Millennium, IBM SP machines)
with specialized communication assist integrated with memorysystem to provide global access to shared data (??)
Mid-end
almost all servers are bus-based CC SMPs high-end servers are replacing the bus with a network
Sun Enterprise 10000, Cray SV1, HP/Convex SPP
SGI Origin 2000
volume approach is Pentium pro quadpack + SCI ring
Sequent, Data General Low-end
SMP desktop is here
Major change ahead SMP on a chip as a building block