1 DSM Innovations - MultiView Software Distributed Shared Memory (SDSM): MultiView 1.SDSM, false...
-
date post
22-Dec-2015 -
Category
Documents
-
view
244 -
download
0
Transcript of 1 DSM Innovations - MultiView Software Distributed Shared Memory (SDSM): MultiView 1.SDSM, false...
1DSM Innovations - MultiView
Software Distributed Shared Memory (SDSM):
MultiView
1. SDSM, false sharing.
2. Solution: MultiView.
3. Granularity adaptation.
4. Integrated services.
Ayal Itzkovitz, Assaf Schuster
2DSM Innovations - MultiView
A multi-core system (simplified)
A parallel program may spawn processes (threads) in order to utilize all computing units
Processes communicate through shared memory, physically located on the local machine
core
Local memory
core core
core
3DSM Innovations - MultiView
Network
A distributed system
core
Local memory
core
Local memory
core
Local memory
Virtual Shared Memory
Emulation of the same programming paradigm Ultimately: no changes to source/binary code
4DSM Innovations - MultiView
The First SDSM System
The first software SDSM system, Ivy [Li & Hudak, Yale, ‘86] Strict memory semantics (Lamport’s sequential consistency)
Page-based: memory pages as units of sharing
The major performance limitation:
Page size False sharing Page size – 4K (and more) Average object size – 28 bytes
About 150 objects on a page
5DSM Innovations - MultiView
Object Distribution
6DSM Innovations - MultiView
Network
Object Distribution – Memory View
“…the conventional wisdom remains that the overhead of false sharing […] in page-based consistency protocols is the primary factor limiting the performance of software SDSM”
[Amza, Cox, Ramajamni, and Zwaenepoel, PPoPP ‘97]
“[The] conventional wisdom holds that fine-grain performance and false sharing doom page-based approaches”
[Buck and Keleher, IPPS ‘98]
False Sharing
8DSM Innovations - MultiView
Solution: The MultiView Approach
“MultiView and Millipage – Fine-grain Sharing in Page-based SDSMs” [Itzkovitz and Schuster, OSDI ‘99]
Implement small-size pages through special memory configuration
Other Goals: W/O compromising the strict memory consistency [ICS’04, EuroPar’04]
Utilizing low-latency networks (Myrinet, VIA/ServerNet-II, Infiniband) [Hot-Interconnects’03, IPDPS’04]
Transparency [EuroPar’03]
Adaptive sharing granularity [ICPP’00, IPDPS’01 best paper]
Maximize locality through migration and load sharing [DISC’01]
Additional “service layers” (garbage collection, data-race detection) [JPDC’01,JPDC02]
9DSM Innovations - MultiView
The Traditional Memory Layout
xyz
Traditional
w
v
u
struct a { …};struct b; int x, y, z;
main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b));
…}
struct a { …};struct b; int x, y, z;
main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b));
…}
10DSM Innovations - MultiView
xyz
The MultiView Technique
TraditionalMultiView
w
v
u
w
v
u
xyz
11DSM Innovations - MultiView
The MultiView Technique
TraditionalMultiView
w
v
u
xyz
xyz
w
v
u
Protection is now set independently
RW
NAR
Variables reside in the same page but are not shared
12DSM Innovations - MultiView
The MultiView Technique
TraditionalMultiView
w
v
u
xyz
xyz
w
v
u
View 1
View 2
View 3
Memory Object
13DSM Innovations - MultiView
The MultiView Technique
Memory Layout
View 1
View 2
Memory Object
xyz
MultiView
w
v
u
MemoryObjectView 1
View 2
View 3
View 3
14DSM Innovations - MultiView
The MultiView Technique
Host A
View 1
View 2
Memory Object
View 3
Host B
View 1
View 2
Memory Object
View 3
R R
NA RW
NA
R
R
R
R
R
RW
RW
NA
NA
15DSM Innovations - MultiView
The MultiView Technique
View 1
View 2
View 3
View 1
View 2
View 3
R R
NA RW
NA
R
R
R
R
R
RW
RW
NA
NA
Host A Host B
16DSM Innovations - MultiView
Enabling Technology
SharedMemoryObject
Memory mapped I/O created for inter-process communication
17DSM Innovations - MultiView
Implementation: Millipage
Can be used by a single process to provide desired functionality
SharedMemoryObject
• Windows-NT (Solaris, BSD, Linux)
• CreateFileMapping(), MapViewOfFileEx() for allocating views
18DSM Innovations - MultiView
Transparency
1999: Minipages are allocated at malloc time (via malloc-like API) Allocation routines should be slightly modified
mat = malloc(lines*cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; …
mat = malloc(lines*sizeof(int*));for(i=0;i<N;i++) mat[i] = malloc(cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; …
SOR and LU have not been modified at all WATER- changed ~20 lines out of 783 lines IS- changed 5 lines out of 93 lines TSP- changed ~15 lines out of ~400 lines
2003: complete transparency Through binary instrumentation/interception of OS calls
19DSM Innovations - MultiView
SOR SPLASH-II Benchmark
SOR speedup
012345678
0 2 4 6 8 10
Number of threads
Spe
edup
Transparent DSM
Millipede 4.0
Transparent+Barrier
SMP (2 processors)
20DSM Innovations - MultiView
Performance with Fixed Granularity(NBodyW on 8 nodes)
50
52
54
56
58
60
62
allocation granularity
run
tim
e [
s]
21DSM Innovations - MultiView
False Sharing vs. Prefetching (WATER)
0
1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
1 2 3 4 5 6 none
chunking level
0.50
0.60
0.70
0.80
0.90
1.00
1.10
eff
icie
nc
y
compete req. (4) x 10 compete req. (8) x 10
Read/Write faults(4) Read/Write faults(8)
efficiency (4 hosts) efficiency (8 hosts)
22DSM Innovations - MultiView
Adapting Granularity
Application run time
Sha
red
data
ele
men
ts
Adaptation is dynamic, automatic, transparent
23DSM Innovations - MultiView
Performance (VIA/ServerNet-II, 2004)
1 2 4 6 8 10 120
2
4
6
8
10
12Water-nsq speedup (one thread per node)
nodes
spee
dup
1 2 4 6 8 10 1202468
1012141618202224
Water-nsq speedup (two threads per node)
nodes
spee
dup
SC/MV - fine granularityHLRCMixed consistencySC/MV - best static granularity
SC/MV - dynamic granularity
24DSM Innovations - MultiView
Integrating Data Race Detection
Detection in application variable granularity
Overheads 1 proc
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
SOR LU IS TSP W ATER
no
rmal
ized
exe
cuti
on
tim
e
NO_DR BAS PCT OPT
Overheads 8 proc
0.8
0.9
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
SOR LU IS TSP W ATER
no
rmal
ized
exe
cuti
on
tim
e
NO_DR BAS PCT OPT
25DSM Innovations - MultiView
Integrating Distributed Garbage Collection(Remote Reference Counting)
Collection in native application granularity.
0.20%
2.50% 2.60%
37.70%
0%
5%
10%
15%
20%
25%
30%
35%
40%
IS 0.8 LU 30 WATER 31 SOR 1140
garbage creation ratio (obj/sec)
ove
rhe
ad
26DSM Innovations - MultiView
Questions?
27DSM Innovations - MultiView
1. In-core multi-threading
2. Multi-core/SMP multi-threading
3. Tightly-coupled cluster,
customized interconnect (SGI’s Altix)
4. Tightly-coupled cluster,
of-the-shelf interconnect (InfiniBand)
5. WAN, Internet, Grid, peer-to-peer
Traditionally: 1+2 are programmable using shared memory, 3+4 are programmable using message passing, in 5 peer processes communicate with central control only.
HDSM: systems in 3 move towards presenting a shared memory interface to a physically distributed system.
What about 4,5? Software Distributed Shared Memory = SDSM
Types of Parallel Systems
Scalability
Communication Efficiency
28DSM Innovations - MultiView
Matrix Multiplication
R R W
two threads
Read/only matrices Write matrix
A = malloc(MATSIZE);B = malloc(MATSIZE);C = malloc(MATSIZE);
parfor(n) mult(A, B, C);
mult(id):
for (line=Nxid .. Nx(id+1)) for(col=0..N) C[line,col] = multline(A[line],B[col]);
29DSM Innovations - MultiView
Network
Matrix Multiplication
RO RO
RO RO
RO RO
RW RW
RO RO
RO RO
RO RO
RW RW
A
x
B
=
C
A
x
B
=
C
Sent once
Sent once
30DSM Innovations - MultiView
Network
Matrix Multiplication
RO RO
RO RO
RO RO
RW RW
RO RO
RO RO
RO RO
RW RW
A
x
B
=
C
A
x
B
=
C
R WR
31DSM Innovations - MultiView
Network
Matrix Multiplication - False Sharing
RO RO
RO RO
NA
RO RO
RO RO
A
x
B
=
C
A
x
B
=
C
Sent once
RO RO
RW RW
RO RO
RO RO
RO RO
RW RW
Sent once
NA
RO RO RO RO
RW RW
32DSM Innovations - MultiView
Network
Matrix Multiplication - False Sharing
RO RO
RO RO
RO RO
RO RO
A
x
B
=
C
A
x
B
=
CRW RW
RO RO RO RO
RW RW
NA NA
RO RO
RO RO
RO RO
RO RO
RW RW
33DSM Innovations - MultiView
Network
Matrix Multiplication - False Sharing
RO RO
RO RO
RO RO
RO RO
A
x
B
=
C
A
x
B
=
CRW RW
RO RO RO RO
RW RW
RO RO
RO RO
RO RO
RO RO
RW RWNA NA
34DSM Innovations - MultiView
RR W
Network
Matrix Multiplication - False Sharing
RO RO
RO RO
RO RO
RO RO
A
x
B
=
C
A
x
B
=
C
RO RO RO RO
RO RO
RO RO
RO RO
RO RO
RW RW
RW RW
RW RW
RW RW
35DSM Innovations - MultiView
First Approach: Weak Semantics
Example - Release Consistency: Allow multiple writers to page
(assume exclusive update for any portion of the page) Each page has a twin copy At synchronization time, all pages perform “diff” with their twins, and
send diffs to managers Managers hold master copies
twin twin
RW RW
Apply diff Apply diff
36DSM Innovations - MultiView
First Approach: Weak Semantics
Allow memory to reside in an incosistent state for time intervals
Enforce consistency only at synchronization points Reaching a consistent view of the memory requires
computation
Reduces (but not always eliminate) false sharing Reduces number of protocol messages
Weak memory semantics Involves both memory and processing time overhead
Still: coarse-grain sharing (why diff at locations not touched? )
37DSM Innovations - MultiView
Software DSM Evolution - Weak Semantics
Li & Hudak - IVY, ‘86Yale
Munin, ‘92Release Cons.
Rice
Midway, ‘93Entry Cons.CMU
Treadmarks, ‘94Lazy Release Cons.
Rice
Brazos, ‘97Scope Cons.
Rice
Page-grain:
Relaxed consistency
38DSM Innovations - MultiView
Software DSM Evolution - Multithreading
Li & Hudak - IVY, ‘86Yale
Munin, ‘92Release Cons.
Rice
Midway, ‘93Entry Cons.CMU
Treadmarks, ‘94Lazy Release Cons.
Rice
Brazos, ‘97Scope Cons.
Rice
Page-grain:
Relaxed consistency
CVM, Millipede, ‘96 multi-protocol
Maryland Technion
Quarks, ‘98protocol latency hiding
Utah
Multithreading
39DSM Innovations - MultiView
Second Approach:Code Instrumentation
Example - Binary Rewriting: wrap each load and store with instructions that check whether
the data is available locally
load r1, ptr[line]load r2, ptr[v] add r1, 3hstore r1, ptr[line]sub r2, r1store r2, ptr[v]
push ptr[line]call __check_rload r1, ptr[line]push ptr[v]call __check_r load r2, ptr[v] add r1, 3hpush ptr[line]call __check_wstore r1, ptr[line]push ptr[line]call __done sub r2, r1push ptr[v]call __check_w store r2, ptr[v]push ptr[v]call __done
CodeInstr.
push ptr[line]call __check_wload r1, ptr[line]push ptr[v]call __check_w load r2, ptr[v] add r1, 3hstore r1, ptr[line]push ptr[line]call __done sub r2, r1store r2, ptr[v]push ptr[v]call __done
Opt.
line += 3; v = v - line;
Compile
40DSM Innovations - MultiView
Second Approach:Code Instrumentation
Provides fine-grain access control, thus avoids false sharing
Bypasses the page protection mechanism Usually, fixed granularity for all application data (Still,
false sharing ) Needs a special compiler or binary-level rewriting tools
Cost: High overheads (even on single machine) Inflated code Not portable (among architectures)
41DSM Innovations - MultiView
Software DSM Evolution
Li & Hudak - IVY, ‘86Yale
Munin, ‘92Release Cons.
Rice
Midway, ‘93Entry Cons.CMU
Treadmarks, ‘94Lazy Release Cons.
Rice
Brazos, ‘97Scope Cons.
Rice
Page-grain:
Relaxed consistency
CVM, Millipede, ‘96 multi-protocol
Maryland Technion
Quarks, ‘98protocol latency hiding
Utah
Multithreading
Blizzard, ‘94binary
instrumentationWisconsin
Shasta, ‘97transparent,
works forcommercial apps
Digital WRL
Fine-grain:Code
Instrumentation
42DSM Innovations - MultiView
MultiView - Overheads
Application:traverse an array of integers, all packed up in minipages
The number of minipages is derived from the value of max views in page
Limitations of the experiments: 1.63GB contiguous address space available Up to 1664 views Need 64 bits!!!
43DSM Innovations - MultiView
MultiView - Overheads
As expected, committed (physical) memory is constant Only a negligible overhead (< 4%): Due to TLB misses
0.96
0.98
1
1.02
1.04
1.06
1.08
512Kb
1 MB 2 MB 4 MB 8 MB 16MB
Slo
wdo
wns
1 2 4 8 16 32Num views
44DSM Innovations - MultiView
MultiView - Taking it to the extreme
Beyond critical points overhead becomes substantial
0
2
4
6
8
10
12
14
16
18
20
Number of views
Slo
wd
ow
n
512 Kb 1 MB 2 MB4 MB 8 MB 16 MB
8MB
4MB
2MB
1MB
Number of minipages at critical points is 128K Slowdown due to L2 cache exhausted by PTEs
45DSM Innovations - MultiView
MultiView - Taking it to the extreme
Beyond critical points overhead becomes substantial
0
2
4
6
8
10
12
14
16
18
20
Number of views
Slo
wd
ow
n
512 Kb 1 MB 2 MB4 MB 8 MB 16 MB
8MB
4MB
2MB
1MB
Number of minipages at critical points is 128K Slowdown due to L2 cache exhausted by PTEs
SDSM
47DSM Innovations - MultiView
The Transparent DSM: System Initialization
For most DSM systems, initialization is an almost trivial task
The transparent DSM system cannot use such a simple solution
In order to initialize a DSM system transparently we have to inject the initialization code into the loaded application
48DSM Innovations - MultiView
Standard Initialization
…call c_init…call main…
crtStartup:
…application code…
main:
Startup code from in the C standard library. This code is
identical for all C applications.crtStartup is the entry point of
the executable.
Standard C application
This instruction lies at a fixed offset from crtStartup. We
denote this offset as main_call_offset
Initialize the C runtime library
Start the application
49DSM Innovations - MultiView
Transparent DSM System Initialization
…call c_init…call main…
crtStartup:
…application code…
main:
mainPtr dd NULL
hookedMain: dsm_init(…); dsm_create_thread(…,mainPtr,…); …
DllMain: … crtStartup = get_entry_point(); mainPtr = *(crtStartup + main_call_offset); *(crtStartup + main_call_offset) = hookedMain; …
main
hookedMain
Injected DLL
The OS passes control to DllMain() after
the DLL has been loadedThe main thread is resumed
Initialize the C runtime library
Initialize the DSM system(the OS API is intercepted,
globals are moved to the DSM)
The application main threadis created using the DSM
system thread creation API
50DSM Innovations - MultiView
SDSMs on Emerging Fast Networks
Fast networking is an emerging technology MultiView provides only one aspect: reducing message
sizes
The next magnitude of improvement shifts from the network layer to the system architectures and protocols that use those networks
Challenges: Efficiently employ and integrate fast networks Provide a “thin” protocol layer: reduce protocol complexity, eliminate
buffer copying, use home-based management, etc.
51DSM Innovations - MultiView
Adding the Privileged View
Constant Read/Write permissions
Separate application threads from SDSM injected threads
Atomic updates DSM threads can access (and
update) memory while application threads are prohibited
Direct send/receive Memory-to-memory No buffer copying
xyz
Application Views
RW
NAR
RW
The Privileged View
Memory Object
52DSM Innovations - MultiView
Coarse Granularity
1
2
3
5
6
4
Manager
Memory Access Request(1-6) Request
Request
Host 1 Host 2
Host 3
Reply (Data 2,4,5)
Reply (Data 1,3) 1
2
3
5
6
4
1
2
3
5
6
4
53DSM Innovations - MultiView
Automatic Adaptation of Granularity
1
2
3
4
5
6
Recompose
When same host accesses consecutive minipages
Coarse granularity
1
2
3
4
5
6
Coarse granularityHost A
Host A
Split
When different hosts update
different minipages
Host A
Host B
Fine granularity
1
2
3
4
5
6
Fine granularity
54DSM Innovations - MultiView
Memory Faults(Barnes)
0
10000
20000
30000
40000
50000fa
ult
s
read faults write faults
Millipede
55DSM Innovations - MultiView
Water-nsq Performance (cont’d)
SC/MV-f.g. HLRC Mixed SC/MV-b.g.0
20
40
60
80
100
120
140
160
180
200
run
time
brea
kdow
ns (s
ec)
Water-nsquared breakdown
computationread faultswrite faultsbarrierslocks
1 2 3 4 5 6 7 80
0.5
1
1.5
2
2.5
3
3.5
4x 10
4
chunking level (molecules)
Pro
toco
l ove
rhea
d
read faultswrite faultscompete requests
run-
time
(sec
)
run-time
The effect of chunking in Water - nsquared
162
164
166
168
170
172
174
176
178
180
56DSM Innovations - MultiView
Basic Costs in Millipage(Myrinet interconnect, 1998)
Access fault 26 usec
get protection 7 usec
set protection 12 usec
messages (one way)
header msg 12 usec
a data msg (1/2 KB) 22 usec
a data msg (1 KB) 34 usec
a data msg (4 KB) 90 usec
MPT translation 7 usec
Message sizes directly influence latency
The most compute demanding operation: Minipage translation - 7 usec
In relaxed consistency systems, protocol operations might take hundreds of usecs
example:Run-length diff for 4KB page: 250 usec
57DSM Innovations - MultiView
Scalability (IB vs. VIA interconnects, 2003)
Application Speedups (8 nodes)
0
2
4
6
8
10
12
14
16
Sp
eed
up
VIA/ServerNet - 1 thread Kernel/IB - 1 thread Kernel/IB - 2 threads