1 DSM Innovations - MultiView Software Distributed Shared Memory (SDSM): MultiView 1.SDSM, false...

1DSM Innovations - MultiView

Software Distributed Shared Memory (SDSM):

MultiView

1. SDSM, false sharing.

2. Solution: MultiView.

3. Granularity adaptation.

4. Integrated services.

Ayal Itzkovitz, Assaf Schuster


A multi-core system (simplified)

A parallel program may spawn processes (threads) in order to utilize all computing units

Processes communicate through shared memory, physically located on the local machine

core

Local memory

core core

core


Network

A distributed system

core

Local memory

core

Local memory

core

Local memory

Virtual Shared Memory

Emulation of the same programming paradigm Ultimately: no changes to source/binary code


The First SDSM System

The first software SDSM system, Ivy [Li & Hudak, Yale, ‘86] Strict memory semantics (Lamport’s sequential consistency)

Page-based: memory pages as units of sharing

The major performance limitation:

Page size False sharing Page size – 4K (and more) Average object size – 28 bytes

About 150 objects on a page


Object Distribution


Network

Object Distribution – Memory View

“…the conventional wisdom remains that the overhead of false sharing […] in page-based consistency protocols is the primary factor limiting the performance of software SDSM”

[Amza, Cox, Ramajamni, and Zwaenepoel, PPoPP ‘97]

“[The] conventional wisdom holds that fine-grain performance and false sharing doom page-based approaches”

[Buck and Keleher, IPPS ‘98]

False Sharing


Solution: The MultiView Approach

“MultiView and Millipage – Fine-grain Sharing in Page-based SDSMs” [Itzkovitz and Schuster, OSDI ‘99]

Implement small-size pages through special memory configuration

Other Goals: W/O compromising the strict memory consistency [ICS’04, EuroPar’04]

Utilizing low-latency networks (Myrinet, VIA/ServerNet-II, Infiniband) [Hot-Interconnects’03, IPDPS’04]

Transparency [EuroPar’03]

Adaptive sharing granularity [ICPP’00, IPDPS’01 best paper]

Maximize locality through migration and load sharing [DISC’01]

Additional “service layers” (garbage collection, data-race detection) [JPDC’01,JPDC02]


The Traditional Memory Layout

xyz

Traditional

w

v

u

struct a { …};struct b; int x, y, z;

main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b));

…}

struct a { …};struct b; int x, y, z;

main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b));

…}


xyz

The MultiView Technique

TraditionalMultiView

w

v

u

w

v

u

xyz




w

v

u

xyz

xyz

w

v

u

Protection is now set independently

RW

NAR

Variables reside in the same page but are not shared




w

v

u

xyz

xyz

w

v

u

View 1

View 2

View 3

Memory Object



Memory Layout

View 1

View 2

Memory Object

xyz

MultiView

w

v

u

MemoryObjectView 1

View 2

View 3

View 3



Host A

View 1

View 2

Memory Object

View 3

Host B

View 1

View 2

Memory Object

View 3

R R

NA RW

NA

R

R

R

R

R

RW

RW

NA

NA



View 1

View 2

View 3

View 1

View 2

View 3

R R

NA RW

NA

R

R

R

R

R

RW

RW

NA

NA

Host A Host B


Enabling Technology

SharedMemoryObject

Memory mapped I/O created for inter-process communication


Implementation: Millipage

Can be used by a single process to provide desired functionality

SharedMemoryObject

• Windows-NT (Solaris, BSD, Linux)

• CreateFileMapping(), MapViewOfFileEx() for allocating views


Transparency

1999: Minipages are allocated at malloc time (via malloc-like API) Allocation routines should be slightly modified

mat = malloc(lines*cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; …

mat = malloc(lines*sizeof(int*));for(i=0;i<N;i++) mat[i] = malloc(cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; …

SOR and LU have not been modified at all WATER- changed ~20 lines out of 783 lines IS- changed 5 lines out of 93 lines TSP- changed ~15 lines out of ~400 lines

2003: complete transparency Through binary instrumentation/interception of OS calls


SOR SPLASH-II Benchmark

SOR speedup

012345678

0 2 4 6 8 10

Number of threads

Spe

edup

Transparent DSM

Millipede 4.0

Transparent+Barrier

SMP (2 processors)


Performance with Fixed Granularity(NBodyW on 8 nodes)

50

52

54

56

58

60

62

allocation granularity

run

tim

e [

s]


False Sharing vs. Prefetching (WATER)

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

1 2 3 4 5 6 none

chunking level

0.50

0.60

0.70

0.80

0.90

1.00

1.10

eff

icie

nc

y

compete req. (4) x 10 compete req. (8) x 10

Read/Write faults(4) Read/Write faults(8)

efficiency (4 hosts) efficiency (8 hosts)


Adapting Granularity

Application run time

Sha

red

data

ele

men

ts

Adaptation is dynamic, automatic, transparent


Performance (VIA/ServerNet-II, 2004)

1 2 4 6 8 10 120

2

4

6

8

10

12Water-nsq speedup (one thread per node)

nodes

spee

dup

1 2 4 6 8 10 1202468

1012141618202224

Water-nsq speedup (two threads per node)

nodes

spee

dup

SC/MV - fine granularityHLRCMixed consistencySC/MV - best static granularity

SC/MV - dynamic granularity


Integrating Data Race Detection

Detection in application variable granularity

Overheads 1 proc

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

SOR LU IS TSP W ATER

no

rmal

ized

exe

cuti

on

tim

e

NO_DR BAS PCT OPT

Overheads 8 proc

0.8

0.9

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

SOR LU IS TSP W ATER

no

rmal

ized

exe

cuti

on

tim

e

NO_DR BAS PCT OPT


Integrating Distributed Garbage Collection(Remote Reference Counting)

Collection in native application granularity.

0.20%

2.50% 2.60%

37.70%

0%

5%

10%

15%

20%

25%

30%

35%

40%

IS 0.8 LU 30 WATER 31 SOR 1140

garbage creation ratio (obj/sec)

ove

rhe

ad


Questions?


1. In-core multi-threading

2. Multi-core/SMP multi-threading

3. Tightly-coupled cluster,

customized interconnect (SGI’s Altix)

4. Tightly-coupled cluster,

of-the-shelf interconnect (InfiniBand)

5. WAN, Internet, Grid, peer-to-peer

Traditionally: 1+2 are programmable using shared memory, 3+4 are programmable using message passing, in 5 peer processes communicate with central control only.

HDSM: systems in 3 move towards presenting a shared memory interface to a physically distributed system.

What about 4,5? Software Distributed Shared Memory = SDSM

Types of Parallel Systems

Scalability

Communication Efficiency


Matrix Multiplication

R R W

two threads

Read/only matrices Write matrix

A = malloc(MATSIZE);B = malloc(MATSIZE);C = malloc(MATSIZE);

parfor(n) mult(A, B, C);

mult(id):

for (line=Nxid .. Nx(id+1)) for(col=0..N) C[line,col] = multline(A[line],B[col]);


Network


RO RO

RO RO

RO RO

RW RW

RO RO

RO RO

RO RO

RW RW

A

x

B

=

C

A

x

B

=

C

Sent once

Sent once


Network


RO RO

RO RO

RO RO

RW RW

RO RO

RO RO

RO RO

RW RW

A

x

B

=

C

A

x

B

=

C

R WR


Network

Matrix Multiplication - False Sharing

RO RO

RO RO

NA

RO RO

RO RO

A

x

B

=

C

A

x

B

=

C

Sent once

RO RO

RW RW

RO RO

RO RO

RO RO

RW RW

Sent once

NA

RO RO RO RO

RW RW


Network


RO RO

RO RO

RO RO

RO RO

A

x

B

=

C

A

x

B

=

CRW RW

RO RO RO RO

RW RW

NA NA

RO RO

RO RO

RO RO

RO RO

RW RW


Network


RO RO

RO RO

RO RO

RO RO

A

x

B

=

C

A

x

B

=

CRW RW

RO RO RO RO

RW RW

RO RO

RO RO

RO RO

RO RO

RW RWNA NA


RR W

Network


RO RO

RO RO

RO RO

RO RO

A

x

B

=

C

A

x

B

=

C

RO RO RO RO

RO RO

RO RO

RO RO

RO RO

RW RW

RW RW

RW RW

RW RW


First Approach: Weak Semantics

Example - Release Consistency: Allow multiple writers to page

(assume exclusive update for any portion of the page) Each page has a twin copy At synchronization time, all pages perform “diff” with their twins, and

send diffs to managers Managers hold master copies

twin twin

RW RW

Apply diff Apply diff


First Approach: Weak Semantics

Allow memory to reside in an incosistent state for time intervals

Enforce consistency only at synchronization points Reaching a consistent view of the memory requires

computation

Reduces (but not always eliminate) false sharing Reduces number of protocol messages

Weak memory semantics Involves both memory and processing time overhead

Still: coarse-grain sharing (why diff at locations not touched? )


Software DSM Evolution - Weak Semantics

Li & Hudak - IVY, ‘86Yale

Munin, ‘92Release Cons.

Rice

Midway, ‘93Entry Cons.CMU

Treadmarks, ‘94Lazy Release Cons.

Rice

Brazos, ‘97Scope Cons.

Rice

Page-grain:

Relaxed consistency


Software DSM Evolution - Multithreading



Rice



Rice


Rice

Page-grain:

Relaxed consistency

CVM, Millipede, ‘96 multi-protocol

Maryland Technion

Quarks, ‘98protocol latency hiding

Utah

Multithreading


Second Approach:Code Instrumentation

Example - Binary Rewriting: wrap each load and store with instructions that check whether

the data is available locally

load r1, ptr[line]load r2, ptr[v] add r1, 3hstore r1, ptr[line]sub r2, r1store r2, ptr[v]

push ptr[line]call __check_rload r1, ptr[line]push ptr[v]call __check_r load r2, ptr[v] add r1, 3hpush ptr[line]call __check_wstore r1, ptr[line]push ptr[line]call __done sub r2, r1push ptr[v]call __check_w store r2, ptr[v]push ptr[v]call __done

CodeInstr.

push ptr[line]call __check_wload r1, ptr[line]push ptr[v]call __check_w load r2, ptr[v] add r1, 3hstore r1, ptr[line]push ptr[line]call __done sub r2, r1store r2, ptr[v]push ptr[v]call __done

Opt.

line += 3; v = v - line;

Compile


Second Approach:Code Instrumentation

Provides fine-grain access control, thus avoids false sharing

Bypasses the page protection mechanism Usually, fixed granularity for all application data (Still,

false sharing ) Needs a special compiler or binary-level rewriting tools

Cost: High overheads (even on single machine) Inflated code Not portable (among architectures)


Software DSM Evolution



Rice



Rice


Rice

Page-grain:

Relaxed consistency

CVM, Millipede, ‘96 multi-protocol

Maryland Technion

Quarks, ‘98protocol latency hiding

Utah

Multithreading

Blizzard, ‘94binary

instrumentationWisconsin

Shasta, ‘97transparent,

works forcommercial apps

Digital WRL

Fine-grain:Code

Instrumentation


MultiView - Overheads

Application:traverse an array of integers, all packed up in minipages

The number of minipages is derived from the value of max views in page

Limitations of the experiments: 1.63GB contiguous address space available Up to 1664 views Need 64 bits!!!


MultiView - Overheads

As expected, committed (physical) memory is constant Only a negligible overhead (< 4%): Due to TLB misses

0.96

0.98

1

1.02

1.04

1.06

1.08

512Kb

1 MB 2 MB 4 MB 8 MB 16MB

Slo

wdo

wns

1 2 4 8 16 32Num views


MultiView - Taking it to the extreme

Beyond critical points overhead becomes substantial

0

2

4

6

8

10

12

14

16

18

20

Number of views

Slo

wd

ow

n

512 Kb 1 MB 2 MB4 MB 8 MB 16 MB

8MB

4MB

2MB

1MB

Number of minipages at critical points is 128K Slowdown due to L2 cache exhausted by PTEs


MultiView - Taking it to the extreme

Beyond critical points overhead becomes substantial

0

2

4

6

8

10

12

14

16

18

20

Number of views

Slo

wd

ow

n

512 Kb 1 MB 2 MB4 MB 8 MB 16 MB

8MB

4MB

2MB

1MB

Number of minipages at critical points is 128K Slowdown due to L2 cache exhausted by PTEs

SDSM


The Transparent DSM: System Initialization

For most DSM systems, initialization is an almost trivial task

The transparent DSM system cannot use such a simple solution

In order to initialize a DSM system transparently we have to inject the initialization code into the loaded application


Standard Initialization

…call c_init…call main…

crtStartup:

…application code…

main:

Startup code from in the C standard library. This code is

identical for all C applications.crtStartup is the entry point of

the executable.

Standard C application

This instruction lies at a fixed offset from crtStartup. We

denote this offset as main_call_offset

Initialize the C runtime library

Start the application


Transparent DSM System Initialization

…call c_init…call main…

crtStartup:

…application code…

main:

mainPtr dd NULL

hookedMain: dsm_init(…); dsm_create_thread(…,mainPtr,…); …

DllMain: … crtStartup = get_entry_point(); mainPtr = *(crtStartup + main_call_offset); *(crtStartup + main_call_offset) = hookedMain; …

main

hookedMain

Injected DLL

The OS passes control to DllMain() after

the DLL has been loadedThe main thread is resumed

Initialize the C runtime library

Initialize the DSM system(the OS API is intercepted,

globals are moved to the DSM)

The application main threadis created using the DSM

system thread creation API


SDSMs on Emerging Fast Networks

Fast networking is an emerging technology MultiView provides only one aspect: reducing message

sizes

The next magnitude of improvement shifts from the network layer to the system architectures and protocols that use those networks

Challenges: Efficiently employ and integrate fast networks Provide a “thin” protocol layer: reduce protocol complexity, eliminate

buffer copying, use home-based management, etc.


Adding the Privileged View

Constant Read/Write permissions

Separate application threads from SDSM injected threads

Atomic updates DSM threads can access (and

update) memory while application threads are prohibited

Direct send/receive Memory-to-memory No buffer copying

xyz

Application Views

RW

NAR

RW

The Privileged View

Memory Object


Coarse Granularity

1

2

3

5

6

4

Manager

Memory Access Request(1-6) Request

Request

Host 1 Host 2

Host 3

Reply (Data 2,4,5)

Reply (Data 1,3) 1

2

3

5

6

4

1

2

3

5

6

4


Automatic Adaptation of Granularity

1

2

3

4

5

6

Recompose

When same host accesses consecutive minipages

Coarse granularity

1

2

3

4

5

6

Coarse granularityHost A

Host A

Split

When different hosts update

different minipages

Host A

Host B

Fine granularity

1

2

3

4

5

6

Fine granularity


Memory Faults(Barnes)

0

10000

20000

30000

40000

50000fa

ult

s

read faults write faults

Millipede


Water-nsq Performance (cont’d)

SC/MV-f.g. HLRC Mixed SC/MV-b.g.0

20

40

60

80

100

120

140

160

180

200

run

time

brea

kdow

ns (s

ec)

Water-nsquared breakdown

computationread faultswrite faultsbarrierslocks

1 2 3 4 5 6 7 80

0.5

1

1.5

2

2.5

3

3.5

4x 10

4

chunking level (molecules)

Pro

toco

l ove

rhea

d

read faultswrite faultscompete requests

run-

time

(sec

)

run-time

The effect of chunking in Water - nsquared

162

164

166

168

170

172

174

176

178

180


Basic Costs in Millipage(Myrinet interconnect, 1998)

Access fault 26 usec

get protection 7 usec

set protection 12 usec

messages (one way)

header msg 12 usec

a data msg (1/2 KB) 22 usec

a data msg (1 KB) 34 usec

a data msg (4 KB) 90 usec

MPT translation 7 usec

Message sizes directly influence latency

The most compute demanding operation: Minipage translation - 7 usec

In relaxed consistency systems, protocol operations might take hundreds of usecs

example:Run-length diff for 4KB page: 250 usec


Scalability (IB vs. VIA interconnects, 2003)

Application Speedups (8 nodes)

0

2

4

6

8

10

12

14

16

Sp

eed

up

VIA/ServerNet - 1 thread Kernel/IB - 1 thread Kernel/IB - 2 threads

1 DSM Innovations - MultiView Software Distributed Shared Memory (SDSM): MultiView 1.SDSM, false...

Documents

Transcript of 1 DSM Innovations - MultiView Software Distributed Shared Memory (SDSM): MultiView 1.SDSM, false...