Uppsala University Information Technology Department of Computer Systems Uppsala Architecture...

19
Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based Shared Memory Zoran Radovic and Erik Hagersten {zoranr, eh}@it.uu.se
  • date post

    18-Dec-2015
  • Category

    Documents

  • view

    215
  • download

    0

Transcript of Uppsala University Information Technology Department of Computer Systems Uppsala Architecture...

Page 1: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Uppsala UniversityInformation Technology

Department of Computer SystemsUppsala Architecture Research Team [UART]

Removing the Overhead from Software-Based Shared Memory

Removing the Overhead from Software-Based Shared Memory

Zoran Radovic and Erik Hagersten{zoranr, eh}@it.uu.se

Page 2: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Problems with Traditional SW-DSMs

Page-sized coherence unit False Sharing![e.g., Ivy, Munin, TreadMarks, Cashmere-2L, GeNIMA, …]

Protocol agent messaging is slow Most efficiency lost in interrupt/poll

CPUs

MemProt.agent

CPUs

MemProt.agent

LD x

Page 3: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Our proposal: DSZOOM Run entire protocol in requesting-processor

No protocol agent communication!

Assumes user-level remote memory access put, get, and atomics [ InfiniBandSM]

Fine-grain access-control checks[e.g., Shasta, Blizzard-S, Sirocco-S]

CPUs

Mem

Protocol

CPUs

Mem

atomic, get/putDIR

get

LD x

DIR

Page 4: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Outline

Motivation General DSZOOM Overview DSZOOM-WF Implementation Details Experimentation Environment Performance Results Conclusions

Page 5: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

DSZOOM Cluster

DSZOOM Nodes: Each node consists of an unmodified SMP

multiprocessor SMP hardware keeps coherence among the caches

and the memory within each SMP node

DSZOOM Cluster Network: Non-coherent cluster interconnect Inexpensive user-level remote memory access Remote atomic operations [e.g., InfiniBandSM]

Page 6: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Squeezing Protocols into Binaries …

Static Binary Instrumentation EEL — Machine-independent Executable

Editing Library implemented in C++• Instrument global LOADs with snippets containing

fine-grain access control checks• Instrument global STOREs with MTAG snippets • Insert calls to coherence protocols implemented in C

Page 7: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

1: ld [address],%reg // original LOAD 2: fcmps %fcc0,%reg,%reg // compare reg with itself 3: fbe,pt %fcc0,hit // if (reg == reg) goto hit 4: nop

5: // Call global coherence load routine

hit:

Fine-grain Access Control Checks

The “magic” value is a small integer corresponding to an IEEE floating-point NaN[e.g., Blizzard-S, Sirocco-S]

Floating-point load example:

CoherenceProtocols(C-code)

Page 8: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Blocking Directory Protocols

Originally proposed to simplify the design and verification of HW-DSMs Eliminates race conditions

DSZOOM implements a distributed version of a blocking protocol

Node 0

G_MEM

0 0 0 0 0 0 0 1LOCK

After MEM_STORE

Presence bitsDIR_ENTRY

0 1 1 0 1 0 0 0LOCK

Before MEM_STORE

One DIR_ENTRYper cache line

Distributed DIR

MEM_STORE

Page 9: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Global Coherency ActionRead data from home node: 2–hop read

MemDIR1a. f&s

= Small packet (~10 bytes)

= Large packet (~68 bytes)

= Message on the critical path

= Message off the critical path

1b. get

data

2. put

RequestorLD x

Page 10: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Global Coherency ActionRead data modified in a third node: 3–hop read

DIR

Mem MTAG

1. f&s

3b. put

2a. f&s

2b. get

data

3a. put

Requestor

LD x

Page 11: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Compilation Process

DSZOOM-WFImplementationof PARMACS

Macros

a.out

(Un)executable

EEL

DSZOOM-WFRun-Time Library

m4

GNU

gcc

UnmodifiedSPLASH-2Application

CoherenceProtocols(C-code)

Page 12: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Instrumentation Performance

Program Problem Size%LD

%ST

InstrumentationOverhead

FFT 1,048,576 points (48.1 MB) 19.0 16.5 1.38

LU-Cont 10241024, block 16 (8.0 MB) 15.5 9.4 1.59

LU-Non-Cont 10241024, block 16 (8.0 MB) 16.7 11.1 1.50

Radix 4,194,304 items (36.5 MB) 15.6 11.6 1.13

Barnes-Hut 16,384 bodies (32.8 MB) 23.8 31.1 1.03

FMM 32,768 particles (8.1 MB) 17.5 13.6 1.06

Ocean-Cont 514514 (57.5 MB) 27.0 23.9 1.34

Ocean-Non-Cont 258258 (22.9 MB) 11.6 28.0 1.24

Radiosity Room (29.4 MB) 26.3 27.2 1.07

Raytrace Car (32.2 MB) 19.0 18.1 1.21

Water-nsq 2,197 mols., 2 steps (2.0 MB) 13.4 16.2 1.06

Water-sp 2,197 mols., 2 steps (1.5 MB) 15.7 13.9 1.09

Average 18.4 18.3 1.22

Page 13: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Instrumentation BreakdownSequential Execution

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FFTLU

-c

LU-n

c

Radix

Barne

sFM

M

Ocean

-c

Ocean

-nc

Radios

ity

Raytra

ce

Wat

er-n

sq

Wat

er-s

p

f-p-ST-snippet

int-ST-snippet

f-p-LD-snippet

int-LD-snippet

E6000 seq.

Page 14: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Current DSZOOM Hardware

Two E6000 connected through a hardware-coherent interface (Sun-WildFire) with a raw bandwidth of 800 MB/s in each direction Data migration and coherent memory replication (CMR) are kept inactive

16 UltraSPARC II (250 MHz) CPUs per node and 8 GB memory Memory access times: 330 ns local / 1700 ns remote (lmbench latency)

Run as 16-way SMP, 28 CC-NUMA, and 28 SW-DSM

Page 15: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Stack

Text & Data

Heap

PRIVATE_DATA

shmid = A

Physical MemoryCabinet 1

shmget

shmid = B

shmget

Physical MemoryCabinet 2

Process and Memory Distribution

Cabinet 1

forkforkfork

pset_bindpset_bindpset_bind

forkforkfork

0x80000000

G_MEM

Cabinet_1_G_MEM

Cabinet_2_G_MEM

Cabinet_1_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

Cabinet_2_G_MEM

Cabinet_1_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

Cabinet_2_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

Stack

Text & Data

Heap

PRIVATE_DATA

Cabinet_1_G_MEM

Cabinet_2_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

”Aliasing”

Stack

Text & Data

Heap

PRIVATE_DATA

Cabinet 2

Stack

Text & Data

Heap

PRIVATE_DATA

Cabinet_1_G_MEM

Cabinet_2_G_MEM

Stack

Text & Data

Heap

PRIVATE_DATA

G_MEM

shmat

shmat

shmat

Page 16: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Results (1)Execution Times in Seconds (16 CPUs)

0

1

2

3

4

5

6

7

8

9

10

FFTLU

-c

LU-n

c

Radix

Barne

sFM

M

Ocean

-c

Ocean

-nc

Radios

ity

Raytra

ce

Wat

er-n

sq

Wat

er-s

p

Exe

cutio

n ti

me

[se

con

ds]

E6000 16 CPUs CC-NUMA 2x8 DSZOOM-WF 1x16 DSZOOM-EMU 2x8 DSZOOM-WF 2x8

HW SWEEL

8

8SW16

EEL

168 8 8 8

EEL

Page 17: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

Results (2)Normalized Execution Time Breakdowns (16 CPUs)

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

FFTLU

-c

LU-n

c

Radix

Barne

sFM

M

Ocean

-c

Ocean

-nc

Radios

ity

Raytra

ce

Wat

er-n

sq

Wat

er-s

p

Store

Load

Locks

Barriers

ILC

Task

SW8 8

EEL

Page 18: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

DSZOOM completely eliminates asynchronous messaging between protocol agents

Consistently competitive and stable performance in spite of high instrumentation overhead 30% slowdown compared to hardware State-of-the-art checking overheads are in the range of

5–35% (e.g., Shasta), DSZOOM: 3–59%

Conclusions

Page 19: Uppsala University Information Technology Department of Computer Systems Uppsala Architecture Research Team [UART] Removing the Overhead from Software-Based.

Supercomputing 2001 Uppsala Architecture Research Team (UART)

http://www.it.uu.se/research/group/uart

DSZOOM’s Home Page