Designing High Performance DSM Systems using InfiniBand...

34
Designing High Performance DSM Systems using InfiniBand Features Ranjit Noronha and Dhabaleswar K. Panda The Ohio State University NBC

Transcript of Designing High Performance DSM Systems using InfiniBand...

Page 1: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Designing High Performance DSM Systems using InfiniBand Features

Ranjit Noronha and

Dhabaleswar K. PandaThe Ohio State University

NBC

Page 2: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

OutlineIntroductionMotivationDesign and ImplementationResults ConclusionsFuture Work

Page 3: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

IntroductionSoftware DSM

HLRC/VIA (Rutgers), TreadMarks (Rice), JIAJIA (ICT China)

Depends on user and software layer

Depends on communication protocols provided by the system such as TCP, UDP, etc.

Degraded performance because of false sharing and high overhead of communication

Has scaling problems

Page 4: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Introduction

Modern Interconnects (InfiniBand, Myrinet, Quadrics)

Low Latency (InfiniBand 5.0 µs)

High Bandwidth (InfiniBand 4X upto 10 Gbps)

Programmable NIC

User Level Protocols (VAPI, GM)

Can deliver performance close to that of the underlying hardware

RDMA Write/Read, Atomic Operations, Service Levels, Multicast

Page 5: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

MotivationTraditional DSM

Uses Request / Response Communication Model (asynchronous)Separate signal handler thread neededApplication Processing interruptedCache Effects

Can network based features be used to reduce interrupt overhead ?

0 1Send REQ Interrupt

Process

Send RESRecv REQ

Page 6: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

MotivationAsynchronous communication model

Use network features to achieve the same effect (synchronous/hybrid communication model)

Potential AdvantagesPartial offload of protocol to networkMore application processing timeReduced CopyingBetter caching

Potential DisadvantagesLonger protocol execution time Ordering problemsConsistency Issues

Page 7: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Outline

IntroductionMotivationDesign and ImplementationResults ConclusionsFuture Work

Page 8: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Preliminaries

RDMARemote Direct Memory AccessAllows access to memory on a remote nodeNo involvement from the remote nodeRDMA WriteRDMA Read

Page 9: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

RDMA Write Example

NIC NIC

A BHost HostX X

Page 10: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

RDMA Read Example

NIC NIC

A BHost Host PP

Page 11: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Preliminaries - Remote Atomic Operations

Remote Atomic OperationsCompare and Swap (CMP_AND_SWAP)

Conditionally change a location on a remote machine atomically

Fetch and Add

Page 12: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Remote Atomic Operations Example

NIC NIC

A BHost Host Y

• Compare and Swap

Z S

Z == Y ?

SY

Page 13: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Preliminaries - HLRCHLRC/VIA (Rutgers)

Home Based Lazy Release Consistency ModelPage Based DSM System

Basic OperationsPageDiffLock

Use interrupts Referred to as ASYNC

Page 14: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

HLRC Programming Example

Acquire_Lock (L1)X=X * 2Release_Lock(L1)

Acquire_Lock (L1)

X = X + 1

Release_Lock(L1)Time

A

B

•Initial value of X = 0

•B is home node for page P containing X

Read page P (containing X) from B

Send diffs for P to B

Page 15: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

HLRC Design

HLRC

ASYNC

Page Diff Lock

Page 16: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Our Design

Design consists of 2 protocolsARDMAR (Atomic and RDMA Write)DRAW (Diff using RDMA Write)

ARDMAR is a synchronous protocolDRAW is a hybrid protocolNEWGENDSM = ARDMAR + DRAW

Page 17: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

NEWGENDSM

HLRC

ASYNC

Page Diff Lock Page (ARDMAR)

Diff (DRAW)

Lock

NEGENDSM

Page 18: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

ASYNC (page fetch)A B C

DEFAULTHome for page 2

RES

B

REQ

HOME

REQ

RES

RES

PAGE

BB

Page 19: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

ARDMAR (Atomic and RDMA Write)

--B

CMP AND SWAP

B

CMP AND SWAP

RDMA READ

B

A B C

Home for page 2

Page 20: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

NEWGENDSM

HLRC

ASYNC

Page Diff Lock Page (ARDMAR)

Diff (DRAW)

Lock

NEGENDSM

Page 21: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

ASYNC (diff)A B

P1 P2

DIFF (P1)

ACK (P1)

DIFF (P2)

TIMESTAMP (P1)

TIMESTAMP (P2)

ACK (P2)

Page 22: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

DRAWA B

P1 P2

RDMA WRITE DIFF (P1)

RDMA WRITE DIFF (P2)

TIMESTAMP (P1 and P2)

Page 23: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Outline

IntroductionMotivationDesign and ImplementationResultsConclusionsFuture Work

Page 24: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Experimental SetupHLRC/ VIA (Rutgers) modified to work with VAPI InfiniScale MT43132 Eight 4X switchMellanox InfiniHost MT23108 DualPort 4X HCA’s SuperMicro SUPER P4DL6

Dual Pentium Xeon 2.4 GHz512 MB memory133 MHz PCI-X bus

Linux 2.4.7-10 SMP kernel

Page 25: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

EvaluationMicro-benchmarks (modified from TreadMarks suite)

Page Average time to fetch a page from a home node when a number of nodes are accessing it

Diff Measure Compute Time and Apply TimeSmall diff (single word) and Large diff (entire page)

Applications from SPLASH-2 suite (Barnes, TSP, 3Dfft, Radix)

20 (large)Tour sizeTSP

2621440Number of keysRadix128Grid size3Dfft32678BodiesBarnesSizeParameterApplication

Page 26: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Microbenchmarks (Page)

0

20

40

60

80

100

120

140

160

2 3 4 5 6 7 8

Page

fetc

h tim

e (u

sec)

Number of nodes

Page microbenchmark

ASYNCARDMAR

• Page fetching in ARDMAR is lower than ASYNC at 8 nodes

Page 27: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Microbenchmarks (Diff)

01020304050

Compute(Small)

Apply(Small)

Compute(Large)

Apply(Large)

Diff Component

Tim

e (m

illis

econ

ds)

ASYNC DRAW

• DRAW performs better than ASYNC in all cases

Page 28: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Application Speedup

0

1

2

3

4

5

6

Barnes TSP 3Dfft RadixApplication

Spee

dup

(8 n

odes

)

ASYNC ARDMAR DRAW NEWGENDSM

• Speedup w.r.t. sequential running times

•Radix NEWGENDSM speedup 1.63 times ASYNC

• Barnes NEGENDSM speedup 1.59 times ASYNC

Page 29: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

•Diff time a part of Barrier Compute Time

•Page time reduced significantly

Breakdown

Page 30: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Asynchronous Handler Time

• Asynchronous handler time substantially reduced for Barnes and 3Dfft

0

50

100

150

200

250

300

350

Barnes TSP 3Dfft Radix

Application

Tim

e (m

illis

econ

ds)

ASYNC ARDMAR DRAW NEWGENDSM

Page 31: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Conclusions

Explored reducing asynchronous protocol processing timeUsed network features like RDMA Read/Write and atomic operationsIncorporated in a protocol NEWGENDSMMicrobenchmark/application level evaluationImprovement in parallel speedup upto 1.63

Page 32: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

Future WorkExploit small message latency to implement “critical word first”

RDMA Read for “early restart”

Atomic operations for locking

Migrating home protocol

Page 33: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

http://nowlab.cis.ohio-state.edu/

E-mail: {noronha, panda}@cis.ohio-state.edu

NBC home page

Web Pointers

Page 34: Designing High Performance DSM Systems using InfiniBand ...nowlab.cse.ohio-state.edu/static/media/publications/slide/noronha... · Evaluation zMicro-benchmarks (modified from TreadMarks

•Page time reduced for Barnes

Breakdown