Designing High Performance DSM Systems using InfiniBand...

Designing High Performance DSM Systems using InfiniBand Features

Ranjit Noronha and

Dhabaleswar K. PandaThe Ohio State University

OutlineIntroductionMotivationDesign and ImplementationResults ConclusionsFuture Work

IntroductionSoftware DSM

HLRC/VIA (Rutgers), TreadMarks (Rice), JIAJIA (ICT China)

Depends on user and software layer

Depends on communication protocols provided by the system such as TCP, UDP, etc.

Degraded performance because of false sharing and high overhead of communication

Has scaling problems

Introduction

Modern Interconnects (InfiniBand, Myrinet, Quadrics)

Low Latency (InfiniBand 5.0 µs)

High Bandwidth (InfiniBand 4X upto 10 Gbps)

Programmable NIC

User Level Protocols (VAPI, GM)

Can deliver performance close to that of the underlying hardware

RDMA Write/Read, Atomic Operations, Service Levels, Multicast

MotivationTraditional DSM

Uses Request / Response Communication Model (asynchronous)Separate signal handler thread neededApplication Processing interruptedCache Effects

Can network based features be used to reduce interrupt overhead ?

0 1Send REQ Interrupt

Process

Send RESRecv REQ

MotivationAsynchronous communication model

Use network features to achieve the same effect (synchronous/hybrid communication model)

Potential AdvantagesPartial offload of protocol to networkMore application processing timeReduced CopyingBetter caching

Potential DisadvantagesLonger protocol execution time Ordering problemsConsistency Issues

Outline

IntroductionMotivationDesign and ImplementationResults ConclusionsFuture Work

Preliminaries

RDMARemote Direct Memory AccessAllows access to memory on a remote nodeNo involvement from the remote nodeRDMA WriteRDMA Read

RDMA Write Example

NIC NIC

A BHost HostX X

RDMA Read Example

NIC NIC

A BHost Host PP

Preliminaries - Remote Atomic Operations

Remote Atomic OperationsCompare and Swap (CMP_AND_SWAP)

Conditionally change a location on a remote machine atomically

Fetch and Add

Remote Atomic Operations Example

NIC NIC

A BHost Host Y

• Compare and Swap

Z == Y ?

Preliminaries - HLRCHLRC/VIA (Rutgers)

Home Based Lazy Release Consistency ModelPage Based DSM System

Basic OperationsPageDiffLock

Use interrupts Referred to as ASYNC

HLRC Programming Example

Acquire_Lock (L1)X=X * 2Release_Lock(L1)

Acquire_Lock (L1)

X = X + 1

Release_Lock(L1)Time

•Initial value of X = 0

•B is home node for page P containing X

Read page P (containing X) from B

Send diffs for P to B

HLRC Design

Page Diff Lock

Our Design

Design consists of 2 protocolsARDMAR (Atomic and RDMA Write)DRAW (Diff using RDMA Write)

ARDMAR is a synchronous protocolDRAW is a hybrid protocolNEWGENDSM = ARDMAR + DRAW

NEWGENDSM

Page Diff Lock Page (ARDMAR)

Diff (DRAW)

NEGENDSM

ASYNC (page fetch)A B C

DEFAULTHome for page 2

ARDMAR (Atomic and RDMA Write)

CMP AND SWAP

RDMA READ

Home for page 2

NEWGENDSM

Page Diff Lock Page (ARDMAR)

Diff (DRAW)

NEGENDSM

ASYNC (diff)A B

DIFF (P1)

ACK (P1)

DIFF (P2)

TIMESTAMP (P1)

TIMESTAMP (P2)

ACK (P2)

DRAWA B

RDMA WRITE DIFF (P1)

RDMA WRITE DIFF (P2)

TIMESTAMP (P1 and P2)

Outline

IntroductionMotivationDesign and ImplementationResultsConclusionsFuture Work

Experimental SetupHLRC/ VIA (Rutgers) modified to work with VAPI InfiniScale MT43132 Eight 4X switchMellanox InfiniHost MT23108 DualPort 4X HCA’s SuperMicro SUPER P4DL6

Dual Pentium Xeon 2.4 GHz512 MB memory133 MHz PCI-X bus

Linux 2.4.7-10 SMP kernel

EvaluationMicro-benchmarks (modified from TreadMarks suite)

Page Average time to fetch a page from a home node when a number of nodes are accessing it

Diff Measure Compute Time and Apply TimeSmall diff (single word) and Large diff (entire page)

Applications from SPLASH-2 suite (Barnes, TSP, 3Dfft, Radix)

20 (large)Tour sizeTSP

2621440Number of keysRadix128Grid size3Dfft32678BodiesBarnesSizeParameterApplication

Microbenchmarks (Page)

2 3 4 5 6 7 8

Number of nodes

Page microbenchmark

ASYNCARDMAR

• Page fetching in ARDMAR is lower than ASYNC at 8 nodes

Microbenchmarks (Diff)

01020304050

Compute(Small)

Apply(Small)

Compute(Large)

Apply(Large)

Diff Component

ASYNC DRAW

• DRAW performs better than ASYNC in all cases

Application Speedup

Barnes TSP 3Dfft RadixApplication

ASYNC ARDMAR DRAW NEWGENDSM

• Speedup w.r.t. sequential running times

•Radix NEWGENDSM speedup 1.63 times ASYNC

• Barnes NEGENDSM speedup 1.59 times ASYNC

•Diff time a part of Barrier Compute Time

•Page time reduced significantly

Breakdown

Asynchronous Handler Time

• Asynchronous handler time substantially reduced for Barnes and 3Dfft

Barnes TSP 3Dfft Radix

Application

ASYNC ARDMAR DRAW NEWGENDSM

Conclusions

Explored reducing asynchronous protocol processing timeUsed network features like RDMA Read/Write and atomic operationsIncorporated in a protocol NEWGENDSMMicrobenchmark/application level evaluationImprovement in parallel speedup upto 1.63

Future WorkExploit small message latency to implement “critical word first”

RDMA Read for “early restart”

Atomic operations for locking

Migrating home protocol

http://nowlab.cis.ohio-state.edu/

E-mail: {noronha, panda}@cis.ohio-state.edu

NBC home page

Web Pointers

•Page time reduced for Barnes

Breakdown

Designing High Performance DSM Systems using InfiniBand...

Documents

Transcript of Designing High Performance DSM Systems using InfiniBand...

Um Estudo sobre Memória Compartilhada Distribuída · projeto do TreadMarks teve como objetivo reduzir a quantidade de comunicação neces-sária para manter a consistência de memória.

Performance Analysis and Evaluation of Mellanox ConnectX ...nowlab.cse.ohio-state.edu/static/media/... · • InfiniBand is a popular interconnect used for High-Performance Computing

ORION 18 - ZMicro

Current and Future Challenges of the Tofu Interconnect for …nowlab.cse.ohio-state.edu/static/media/workshops/... · 2017. 7. 18. · Parallelization of Inchworm (Prof. Nanri’s

An Efficient Hardware-Software Approach to Network Fault ...nowlab.cse.ohio-state.edu/static/media/publications/slide/vishnu... · Hybrid-IBNFT Design Hardware-IBNFT and Software-IBNFT

Performance Evaluation of MPI Libraries on GPU-enabled ...nowlab.cse.ohio-state.edu/.../abstract/IWOPH_19_MPI_on_POWER.pdf · nodes, the concept of CUDA-aware MPI [28] has been introduced

TreadMarks: Shared Memory Computing on Networks of ...research.cs.wisc.edu/areas/os/Qual/papers/treadmarks.pdf · Shared memory ,dcilitates ... shared data structures by a lock acquire

TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems

Relaxed Consistency Models. Outline Lazy Release Consistency TreadMarks DSM system.

An Adaptive Heterogeneous Software DSMpapers/icpp06/ICPPW/papers/036_jwalters-DSM.… · shared memory/state systems. TreadMarks [2] is a DSM system with several advanced features,

DESIGNING HIGH PERFORMANCE AND SCALABLE DISTRIBUTED ...nowlab.cse.ohio-state.edu/static/media/... · Management Techniques, Processes, and Services (SMTPS ’07), held in conjunction

Implementing TreadMarks on GM over Myrinet: Challenges, Design

Treadmarks: Distributed Shared Memory on Standard Workstations and Operating Systems P. Keleher, A. Cox, S. Dwarkadas, and W. Zwaenepoel The Winter Usenix.

Designing’Scalable’Graph500’Benchmarkwith’ …nowlab.cse.ohio-state.edu/static/media/publications/...Designing’Scalable’Graph500’Benchmarkwith’ HybridMPI+OpenSHMEM’Programming’Models’

HIGH-PERFORMANCE MULTI-TRANSPORT MPI …nowlab.cse.ohio-state.edu/static/media/publications/...HIGH-PERFORMANCE MULTI-TRANSPORT MPI DESIGN FOR ULTRA-SCALE INFINIBAND CLUSTERS DISSERTATION

SCALABLE JOB STARTUP AND INTER-NODE ...nowlab.cse.ohio-state.edu/static/media/publications/...SCALABLE JOB STARTUP AND INTER-NODE COMMUNICATION IN MULTI-CORE INFINIBAND CLUSTERS A

High Performance Big Data (HiBD): Accelerating Hadoop ...nowlab.cse.ohio-state.edu/static/media/talks/slide/dk...– Based on Apache Hadoop 2.8.0 – Compliant with Apache Hadoop 2.8.0,

Can High-Performance Interconnects Benefit Hadoop Distributed File System?nowlab.cse.ohio-state.edu/.../slide/masvdc10-hdfs-ib_1.pdf · 2017. 7. 18. · – Hadoop Distributed Filesystem

Treadmarks : Distributed Shared Memory on Standard Workstations and Operating Systems

Sayantan Sur, Intel ExaComm Workshop held in conjunction ...nowlab.cse.ohio-state.edu/.../exacomm18/exacomm18-invited-talk-sayantan-sur.pdfSayantan Sur, Intel ExaComm Workshop held