Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W....

25
Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W. Jin, S. Narravula, G. Brown, K. Vaidyanathan, P. Balaji, and D.K. Panda Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University { jinhy, narravul, browngre, vaidyana, balaji, panda}@cse.ohio-state.edu

Transcript of Performance Evaluation of RDMA over IP: A Case Study with the Ammasso Gigabit Ethernet NIC H.-W....

Performance Evaluation of RDMA over IP:A Case Study with the Ammasso Gigabit Ethernet NIC

H.-W. Jin, S. Narravula, G. Brown,

K. Vaidyanathan, P. Balaji, and D.K. Panda

Network-Based Computing Laboratory

Department of Computer Science and Engineering

The Ohio State University

{ jinhy, narravul, browngre, vaidyana, balaji, panda}@cse.ohio-state.edu

Contents

• Introduction

• WAN Emulator for Cluster-of-Clusters

• Performance Evaluation of RDMA over IP

• Conclusions and Future Work

Introduction

• Sockets over TCP/IP• RDMA over LAN

– InfiniBand, Myrinet, Quadrics– HPC middleware (MPI) and file systems (PVFS)

• RDMA over WAN– iWARP, RDDP– Grid and Internet applications

• RDMA-enabled Gigabit Ethernet NIC– Ammasso

Ammasso Gigabit Ethernet NICApplications

Sockets Interface CCIL(Cluster Core Interface Lang.)

Sockets

TCP

IP

Device Driver

Gigabit Ethernet

RDMA

TOE(TCP/IP Offload Engine)

Am

masso G

igab

it Eth

erne

t NIC

Op

era

ting

Sys

tem

Problem Statement

• There have been no comprehensive quantitative evaluations of RDMA over WAN environment

• How to Emulate the WAN Environment?

• What Kind of Performance Metrics?

• Sockets vs. CCIL

Contents

• Introduction

• WAN Emulator for Cluster-of-Clusters

• Performance Evaluation of RDMA over IP

• Conclusions and Future Work

Experimental WAN Setup

GigESwitch

GigESwitch

IP

eth0 eth1

Device Driver

Linux Workstation-basedRouter

IP Network A IP Network BWANEmulation

WAN Emulator for Cluster-of-Clusters

• Characteristics of WAN Environments– High network delay– Packet loss– Etc.

• User-Level or Kernel-Level Emulator?

• Blocking or Queueing based Delay Adding?

Degen: Delay generator

eth0 eth1

Device Driver Device Driver

Routing Decision Degen Netfilter

Timestamp delay queue

reinjection

IP

Degen Kernel Module

Dgen DaemonWAN Emulator for Cluster-of-Clusters

Kernel Patch for CCIL WAN Communication

• Ammasso Setup– Ammasso 1100– Ammasso software version amso1100-1.2-ga2

• Packet Drops for CCIL WAN Communication– Timeout– Retransmission

• Kernel Patch on Router

Contents

• Introduction• WAN Emulator for Cluster-of-Clusters• Performance Evaluation of RDMA over IP

– Basic communication latency– Computation and communication overlap– Communication progress– CPU resource requirements– Unification of communication interface– Bandwidth (throughput)

• Conclusions and Future Work

Basic Communication Latency

0

50

100

150

200

250

300

350

400

450

4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

Message Size (Byte)

Lat

ency

(u

s)

Sockets

CCIL

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

0 1 2 4 8

Network Delay (ms)L

aten

cy (

us)

Sockets

CCIL

• No impact of zero-copy on the basic communication latency• Basic communication is not an important metric

1KB Message Size

Computation and Communication Overlap

Router SwitchSwitchn0 n1

Computation(t1)

TotalTime(t2)

Overlap Ratio = t1/ t2

Send

Receive

Computation and Communication Overlap

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 61 122 182 242 302 362 422

Computation (ms)

Ove

rlap

Rat

io

Sockets

CCIL

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1

0 1 2 4 8

Network Delay (ms)O

verl

ap R

atio

Sockets

CCIL

• RDMA can achieve a better computation and communication overlap• Its benefit reduces as the network delay increases

1KB Message Size 242ms Computation1098%

114%

Communication Progress

Router SwitchSwitchn0 n1

Response DelayBy Load

DataFetchingLatency

Request

Response

Communication Progress

1

10

100

1000

10000

100000

0 1 4 16 64

Response Delay by Load (ms)

Lat

ency

(u

s)

Sockets

CCIL

1

10

100

1000

10000

100000

0 1 2 4 8

Network Delay (ms)L

aten

cy (

us)

• RDMA can achieve a better communication progress• Its benefit reduces as the network delay increases

16ms Response Delay1KB Message Size

98% 65%

CPU Resource Requirements

Router SwitchSwitchn0 n1

… 40 Streams

Application

Application Execution Time?

CPU Resource Requirements

0

5

10

15

20

25

30

35

40

45

50

1K 2K 4K 8K 16K

Message Size (Byte)

Exe

cuti

on

Tim

e (S

ec)

Sockets

CCIL

0

5

10

15

20

25

30

35

40

45

50

0 1 2 4 8

Network Delay (ms)E

xecu

tio

n T

ime

(Sec

)

• RDMA-based communication does not affect to the application execution time• RDMA has a strong potential of saving the CPU resource

16KB Message Size

Unification of Communication Interface

switch

switch

Inter-Cluster

Intra-Cluster

0

50

100

150

200

250

4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

Message Size (Byte)L

aten

cy (

us)

Sockets

CCIL

• RDMA over IP can provide a unified communication interface• RDMA can achieve lower latency for intra-cluster communication

38%

Bandwidth

• Where is the bottleneck?• Ethernet devices on the router• TCP window size

16KB Message Size

0

100

200

300

400

500

600

4 8 16 32 64 128

256

512

1024

2048

4096

8192

1638

4

Message Size (Byte)

Ban

dw

idth

(M

bp

s)

Sockets

CCIL

0

50

100

150

200

250

300

350

400

450

500

0 1 2 4 8

Network Delay (ms)B

and

wid

th (

Mb

ps)

Sockets

CCIL

Contents

• Introduction

• WAN Emulator for Cluster-of-Clusters

• Performance Evaluation of RDMA over IP

• Conclusions and Future Work

Conclusions

• The first quantitative study of RDMA over IP on a WAN setup

• WAN Emulator for Custer-of-Clusters– Degen

• RDMA over IP Can – Save CPU resource on the server side even on a high

delay WAN environment– Achieve better

• computation and communication overlap• communication progress• peak bandwidth

– Provide unified interface

Future Work

• Performance Evaluations– Other performance factors

• impact of address exchange• bandwidth

– Application-level performance

• WAN Emulator for Cluster-of-Clusters– Delay model– Other components

• RDMA-aware Middleware for Widely Distributed Systems over WAN

Acknowledgements

Our research is supported by the following organizations:

• Current Funding support by

• Current Equipment donations by

Thank You

{ jinhy, narravul, browngre, vaidyana, balaji, panda}@

cse.ohio-state.edu

Network-Based Computing Laboratory

http://nowlab.cse.ohio-state.edu/