Accelerating Big Data with Hadoop (HDFS, MapReduce and ... · •Overview of Hadoop (HDFS,...

Accelerating Big Data with Hadoop (HDFS, MapReduce and HBase) and Memcached

Dhabaleswar K. (DK) Panda

The Ohio State University

E-mail: [email protected]

http://www.cse.ohio-state.edu/~panda

Talk at HPC Advisory Council Lugano Conference (2013)

by




Big Data

• Provides groundbreaking opportunities for enterprise

information management and decision making

• The rate of information growth appears to be exceeding

Moore’s Law

• The amount of data is exploding; companies are capturing

and digitizing more information that ever

• 35 zettabytes of data will be generated and consumed by

the end of this decade

Courtesy: John Gantz and David Reinsel, "The Digital Universe Decade

– Are You Ready?" May 2010.

http://www.emc.com/collateral/analyst-reports/idc-digital-

universe-are-you-ready.pdf

2 HPC Advisory Council Lugano Conference, Mar '13

http://www.emc.com/collateral/analyst-reports/idc-digital-universe-are-you-ready.pdf













Frequency of Big Data – An Example

Graph of 1026 Twitter users whose tweets contain ‘big data’ https://nodexlgraphgallery.org/Pages/Graph.aspx?graphID=1266

Tweets made over a period of 8 hours and 24 minutes


HPC Advisory Council Lugano Conference, Mar '13 4

Trends of Networking Technologies in TOP500 Systems

Interconnect Family – Performance Share Interconnect Family – Systems Share

HPC and Advanced Interconnects

• High-Performance Computing (HPC) has adopted advanced interconnects and protocols

– InfiniBand

– 10 Gigabit Ethernet/iWARP

– RDMA over Converged Enhanced Ethernet (RoCE)

• Very Good Performance

– Low latency (few micro seconds)

– High Bandwidth (100 Gb/s with dual FDR InfiniBand)

– Low CPU overhead (5-10%)

• OpenFabrics software stack with IB, iWARP and RoCE interfaces are driving HPC systems

• Many such systems in Top500 list


• Message Passing Interface (MPI)

• Parallel File Systems

• Storage Systems

• Almost 12 years of Research and Development since

InfiniBand was introduced in October 2000

• Other Programming Models are emerging to take

advantage of High-Performance Networks

– UPC

– OpenSHMEM

– Hybrid MPI+PGAS (OpenSHMEM and UPC)

6

Use of High-Performance Networks for Scientific Computing

HPC Advisory Council Lugano Conference, Mar '13

• Focuses on large data and data analysis

• Hadoop (HDFS, HBase, MapReduce) environment is gaining

a lot of momentum

• Memcached is also used for Web 2.0

7

Big Data/Enterprise/Commercial Computing


Can High-Performance Interconnects Benefit Enterprise Computing?

• Most of the current enterprise systems use 1GE

• Concerns for performance and scalability

• Usage of High-Performance Networks is beginning to draw interest

– Oracle, IBM, Google are working along these directions

• What are the challenges?

• Where do the bottlenecks lie?

• Can these bottlenecks be alleviated with new designs (similar to the designs adopted for MPI)?


• Overview of Hadoop (HDFS, MapReduce and HBase) and Memcached

• Challenges in Accelerating Enterprise Middleware

• Designs and Case Studies – Hadoop

• HDFS

• MapReduce

• HBase

• Combination (HDFS + HBase)

– Memcached

• Conclusion and Q&A

Presentation Outline


Big Data Processing with Hadoop

• The open-source implementation of MapReduce programming model for Big Data Analytics

• http://hadoop.apache.org/

• Framework includes: MapReduce, HDFS, and HBase

• Underlying Hadoop Distributed File System (HDFS) used by MapReduce and HBase

• Model scales but high amount of communication during intermediate phases

10 HPC Advisory Council Stanford Conference, Mar'13

Hadoop Distributed File System (HDFS)

• Primary storage of Hadoop;

highly reliable and fault-tolerant

• Adopted by many reputed

organizations

– eg:- Facebook, Yahoo!

• NameNode: stores the file system namespace

• DataNode: stores data blocks

• Developed in Java for platform-

independence and portability

• Uses sockets for communication!

(HDFS Architecture)


Network-Level Interaction Between Clients and Data Nodes in HDFS

12

High

Performance

Networks

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

...

(HDFS Data Nodes)(HDFS Clients)

...

...

HPC Advisory Council Stanford Conference, Mar '13


HBase

• Apache Hadoop Database

(http://hbase.apache.org/)

• Semi-structured database,

which is highly scalable

• Integral part of many

datacenter applications

– eg:- Facebook Social Inbox

• Developed in Java for platform-

independence and portability

• Uses sockets for

communication!

ZooKeeper

HBase

Client

HMaster

HRegion

Servers

HDFS

(HBase Architecture)

Network-Level Interaction Between HBase Clients, Region Servers and Data Nodes

14

(HBase Clients) (HRegion Servers) (Data Nodes)

(HDD/SSD)

(HDD/SSD)

(HDD/SSD)

...

......

... ...

...

High

Performance

Networks

High

Performance

Networks


Memcached Architecture

• Integral part of Web 2.0 architecture

• Distributed Caching Layer

– Allows to aggregate spare memory from multiple nodes

– General purpose

• Typically used to cache database queries, results of API calls

• Scalable model, but typical usage very network intensive

15

!"#$%"$#&

Web Frontend Servers (Memcached Clients)

(Memcached Servers)

(Database Servers)

High

Performance

Networks

High

Performance

Networks

Main

memoryCPUs

SSD HDD

High Performance Networks

... ...

...

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD

Main

memoryCPUs

SSD HDD





• HDFS

• MapReduce

• HBase

• Combination (HDFS and HBase)

– Memcached




Designing Communication and I/O Libraries for Enterprise Systems: Challenges


D at ac e n t e r M i ddl e war e (H D F S , H Ba s e , M a pRe duc e , M e m c a c he d)

N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 10/ 40 G i G E ,

RN ICs & Int e l l i ge nt N ICs )

S t ora ge T e c hnol ogi e s (H D D or S S D )

P oi nt -t o-P oi nt

Com m uni c a t i on

Q oS

T hre a di ng M ode l s a nd

S ync hroni z a t i on

F a ul t T ol e ra nc e I/ O a nd F i l e s ys t e m s

Com m u n i c at i on an d I / O L i br ar y

P r ogr am m i n g M ode l s (S oc ke t )

Appl i c at i on s

Com m odi t y Com put i ng S ys t e m

A rc hi t e c t ure s (s i ngl e , dua l , qua d, ..)

M ul t i / M a ny-c ore A rc hi t e c t ure

a nd A c c e l e ra t ors

Common Protocols using Open Fabrics

18

Application

Verbs Sockets

Application Interface

SDP

RDMA

SDP

InfiniBand Adapter

InfiniBand Switch

RDMA

IB Verbs

InfiniBand Adapter

InfiniBand Switch

User

space

RDMA

RoCE

RoCE Adapter

User

space

Ethernet Switch

TCP/IP

Ethernet Driver

Kernel Space

Protocol

InfiniBand Adapter

InfiniBand Switch

IPoIB

Ethernet Adapter

Ethernet Switch

Adapter

Switch

1/10/40 GigE

iWARP

Ethernet Switch

iWARP

iWARP Adapter

User

space IPoIB

TCP/IP

Ethernet Adapter

Ethernet Switch

10/40 GigE-TOE

Hardware Offload

RSockets

InfiniBand Adapter

InfiniBand Switch

User

space

RSockets


Can Big Data Processing Systems be Designed with High-Performance Networks and Protocols?

19

Enhanced Designs

Application

Accelerated Sockets

10 GigE or InfiniBand

Verbs / Hardware Offload

Current Design

Application

Sockets

1/10 GigE Network

• Sockets not designed for high-performance

– Stream semantics often mismatch for upper layers (Memcached, HBase, Hadoop)

– Zero-copy not available for non-blocking sockets

Our Approach

Application

OSU Design

10 GigE or InfiniBand

Verbs Interface


Interplay between Storage and Interconnect/Protocols

• Most of the current generation enterprise systems use the

traditional hard disks

• Since hard disks are slower, high performance

communication protocols may not have impact

• SSDs and other storage technologies are emerging

• Does it change the landscape?





• HDFS

• MapReduce

• HBase


– Memcached




HDFS-RDMA Design Overview

• JNI Layer bridges Java based HDFS with communication library written in native code

• Only the communication part of HDFS Write is modified; No change in HDFS architecture

22

HDFS

IB Verbs

InfiniBand

Applications

1/10 GigE, IPoIB Network

Java Socket Interface

Java Native Interface (JNI)

Write Others

OSU Design

HDFS Write involves

replication; more

network intensive

HDFS Read is mostly

node-local

Enables high performance RDMA communication, while supporting traditional socket interface


Communication Times in HDFS

• Cluster B with 32 HDD DataNodes

– 30% improvement in communication time over IPoIB (32Gbps)

– 87% improvement in communication time over 1GigE

• Similar improvements are obtained for SSD DataNodes

0

10

20

30

40

50

60

70

80

90

100

2GB 4GB 6GB 8GB 10GB

Co

mm

un

icat

ion

Tim

e (s

)

File Size (GB)

1GigE

IPoIB (QDR)

OSU-IB (QDR)

Reduced by 30%

N. S. Islam, M. W. Rahman, J. Jose, R. Rajachandrasekar, H. Wang, H. Subramoni, C. Murthy and D. K. Panda ,

High Performance RDMA-Based Design of HDFS over InfiniBand , Supercomputing (SC), Nov 2012


Evaluations using TestDFSIO

• Cluster with 32 HDD DataNodes

– 13.5% improvement over IPoIB (32Gbps) for 8GB file size

• Cluster with 4 SSD DataNodes

– 15% improvement over IPoIB (32Gbps) for 8GB file size

0

50

100

150

200

250

300

350

2 4 6 8 10

Thro

ugh

pu

t (M

ega

Byt

es/

s)

File Size (GB)

1GigE IPoIB (QDR) OSU-IB (QDR)

0

20

40

60

80

100

120

140

2 4 6 8 10

Thro

ugh

pu

t (M

ega

Byt

es/

s)

File Size (GB)


Cluster with 32 HDD Nodes Cluster with 4 SSD Nodes

Increased by 13.5%

Increased by 15%


Evaluations using TestDFSIO

• Cluster with 4 DataNodes, 1 HDD per node

– 10% improvement over IPoIB (32Gbps) for 10GB file size

• Cluster with 4 DataNodes, 2 HDD per node

– 16.2% improvement over IPoIB (32Gbps) for 10GB file size

• 2 HDD vs 1 HDD

– 2.01x improvement for OSU-IB (32Gbps)

– 1.8x improvement for IPoIB (32Gbps)

Increased by 16.2%

0

50

100

150

200

250

300

350

2 4 6 8 10

Ave

rage

Th

rou

ghp

ut

(Meg

aByt

es/s

)

File Size (GB)

1GigE (1HDD) IPoIB(1HDD) OSU-IB (1HDD)

1GigE (2HDD) IPoIB (2HDD) OSU-IB (2HDD)


Evaluations using YCSB (Single Region Server: 100% Update)

• HBase using TCP/IP, running over HDFS-IB

• HBase Put latency for 360K records

– 204 us for OSU Design; 252 us for IPoIB (32Gbps)

• HBase Put throughput for 360K records

– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)

• 20% improvement in both average latency and throughput

Put Latency Put Throughput


0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

120K 240K 360K 480K

Thro

ugh

pu

t (M

ega

Byt

es/

s)

Number of Records (Record Size = 1KB)


0

50

100

150

200

250

300

350

400

450

500

120K 240K 360K 480K

Ave

rage

Lat

en

cy (

us)

Number of Record (Record Size = 1KB)


Evaluations using YCSB (32 Region Servers: 100% Update)

Put Latency Put Throughput

• HBase using TCP/IP, running over HDFS-IB

• HBase Put latency for 480K records

– 201 us for OSU Design; 272 us for IPoIB (32Gbps)

• HBase Put throughput for 480K records

– 4.42 Kops/sec for OSU Design; 3.63 Kops/sec for IPoIB (32Gbps)

• 26% improvement in average latency; 24% improvement in throughput


0

50

100

150

200

250

300

350

400

120K 240K 360K 480K

Ave

rage

Lat

en

cy (

us)



0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

120K 240K 360K 480K

Thro

ugh

pu

t (K

op

s/se

c)






• HDFS

• MapReduce

• HBase


– Memcached





MapReduce-RDMA Design Overview

• Enables high performance RDMA communication, while supporting traditional socket interface

RDMA-enabled MapReduce

IB Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU-IB Design

Applications

1/10 GigE Network



Job

Tracker

Task

Tracker

Map

Reduce

• For 5-20 GB Sort, 4-node cluster with a single SSD per node

• For 25-40 GB Sort, 8-node cluster with a single HDD per node

• 46% improvement over IPoIB (32 Gbps) for 15 GB Sort

• 27% improvement over IPoIB (32 Gbps) for 40 GB Sort

Evaluations using Sort

0

50

100

150

200

250

300

350

400

450

500

5 10 15 20

Exec

uti

on

Tim

e (s

ec)

Sort Size (GB)


0

200

400

600

800

1000

1200

1400

25 30 35 40

Exec

uti

on

Tim

e (s

ec)

Sort Size (GB)



• Cluster Size 4 and 8 have 24 GB RAM in each node, Cluster Size 12 and 24

have 12 GB RAM in each node, all the nodes have single HDD

• 41% improvement over IPoIB (32Gbps) for 100 GB Terasort

Evaluations using TeraSort

0

500

1000

1500

2000

2500

3000

30 GB 60 GB 100 GB 200 GB

Cluster Size = 4 Cluster Size = 8 Cluster Size = 12 Cluster Size = 24

Exec

uti

on

Tim

e (s

ec) 1GigE IPoIB (QDR) OSU-IB (QDR)


• The DataSet for these workloads are taken from PUMA (Purdue MapReduce

Benchmark Suite)

• 46% improvement in Adjacency List over IPoIB (32Gbps) for 30 GB data size

• 33% improvement in Sequence Count over IPoIB (32 Gbps) for 50 GB data

size

Evaluations using PUMA Workload

0

500

1000

1500

2000

2500

3000

Adjacency List (30GB) Self Join (30 GB) Word Count (50 GB) Sequence Count (50 GB) Inverted Index (50 GB)

Exec

uti

on

Tim

e (s

ec)

Workloads

IPoIB (QDR) OSU-IB (QDR)





• HDFS

• MapReduce

• HBase


– Memcached





HBase Put/Get – Detailed Analysis

• HBase 1KB Put – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

• HBase 1KB Get – Communication Time – 8.9 us – A factor of 6X improvement over 10GE for communication time

0

50

100

150

200

250

300

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Put 1KB

0

50

100

150

200

250

1GigE IPoIB 10GigE OSU-IB

Tim

e (

us)

HBase Get 1KB

Communication

Communication Preparation

Server Processing

Server Serialization

Client Processing

Client Serialization

W. Rahman, J. Huang, J. Jose, X. Ouyang, H. Wang, N. Islam, H. Subramoni, Chet Murthy and D. K. Panda, Understanding the Communication Characteristics in HBase: What are the Fundamental Bottlenecks?, ISPASS’12


HBase-RDMA Design Overview

• JNI Layer bridges Java based HBase with communication library written in native code

• Enables high performance RDMA communication, while supporting traditional socket interface

HBase

IB Verbs

RDMA Capable Networks

(IB, 10GE/ iWARP, RoCE ..)

OSU-IB Design

Applications

1/10 GigE Network




HBase Single Server-Multi-Client Results

• HBase Get latency

– 4 clients: 104.5 us; 16 clients: 296.1 us

• HBase Get throughput

– 4 clients: 37.01 Kops/sec; 16 clients: 53.4 Kops/sec

• 27% improvement in throughput for 16 clients over 10GE

0

100

200

300

400

500

600

1 2 4 8 16

Tim

e (

us)

No. of Clients

0

10000

20000

30000

40000

50000

60000

1 2 4 8 16

Op

s/se

c

No. of Clients

IPoIB

OSU-IB

1GigE

10GigE

Latency Throughput


HBase – YCSB Read-Write Workload

• HBase Get latency (Yahoo! Cloud Service Benchmark)

– 64 clients: 2.0 ms; 128 Clients: 3.5 ms

– 42% improvement over IPoIB for 128 clients

• HBase Get latency

– 64 clients: 1.9 ms; 128 Clients: 3.5 ms

– 40% improvement over IPoIB for 128 clients

0

1000

2000

3000

4000

5000

6000

7000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

0

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

8 16 32 64 96 128

Tim

e (

us)

No. of Clients

IPoIB OSU-IB

1GigE 10GigE

Read Latency Write Latency

J. Huang, X. Ouyang, J. Jose, W. Rahman, H. Wang, M. Luo, H. Subramoni, Chet Murthy and D. K. Panda, High-Performance Design of HBase with RDMA over InfiniBand, IPDPS’12




• HDFS

• MapReduce

• HBase


– Memcached




• YCSB Evaluation with 4 RegionServers (100% update)

• HBase Put Latency and Throughput for 360K records

- 37% improvement over IPoIB (32Gbps)

- 18% improvement over OSU-IB HDFS only

HDFS and HBase Integration over IB (OSU-IB)

0

50

100

150

200

250

300

350

400

450

120K 240K 360K 480K

Late

ncy

(u

s)


1GigE IPoIB (QDR)

OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR)

0

1

2

3

4

5

6

120K 240K 360K 480K

Thro

ugh

pu

t (K

op

s/se

c)


1GigE IPoIB (QDR)

OSU-IB (HDFS) - IPoIB (HBase) (QDR) OSU-IB (HDFS) - OSU-IB (HBase) (QDR)





• HDFS

• MapReduce

• HBase


– Memcached




Memcached-RDMA Design

41

• Memcached applications need not be modified; uses verbs interface if available

• Memcached Server can serve both socket and verbs clients simultaneously

• Native IB-verbs-level Design and evaluation with

– Memcached Server: 1.4.9

– Memcached Client: (libmemcached) 0.52

Sockets Client

RDMA Client

Master Thread

Sockets Worker Thread

Verbs Worker Thread

Sockets Worker Thread

Verbs Worker Thread

Shared

Data

Memory Slabs Items

…

1

1

2

2


42

• Memcached Get latency

– 4 bytes RC/UD – DDR: 6.82/7.55 us; QDR: 4.28/4.86 us

– 2K bytes RC/UD – DDR: 12.31/12.78 us; QDR: 8.19/8.46 us

• Almost factor of four improvement over 10GE (TOE) for 2K bytes on

the DDR cluster

Intel Clovertown Cluster (IB: DDR) Intel Westmere Cluster (IB: QDR)

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

0

20

40

60

80

100

120

140

160

180

1 2 4 8 16 32 64 128 256 512 1K 2K

Tim

e (

us)

Message Size

SDP IPoIB

OSU-RC-IB 1GigE

10GigE OSU-UD-IB

Memcached Get Latency (Small Message)



Memcached Get TPS (4byte)

• Memcached Get transactions per second for 4 bytes

– On IB QDR 1.4M/s (RC), 1.3 M/s (UD) for 8 clients

• Significant improvement with native IB QDR compared to SDP and IPoIB

0

200

400

600

800

1000

1200

1400

1600

1 2 4 8 16 32 64 128 256 512 800 1K

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients

SDP IPoIB

OSU-RC-IB 1GigE

OSU-UD-IB

0

200

400

600

800

1000

1200

1400

1600

4 8

Tho

usa

nd

s o

f Tr

ansa

ctio

ns

pe

r se

con

d (

TPS)

No. of Clients


Application Level Evaluation – Olio Benchmark

• Olio Benchmark

– RC – 1.6 sec, UD – 1.9 sec, Hybrid – 1.7 sec for 1024 clients

• 4X times better than IPoIB for 8 clients

• Hybrid design achieves comparable performance to that of pure RC design

0

500

1000

1500

2000

2500

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

0

20

40

60

80

100

120

1 4 8

Tim

e (

ms)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB


Application Level Evaluation – Real Application Workloads

• Real Application Workload – RC – 302 ms, UD – 318 ms, Hybrid – 314 ms for 1024 clients

• 12X times better than IPoIB for 8 clients • Hybrid design achieves comparable performance to that of pure RC design

0

20

40

60

80

100

120

1 4 8

Tim

e (

ms)

No. of Clients

SDP

IPoIB

OSU-RC-IB

OSU-UD-IB

OSU-Hybrid-IB

0

50

100

150

200

250

300

350

64 128 256 512 1024

Tim

e (

ms)

No. of Clients

J. Jose, H. Subramoni, M. Luo, M. Zhang, J. Huang, W. Rahman, N. Islam, X. Ouyang, H. Wang, S. Sur and D. K.

Panda, Memcached Design on High Performance RDMA Capable Interconnects, ICPP’11

J. Jose, H. Subramoni, K. Kandalla, W. Rahman, H. Wang, S. Narravula, and D. K. Panda, Scalable Memcached

design for InfiniBand Clusters using Hybrid Transport, CCGrid’12

Designing Communication and I/O Libraries for Enterprise Systems: Solved a Few Initial Challenges


D at ac e n t e r M i ddl e war e (H D F S , H Ba s e , M a pRe duc e , M e m c a c he d)

N e t w orki ng T e c hnol ogi e s (Infi ni Ba nd, 1/ 10/ 40 G i G E ,

RN ICs & Int e l l i ge nt N ICs )

S t ora ge T e c hnol ogi e s (H D D or S S D )

P oi nt -t o-P oi nt

Com m uni c a t i on

Q oS

T hre a di ng M ode l s a nd

S ync hroni z a t i on

F a ul t T ol e ra nc e I/ O a nd F i l e s ys t e m s

Com m u n i c at i on an d I / O L i br ar y

P r ogr am m i n g M ode l s (S oc ke t )

Appl i c at i on s

Com m odi t y Com put i ng S ys t e m

A rc hi t e c t ure s (s i ngl e , dua l , qua d, ..)

M ul t i / M a ny-c ore A rc hi t e c t ure

a nd A c c e l e ra t ors

Upper level

Changes?

• Presented initial designs to take advantage of InfiniBand/RDMA

for HDFS, MapReduce, HBase and Memcached

• Results are promising

• Working on Integrated designs of all components

• Many other open issues need to be solved including design

changes at the upper layers

• Will enable BigData community to take advantage of modern

clusters and Hadoop middleware to carry out their analytics in a

fast and scalable manner


Concluding Remarks

47


Web Pointers


http://nowlab.cse.ohio-state.edu

MVAPICH Web Page

http://mvapich.cse.ohio-state.edu

[email protected]

48





http://nowlab.cse.ohio-state.edu/




http://mvapich.cse.ohio-state.edu/



mailto:[email protected]



Accelerating Big Data with Hadoop (HDFS, MapReduce and ... · •Overview of Hadoop (HDFS,...

Documents

Transcript of Accelerating Big Data with Hadoop (HDFS, MapReduce and ... · •Overview of Hadoop (HDFS,...