What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData.

Post on 14-Dec-2015

222 views 1 download

Tags:

Transcript of What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData.

What does it mean to virtualize the Hadoop

File System?

Tom Phelan

Chief Architect for BlueData

It is HDFS …

Unless it is not

Outline

There are questions to be answered …

Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?

Instances Advantages and considerations

And a “When”:• When to choose HDFS storage virtualization?

What is HDFS?

Before we can virtualize it, we need to understand what “it” is.

HDFSIt is a distributed file system built with NameNodes and

DataNodes

http://image.slidesharecdn.com/introtohadoop-javamug-110414122200-phpapp01/95/intro-to-the-hadoop-stack-april-2011-javamug-14-728.jpg?cb=1302793500

Source: David Engfer via slidershare.net

hadoop-hdfs.jar org.apache.hadoop.fs.FileSystem

org.apache.hadoop.hdfs.FileSystem org.apache.hadoop.hdfs.DistributedFileSystem

HDFS Implementation

HDFS Implementation

HDFS ImplementationHDFS Implementation

Hadoop Distributed File System API/Java Class

Distributed File System Client Protocol at TCP/IP level – “over the wire”

HDFS Implementation

It is a stack of Java code used by Hadoop applications to access data.

YARN

HDFS Implementation

Generic Java ClassesJava class org.apache.hadoop.fs.FileSystem

HDFS over the wire protocolJava class org.apache.hadoop.hdfs.DFSClient

HDFS Layers of Potential Virtualization

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

HDFS Implementation

WireProtocol

HDFS Implementation

HDFS Virtualization

The virtualization of either the HDFS Implementation or the Protocols

Outline

There are questions to be answered …

Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?

Instances Advantages and considerations

And a “When”:• When to choose HDFS storage virtualization?

HDFS Virtualization Methods

• Virtualize the HDFS Implementation• Implement one of the Hadoop Compatible File System (HCFS)

Protocols Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)

Virtualize the HDFS Implementation

This is the only method of HDFS virtualization that requires Hadoop compute virtualization.

Simple. Install a Hadoop distro into a cluster of virtualized compute nodes and run the HDFS services in the cluster storing data on vdisks/vmdks.

Instances of this type of HDFS virtualization include:• VMware BDE• Apache OpenStack Sahara• Cloudera Director• Hortonworks Cloudbreak

NameNodeResourceManager

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

HOST

HOST

HOSTVM

VM

VM

Virtualize the HDFS Implementation

Advantages:•Simple•No new Java code•Compute/data locality

Considerations:•Requires data ingest time•The clusters become stateful

Virtualize the HDFS Implementation

HDFS Virtualization Methods

• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS

• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)

• Implement a HCFS via the FileSystem protocol (fs.FileSystem)

Implement a HCFS via the over-the-wire protocol

Use the unmodified hadoop-hdfs jarfs.defaultfs hdfs://1.2.3.4:8020/path

Instance:• EMC Isilon

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

StorageService Local

Disk

Local Disk

Implement a HCFS via the over-the-wire protocol

Advantages:•Multi-protocol•No new Java code•Enterprise storage services

Considerations:•Open source / proprietary•No compute / data locality

Implement a HCFS via the over-the-wire protocol

HDFS Virtualization Methods

• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS

• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)• Implement a HCFS via the FileSystem protocol

(fs.FileSystem)

Implement a HCFS via the FileSystem Java classes

Write the java code that implements the class, build a jar file,put the jar file in the YARN services class path

edit the core-site.xml file

Instances:•S3 and S3a/S3n – org.apache.hadoop.fs.FileSystem

https://github.com/Aloisius/hadoop-s3a•GlusterFS - org.apache.hadoop.fs.FilterFileSystem

https://github.com/gluster/glusterfs-hadoop•Tachyon – org.apache.hadoop.fs.FileSystem

https://github.com/amplab/tachyon•Apache Ignite – org.apache.hadoop.fs.AbstractFileSystem

https://github.com/apache/ignite

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

CustomFS Impl CustomFS

Impl

StorageService

StorageService

StorageService

Implement a HCFS via the FileSystem Java classes

Host

NameNode

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient

Local Disk Local

Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

Local Disk

Local Disk

CustomFS Impl CustomFS

Impl

StorageService

Implement a HCFS via the FileSystem Java classes

StorageService

StorageService

ResourceManager

Advantages:•Open source / proprietary•Multiple file access protocols supported

Considerations:•These are file systems•New Java code•Possibly no compute / data locality•May lag latest HDFS feature set

Implement a HCFS via the FileSystem Java classes

HDFS Virtualization

Is there another way?

HDFS Virtualization

• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS

• Implement a HCFS via the over-the-wire protocol• Implement a HCFS via the FileSystem Java classes

• Virtualize the Hadoop Compatible File System Protocol

Virtualize the Hadoop Compatible File System Protocol

Instance:• BlueData EPIC software – org.apache.fs.FileSystem

Translate the Hadoop File System Calls into native calls to the BackEnd File systems

Insert intelligent caching layer

Host

NameNodeResourceManager

Host

DataNode

NodeManager

App

HDFS Impl

DFSClientLocal Disk

Local Disk

Host

DataNode

NodeManager

App

HDFS Impl

DFSClient Local Disk

Local Disk

DTAPImpl

DTAPImpl

DTAPService

DTAPService

HostStorageService

Local Disk

Local Disk

Virtualize the Hadoop Compatible File System Protocol

HDFS mem cachePage

Cache

HDFS Implementation

DFSClient

DataNode

page

Application is cache aware

Extend mem cache to any File System or Object storage

Page Cache

DTAP FileSystem Implementation

DTAPService

page

HDFS GlusterFS Object Store

Application is cache unaware

Advantages:•Not a file system•Transparent in memory cache

write back, read ahead•Supports multiple protocols•Supports compute / data locality

Considerations:•New Java code•Open source / proprietary•May lag latest HDFS feature set

Virtualize the Hadoop Compatible File System Protocol

Let’s Review

Outline

There are questions to be answered …

Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?

Instances Advantages and considerations

And a “When”:• When to choose HDFS storage virtualization?

A Few Words about Performance

Performance measurements are an art as well as a science

•Bottlenecks in applications•Bottlenecks in infrastructure

network CPU disk

•Configuration is key block size distro security

Virtualize the HDFS Implementation

Source of graph: VMware Technical Paper – Virtualized Hadoop Performance with VMware vSphere 6 on High Performance Servers

Performance – VMware BDE

Performance – Isilon

http://stefanradtke.blogspot.com/2015/05/comparing-hadoop-performance-on-das-and.htmlSource of graph: Stefan Radtke blog post

Implement a HCFS via the over-the-wire protocol

Performance – Tachyon

Source of graph: Haoyuan Li

Implement a HCFS via the FileSystem Java classes

https://spark-summit.org/2014/wp-content/uploads/2014/07/Tachyon-Further-Improve-Sparks-Performance-Haoyuan-Li.pdf

Performance – BlueData

Source of Graph: BlueData customer proof-of-concept results

Virtualize the Hadoop Compatible File System Protocol

Virtualized HDFS solutions provide good performance

Even with remote storage

Even in virtualized environments

When it comes to Hadoop storage virtualization, speed is not the whole story

Other factors to consider when implementing a virtualized HDFS option:

•Use of a virtualized compute environment

•Open source / proprietary solution

•Required Hadoop File System features

•Lifespan of Hadoop cluster

Other factors to consider when selecting storage:

•Data accessibility

Hadoop File System protocol

NFS, object store, other protocols

•Enterprise storage services

data protection

geographical replication

offline backup

When it comes to Hadoop storage virtualization, speed is not the whole story

Consider a Virtualized HDFS Solution

When any of the following are true:

•Hadoop and non-Hadoop applications are required to access the same data

Do not want to replicate the data

•Enterprise storage data services required

•Need to run Hadoop in a virtual compute environment

Hadoop File System

Volume, Velocity, Variety

Virtualization

Q & A

twitter: @tapbluedata

email: tap@bluedata.com

www.bluedata.com

Visit our booth in the Expo