What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData.
-
Upload
arline-snow -
Category
Documents
-
view
222 -
download
1
Transcript of What does it mean to virtualize the Hadoop File System? Tom Phelan Chief Architect for BlueData.
What does it mean to virtualize the Hadoop
File System?
Tom Phelan
Chief Architect for BlueData
It is HDFS …
Unless it is not
Outline
There are questions to be answered …
Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?
Instances Advantages and considerations
And a “When”:• When to choose HDFS storage virtualization?
What is HDFS?
Before we can virtualize it, we need to understand what “it” is.
HDFSIt is a distributed file system built with NameNodes and
DataNodes
http://image.slidesharecdn.com/introtohadoop-javamug-110414122200-phpapp01/95/intro-to-the-hadoop-stack-april-2011-javamug-14-728.jpg?cb=1302793500
Source: David Engfer via slidershare.net
hadoop-hdfs.jar org.apache.hadoop.fs.FileSystem
org.apache.hadoop.hdfs.FileSystem org.apache.hadoop.hdfs.DistributedFileSystem
HDFS Implementation
HDFS Implementation
HDFS ImplementationHDFS Implementation
Hadoop Distributed File System API/Java Class
Distributed File System Client Protocol at TCP/IP level – “over the wire”
HDFS Implementation
It is a stack of Java code used by Hadoop applications to access data.
YARN
HDFS Implementation
Generic Java ClassesJava class org.apache.hadoop.fs.FileSystem
HDFS over the wire protocolJava class org.apache.hadoop.hdfs.DFSClient
HDFS Layers of Potential Virtualization
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
HDFS Implementation
WireProtocol
HDFS Implementation
HDFS Virtualization
The virtualization of either the HDFS Implementation or the Protocols
Outline
There are questions to be answered …
Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?
Instances Advantages and considerations
And a “When”:• When to choose HDFS storage virtualization?
HDFS Virtualization Methods
• Virtualize the HDFS Implementation• Implement one of the Hadoop Compatible File System (HCFS)
Protocols Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient) Implement a HCFS via the FileSystem protocol (fs.FileSystem)
Virtualize the HDFS Implementation
This is the only method of HDFS virtualization that requires Hadoop compute virtualization.
Simple. Install a Hadoop distro into a cluster of virtualized compute nodes and run the HDFS services in the cluster storing data on vdisks/vmdks.
Instances of this type of HDFS virtualization include:• VMware BDE• Apache OpenStack Sahara• Cloudera Director• Hortonworks Cloudbreak
NameNodeResourceManager
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
HOST
HOST
HOSTVM
VM
VM
Virtualize the HDFS Implementation
Advantages:•Simple•No new Java code•Compute/data locality
Considerations:•Requires data ingest time•The clusters become stateful
Virtualize the HDFS Implementation
HDFS Virtualization Methods
• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS
• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)
• Implement a HCFS via the FileSystem protocol (fs.FileSystem)
Implement a HCFS via the over-the-wire protocol
Use the unmodified hadoop-hdfs jarfs.defaultfs hdfs://1.2.3.4:8020/path
Instance:• EMC Isilon
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
StorageService Local
Disk
Local Disk
Implement a HCFS via the over-the-wire protocol
Advantages:•Multi-protocol•No new Java code•Enterprise storage services
Considerations:•Open source / proprietary•No compute / data locality
Implement a HCFS via the over-the-wire protocol
HDFS Virtualization Methods
• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS
• Implement a HCFS via the over-the-wire protocol (hdfs.DFSClient)• Implement a HCFS via the FileSystem protocol
(fs.FileSystem)
Implement a HCFS via the FileSystem Java classes
Write the java code that implements the class, build a jar file,put the jar file in the YARN services class path
edit the core-site.xml file
Instances:•S3 and S3a/S3n – org.apache.hadoop.fs.FileSystem
https://github.com/Aloisius/hadoop-s3a•GlusterFS - org.apache.hadoop.fs.FilterFileSystem
https://github.com/gluster/glusterfs-hadoop•Tachyon – org.apache.hadoop.fs.FileSystem
https://github.com/amplab/tachyon•Apache Ignite – org.apache.hadoop.fs.AbstractFileSystem
https://github.com/apache/ignite
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
CustomFS Impl CustomFS
Impl
StorageService
StorageService
StorageService
Implement a HCFS via the FileSystem Java classes
Host
NameNode
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient
Local Disk Local
Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
Local Disk
Local Disk
CustomFS Impl CustomFS
Impl
StorageService
Implement a HCFS via the FileSystem Java classes
StorageService
StorageService
ResourceManager
Advantages:•Open source / proprietary•Multiple file access protocols supported
Considerations:•These are file systems•New Java code•Possibly no compute / data locality•May lag latest HDFS feature set
Implement a HCFS via the FileSystem Java classes
HDFS Virtualization
Is there another way?
HDFS Virtualization
• Virtualize the HDFS Implementation• Implement a Hadoop Compatible File System – HCFS
• Implement a HCFS via the over-the-wire protocol• Implement a HCFS via the FileSystem Java classes
• Virtualize the Hadoop Compatible File System Protocol
Virtualize the Hadoop Compatible File System Protocol
Instance:• BlueData EPIC software – org.apache.fs.FileSystem
Translate the Hadoop File System Calls into native calls to the BackEnd File systems
Insert intelligent caching layer
Host
NameNodeResourceManager
Host
DataNode
NodeManager
App
HDFS Impl
DFSClientLocal Disk
Local Disk
Host
DataNode
NodeManager
App
HDFS Impl
DFSClient Local Disk
Local Disk
DTAPImpl
DTAPImpl
DTAPService
DTAPService
HostStorageService
Local Disk
Local Disk
Virtualize the Hadoop Compatible File System Protocol
HDFS mem cachePage
Cache
HDFS Implementation
DFSClient
DataNode
page
Application is cache aware
Extend mem cache to any File System or Object storage
Page Cache
DTAP FileSystem Implementation
DTAPService
page
HDFS GlusterFS Object Store
Application is cache unaware
Advantages:•Not a file system•Transparent in memory cache
write back, read ahead•Supports multiple protocols•Supports compute / data locality
Considerations:•New Java code•Open source / proprietary•May lag latest HDFS feature set
Virtualize the Hadoop Compatible File System Protocol
Let’s Review
Outline
There are questions to be answered …
Three “What”’s:• What is HDFS?• What does it mean to virtualize HDFS?• What are the different methods of virtualization?
Instances Advantages and considerations
And a “When”:• When to choose HDFS storage virtualization?
A Few Words about Performance
Performance measurements are an art as well as a science
•Bottlenecks in applications•Bottlenecks in infrastructure
network CPU disk
•Configuration is key block size distro security
Virtualize the HDFS Implementation
Source of graph: VMware Technical Paper – Virtualized Hadoop Performance with VMware vSphere 6 on High Performance Servers
Performance – VMware BDE
Performance – Isilon
http://stefanradtke.blogspot.com/2015/05/comparing-hadoop-performance-on-das-and.htmlSource of graph: Stefan Radtke blog post
Implement a HCFS via the over-the-wire protocol
Performance – Tachyon
Source of graph: Haoyuan Li
Implement a HCFS via the FileSystem Java classes
https://spark-summit.org/2014/wp-content/uploads/2014/07/Tachyon-Further-Improve-Sparks-Performance-Haoyuan-Li.pdf
Performance – BlueData
Source of Graph: BlueData customer proof-of-concept results
Virtualize the Hadoop Compatible File System Protocol
Virtualized HDFS solutions provide good performance
Even with remote storage
Even in virtualized environments
When it comes to Hadoop storage virtualization, speed is not the whole story
Other factors to consider when implementing a virtualized HDFS option:
•Use of a virtualized compute environment
•Open source / proprietary solution
•Required Hadoop File System features
•Lifespan of Hadoop cluster
Other factors to consider when selecting storage:
•Data accessibility
Hadoop File System protocol
NFS, object store, other protocols
•Enterprise storage services
data protection
geographical replication
offline backup
When it comes to Hadoop storage virtualization, speed is not the whole story
Consider a Virtualized HDFS Solution
When any of the following are true:
•Hadoop and non-Hadoop applications are required to access the same data
Do not want to replicate the data
•Enterprise storage data services required
•Need to run Hadoop in a virtual compute environment
Hadoop File System
Volume, Velocity, Variety
Virtualization