HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ ·...
Transcript of HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ ·...
![Page 1: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/1.jpg)
ì Second Workshop on Architectures and Systems for Big Data (ASBD 2012) June 9th, 2012
HFAA: A Generic Socket API for Hadoop File Systems
Adam Yee and Jeffrey Shafer University of the Pacific
![Page 2: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/2.jpg)
Hadoop MapReduce
ì Hadoop: Open source framework for data-‐intensive compu<ng ì Inspired by Google’s web indexing framework ì Uses MapReduce parallel programming model
ì Enables scalable computa<on on a commodity cluster computer
ì Popular and in widespread use today ì Amazon, Facebook, MicrosoL Bing, Yahoo, …
ASBD'12
2
June 9th, 2012
![Page 3: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/3.jpg)
Hadoop Software Stack
ì Hadoop is an all-‐in-‐one so?ware framework that <es the cluster together ì ComputaDon – Execute Map and Reduce tasks ì Storage – User-‐level filesystem for applica<ons
ì HDFS – Hadoop Distributed File System
ì Scheduling – Distribute jobs across cluster ì Reliability – Data replica<on, re-‐spawning failed
jobs
ì Designed for portability (WriXen in Java)
ASBD'12
3
June 9th, 2012
![Page 4: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/4.jpg)
Motivating Challenge
1. I already have a cluster computer
2. I already have a different distributed file system running (and exper<se in managing it) ì File system examples: PVFS, Ceph, Lustre, GPFS ì Will refer to them collec<vely as NewDFS for the
remainder of talk
3. Hadoop (MapReduce) is only a small part of my computa<on workload
ASBD'12
4
June 9th, 2012
Hadoop needs to come to me, and not the other way around…
![Page 5: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/5.jpg)
Motivating Challenge
ì This is harder than it sounds!
ì Hadoop is Dghtly integrated with HDFS (Hadoop Distributed File System) ì Holds data input and computa<on output
ASBD'12
5
June 9th, 2012
How can I use Hadoop to process data stored in other distributed file systems?
![Page 6: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/6.jpg)
Use Hadoop with NewDFS
ì Method 1: Copy the data from NewDFS to HDFS ì Pros: Easy J ì Cons: Slow and wastes storage space
ì Method 2: Mount NewDFS using a POSIX driver (which can be directly accessed in Hadoop) ì Pros: Easy; Faster than making a copy first! (but s<ll slow) ì Cons: Lose Hadoop performance op<miza<ons (like data locality)
ì Method 3: Custom soLware layer integrates directly with Hadoop ì Pros: Near-‐na<ve speed ì Cons: Highest complexity; Requires detailed knowledge of Hadoop
and NewDFS architecture
June 9th, 2012 ASBD'12
6
![Page 7: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/7.jpg)
Using Hadoop with NewDFS
Past Projects ì Hadoop with CloudStore
ì Hadoop with Ceph
ì Hadoop with GPFS
ì Hadoop with Lustre
ì Hadoop with PVFS
ì So is this a solved problem? ì No!
Limita<ons of Prior Work ì All of these are point
solu<ons ì Different implementa<on
strategies ì Each require different
soLware patches to Hadoop
June 9th, 2012 ASBD'12
7
![Page 8: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/8.jpg)
Hadoop Filesystem Agnostic API (HFAA)
ì Universal, generic interface
ì Allows Hadoop to run on any file system that supports network sockets ì Since we’re targeBng distributed file systems, that
should include everyone
ì Design moves integra<on responsibili<es outside of Hadoop ì Does not require user or developer knowledge of
the Hadoop framework
ASBD'12
8
June 9th, 2012
![Page 9: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/9.jpg)
Traditional Hadoop
June 9th, 2012 ASBD'12
9
Hadoop MapReduce
HDFS DataNode
Applica<on(s)
TaskTracker
FileSystem + HDFS Client HDFS Daemon
Java Run<me
Opera<ng System (Linux)
Disk Disk
Cluster Node ì FileSystem
is Hadoop’s storage API class ì HDFS ì Local disk ì Amazon S3
ì Java implementa<on
![Page 10: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/10.jpg)
Hadoop + HFAA
June 9th, 2012 ASBD'12
10
Hadoop MapReduce
“NewDFS”
Applica<on(s)
TaskTracker
FileSystem + HFAA Client HFAA Server
Java Run<me
Opera<ng System (Linux)
Disk Disk
Cluster Node
“NewDFS” Daemon
ì Client: Integrates with Hadoop (Java) ì Reusable for
any NewDFS
ì Server: Integrates with NewDFS (Any language)
![Page 11: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/11.jpg)
HFAA Operations
ì Fundamental Hadoop storage opera<ons
ì How do we know API is complete? ì Every class that extends FileSystem
(including HFAA) must implement these abstract methods
June 9th, 2012 ASBD'12
11
FuncDon
Open
Create
Append
Rename
Delete
List Status
Make Directories
Get File State
Write
Read
![Page 12: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/12.jpg)
Evaluation
System ì HFAA prototype wriXen for
Hadoop + PVFS ì Hadoop 0.20.204.0
ì Released Sept 5th, 2011
ì OrangeFS 2.8.4 ì Branch of PVFS
ì Run on small 4-‐node cluster
ì Streaming reads and writes
Architectures Compared 1. Raw disk (outside Hadoop)
2. Hadoop with na<ve HDFS
3. Hadoop with OrangeFS using POSIX driver
4. Hadoop with OrangeFS using HFAA
June 9th, 2012 ASBD'12
12
![Page 13: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/13.jpg)
Evaluation
June 9th, 2012 ASBD'12
13
0
20
40
60
80
100
120
140
Streaming Write Streaming Read
Band
width (M
B/s)
Raw Disk Hadoop+HDFS
Hadoop+OFS Hadoop+HFAA+OFS
![Page 14: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/14.jpg)
Next Steps
Performance Op<miza<ons ì Expose file system locality to
Hadoop scheduler ì More important when we
test on larger clusters
ì Socket re-‐use between requests ì Less detrimental because
HDFS moves 64MB data block per request
ì Tune, tune, tune!
Broader Compa<bility ì Implement HFAA Server
component for other popular file systems ì Lustre? ì Ceph? ì Others?
ì Release for Hadoop 1.0.x family
June 9th, 2012 ASBD'12
14
![Page 15: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/15.jpg)
Summary
ì Hadoop Filesystem AgnosDc API (HFAA) ì Generic interface supports any distributed file system
ì Implementa<on includes two components ì HFAA Client – Interfaces with Hadoop ì HFAA Server – Interfaces with PVFS
ì Client can be re-‐used with all future filesystems ì Have a new filesystem you like? ì You only need to understand your filesystem and our
simple API to link it to Hadoop ì You don’t have to be a Hadoop expert
June 9th, 2012 ASBD'12
15
![Page 16: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,](https://reader034.fdocuments.net/reader034/viewer/2022050417/5f8d6eaa60b42d60237d3757/html5/thumbnails/16.jpg)
Questions?
June 9th, 2012 ASBD'12
16
? ? ?