HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ ·...

16
Second Workshop on Architectures and Systems for Big Data (ASBD 2012) June 9 th , 2012 HFAA: A Generic Socket API for Hadoop File Systems AdamYee and Jeffrey Shafer University of the Pacific

Transcript of HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ ·...

Page 1: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

ì  Second  Workshop  on  Architectures  and  Systems  for  Big  Data  (ASBD  2012)  June  9th,  2012  

HFAA:  A  Generic  Socket  API  for  Hadoop  File  Systems  

Adam  Yee  and  Jeffrey  Shafer  University  of  the  Pacific  

Page 2: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Hadoop  MapReduce  

ì  Hadoop:  Open  source  framework  for  data-­‐intensive  compu<ng  ì  Inspired  by  Google’s  web  indexing  framework  ì  Uses  MapReduce  parallel  programming  model  

ì  Enables  scalable  computa<on  on  a  commodity  cluster  computer  

ì  Popular  and  in  widespread  use  today  ì  Amazon,  Facebook,  MicrosoL  Bing,  Yahoo,  …  

ASBD'12  

2  

June  9th,  2012  

Page 3: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Hadoop  Software  Stack  

ì  Hadoop  is  an  all-­‐in-­‐one  so?ware  framework  that  <es  the  cluster  together  ì  ComputaDon  –  Execute  Map  and  Reduce  tasks  ì  Storage  –  User-­‐level  filesystem  for  applica<ons  

ì  HDFS  –  Hadoop  Distributed  File  System  

ì  Scheduling  –  Distribute  jobs  across  cluster  ì  Reliability  –  Data  replica<on,  re-­‐spawning  failed  

jobs  

ì  Designed  for  portability  (WriXen  in  Java)  

ASBD'12  

3  

June  9th,  2012  

Page 4: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Motivating  Challenge  

1.  I  already  have  a  cluster  computer  

2.  I  already  have  a  different  distributed  file  system  running  (and  exper<se  in  managing  it)  ì  File  system  examples:  PVFS,  Ceph,  Lustre,  GPFS  ì  Will  refer  to  them  collec<vely  as  NewDFS  for  the  

remainder  of  talk  

3.  Hadoop  (MapReduce)  is  only  a  small  part  of  my  computa<on  workload  

ASBD'12  

4  

June  9th,  2012  

Hadoop  needs  to  come  to  me,  and  not  the  other  way  around…  

Page 5: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Motivating  Challenge  

ì  This  is  harder  than  it  sounds!  

ì  Hadoop  is  Dghtly  integrated  with  HDFS  (Hadoop  Distributed  File  System)  ì  Holds  data  input  and  computa<on  output  

ASBD'12  

5  

June  9th,  2012  

How  can  I  use  Hadoop  to  process  data  stored  in  other  distributed  file  systems?  

Page 6: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Use  Hadoop  with  NewDFS  

ì  Method  1:  Copy  the  data  from  NewDFS  to  HDFS  ì  Pros:  Easy  J  ì  Cons:  Slow  and  wastes  storage  space  

ì  Method  2:  Mount  NewDFS  using  a  POSIX  driver    (which  can  be  directly  accessed  in  Hadoop)  ì  Pros:  Easy;  Faster  than  making  a  copy  first!  (but  s<ll  slow)  ì  Cons:  Lose  Hadoop  performance  op<miza<ons  (like  data  locality)  

ì  Method  3:  Custom  soLware  layer  integrates  directly  with  Hadoop  ì  Pros:  Near-­‐na<ve  speed  ì  Cons:  Highest  complexity;  Requires  detailed  knowledge  of  Hadoop  

and  NewDFS  architecture  

June  9th,  2012  ASBD'12  

6  

Page 7: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Using  Hadoop  with  NewDFS  

Past  Projects  ì  Hadoop  with  CloudStore  

ì  Hadoop  with  Ceph  

ì  Hadoop  with  GPFS  

ì  Hadoop  with  Lustre  

ì  Hadoop  with  PVFS  

ì  So  is  this  a  solved  problem?  ì  No!    

Limita<ons  of  Prior  Work  ì  All  of  these  are  point  

solu<ons  ì  Different  implementa<on  

strategies  ì  Each  require  different  

soLware  patches  to    Hadoop  

June  9th,  2012  ASBD'12  

7  

Page 8: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Hadoop  Filesystem  Agnostic  API  (HFAA)    

ì  Universal,  generic  interface  

ì  Allows  Hadoop  to  run  on  any  file  system  that  supports  network  sockets  ì  Since  we’re  targeBng  distributed  file  systems,  that  

should  include  everyone  

ì  Design  moves  integra<on  responsibili<es  outside  of  Hadoop  ì  Does  not  require  user  or  developer  knowledge  of  

the  Hadoop  framework  

ASBD'12  

8  

June  9th,  2012  

Page 9: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Traditional  Hadoop  

June  9th,  2012  ASBD'12  

9  

Hadoop  MapReduce  

HDFS  DataNode  

Applica<on(s)  

TaskTracker  

FileSystem  +  HDFS  Client   HDFS  Daemon  

Java  Run<me  

Opera<ng  System  (Linux)  

Disk   Disk  

Cluster  Node  ì  FileSystem  

is  Hadoop’s  storage  API  class  ì  HDFS  ì  Local  disk  ì  Amazon  S3  

ì  Java  implementa<on  

Page 10: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Hadoop  +  HFAA  

June  9th,  2012  ASBD'12  

10  

Hadoop  MapReduce  

“NewDFS”  

Applica<on(s)  

TaskTracker  

FileSystem  +  HFAA  Client   HFAA  Server  

Java  Run<me  

Opera<ng  System  (Linux)  

Disk   Disk  

Cluster  Node  

“NewDFS”  Daemon  

ì  Client:  Integrates  with  Hadoop  (Java)  ì  Reusable  for  

any  NewDFS    

ì  Server:  Integrates  with  NewDFS  (Any  language)  

Page 11: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

HFAA  Operations  

ì  Fundamental  Hadoop  storage  opera<ons  

ì  How  do  we  know  API  is  complete?  ì  Every  class  that  extends  FileSystem  

(including  HFAA)  must  implement  these  abstract  methods  

June  9th,  2012  ASBD'12  

11  

FuncDon  

Open  

Create  

Append  

Rename  

Delete  

List  Status  

Make  Directories  

Get  File  State  

Write  

Read  

Page 12: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Evaluation  

System  ì  HFAA  prototype  wriXen  for  

Hadoop  +  PVFS  ì  Hadoop  0.20.204.0  

ì  Released  Sept  5th,  2011  

ì  OrangeFS  2.8.4  ì  Branch  of  PVFS  

ì  Run  on  small  4-­‐node  cluster  

ì  Streaming  reads  and  writes  

Architectures  Compared  1.  Raw  disk  (outside  Hadoop)  

2.  Hadoop  with  na<ve  HDFS  

3.  Hadoop  with  OrangeFS    using  POSIX  driver  

4.  Hadoop  with  OrangeFS    using  HFAA  

June  9th,  2012  ASBD'12  

12  

Page 13: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Evaluation  

June  9th,  2012  ASBD'12  

13  

0  

20  

40  

60  

80  

100  

120  

140  

Streaming  Write   Streaming  Read  

Band

width  (M

B/s)  

Raw  Disk   Hadoop+HDFS  

Hadoop+OFS   Hadoop+HFAA+OFS  

Page 14: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Next  Steps  

Performance  Op<miza<ons  ì  Expose  file  system  locality  to  

Hadoop  scheduler  ì  More  important  when  we  

test  on  larger  clusters  

ì  Socket  re-­‐use  between  requests  ì  Less  detrimental  because  

HDFS  moves  64MB  data  block  per  request  

ì  Tune,  tune,  tune!  

Broader  Compa<bility  ì  Implement  HFAA  Server  

component  for  other  popular  file  systems  ì  Lustre?    ì  Ceph?  ì  Others?  

ì  Release  for  Hadoop  1.0.x  family  

June  9th,  2012  ASBD'12  

14  

Page 15: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Summary  

ì  Hadoop  Filesystem  AgnosDc  API  (HFAA)  ì  Generic  interface  supports  any  distributed  file  system  

ì  Implementa<on  includes  two  components  ì  HFAA  Client  –  Interfaces  with  Hadoop  ì  HFAA  Server  –  Interfaces  with  PVFS  

ì  Client  can  be  re-­‐used  with  all  future  filesystems  ì  Have  a  new  filesystem  you  like?  ì  You  only  need  to  understand  your  filesystem  and  our  

simple  API  to  link  it  to  Hadoop  ì  You  don’t  have  to  be  a  Hadoop  expert  

June  9th,  2012  ASBD'12  

15  

Page 16: HFAA:’A’Generic’Socket’API’ forHadoop’File’Systems’ · Hadoop’Software’Stack’! Hadoop!is!an!all9in9one!so?ware,frameworkthat,

Questions?  

June  9th,  2012  ASBD'12  

16  

?  ?  ?