Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

28
© 2014 Trace3, All rights reserved. BIG DATA INTELLIGENCE PRACTICE HADOOP: PAST, PRESENT AND FUTURE

description

Presentation given at SQLSaturday #326 Tampa, FL BA Edition https://www.sqlsaturday.com/326/schedule.aspx

Transcript of Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

Page 1: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

BIG  DATA  INTELLIGENCE  PRACTICE  

HADOOP:  PAST,  PRESENT  AND  FUTURE  

Page 2: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Roadmap  

1  

~1  hour  

1-­‐  What  Makes  Up  Hadoop  1.x?  

2-­‐  What’s  New  In  Hadoop  2.x?  

3-­‐  The  Future  Of  Hadoop  …  

Page 3: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

WHAT  MAKES  UP  HADOOP  1.0?  

Page 4: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

What’s  a  “Node”?  

Node  aka  Server  

Compute  

Storage  

Processes  /  Daemons  /  Services  

Memory  

OperaZng  System  

Page 5: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Hadoop  1.0:  HDFS  +  MapReduce  

4  

NameNode  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

JobTracker  

Client  1-­‐1  

1-­‐2  1-­‐3  

Page 6: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Hadoop  1.0:  HDFS  +  MapReduce  

5  

NameNode  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

DataNode  /  TaskTracker   DataNode  /  TaskTracker  

JobTracker  

Client  1-­‐1   1-­‐2  

1-­‐3  

Reduce  Map  

2-­‐1   3-­‐2   3-­‐3   4-­‐1  

2-­‐3   4-­‐2   2-­‐2   3-­‐1   4-­‐3  

Reduce  Map  

Page 7: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

MapReduce  v1  LimitaZons  

6  

Scalability  Maximum  cluster  size  is  4,000  nodes  and  maximum  concurrent  tasks  is  40,000  

Availability  JobTracker  failure  kills  all  queued  and  running  jobs  

Resources  ParZZoned  into  Map  and  Reduce  Hard  parGGoning  of  Map  and  Reduce  slots  led  to  low  resource  uZlizaZon  

No  Support  for  Alternate  Paradigms  /  Services  Only  MapReduce  batch  jobs,  nothing  else  

Page 8: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Hadoop  1.0:  Single  Use  System  

7  

HADOOP  1.0  

Single  Use  System  Batch  Apps  

HDFS  (redundant,  reliable  storage)  

MapReduce  (cluster  resource  management  and  data  

processing)  

Pig   Hive  

Page 9: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

WHAT’S  NEW  IN  HADOOP  2.0?  

Page 10: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

YARN  

9  

YARN  Replaces  MapReduce  

Yet  Another  Resource  NegoZator  

YARN  will  be  the  de-­‐facto  distributed  operaZng  system  for  Big  Data  

Page 11: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

YARN  =  BIG  DATA  

10  

Page 12: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  11  

Store  DATA  in  one  place  Interact  with  that  data  in  MULTIPLE  WAYS  

with  Predictable  Performance  and  Quality  of  Service  

           ApplicaGons  Run  NaGvely  IN  Hadoop  

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

BATCH  (MapReduce)  

INTERACTIVE  (Tez)  

ONLINE  (HBase)  

STREAMING  (DataTorrent)  

GRAPH  (Giraph)  

YARN:  No  Longer  Just  Batch  Apps  

Page 13: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  12  

YARN:  ApplicaZons  

Running  all  on  the  same  Hadoop  cluster  to  give  applicaZons  access  to  all  the  same  source  data!  

MapReduce  v2  

Real-­‐Time  Stream  Processing  

Master-­‐Worker  Online  

In-­‐Memory  

Apache  Storm  

Page 14: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  13  

YARN:  Quickly  Maturing  

2010    

2011    

2012    

2013    

2014    

Today  

Conceived  at  Yahoo!  

Alpha  Releases  –  2.0  

Beta  Releases  –  2.1  GA  Released  –  2.2  

200,000+  nodes,  800,000+  jobs  daily  10  million+  hours  of  compute  daily  

Version  2.3   Version  2.4  

Version  2.5  

Page 15: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  14  

YARN:  What  Has  Changed?  YARN   MRv1  RM  

ResourceManager  

AM  ApplicaZonMaster  

JT  JobTracker  

Scheduler   Scheduler  

NM  NodeManager  

TT  TaskTracker  

Container  Map  &  Reduce  Slot  

ResourceManager  

Scheduler  

JobTracker  

Scheduler  

NodeManager  

ApplicaZonMaster  

TaskTracker  

Map   Reduce  

NodeManager  

Container   Container  

TaskTracker  

Map   Reduce  

Page 16: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

The  6  Benefits  Of  YARN  

15  

• Scale  • New  programming  models  and  services  

• Improved  cluster  uZlizaZon  

• Agility  • Backwards  compaZble  with  MapReduce  v1  

• Mixed  workloads  on  the  same  source  of  data  

Page 17: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

THE  FUTURE  OF  HADOOP  

Page 18: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

SQL  on  Hadoop  

Speed  Deliver  interacGve  query  performance.  

SQL  Support  array  of  SQL  semanGcs  for  analyGc  applicaGons  running  against  Hadoop.  

Scale  SQL  interface  to  Hadoop  designed  for  queries  that  scale  from  Terabytes  to  Petabytes    

Page 19: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

SQL  on  Hadoop  

Hive  on  Apache  Tez  Hortonworks  HDP2  

Hive  on  Apache  Spark  Cloudera  CDH5  

Apache  Drill  MapR  M7  

Cloudera  Impala  Cloudera  CDH5  

Pivotal  HAWQ  Pivotal  Big  Data  Suite  

Page 20: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Apache  Spark  

HDFS2  (redundant,  reliable  storage)  

YARN  (cluster  resource  management)  

Apache  Spark  (Databricks)  

Programming  Languages  Java,  Scala,  Python,  R*  

InteracZve  Shell  Ability  to  write  code  and  get  output.  

Faster  by  ~100x  Due  how  it  handles  data  in  memory.  

Page 21: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Apache  Spark  –  Wordcount  

Page 22: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

HOYA:  HBase  (NoSQL)  on  YARN  

Dynamic  Scaling  On-­‐demand  cluster  size.  Increase  and  decrease  the  size  with  load.  

Easier  Deployment  APIs  to  create,  start,  stop  and  delete  HBase  clusters.  

Availability  Recover  from  Region  Server  loss  with  a  new  container.  

Page 23: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Apache  REEF  

Machine  Learning  Framework  well  suited  for  building  machine  learning  jobs.  

Scalable  /  Fault  Tolerant  Makes  it  easy  to  implement  scalable,  fault-­‐tolerant  runGme  environments  for  a  range  of  computaGonal  models.  

Maintain  State  Users  can  build  jobs  that  uGlize  data  from  where  it’s  needed  and  also  maintain  state  a`er  jobs  are  done.  

Retainable  Evaluator  ExecuGon  Framework  

Page 24: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Real-­‐Time  Stream  Processing  

Apache  Storm  

Streaming  

Page 25: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Heterogeneous  Storage  

NameNode  

Storage  

NameNode  

SATA   SSD   Fusion  IO  

THEN   NOW  

Page 26: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

Hadoop  Roadmap  

 

• Apache  Hadoop  2.5  – NodeManager  Restart  w/o  disrupGon  

 

 

• Apache  Hadoop  2.6  – Memory  As  Storage  Tier  – Dynamic  Resource  ConfiguraGon  –  Support  For  Docker  Containers  

Q3  2014  

Q4  2014  

Page 27: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

I  KNOW  YOU  HAVE  QUESTIONS  

26  

Page 28: Hadoop: Past, Present and Future - v2.2 - SQLSaturday #326 - Tampa BA Edition

©  2014  Trace3,  All  rights  reserved.  

THANK  YOU!  

hqp://bigdatajoe.io/    

hqp://bigdatacentric.com/    

@bigdatajoerossi    

[email protected]