GoodOle Hadoop!$How$We$Love$...

37
Good Ole Hadoop! How We Love Thee! Jerome E. Mitchell Indiana University

Transcript of GoodOle Hadoop!$How$We$Love$...

Page 1: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Good  Ole  Hadoop!  How  We  Love  Thee!  

 Jerome  E.  Mitchell  

Indiana  University  

Page 2: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Outline  

•  Why  Should  You  Care  about  Hadoop?  •  What  Exactly  is  Hadoop?  •  Overview  of  Hadoop  MapReduce  and  Hadoop  Distributed  File  System  (HDFS)  

•  Workflow  •  Conclusions  and  References  •  GeOng  Started  with  Hadoop  

Page 3: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Why  Should  You  Care?  –  Lots  of  Data    

 LOTS  OF  DATA    EVERYWHERE  

Page 4: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Why  Should  You  Care?  –  Even  Grocery  Stores  Care  

September 2, 2009

Tesco, British Grocer, Uses Weather to Predict Sales By Julia Werdigier

LONDON  —  Britain  oUen  conjures  images  of  unpredictable  weather,  with  downpours  someXmes  followed  by  sunshine  within  the  same  hour  —  several  Xmes  a  day.    “Rapidly  changing  weather  can  be  a  real  challenge,”  Jonathan  Church,  a  Tesco  spokesman,  said  in  a  statement.  “The  system  successfully  predicted  temperature  drops  during  July  that  led  to  a  major  increase  in  demand  for  soup,  winter  vegetables  and  cold-­‐weather  puddings.”  

Page 5: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

What  is  Hadoop?  

•  At  Google  MapReduce  operaXon  are  run  on  a  special  file  system  called  Google  File  System  (GFS)  that  is  highly  opXmized  for  this  purpose.  

•  GFS  is  not  open  source.  •  Doug  CuOng  and  Yahoo!  reverse  engineered  the  GFS  and  called  it  Hadoop  Distributed  File  System  (HDFS).  

•  The  soUware  framework  that  supports  HDFS,  MapReduce  and  other  related  enXXes  is  called    the  project  Hadoop  or  simply  Hadoop.  

•  This  is  open  source  and  distributed  by  Apache.  

Page 6: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

What  Exactly  is  Hadoop?  

•  A  growing  collecXon  of  subprojects    

Page 7: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Why  Hadoop  for  BIG  Data?  •  Most  creditable  open  source  toolkit  for  large  scale,  general  

purpose  compuXng    •  Backed  by                                                          ,      

•  Used  by                                        ,                                                            ,  and  many  others  

•  Increasing  support  from                                                                          web  services    •  Hadoop  closely  imitates  infrastructure  developed  by    

•  Hadoop  processes  petabytes  daily,  right  now  

Page 8: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

MoXvaXon  for  MapReduce  

•  Large-­‐Scale  Data  Processing  – Want  to  use  1000s  of  CPUs  

•  But,  don’t  want  the  hassle  of  managing  things  

•  MapReduce  Architecture  provides  –  AutomaXc  ParallelizaXon  and  DistribuXon    –  Fault  Tolerance  –  I/O  Scheduling  – Monitoring  and  Status  Updates  

Page 9: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

MapReduce  Model  •  Input  &  Output:  a  set  of  key/value  pairs  •  Two  primiXve  operaXons  

–  map:          (k1,v1) à list(k2,v2) –  reduce:  (k2,list(v2)) à list(k3,v3)

•  Each  map  operaXon  processes  one  input  key/value  pair  and  produces  a  set  of  key/value  pairs  

•  Each  reduce  operaXon  – Merges  all  intermediate  values  (produced  by  map  ops)  for  a  parXcular  key  

–  Produce  final  key/value  pairs  •  OperaXons  are  organized  into  tasks  

– Map  tasks:  apply  map  operaXon  to  a  set  of  key/value  pairs  –  Reduce  tasks:  apply  reduce  operaXon  to  intermediate  key/value  pairs  

9  

Page 10: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

HDFS  Architecture    

Page 11: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

The  WorkFlow  

•  Load  data  into  the  Cluster  (HDFS  writes)  •  Analyze  the  data  (MapReduce)  •  Store  results  in  the  Cluster  (HDFS)  •  Read  the  results  from  the  Cluster  (HDFS  reads)  

Page 12: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Distributed  Data  Processing  

Slaves  

Master  

Client  

Distributed  Data  Storage  

Name  Node  Secondary  Name  

Node  

Data  Nodes  &  Task  Tracker  

Data  Nodes  &  Task  Tracker  

Data  Nodes  &  Task  Tracker  

Data  Nodes  &  Task  Tracker  

Data  Nodes  &  Task  Tracker  

Data  Nodes  &  Task  Tracker  

MapReduce   HDFS  

Job  Tracker  

Hadoop  Server  Roles  

Page 13: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Hadoop  Cluster  

Switch    

Switch    

Switch   Switch   Switch  

DN  +  TT  DN  +  TT  DN  +  TT  DN  +  TT  

DN  +  TT  DN  +  TT  DN  +  TT  DN  +  TT  

DN  +  TT  DN  +  TT  DN  +  TT  DN  +  TT  

DN  +  TT  DN  +  TT  DN  +  TT  DN  +  TT  DN  +  TT  

Switch  

Rack  1   Rack  2   Rack  3   Rack  N  

Page 14: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Hadoop  Rack  Awareness  –  Why?  

Data  Node  1  

Data  Node  2  

Data  Node  3  Data  Node  5  Rack  1   Rack  5   Rack  9  

Switch   Switch   Switch  

Data  Node  5  

Data  Node  6  

Data  Node  7  Data  Node  8  

Data  Node  5  

Data  Node  10  

Data  Node  11    

Data  Node  12  

Switch  

Name  Node  

Results.txt  =  BLK  A:  DN1,  DN5,  DN6    BLK  B:  DN7,  DN1,  DN2    BLK  C  DN5,  DN8,  DN9  

metatadata  

Rack  1:  Data  Node  1    Data  Node  2  Data  Node  3    Rack  1:  Data  Node  5  Data  Node  6  Data  Node  7    

Rack  Aware  

Page 15: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Sample  Scenario  

Huge  file  containing  all  emails  sent  to  Indiana  University…   File.txt  

How  many  Xmes  did  our  customers  type  the  word  refund  into  emails  sent  to  Indiana  University?    

Page 16: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

WriXng  Files  to  HDFS  File.txt  

Data  Node  1   Data  Node  5   Data  Node  6  

BLK  A   BLK  B   BLK  C  

BLK  A   BLK  B   BLK  C  

Data  Node  N  

Client  

NameNode  

Page 17: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Data  Node  Reading  Files  From  HDFS  

Data  Node  1  

Data  Node  2  

Data  Node  Data  Node  Rack  1   Rack  5   Rack  9  

Switch   Switch   Switch  

Data  Node  1  

Data  Node  2  

Data  Node  Data  Node  

Data  Node  1  

Data  Node  2  

Data  Node  Data  Node  

Switch  

Name  Node  

Results.txt  =  BLK  A:  DN1,  DN5,  DN6    BLK  B:  DN7,  DN1,  DN2    BLK  C  DN5,  DN8,  DN9  

metatadata  Rack  1:  Data  Node  1    Data  Node  2  Data  Node  3    Rack  1:  Data  Node  5  

Rack  Aware  

Page 18: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

MapReduce:  Three  Phases  

 1.  Map  2.  Sort  

3.  Reduce  

Page 19: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Data  Processing:  Map  

Map  Task   Map  Task   Map  Task  

BLK  A   BLK  B   BLK  C  

Job  Tracker  Name  Node  

Data  Node  1   Data  Node  5   Data  Node  9  

File.txt  

Refund  =  3   Refund  =  0   Refund=  11  

Page 20: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

MapReduce:  The  Map  Step  

v  k  

k   v  

k   v  

map  v  k  

v  k  

…  

k   v  map  

Input  key-­‐value  pairs  

Intermediate  key-­‐value  pairs  

…  

k   v  

Page 21: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

The  Map  (Example)  

When  in  the  course  of  human  events  it  …  

It  was  the  best  of  Xmes  and  the  worst  of  Xmes…    

map  (in,1)  (the,1)  (of,1)  (it,1)  (it,1)  (was,1)  (the,1)  (of,1)  …  

(when,1),  (course,1)  (human,1)  (events,1)  (best,1)  …  

inputs   tasks  (M=3)   parFFons  (intermediate  files)  (R=2)  

This  paper  evaluates  the  suitability  of  the  …     map   (this,1)  (paper,1)  (evaluates,1)  (suitability,1)  …  

(the,1)  (of,1)  (the,1)  …  

Over  the  past  five  years,  the  authors  and  many…    

map   (over,1),  (past,1)  (five,1)  (years,1)  (authors,1)  (many,1)  …  

(the,1),  (the,1)  (and,1)  …  

Page 22: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Data  Processing:  Reduce  

Map  Task   Map  Task   Map  Task  

Reduce  Task  Results.txt  HDFS  

BLK  A   BLK  B   BLK  C  

Client   Job  Tracker  

Data  Node    1   Data  Node    5   Data  Node    9  

Refund  =  3  

Refund  =  0  

Data  Node    3  X   Y   Z  

Page 23: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

MapReduce:  The  Reduce  Step  

k   v  

…  

k   v  

k   v  

k   v  

Intermediate  key-­‐value  pairs  

group  

reduce  

reduce  

k   v  

k   v  

k   v  

…  

k   v  

…  

k   v  

k   v   v  

v   v  

Key-­‐value  groups  Output    key-­‐value  pairs  

Page 24: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

The  Reduce  (Example)  

reduce  

(in,1)  (the,1)  (of,1)  (it,1)  (it,1)  (was,1)  (the,1)  (of,1)  …  

(the,1)  (of,1)  (the,1)  …  

 reduce  task  parFFon  (intermediate  files)  (R=2)  

(the,1),  (the,1)  (and,1)  …  

sort  

(and,  (1))  (in,(1))  (it,  (1,1))  (the,  (1,1,1,1,1,1))    (of,  (1,1,1))  (was,(1))  

(and,1)  (in,1)  (it,  2)  (of,  3)  (the,6)  (was,1)  Note:  only  one  of  the  two  reduce  tasks  shown  

Page 25: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Clients  Reading  Files  from  HDFS  

Data  Node  1  

Data  Node  2  

Data  Node  Data  Node  Rack  1   Rack  5   Rack  9  

Switch   Switch   Switch  

Data  Node  1  

Data  Node  2  

Data  Node  Data  Node  

Data  Node  1  

Data  Node  2  

Data  Node  Data  Node  

Client   NameNode  

Results.txt  =  BLK  A:  DN1,  DN5,  DN6    BLK  B:  DN7,  DN1,  DN2    BLK  C  DN5,  DN8,  DN9  

metatadata  

Page 26: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Conclusions  

•  We  introduced  MapReduce  programming  model  for  processing  large  scale  data  

•  We  discussed  the  supporXng  Hadoop  Distributed  File  System  

•  The  concepts  were  illustrated  using  a  simple  example  •  We  reviewed  some  important  parts  of  the  source  code  for  the  example.  

Page 27: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

References    

1.  Apache  Hadoop  Tutorial:  hsp://hadoop.apache.org  hsp://hadoop.apache.org/core/docs/current/mapred_tutorial.html  

2.  Dean,  J.  and  Ghemawat,  S.  2008.  MapReduce:  simplified  data  processing  on  large  clusters.  Communica)on  of  ACM  51,  1  (Jan.  2008),  107-­‐113.  

3.  Cloudera  Videos  by  Aaron  Kimball:              hsp://www.cloudera.com/hadoop-­‐training-­‐basic    

Page 28: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

FINALLY,  A  HANDS-­‐ON  ASSIGNMENT!    

Page 29: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$
Page 30: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Lifecycle  of  a  MapReduce  Job  

Map function

Reduce function

Run this program as a MapReduce job

Page 31: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Job  ConfiguraXon  Parameters  •  190+  parameters  in  

Hadoop  •  Set  manually  or  defaults  

are  used  

Page 32: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$
Page 33: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Even  though  the  Hadoop  framework  is  wrisen  in  Java,  programs  for  Hadoop  need  not  to  be  coded  in  Java  but  can  also  be  developed  in  other  languages  like  Python  or  C++.    CreaXng  a  launching  program  for  your  applicaXon  The  launching  program  configures:        –  The  Mapper  and  Reducer  to  use        –  The  output  key  and  value  types  (input  types  are  inferred  from  the  InputFormat)      –  The  locaXons  for  your  input  and  output  The  launching  program  then  submits  the  job  and  typically  waits  for  it  to  complete      A  Map/Reduce  may  specify  how  it’s  input  is  to  be  read  by  specifying  an  InputFormat  to  be  used    A  Map/Reduce  may  specify  how  it’s  output  is  to  be  wrisen  by  specifying  an  OutputFormat  to  be  used  

Page 34: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

 HANDS  –  ON  SESSION    

 

Page 35: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Now,  Your  Turn…  

•  Verify  the  output  from  the  wordcount  program  –  HINT:    hadoop  fs  –ls  output    

•   View  the  output  from  the  wordcount  program  –  HINT:  hadoop  fs  –cat  output/part-­‐00000  

Page 36: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Hadoop  –  The  Good/Bad/Ugly  

•  Hadoop  is  Good  –  that  is  why  we  are  all  here  

•  Hadoop  is  not  Bad  –  else  we  would  not  be  here  

•  Hadoop  is  someXmes  Ugly  –  why?  –  Out  of  the  box  configuraXon  –  not  friendly  –  Difficult  to  debug  –  Performance  –  tuning/opXmizaXons  is  a  black  magic  

Page 37: GoodOle Hadoop!$How$We$Love$ Thee!$salsahpc.indiana.edu/ScienceCloud/slides/good_ole_hadoop.pdfHadoop$Rack$Awareness$–Why?$ DataNode$1$ DataNode$2$ DataNode$3$ DataNode$5$ Rack$1$

Number  of  Maps/Reducers  

•  mapred.tasktracker.map/reduce.tasks.maximum  – Maximum  tasks  (map/reduce)  for  a  tasktracker    

•  Default:  2  •  SuggesXons  

–  Recommended  range  –  (cores_per_node)/2  to  2x(cores_per_node)  

–  This  value  should  be  set  according  to  the  hardware  specificaXon  of  cluster  nodes  and  resource  requirements  of  tasks  (map/reduce)