Big datatraining ranga_1

24
BIG DATA TRAINING Ranga Vadlamudi March 2014

description

Big Data Training Slides

Transcript of Big datatraining ranga_1

Page 1: Big datatraining ranga_1

BIG  DATA  TRAINING  

Ranga  Vadlamudi  March  2014  

Page 2: Big datatraining ranga_1
Page 3: Big datatraining ranga_1

What  is  Big  Data  

•  Volume:  Large  Amounts  of  Data  at  rest  

•  Velocity:  milliseconds  to  seconds  to  respond  

•  Variety:  Data  in  many  forms  (Structured,  

Unstructured,  MulEmedia,  Text  etc.)  

•  Veracity:  Data  in  doubt  

 

Page 4: Big datatraining ranga_1

•  30  billion  pieces  of  content  a  month  

 

•   1  Peta  byte  of  content  every  day    

•  2  Billion  videos  watched  everyday    

 

•  3  Billion  people  will  be  online    

•  Sharing  8  zeQabytes  of  data  

   

Page 5: Big datatraining ranga_1
Page 6: Big datatraining ranga_1

CAP  THEOREM  (Consistency,  Availability,  ParEEon)  

Page 7: Big datatraining ranga_1

Big  Data  SoluEons  

Big  Data  

Real  Time  Querying    

Batch    Querying    

Mining  &  AnalyEcs  

Machine  Learning  

Storage  

Page 8: Big datatraining ranga_1

Technology  

Page 9: Big datatraining ranga_1

Background  •  Underlying  Technology  invented  by  Google  •  Google  Big-­‐Table  &  Google  File  System  •  Doug  Cu\ng  created  NUTCH  and  Hadoop  was  spun  off  at  Yahoo  

•  Yahoo  played  a  key  role  in  developing  Hadoop  for  enterprise  applicaEons  

Page 10: Big datatraining ranga_1

Hadoop    •  Is  a  framework  •  Built  on  commodity  hardware  •  Implements  computaEonal  paradigm  called  Map-­‐Reduce  

•  Provides  a  distributed  file  system  called  HDFS  to  store  data  

•  Node  failures  are  automaEcally  handled  

Page 11: Big datatraining ranga_1

Data  Becomes  BoQleneck  

•  Ge\ng  data  to  processors  is  expensive  •  Typical  disk  data  transfer  rate  75MB/sec  •  100GB  data  transfer  :  22mins  approx.  •  New  approach  is  needed    

Page 12: Big datatraining ranga_1

Hadoop  Solves  •  Problems  where  you  have  lot  of  data  •  Mixture  of  complex  and  structured  data  •  Speeds  up  computaEons  by  distribuEon  •  Mantra  is  take  computaEon  to  the  data,  don’t  bring  data  to  computaEon  

Page 13: Big datatraining ranga_1

Hadoop  DistribuEons  

Page 14: Big datatraining ranga_1

Hadoop  Architecture  •  Master  Slave  philosophy  •  Designed  to  run  on  large  number  of  machines  •  Machines  don’t  share  memory  or  disk  

•  Rack  them  up  and  run  Hadoop  on  each  machine  

Page 15: Big datatraining ranga_1

Hadoop  Architecture  •  Data  is  divided  and  spread  across  servers  •  Hadoop  keeps  track  of  where  the  data  is  •  Hadoop  replicates  data  to  mulEple  copies  to  avoid  single  point  of  failure  

•  MapReduce  is  a  programming  model    to  process  large  sets  of  data  in  parallel  

•  Map  the  operaEon  out  to  all  servers  •  Shuffle  the  results  •  Reduce  the  results  back  into  one  result  set  

Page 16: Big datatraining ranga_1

Hadoop  Components  

Page 17: Big datatraining ranga_1

HDFS  (Hadoop  File  System    

Page 18: Big datatraining ranga_1

HDFS  •  Distributed  file  system  •  Highly  fault  tolerant  •  HDFS  instance  can  span  across  many  servers  •  Has  large  datasets  into  terabytes  to  petabytes  •  Moving  computaEon  is  cheaper  than  moving  data  

•  Large  block  sizes  (128MB  for  example)  

Page 19: Big datatraining ranga_1
Page 20: Big datatraining ranga_1

HDFS  Layout  

Page 21: Big datatraining ranga_1

Cloudera  Manager  

•  Management  sogware  to  manage  Hadoop  ecosystem  

•  Helps  install,  manage  and  maintain  a  cluster  •  Resource  consumpEon  tracking  •  ProacEve  health  checks  •  AlerEng  •  Config  changes  

 

Page 22: Big datatraining ranga_1

Cloudera  CapabiliEes  

Page 23: Big datatraining ranga_1

Demo  Cloudera    Demo  Cassandra    Demo  Mongo  DB    

Page 24: Big datatraining ranga_1

QuesEons?