Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(•...

24
1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Building ApplicaCons on Hadoop Mark Grover SoFware Engineer, Cloudera @mark_grover Jfokus 2014 (February 4 th , 2014) ©2014 Cloudera, Inc. All Rights Reserved.

Transcript of Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(•...

Page 1: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

1

Headline  Goes  Here  Speaker  Name  or  Subhead  Goes  Here  

DO  NOT  USE  PUBLICLY  PRIOR  TO  10/23/12  Building  ApplicaCons  on  Hadoop  

Mark  Grover  SoFware  Engineer,  Cloudera  @mark_grover  Jfokus  2014  (February  4th,  2014)    

©2014 Cloudera, Inc. All Rights Reserved.

Page 2: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Agenda  

•  Brief  intro  to  Hadoop  and  the  ecosystem  • Developing  apps  on  Hadoop  

• What’s  the  current  problem?  •  How  are  we  fixing  it?  

2 ©2014 Cloudera, Inc. All Rights Reserved.

Page 3: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

What  is  Apache  Hadoop?  

3

Has  the  Flexibility  to  Store  and  Mine  Any  Type  of  Data  

 §  Ask  quesCons  across  structured  and  

unstructured  data  that  were  previously  impossible  to  ask  or  solve  

§  Not  bound  by  a  single  schema  

Excels  at  Processing  Complex  Data  

 §  Scale-­‐out  architecture  divides  workloads  

across  mulCple  nodes  

§  Flexible  file  system  eliminates  ETL  bo^lenecks  

Scales  Economically  

 §  Can  be  deployed  on  commodity  

hardware  

§  Open  source  pla_orm  guards  against  vendor  lock  

Hadoop  Distributed  File  System  (HDFS)  

 Self-­‐Healing,  High  

Bandwidth  Clustered  Storage  

   

MapReduce    

Distributed  CompuCng  Framework  

Apache Hadoop  is  an  open  source  pla_orm  for  data  storage  and  processing  that  is…  

ü  Scalable  ü  Fault  tolerant  ü  Distributed  

CORE  HADOOP  SYSTEM  COMPONENTS  

©2014 Cloudera, Inc. All Rights Reserved.

Page 4: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

4

Kite  SDK  

Developing  apps  on  Hadoop  

©2014 Cloudera, Inc. All Rights Reserved.

Page 5: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

A  typical  system  (zoom  100:1)  

5 ©2014 Cloudera, Inc. All Rights Reserved.

Page 6: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Hadoop  is  incredibly  powerful  

6 ©2014 Cloudera, Inc. All Rights Reserved.

Page 7: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Hadoop  is  incredibly  flexible  

7 ©2014 Cloudera, Inc. All Rights Reserved.

Page 8: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Hadoop  is  incredibly  low-­‐level  

8 ©2014 Cloudera, Inc. All Rights Reserved.

Page 9: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Hadoop  is  incredibly  complex  

9 ©2014 Cloudera, Inc. All Rights Reserved.

Page 10: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

10 2

“[I]t’s  not  enough  to  just  build  a  scalable  and  stable  system;  the  system  also  has  to  be  easy  enough  for  thousands  of  internal  developers  of  all  types  and  all  skill  levels  to  use.”  

h^p://gigaom.com/data/how-­‐disney-­‐built-­‐a-­‐big-­‐data-­‐pla_orm-­‐on-­‐a-­‐startup-­‐budget/  ©2014 Cloudera, Inc. All Rights Reserved.

Page 11: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

A  typical  system  (zoom  100:1)  

11 ©2014 Cloudera, Inc. All Rights Reserved.

Page 12: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

A  typical  system  (zoom  10:1)  

12 ©2014 Cloudera, Inc. All Rights Reserved.

Page 13: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

A  typical  system  (zoom  5:1)  

13 ©2014 Cloudera, Inc. All Rights Reserved.

Page 14: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

What  you  actually  care  about  

• Gelng  data  from  A  to  B  • Using  it  later  

14 ©2014 Cloudera, Inc. All Rights Reserved.

Page 15: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Infrastructure  details  

•  SerializaCon,  file  formats,  and  compression  • Metadata  capture  and  maintenance  • Dataset  organizaCon  and  parCConing  • Durability  and  delivery  guarantees  • Well-­‐defined  failure  semanCcs  •  Performance  and  health  instrumentaCon  

15 ©2014 Cloudera, Inc. All Rights Reserved.

Page 16: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Wouldn’t  it  be  nice…?  

• Make  Hadoop  accessible  to  the  enterprise  developer  • Address  the  most  common  cases  •  Codify  expert  pa^erns  and  pracCces  for  building  data-­‐oriented  systems  and  applicaCons.  

•  Let  developers  focus  on  business  logic,  not  plumbing  or  infrastructure.  

•  Provide  smart  defaults  for  pla_orm  choices.  •  Support  piecemeal  adopCon  via  loosely-­‐coupled  modules  

16 ©2014 Cloudera, Inc. All Rights Reserved.

Page 17: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Kite  SDK  

• An  open  source  set  of  libraries,  guides,  and  examples  for  building  data-­‐oriented  systems  and  applicaCons  

•  Provides  higher  level  APIs  atop  exisCng  components  of  CDH  •  Supports  piecemeal  adopCon  via  loosely  coupled  modules  

17 ©2014 Cloudera, Inc. All Rights Reserved.

Page 18: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Kite  SDK  Data  Module  

•  Logical  abstracCons  of  records,  datasets  and  repositories  with  implementaCons  for  HDFS  and  HBase  (upcoming)  

•  APIs  to  drasCcally  simplify  working  with  datasets  in  Hadoop  filesystems.  The  Data  module:  

•  Handles  automaCc  serializaCon  and  deserializaCon  of  Java  POJOs  as  well  as  Avro  Records.  

•  AutomaCc  compression.  •  File  and  directory  layout  and  management.  •  AutomaCc  parCConing  based  on  configurable  funcCons.  •  A  metadata  provider  plugin  interface  to  integrate  with  centralized  metadata  management  systems.    

18 ©2014 Cloudera, Inc. All Rights Reserved.

Page 19: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

19 15

DatasetRepository repo = new FileSystemDatasetRepository.Builder() .fileSystem(FileSystem.get(new Configuration())) .directory(new Path(“/data”)) .get();Dataset events = repo.create(“events”, new DatasetDescriptor.Builder() .schema(new File(“event.avsc”)) .partitionStrategy( new PartitionStrategy.Builder().hash(“userId”, 53).get() ).get());DatasetWriter<GenericRecord> writer = events.getWriter();writer.open();writer.write( new GenericRecordBuilder(schema) .set(“userId”, 1) .set(“timeStamp”, System.currentTimeMillis()) .build());writer.close();

/data /events /.metadata /schema.avsc /descriptor.properties /userId=0 /10000000.avro /10000001.avro /userId=1 /20000000.avro /userId=2 /30000000.avro

Code  

Data  

©2014 Cloudera, Inc. All Rights Reserved.

Page 20: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

20

Kite  SDK  Morphlines  Module  

Pluggable,  configuraCon-­‐driven  data  transform  library  Born  out  of  Cloudera  Search,  but  general  purpose  Configure  record  transform  stages  in  a  container  library  Use  the  library  in  Flume,  MapReduce  jobs,  Storm,  and  other  Java  applicaCons  

14 ©2014 Cloudera, Inc. All Rights Reserved.

Page 21: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

21

Other  Modules  

Maven  plugin  Package,  deploy,  and  execute  “apps”  Execute  dataset  operaCons  

Examples  POJO,  generic,  and  generated  enCty  ingest  Dataset  administraCve  operaCons  Crunch  and  MR  integraCon  ...  

14 ©2014 Cloudera, Inc. All Rights Reserved.

Page 22: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

22

Future  

HBase  Extending  data  APIs  to  support  random  access  Same  automaCc  serializaCon,  schema  management,  etc.  

Higher-­‐order  data  management  Common  tasks  Think  background  compacCon,  conversion,  etc.  

IntegraCon  with  exisCng  middleware  frameworks  Give  us  all  your  good  ideas  (and  code)!  

14 ©2014 Cloudera, Inc. All Rights Reserved.

Page 23: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Kite  SDK  Resources  

•  Docs  •  h^p://kitesdk.org/docs/current/  

•  Examples  •  h^ps://github.com/kite-­‐sdk/kite-­‐examples  

•  Source  code  •  h^ps://github.com/kite-­‐sdk/  

Binary  arCfacts  available  from  Cloudera’s  Maven  repository  •  Twi^er:  @mark_grover  •  Slides  at  h^p://www.slideshare.net/markgrover/applicaCons-­‐on-­‐hadoop  •  LinkedIn:  linkedin.com/in/grovermark  

23 ©2014 Cloudera, Inc. All Rights Reserved.

Page 24: Building(Applicaons(on(Hadoop( Headline(Goes(Here( PRIORTO ... · Kite(SDK(DataModule(• Logical(abstracCons(of(records,(datasets(and(repositories(with( implementaons(for(HDFS(and(

Co-­‐authoring  O’Reilly  book  

•  Titled  ‘Hadoop  ApplicaCon  Architectures’  • How  to  build  end-­‐to-­‐end  soluCons  using    Apache  Hadoop  and  related  tools  • Updates  on  Twi^er:  @hadooparchbook  •  h^p://www.hadooparchitecturebook.com/  

24 ©2014 Cloudera, Inc. All Rights Reserved.