Kite SDK introduction for Portland Big Data

download Kite SDK introduction for Portland Big Data

of 29

Embed Size (px)

description

Kite SDK is a set of tools for building big data applications on Hadoop.

Transcript of Kite SDK introduction for Portland Big Data

  • 1.Kite SDK: Its for developers Ryan Blue, Software Engineer

2. Resources 2014 Cloudera, Inc. All rights reserved. Kite guide http://tiny.cloudera.com/KiteGuide Dataset overview and intro http://tiny.cloudera.com/Datasets Command-line tutorial http://tiny.cloudera.com/KiteCLI Kite repository and examples https://github.com/kite-sdk/kite https://github.com/kite-sdk/kite-examples 3. Agenda 2014 Cloudera, Inc. All rights reserved. Kite background Kite data 4. What problem does Kite solve? 2014 Cloudera, Inc. All rights reserved. Accessibility for getting started Easy to get started, without being an expert Use before understanding Save time for experienced developers Off-the-shelf tools for common tasks Quickly iterate and test configurations 5. Kite Datasets: Motivation 2014 Cloudera, Inc. All rights reserved. Focus on using data, not managing files Developers shouldnt have to maintain data files Use through configuration, not code Need consistency across the platform 6. Kite Datasets: Motivation 2014 Cloudera, Inc. All rights reserved. Application Database Data files User code Provided Maintained by the database 7. Kite Datasets: Motivation 2014 Cloudera, Inc. All rights reserved. Application Application Database Data files Data files HBase User code 8. Kite Datasets: Motivation 2014 Cloudera, Inc. All rights reserved. Application ApplicationApplication Database Data files Data files Kite Data HBase Data files HBase Maintained by the Kite 9. Kite Datasets: Goals 2014 Cloudera, Inc. All rights reserved. Think in terms of data: datasets, views, records Describe data, layout and Kite does the right thing Should work consistently across the platform Reliable 10. Kite Datasets: Compatibility 2014 Cloudera, Inc. All rights reserved. Project HDFS (avro) HDFS (parquet) HBase Kite 1.0 1.0 1.0 Flume Sink 1.0 1.0 1.0 MapReduce 1.0 1.0 1.0 Crunch 1.0 1.0 1.0 Hive 1.0 1.0 1.1 Impala 1.0 1.0 * * depends on common HBase encoding format 11. Current compatibility (0.15.0) 2014 Cloudera, Inc. All rights reserved. Project HDFS (avro) HDFS (parquet) HBase Kite 1.0 1.0 1.0 Flume Sink 1.0 1.0 1.0 MapReduce 1.0 1.0 1.0 Crunch 1.0 1.0 1.0 Hive 1.0 1.0 1.1 Impala 1.0 1.0 * * depends on common HBase encoding format 12. Agenda 2014 Cloudera, Inc. All rights reserved. Kite background Kite data Application Kite Data Data files HBase Maintained by the Kite 13. Datasets 2014 Cloudera, Inc. All rights reserved. A collection of records or entities Like a Hive or relational table Generic, reflected, or generated objects Identified by URI dataset:hdfs:/data/ratings dataset:hive:/data/ratings dataset:hbase:zk1/ratings ratings = Datasets.load("dataset:hive:/data/ratings") 14. Dataset configuration, JSON 2014 Cloudera, Inc. All rights reserved. Schema (Avro) Record fields, like a table definition 15. Dataset configuration, JSON 2014 Cloudera, Inc. All rights reserved. Schema (Avro) Record fields, like a table definition Partition strategy Layout or key definition from record fields 16. Configuring partitioning 2014 Cloudera, Inc. All rights reserved. Partition strategy [ { "source" : "timestamp", "type" : "year" }, { "source" : "timestamp", "type" : "month" }, { "source" : "timestamp", "type" : "day" } ] datasets/ ratings/ year=1997/ month=09/ day=20/ ... day=30/ month=10/ day=01/ ... 17. Configuring key building 2014 Cloudera, Inc. All rights reserved. Partition strategy for HBase [ { "source" : "email", "type" : "hash", "buckets": 32 }, { "source" : "email", "type" : "identity" } ] (22, "buzz@pixar.com") x80x00x00x16buzz@pixar.comx00x00 18. Dataset configuration, JSON 2014 Cloudera, Inc. All rights reserved. Schema (Avro) Record fields, like a table definition Partition strategy Layout or key definition from record fields Column mapping (HBase) Where to store record fields 19. { "type" : "record", "name" : "User", "fields" : [ { "name" : "email", "type" : "string" }, ... ] } Mapping example 2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "email", "type": "key" }, ... ] 20. { "type" : "record", "name" : "User", "fields" : [ { "name" : "lastName", "type" : "string" }, ... ] } Mapping example 2014 Cloudera, Inc. All rights reserved. family name counts prefs row key last first visits flash buzz@pixar.com Lightyear Buzz 315 true [ { "source": "lastName", "type": "column", "family": "name", "qualifier": "last" }, ... ] 21. Command-line demo? 2014 Cloudera, Inc. All rights reserved. 1. Describe your data dataset obj-schema org.movielens.Rating --jar app.jar--output rating.avsc 2. Describe your layout dataset partition-config ts:year ts:month ts:day--schema rating.avsc --output ymd.json 3. Create a dataset dataset create ratings --schema rating.avsc--partition-by ymd.json 22. Command-line tool 2014 Cloudera, Inc. All rights reserved. Executable jar download Inspects the environment Must be used on-cluster Classpath for HBase, Hive, etc. Debugging: debug=true ./dataset -v Requires MAPRED_HOME variable on CDH5 23. Resources 2014 Cloudera, Inc. All rights reserved. Kite guide http://tiny.cloudera.com/KiteGuide Dataset overview and intro http://tiny.cloudera.com/Datasets Command-line tutorial http://tiny.cloudera.com/KiteCLI Kite repository and examples https://github.com/kite-sdk/kite https://github.com/kite-sdk/kite-examples 24. Questions 2014 Cloudera, Inc. All rights reserved. Ryan Blue: blue@cloudera.com Kite mailing list: cdk-dev@cloudera.org 25. Maven parent POM 2014 Cloudera, Inc. All rights reserved. Automatic Kite and Hadoop dependencies Inherit from kite-app-parent-cdh4 CDH4 only, CDH5 support in 0.16.0 org.kitesdkkite-app-parent-cdh40.15.0 26. Maven Plugin 2014 Cloudera, Inc. All rights reserved. Maven plugin manages datasets for an application Configured by app-parent POM Handles create, update, etc. in maven goals 27. MapReduce 2014 Cloudera, Inc. All rights reserved. DatasetKeyInputFormat DatasetKeyOutputFormat Values are always null View eventsBeforeToday = Datasets .load("dataset:hive:/data/events") .toBefore("timestamp", startOfToday()); DatasetKeyInputFormat.configure(mrJob).readFrom(eventsBeforeToday); 28. Crunch 2014 Cloudera, Inc. All rights reserved. CrunchDatasets.asSource CrunchDatasets.asTarget PCollection getPipeline().read( CrunchDatasets.asSource(eventsBeforeToday); Handle-existing support in 0.16.0 Configure dependencies with Kite parent POM 29. DatasetSink 2014 Cloudera, Inc. All rights reserved. Write to HDFS Avro and HBase http://tiny.cloudera.com/DatasetSink Proxy user support Automatic partitioning agent.sinks.name.type = org.apache.flume.sink.kite.DatasetSink agent.sinks.name.kite.repo.uri = repo:hdfs:/datasets agent.sinks.name.kite.dataset.name = events agent.sinks.name.auth.proxyUser = cloudera