Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Kite SDK: Helping Hadoop projects work together

Ryan Blue 23 June 2015


Quick poll

●Who has seen the movie Fear and Loathing in Las Vegas?


Oh, no. What did we do?

●Last thing I remember, we were at a NoSQL party

● I don’t remember much . . .

●Did we build a database?


Dinosaur tails and tape recorders

●Dinosaur tail: some tools work with tables, some with files

●Tape recorder: tables came later, and it wasn’t too bad to deal with it

●Result: table formats are reimplemented everywhere, and

jobs commonly drop files into folders that back a database table



●Dinosaur tail: if you can dream up a file format, someone is using it in Hadoop

●Tape recorder: unstructured data was part of the appeal

●Result: it is easy to choose a format with lurking application problems



●Dinosaur tail: the de-facto table format mixes metadata into directory names

●Tape recorder: this format was intended to be simple and be a coarse index

●Result: needs an elaborate locking scheme to guarantee safety, which

would cause low-latency queries to be slow



●Dinosaur tail: schemas are missing key features

●Tape recorder: schema on read? I honestly don’t remember

●Result: schema evolution, data types, and behavior vary, and

table schemas are sometimes missing


Building Hadoop applications is hard

●Early choices have big consequences for performance and compatibility

●Components and formats work slightly differently

●Table support is still done manually in most projects

●SQL engines can’t trust the files in a table

●Types are missing


How can we fix it?

●Collaborate on (strict) data storage specs and consistent schemas

● Implement table-level everywhere, not file-level

● Include partition handling for storage and retrieval

●Build a standard API so that storage can be versioned and evolved

●Build a common set of tools

● Improve the table format


What is Kite?

●A table-level API that allows storage to be versioned and evolved

●A common set of tools built around that API

●Datasets are identified by URI

●Defined by an Avro schema and partition configuration

●Compatible with Hive and Impala

●Provide an API for table-level access in MR and Spark


How does Kite differ from Cask?

●Kite is focused on storage

● How should objects be serialized?

● Provides compatibility across the ecosystem

●Cask is focused on application patterns

. . . yes, there is some overlap


Current efforts

●Date, time, and timestamp standardization in Avro and Parquet

●A new table format with snapshot isolation

●An HBase encoding specification for portability


Demo!


Thank [email protected]://ingest.tips/

[email protected]

Kite (Big Data Applications Meetup @ Cask)

Data & Analytics

Transcript of Kite (Big Data Applications Meetup @ Cask)