Kite (Big Data Applications Meetup @ Cask)
-
Upload
blue -
Category
Data & Analytics
-
view
321 -
download
3
Transcript of Kite (Big Data Applications Meetup @ Cask)
© Cloudera, Inc. All rights reserved.
Kite SDK: Helping Hadoop projects work together
Ryan Blue 23 June 2015
© Cloudera, Inc. All rights reserved.
Quick poll
●Who has seen the movie Fear and Loathing in Las Vegas?
© Cloudera, Inc. All rights reserved.
Oh, no. What did we do?
●Last thing I remember, we were at a NoSQL party
● I don’t remember much . . .
●Did we build a database?
© Cloudera, Inc. All rights reserved.
Dinosaur tails and tape recorders
●Dinosaur tail: some tools work with tables, some with files
●Tape recorder: tables came later, and it wasn’t too bad to deal with it
●Result: table formats are reimplemented everywhere, and
jobs commonly drop files into folders that back a database table
© Cloudera, Inc. All rights reserved.
Dinosaur tails and tape recorders
●Dinosaur tail: if you can dream up a file format, someone is using it in Hadoop
●Tape recorder: unstructured data was part of the appeal
●Result: it is easy to choose a format with lurking application problems
© Cloudera, Inc. All rights reserved.
Dinosaur tails and tape recorders
●Dinosaur tail: the de-facto table format mixes metadata into directory names
●Tape recorder: this format was intended to be simple and be a coarse index
●Result: needs an elaborate locking scheme to guarantee safety, which
would cause low-latency queries to be slow
© Cloudera, Inc. All rights reserved.
Dinosaur tails and tape recorders
●Dinosaur tail: schemas are missing key features
●Tape recorder: schema on read? I honestly don’t remember
●Result: schema evolution, data types, and behavior vary, and
table schemas are sometimes missing
© Cloudera, Inc. All rights reserved.
Building Hadoop applications is hard
●Early choices have big consequences for performance and compatibility
●Components and formats work slightly differently
●Table support is still done manually in most projects
●SQL engines can’t trust the files in a table
●Types are missing
© Cloudera, Inc. All rights reserved.
How can we fix it?
●Collaborate on (strict) data storage specs and consistent schemas
● Implement table-level everywhere, not file-level
● Include partition handling for storage and retrieval
●Build a standard API so that storage can be versioned and evolved
●Build a common set of tools
● Improve the table format
© Cloudera, Inc. All rights reserved.
How can we fix it?
●Collaborate on (strict) data storage specs and consistent schemas
● Implement table-level everywhere, not file-level
● Include partition handling for storage and retrieval
●Build a standard API so that storage can be versioned and evolved
●Build a common set of tools
● Improve the table format
© Cloudera, Inc. All rights reserved.
What is Kite?
●A table-level API that allows storage to be versioned and evolved
●A common set of tools built around that API
●Datasets are identified by URI
●Defined by an Avro schema and partition configuration
●Compatible with Hive and Impala
●Provide an API for table-level access in MR and Spark
© Cloudera, Inc. All rights reserved.
How does Kite differ from Cask?
●Kite is focused on storage
● How should objects be serialized?
● Provides compatibility across the ecosystem
●Cask is focused on application patterns
. . . yes, there is some overlap
© Cloudera, Inc. All rights reserved.
Current efforts
●Date, time, and timestamp standardization in Avro and Parquet
●A new table format with snapshot isolation
●An HBase encoding specification for portability
© Cloudera, Inc. All rights reserved.
Demo!