Kite (Big Data Applications Meetup @ Cask)

15
© Cloudera, Inc. All rights reserved. Kite SDK: Helping Hadoop projects work together Ryan Blue 23 June 2015

Transcript of Kite (Big Data Applications Meetup @ Cask)

Page 1: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Kite SDK: Helping Hadoop projects work together

Ryan Blue 23 June 2015

Page 2: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Quick poll

●Who has seen the movie Fear and Loathing in Las Vegas?

Page 3: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Oh, no. What did we do?

●Last thing I remember, we were at a NoSQL party

● I don’t remember much . . .

●Did we build a database?

Page 4: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Dinosaur tails and tape recorders

●Dinosaur tail: some tools work with tables, some with files

●Tape recorder: tables came later, and it wasn’t too bad to deal with it

●Result: table formats are reimplemented everywhere, and

jobs commonly drop files into folders that back a database table

Page 5: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Dinosaur tails and tape recorders

●Dinosaur tail: if you can dream up a file format, someone is using it in Hadoop

●Tape recorder: unstructured data was part of the appeal

●Result: it is easy to choose a format with lurking application problems

Page 6: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Dinosaur tails and tape recorders

●Dinosaur tail: the de-facto table format mixes metadata into directory names

●Tape recorder: this format was intended to be simple and be a coarse index

●Result: needs an elaborate locking scheme to guarantee safety, which

would cause low-latency queries to be slow

Page 7: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Dinosaur tails and tape recorders

●Dinosaur tail: schemas are missing key features

●Tape recorder: schema on read? I honestly don’t remember

●Result: schema evolution, data types, and behavior vary, and

table schemas are sometimes missing

Page 8: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Building Hadoop applications is hard

●Early choices have big consequences for performance and compatibility

●Components and formats work slightly differently

●Table support is still done manually in most projects

●SQL engines can’t trust the files in a table

●Types are missing

Page 9: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

How can we fix it?

●Collaborate on (strict) data storage specs and consistent schemas

● Implement table-level everywhere, not file-level

● Include partition handling for storage and retrieval

●Build a standard API so that storage can be versioned and evolved

●Build a common set of tools

● Improve the table format

Page 10: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

How can we fix it?

●Collaborate on (strict) data storage specs and consistent schemas

● Implement table-level everywhere, not file-level

● Include partition handling for storage and retrieval

●Build a standard API so that storage can be versioned and evolved

●Build a common set of tools

● Improve the table format

Page 11: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

What is Kite?

●A table-level API that allows storage to be versioned and evolved

●A common set of tools built around that API

●Datasets are identified by URI

●Defined by an Avro schema and partition configuration

●Compatible with Hive and Impala

●Provide an API for table-level access in MR and Spark

Page 12: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

How does Kite differ from Cask?

●Kite is focused on storage

● How should objects be serialized?

● Provides compatibility across the ecosystem

●Cask is focused on application patterns

. . . yes, there is some overlap

Page 13: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Current efforts

●Date, time, and timestamp standardization in Avro and Parquet

●A new table format with snapshot isolation

●An HBase encoding specification for portability

Page 14: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Demo!

Page 15: Kite (Big Data Applications Meetup @ Cask)

© Cloudera, Inc. All rights reserved.

Thank [email protected]://ingest.tips/

[email protected]