Building in Proximity to Water and Sewer Pipelines Procedure
Building data pipelines with kite
-
Upload
joey-echeverria -
Category
Technology
-
view
1.777 -
download
10
Transcript of Building data pipelines with kite
![Page 1: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/1.jpg)
Building Data Pipelines with the Kite SDK
Joey Echeverria // Software Engineer
![Page 2: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/2.jpg)
2
Problem
![Page 3: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/3.jpg)
![Page 4: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/4.jpg)
4
Hadoop
©2015 Cloudera, Inc. All rights reserved.
![Page 5: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/5.jpg)
5
Logs
©2015 Cloudera, Inc. All rights reserved.
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
syslog
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
LogFiles
ApacheHTTPD
Local Disk
syslog
Kafka
Kafka
FlumeHDFS
![Page 6: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/6.jpg)
6
RDBMS
©2015 Cloudera, Inc. All rights reserved.
SqoopHDFS
RDBMS
![Page 7: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/7.jpg)
7
Sea of text files
©2015 Cloudera, Inc. All rights reserved.
CSV CSV CSV CSV CSV
CSV CSV CSV CSV CSV
![Page 8: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/8.jpg)
8
A note on Hadoop
![Page 9: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/9.jpg)
9
Hadoop
• Technically:– HDFS, YARN, MapReduce
• Hadoop ecosystem:– Hadoop, HBase, Flume, Sqoop, Kafak, Oozie, Hive, Impala, Pig, Crunch,
Spark, etc.
– I’ll also call this just “Hadoop”
©2015 Cloudera, Inc. All rights reserved.
![Page 10: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/10.jpg)
10
Introduction to the Kite SDK
©2015 Cloudera, Inc. All rights reserved.
![Page 11: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/11.jpg)
11
• Hadoop is all about data
• Bring all of your data to one platform
• Access data using the best engine for your use case
Data
©2015 Cloudera, Inc. All rights reserved.
![Page 12: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/12.jpg)
12
• Hadoop ecosystem built from open source components
• Benefits:– Shared investments
– No vendor lock-in
– Fast evolution
• Costs:– APIs tend to be low-level
– Integration is ad-hoc
Open source core
©2015 Cloudera, Inc. All rights reserved.
![Page 13: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/13.jpg)
13
• HDFS– Filesystem
• HBase– Byte array keys -> byte array values
Storage APIs
©2015 Cloudera, Inc. All rights reserved.
![Page 14: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/14.jpg)
14
Relational systems
©2015 Cloudera, Inc. All rights reserved.
Database
Data files
User code
Provided
Maintained by the database
Application
JDBC Driver
![Page 15: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/15.jpg)
15
Hadoop without Kite
©2015 Cloudera, Inc. All rights reserved.
Application
Database
Data files
Data files HBase
User code
Application
JDBC Driver
![Page 16: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/16.jpg)
16
Hadoop with Kite
©2015 Cloudera, Inc. All rights reserved.
ApplicationApplication
Database
Data files
Data files
Kite
HBaseData files HBase
Maintained by the Kite
Application
JDBC Driver
![Page 17: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/17.jpg)
17
• Kite is the data API for the Hadoop ecosystem
• Kite makes it easy to put your data into Hadoop and to use it once it’s there.
Kite
©2015 Cloudera, Inc. All rights reserved.
![Page 18: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/18.jpg)
18
• Data is stored in datasets
• Datasets are made up of entities
• Related datasets are grouped into namespaces
Abstractions
©2015 Cloudera, Inc. All rights reserved.
![Page 19: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/19.jpg)
19
• A collection of entities/records– Like a relational database table
• Data types and field names defined by an Avro schema
• Identified by URI– dataset:hdfs:/datasets/movie/ratings
– dataset:hive:movie/ratings
– dataset:hbase:zk1,zk2,zk3/ratings
Datasets
©2015 Cloudera, Inc. All rights reserved.
![Page 20: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/20.jpg)
20
• A single record in a dataset– Think row in a relational database table
• Entities can be complex and nested– Avro compiled objects
– Avro generic objects
– Plain old java objects (POJOs)
Entities
©2015 Cloudera, Inc. All rights reserved.
![Page 21: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/21.jpg)
21
• Namespaces group related datasets– Think database or schema in a relational system
• Dataset names are unique within the same namespace
Namespaces
©2015 Cloudera, Inc. All rights reserved.
![Page 22: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/22.jpg)
22
Schem
e
Pattern Example
Hive dataset:hive:<namespace>/<dataset-
name>
dataset:hive:movielens/movies
HDFS dataset:hdfs:/<path>/<namespace>/<datas
et-name>
dataset:hdfs:/datasets/movielens/movies
Local
FS
dataset:file:/<path>/<namespace>/<dataset
-name>
dataset:file:/tmp/data/movielens/movies
HBase dataset:hbase:<zookeeper-
hosts>/<dataset-name>
dataset:hbase:zoo-1,zoo-2,zoo-3/movies
Dataset URIs
©2015 Cloudera, Inc. All rights reserved.
• Hive URIs accept an optional location parameter for external tables– dataset:hive:movielens/movies?location=/datasets/movielens/movies
• HDFS URIs accept an optional nameservice and host– dataset:hdfs://namenode:8020/datasets/movielens/movies
![Page 23: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/23.jpg)
23
• Ingestion framework– Integrates with Sqoop, Flume, and Kafka; doesn’t replace them
• ETL tool– Basic command-line tool
– Complete ETL tools can build on Kite
• Processing language– SQL, Crunch, MapReduce, Spark, Pig, etc.
What Kite isn’t
©2015 Cloudera, Inc. All rights reserved.
![Page 24: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/24.jpg)
24
• Flume– Stream log events directly into Kite datasets
• Sqoop– Ingest relational database tables into Kite datasets
• Kafka– Integration is through Flafka (Flume/Kafka integration)
Ingest integration
©2015 Cloudera, Inc. All rights reserved.
![Page 25: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/25.jpg)
25
• MapReduce– Input/OutputFormats
• Crunch– Source and target
• Spark– Use Input/OutputFormats to convert datasets to RDDs
• Impala, Hive, Pig– Use underlying file format support
Data processing integration
©2015 Cloudera, Inc. All rights reserved.
![Page 26: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/26.jpg)
26
• Codifies best practices
• Interoperability
• Shields you from Hadoop, Hive, etc. version changes
• Get up and running faster
What does Kite do for you?
©2015 Cloudera, Inc. All rights reserved.
![Page 27: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/27.jpg)
27
• Kite is Apache 2.0 licensed
• Hosted on GitHub
• Compatibility: – Test against upstream Apache Hadoop 1.0 and 2.3 as well as
CDH4/5
• Contributors:– Cloudera, Cerner, Capital One, Intel, Pivotal
• Distributions:– Cloudera, Hortonworks, Pivotal, MapR
Open source
©2015 Cloudera, Inc. All rights reserved.
![Page 28: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/28.jpg)
28
• Site– http://kitesdk.org
• Kite guide– http://tiny.cloudera.com/KiteGuide
• Data module overview– http://tiny.cloudera.com/Datasets
• Command-line interface tutorial– http://tiny.cloudera.com/KiteCLI
• Kite examples– https://github.com/kite-sdk/kite-examples
Resources
©2015 Cloudera, Inc. All rights reserved.
![Page 29: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/29.jpg)
29
Using Kite
©2015 Cloudera, Inc. All rights reserved.
![Page 30: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/30.jpg)
30
Architecture
©2015 Cloudera, Inc. All rights reserved.
CSV Kite CLISchema
Kite CLIHDFS
infer Avro schema create dataset
Kite CLI
load dataset
CrunchHDFS
ImpalaReport
![Page 31: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/31.jpg)
31
Dataset schemes
• Pluggable dataset interface with multiple schemes
• Schemes determine underlying storage mechanism and metadata provider
• HDFS– Data stored in HDFS directories
– Metadata stored in an Avro schema file and a Java properties file in the dataset directory
• Hive– Data stored in HDFS directories
– Metadata stored in Hive metastore
• HBase– Data and metadata ©2015 Cloudera, Inc. All rights reserved.
![Page 32: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/32.jpg)
32
Which scheme?
• HDFS– Best for raw data and intermediate data in an ETL pipeline
– No SQL access
• Hive– Best for data that is ready for query or SQL ETL
– No performance difference between Hive and HDFS-backed datasets
• HBase– Best for online serving applications
– Provides sorted keys
– Optimistic concurrency control
©2015 Cloudera, Inc. All rights reserved.
![Page 33: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/33.jpg)
33
Dataset formats
• Physical serialization format
• Avro– Row-based storage format with schemas and compression
• Parquet– Column-based storage format optimized for query access
• CSV– Read-only format
– Used by ETL jobs to read raw data files
©2015 Cloudera, Inc. All rights reserved.
![Page 34: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/34.jpg)
34
Avro
©2015 Cloudera, Inc. All rights reserved.
1
2
3
4
5
6
7
![Page 35: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/35.jpg)
35
Parquet
©2015 Cloudera, Inc. All rights reserved.
a b c d e f g h i j
![Page 36: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/36.jpg)
36
When to choose which format
• Avro– Access all fields of a record at the same time
– Intermediate/non-long-lived data
• Parquet– Access subset of fields/columns at a time
– SQL tables (Impala/Hive)
©2015 Cloudera, Inc. All rights reserved.
![Page 37: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/37.jpg)
37
Compression type
• Uncompressed– Nope. Nope. Nope. Nope.
• Snappy– Default
– Balances performance and speed
– Fastest for query
• Deflate/gzip– Good for archived/infrequently accessed data
– Slow writes, decent read performance
©2015 Cloudera, Inc. All rights reserved.
![Page 38: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/38.jpg)
38
• Schema– Record fields, like a table definition
Configuration
©2015 Cloudera, Inc. All rights reserved.
![Page 39: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/39.jpg)
39
• Demo schema inference/generation
Demo
©2015 Cloudera, Inc. All rights reserved.
![Page 40: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/40.jpg)
40
• Schema– Record fields, like a table definition
• Partition strategy– Physical layout/storage key definition
Configuration
©2015 Cloudera, Inc. All rights reserved.
![Page 41: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/41.jpg)
41
• Map entity fields to partitions
• Unlike Hive, partitions are tied to per-entity data
• Common partition types: values, hashes, timestamp parsing
Partitioning
©2015 Cloudera, Inc. All rights reserved.
![Page 42: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/42.jpg)
42
• Demo partition definition
Demo
©2015 Cloudera, Inc. All rights reserved.
![Page 43: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/43.jpg)
43
• Experiment before understanding
• Creates configuration files
• Handles dataset lifecycle– create, update, delete
• Basic ETL tasks– copy datasets
– transform individual records
• Import CSV
Command-line interface
©2015 Cloudera, Inc. All rights reserved.
![Page 44: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/44.jpg)
44
1. Describe your data
kite-dataset obj-schema org.grouplens.Rating \
--jar group-lens-1.0.jar -o rating.avsc
2. Describe your layout
kite-dataset partition-config ts:year ts:month ts:day \
--schema rating.avsc -o ymd.json
3. Create a dataset
kite-dataset create ratings --schema rating.avsc \
--partition-by ymd.json
Example
©2015 Cloudera, Inc. All rights reserved.
![Page 45: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/45.jpg)
45
• Two packages– Standalone for on-cluster use
– Tarball with dependencies for remote access (CDH5-only)
• Environment variables– HIVE_HOME, HIVE_CONF_DIR, HBASE_HOME,
HADOOP_MAPRED_HOME, HADOOP_COMMON_HOME
• Debug environment– debug=true ./kite-dataset <command>
• Verbose output– ./kite-dataset -v <command>
Command-line interface
©2015 Cloudera, Inc. All rights reserved.
![Page 46: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/46.jpg)
46
• Demo dataset creation with the CLI
• Demo dataset loading with the CLI
Demo
©2015 Cloudera, Inc. All rights reserved.
![Page 47: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/47.jpg)
47
Maven parent POM
• Consolidated Kite and Hadoop dependencies
• To use:– Set kite-app-parent-cdh4 or kite-app-parent-cdh5 as your project’s parent
POM
<parent>
<group>org.kitesdk</group>
<artifact>kite-app-parent-cdh5</artifact>
<version>0.17.1</version>
</parent>
©2015 Cloudera, Inc. All rights reserved.
![Page 48: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/48.jpg)
48
• Demo maven project using Kite parent pom
Demo
©2015 Cloudera, Inc. All rights reserved.
![Page 49: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/49.jpg)
49
• Java dataflow API
• Runs pipelines in memory, MapReduce, or Spark
• Parallel collections
Crunch
©2015 Cloudera, Inc. All rights reserved.
![Page 50: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/50.jpg)
50
Use Crunch with Kite
• CrunchDatasets helper class– CrunchDatasets.asSource(View view)
– CrunchDatasets.asTarget(View view)
• Supports Crunch write modes: default, overwrite and append
PCollection<Movie> movies = getPipeline().read(
CrunchDatasets.asSource(“dataset:hive:movies”, Movie.class));
• Re-partition data before writing
PCollection<Movie> partitionedMovies = CrunchDatasets.
partition(movies, targetDataset);
©2015 Cloudera, Inc. All rights reserved.
![Page 51: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/51.jpg)
51
• Demo crunch processing on Kite
Demo
©2015 Cloudera, Inc. All rights reserved.
![Page 52: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/52.jpg)
52
Impala
• Massively parallel processing (MPP) database
• SQL
• Distributed
• Fast
©2015 Cloudera, Inc. All rights reserved.
![Page 53: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/53.jpg)
53
• Demo querying a Kite dataset with Impala
Demo
©2015 Cloudera, Inc. All rights reserved.
![Page 54: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/54.jpg)
54
Architecture
©2015 Cloudera, Inc. All rights reserved.
CSV Kite CLISchema
Kite CLIHDFS
infer Avro schema create dataset
Kite CLI
load dataset
CrunchHDFS
ImpalaReport
![Page 55: Building data pipelines with kite](https://reader031.fdocuments.net/reader031/viewer/2022020106/55a4ddcc1a28ab43768b4629/html5/thumbnails/55.jpg)
Thank you