Federated Queries with HAWQ - SQL on Hadoop and Beyond

50
Unless otherwise indicated, these slides are © 2013-2015 Pivotal Software, Inc. and licensed under a Creative Commons Attribution-NonCommercial license: http://creativecommons.org/licenses/by-nc/3.0/ SPRINGONE2GX WASHINGTON, DC Federated Queries with HAWQ SQL on Hadoop and Beyond By Christian Tzolov @christzolov

Transcript of Federated Queries with HAWQ - SQL on Hadoop and Beyond

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

SPRINGONE2GXWASHINGTON, DC

Federated Queries with HAWQ SQL on Hadoop and Beyond

By Christian Tzolov @christzolov

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Safe Harbor Statement

The following is intended to outline the general direction of Pivotal's offerings. It is intended for information purposes only and may not be incorporated into any contract. Any information regarding pre-release of Pivotal offerings, future updates or other planned modifications is subject to ongoing evaluation by Pivotal and is subject to change. This information is provided without warranty or any kind, express or implied, and is not a commitment to deliver any material, code, or functionality, and should not be relied upon in making purchasing decisions regarding Pivotal's offerings. These purchasing decisions should only be based on features currently available. The development, release, and timing of any features or functionality described for Pivotal's offerings in this presentation remain at the sole discretion of Pivotal. Pivotal has no obligation to update forward looking information in this presentation.

2

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/ 3

BioChristian Tzolov

Pivotal Technical Architect,

BigData, Hadoop, SpringXD

Apache Committer, Crunch PMC member

[email protected]

@christzolov

blog.tzolov.net

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Contents

• Data System Concepts• Data System Reference Architectures.• Real Life Use Case: OLTP and OLAP Data Systems Integration• Passive Data Synchronization (Demo)• Federated Queries With HAWQ• HAWQ Web Tables• HAWQ PXF Architecture• Geode PXF (Demo)

4

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

How Compute Arbitrary Functions

on Arbitrary Data

Query = Function (All Data)

5

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data System Concepts

6

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data

What is Data?

• Data is Information

• Derived vs. Fact Information

• Not derived information will be called Data

• Data hold to be true simply because it exists

• All other information is derived from the Data

7

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data System

System that computes arbitrary functions on arbitrary data.

• Low Latency

• High Throughput

• Scalability

• Ad hoc queries

• Provenance

8

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data System Principles

The 6-elements Application = Data-Native System

• Fact Data

• Immutable Data

• Deterministic Functions

• Data-Lineage

• Data Locality - space or temporal

• All Data vs. Working Set

9

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data System – Key Properties

10

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data System Architectures

11

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data System Architectural Patterns

How to build Data Systems?

• Data Lake

• Lambda

• Kappa / Samza

• Tachyon

12

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Data Lake (From ETL to ELT)

Single Store For All Data

• Store Raw Data

• No Data Schema Limitations

• No Data Volume Limitations

• Late evaluation

• Extract-Transform-Load (ETL) vs. Extract-Load-Transform (ELT)

• Data Lineage

13

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

• Batch Layer

• Serving Layer

• Speed Layer

14

Lambda Architecture

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

• Durable Queues (Kafka)• Replaying History• Versioned Outputs• Dynamic Output Switch

15

Kappa Architecture

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Tachyon

No write speed constrains

• Data-Lineage Based Fault Tolerance

• Asynchronous Checkpoints

• Unbounded Writes

16

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Real Life Use Case

17

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Real Life Use Case

• Integrate an In-Memory Data Grid - Geode (e.g. GemFire) with SQL-On-Hadoop OLAP system - HAWQ

• Provide an unified data view across both systems

• Use Geode as Slowly Changing Dimensions (SCDs) store for HAWQ

• Keep the Operational and Historical data in Sync

18

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

• Cache - Performance / Consistency / Resiliency

• Region

• Distributed j.u.Map on steroids

• Highly available, redundant

• Distributed

• Locator, Server, Client

Apache Geode & GemFire (OLTP)

19

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Geode use cases…

20

China RailwayCorporation

5,700 train stations4.5 million tickets per day20 million daily users1.4 billion page views per day40,000 visits per second

* http://pivotal.io/big-data/pivotal-gemfire

Indian Railways

7,000 stations72,000 miles of track23 million passengers daily120,000 concurrent users10,000 transactions per minute

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

HAWQ (SQL-on-HADOOP)

• Enterprise SQL on Hadoop analytic engine built around a high-performance Greenplum MPP Database

• HAWQ runs natively on Apache Hadoop, HDFS and YARN.• Supports multiple Hadoop file formats: Parquet, HDFS and Avro• 100% ANSI SQL compliant (ANSI SQL-92, SQL-99, and SQL-2003…)• ODBC and JDBC connectivity• Implemented in C and C++

21

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

HAWQ (SQL-on-HADOOP)

• Runs TPC-DS benchmark in half the wall clock time that Impala • Beats Impala by overall 454% in performance• Compared to Hive, HAWQ provides an additional of 344% of performance

improvement on complex queries• Impala and Apache Hive™ do not support all standard TPC-DS queries• HAWQ runs 100% of them natively.

22

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Test Environment

• HDFS – Distributed file system. PHD or HDP Hadoop distributions.

• HAWQ - SQL-on-Hadoop.

• Geode/GemFire - In-memory data grid.

• SpringXD – runtime integration and streaming platform.

• Apache Ambari – Manages all Clusters.

• Apache Zeppelin – Web interface for interacting with different Data Systems.

23

GEODE Locator, Servers

HAWQ Master,

Segments

HDFS NameNode, DataNodes

SPRING XD

Admin, Containers

Zeppelin Notebook

Ambari

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Test Environment - Ambari Management

24

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Passive Data Synchronization

25

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Passive Synchronization (Case 1)

26

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Passive Synchronization (Case 1) - Demo

27

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Passive Synchronization (Case 2) - gpfdist

28

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Passive Synchronization (Case 2): Demo

29

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Federated Queries With HAWQ

30

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Access dynamic data sources on a web server or by executing OS scripts.

31

HAWQ Web Table – Architecture

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

HAWQ Web Table

• HAWQ Web Table - access dynamic data sources on a web server or by executing OS scripts.

• Leverage Geode REST API and OQL.

• RESTful SpringBoot app to convert JSON into TSV.

32

CREATE EXTERNAL WEB TABLE EMPLOYEE_WEB_TABLE (...) EXECUTE E'curl http://<adapter proxy>/gemfire-api/v1/ queries/adhoc?q=<URLencoded OQL statement>' ON MASTER FORMAT 'text' (delimiter '|' null 'null' escape E'\\');

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

HAWQ Web Table – Limitations

• Not Scalable

• No “Filter Push Down”

• Inflexibility

• No Compression

• Requires Additional Components

33

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

HAWQ PXF/Geode External Table

34

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Pivotal eXtension Framework (PXF)

• Java-Based

• Parallel, High Throughput Data Access

• Heterogeneous Data Sources.

• ANSI-compliant SQL On Any Dataset

• Wide variety of PXF plugins

35

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

PXF Architecture

36

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

PXF Data Model

• Data Source is modeled as a collection of one or more Fragments.

• Each Fragment consists of many Rows that in turn are split into typed Fields.

• Analyzer (optional) provides PXF statistical data for the HAWQ query optimizer

• Metadata about the data source locations, access attributes, table schemas formats, SQL queries filters, etc

37

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

PXF Processors

38

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

PXF Runtime Model

39

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

CREATE EXTERNAL TABLE ext_table_name <attr list, ...> LOCATION('pxf://<host>:<port>/path/to/data? FRAGMENTER=package.name.FragmenterForX& ACCESSOR=package.name.AccessorForX& RESOLVER=package.name.ResolverForX& <Other custom user options>=<value>') FORMAT ‘custom'(formatter='pxfwritable_import');

40

PXF External Tables Definition

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Available PXF Plugins

• HdfsTextSimple• HdfsTextMulti• Hive• HiveRC• HiveText• Hbase• Avro

41

• Accumulo• Casandra• JSON• Redis• Geode/Gemfire• Pipes

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Build Geode PXF Plugin

42

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

<profile> <name>GEMFIRE</name> <description>A profile for reading Gemfire data</description> <plugins> <fragmenter>io.pivotal.pxf.plugins.gemfire.GemfireFragmenter</fragmenter> <accessor>io.pivotal.pxf.plugins.gemfire.GemfireAccessor</accessor> <resolver>io.pivotal.pxf.plugins.gemfire.GemfireResolver</resolver> </plugins> </profile>

43

PXF Geode Profile

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Geode External Table Definition

CREATE EXTERNAL TABLE <GEMFIRE_TABLE_NAME> (...) LOCATION('pxf://<namenode>/<path>? PROFILE=GEMFIRE & LOCATORS=<gemfire-server:port> & REGION=<region-name>') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');

44

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

public class GemfireFragmenter extends Fragmenter { public GemfireFragmenter(InputData inputData) throws Exception { super(inputData); }

public List<Fragment> getFragments() throws Exception { return Arrays.asList (new Fragment( inputData.getDataSource(), new String[] { inputData.getUserProperty("URL-HOST") }, new byte[0])); }}

45

Geode Fragmenter

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

public class GemfireAccessor extends Plugin implements ReadAccessor { public GemfireAccessor(InputData input) throws Exception { super(input); }

public boolean openForRead() throws Exception { … } public OneRow readNextObject() throws Exception {…}

public void closeForRead() throws Exception {…}}

46

Geode Accessor

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

public class GemfireResolver extends Plugin implements ReadResolver {

public GemfireResolver(InputData input) { super(input); } public List<OneField> getFields(OneRow paramOneRow) throws Exception { }}

47

Geode Resolver

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Demo 3: Federated queries with PXF/Geode

48

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/

Demo 3: Federated queries with PXF/Geode

49

Unless otherwise indicated, these s l ides are © 2013-2015 Pivotal Software, Inc. and l icensed under a Creat ive Commons Attr ibut ion-NonCommercial l icense: ht tp: / /creat ivecommons.org/ l icenses/by-nc/3.0/ 50

By Christian Tzolov @christzolov

HADOOP WORKFLOWS AND DISTRIBUTED YARN APPS USING SPRING TECHNOLOGIES

Learn More. Stay Connected.

@springcentral Spring.io/video