Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

39
1 Cloudera Impala LV Big Data Monthly Meetup #1 November 5 th 2014 Maxime Dumas Systems Engineer

description

Cloudera Impala Real-Time SQL

Transcript of Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Page 1: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

1

Cloudera ImpalaLV Big Data Monthly Meetup #1November 5th 2014

Maxime DumasSystems Engineer

Page 2: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Thirty Seconds About Max

• Systems Engineer

• aka Sales Engineer

• SoCal, AZ, NV

• former coder of PHP

• teaches meditation + yoga

• from Montreal, Canada

2

Page 3: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

What Does Cloudera Do?

• product

• distribution of Hadoop components, Apache licensed

• enterprise tooling

• support

• training

• services (aka consulting)

• community

3

Page 4: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

What This Talk Isn’t About

• deploying

• Puppet, Chef, Ansible, homegrown scripts, intern labor

• sizing & tuning

• depends heavily on data and workload

• coding

• unless you count XML or CSV or SQL

• algorithms

4

Page 5: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

What is Cloudera Impala?

5

Page 6: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Public Domain IFCAR

Page 7: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

cloud·e·ra im·pal·a

7

/kloudˈi(ə)rə imˈpalə/

noun

a modern, open source, MPP SQL query engine for Apache Hadoop.

“Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complementing traditional MapReduce batch processing.”

Page 8: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Impala adoption

8

Component (and Founder) Vendor Support

Cloudera MapR Amazon IBM Pivotal Hortonworks

Impala (Cloudera) ✔ ✔ ✔ X X X

Hue (Cloudera) ✔ ✔ X X X ✔

Sentry (Cloudera) ✔ ✔ X ✔ ✔ X

Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔

Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X

Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔

Ambari (Hortonworks) X X X X ✔ ✔

Knox (Hortonworks) X X X X X ✔

Tez (Hortonworks) X X X X X ✔

Drill (MapR) X ✔ X X X X

Page 9: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

9

Quick and dirty, for context.

The Apache Hadoop Ecosystem

Page 10: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

©2014 Cloudera, Inc. All rights

reserved.

• Scalability• Simply scales just by adding nodes• Local processing to avoid network bottlenecks

• Efficiency• Cost efficiency (<$1k/TB) on commodity hardware• Unified storage, metadata, security (no duplication or synchronization)

• Flexibility• All kinds of data (blobs, documents, records, etc)• In all forms (structured, semi-structured, unstructured)• Store anything then later analyze what you need

Why Hadoop?

Page 11: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Why “Ecosystem?”

• In the beginning, just Hadoop• HDFS

• MapReduce

• Today, dozens of interrelated components• I/O

• Processing

• Specialty Applications

• Configuration

• Workflow

11

Page 12: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

HDFS

• Distributed, highly fault-tolerant filesystem

• Optimized for large streaming access to data

• Based on Google File System

• http://research.google.com/archive/gfs.html

12

Page 13: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Lots of Commodity Machines

13

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

Page 14: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

MapReduce (MR)

• Programming paradigm

• Batch oriented, not realtime

• Works well with distributed computing

• Lots of Java, but other languages supported

• Based on Google’s paper

• http://research.google.com/archive/mapreduce.html

14

Page 15: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Apache Hive

• Abstraction of Hadoop’s Java API

• HiveQL “compiles” down to MR

• a “SQL-like” language

• Eases analysis using MapReduce

15

Page 16: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Apache Hive Metastore

• Maps HDFS files to DB-like resources

• Databases

• Tables

• Column/field names, data types

• Roles/users

• InputFormat/OutputFormat

16

Page 17: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Architecture

©2014 Cloudera, Inc. All rights

reserved.

3RD PARTYAPPS

STORAGE FOR ANY TYPE OF DATAUNIFIED, ELASTIC, RESILIENT, SECURE

CLOUDERA’S ENTERPRISE DATA HUB

BATCHPROCESSING

MAPREDUCE, SPARK

ANALYTICSQL

IMPALA

SEARCH

SOLR

MACHINELEARNING

STREAMPROCESSING

SPARK

WORKLOAD MANAGEMENT YARN

FILESYSTEM

HDFS

ONLINE NOSQL

HBASE

DA

TAM

AN

AG

EMEN

TC

LOU

DER

A N

AV

IGA

TOR

SYSTEMM

AN

AG

EMEN

TC

LOU

DER

A M

AN

AG

ER

SENTRY

PARTNERS, MAHOUT

Page 18: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

WHY DO WE NEED THIS?But wait…

18

Page 19: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

19

Page 20: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

20

Familiar interface, but more powerful.

Cloudera Impala

Page 21: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala

• Interactive query on Hadoop

• think seconds, not minutes

• ANSI-92 standard SQL

• compatible with HiveQL

• Native MPP query engine

• built for low-latency queries

• HDFS and HBase storage

21

Page 22: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala – Design Choices

• Native daemons, written in C/C++

• No JVM, no MapReduce

• Saturate disks on reads

• Uses in-memory HDFS caching

• Re-uses Hive metastore

• Not as fault-tolerant as MapReduce

22

Page 23: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Benefits of ImpalaUnlocks BI/analytics on Hadoop

• Interactive SQL in seconds• Highly concurrent to handle 100s of users

Native Hadoop flexibility• No data migration, conversion, or duplication required• Query existing Hadoop data• Run multiple frameworks on the same data at the same time• Supports Parquet for best-of-breed columnar performance

Native MPP query engine designed into Hadoop:• Unified Hadoop storage• Unified Hadoop metadata (uses Hive and HCatalog)• Unified Hadoop security• Fine-grained role-based access controls with Sentry

Apache-licensed open source

Proven in Production

23

Page 24: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala – Architecture

• Impala Daemon• runs on every node

• handles client requests

• handles query planning & execution

• State Store Daemon• provides name service

• metadata distribution

• used for finding data

24

Page 25: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Impala Query Execution

25

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/HUE/Shell

Page 26: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Impala Query Execution

26

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

2) Planner turns request into collections of plan fragments3) Coordinator initiates execution on impalad(s) local to data

Page 27: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Impala Query Execution

27

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

4) Intermediate results are streamed between impalad(s)5) Query results are streamed back to client

Query results

Page 28: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala – Results

• Allows for fast iteration/discovery

• How much faster?

• 3-4x faster on I/O bound workloads

• up to 45x faster on multi-MR queries

• up to 90x faster on in-memory cache

28

Page 29: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

0

50

100

150

200

250

300

350

Impala Spark SQL Presto Hive-on-Tez

Tim

e (

in s

eco

nd

s)Single User vs 10 User Response Time/Impala

Times Faster(Lower bars = better)

Latest SQL Performance

Sin

gle

Use

r, 5

10

Use

rs, 1

1

Sin

gle

Use

r, 2

5

10

Use

rs, 1

20

10

Use

rs, 3

02

10

Use

rs, 2

02

Sin

gle

Use

r, 3

7

Sin

gle

Use

r, 7

7

5.0x

10.6x

7.4x

27.4x

15.4x

18.3x

Independent validation by IBM Research SQL-on-Hadoop VLDB paper:“Impala’s database architecture provides significant performance gains”

29

Page 30: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Previous Milestones

Impala 1.0 (GA)

Impala 1.1 (Security)

Impala 1.2 (Usability)

Impala 1.3 (Resource

Management)

Impala 1.4 (Extensibility)

Impala 2.0 (SQL)

An

alyt

ic D

atab

ase

C

apab

iliti

es

Spring2013

Summer 2013

Fall2013

Spring2014

Summer2014

Fall2014

30

Page 31: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala 2.0

Window Functions“Aggregate function applied to a partition of the result set” (SQL 2003)Ex:sum(population) OVER (PARTITION BY city)rank() OVER (PARTITION BY state, ORDER BY population)

We’ve implemented most of the spec• PARTITION BY, ORDER BY• WINDOW

• PRECEEDING, FOLLOWING• ROWS

• Any number of analytic functions in one query

31

Page 32: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala 2.0

Subqueries

A query that is part of another query. Ex:select col from t1

where col in

(select c2 from t2)

Support:

• Correlated and uncorrelated subqueries.

• IN, NOT IN, EXISTS, NOT EXISTS

32

Page 33: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala 2.0

Spill to disk joins & aggregations

• Previously, if a query ran out of memory, Impala would abort it• This means some big joins (fact table – fact table) joins could never run.

• All operators that accumulate memory can now spill to disk if necessary.

• Order by (Impala 1.4)

• Join/Agg (Impala 2.0)

• Analytic Functions (Impala 2.0)

• Transparent to existing workloads

33

Page 34: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Cloudera Impala 2.1 +

34

• Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015)

• MERGE statement – enables merging in updates into existing tables• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET• SQL SET operators – MINUS, INTERSECT• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase• UDTFs (user-defined table functions) – for more advanced user functions and

extensibility• Intra-node parallelized aggregations and joins – to provide even faster joins and

aggregations on on top of the performance gains of Impala• Parquet enhancements – continued performance gains including index pages• Amazon S3 integration

Page 35: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

35

Hold onto something, folks.

Quick Demo

Page 36: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

©2014 Cloudera, Inc. All rights

reserved.

Apache-licensed open source

• Download: cloudera.com/downloads

• Email: [email protected]

• Join: groups.cloudera.org

Cloudera Live

Free, Interactive Tutorials at cloudera.com/live

Try It Out

Page 37: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

LAS VEGAS BIG DATA Special thanks:

37

Page 38: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

38

Preferably related to the talk… or not.

Questions?

Page 39: Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

39

Thank You!Maxime Dumas

[email protected]

We’re hiring.