Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

1

Cloudera ImpalaLV Big Data Monthly Meetup #1November 5th 2014

Maxime DumasSystems Engineer

Thirty Seconds About Max

• Systems Engineer

• aka Sales Engineer

• SoCal, AZ, NV

• former coder of PHP

• teaches meditation + yoga

• from Montreal, Canada

2

What Does Cloudera Do?

• product

• distribution of Hadoop components, Apache licensed

• enterprise tooling

• support

• training

• services (aka consulting)

• community

3

What This Talk Isn’t About

• deploying

• Puppet, Chef, Ansible, homegrown scripts, intern labor

• sizing & tuning

• depends heavily on data and workload

• coding

• unless you count XML or CSV or SQL

• algorithms

4

What is Cloudera Impala?

5

Public Domain IFCAR

cloud·e·ra im·pal·a

7

/kloudˈi(ə)rə imˈpalə/

noun

a modern, open source, MPP SQL query engine for Apache Hadoop.

“Cloudera Impala provides fast, ad hoc SQL query capability for Apache Hadoop, complementing traditional MapReduce batch processing.”

Impala adoption

8

Component (and Founder) Vendor Support

Cloudera MapR Amazon IBM Pivotal Hortonworks

Impala (Cloudera) ✔ ✔ ✔ X X X

Hue (Cloudera) ✔ ✔ X X X ✔

Sentry (Cloudera) ✔ ✔ X ✔ ✔ X

Flume (Cloudera) ✔ ✔ X ✔ ✔ ✔

Parquet (Cloudera/Twitter) ✔ ✔ X ✔ ✔ X

Sqoop (Cloudera) ✔ ✔ ✔ ✔ ✔ ✔

Ambari (Hortonworks) X X X X ✔ ✔

Knox (Hortonworks) X X X X X ✔

Tez (Hortonworks) X X X X X ✔

Drill (MapR) X ✔ X X X X

9

Quick and dirty, for context.

The Apache Hadoop Ecosystem

©2014 Cloudera, Inc. All rights

reserved.

• Scalability• Simply scales just by adding nodes• Local processing to avoid network bottlenecks

• Efficiency• Cost efficiency (<$1k/TB) on commodity hardware• Unified storage, metadata, security (no duplication or synchronization)

• Flexibility• All kinds of data (blobs, documents, records, etc)• In all forms (structured, semi-structured, unstructured)• Store anything then later analyze what you need

Why Hadoop?

Why “Ecosystem?”

• In the beginning, just Hadoop• HDFS

• MapReduce

• Today, dozens of interrelated components• I/O

• Processing

• Specialty Applications

• Configuration

• Workflow

11

HDFS

• Distributed, highly fault-tolerant filesystem

• Optimized for large streaming access to data

• Based on Google File System

• http://research.google.com/archive/gfs.html

12

Lots of Commodity Machines

13

Image:Yahoo! Hadoop cluster [ OSCON ’07 ]

MapReduce (MR)

• Programming paradigm

• Batch oriented, not realtime

• Works well with distributed computing

• Lots of Java, but other languages supported

• Based on Google’s paper

• http://research.google.com/archive/mapreduce.html

14

Apache Hive

• Abstraction of Hadoop’s Java API

• HiveQL “compiles” down to MR

• a “SQL-like” language

• Eases analysis using MapReduce

15

Apache Hive Metastore

• Maps HDFS files to DB-like resources

• Databases

• Tables

• Column/field names, data types

• Roles/users

• InputFormat/OutputFormat

16

Architecture


reserved.

3RD PARTYAPPS

STORAGE FOR ANY TYPE OF DATAUNIFIED, ELASTIC, RESILIENT, SECURE

CLOUDERA’S ENTERPRISE DATA HUB

BATCHPROCESSING

MAPREDUCE, SPARK

ANALYTICSQL

IMPALA

SEARCH

SOLR

MACHINELEARNING

STREAMPROCESSING

SPARK

WORKLOAD MANAGEMENT YARN

FILESYSTEM

HDFS

ONLINE NOSQL

HBASE

DA

TAM

AN

AG

EMEN

TC

LOU

DER

A N

AV

IGA

TOR

SYSTEMM

AN

AG

EMEN

TC

LOU

DER

A M

AN

AG

ER

SENTRY

PARTNERS, MAHOUT

WHY DO WE NEED THIS?But wait…

18

20

Familiar interface, but more powerful.

Cloudera Impala

Cloudera Impala

• Interactive query on Hadoop

• think seconds, not minutes

• ANSI-92 standard SQL

• compatible with HiveQL

• Native MPP query engine

• built for low-latency queries

• HDFS and HBase storage

21

Cloudera Impala – Design Choices

• Native daemons, written in C/C++

• No JVM, no MapReduce

• Saturate disks on reads

• Uses in-memory HDFS caching

• Re-uses Hive metastore

• Not as fault-tolerant as MapReduce

22

Benefits of ImpalaUnlocks BI/analytics on Hadoop

• Interactive SQL in seconds• Highly concurrent to handle 100s of users

Native Hadoop flexibility• No data migration, conversion, or duplication required• Query existing Hadoop data• Run multiple frameworks on the same data at the same time• Supports Parquet for best-of-breed columnar performance

Native MPP query engine designed into Hadoop:• Unified Hadoop storage• Unified Hadoop metadata (uses Hive and HCatalog)• Unified Hadoop security• Fine-grained role-based access controls with Sentry

Apache-licensed open source

Proven in Production

23

Cloudera Impala – Architecture

• Impala Daemon• runs on every node

• handles client requests

• handles query planning & execution

• State Store Daemon• provides name service

• metadata distribution

• used for finding data

24

Impala Query Execution

25

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/HUE/Shell


26

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

2) Planner turns request into collections of plan fragments3) Coordinator initiates execution on impalad(s) local to data


27

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

HiveMetastore

HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

4) Intermediate results are streamed between impalad(s)5) Query results are streamed back to client

Query results

Cloudera Impala – Results

• Allows for fast iteration/discovery

• How much faster?

• 3-4x faster on I/O bound workloads

• up to 45x faster on multi-MR queries

• up to 90x faster on in-memory cache

28

0

50

100

150

200

250

300

350

Impala Spark SQL Presto Hive-on-Tez

Tim

e (

in s

eco

nd

s)Single User vs 10 User Response Time/Impala

Times Faster(Lower bars = better)

Latest SQL Performance

Sin

gle

Use

r, 5

10

Use

rs, 1

1

Sin

gle

Use

r, 2

5

10

Use

rs, 1

20

10

Use

rs, 3

02

10

Use

rs, 2

02

Sin

gle

Use

r, 3

7

Sin

gle

Use

r, 7

7

5.0x

10.6x

7.4x

27.4x

15.4x

18.3x

Independent validation by IBM Research SQL-on-Hadoop VLDB paper:“Impala’s database architecture provides significant performance gains”

29

Previous Milestones

Impala 1.0 (GA)

Impala 1.1 (Security)

Impala 1.2 (Usability)

Impala 1.3 (Resource

Management)

Impala 1.4 (Extensibility)

Impala 2.0 (SQL)

An

alyt

ic D

atab

ase

C

apab

iliti

es

Spring2013

Summer 2013

Fall2013

Spring2014

Summer2014

Fall2014

30

Cloudera Impala 2.0

Window Functions“Aggregate function applied to a partition of the result set” (SQL 2003)Ex:sum(population) OVER (PARTITION BY city)rank() OVER (PARTITION BY state, ORDER BY population)

We’ve implemented most of the spec• PARTITION BY, ORDER BY• WINDOW

• PRECEEDING, FOLLOWING• ROWS

• Any number of analytic functions in one query

31

Cloudera Impala 2.0

Subqueries

A query that is part of another query. Ex:select col from t1

where col in

(select c2 from t2)

Support:

• Correlated and uncorrelated subqueries.

• IN, NOT IN, EXISTS, NOT EXISTS

32

Cloudera Impala 2.0

Spill to disk joins & aggregations

• Previously, if a query ran out of memory, Impala would abort it• This means some big joins (fact table – fact table) joins could never run.

• All operators that accumulate memory can now spill to disk if necessary.

• Order by (Impala 1.4)

• Join/Agg (Impala 2.0)

• Analytic Functions (Impala 2.0)

• Transparent to existing workloads

33

Cloudera Impala 2.1 +

34

• Nested data – enables queries on complex nested structures including maps, structs, and arrays (early 2015)

• MERGE statement – enables merging in updates into existing tables• Additional analytic SQL functionality – ROLLUP, CUBE, and GROUPING SET• SQL SET operators – MINUS, INTERSECT• Apache HBase CRUD – allows use of Impala for inserts and updates into HBase• UDTFs (user-defined table functions) – for more advanced user functions and

extensibility• Intra-node parallelized aggregations and joins – to provide even faster joins and

aggregations on on top of the performance gains of Impala• Parquet enhancements – continued performance gains including index pages• Amazon S3 integration

35

Hold onto something, folks.

Quick Demo


reserved.

Apache-licensed open source

• Download: cloudera.com/downloads

• Email: [email protected]

• Join: groups.cloudera.org

Cloudera Live

Free, Interactive Tutorials at cloudera.com/live

Try It Out

http://www.cloudera.com/downloads

mailto:[email protected]

http://groups.cloudera.org

http://www.cloudera.com/live

LAS VEGAS BIG DATA Special thanks:

37

38

Preferably related to the talk… or not.

Questions?

39

Thank You!Maxime Dumas

[email protected]

We’re hiring.

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Software

Transcript of Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014