Cloudera Impala Technical Overview

Cloudera Impala Jus/n Erickson | Senior Product Manager May 2013

Agenda

•  Why Impala? •  Architectural Overview •  Real-‐World Use Cases •  Alterna/ve Approaches •  The PlaKorm for Big Data

©2013 Cloudera, Inc. All Rights Reserved. 2

Why Hadoop?

•  Scalability •  Simply scales just by adding nodes •  Local processing to avoid network boSlenecks

•  Flexibility •  All kinds of data (blobs, documents, records, etc) •  In all forms (structured, semi-‐structured, unstructured) •  Store anything then later analyze what you need

•  Efficiency •  Cost efficiency (<$1k/TB) on commodity hardware •  Unified storage, metadata, security (no duplica/on or synchroniza/on)


What’s Impala?

•  Interac2ve SQL •  Typically 5-‐65x faster than Hive (observed up to 100x faster) •  Responses in seconds instead of minutes (some/mes sub-‐second)

•  Nearly ANSI-‐92 standard SQL queries with Hive SQL •  Compa/ble SQL interface for exis/ng Hadoop/CDH applica/ons •  Based on industry standard SQL

•  Na2vely on Hadoop/HBase storage and metadata •  Flexibility, scale, and cost advantages of Hadoop •  No duplica/on/synchroniza/on of data and metadata •  Local processing to avoid network boSlenecks

•  Separate run2me from MapReduce •  MapReduce is designed and great for batch •  Impala is purpose-‐built for low-‐latency SQL queries on Hadoop


Benefits of Impala

5

More & Faster Value from “Big Data” §  BI tools imprac/cal on Hadoop before Impala §  Move from 10s of Hadoop users per cluster to 100s of SQL users §  No delays from data migra/on

Flexibility §  Query across exis/ng data §  Select best-‐fit file formats (Parquet, Avro, etc.) §  Run mul/ple frameworks on the same data at the same /me

Cost Efficiency §  Reduce movement, duplicate storage & compute §  10% to 1% the cost of analy/c DBMS

Full Fidelity Analysis §  No loss from aggrega/ons or fixed schemas

©2013 Cloudera, Inc. All Rights Reserved.

Impala Query Execu/on

6

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC Hive

Metastore HDFS NN Statestore

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

1) Request arrives via ODBC/JDBC/Beeswax/Shell



7

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC Hive


Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

2) Planner turns request into collec2ons of plan fragments 3) Coordinator ini2ates execu2on on impalad(s) local to data



8

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC Hive


Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

4) Intermediate results are streamed between impalad(s) 5) Query results are streamed back to client

Query results


Impala and Hive

9

Shares Everything Client-‐Facing §  Metadata (table defini/ons) §  ODBC/JDBC drivers §  SQL syntax (Hive SQL) §  Flexible file formats §  Machine pool §  Hue GUI

But Built for Different Purposes §  Hive: runs on MapReduce and ideal for batch

processing §  Impala: na/ve MPP query engine ideal for

interac/ve SQL

Storage

Integra2on

Resource Management

Metad

ata

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS

Hive SQL Syntax Impala

SQL Syntax + Compute Framework MapReduce

Compute Framework

Batch Processing

Interac/ve

SQL


Impala Use Cases

10

Interac/ve BI/analy/cs on more data

Asking new ques/ons

Query-‐able archive w/ full fidelity

Data processing with /ght SLAs

Cost-‐effec2ve, ad hoc query environment that offloads the data warehouse for:


Global Financial Services Company

11

Saving 90% on incremental EDW spend & improving performance by 5x

Offload data warehouse for query-‐able archive

Store decades of data cost-‐effec/vely

Process & analyze on the same system

Improve capabili/es through interac/ve query on more data


Six3 Systems

12

Boos2ng performance by 20X for mission-‐cri2cal, real-‐2me cyber security

Analyze unstructured data with flexibility & real-‐/me response

Integrate with exis/ng desktop & BI tools

Deploy in minutes with Cloudera Manager


Expedia

13

Implemen2ng self-‐service BI on big data, reducing data latency by 50%

Offload data warehouse for archiving, ETL & analy/cs

Unify IT environment

Con/nuously ingest & analyze at scale

Drive greater usability & adop/on of big data stack


Our Design Strategy

14

Storage

Integra2on

Resource Management

Metad

ata

Batch Processing MAPREDUCE, HIVE & PIG

…Interac/ve

SQL IMPALA

Machine

Learning MAHOUT, DATAFU

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS

Engines

One pool of data

One metadata model

One security framework

One set of system resources

An Integrated Part of the Hadoop System


Not All SQL on Hadoop is Created Equal

15

Batch MapReduce Make MapReduce faster

Slow, s2ll batch

Remote Query Pull data from HDFS over the network to the DW

compute layer

Slow, expensive

Siloed DBMS Load data into a

proprietary database file

Rigid, siloed data, slow ETL

Impala Na/ve MPP query engine that’s integrated into

Hadoop

Fast, flexible, cost-‐effec2ve

$


The Impala Advantage

16

BI Partners: Building on the

Enterprise Standard POWERED BY

IMPALA


It’s Not Just About SQL on Hadoop

17

The Plaeorm for Big Data

Storage

Integra2on

Resource Management

Metad

ata

Batch Processing MAPREDUCE, HIVE & PIG

…Interac/ve

SQL IMPALA

Machine

Learning MAHOUT, DATAFU

HDFS HBase

TEXT, RCFILE, PARQUET, AVRO… RECORDS

Engines

Management | Support

Single plaKorm for processing & analy/cs

Scales to ‘000s of servers

No upfront schema

10% the cost per TB

Open source plaKorm


Cloudera Impala Technical Overview

Technology

Transcript of Cloudera Impala Technical Overview