Impala Presentation - Orlando

60
Impala A Modern, Open-Source SQL Engine for Hadoop Ricky Saltzer Cloudera

Transcript of Impala Presentation - Orlando

Page 1: Impala Presentation - Orlando

ImpalaA Modern, Open-Source SQL Engine for Hadoop

Ricky SaltzerCloudera

Page 2: Impala Presentation - Orlando

About Me

• Tools Developer at Cloudera

Raleigh, NC

• Hadoop / Impala Enthusiast

• Co-Author of Impala in Action

Coupon: Save 45%

impalafla(Expires 01/25/2014)www.manning.com/saltzer

Page 3: Impala Presentation - Orlando

About Cloudera

• Formed in 2008• First company to release commercial Hadoop distribution• Employs more than 70 Hadoop commitors, PMC members,

and contributors.• 15 projects were founded by Cloudera employees• 5 books published on Hadoop / related technologies• Tens of thousands of nodes under management• Creators of Cloudera Manager

• Install, Configure, and Manage a Hadoop cluster in minutes

Page 4: Impala Presentation - Orlando

Agenda

• Quick overview of Hadoop• What is Impala?• Goals and user view of Impala• Impala architecture• Scaling• Performance comparisons• Future roadmap

Page 5: Impala Presentation - Orlando

What is Apache Hadoop?

Page 6: Impala Presentation - Orlando

What’s Wrong With MapReduce?

• Batch oriented• High latency• Not all paradigms fit• Only for developers

Page 7: Impala Presentation - Orlando

Wait, What about Hive?

• SQL to MapReduce engine, originally created by Facebook

• Still a great tool for batch oriented jobs (e.g. conversions)

• Was never designed to be real-time, or to handle large concurrent volume.

Page 8: Impala Presentation - Orlando

What is Impala?

Page 9: Impala Presentation - Orlando

General Purpose SQL Engine

• Analytical workloads• Linearly scalable• Handle queries that run from milliseconds to hours• Thousands of concurrent queries

Page 10: Impala Presentation - Orlando

Runs Directly in Hadoop

• Compatible with multiple storage managers• Reads widely used Hadoop file formats• Runs on the same nodes as Hadoop

Page 11: Impala Presentation - Orlando

High Performance

• True MPP Query Engine• C++ instead of Java• Runtime code generation (LLVM IR)• Completely new execution engine (not MapReduce)• Novel IO Manager

Page 12: Impala Presentation - Orlando

Completely Open Source

Apache License 2.0

http://github.com/cloudera/impala

Page 13: Impala Presentation - Orlando

User View of Impala

• Runs as a distributed service; each node with data runs an Impala Daemon.

• Truly distributed, any Impala daemon can accept a query• Highly available, no single point of failure• Submit queries via JDBC, ODBC, Shell, or Hue• Thrift Application Drivers (Ruby, Python, etc)• Queries are distributed to nodes with relevant data• Uses same metastore as Hive

Page 14: Impala Presentation - Orlando

There is NO Impala Format!

- Bring Your Own Data

Page 15: Impala Presentation - Orlando

• Parquet• High performance columnar format, based on Dremel

• RCFile• Original Hadoop columnar file (slower)

• Avro• Data serialization system supporting rich, complex types

• Sequence• Default MapReduce output, optimized for KeyValue

• Text• Raw text data (e.g. CSV)

Supported File Formats

Page 16: Impala Presentation - Orlando

• Snappy• Fast, great compression

• GZIP• Slow, best compression

• BZIP2• Moderate, great compression

• LZO• Text only

Supported Compression

Page 17: Impala Presentation - Orlando

SQL

• SQL-92 minus correlated subqueries• Similar to HiveQL• ORDER requires LIMIT (coming soon…)• No complex data types (coming soon…)• Other Goodies

• INSERT INTO SELECT• CREATE TABLE AS SELECT• LOAD INTO

• UDF, UDAF (Native and Java)• JOINs must fit in the aggregate memory of all executing nodes• Continuing to add additional SQL functionality

Page 18: Impala Presentation - Orlando

• Impala DaemonReceives, coordinates, and executes user queries. Runs as a distributed service

• StatestoreManages the cluster state

• Catalog ServerProvides automatic metadata propagation

Impala Architecture

Page 19: Impala Presentation - Orlando

Impala Daemon

• Runs as a distributed process on each HDFS DataNode• Each Impala Daemon is capable of facilitating a user query,

there is no “master server”.• Subcomponents:

• Planner - Translates user queries to a logical execution plan• Coordinator - Receives plans; coordinates execution on remote

Impala daemons.• Executor - Scans local data; Finishes work given by coordinator

as fast as possible.

Page 20: Impala Presentation - Orlando

Statestore

• Central system state repository• Name service (membership)• Metadata• Future: scheduling relevant, and diagnostic states

• Soft-state• All data can be reconstructed from the rest of the system• Cluster continues to function without statestore

• Heartbeats• Pushes new data• Checks for liveness

Page 21: Impala Presentation - Orlando

Metadata / Catalog Server

• Impala Metadata• Hive Metastore

• Logical table representations• HDFS

• block replica locations• block replica volumes ids

• Catalog Server• Caches metadata• Automatically propagates changes• Updates are atomic• Manual refresh for outside changes (e.g. DDL through Hive)

Page 22: Impala Presentation - Orlando

What happens when I submit a query?

Page 23: Impala Presentation - Orlando

High Level Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBCHDFS NN

Statestore&

Catalog

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL request

Client Sends Request via ODBC, JDBC or Thrift

HiveMetastore

Page 24: Impala Presentation - Orlando

High Level Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBC

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

HDFS NNStatestore

&Catalog

Planner turns request into collections of plan fragmentsCoordinator initiates execution on remotes nodes

HiveMetastore

Page 25: Impala Presentation - Orlando

High Level Query Execution

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

SQL App

ODBCHive

Metastore HDFS NNStatestore

&Catalog

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

Query Planner

Query Coordinator

Query Executor

HDFS DN HBase

query results

Intermediate results are streamed between nodes

Operation permitted, query results are streamed back to client

Page 26: Impala Presentation - Orlando

Query Planning

Page 27: Impala Presentation - Orlando

Query Planning: Overview

• 2-phase planning process:• single-node plan: left-deep tree of plan

operators• plan partitioning: partition single-node plan to

maximize scan locality (i.e. less data movement)

• Parallelization of operators:• All query operators are fully distributed• Operations are heavily pipelined

Left Deep Tree

X / \ / \ X r6 / \ / \ X r5 / \ / \r0 r1

Page 28: Impala Presentation - Orlando

Query Planning: Single-Node Plan

• Plan Operators• Scan• HashJoin• HashAggregation• Union• Top-N• Exchange

Page 29: Impala Presentation - Orlando

Single-Node: Example Query

SELECT t1.custid, SUM(t2.revenue) AS revenue FROM largehdfstable t1 JOIN largehdfstable t2 ON ( t1.id1 = t2.id ) JOIN smallhbasetable t3 ON ( t1.id2 = t3.id ) WHERE t3.category = 'Online' GROUP BY t1.custid ORDER BY revenue DESC LIMIT 10;

Page 30: Impala Presentation - Orlando

Query Planning: Distributed Plans

• Goals:• Maximize scan locality, minimize data movement• Full distribution of all query operators (where semantically

correct). • Parallel Joins:

• Broadcast• Join is colocated, right-hand side table is broadcasting to each

executing node. Preferred for smaller tables• Partition

• Both tables are hash partitioned on join columns• Cost based decisions made when statistics are available

Page 31: Impala Presentation - Orlando

Query Planning: Distributed Plans

• Parallel Aggregation:• pre-aggregation where data is first materialized• merge aggregation partitioned by grouping columns

• Parallel Top-N:• Initial Top-N operation where data is first materialized• Final Top-N in single-node plan fragment

Page 32: Impala Presentation - Orlando

Query Planning: Distributed Plans

1. All scans are local2. First join is Partitioned3. Second join is Broadcast4. Pre-Aggregate Join Result5. Merge aggregation after

repartitioning6. Initial Top-N in aggregation7. Final Top-N in coordinator 1 12

3

4

5

6 7

3

Page 33: Impala Presentation - Orlando

Under the Hood

Page 34: Impala Presentation - Orlando

Impala’s Execution Engine

• Written in C++ for minimal execution overhead• Internal in-memory tuple formats puts fixed-width data

at fixed offsets. • Uses intrinsics/special CPU instructions for text parsing,

crc32 computation, etc.• Runtime code generation using LLVM for big loops

Page 35: Impala Presentation - Orlando

Runtime Code Generation

• Example of a “Big Loop”• Insert a batch of rows into a hash table

• Known at compile time:• Number of tuples in a batch• tuple layout• column types

• Generated at compile time:• Unrolled loops• Inlines all function calls• Contains no dead code• Minimizes branching

Page 36: Impala Presentation - Orlando

Code Path Optimizations

void MaterializeTuple(char* tuple) {for (int i = 0; i < num_slots_; ++i) {

char* slot = tuple + offsets_[i];switch (types_[i]) {

case BOOLEAN:*slot = ParseBoolean();break;

case INT:*slot = ParseInt();

case FLOAT: …case STRING: …// etc.

}}

}

void MaterializeTuple(char* tuple) {// i = 0*(tuple + 0) = ParseInt();// i = 1*(tuple + 4) = ParseBoolean();// i = 2*(tuple + 5) = ParseInt();

}

Interpreted Code Generated

Hot code path, called per row

Page 37: Impala Presentation - Orlando

Was it worth it?

Page 38: Impala Presentation - Orlando

Yes.

Page 39: Impala Presentation - Orlando

Code Path Optimizations

Codegen on? Time Instructions Branches Branch Misses Miss %

Yes .63s 52,605,701,380 9,050,446,359 145,461,106 1.607

No 1.7s 102,345,521,322 17,131,519,396 370,150,103 2.161

Page 40: Impala Presentation - Orlando

Impala’s IO Manager

• Maximizes disk throughput• Designed to take advantage of the aggregate throughput of

your disks• Interleaves computation and IO as much as possible• Services all queries on a single node

• Thread per disk• Each thread keeps the disk busy with read-aheads

• Supports blocking and non-blocking IO

Page 41: Impala Presentation - Orlando

Impala’s IO Manager

Query Query Query

IO Manager

Disk Disk Disk

Impala Daemon

Disk Disk

Page 42: Impala Presentation - Orlando

Comparing Impala to Dremel

• What is Dremel?• Columnar storage for data with nested data structures• Distributed scalable aggregation on top of that• Scales linearly• Whitepaper (http://bit.ly/19txiiX) published by Google in 2010

• Columnar Storage: Parquet• Distributed Aggregation: Impala• Contains much more than Dremel had at its time, including

JOINs.

Page 43: Impala Presentation - Orlando

More About Parquet

• What is it?• container format for all popular serialization formats: Avro, Thrift,

Protocol Buffers• Jointly developed Cloudera & Twitter, many more contributors

now!• Open source, on Github

• Features• Stores data in native types (i.e. bool, ints, doubles)• Supports fully shredded nested data• Support for fast index pages (fast lookups)• Extensible value encodings (i.e. run-length, dictionary, delta)

Page 44: Impala Presentation - Orlando

From Twitter’s “Dremel Made Simple” blog

Row Major

Why Columnar Storage?

Column Major

Table

The most efficient IO, is one that never happens at all

Page 45: Impala Presentation - Orlando

Nested Data? From Twitter’s “Dremel Made Simple” blog

Page 46: Impala Presentation - Orlando

Does it Scale?

Page 47: Impala Presentation - Orlando

Scaling Impala

• Impala loves memory

• Impala will use all or your disks when possible

• Linearly Scalable• Double your cluster size? Expect queries to run twice as fast.• HDFS takes care of block placement, no sharding!

Page 48: Impala Presentation - Orlando

Double the Hardware, Same Users

Page 49: Impala Presentation - Orlando

Double the Hardware, Double the Users

Page 50: Impala Presentation - Orlando

Double the Hardware, Double the Data

Page 51: Impala Presentation - Orlando

Wanna Race?

Page 52: Impala Presentation - Orlando

Performance Benchmark #1

Impala vs Hive 0.12

Hive 0.12 = Stinger Phase 2 on YARN/MR2

Page 53: Impala Presentation - Orlando

• Size: 3TB (TPC-DS scale 3,000)• Machines: 5 Nodes (8-core, 96GB memory, 12 disks, 1Gbps eth)• Format: (Hive: ORC, Impala: Parquet)

Winner: Impala

Page 54: Impala Presentation - Orlando

Hive is not the bar.One trillion rows per second is the bar.

- Josh WillsDirector of Data Science, Cloudera

Page 55: Impala Presentation - Orlando

Performance Benchmark #2

Impala vs

DBMS-Y = Popular Analytical Database

DBMS-Y

Page 56: Impala Presentation - Orlando

• Size: 30TB (TPC-DS scale 30,000)• Machines: 20 Nodes (8-core, 96GB memory, 12 disks, 1Gbps eth)• Format: (DBMS-Y: Proprietary Columnar, Impala: Parquet)

Winner: Impala

Page 57: Impala Presentation - Orlando

Future Roadmap

• Additional SQL• ORDER BY without LIMIT• Analytical window functions• DECIMAL datatype• Unstructured data types (remember Parquet)

• Resource Management• We didn’t talk about LLAMA:

• Impala on YARN

• Confidently share a cluster between components (e.g. Impala, MR, Spark)

Page 58: Impala Presentation - Orlando

Future Roadmap

• Performance• Multi-threaded operators (yes, they are single threaded now)• In-Memory caching; ability to pin tables/partitions in memory

• Background/Incremental Statistics

• And more...

Page 59: Impala Presentation - Orlando

Feature Request?Please voice your requests!

Tell us which feature is important to you!

[email protected]

Page 60: Impala Presentation - Orlando

Thank You!

[email protected]

@monstrado

SELECT question FROM audience WHERE has_question = true;