Impala deep dive - running descriptive analytics in hadoop

Impala Product Update

Justin Erickson | Director, Product ManagementSeptember 2014

Reserved.1

Agenda

• Impala releases

• Impala roadmap

• Perf update

Reserved.2

Key Milestones and Features

• Impala 1.0• ~SQL-92 (minus correlated sub-queries)• Native Hadoop file formats (Parquet, Avro, text, Sequence, …)• Enterprise-readiness (authentication, ODBC/JDBC drivers, etc)• Service-level resource isolation with other Hadoop frameworks

• Impala 1.1• Fine-grained, role-based authorization via Apache Sentry• Auditing (Impala 1.1.1 and CM 4.7+)

• Impala 1.2• Custom language extensibility (UDFs, UDAFs)• Cost-based join-order optimization• On-par performance compared to traditional MPP query engines while maintaining native

Hadoop data flexibility

• Impala 1.3 / CDH 5.0 (also has version for CDH 4.x)• Resource management

Reserved.3

Just Released

Impala 1.4 / CDH 5.1 (also with version for CDH 4.x)

• Additional SQL:

• DECIMAL data type

• Additional built-in functions from EDW

• ORDER BY without LIMIT

• Continued performance gains:

• HDFS caching support (CDH 5 only)

• Faster selective joins

• Faster COMPUTE STATS

Reserved.

Impala near-term roadmap

Targeted for Impala 2.0 (fall 2014):• Additional SQL:

• Analytic/window functions• Subqueries in the WHERE clause• Additional data types (VARCHAR, CHAR)

• Disk-based joins and aggregations• GRANT/REVOKE

Considerations for Impala 2.x (priority and inclusion based on your feedback):• Nested/complex types (next highest priority)• Navigator Lineage• Updates via MERGE• Incremental stats• Additional SQL functions (GROUPING, ROLLUP, CUBE, MINUS, INTERSECT built-ins, etc)• UDTFs• Intra-node parallel joins and aggregations• Even faster performance• S3 integration

Reserved.5

SQL-on-Hadoop benchmark:Impala, Presto, Stinger, Spark SQL

• Upcoming benchmarks on latest versions of:• Impala (1.4.0)• Presto (0.74)• Stinger (final) phase 3 => aka Hive 0.13.0• Spark SQL (1.1)

• Published with smaller memory configuration (64 GB / node)• Demonstrates leadership is independent of memory size

• Dropped Shark given retirement for Hive-on-Spark

• As always, our public benchmarks are:• Based on industry standards (TPC)• Repeatable (https://github.com/cloudera/impala-tpcds-kit)• Methodical testing with multiple runs on same hardware• Help competing software put its best foot forward

• SQL-92 join style for engines without CBO• JVM tuning for Presto• Run on optimal file formats for each

Reserved.6

Impala’s Multi-User over 10x faster:Gap widening compared to May’s update

Reserved.7

Faster = more work in less time:Impala enables over 8.7x throughput

Reserved.8

Performance Takeaways

• Impala’s advantage expands from 5x single-user to >10x with just 10 user

• Performance gap is widening since May

• Single user Presto went from 5x before to 7.5x now

• Single user Hive/Tez went from 5x before to 9x now

• Mid-term trends will further favor Impala’s design approach

• More data sets move to memory (HDFS caching, in-memory joins, Intel joint roadmap)

• CPU efficiency will increase in importance

• Native code enables easy optimizations for CPU instruction sets (e.g. floating point operations, math operations, encrypt/decrypt)

• The Intel joint roadmap helps support these opportunities

Reserved.9

Try It Out!

• 100% Apache-licensed open source

• Downloads on http://impala.io/:

• Live online

• VM

• Installation

• Questions/comments?

• Community: http://impala.io/community

• Email: impala-user@cloudera.org

Reserved.10

Reserved.11

Real Time Audience Dashboard

September 2014

Introduction

Tubular Labs

SAAS Platform for online Video

Audience Development

(e.g. Big Data for YouTube videos)

David Koblas

VP Engineering, Tubular Labs

Overview

This presentation will talk about the work

Tubular Labs has done to use Impala as

one of the core components to our SAAS

platform. We'll go through the pipeline

for getting data into the system, to how

we've distributed responsibility across

AWS instances, and other tips and tricks

for getting real-time responses to our

end-user queries over billions of data

points.

User Story: Audience Also Watches

For any YouTube video can we figure out

who the audience is and what other

videos and channels they are watching.

Also to have the ability to slice the

audience by demographic information.

…and have it all run interactively from a

web SAAS platform.

Tubular App

Technology Options

• Pre-compute (e.g. Map/Reduce)

• MySQL or similar

• Data Warehouse

• Impala or Redshift

• Homebrew

Impala 0.7

Now we have a technology

Make it interactive

and make a bet on Cloudera

Now We Have A TechnologyTime To Make It Fast

and Economical

Source: Tubular Labs

Pipeline

Loading

• Sqoop- collect data from MySQL

• Hive- preprocess data

• Impala- interactive display

• Python- REST endpoint

AWS EC2: Node types

• m1.xlarge

- 1.6TB of Instance Storage

- slow IO

• hi1.4xlarge

- 2TB of SSD

- expensiveNote: this would be an i2.4xlarge instance today

Managing costs

Problem

• hi1.4xlarge - expensive

• m1.xlarge - slow IO

Solution – HDFS rack replication for separation

• One copy of data on both racks

• Hive creates tables on m1.xlarge instances

• Impala queries on hi1.4xlarge instances

Interactive Performance

Problem

• Large tables take time to scan

• No indexes

• Need to deliver results in < 1second

Solution – partitioning (duh!)

• Partitions are targeted to be between 100…200MB

• The query log is your friend

Tubular App

Summary

Impala can back your SAAS application

• We’re now running version 1.3

• We’re “spinning” 10TB of data

• Delivering queries in < 2seconds

We’re hiring – but you already knew that.

Impala deep dive - running descriptive analytics in hadoop

Engineering

Transcript of Impala deep dive - running descriptive analytics in hadoop

How MediaMath Built Faster, Scalable Attribution Reporting with Hadoop-Impala

Deep dive into enterprise data lake through Impala

Big Data and Hadoop - History, Technical Deep Dive, and Industry Trends

Impala: A Modern, Open-Source SQL Engine for Hadoopbrents/cs494-cdcs/papers/impala.pdf · Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala

The Intelligent Catalog and Data Sequestering Engine ......HBASE Hive or Impala MAPREDUCE Hadoop Components Hive Spark Pig Drill Analytic Libraries Impala TIKA Hbase Spark Stream D3.JS

Setting up a Hadoop Cluster with Cloudera Manager and Impala

Cloudera Impala technical deep dive

TDWI SQL on Hadoop - sigs.de · PDF fileSpark, Cascading •Query Engines like Pig, Hive, Impala ... Prof. Dr. Jens Albrecht SQL on Hadoop 39 Example Row-Format: Avro Schema specification

Introduction to Impala ～Hadoop用のSQLエンジン～ #hcj13w

Impala Unlocks Interactive BI on Hadoop

Impala: A Modern SQL Engine for Hadoop

Cloudera ODBC Driver for Impala...2015/02/05 · Cloudera ODBC Driver for Impala is used for direct SQL and Impala SQL access to Apache Hadoop / Impala distributions, enabling Business

Nos Formations - Enigma School...Analyse de données, Big Data avec Hadoop, Pig, Hive, Impala Administration de Hadoop pour solutions Big Data, Kafka, Ranger Analyse de données –

Strata + Hadoop World 2012: Data Science on Hadoop: How Cloudera Impala Unlocks New Productivity and Insights

NJ Hadoop Meetup - Apache NiFi Deep Dive

Hadoop World: Production Deep Dive with High Availability

Deep Dive of Kafka to HDFS/Hadoop Ingestion App Template

Interactive SQL-on-Hadoop - Meetupfiles.meetup.com/1804355/JethroData_Interactive_BI_on_Hadoop.pdf · Or just buy a Teradata appliance ... Impala way ahead of the pack ... Internal

Impala: A Modern, Open-Source SQL Engine for Hadoop ...€¦ · Copyright © 2013 Cloudera Inc. All rights reserved. Impala: A Modern, Open-Source SQL Engine 3 • Implementation

2014 Import.io Data Summit - Including Hadoop/Impala Getting Started Demo