Cloudera Impala Internals

52
Impala internals By David Gruzman BigDataCraft.com

description

This presentation is about internals of Cloudera Impala open source database engine. it is done based on my study of their source code.

Transcript of Cloudera Impala Internals

Page 1: Cloudera Impala Internals

Impala internals

By David Gruzman

BigDataCraft.com

Page 2: Cloudera Impala Internals
Page 3: Cloudera Impala Internals

Impala

by David Gruzman

►Impala is

– Relational Query Engine

– Open source

– Massive parallel processing

Page 4: Cloudera Impala Internals

Why do we care, about internals?

► SQL is declarative, no need for internals...► In the same time, even small problems in

engine operation require good understanding of its work principles to fix...

► It is hardly possible to optimize without understanding algorithms under the hood.

► It is hard to make decisions about engine suitability to future needs without knowing technical limitations.

Page 5: Cloudera Impala Internals

Engine parts

Page 6: Cloudera Impala Internals

How to understand engine?

What it is doing?

Main principle of operation

Main building block

Operation sequence

Operation environment

Efficiency

Design decisions

Materials

Main problems and fixes

Page 7: Cloudera Impala Internals

What it is doing

Impala is Relation engine. It executes SQL queries.

Data is append-able only. There is no “Update” or “Delete” statements.

Page 8: Cloudera Impala Internals

Principle of operation

Main differentiators are:

Distribution of Query among nodes (MPP)

LLVM and Code generation. Impala is compiler.

Relay on HDFS

Use external metadata – hive metastore.

Parallel query capability (per node, per cluster).

Page 9: Cloudera Impala Internals

Sequence of operation

Query parsing – translate SQL to AST(Abstract syntax tree)

Match objects to metadata

Query planning – create physical execution plan.

In case of MPP – divide plan into plan fragments for nodes.

Distribute plan fragments to nodes

Execute plan fragments.

Page 10: Cloudera Impala Internals

Main building blocks

Front End. This is Java code which implements a lot of logic with non-critical performance

- database objects fe/src/main/java/com/cloudera/impala/analysis/

- execution plan parts : fe/src/main/java/com/cloudera/impala/planner/

Page 11: Cloudera Impala Internals

BackEnd (Be)

Backend is written on C++, and used mostly for performance critical parts. Specifically:

- Execution of the plan fragments on nodes

- Services implementation

ImpalaD service

StateStore

Catalog Service

Page 12: Cloudera Impala Internals

Services - ImpalaD

This is “main” service of impala which runs on each node. It logically consists of the following sub-services of our interest.

ImpalaService – service, used to execute query. Console, JDBC/ODBC connects here.

ImpalaInternalService – service is used to coordinate work within the impala cluster. Example of usage – to coordinate the job of running query fragments on planned impala nodes.

What is interesting for us? Each node can serve query master role. Each node can help others. It is step to horizontal scalability.

Page 13: Cloudera Impala Internals

Dual role of ImpalaD service

Query coordinator

Fragment executor

Page 14: Cloudera Impala Internals

Services view

Front End

Impala Service

Impala Internal Service

Page 15: Cloudera Impala Internals

ImpalaService – main methods

inherited from beeswax :

ExecuteAndWait

Fetch

Explain

Impala specific :

ResetCatalog

GetRuntimeProfile

Page 16: Cloudera Impala Internals

ImpalaInternalService – main methods

ExecPlanFragment

ReportExecStatus

CancelPlanFragment

TransmitData

Page 17: Cloudera Impala Internals

Services - StateStore

In many clusters we have to solve “cluster synchronization” problem on some or other way.

In impala it is solved by StateStore – published/subscriber service, similar to Zookeeper. Why Zookeeper is not used?

It speaks with its clients in terms of topics. Clients can subscribe to different topics. So to find “endpoints” - look in the sources for the usage of “StatestoreSubscriber”

Page 18: Cloudera Impala Internals

StateStore – main topics

IMPALA_MEMBERSHIP_TOPIC – updates about attached and detached nodes.

IMPALA_CATALOG_TOPIC – updates about metadata changes.

IMPALA_REQUEST_QUEUE_TOPIC – updates in the queue of waiting queries.

Page 19: Cloudera Impala Internals

Admission control

There is module called AdmissionController.

Via topic impala-request-queue it is know about queries currently running and their basic statistics like memory and CPU consumption.

Based on this info it can decide to:

-run query

-queue query

-reject query

Page 20: Cloudera Impala Internals

Catalog Service

It caches in Java code metadata from hive metastore: /fe/src/main/java/com/cloudera/impala/catalog/

It is important since Hive's native partition pruning is slow especially with large number of partitions.

It use C++ code be/src/catalog/

To relay changes (delta's) to other nodes via StateStore.

Page 21: Cloudera Impala Internals

Differance with hive

Catalog Service store in memory and operate on metadata, leaving MetaStore for persistance only.

Technically it mean that disconnection from MetaStore is not that complicated.

Page 22: Cloudera Impala Internals

ImpalaInternalService - details

This is place where the real heavy lifting takes place.

Before diving in, what we want to understand here:

Threading model

File System interface

Predicate pushdown

Resource management

Page 23: Cloudera Impala Internals

Threading model

DiskIoMgr schedules access of all readers to all disks. It should include predicates.

It can give optimal concurrency. Sounds coherent to the Intel TBB / Java Executor service approach: give me small tasks and I will schedule them.

The rest of operations – like Joins, Group By looks like single threaded in current version.

IMHO – sort joins and group by are better for concurrency.

Page 24: Cloudera Impala Internals

File System interface

Impala is working via LibHDFS – so HDFS (not DFS) is hard coded.

Impala required and checked that short circuit is enabled.

During planning phase names of the block files to be scanned are determined.

Page 25: Cloudera Impala Internals

Main “database” algorithm

It is interesting to see, how main operations are implemented, what options do we have:

Group By,

Order By (Sort),

Join

Page 26: Cloudera Impala Internals

Join

Join is probably most powerful and performance critical part of any analytical RDBMS.

Impala implements BroadCastJoin and

GraceHashJoin.(be/src/exec/partitioned-hash-join-node.h). Both are kinds of Hash Join.

Basic idea of GraceHashJoin is to partition data, and load in memory corresponding partitions of the tables for the join.

Page 27: Cloudera Impala Internals

DiskMemory

Part 2 Part 3 Part 4Part 1 Part 5

Part 2 Part 3 Part 4Part 1 Part 5

Part 2 Part 3 Part 4Part 1 Part 5Part 3 Part 4 Part 5

In-memory hash join

DiskMemory

Part 3 Part 4

Part 3 Part 4 Part 5

Part 5

Page 28: Cloudera Impala Internals

BroadCast join

Just send small table to all nodes and join with big one.

It is very similar to Map Side join in Hive.

Selection of join algorithm can be hinted.

Page 29: Cloudera Impala Internals

Group by

There are two main approaches – using dictionary or sorting.

Aggregation can be subject to memory problems with too many groups.

Impala is using Partitioned Hash join which can spill to disk using BufferedBlockManager.

It is somewhat analogous to join implementation.

Page 30: Cloudera Impala Internals

User defined functions

Impala supports two kinds of UDF / UDAF

- Native, written in C/C++

- Hive's UDF written in java.

Page 31: Cloudera Impala Internals

Caching

Impala does not cache data by itself.

It delegates it to the new HDFS caching capability.

In a nutshell – HDFS is capable to keep given directory in memory.

Zero copy access via MMAP is implemented.

Why it is better then buffer cache?

Less task switching

No CRC Check

Page 32: Cloudera Impala Internals

Spill to Disk

In order to be reliable, especially in face of Data Skews, some sort of spilling data to disk is needed.

Impala approach this problem with introduction of

BufferedBlockMgr

It implements mechanism somewhat similar to virtual memory – pin, unpin blocks, persist them.

It can use many disks to distribute load.

It is used in all places where memory can be not sufficient

Page 33: Cloudera Impala Internals

Why not Virtual Memory?

Some databases offload all buffer management to the OS Virtual Memory. Most popular example: MongoDB.

Impala create BufferedBlockManager per PlanFragment.

It gives control how much memory consumed by single query on given node.

We can summarize answer as : better resource management.

Page 34: Cloudera Impala Internals

BufferedBlockMgr usage

Partitioned join

Sorting

Buffered Tuple Stream

Partitioned aggregation

Page 35: Cloudera Impala Internals

Memory Management

Impala BE has its own MemPool class for memory allocation.

It is used across the board by runtime primitives and plan nodes.

Page 36: Cloudera Impala Internals

Why own Runtime?

Impala has implemented own runtime – memory management, virtual memory?

IMHO Existing runtime (both Posix, and C++ runtime) are not multi-tenant. It is hard to track and limit resource usage by different requests in the same process.

To solve this problem Impala has its own runtime with tracking and limiting capabilities.

Page 37: Cloudera Impala Internals

YARN integration

When Impala run as part of the Hadoop stack resource sharing is important question...

Two main options are

- Just divide resources between Impala and Yarn using cgroups.

- Use YARN for the resource management.

Page 38: Cloudera Impala Internals

Yarn Impala Impedance

YARN is built to schedule batch processing.

Impala is aimed to sub-second queries.

Running application master per query does not sounds “low latency”.

Requesting resources “as execution go” does not suit pipeline execution of query fragments.

Page 39: Cloudera Impala Internals

L..LAMA ?

Page 40: Cloudera Impala Internals

LLAMA

Low Latency Application Master

Or

Long Living Application Master

It enable low latency requests by living longer – for a whole application lifetime.

Page 41: Cloudera Impala Internals

How LLAMA works

1. There is single LLAMA daemon to broker resources between Impala and YARN

2. Impala ask for all resources at once - “gang scheduling”

3. LLAMA cache resources before return them to YARN.

Page 42: Cloudera Impala Internals

Important point

Impala is capable of:

- Run real time queries In YARN environment

- Ask for more resources (especially memory) when needed.

Main drawbacks:Impala implements own resource management among concurrent

queries, thus partially duplicating YARN functionality.

Possible deadlocks between two YARN applications.

Page 43: Cloudera Impala Internals

Find 10 similarities

Page 44: Cloudera Impala Internals

What is source of similarity

With all the difference, they solve similar problem:

How to survive in Africa...

O, sorry,

How to run and coordinate number of tasks in the cluster.

Page 45: Cloudera Impala Internals

Hadoop parallels

QueryPlanner – Developer or Hive. Somebody who create job.

Coordinator, ImpalaServer – Job Tracker

PlanFragment – Task. (map or reduce)

ImpalaInternalService – TaskTracker

RequestPoolService+Scheduler+AdmissionController = Hadoop job Scheduler.

StateStore – Zookeeper.

Page 46: Cloudera Impala Internals

ImpalaToGo

While being a perfect product Impala is chained to the hadoop stack

- HDFS

- Management

Page 47: Cloudera Impala Internals

Why it is a the problem?

HDFS is perfect to store vast amounts of data.

HDFS is built from large inexpensive SATA drives.

For the interactive analytics we want fast storage.

We can not afford FLASH drives for whole big data.

Page 48: Cloudera Impala Internals

What is solution

We can create another hadoop cluster on flash storage.

Minus – another namenode to manage, replication will waste space.

If replication factor is one – any problems should be manually repaired.

Page 49: Cloudera Impala Internals

Cache Layer in place of DFS

HDFS/Hadoop cluster

ImpalaToGo cluster

Data caching (LRU)

Auto load

Page 50: Cloudera Impala Internals

Elasticity

Having cache layer in place of distributed file system it is much easier to resize cluster.

ImpalaToGo is used consistent hashing for its data placement – to minimize impact on resize.

Page 51: Cloudera Impala Internals

Who we are?

Group of like minded developers, working on making Impala even greater.

Page 52: Cloudera Impala Internals

Thank you!!!

Please ask questions!