1 Cloudera Impala and improvements in HDFS for real-time queries Todd Lipcon Software Engineer,...
-
Upload
mercy-watkins -
Category
Documents
-
view
220 -
download
0
Transcript of 1 Cloudera Impala and improvements in HDFS for real-time queries Todd Lipcon Software Engineer,...
1
Cloudera Impala and improvements in HDFS for real-time queriesTodd LipconSoftware Engineer, Cloudera
你好 !
• Todd Lipcon• Email: [email protected] / [email protected]• Twitter: @tlipcon
• Software Engineer at Cloudera since 2009• Hadoop and HBase PMC Member / Committer at
Apache Software Foundation• Primarily work on HDFS and other Storage projects
• Best known for writing about 80% of HDFS HA support• …but also know a lot about the Impala code
2
Outline
• Cloudera and the Enterprise Data Hub (less technical)• Impala introduction/architecture (medium technical)• New requirements for HDFS (very technical)
• Block replica / disk placement info• In-memory caching for hot files• Short-circuit reads, reduced copy overhead
• Performance benchmarks
3
5
CROSSING THE CHASM
CHASM
Source: Crossing the Chasm, Geoff Moore
Visionaries Pragmatists Conservatives Skeptics
6
Hadoop has matured
• Hadoop 2 with HDFS HA and YARN• Security with Apache Sentry• Data Governance with emphasis on metadata,
discovery, auditing and most importantly lineage• Reliability with Backup and Disaster Recovery• Most importantly open both in Source and Platform.
Enterprise Data Hub
Inside an EDH
Unified Scale-out StorageFor Any Type of Data
Elastic, Fault-tolerant, Self-healing, In-memory capabilities
Resource Management
Online NoSQL DBMS
Analytic MPP DBMS
Search Engine
Batch Processing
Stream Processing
Machine Learning
SQL Streaming File System (NFS)
System
Managem
entD
ata M
anagement
Metadata, Security, Audit, Lineage
Attributes of an EDH:
• Store all data in original fidelity
• Secure data with standard enterprise assurances
• Access, Process and Analyze all data with multiple real-time or batch engines
• Connect to your existing IT investments
11
Cloudera Impala
Interactive SQL for Hadoop Responses in seconds or milliseconds Nearly ANSI-92 standard SQL. Fully compatible with Hive QL
Native MPP Query Engine Built for low-latency queries Separate runtime from MapReduce Designed as part of the Hadoop ecosystem
Open Source Apache-licensed, C++ and Java code base http://github.com/cloudera/impala
12
Not All SQL On Hadoop Is Created Equal
Batch MapReduceMake MapReduce faster
Slow, still batch
Remote QueryPull data from HDFS over the network to the DW
compute layer
Slow, expensive
Separate DBMSLoad data into a
proprietary database file
Rigid, separate data,slow ETL
ImpalaNative MPP query engine
that’s integrated into Hadoop
Fast, flexible, cost-effective
$
14
Impala and Hive
Shares Everything Client-Facing Metadata (table definitions) ODBC/JDBC drivers and BI applications SQL syntax (Hive SQL) Flexible file formats Machine pool Existing Hive UDFs
But Built for Different Purposes Hive: runs on MapReduce and ideal for batch
processing Impala: native MPP query engine ideal for
interactive SQL
Storage
Integration
Resource Management
Met
adat
a
HDFS HBase
TEXT, RCFILE, PARQUET, AVRO, ETC. RECORDS
HiveSQL Syntax Impala
SQL Syntax +Compute FrameworkMapReduce
Compute Framework
BatchProcessing
Interactive
SQL
15
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL request
1) Request arrives via ODBC/JDBC/Beeswax/Shell
16
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
2) Planner turns request into collections of plan fragments3) Coordinator initiates execution on impalad(s) local to data
17
Impala Query Execution
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
SQL App
ODBCHive
Metastore HDFS NN Statestore
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
Query Planner
Query Coordinator
Query Executor
HDFS DN HBase
4) Intermediate results are streamed between impalad instances5) Query results are streamed back to client
Query results
18
What makes Impala fast?
• LLVM based Just-In-Time (JIT) code generation• Query execution at maximum CPU speed, optimized for
modern processors (SSE4.2, etc)• Low query startup overhead
• Always-on service (no JVM launch overhead)• Cached table metadata
• Efficient file formats and I/O scheduling• Use full disk speed from all available disks
• HDFS developments specifically for Impala performance
HDFS Improvements: Motivation
• HDFS originally built for large batch jobs (MR)• Impala is concerned with very low latency queries
• Need to make best use of available aggregate disk throughput and CPU resources
• Impala’s more CPU-efficient execution engine is far more likely to be I/O bound as compared to Hive• For many queries the best performance improvement will
be from improved I/O• How can we improve HDFS’s raw bytes/second and
bytes/cycle throughput?
20
HDFS Improvements: Outline
• Reducing disk contention with I/O scheduling• In-memory caching of hot tables/files• Reduced copies during reading, short-circuit reads
21
Reducing I/O contention on disks
• The problem: many plan fragments run concurrently, and may contend with each other on spinning disks.
• Impala wants to make sure that concurrent plan fragments operate on data on separate disks• Avoid disk contention and seeks• Maximize aggregate available disk throughput
22
23
Scheduling without disk awareness
CPU Core 1
CPU Core 2
CPU Core 3
CPU Core 4
Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 8Block 9
Block 10Block 11Block 12
Query fragments
Block 7
Block 10
Block 1Block 2
Block 4
Block 5
Block 6Block 8
Block 9
Block 11
Block 12
Disk 1
Disk 2
Disk 3
Disk 4
Block 3
Query arrives, which must process 12 blocks
These blocks are randomly
distributed across 4 disks
Blocks are assigned in FIFO order to CPU cores
Disks 1 and 2 are overloaded. Disks 3 and 4 are wasted!
50% waste
Block Replica Disk Location Data
• The solution: add new API to return which disks each replica is stored on
• During query planning phase, impalad…• Determines all DNs data for query is stored on• Queries those DNs to get volume information
• During query execution phase, impalad…• Queues disk reads to per-disk queues so that each disk
always has one active reader
• (included in CDH4 and Apache Hadoop 2.x)
24
25
Scheduling with disk awareness
Block 1Block 2Block 3Block 4Block 5Block 6Block 7Block 8Block 9
Block 10Block 11Block 12
Query fragments
Query arrives which must process 12 blocks
Blocks are grouped by disk using HDFS API
Block 10
Block 1Block 2
Block 3Block 4
Block 5
Block 6
Block 7
Block 8
Block 9
Block 11
Block 12
Disk 1
Disk 2
Disk 3
Disk 4
Each CPU core accesses a different disk:
100% throughput,no random IO
CPU Core 1
CPU Core 2
CPU Core 3
CPU Core 4
In-memory Caching
• The problem: Impala queries are often bottlenecked at maximum disk throughput (~1GB/sec for 10 disks)
• Memory throughput is much higher (~25GB/sec)• Memory is getting cheaper/denser
• Slave nodes with 96+GB of RAM• If you put a table in RAM, we want to give a 25x
speed boost.• How do we give users this 25x speed boost?
26
In-memory Caching
• The solution: Let HDFS explicitly mlock specific files into memory
• Allows Impala to query data at full memory bandwidth speeds (25+ GB/s/node)
• Give cluster operator control over which files/tables should be kept in RAM• For example, “cache all data in the table partitions for this
week”• Later, we will add LRU/LFU policies
• Upcoming in Hadoop 2.3 / CDH5
27
Fewer Copies/Short-circuit Reads
• The problem: A typical read in HDFS resulted in many extra memory copies, wasting CPU.• Buffer cache -> Network socket (sendfile())• Network socket -> JVM buffer (read() syscall)• JVM buffer -> byte[] (Java InputStream.read())• For C clients: byte[] -> char* (libhdfs hdfsRead())
• Competition: Traditional RDBMS engines have no copies in data path
28
Fewer Copies/Short-circuit Reads
• The solution: Allow for reads to be performed directly on local files, with fewer copies
• Short circuit reads: Added facility to HDFS to allow for reads to completely bypass DataNode when client co-located with block replica files
• Direct read API: Added API in libhdfs to supply direct “char*” buffers to HDFS read operations• reduces number of copies to bare minimum (1-copy)
• Zero-copy read API: Added API in libhdfs which returns a buffer of cached data (zero-copy/mmap)
• All available in Hadoop 2.x29
• This can also benefit apps like HBase and MR• mmap-based zero-copy read for cached files is “infinitely fast” (just char*
pointer assignment)
libhdfs “direct read” support (HDFS-2834)
30
32
Impala vs Hive 0.12 (latest)
Cluster: 5 nodesWorkload: TPC-DS geometric mean 3TB dataset
Impala: Parquet formatHive 0.12: ORCFile format
Impala vs commercial MPP RDBMS
33
“DeWitt Clause” prohibitsusing DBMS vendor name
Test drive Impala
• Impala Community:• Download: http://cloudera.com/impala• Github: https://github.com/cloudera/impala• User group: impala-user on groups.cloudera.org
34
Copyright © 2013 Cloudera Inc. All rights reserved.
38
Parquet File Format
Open source, columnar Hadoop file format developed by Cloudera & Twitter
Organizes data from the same column together in the file
Read only the columns which are required for a query
Efficient encodings for each type (RLE, Dictionary, etc)
Supports fully nested data
Supports index pages for fast lookup
Why use Parquet Columnar Format for HDFS?•Well defined open format - http://parquet.io/
• Works in Impala, Pig, Hive & Map/Reduce• I/O reduction by only reading necessary columns•Columnar layout compresses/encodes better•Supports nested data by shredding columns
• Uses techniques used by Google’s ColumnIO • Impala loads use Snappy compression by default
• Gzip available: set PARQUET_COMPRESSION_CODEC=gzip;•Quick word on Snappy vs. Gzip
39
Quick Note on Compression
•Snappy• Faster compression/decompression speeds• Less CPU cycles•Lower compression ratio
•Gzip/Zlib• Slower compression/decompression speeds • More CPU cycles•Higher compression ratio
• It’s all about trade-offs
40
It’s all about the compression codec
Compressed with Snappy
Compressed with Gzip
Hortonworks graphic fails to call out that different codecs are
used. Gzip compresses better than Snappy, but
we all knew that.
41
Source: http://hortonworks.com/blog/orcfile-in-hdp-2-better-compression-better-performance/
HDFS Introduction
• HDFS is the Hadoop Distributed File System• Append-only distributed file system• Intended to store many very large files
• Block sizes usually 64MB – 512MB• Files composed of several blocks• Scalable to 100+ PB
• Write a file once during ingest• Read a file many times for analysis
44
HDFS Introduction (continued)
• HDFS originally designed specifically for Map/Reduce• Each MR task typically operates on one HDFS block• MR tasks run co-located on HDFS nodes• Data locality: move the code to the data
• Each block of each file is replicated 3 times• For reliability when machines or drives fail• Provide a few options for data locality during processing
45
HDFS Architecture
• Each cluster has…• A single Name Node (with High Availability failover)
• Stores file system metadata• Stores “Block ID” -> Data Node mapping
• Many Data Nodes• Store actual file data on local filesystem (eg ext4)
• Clients of HDFS…• Communicate with Name Nodes to browse file system, get block
locations for files• Communicate directly with Data Nodes to read/write files
47
Cloudera’s Enterprise Data Hub
Improve IT Operations• ETL Acceleration• EDW Rationalization• Mainframe Offload
Powerful Security Solution• Risk Analysis• Fraud Prevention• Compliance
Advanced Analytics Engine• 360° Customer View• Recommendation Engines• Processing & Analytics
Infinite Analytic Storage• Multi-Structured Data• In-place Analytics• Active Archive
Integration with Over 200 ISVs• Self-Service BI• Data Exploration• Visualization
Flexible Deployment Options• On-Premise or Cloud• Appliances• Engineered Systems
Enterprise Data Hub