Deep dive into enterprise data lake through Impala
-
Upload
evans-ye -
Category
Engineering
-
view
306 -
download
9
Embed Size (px)
description
Transcript of Deep dive into enterprise data lake through Impala

Deep dive into enterprise data lake through Impala
Evans Ye2014.7.28
04/07/2023 Confidential | Copyright 2013 TrendMicro Inc. 1

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Agenda
• Introduction to Impala• Impala Architecture• Query Execution• Getting Started• Parquet File Format• ACL via Sentry
2

04/07/2023
3
Introduction to Impala
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Impala
• General-purpose SQL query engine• Support queries takes from milliseconds to
hours (near real-time)• Support most of the Hadoop file formats
– Text, SequenseFile, Avro, RCFile, Parquet• Suitable for data scientists or business
analysts
2

04/07/2023
5
Why do we need it?
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc. 2
SPEED

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Current adhoc-query solution - Pig
• Do hourly count on akamai log– A = load ‘/test_data/2014/07/20/00'
using PigStorage();B = foreach (group A all) COUNT_STAR(A);dump B;
– …0% complete100% complete(194202349)
2
4mins, 28sec

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Using Impala
• No memory cache– > select count(*) from test_data
where day=20140720 and hour=0– 194202349
• with OS cache
• Do a further query:– select count(*) from test_data where day=20140720
and hour=00 and c='US';– 41118019
2
96.46s
9.07s
6.57s

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Status quo
• Developed by Cloudera• Open source under Apache License• 1.0 available in 05/2013• current version is 1.4• connect via ODBC/JDBC/hue/impala-shell• authenticate via Kerberos or LDAP• fine-grained ACL via Apache Sentry
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Benefits
• High Performance– C++– direct access to data (no Mapreduce)– in-memory query execution
• Flexibility– Query across existing data(no duplication)– support multiple Hadoop file format
• Scalable– scale out by adding nodes
2

04/07/2023
12
Impala Architecture
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Impala Architecture
Datanode
Tasktracker
Regionserver
impala daemon
2
NN, JT, HMActive
NN, JT, HMStandby
Datanode
Tasktracker
Regionserver
impala daemon
Datanode
Tasktracker
Regionserver
impala daemon
Datanode
Tasktracker
Regionserver
impala daemon
State store
Catalog
Hive Metastore

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Components
• Impala daemon– collocate with datanodes– handle all impala internal requests related to query
execution– User can submit request to impala daemon running on
any node and that node serve as coordinator node• State store daemon
– communicates with impala daemons to confirm which node is healthy and can accept new work
• Catalog daemon– broadcast metadata changes from impala SQL
statements to all the impala daemons
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Fault tolerance
• No fault tolerance for impala daemons– A node failed, the query failed
• state-store offline– query execution still function normally– can not update metadata(create, alter…)– if another impala daemon goes down, then
entire cluster can not execute any query• catalog offline
– can not update metadata
2

04/07/2023
16
Query Execution
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc. 2

04/07/2023
21
Getting Started
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Impala-shell (sercure cluster)
• $ yum install impala-shell• $ kinit –kt evans_ye.keytab evans_ye• $ impala-shell --kerberos \
--impalad IMPALA_HOST
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Create and insert
• > create table t1 (col1 string, col2 int);• > insert into t1 values (‘foo’, 10);
– only supports writing to TEXT and PARQUET tables
– every insert creates 1 tiny hdfs file– by default, the file will be stored under
/user/hive/warehouse/t1/– use it for setting up small dimension table ,
experiment purpose, or with HBase table
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Create external table to read existing files
• > create external table t2 (col1 string, col2 int)row format delimited fields terminated by ‘\t’ location ‘/user/evans_ye/test_data’;– location must be a directory
(for example, pig output directory)– files to read:
• V /user/evans_ye/test_data/part-r-00000• X /user/evans_ye/test_data/_logs/history• X /user/evans_ye/test_data/20140701/00/part-r-00000• no recursive
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
No recursive?
• Then how to add external data with folder structure like this:– /user/evans_ye/test_data/20140701/00
/user/evans_ye/test_data/20140701/01…/user/evans_ye/test_data/20140701/02…/user/evans_ye/test_data/20140702/00
2

04/07/2023
26
Partitioning
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Create the table with partitions
• > create external table t3 (col1 string, col2 int)partitioned by (`date` int, hour tinyint) row format delimited fields terminated by ‘\t’;
• No need to specify the location on create
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Add partitions into the table
• > alter table t3 add partition (`date`=20130330, hour=0) location ‘/user/evans_ye/test_data/20130330/00‘;
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Partition number
• thousands of partitions per table– OK
• tens of thousands partitions per table– Not OK
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Compute table statistics
• > compute stats t3;• > show table stats t3;
• Help impala to optimize join query:broadcast join, partitioned join
2

04/07/2023
31
Parquet File Format
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Parquet
• apache incubator project• column-oriented file format
– compression is better since all the value would be the same type– encoding is better since value could often be the same and
repeated– SQL projection can avoid unnecessary read and decoding on
columns• Supported by Pig, Impala, Hive, MR and Cascading• impala by default use snappy with parquet• impala + parquet = google dremel
– dremel doesn’t support join– impala doesn’t support nested data structure(yet)
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Transform text files table into parquet format
• > create table t4 like t3 stored as parquet;• > insert overwrite t4
partition (`date`, hour) select * from t3
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Using parquet in Pig
• $ yum install parquet• $ pig• > A = load ‘/user/hive/warehouse/t4’
using parquet.pig.ParquetLoader as (x: chararray, y: int);
• > store A into ‘/user/evans_ye/parquet_out’ using parquet.pig.ParquetStorer;
2

04/07/2023
35
ACL via
Confidential | Copyright 2013 TrendMicro Inc.

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Sentry
• apache incubator project• provide fine-grained role based
authorization• currently integrates with Hive and Impala• require strong authentication such as
kerberos
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Enable Sentry for Impala
• turns on Sentry authorization for Impala– add two lines into impala daemon’s configuration
file(/etc/default/impala)
– auth-policy.ini Sentry policy file– server1 a symbolic name used in policy file
• all impala daemons must specify same server name
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Sentry policy file example
• roles:– on server1,
spn_user_role has permission to read(SELECT) all tables in spn database
• groups– evans_ye group has role spn_user_role
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Sentry policy file example
• roles:– evans_data has permission to access
/user/evans_ye• allows you to add data under /user/evans_ye as
partitions– foo_db_role can do anything in foo database
• create, alter, insert, drop
2

04/07/2023
Confidential | Copyright 2013 TrendMicro Inc.
Impala 2014 Roadmap
• 1.4 (now available)– order by without limit
• 2.0– nested data types (structs, arrays, maps)– disk-based joins and aggregations
2

41
Q&A
04/07/2023 Confidential | Copyright 2013 TrendMicro Inc.