Hive @ Bucharest Java User Group

HIVEBucharest Java User Group

July 3, 2014

whoami• Developer with SQL Server team since 2001• Apache contributor

• Hive• Hadoop core (security)

• stackoverflow user 105929s• @rusanu

What is HIVE• Datawarehouse for querying and managing large datasets• A query engine that use Hadoop MapReduce for execution• A SQL abstraction for creating MapReduce algorithms• SQL interface to HDFS data• Developed at Facebook

VLDB 2009: Hive - A Warehousing Solution Over a Map-Reduce Framework

• ASF top project since September 2010

http://www.vldb.org/pvldb/2/vldb09-938.pdf

What is Hadoop

Hadoop Core• Distributed execution engine

• MapReduce• YARN• TEZ

• Distributed File System HDFS• Tools for administering the

execution engine and HDFS• Libraries for writing MapReduce

jobs

Hadoop Ecosystem

• HBase (BigTable)• Pig (scripting query language)• Hive (SQL)• Storm (Stream Processing)• Flume (Data Aggregator)• Sqoop (RDBMS bulk data transfer)• Oozie (workflow scheduling)• Mahout (machine learning)• Falcon (data lifecycle)• Spark, Cassandra etc (not based on Hadoop)hadoopecosystemtable.github.io

How does Hadoop work• JOB: binary code (Java JAR), configuration XML, any additional file(s)

• The job gets uploaded into the cluster file system (usually HDFS)

• SPLIT: a fragment of data (file) to be processes• The input data is broken into several splits

• TASK: execution of the job JAR to process a split• Scheduler attempts to execute the task near the data split

• MAP: takes unsorted, unclustered data and outputs clustered data• SHUFFLE: takes clustered data and produces sorted data• REDUCE: takes sorted data and produces desired output• Synergies

• Processing locality: execute the code near the data storage, avoid data transfer• Algorithms scalability:

• Map phase can scale out because assumes no sorting and no clustering• Reduce phase easy to write algorithms when data is guaranteed sorted and clustered

• Execution reliability (monitoring, retry, preemptive execution etc)

MapReduce

How does Hive work• SQL submitted via CLI or

Hiveserver(2)• Metadata describing tables

stored in RDBMS• Driver compiles/optimizes

execution plan• Plan and execution engine

submitted to Hadoop as job• MR invokes Hive execution

engine which executes planHi

veHa

doop

MetastoreRDBMS

HCatalog

HDFS

DriverCompiles, Optimizes

Map

Redu

ce

Task

Task

Split

Split

CLI Hiveserver2

ODBC JDBCShell

Job Tracker

Beeswax

Hive Query execution• Compilation/Optimization results in an AST containing operators eg:

• FetchOperator: scans source data (the input split)• SelectOperator: projects column values, computes • GroupByOperator: aggregate functions (SUM, COUNT etc)• JoinOperator:joins

• The plan forms a DAG of MR jobs• The plan tree is serialized (Kryo)• Hive Driver dispatches jobs

• Multiple stages can result in multiple jobs

• Task execution picks up the plan and start iterating the plan• MR emits values (rows) into the topmost operator (Fetch)• Rows propagate down the tree• ReduceSinkOperator emits map output for shuffle

• Each operator implements both a map side and a reduce side algorithm• Executes the one appropriate for the current task• MR does the shuffle, many operators rely on it as part of their algorithm• Eg. SortOperator, GroupByOperator

• Multi-stage queries create intermediate output and the driver submits new job to continue next stage• TEZ execution: map-reduce-reduce, usually eliminates multiple stages (more later)• Vectorized execution mode emits batches of rows (1024 rows)

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/FetchOperator.java

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/SelectOperator.java

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java

https://github.com/apache/hive/blob/trunk/ql/src/java/org/apache/hadoop/hive/ql/exec/JoinOperator.java

Interacting with Hive• hive from shell prompt launches CLI

• Run SQL command interactively• Can execute a batch of commands from a file• Results displayed in console

• hiveserver2 is a daemon• JDBC and ODBC drivers for applications to connect to it• Queries submitted via JDBC/ODBC• Query results as JDBC/ODBC resultsets

• Other applications embed Hive driver eg. beeswax

Hive QL• The dialect of SQL supported by Hive• More similar to MySQL dialect than ANSI-SQL• Drive toward ANSI-92 compliance (syntax, data types)• Query language: SELECT• DDL: CREATE/ALTER/DROP DATABASE/TABLE/PARTITION• DML: Only bulk insert operations

• LOAD• INSERT

• HIVE-5317 Implement insert, update, and delete in Hive with full ACID support

https://issues.apache.org/jira/browse/HIVE-5317

Supported data types• Numeric

• tinyint, smallint, int, bigint

• float, double• decimal(precision, scale)

• Date/Time• timestamp• date

• Character types• string• char(size)• varchar(size)

• Misc. types• boolean• binary

• Complex types• ARRAY<type>• MAP<type, type>• STRUCT<name:type, name:type>

• UNIONTYPE<type, type, type>

Storage Formats• Text

• ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t’ LINES TERMINATED BY ‘\n’;

• Gzip or Bzip2 is automatically detected• SEQUENCEFILE (default map-reduce output)• ORC Files

• Columnar, Compressed• Certain features only enabled on ORC

• Parquet• Columnar, Compressed

• Arbitrary SerDe (Serializer Deserializer)

DDL/Databases/Tables• CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name

[COMMENT database_comment][LOCATION hdfs_path][WITH DBPROPERTIES (property_name=property_value, ...)];

• CREATE [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name[(col_name data_type [COMMENT col_comment], ...)][COMMENT table_comment][PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)][CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS] [SKEWED BY (col_name, col_name, ...) ON ([(col_value, col_value, ...), ...|col_value, col_value, ...]) [STORED AS DIRECTORIES][ [ROW FORMAT row_format] [STORED AS file_format] | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)] ][LOCATION hdfs_path][TBLPROPERTIES (property_name=property_value, ...)][AS select_statement]

• EXTERNAL tables are not owned by Hive (DROP TABLE lets the file in place)• Partitioning, Bucketing, Skew control allow precise control of file size (important for processing to achieve balanced MR splits)• ALTER TABLE … EXCHANGE PARTITION allows for fast (metadata only) move of data.• ALTER TABLE … ADD PARTITION adds to Hive metadata a partition already existing on disk• MSCK REPAIR TABLE … scans on-disk files to discover partitions and synchronizes Hive metadata• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL

Data Load• LOAD DATA [LOCAL] INPATH 'filepath' [OVERWRITE] INTO TABLE tablename [PARTITION (partcol1=val1, partcol2=val2 ...)]

• File format must match table format (no transformations)• INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement;INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement;

• OVERWRITE replaces the data in the table (TRUNCATE + INSERT)• INTO appends the data (leaves existing data intact)

• Dynamic Partitioning• Creates new partitions based on data• https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

• INSERT OVERWRITE [LOCAL] DIRECTORY directory1 [ROW FORMAT row_format] [STORED AS file_format] SELECT ... FROM ...

• Writes a file without creating Hive table

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

https://cwiki.apache.org/confluence/display/Hive/Tutorial#Tutorial-Dynamic-PartitionInsert

Hive SELECT syntax [WITH CommonTableExpression (, CommonTableExpression)*]SELECT [ALL | DISTINCT] select_expr, select_expr, ...FROM table_reference[WHERE where_condition][GROUP BY col_list][CLUSTER BY col_list | [DISTRIBUTE BY col_list] [SORT BY col_list]][HAVING having_condition][LIMIT number]

SELECT features• REGEX column specifications

• SELECT `(ds|hr)?+.+` FROM sales• Virtual columns

• INPUT_FILE_NAME• BLOCK_OFFSET_INSIDE_FILE

• Sampling• SELECT … FROM source TABLESAMPLE (BUCKET 3 OUT OF 32 ON rand());

• SELECT … FROM source TABLESAMPLE (1 PERCENT);• SELECT … FROM source TABLESAMPLE (10M);• SELECT … FROM source TABLESAMPLE (100 ROWS);

Clustering and Distribution• ORDER BY

• In strict mode must be followed by LIMIT as a last single reducer is required to sort all output

• SORT BY• Only guarantees order of rows up to the last reducer• If multiple last reducers then only partially ordered result

• DISTRIBUTE BY• Specifies how to distribute the rows to reducers, but does not require order

• CLUSTER BY• Syntactic sugar for SORT BY and DISTRIBUTE BY

Subqueries• In FROM clause

• SELECT … FROM (SELECT ….FROM …) AS alias …

• In WHERE clause• SELECT … FROM …. WHERE EXISTS (SELECT … )• SELECT … FROM …. WHERE col IN (SELECT …)

• Must appear on the right-hand side in expressions• IN/NOT IN must project exactly one column• EXISTS/NOT EXISTS must contain correlated predicates

• Otherwise they’re JOINs

• Reference to parent query is only supported in WHERE clause subqueries• References of course required for correlated sub-queries

Common Table Expressions (CTE)• Supported for SELECT and INSERT• Do not support recursive syntax• with q1 as ( select key, value from src where key = '5')from q1insert overwrite table s1select *;

Lateral Views• Aka CROSS APPLY• Apply a table function to every row• SELECT … FROM table LATERAL VIEW explode(column) exTable AS exCol;

• OUTER clause to include rows for which the function generates nothing• Similar to ANSI-SQL OUTER APPLY

• Built-in table functions (UDTF):

• explode(ARRAY)• explode(MAP)• inline(STRUCT)• json_tuple(json, k1, k2,…)

• Returns k1, k2 from json as rows

• parse_url(url, part, part, …)• Returns URL host, path, query:key

• posexplode(ARRAY)• explode + index

• stack(n, v1, v2, …, vk)• n rows, each with k/n columns

Windowing and analytical functions• LEAD, LAG, FIRST_VALUE, LAST_VALUE• RANK, ROW_NUMBER, DENSE_RANK, PERCENT_RANK, NTILE• OVER clause for aggregates

• PARTITION BY• SELECT SUM(a) OVER (PARTITION BY b)

• ORDER BY• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c)

• window specification• SELECT SUM(a) OVER (PARTITION BY b ORDER BY c ROWS 3 PRECEDING AND 3 FOLLOWING)

• WINDOW clause• SELECT SUM(b) OVER wFROM tWINDOW w AS (PARTITION BY b ORDER BY c ROWS BETWEEN CURRENT ROW AND 2 FOLLOWING)

GROUPING SETS, CUBE, ROLLUP• GROUPING SET

• Logical equivalent of having the same query run with different GROUP BY and then UNION the results • SELECT SUM(a) … GROUP BY a,b GROUPING SETS (a, (a,b))

SELECT SUM(a) … GROUP BY a UNIONSELECT SUM(a) … GROUP BY a,b;

• GROUP BY … WITH CUBE• Equivalent of adding all possible GROUPING SETS• GROUP BY a,b,c WITH CUBE

GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a,c), (b,c),(a), (b),(c), ())

• GROUP BY … WITH ROLLUP• Equivalent of adding all the GROUPING SETS that lead with the GROUP BY columns• GROUP BY a,b,c WITH ROLLUP

GROUP BY a,b,c GROUPING SETS ((a,b,c), (a,b), (a))

XPath functions• xpath_...(xml_string, xpath_expression_string)

• xpath_long returns a long• xpath_short returns a short• xpath_string returns a string• …

• xpath(xml, xpath) returns an array of strings• SELECT xpath(col, ‘//configuration/property[name=“foo”]/value’)

User Defined Functionspackage com.example.hive.udf; import org.apache.hadoop.hive.ql.exec.UDF;import org.apache.hadoop.io.Text; public final class Lower extends UDF { public Text evaluate(final Text s) { if (s == null) { return null; } return new Text(s.toString().toLowerCase()); }}

CREATE FUNCTION myLower AS ‘Lower' USING JAR 'hdfs:///path/to/jar';

• Aggregate functions also possible, but more complicated• Must track amp side vs. reduce side and ‘merge’ the intermediate results• https://cwiki.apache.org/confluence/display/Hive/GenericUDAFCaseStudy

TRANSFORM• Plug custom scripts into query execution• SELECT TRANSFORM(stuff)USING 'script‘AS (thing1 INT, thing2 INT)

• FROM ( FROM pv_users MAP pv_users.userid, pv_users.date USING 'map_script‘ CLUSTER BY key) map_outputINSERT OVERWRITE TABLE pv_users_reduced REDUCE map_output.key, map_output.value USING 'reduce_script‘ AS date, count;

• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Transform

Hive Indexes• Indexes aimed at reducing data for range scans

• Fewer splits, fewer map tasks, less IO• Relies in Predicate Push Down

• Order guarantee can simplify certain algorithms• GROUP BY aggregations can use streaming aggregates vs. hash aggregates

• Hive does not need/use indexes for ‘seek’ like OLTP RDBMSs• Indexes are in almost every respect just another table with same data• Query Optimizer uses rewrite rules to leverage indexes• Indexes are not automatically maintained on LOAD/INSERT• https://cwiki.apache.org/confluence/display/Hive/IndexDev

https://cwiki.apache.org/confluence/display/Hive/FilterPushdownDev

https://cwiki.apache.org/confluence/display/Hive/IndexDev

JOIN optimizations• Difficult problem in MR• Naïve join relies on MR shuffle to partition the data

• Reducers can implement JOIN easily simply by merging the input, as is sorted• Is a size-of-data copy through the MR shuffle

• MapJoin• If there is one big table (facts) and several small tables (dimensions)• Read all the small tables, hash them

• serialize the hash into HDFS distributed cache• Done by driver as stage-0, before launching the actual query

• The MapJoinOperator loads the small tables in memory• JOIN can be performed on-the-fly, on the map side, avoiding big shuffle• Requires live RAM, task JVM memory settings must allow for enough memory

• Sort Merge Bucket (SMB) join• Between big tables that are bucketed by the same key

• And the bucketing key is also the join key• Map task scans buckets from multiple tables in parallel

• MR only knows about one of them• For the rest the SMBJoinOperator simulates a MR environment to scan them

• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+JoinOptimization

Partitioning in Hive• CREATE TABLE …. PARTITIONED BY (…)• Separate data directory created for each distinct combination of partitioning

column values• Can result in many small tables if abused

• Use org.apache.hadoop.hive.ql.io.CombineHiveInputFormat

• Use also Bucketing• CREATE TABLE …

PARTITIONED BY (…)CLUSTERED BY (…) SORTED BY (…) INTO … BUCKETS

• Bucketing helps many queries• https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL+BucketedTables

How to get started with Hive• HDPInsight 3.1 comes with Hive 0.13 • Hortonworks Sandbox (VM) has Hive 0.13• Cloudera CDH 5 VM comes with Hive 0.12• Build it yourself

• https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation

• Mailing list: [email protected]•

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation

https://cwiki.apache.org/confluence/display/Hive/AdminManual+Installation

mailto:[email protected]

Hive @ Bucharest Java User Group

Software

Transcript of Hive @ Bucharest Java User Group