Hive Quick Start

8/13/2019 Hive Quick Start

1/26

First Taste of Apache Hive


2/26

About Hadoop

Hadoop consists of HDFS and MapReduce. When we talk a bout Hadoop, we usuallyrefer both components, but they can be deployed and work independently.

Name Node: HDFS data dictionary. Also used to record various counters used byMapReduce jobs.

Job Tracker: Used for MapReduce job management, from resource management, jobdispatch, and job status. YARN (Hadoop 2.0 has rea-architected the functions provided

by Job Tracker and split resource management and job management to differentmodules.)

Job Tracker

Name Node

MapReduce: distributed computing framework

HDFS: distributed storage and file system


3/26

About Hive

Hive is a RDBMS and SQL engine based on Hadoop.

Hive can have native tables and external tables. The data for native tables are inside Hadoop HDFS.

The source of external tables can be data stored in HDFS, or from another HADOOP based NoSQL databaseHBASE, or any other RDBMS using JDBCStorageHandler, or any other NoSQL databases using supportedstorage handler.

Even with external tables from external data sources, Hive still needs HDFS for temp space usages andintermediate data storage.

Hive stores meta data (database/schema definition, table definition, index definition,partition definition, privileges) into meta data store, a RDBMS such as Java Derby DB, orMySQL, Oracle, etc.

Hive Query language QL syntax is very similar to MySQL SQL language, but only providea subset of functions.

MapReduceDHFS External Data Sources

Hive Driver (QL)

App/User Interface

MetaDataStore

(Data Dictionary)


4/26

Hive Process Flow

App/User

Interface

Compile

Execute

Driver

Hadoop Job Tracker

Parse Tree

Logic Plan (SQL Ops)

Logic Optimization

Physical Plan(MR Tasks)

Physical Optimization

1

2

3

Parser1. The user sends a query to

Hive using either its CLI

interface, JDBC/ODBC

interface or Thrift server.

2. The query will be passed to

Hive QL Driver.

3. Hive QL will compile (parse)

the query, apply a small set

of rule based optimizations,

and generate one or more

MapReduce Jobs.

4. Hive QL drive will submit

generated MapReduce jobs

to Hadoop job tracker forexecution.

5. Logic Optimization is SQL

optimization.

6. Physical Optimization will

scan generated MaPreduce

tasks to merge or eliminate

redundant tasks.


5/26

Compare to Oracle

Think Hive query as Oracle query with large parallel operations across a very large

Oracle RAC, without cache fusion.

A normal Oracle operation consists of two sets of PX processes, producers and

consumers, and a PX coordinator. For hive, the producer is Hadoop Mapper. The

consumer is Hadoop Reducer. The PX coordinator is Hadoop Job Tracker.

The differences:

A typical Oracle PX execution plan can have multiple operations and the

allocated PX processes will be reused to execute the multiple operations.

A typical Hive query can have one or more relatively independent Hadoop jobs

(while jobs are interdependent from Hive point of view, they are independent

from Hadoop point of view). If more than one jobs are required for a Hivequery, the Hive PX processes Mapper and Reducer slots have to be reallocated

and re-launched. This is a very important latency for complicated hive queries.

On the other hand, a Hive query requiring multiple Hadoop MapReduce jobs

can be compared to an Oracle query plan with multiple PX Coordinators to

deal with unmergable subqueries and subqueries inside SELECT list requiring

PX.


6/26

Quick Start

Cloudera provides preinstalled Hadoop VM image we canplay with:https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

Note earlier versions (CDH3..) VMs could have broken hive

functionalities. So try to use CDH4... CDH4uses Hadoop 2.0 (YARN) which splits job tracker into

multiple modules.

A major issue for Hive query with multiple MapReduce jobsis the job dispatcher from job tracker and Mapper and

Reducer slots allocation. One major purpose of Hadoop 2.0(YARN) is to optimize resource allocation. But looks like Hivehas yet to take advantage of this feature.
https://ccp.cloudera.com/display/SUPPORT/Demo+VMshttps://ccp.cloudera.com/display/SUPPORT/Demo+VMshttps://ccp.cloudera.com/display/SUPPORT/Demo+VMs


7/26

Play With Hive

Stock List and Quotes

downloaded from

NASDAQ and Y!Finance

Hive command lineLog configuration file you can

adjust log level.

Session level query related log

such as query text and plan.Easy to create a newdatabase (schema)

Display database names

(MySQL syntax)

Equivalent to Oracle alter session set

current_schema=stocks. MySQL style syntax.


8/26

Create Table and Load Data

1. create table will create a table the data will be managed by Hive

2. create external table will create table linked to other data source for example,data stored on HDFS loaded/generated by methods not related to Hive

3. If data is going to be loaded from other source, better specify its row format and

column format.

4. Here I loaded data from three local files using into table syntax. The source data

can also be HDFS data. Hive will simply copy the data files. overwrite into table

syntax will delete whatever already stored and copy the new file. If I used this

option, I will end up with only one exchange list.


9/26

A Simple Query At Work

Two MR jobs are generated. Most likely

one for GROUP BY, one for ORDER BY.

Hadoop MR job tracking information. Detail URL is

provided. From there, we can check performance

metrics and task logs

The line following tracking URL is the command to kill

the job, equivalent to Oracle alter system kill session


10/26

Job Status

Detail can be accessed

from above menu. Counter

for performance metrics.

Map Tasks and ReducedTasks for individual map

and task performance data

and logs.

Job log can be accessed here


11/26

1. Job counters are performance and resource usage metrics specific to the job.

2. For long running job, looking for the very large numbers especially IO and network

communication.


12/26

Job log can give detail about how the job is managed, for example, how resources are

allocated and when resources are allocated. It is equivalent to Oracle 10046 trace file for

query coordinator of PX operation. For a Hive query requiring multiple MR jobs on a busy

cluster, job dispatch latency is going to be a huge headache.


13/26

Tracking status for all Map Tasks. There is also a page for all Reduce Tasks. From there, we

can further drill down to individual task attempt, especially when most of the tasks are

completed while there is one or two still hanging around.

Individual task performance data. There are performance counters specific to the task. The

log link is more interesting for app dev troubleshooting.


14/26

Task log is very useful to find out business logic errors. The tiemstamp can help you to identify

performance bottleneck operations. It is equivalent to Oracle 10046 trace file for PX slave

process. For small MR job, the overhead to get job scheduled and tasks to be launched is

significant.


15/26

Partition Support

Here I created a table PRICES to store stock history prices. The table is partitioned by stock tick.

Then I loaded stock quote CSV files for several ticks. Each tick has its own file on Hive HDFS

storage.


16/26

Partition Pruning

The queries use partition column

explicitly. MSFT has longer history

and its partition is larger. FB is new

so only very little data inside its

partition.


17/26

JOIN


18/26

Join OptimizationMap Join

1. Map join is equivalent to Oracle PX hash join with one

table very small and broadcast is used for PX distribution.2. For Hive, the HASH table is built by Hive Driver, and use Hadoop

distributed cache to send the data to nodes which will run

the Map join.

3. The join will be done at mapper stage (hence named). No

reducer will be used. Because one fewer layer of resource

allocation and data transfer, the final time is less than half of

without using this optimization.


19/26

Join OptimizationSkewed Data

A usual Hive join is like HASH partitioned MERGE SORT join.Data from two sources are hash partitioned and distributedto reducers according to the hash values. For very largedata sets, if data are highly skewed on several join keyvalue, for example 1 or 2, the join will be focused on 1 or 2

reducers and execution time could be very long, or evenrun out memory.

Skewed data join is not Hive specific issue. Oracle hash joinwith PX suffers the same. Even without PX, skewed dataalso cause high CPU usages because of hash value collision.

For Hive, if one data source is small, Map Join can be usedto reduce the impact from data skew.


20/26

Join OptimizationSkewed Data

Hive also introduced another join strategy: skewed join. It will split the jobinto two jobs: one for non skewed data, one for skewed data. For nonskewed data, the join job is as normal. For the skewed data, the join keyset is small and most likely known at runtime, Hive will use other joinoptimization such as MAPJOIN.

Session level parameters are used to tell Hive to consider skewed join: set hive.optimize.skewjoin=true; set hive.skewjoin.key={a threshold number for the row counts on skewed key,

default to 100,000 }

Note the implementation is not very stable. The join is even sensitive tothe table orders presented in the query and query execution could beterminated because of the table order.

Because it adds more Hive stages (Hive local jobs plus Hadoop jobs), thisstrategy should be considered only if a query job runs intolerable long.


21/26

Aggregation OptimizationMap Side

Aggregation A typical MapReduce or PIG job uses Combiner for partial aggregation at

map side, to reduce the data size passed to the reducers. Note Oracleparallel operation has similar feature for GROUP BY (frequently you cansee two GROUP BY steps in a PX plan).

Hive does not support Combiner. Instead, it introduced a concept calledmap size aggregation, and it is enabled by default: set hive.map.aggr=true; To verify it, use: set hive.map.aggr;

My sample query for test and the impact is verified from job statustracking: Query: select sector, count(*) from tick_simple group by sector;

With hive.map.aggr=true, map input record count is 6430, and map outputcount is 55. So only 55 records will be sent to reducer side.

With hive.map.aggr=false, map input record count is 6430, and map outputcount is 6430. About 116 times of more data will be sent to reducer side.

It will not make much difference for this small test, but it will make hugedifference for product data with billions of input records.


22/26

Function Support

Hive supports usual functions with built in functions.

Standard functions: abs, round, floor, ucase, concat, etc.

Aggregate functions: avg, sum, etc

Table generate functions (Oracle table functions): explode

User defined functions can be written in Java, packaged in Jar and loaded persession base.

However, making user defined function permanent requires to modify Hivesource code and rebuild Hive package.

User defined functions do provide us the opportunity to create common functionsshared by multiple products, or to take advantages of Hive general QL capabilitywhile wrap project specific requirement into user defined functions.

For example, for RMX, the source data are events spilled out from ad servers,

collected by data highway, dumped on Hadoop GRID, and RMX data pipelinesuse MapReduce jobs or PIG jobs to reprocess the data which will be consumedby other applications. A common requirement is to split/extract the datarelated to advertiser, publisher and network, etc, in tabular format from eachevent. This can be easily written into a Hive Table generate function and theremaining works are just standard SQL operations.


23/26

Function At Work Hive does not provide built in method to handle CSV with quotes. So my test table has data quality issue after

column company name because some names have comma inside. So my first test query failed to give correctdata for sector names.

For the following test, I create a table with a single column and load the one tick symbol file nasdaq.csv. Then Iuse split function with CSV regex to extract correct sector name.

However, I have not figured out how to use SPLIT function to extract all columns to build another table withCTAS. But it not hard to use Java String SPLIT function to build a Hive Table generate function for that purpose.


24/26

Other Interesting Things

Index: Hive has two types of indexes. A normal index stores (key, HDFS file name,key offset) into another hive table. Hive also supports bitmap indexes. However,since Hadoop typically uses 64MB or 128MB block size, index is not really veryuseful in Hive. Furthermore, Hive index only works with partitioned table if Hivecan prune the partitions based on index data.

RDBMS connections: Hive current does not have builtin storage handler to accessdata from other RDBMS databases such as Oracle and MySQL (there is a thirdparty JDBCStorageHandler, though). It will be interesting if we can use Hive ratherthan Scoop to dump or access data from Oracle or MySQL directly.

NoSQL: Hive does provide builtin storage handlers to access data on some NoSQLdatabases such as HBASE, Cassandra, and DynomDB.

If log level is set appropriately, hive log file hive.log have information similar toOracle 10053 trace file for query parsing process. This can help to figure out why a

specific feature is not used or used. Hive also provide EXPLAIN for explain plan. Execution plan can be found from

session level log file (though not user friendly formatted).


25/26

My Wish List

Better YARN (Hadoop 2.0) integration so that itcan take advantage of new job and resourcemanagement and allocation architecture.

Better documentation and manuals: the userdocuments are rare and incomplete.

Documentation about its source code: for opensource products without professional technical

support, a clear way to help users to navigate thesource code modules and to understand thedesign ideas is very important.


26/26

References

Book: Programming HiveJason Rutherglen, Dean Wampler,Edward Capriolo

Home page: http://hive.apache.org, check its wiki for links topresentations and manuals

Source code: http://svn.apache.org/viewvc/hive: this is the only

place to find out why the feature you like does not work, or yourquery encounters error.

Hive internal:http://www.slideshare.net/mobile/recruitcojp/internal-hive

Side notes: compare your understanding of Oracle, especially queryoptimization, will help to understand Hive and make it performbetter. However, Hive developers might take more hints fromMySQL which is also open source project.
http://hive.apache.org/http://svn.apache.org/viewvc/hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://svn.apache.org/viewvc/hivehttp://hive.apache.org/

Hive Quick Start

Documents

Transcript of Hive Quick Start