Hive Quick Start
-
Upload
saeed-meethal -
Category
Documents
-
view
223 -
download
0
Transcript of Hive Quick Start
-
8/13/2019 Hive Quick Start
1/26
First Taste of Apache Hive
-
8/13/2019 Hive Quick Start
2/26
About Hadoop
Hadoop consists of HDFS and MapReduce. When we talk a bout Hadoop, we usuallyrefer both components, but they can be deployed and work independently.
Name Node: HDFS data dictionary. Also used to record various counters used byMapReduce jobs.
Job Tracker: Used for MapReduce job management, from resource management, jobdispatch, and job status. YARN (Hadoop 2.0 has rea-architected the functions provided
by Job Tracker and split resource management and job management to differentmodules.)
Job Tracker
Name Node
MapReduce: distributed computing framework
HDFS: distributed storage and file system
-
8/13/2019 Hive Quick Start
3/26
About Hive
Hive is a RDBMS and SQL engine based on Hadoop.
Hive can have native tables and external tables. The data for native tables are inside Hadoop HDFS.
The source of external tables can be data stored in HDFS, or from another HADOOP based NoSQL databaseHBASE, or any other RDBMS using JDBCStorageHandler, or any other NoSQL databases using supportedstorage handler.
Even with external tables from external data sources, Hive still needs HDFS for temp space usages andintermediate data storage.
Hive stores meta data (database/schema definition, table definition, index definition,partition definition, privileges) into meta data store, a RDBMS such as Java Derby DB, orMySQL, Oracle, etc.
Hive Query language QL syntax is very similar to MySQL SQL language, but only providea subset of functions.
MapReduceDHFS External Data Sources
Hive Driver (QL)
App/User Interface
MetaDataStore
(Data Dictionary)
-
8/13/2019 Hive Quick Start
4/26
Hive Process Flow
App/User
Interface
Compile
Execute
Driver
Hadoop Job Tracker
Parse Tree
Logic Plan (SQL Ops)
Logic Optimization
Physical Plan(MR Tasks)
Physical Optimization
1
2
3
Parser1. The user sends a query to
Hive using either its CLI
interface, JDBC/ODBC
interface or Thrift server.
2. The query will be passed to
Hive QL Driver.
3. Hive QL will compile (parse)
the query, apply a small set
of rule based optimizations,
and generate one or more
MapReduce Jobs.
4. Hive QL drive will submit
generated MapReduce jobs
to Hadoop job tracker forexecution.
5. Logic Optimization is SQL
optimization.
6. Physical Optimization will
scan generated MaPreduce
tasks to merge or eliminate
redundant tasks.
-
8/13/2019 Hive Quick Start
5/26
Compare to Oracle
Think Hive query as Oracle query with large parallel operations across a very large
Oracle RAC, without cache fusion.
A normal Oracle operation consists of two sets of PX processes, producers and
consumers, and a PX coordinator. For hive, the producer is Hadoop Mapper. The
consumer is Hadoop Reducer. The PX coordinator is Hadoop Job Tracker.
The differences:
A typical Oracle PX execution plan can have multiple operations and the
allocated PX processes will be reused to execute the multiple operations.
A typical Hive query can have one or more relatively independent Hadoop jobs
(while jobs are interdependent from Hive point of view, they are independent
from Hadoop point of view). If more than one jobs are required for a Hivequery, the Hive PX processes Mapper and Reducer slots have to be reallocated
and re-launched. This is a very important latency for complicated hive queries.
On the other hand, a Hive query requiring multiple Hadoop MapReduce jobs
can be compared to an Oracle query plan with multiple PX Coordinators to
deal with unmergable subqueries and subqueries inside SELECT list requiring
PX.
-
8/13/2019 Hive Quick Start
6/26
Quick Start
Cloudera provides preinstalled Hadoop VM image we canplay with:https://ccp.cloudera.com/display/SUPPORT/Demo+VMs
Note earlier versions (CDH3..) VMs could have broken hive
functionalities. So try to use CDH4... CDH4uses Hadoop 2.0 (YARN) which splits job tracker into
multiple modules.
A major issue for Hive query with multiple MapReduce jobsis the job dispatcher from job tracker and Mapper and
Reducer slots allocation. One major purpose of Hadoop 2.0(YARN) is to optimize resource allocation. But looks like Hivehas yet to take advantage of this feature.
https://ccp.cloudera.com/display/SUPPORT/Demo+VMshttps://ccp.cloudera.com/display/SUPPORT/Demo+VMshttps://ccp.cloudera.com/display/SUPPORT/Demo+VMs -
8/13/2019 Hive Quick Start
7/26
Play With Hive
Stock List and Quotes
downloaded from
NASDAQ and Y!Finance
Hive command lineLog configuration file you can
adjust log level.
Session level query related log
such as query text and plan.Easy to create a newdatabase (schema)
Display database names
(MySQL syntax)
Equivalent to Oracle alter session set
current_schema=stocks. MySQL style syntax.
-
8/13/2019 Hive Quick Start
8/26
Create Table and Load Data
1. create table will create a table the data will be managed by Hive
2. create external table will create table linked to other data source for example,data stored on HDFS loaded/generated by methods not related to Hive
3. If data is going to be loaded from other source, better specify its row format and
column format.
4. Here I loaded data from three local files using into table syntax. The source data
can also be HDFS data. Hive will simply copy the data files. overwrite into table
syntax will delete whatever already stored and copy the new file. If I used this
option, I will end up with only one exchange list.
-
8/13/2019 Hive Quick Start
9/26
A Simple Query At Work
Two MR jobs are generated. Most likely
one for GROUP BY, one for ORDER BY.
Hadoop MR job tracking information. Detail URL is
provided. From there, we can check performance
metrics and task logs
The line following tracking URL is the command to kill
the job, equivalent to Oracle alter system kill session
-
8/13/2019 Hive Quick Start
10/26
Job Status
Detail can be accessed
from above menu. Counter
for performance metrics.
Map Tasks and ReducedTasks for individual map
and task performance data
and logs.
Job log can be accessed here
-
8/13/2019 Hive Quick Start
11/26
1. Job counters are performance and resource usage metrics specific to the job.
2. For long running job, looking for the very large numbers especially IO and network
communication.
-
8/13/2019 Hive Quick Start
12/26
Job log can give detail about how the job is managed, for example, how resources are
allocated and when resources are allocated. It is equivalent to Oracle 10046 trace file for
query coordinator of PX operation. For a Hive query requiring multiple MR jobs on a busy
cluster, job dispatch latency is going to be a huge headache.
-
8/13/2019 Hive Quick Start
13/26
Tracking status for all Map Tasks. There is also a page for all Reduce Tasks. From there, we
can further drill down to individual task attempt, especially when most of the tasks are
completed while there is one or two still hanging around.
Individual task performance data. There are performance counters specific to the task. The
log link is more interesting for app dev troubleshooting.
-
8/13/2019 Hive Quick Start
14/26
Task log is very useful to find out business logic errors. The tiemstamp can help you to identify
performance bottleneck operations. It is equivalent to Oracle 10046 trace file for PX slave
process. For small MR job, the overhead to get job scheduled and tasks to be launched is
significant.
-
8/13/2019 Hive Quick Start
15/26
Partition Support
Here I created a table PRICES to store stock history prices. The table is partitioned by stock tick.
Then I loaded stock quote CSV files for several ticks. Each tick has its own file on Hive HDFS
storage.
-
8/13/2019 Hive Quick Start
16/26
Partition Pruning
The queries use partition column
explicitly. MSFT has longer history
and its partition is larger. FB is new
so only very little data inside its
partition.
-
8/13/2019 Hive Quick Start
17/26
JOIN
-
8/13/2019 Hive Quick Start
18/26
Join OptimizationMap Join
1. Map join is equivalent to Oracle PX hash join with one
table very small and broadcast is used for PX distribution.2. For Hive, the HASH table is built by Hive Driver, and use Hadoop
distributed cache to send the data to nodes which will run
the Map join.
3. The join will be done at mapper stage (hence named). No
reducer will be used. Because one fewer layer of resource
allocation and data transfer, the final time is less than half of
without using this optimization.
-
8/13/2019 Hive Quick Start
19/26
Join OptimizationSkewed Data
A usual Hive join is like HASH partitioned MERGE SORT join.Data from two sources are hash partitioned and distributedto reducers according to the hash values. For very largedata sets, if data are highly skewed on several join keyvalue, for example 1 or 2, the join will be focused on 1 or 2
reducers and execution time could be very long, or evenrun out memory.
Skewed data join is not Hive specific issue. Oracle hash joinwith PX suffers the same. Even without PX, skewed dataalso cause high CPU usages because of hash value collision.
For Hive, if one data source is small, Map Join can be usedto reduce the impact from data skew.
-
8/13/2019 Hive Quick Start
20/26
Join OptimizationSkewed Data
Hive also introduced another join strategy: skewed join. It will split the jobinto two jobs: one for non skewed data, one for skewed data. For nonskewed data, the join job is as normal. For the skewed data, the join keyset is small and most likely known at runtime, Hive will use other joinoptimization such as MAPJOIN.
Session level parameters are used to tell Hive to consider skewed join: set hive.optimize.skewjoin=true; set hive.skewjoin.key={a threshold number for the row counts on skewed key,
default to 100,000 }
Note the implementation is not very stable. The join is even sensitive tothe table orders presented in the query and query execution could beterminated because of the table order.
Because it adds more Hive stages (Hive local jobs plus Hadoop jobs), thisstrategy should be considered only if a query job runs intolerable long.
-
8/13/2019 Hive Quick Start
21/26
Aggregation OptimizationMap Side
Aggregation A typical MapReduce or PIG job uses Combiner for partial aggregation at
map side, to reduce the data size passed to the reducers. Note Oracleparallel operation has similar feature for GROUP BY (frequently you cansee two GROUP BY steps in a PX plan).
Hive does not support Combiner. Instead, it introduced a concept calledmap size aggregation, and it is enabled by default: set hive.map.aggr=true; To verify it, use: set hive.map.aggr;
My sample query for test and the impact is verified from job statustracking: Query: select sector, count(*) from tick_simple group by sector;
With hive.map.aggr=true, map input record count is 6430, and map outputcount is 55. So only 55 records will be sent to reducer side.
With hive.map.aggr=false, map input record count is 6430, and map outputcount is 6430. About 116 times of more data will be sent to reducer side.
It will not make much difference for this small test, but it will make hugedifference for product data with billions of input records.
-
8/13/2019 Hive Quick Start
22/26
Function Support
Hive supports usual functions with built in functions.
Standard functions: abs, round, floor, ucase, concat, etc.
Aggregate functions: avg, sum, etc
Table generate functions (Oracle table functions): explode
User defined functions can be written in Java, packaged in Jar and loaded persession base.
However, making user defined function permanent requires to modify Hivesource code and rebuild Hive package.
User defined functions do provide us the opportunity to create common functionsshared by multiple products, or to take advantages of Hive general QL capabilitywhile wrap project specific requirement into user defined functions.
For example, for RMX, the source data are events spilled out from ad servers,
collected by data highway, dumped on Hadoop GRID, and RMX data pipelinesuse MapReduce jobs or PIG jobs to reprocess the data which will be consumedby other applications. A common requirement is to split/extract the datarelated to advertiser, publisher and network, etc, in tabular format from eachevent. This can be easily written into a Hive Table generate function and theremaining works are just standard SQL operations.
-
8/13/2019 Hive Quick Start
23/26
Function At Work Hive does not provide built in method to handle CSV with quotes. So my test table has data quality issue after
column company name because some names have comma inside. So my first test query failed to give correctdata for sector names.
For the following test, I create a table with a single column and load the one tick symbol file nasdaq.csv. Then Iuse split function with CSV regex to extract correct sector name.
However, I have not figured out how to use SPLIT function to extract all columns to build another table withCTAS. But it not hard to use Java String SPLIT function to build a Hive Table generate function for that purpose.
-
8/13/2019 Hive Quick Start
24/26
Other Interesting Things
Index: Hive has two types of indexes. A normal index stores (key, HDFS file name,key offset) into another hive table. Hive also supports bitmap indexes. However,since Hadoop typically uses 64MB or 128MB block size, index is not really veryuseful in Hive. Furthermore, Hive index only works with partitioned table if Hivecan prune the partitions based on index data.
RDBMS connections: Hive current does not have builtin storage handler to accessdata from other RDBMS databases such as Oracle and MySQL (there is a thirdparty JDBCStorageHandler, though). It will be interesting if we can use Hive ratherthan Scoop to dump or access data from Oracle or MySQL directly.
NoSQL: Hive does provide builtin storage handlers to access data on some NoSQLdatabases such as HBASE, Cassandra, and DynomDB.
If log level is set appropriately, hive log file hive.log have information similar toOracle 10053 trace file for query parsing process. This can help to figure out why a
specific feature is not used or used. Hive also provide EXPLAIN for explain plan. Execution plan can be found from
session level log file (though not user friendly formatted).
-
8/13/2019 Hive Quick Start
25/26
My Wish List
Better YARN (Hadoop 2.0) integration so that itcan take advantage of new job and resourcemanagement and allocation architecture.
Better documentation and manuals: the userdocuments are rare and incomplete.
Documentation about its source code: for opensource products without professional technical
support, a clear way to help users to navigate thesource code modules and to understand thedesign ideas is very important.
-
8/13/2019 Hive Quick Start
26/26
References
Book: Programming HiveJason Rutherglen, Dean Wampler,Edward Capriolo
Home page: http://hive.apache.org, check its wiki for links topresentations and manuals
Source code: http://svn.apache.org/viewvc/hive: this is the only
place to find out why the feature you like does not work, or yourquery encounters error.
Hive internal:http://www.slideshare.net/mobile/recruitcojp/internal-hive
Side notes: compare your understanding of Oracle, especially queryoptimization, will help to understand Hive and make it performbetter. However, Hive developers might take more hints fromMySQL which is also open source project.
http://hive.apache.org/http://svn.apache.org/viewvc/hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://svn.apache.org/viewvc/hivehttp://hive.apache.org/