Hive Quick Start

download Hive Quick Start

of 26

Transcript of Hive Quick Start

  • 8/13/2019 Hive Quick Start

    1/26

    First Taste of Apache Hive

  • 8/13/2019 Hive Quick Start

    2/26

    About Hadoop

    Hadoop consists of HDFS and MapReduce. When we talk a bout Hadoop, we usuallyrefer both components, but they can be deployed and work independently.

    Name Node: HDFS data dictionary. Also used to record various counters used byMapReduce jobs.

    Job Tracker: Used for MapReduce job management, from resource management, jobdispatch, and job status. YARN (Hadoop 2.0 has rea-architected the functions provided

    by Job Tracker and split resource management and job management to differentmodules.)

    Job Tracker

    Name Node

    MapReduce: distributed computing framework

    HDFS: distributed storage and file system

  • 8/13/2019 Hive Quick Start

    3/26

    About Hive

    Hive is a RDBMS and SQL engine based on Hadoop.

    Hive can have native tables and external tables. The data for native tables are inside Hadoop HDFS.

    The source of external tables can be data stored in HDFS, or from another HADOOP based NoSQL databaseHBASE, or any other RDBMS using JDBCStorageHandler, or any other NoSQL databases using supportedstorage handler.

    Even with external tables from external data sources, Hive still needs HDFS for temp space usages andintermediate data storage.

    Hive stores meta data (database/schema definition, table definition, index definition,partition definition, privileges) into meta data store, a RDBMS such as Java Derby DB, orMySQL, Oracle, etc.

    Hive Query language QL syntax is very similar to MySQL SQL language, but only providea subset of functions.

    MapReduceDHFS External Data Sources

    Hive Driver (QL)

    App/User Interface

    MetaDataStore

    (Data Dictionary)

  • 8/13/2019 Hive Quick Start

    4/26

    Hive Process Flow

    App/User

    Interface

    Compile

    Execute

    Driver

    Hadoop Job Tracker

    Parse Tree

    Logic Plan (SQL Ops)

    Logic Optimization

    Physical Plan(MR Tasks)

    Physical Optimization

    1

    2

    3

    Parser1. The user sends a query to

    Hive using either its CLI

    interface, JDBC/ODBC

    interface or Thrift server.

    2. The query will be passed to

    Hive QL Driver.

    3. Hive QL will compile (parse)

    the query, apply a small set

    of rule based optimizations,

    and generate one or more

    MapReduce Jobs.

    4. Hive QL drive will submit

    generated MapReduce jobs

    to Hadoop job tracker forexecution.

    5. Logic Optimization is SQL

    optimization.

    6. Physical Optimization will

    scan generated MaPreduce

    tasks to merge or eliminate

    redundant tasks.

  • 8/13/2019 Hive Quick Start

    5/26

    Compare to Oracle

    Think Hive query as Oracle query with large parallel operations across a very large

    Oracle RAC, without cache fusion.

    A normal Oracle operation consists of two sets of PX processes, producers and

    consumers, and a PX coordinator. For hive, the producer is Hadoop Mapper. The

    consumer is Hadoop Reducer. The PX coordinator is Hadoop Job Tracker.

    The differences:

    A typical Oracle PX execution plan can have multiple operations and the

    allocated PX processes will be reused to execute the multiple operations.

    A typical Hive query can have one or more relatively independent Hadoop jobs

    (while jobs are interdependent from Hive point of view, they are independent

    from Hadoop point of view). If more than one jobs are required for a Hivequery, the Hive PX processes Mapper and Reducer slots have to be reallocated

    and re-launched. This is a very important latency for complicated hive queries.

    On the other hand, a Hive query requiring multiple Hadoop MapReduce jobs

    can be compared to an Oracle query plan with multiple PX Coordinators to

    deal with unmergable subqueries and subqueries inside SELECT list requiring

    PX.

  • 8/13/2019 Hive Quick Start

    6/26

    Quick Start

    Cloudera provides preinstalled Hadoop VM image we canplay with:https://ccp.cloudera.com/display/SUPPORT/Demo+VMs

    Note earlier versions (CDH3..) VMs could have broken hive

    functionalities. So try to use CDH4... CDH4uses Hadoop 2.0 (YARN) which splits job tracker into

    multiple modules.

    A major issue for Hive query with multiple MapReduce jobsis the job dispatcher from job tracker and Mapper and

    Reducer slots allocation. One major purpose of Hadoop 2.0(YARN) is to optimize resource allocation. But looks like Hivehas yet to take advantage of this feature.

    https://ccp.cloudera.com/display/SUPPORT/Demo+VMshttps://ccp.cloudera.com/display/SUPPORT/Demo+VMshttps://ccp.cloudera.com/display/SUPPORT/Demo+VMs
  • 8/13/2019 Hive Quick Start

    7/26

    Play With Hive

    Stock List and Quotes

    downloaded from

    NASDAQ and Y!Finance

    Hive command lineLog configuration file you can

    adjust log level.

    Session level query related log

    such as query text and plan.Easy to create a newdatabase (schema)

    Display database names

    (MySQL syntax)

    Equivalent to Oracle alter session set

    current_schema=stocks. MySQL style syntax.

  • 8/13/2019 Hive Quick Start

    8/26

    Create Table and Load Data

    1. create table will create a table the data will be managed by Hive

    2. create external table will create table linked to other data source for example,data stored on HDFS loaded/generated by methods not related to Hive

    3. If data is going to be loaded from other source, better specify its row format and

    column format.

    4. Here I loaded data from three local files using into table syntax. The source data

    can also be HDFS data. Hive will simply copy the data files. overwrite into table

    syntax will delete whatever already stored and copy the new file. If I used this

    option, I will end up with only one exchange list.

  • 8/13/2019 Hive Quick Start

    9/26

    A Simple Query At Work

    Two MR jobs are generated. Most likely

    one for GROUP BY, one for ORDER BY.

    Hadoop MR job tracking information. Detail URL is

    provided. From there, we can check performance

    metrics and task logs

    The line following tracking URL is the command to kill

    the job, equivalent to Oracle alter system kill session

  • 8/13/2019 Hive Quick Start

    10/26

    Job Status

    Detail can be accessed

    from above menu. Counter

    for performance metrics.

    Map Tasks and ReducedTasks for individual map

    and task performance data

    and logs.

    Job log can be accessed here

  • 8/13/2019 Hive Quick Start

    11/26

    1. Job counters are performance and resource usage metrics specific to the job.

    2. For long running job, looking for the very large numbers especially IO and network

    communication.

  • 8/13/2019 Hive Quick Start

    12/26

    Job log can give detail about how the job is managed, for example, how resources are

    allocated and when resources are allocated. It is equivalent to Oracle 10046 trace file for

    query coordinator of PX operation. For a Hive query requiring multiple MR jobs on a busy

    cluster, job dispatch latency is going to be a huge headache.

  • 8/13/2019 Hive Quick Start

    13/26

    Tracking status for all Map Tasks. There is also a page for all Reduce Tasks. From there, we

    can further drill down to individual task attempt, especially when most of the tasks are

    completed while there is one or two still hanging around.

    Individual task performance data. There are performance counters specific to the task. The

    log link is more interesting for app dev troubleshooting.

  • 8/13/2019 Hive Quick Start

    14/26

    Task log is very useful to find out business logic errors. The tiemstamp can help you to identify

    performance bottleneck operations. It is equivalent to Oracle 10046 trace file for PX slave

    process. For small MR job, the overhead to get job scheduled and tasks to be launched is

    significant.

  • 8/13/2019 Hive Quick Start

    15/26

    Partition Support

    Here I created a table PRICES to store stock history prices. The table is partitioned by stock tick.

    Then I loaded stock quote CSV files for several ticks. Each tick has its own file on Hive HDFS

    storage.

  • 8/13/2019 Hive Quick Start

    16/26

    Partition Pruning

    The queries use partition column

    explicitly. MSFT has longer history

    and its partition is larger. FB is new

    so only very little data inside its

    partition.

  • 8/13/2019 Hive Quick Start

    17/26

    JOIN

  • 8/13/2019 Hive Quick Start

    18/26

    Join OptimizationMap Join

    1. Map join is equivalent to Oracle PX hash join with one

    table very small and broadcast is used for PX distribution.2. For Hive, the HASH table is built by Hive Driver, and use Hadoop

    distributed cache to send the data to nodes which will run

    the Map join.

    3. The join will be done at mapper stage (hence named). No

    reducer will be used. Because one fewer layer of resource

    allocation and data transfer, the final time is less than half of

    without using this optimization.

  • 8/13/2019 Hive Quick Start

    19/26

    Join OptimizationSkewed Data

    A usual Hive join is like HASH partitioned MERGE SORT join.Data from two sources are hash partitioned and distributedto reducers according to the hash values. For very largedata sets, if data are highly skewed on several join keyvalue, for example 1 or 2, the join will be focused on 1 or 2

    reducers and execution time could be very long, or evenrun out memory.

    Skewed data join is not Hive specific issue. Oracle hash joinwith PX suffers the same. Even without PX, skewed dataalso cause high CPU usages because of hash value collision.

    For Hive, if one data source is small, Map Join can be usedto reduce the impact from data skew.

  • 8/13/2019 Hive Quick Start

    20/26

    Join OptimizationSkewed Data

    Hive also introduced another join strategy: skewed join. It will split the jobinto two jobs: one for non skewed data, one for skewed data. For nonskewed data, the join job is as normal. For the skewed data, the join keyset is small and most likely known at runtime, Hive will use other joinoptimization such as MAPJOIN.

    Session level parameters are used to tell Hive to consider skewed join: set hive.optimize.skewjoin=true; set hive.skewjoin.key={a threshold number for the row counts on skewed key,

    default to 100,000 }

    Note the implementation is not very stable. The join is even sensitive tothe table orders presented in the query and query execution could beterminated because of the table order.

    Because it adds more Hive stages (Hive local jobs plus Hadoop jobs), thisstrategy should be considered only if a query job runs intolerable long.

  • 8/13/2019 Hive Quick Start

    21/26

    Aggregation OptimizationMap Side

    Aggregation A typical MapReduce or PIG job uses Combiner for partial aggregation at

    map side, to reduce the data size passed to the reducers. Note Oracleparallel operation has similar feature for GROUP BY (frequently you cansee two GROUP BY steps in a PX plan).

    Hive does not support Combiner. Instead, it introduced a concept calledmap size aggregation, and it is enabled by default: set hive.map.aggr=true; To verify it, use: set hive.map.aggr;

    My sample query for test and the impact is verified from job statustracking: Query: select sector, count(*) from tick_simple group by sector;

    With hive.map.aggr=true, map input record count is 6430, and map outputcount is 55. So only 55 records will be sent to reducer side.

    With hive.map.aggr=false, map input record count is 6430, and map outputcount is 6430. About 116 times of more data will be sent to reducer side.

    It will not make much difference for this small test, but it will make hugedifference for product data with billions of input records.

  • 8/13/2019 Hive Quick Start

    22/26

    Function Support

    Hive supports usual functions with built in functions.

    Standard functions: abs, round, floor, ucase, concat, etc.

    Aggregate functions: avg, sum, etc

    Table generate functions (Oracle table functions): explode

    User defined functions can be written in Java, packaged in Jar and loaded persession base.

    However, making user defined function permanent requires to modify Hivesource code and rebuild Hive package.

    User defined functions do provide us the opportunity to create common functionsshared by multiple products, or to take advantages of Hive general QL capabilitywhile wrap project specific requirement into user defined functions.

    For example, for RMX, the source data are events spilled out from ad servers,

    collected by data highway, dumped on Hadoop GRID, and RMX data pipelinesuse MapReduce jobs or PIG jobs to reprocess the data which will be consumedby other applications. A common requirement is to split/extract the datarelated to advertiser, publisher and network, etc, in tabular format from eachevent. This can be easily written into a Hive Table generate function and theremaining works are just standard SQL operations.

  • 8/13/2019 Hive Quick Start

    23/26

    Function At Work Hive does not provide built in method to handle CSV with quotes. So my test table has data quality issue after

    column company name because some names have comma inside. So my first test query failed to give correctdata for sector names.

    For the following test, I create a table with a single column and load the one tick symbol file nasdaq.csv. Then Iuse split function with CSV regex to extract correct sector name.

    However, I have not figured out how to use SPLIT function to extract all columns to build another table withCTAS. But it not hard to use Java String SPLIT function to build a Hive Table generate function for that purpose.

  • 8/13/2019 Hive Quick Start

    24/26

    Other Interesting Things

    Index: Hive has two types of indexes. A normal index stores (key, HDFS file name,key offset) into another hive table. Hive also supports bitmap indexes. However,since Hadoop typically uses 64MB or 128MB block size, index is not really veryuseful in Hive. Furthermore, Hive index only works with partitioned table if Hivecan prune the partitions based on index data.

    RDBMS connections: Hive current does not have builtin storage handler to accessdata from other RDBMS databases such as Oracle and MySQL (there is a thirdparty JDBCStorageHandler, though). It will be interesting if we can use Hive ratherthan Scoop to dump or access data from Oracle or MySQL directly.

    NoSQL: Hive does provide builtin storage handlers to access data on some NoSQLdatabases such as HBASE, Cassandra, and DynomDB.

    If log level is set appropriately, hive log file hive.log have information similar toOracle 10053 trace file for query parsing process. This can help to figure out why a

    specific feature is not used or used. Hive also provide EXPLAIN for explain plan. Execution plan can be found from

    session level log file (though not user friendly formatted).

  • 8/13/2019 Hive Quick Start

    25/26

    My Wish List

    Better YARN (Hadoop 2.0) integration so that itcan take advantage of new job and resourcemanagement and allocation architecture.

    Better documentation and manuals: the userdocuments are rare and incomplete.

    Documentation about its source code: for opensource products without professional technical

    support, a clear way to help users to navigate thesource code modules and to understand thedesign ideas is very important.

  • 8/13/2019 Hive Quick Start

    26/26

    References

    Book: Programming HiveJason Rutherglen, Dean Wampler,Edward Capriolo

    Home page: http://hive.apache.org, check its wiki for links topresentations and manuals

    Source code: http://svn.apache.org/viewvc/hive: this is the only

    place to find out why the feature you like does not work, or yourquery encounters error.

    Hive internal:http://www.slideshare.net/mobile/recruitcojp/internal-hive

    Side notes: compare your understanding of Oracle, especially queryoptimization, will help to understand Hive and make it performbetter. However, Hive developers might take more hints fromMySQL which is also open source project.

    http://hive.apache.org/http://svn.apache.org/viewvc/hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://www.slideshare.net/mobile/recruitcojp/internal-hivehttp://svn.apache.org/viewvc/hivehttp://hive.apache.org/