Hortonworks Technical Workshop: Interactive Query with Apache Hive

36
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Interactive Query With Apache Hive Dec 4, 2014 Ajay Singh

Transcript of Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 1: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Interactive Query With Apache Hive

Dec 4, 2014

Ajay Singh

Page 2: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Agenda •  HDP 2.2

•  Apache Hive & Stinger Initiative

•  Stinger.Next

•  Putting It Together

•  Q&A

Page 3: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP 2.2 Generally Available

Hortonworks Data Platform 2.2

YARN: Data Operating System (Cluster Resource Management)

1 ° ° ° ° ° ° °

° ° ° ° ° ° ° °

Script

Pig

SQL

Hive

Tez Tez

Java Scala

Cascading

Tez

° °

° °

° ° ° ° °

° ° ° ° °

Others

ISV Engines

HDFS (Hadoop Distributed File System)

Stream

Storm

Search

Solr

NoSQL

HBase Accumulo

Slider Slider

SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS

In-Memory

Spark

Provision, Manage & Monitor

Ambari

Zookeeper

Scheduling

Oozie

Data Workflow, Lifecycle & Governance

Falcon Sqoop Flume Kafka NFS

WebHDFS

Authentication Authorization Accounting

Data Protection

Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon

Cluster: Knox Cluster: Ranger

Deployment Choice Linux Windows On-Premises Cloud

YARN is the architectural center of HDP

Enables batch, interactive and real-time workloads

Provides comprehensive enterprise capabilities

The widest range of deployment options

Delivered Completely in the OPEN

Page 4: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

HDP IS Apache Hadoop

There is ONE Enterprise Hadoop: everything else is a vendor derivation

Hortonworks Data Platform 2.2

Had

oop

&YA

RN

Pig

Hiv

e &

HC

atal

og

HB

ase

Sqo

op

Ooz

ie

Zoo

keep

er

Am

bari

Sto

rm

Flu

me

Kno

x

Pho

enix

Acc

umul

o

2.2.0 0.12.0

0.12.0 2.4.0

0.12.1

Data Management

0.13.0

0.96.1

0.98.0

0.9.1 1.4.4

1.3.1

1.4.0

1.4.4

1.5.1

3.3.2

4.0.0

3.4.5 0.4.0

4.0.0

1.5.1

Fal

con

0.5.0

Ran

ger

Spa

rk

Kaf

ka

0.14.0 0.14.0

0.98.4

1.6.1

4.2 0.9.3

1.2.0 0.6.0

0.8.1

1.4.5 1.5.0

1.7.0

4.1.0 0.5.0

0.4.0 2.6.0

* version numbers are targets and subject to change at time of general availability in accordance with ASF release process

3.4.5

Tez

0.4.0

Slid

er

0.60

HDP 2.0

October

2013

HDP 2.2 October

2014

HDP 2.1

April

2014

Sol

r

4.7.2

4.10.0

0.5.1

Data Access Governance & Integration Security Operations

Page 5: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Complete List of New Features in HDP 2.2 Apache Hadoop YARN •  Slide existing services onto YARN through ‘Slider’ •  GA release of HBase, Accumulo, and Storm on

YARN •  Support long running services: handling of logs,

containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads

•  Support for CPU Scheduling and CPU Resource Isolation through CGroups

Apache Hadoop HDFS •  Heterogeneous storage: Support for archival •  Rolling Upgrade (This is an item that applies to the

entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).

•  Multi-NIC Support •  Heterogeneous storage: Support memory as a

storage tier (TP) •  HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez •  Hive Cost Based Optimizer: Function Pushdown &

Join re-ordering support for other join types: star & bushy.

•  Hive SQL Enhancements including: •  ACID Support: Insert, Update, Delete •  Temporary Tables •  Metadata-only queries return instantly •  Pig on Tez •  Including DataFu for use with Pig •  Vectorized shuffle •  Tez Debug Tooling & UI

Hue •  Support for HiveServer 2 •  Support for Resource Manager HA

Apache Spark •  Refreshed Tech Preview to Spark 1.1.0 (available

now) •  ORC File support & Hive 0.13 integration •  Planned for GA of Spark 1.2.0 •  Operations integration via YARN ATS and Ambari •  Security: Authentication •  Apache Solr •  Added Banana, a rich and flexible UI for visualizing

time series data indexed in Solr •  Cascading •  Cascading 3.0 on Tez distributed with HDP

— coming soon Apache Falcon •  Authentication Integration •  Lineage – now GA. (it’s been a tech preview

feature…) •  Improve UI for pipeline management & editing: list,

detail, and create new (from existing elements) •  Replicate to Cloud – Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie •  Sqoop import support for Hive types via HCatalog •  Secure Windows cluster support: Sqoop, Flume,

Oozie •  Flume streaming support: sink to HCat on secure

cluster •  Oozie HA now supports secure clusters •  Oozie Rolling Upgrade •  Operational improvements for Oozie to better support

Falcon •  Capture workflow job logs in HDFS •  Don’t start new workflows for re-run •  Allow job property updates on running jobs

Apache HBase, Apache Phoenix, & Apache Accumulo •  HBase & Accumulo on YARN via Slider •  HBase HA •  Replicas update in real-time •  Fully supports region split/merge •  Scan API now supports standby RegionServers •  HBase Block cache compression •  HBase optimizations for low latency •  Phoenix Robust Secondary Indexes •  Performance enhancements for bulk import into

Phoenix •  Hive over HBase Snapshots •  Hive Connector to Accumulo •  HBase & Accumulo wire-level encryption •  Accumulo multi-datacenter replication Apache Storm •  Storm-on-YARN via Slider •  Ingest & notification for JMS (IBM MQ not supported) •  Kafka bolt for Storm – supports sophisticated

chaining of topologies through Kafka •  Kerberos support •  Hive update support – Streaming Ingest •  Connector improvements for HBase and HDFS •  Deliver Kafka as a companion component •  Kafka install, start/stop via Ambari •  Security Authorization Integration with Ranger Apache Slider •  Allow on-demand create and run different versions of

heterogeneous applications •  Allow users to configure different application

instances differently •  Manage operational lifecycle of application instances •  Expand / shrink application instances •  Provide application registry for publish and discovery

Apache Knox & Apache Ranger (Argus) & HDP Security •  Apache Ranger – Support authorization and auditing

for Storm and Knox •  Introducing REST APIs for managing policies in

Apache Ranger •  Apache Ranger – Support native grant/revoke

permissions in Hive and HBase •  Apache Ranger – Support Oracle DB and storing of

audit logs in HDFS •  Apache Ranger to run on Windows environment •  Apache Knox to protect YARN RM •  Apache Knox support for HDFS HA •  Apache Ambari install, start/stop of Knox Apache Ambari •  Support for HDP 2.2 Stack, including support for

Kafka, Knox and Slider •  Enhancements to Ambari Web configuration

management including: versioning, history and revert, setting final properties and downloading client configurations

•  Launch and monitor HDFS rebalance •  Perform Capacity Scheduler queue refresh •  Configure High Availability for ResourceManager •  Ambari Administration framework for managing user

and group access to Ambari •  Ambari Views development framework for

customizing the Ambari Web user experience •  Ambari Stacks for extending Ambari to bring custom

Services under Ambari management •  Ambari Blueprints for automating cluster

deployments •  Performance improvements and enterprise usability

guardrails

Page 6: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Just How Many New Features are in HDP 2.2? Apache Hadoop YARN •  Slide existing services onto YARN through ‘Slider’ •  GA release of HBase, Accumulo, and Storm on

YARN •  Support long running services: handling of logs,

containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads

•  Support for CPU Scheduling and CPU Resource Isolation through CGroups

Apache Hadoop HDFS •  Heterogeneous storage: Support for archival •  Rolling Upgrade (This is an item that applies to the

entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).

•  Multi-NIC Support •  Heterogeneous storage: Support memory as a

storage tier (TP) •  HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez •  Hive Cost Based Optimizer: Function Pushdown &

Join re-ordering support for other join types: star & bushy.

•  Hive SQL Enhancements including: •  ACID Support: Insert, Update, Delete •  Temporary Tables •  Metadata-only queries return instantly •  Pig on Tez •  Including DataFu for use with Pig •  Vectorized shuffle •  Tez Debug Tooling & UI

Hue •  Support for HiveServer 2 •  Support for Resource Manager HA

Apache Spark •  Refreshed Tech Preview to Spark 1.1.0 (available

now) •  ORC File support & Hive 0.13 integration •  Planned for GA of Spark 1.2.0 •  Operations integration via YARN ATS and Ambari •  Security: Authentication •  Apache Solr •  Added Banana, a rich and flexible UI for visualizing

time series data indexed in Solr •  Cascading •  Cascading 3.0 on Tez distributed with HDP

— coming soon Apache Falcon •  Authentication Integration •  Lineage – now GA. (it’s been a tech preview

feature…) •  Improve UI for pipeline management & editing: list,

detail, and create new (from existing elements) •  Replicate to Cloud – Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie •  Sqoop import support for Hive types via HCatalog •  Secure Windows cluster support: Sqoop, Flume,

Oozie •  Flume streaming support: sink to HCat on secure

cluster •  Oozie HA now supports secure clusters •  Oozie Rolling Upgrade •  Operational improvements for Oozie to better support

Falcon •  Capture workflow job logs in HDFS •  Don’t start new workflows for re-run •  Allow job property updates on running jobs

Apache HBase, Apache Phoenix, & Apache Accumulo •  HBase & Accumulo on YARN via Slider •  HBase HA •  Replicas update in real-time •  Fully supports region split/merge •  Scan API now supports standby RegionServers •  HBase Block cache compression •  HBase optimizations for low latency •  Phoenix Robust Secondary Indexes •  Performance enhancements for bulk import into

Phoenix •  Hive over HBase Snapshots •  Hive Connector to Accumulo •  HBase & Accumulo wire-level encryption •  Accumulo multi-datacenter replication Apache Storm •  Storm-on-YARN via Slider •  Ingest & notification for JMS (IBM MQ not supported) •  Kafka bolt for Storm – supports sophisticated

chaining of topologies through Kafka •  Kerberos support •  Hive update support – Streaming Ingest •  Connector improvements for HBase and HDFS •  Deliver Kafka as a companion component •  Kafka install, start/stop via Ambari •  Security Authorization Integration with Ranger Apache Slider •  Allow on-demand create and run different versions of

heterogeneous applications •  Allow users to configure different application

instances differently •  Manage operational lifecycle of application instances •  Expand / shrink application instances •  Provide application registry for publish and discovery

Apache Knox & Apache Ranger (Argus) & HDP Security •  Apache Ranger – Support authorization and auditing

for Storm and Knox •  Introducing REST APIs for managing policies in

Apache Ranger •  Apache Ranger – Support native grant/revoke

permissions in Hive and HBase •  Apache Ranger – Support Oracle DB and storing of

audit logs in HDFS •  Apache Ranger to run on Windows environment •  Apache Knox to protect YARN RM •  Apache Knox support for HDFS HA •  Apache Ambari install, start/stop of Knox Apache Ambari •  Support for HDP 2.2 Stack, including support for

Kafka, Knox and Slider •  Enhancements to Ambari Web configuration

management including: versioning, history and revert, setting final properties and downloading client configurations

•  Launch and monitor HDFS rebalance •  Perform Capacity Scheduler queue refresh •  Configure High Availability for ResourceManager •  Ambari Administration framework for managing user

and group access to Ambari •  Ambari Views development framework for

customizing the Ambari Web user experience •  Ambari Stacks for extending Ambari to bring custom

Services under Ambari management •  Ambari Blueprints for automating cluster

deployments •  Performance improvements and enterprise usability

guardrails

88 Astonishing amount of innovation in the OPEN Apache Community

HDP is Apache Hadoop

Page 7: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Apache Hive & Stinger Initiative

Page 8: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive – Single tool for all SQL use cases

OLTP,  ERP,  CRM  Systems  

Unstructured  documents,  emails  

Clickstream  

Server  logs  

Sen>ment,  Web  Data  

Sensor.  Machine  Data  

Geoloca>on  

Interactive Analytics

Batch Reports / Deep Analytics

Hive - SQL

ETL / ELT

Page 9: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive Scales To Any Workload

Page 9

"  The original developers of Hive. "  More data than existing RDBMS could handle. "  100+ PB of data under management. "  15+ TB of data loaded daily. "  60,000+ Hive queries per day. "  More than 1,000 users per day.

Page 10: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive Join Strategies

Page 10

Type   Approach   Pros   Cons  

Shuffle Join  

Join keys are shuffled using map/reduce and joins performed reduce side.  

Works regardless of data size or layout.  

Most resource-intensive and slowest join type.  

Broadcast Join  

Small tables are loaded into memory in all nodes, mapper scans through the large table and joins.  

Very fast, single scan through largest table.  

All but one table must be small enough to fit in RAM.  

Sort-Merge-Bucket Join  

Mappers take advantage of co-location of keys to do efficient joins.  

Very fast for tables of any size.  

Data must be bucketed ahead of time.  

Page 11: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger Initiative

• Stinger Initiative – DELIVERED Next generation SQL based interactive query in Hadoop Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds)

Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB

SQL Support broadest range of SQL semantics for analytic applications running against Hadoop

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

HDP 2.1

An Open Community at its finest: Apache Hive Contribution

1,672 Jira Tickets Closed

145 Developers

44 Companies

360,000 Lines Of Code Added… (2.5x)

Apache  YARN  

   

Apache  MapReduce    

1   °   °   °  

°   °   °   °  

°   °   °   °  

°  

°  

N  

HDFS    (Hadoop  Distributed  File  System)  

   

Apache    Tez    

Apache  Hive  SQL  

Business  Analy=cs   Custom  Apps  

13 Months

Hive 10

100’s  to  1000’s  of  seconds  

seconds  Hive 13

Dramatically faster queries

speeds time to insight

Page 12: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger Initiative - Key Innovations

File Format

ORCFile

Execution Engine

Tez

= 100X + + Query Planner

CBO

Page 13: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Tez (“Speed”)

• What is it? – A data processing framework as an alternative to MapReduce

• Who else is involved? – Hortonworks, Facebook, Twitter, Yahoo, Microsoft

• Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop

Page 14: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Hive – MR Hive – Tez

Comparing: Hive/MR vs. Hive/Tez

Page 14

SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a

JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId)

GROUP BY a.state

SELECT a.state

JOIN (a, c) SELECT c.price

SELECT b.id

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

M M M

R R

M M

R

M M

R

M M

R

HDFS

HDFS

HDFS

M M M

R R

R

M M

R

R

SELECT a.state, c.itemId

JOIN (a, c)

JOIN(a, b) GROUP BY a.state

COUNT(*) AVERAGE(c.price)

SELECT b.id

Tez avoids unneeded writes to HDFS

Page 15: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

ORCFile – Columnar Storage for Hive

• Columns stored separately

• Knows types – Uses type-specific encoders – Stores statistics (min, max, sum, count)

• Has light-weight index – Skip over blocks of rows that don’t matter

Page 15

Page 16: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

ORCFile – Columnar Storage for Hive

Large block size ideal for map/reduce.

Columnar format enables high compression and high performance.

Page 17: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Query Planner – Cost Based Optimizer in Hive

The Cost-Based Optimizer (CBO) uses statistics within Hive tables to produce optimal query plans

Why cost-based optimization? •  Ease of Use – Join Reordering •  Reduces the need for specialists to tune queries. •  More efficient query plans lead to better cluster utilization.

Page 17

Page 18: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Statistics: Foundations for CBO

Kind of statistics Table Statistics – Collected on load per partition •  Uncompressed size

•  Number of rows

•  Number of files

Column Statistics – Required by CBO •  NDV (Number of Distinct Values)

•  Nulls, Min, Max

Usability - How does the data get Statistics Analyze Table Command •  Analyze entire table

•  Run this command per partition

•  Run for some partitions and the compiler will extrapolate statistics

Collecting statistics on load •  Table stats can be collected if you insert via hive using set

hive.stats.autogather=true

•  Not with load data file

Page 19: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

A Journey to SQL Compliance

Evolu=on  of  SQL  Compliance  in  Hive  SQL  Datatypes   SQL  Seman=cs  

INT/TINYINT/SMALLINT/BIGINT   SELECT,  INSERT  

FLOAT/DOUBLE   GROUP  BY,  ORDER  BY,  HAVING  

BOOLEAN   JOIN  on  explicit  join  key  

ARRAY,  MAP,  STRUCT,  UNION   Inner,  outer,  cross  and  semi  joins  

STRING   Sub-­‐queries  in  the  FROM  clause  

BINARY   ROLLUP  and  CUBE  

TIMESTAMP   UNION  

DECIMAL   Standard  aggrega>ons  (sum,  avg,  etc.)  

DATE   Custom  Java  UDFs  

VARCHAR   Windowing  func>ons  (OVER,  RANK,  etc.)  

CHAR   Advanced  UDFs  (ngram,  XPath,  URL)  

JOINs  in  WHERE  Clause  

Sub-­‐queries  for  IN/NOT  IN,  HAVING  

Legend  

Hive  10  or  earlier  

Hive  11  

Hive  12  

Hive  13  

Gov

erna

nce

&

Inte

grat

ion

Secu

rity

Ope

ratio

ns

Data Access

Data Management

HDP 2.1

Page 20: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the

end of the beginning. -Winston Churchill

Hive 0.13

Page 21: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger.Next

Page 22: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Stinger.Next: Delivery Themes

Hive  0.14    

•  Transac>ons  with  ACID  allowing  insert,  update  and  delete  

•  Streaming  Ingest  

•  Cost  Based  Op>mizer  op>mizes  star  and  bushy  join  queries  

Sub-­‐Second  1st  Half  2015  

 

•  Sub-­‐Second  queries  with  LLAP  

•  Hive-­‐Spark  Machine  Learning  integra>on  

•  Opera>onal  repor>ng  with  Hive  Streaming  Ingest  and  Transac>ons    

Richer  Analy=cs  2nd  Half  2015  

 •  Toward  SQL:2011  Analy>cs  

•  Materialized  Views    

•  Cross-­‐Geo  Queries  

•  Workload  Management  via  YARN  and  LLAP  integra>on  

Page 23: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Transaction Use Cases Reporting with Analytics (YES) Reporting on data with occasional updates Corrections to the fact tables, evolving dimension tables

Low concurrency updates, low TPS4

Operational Reporting (YES) High throughput ingest from operational (OLTP) database

Periodic inserts every 5-30 minutes

Requires tool support and changes in our Transactions

Operational (OLTP) Database (NO) Small Transactions, each doing single line inserts

High Concurrency - Hundreds to thousands of connections

Hive

OLTP Hive Replication

Analytics Modifications

Hive

High Concurrency OLTP

Page 24: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Deep Dive: Transactions Transaction Support in Hive with ACID semantics •  Hive native support for INSERT, UPDATE, DELETE. •  Split Into Phases:

•  Phase 1: Hive Streaming Ingest (append) •  Phase 2: INSERT / UPDATE / DELETE Support •  Phase 3: BEGIN / COMMIT / ROLLBACK Txn

[Done]

[Done]

[Next]

Read-Optimized ORCFile

Delta File Merged Read-

Optimized ORCFile

1. Original File Task reads the latest

ORCFile

Task

Read-Optimized ORCFile

Task Task

2. Edits Made Task reads the ORCFile and merges

the delta file with the edits

3. Edits Merged Task reads the

updated ORCFile

Hive ACID Compactor periodically merges the delta

files in the background.

Page 25: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Transactions - Requirements

Needs to declare table as having Transaction Property

Table must be in ORC format

Tables must to be bucketed

Page 25

Page 26: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Putting It Together

Page 27: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Step 1 - Turn On Transactions Hive Configuration

§  hive.support.concurrency=true

§  hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager

§  hive.compactor.initiator.on=true

§  hive.compactor.worker.threads=2

§  hive.enforce.bucketing=true

§  hive.exec.dynamic.partition.mode=nonstrict

Page 27

Page 28: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Step 2 – Enable Concurrency By Defining Queues

YARN Configuration

§  yarn.scheduler.capacity.root.default.capacity=50

§  yarn.scheduler.capacity.root.hiveserver.capacity=50

§  yarn.scheduler.capacity.root.hiveserver.hive1.capacity=50

§  yarn.scheduler.capacity.root.hiveserver.hive1.user-limit-factor=4

§  yarn.scheduler.capacity.root.hiveserver.hive2.capacity=50

§  yarn.scheduler.capacity.root.hiveserver.hive2.user-limit-factor=4

§  yarn.scheduler.capacity.root.hiveserver.queues=hive1,hive2

§  yarn.scheduler.capacity.root.queues=default,hiveserver

Default

Hive1

Hive2

Cluster Capacity

Page 29: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Step 3 – Deliver Capacity Guarantees BY Enabling YARN Preemption

YARN Configuration

§  yarn.resourcemanager.scheduler.monitor.enable=true

§  yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy

§  yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=1000

§  yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=5000

§  yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.4

Page 30: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Enable Sessions For Hive Queues

Step 4 – Enable Tez Execution Engine & Tez Sessions

Hive Configuration

§  hive.execution.engine=tez

§  hive.server2.tez.initialize.default.sessions=true

§  hive.server2.tez.default.queues=hive1,hive2

§  hive.server2.tez.sessions.per.default.queue=1

§  hive.server2.enable.doAs=false

§  hive.vectorized.groupby.maxentries=10240

§  hive.vectorized.groupby.flush.percent=0.1

Page 31: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Step 5 - Create Partitioned & Bucketed ORC Tables

Create table if not exists test (id int, val string)

partitioned by (year string,month string,day string)

clustered by (id) into 7 buckets

stored as orc TBLPROPERTIES ("transactional"="true”);

Note: §  Transaction Requires Bucketed tables in ORC

Format. Tables cannot be sorted.

§  Transactional=true must be set as table properties

§  For performance, table Partition is recommended but not mandatory §  Partition on filter columns with low

cardinality §  For optimal performance stay below 1000

partitions §  Cluster on join columns

§  Number of buckets contingent on dataset size

Page 32: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Step 6 - Loading Data into ORC table

§  SQOOP, FLUME & STORM support direct ingestion to ORC Tables

§  Have a Text File ? §  Load to a Table stored as textfile §  Transfer to ORC Table using Hive insert statement

Page 33: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Step 7 - Compute Statistics §  Compute Table Stats

analyze table test partition(year,month,day) compute statistics;

§  Compute Column Stats

analyze table test partition(year,month,day) compute statistics for columns;

§  Keep Stats Updated §  Speed computation by limiting it to partitions that have

changed

Note: §  In hive 0.14, column stats can be

calculated for all partitions in a single statement

§  To limit computation to a specific partition, specify partition keys

Page 34: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Sample Code – Sqoop Import To ORC Table

sqoop import --verbose --connect 'jdbc:mysql://localhost/people' --table persons --username root --hcatalog-table persons --hcatalog-storage-stanza "stored as orc" -m 1

Use Hcatalog to import to ORC Table

Page 35: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Sample Code – Flume Configuration For Hive Streaming Ingest ## Agent

agent.sources = csvfile

agent.sources.csvfile.type = exec

agent.sources.csvfile.command = tail -F /root/test.txt

agent.sources.csvfile.batchSize = 1

agent.sources.csvfile.channels = memoryChannel

agent.sources.csvfile.interceptors = intercepttime

agent.sources.csvfile.interceptors.intercepttime.type = timestamp

## Channels

agent.channels = memoryChannel

agent.channels.memoryChannel.type = memory

agent.channels.memoryChannel.capacity = 10000

## Hive Streaming Sink

agent.sinks = hiveout

agent.sinks.hiveout.type = hive

agent.sinks.hiveout.hive.metastore=thrift://localhost:9083

agent.sinks.hiveout.hive.database=default

agent.sinks.hiveout.hive.table=test

agent.sinks.hiveout.hive.partition=%Y,%m,%d

agent.sinks.hiveout.serializer = DELIMITED

agent.sinks.hiveout.serializer.fieldnames =id,val

agent.sinks.hiveout.channel = memoryChannel

Page 36: Hortonworks Technical Workshop: Interactive Query with Apache Hive

Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved

Q&A