Hortonworks Technical Workshop: Interactive Query with Apache Hive
-
Upload
hortonworks -
Category
Technology
-
view
1.509 -
download
5
Transcript of Hortonworks Technical Workshop: Interactive Query with Apache Hive
Page 1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Interactive Query With Apache Hive
Dec 4, 2014
Ajay Singh
Page 2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Agenda • HDP 2.2
• Apache Hive & Stinger Initiative
• Stinger.Next
• Putting It Together
• Q&A
Page 3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP 2.2 Generally Available
Hortonworks Data Platform 2.2
YARN: Data Operating System (Cluster Resource Management)
1 ° ° ° ° ° ° °
° ° ° ° ° ° ° °
Script
Pig
SQL
Hive
Tez Tez
Java Scala
Cascading
Tez
° °
° °
° ° ° ° °
° ° ° ° °
Others
ISV Engines
HDFS (Hadoop Distributed File System)
Stream
Storm
Search
Solr
NoSQL
HBase Accumulo
Slider Slider
SECURITY GOVERNANCE OPERATIONS BATCH, INTERACTIVE & REAL-TIME DATA ACCESS
In-Memory
Spark
Provision, Manage & Monitor
Ambari
Zookeeper
Scheduling
Oozie
Data Workflow, Lifecycle & Governance
Falcon Sqoop Flume Kafka NFS
WebHDFS
Authentication Authorization Accounting
Data Protection
Storage: HDFS Resources: YARN Access: Hive, … Pipeline: Falcon
Cluster: Knox Cluster: Ranger
Deployment Choice Linux Windows On-Premises Cloud
YARN is the architectural center of HDP
Enables batch, interactive and real-time workloads
Provides comprehensive enterprise capabilities
The widest range of deployment options
Delivered Completely in the OPEN
Page 4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
HDP IS Apache Hadoop
There is ONE Enterprise Hadoop: everything else is a vendor derivation
Hortonworks Data Platform 2.2
Had
oop
&YA
RN
Pig
Hiv
e &
HC
atal
og
HB
ase
Sqo
op
Ooz
ie
Zoo
keep
er
Am
bari
Sto
rm
Flu
me
Kno
x
Pho
enix
Acc
umul
o
2.2.0 0.12.0
0.12.0 2.4.0
0.12.1
Data Management
0.13.0
0.96.1
0.98.0
0.9.1 1.4.4
1.3.1
1.4.0
1.4.4
1.5.1
3.3.2
4.0.0
3.4.5 0.4.0
4.0.0
1.5.1
Fal
con
0.5.0
Ran
ger
Spa
rk
Kaf
ka
0.14.0 0.14.0
0.98.4
1.6.1
4.2 0.9.3
1.2.0 0.6.0
0.8.1
1.4.5 1.5.0
1.7.0
4.1.0 0.5.0
0.4.0 2.6.0
* version numbers are targets and subject to change at time of general availability in accordance with ASF release process
3.4.5
Tez
0.4.0
Slid
er
0.60
HDP 2.0
October
2013
HDP 2.2 October
2014
HDP 2.1
April
2014
Sol
r
4.7.2
4.10.0
0.5.1
Data Access Governance & Integration Security Operations
Page 5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Complete List of New Features in HDP 2.2 Apache Hadoop YARN • Slide existing services onto YARN through ‘Slider’ • GA release of HBase, Accumulo, and Storm on
YARN • Support long running services: handling of logs,
containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads
• Support for CPU Scheduling and CPU Resource Isolation through CGroups
Apache Hadoop HDFS • Heterogeneous storage: Support for archival • Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).
• Multi-NIC Support • Heterogeneous storage: Support memory as a
storage tier (TP) • HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez • Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star & bushy.
• Hive SQL Enhancements including: • ACID Support: Insert, Update, Delete • Temporary Tables • Metadata-only queries return instantly • Pig on Tez • Including DataFu for use with Pig • Vectorized shuffle • Tez Debug Tooling & UI
Hue • Support for HiveServer 2 • Support for Resource Manager HA
Apache Spark • Refreshed Tech Preview to Spark 1.1.0 (available
now) • ORC File support & Hive 0.13 integration • Planned for GA of Spark 1.2.0 • Operations integration via YARN ATS and Ambari • Security: Authentication • Apache Solr • Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr • Cascading • Cascading 3.0 on Tez distributed with HDP
— coming soon Apache Falcon • Authentication Integration • Lineage – now GA. (it’s been a tech preview
feature…) • Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements) • Replicate to Cloud – Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie • Sqoop import support for Hive types via HCatalog • Secure Windows cluster support: Sqoop, Flume,
Oozie • Flume streaming support: sink to HCat on secure
cluster • Oozie HA now supports secure clusters • Oozie Rolling Upgrade • Operational improvements for Oozie to better support
Falcon • Capture workflow job logs in HDFS • Don’t start new workflows for re-run • Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache Accumulo • HBase & Accumulo on YARN via Slider • HBase HA • Replicas update in real-time • Fully supports region split/merge • Scan API now supports standby RegionServers • HBase Block cache compression • HBase optimizations for low latency • Phoenix Robust Secondary Indexes • Performance enhancements for bulk import into
Phoenix • Hive over HBase Snapshots • Hive Connector to Accumulo • HBase & Accumulo wire-level encryption • Accumulo multi-datacenter replication Apache Storm • Storm-on-YARN via Slider • Ingest & notification for JMS (IBM MQ not supported) • Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka • Kerberos support • Hive update support – Streaming Ingest • Connector improvements for HBase and HDFS • Deliver Kafka as a companion component • Kafka install, start/stop via Ambari • Security Authorization Integration with Ranger Apache Slider • Allow on-demand create and run different versions of
heterogeneous applications • Allow users to configure different application
instances differently • Manage operational lifecycle of application instances • Expand / shrink application instances • Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP Security • Apache Ranger – Support authorization and auditing
for Storm and Knox • Introducing REST APIs for managing policies in
Apache Ranger • Apache Ranger – Support native grant/revoke
permissions in Hive and HBase • Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS • Apache Ranger to run on Windows environment • Apache Knox to protect YARN RM • Apache Knox support for HDFS HA • Apache Ambari install, start/stop of Knox Apache Ambari • Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider • Enhancements to Ambari Web configuration
management including: versioning, history and revert, setting final properties and downloading client configurations
• Launch and monitor HDFS rebalance • Perform Capacity Scheduler queue refresh • Configure High Availability for ResourceManager • Ambari Administration framework for managing user
and group access to Ambari • Ambari Views development framework for
customizing the Ambari Web user experience • Ambari Stacks for extending Ambari to bring custom
Services under Ambari management • Ambari Blueprints for automating cluster
deployments • Performance improvements and enterprise usability
guardrails
Page 6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Just How Many New Features are in HDP 2.2? Apache Hadoop YARN • Slide existing services onto YARN through ‘Slider’ • GA release of HBase, Accumulo, and Storm on
YARN • Support long running services: handling of logs,
containers not killed when AM dies, secure token renewal, YARN Labels for tagging nodes for specific workloads
• Support for CPU Scheduling and CPU Resource Isolation through CGroups
Apache Hadoop HDFS • Heterogeneous storage: Support for archival • Rolling Upgrade (This is an item that applies to the
entire HDP Stack. YARN, Hive, HBase, everything. We now support comprehensive Rolling Upgrade across the HDP Stack).
• Multi-NIC Support • Heterogeneous storage: Support memory as a
storage tier (TP) • HDFS Transparent Data Encryption (TP) Apache Hive, Apache Pig, and Apache Tez • Hive Cost Based Optimizer: Function Pushdown &
Join re-ordering support for other join types: star & bushy.
• Hive SQL Enhancements including: • ACID Support: Insert, Update, Delete • Temporary Tables • Metadata-only queries return instantly • Pig on Tez • Including DataFu for use with Pig • Vectorized shuffle • Tez Debug Tooling & UI
Hue • Support for HiveServer 2 • Support for Resource Manager HA
Apache Spark • Refreshed Tech Preview to Spark 1.1.0 (available
now) • ORC File support & Hive 0.13 integration • Planned for GA of Spark 1.2.0 • Operations integration via YARN ATS and Ambari • Security: Authentication • Apache Solr • Added Banana, a rich and flexible UI for visualizing
time series data indexed in Solr • Cascading • Cascading 3.0 on Tez distributed with HDP
— coming soon Apache Falcon • Authentication Integration • Lineage – now GA. (it’s been a tech preview
feature…) • Improve UI for pipeline management & editing: list,
detail, and create new (from existing elements) • Replicate to Cloud – Azure & S3 Apache Sqoop, Apache Flume & Apache Oozie • Sqoop import support for Hive types via HCatalog • Secure Windows cluster support: Sqoop, Flume,
Oozie • Flume streaming support: sink to HCat on secure
cluster • Oozie HA now supports secure clusters • Oozie Rolling Upgrade • Operational improvements for Oozie to better support
Falcon • Capture workflow job logs in HDFS • Don’t start new workflows for re-run • Allow job property updates on running jobs
Apache HBase, Apache Phoenix, & Apache Accumulo • HBase & Accumulo on YARN via Slider • HBase HA • Replicas update in real-time • Fully supports region split/merge • Scan API now supports standby RegionServers • HBase Block cache compression • HBase optimizations for low latency • Phoenix Robust Secondary Indexes • Performance enhancements for bulk import into
Phoenix • Hive over HBase Snapshots • Hive Connector to Accumulo • HBase & Accumulo wire-level encryption • Accumulo multi-datacenter replication Apache Storm • Storm-on-YARN via Slider • Ingest & notification for JMS (IBM MQ not supported) • Kafka bolt for Storm – supports sophisticated
chaining of topologies through Kafka • Kerberos support • Hive update support – Streaming Ingest • Connector improvements for HBase and HDFS • Deliver Kafka as a companion component • Kafka install, start/stop via Ambari • Security Authorization Integration with Ranger Apache Slider • Allow on-demand create and run different versions of
heterogeneous applications • Allow users to configure different application
instances differently • Manage operational lifecycle of application instances • Expand / shrink application instances • Provide application registry for publish and discovery
Apache Knox & Apache Ranger (Argus) & HDP Security • Apache Ranger – Support authorization and auditing
for Storm and Knox • Introducing REST APIs for managing policies in
Apache Ranger • Apache Ranger – Support native grant/revoke
permissions in Hive and HBase • Apache Ranger – Support Oracle DB and storing of
audit logs in HDFS • Apache Ranger to run on Windows environment • Apache Knox to protect YARN RM • Apache Knox support for HDFS HA • Apache Ambari install, start/stop of Knox Apache Ambari • Support for HDP 2.2 Stack, including support for
Kafka, Knox and Slider • Enhancements to Ambari Web configuration
management including: versioning, history and revert, setting final properties and downloading client configurations
• Launch and monitor HDFS rebalance • Perform Capacity Scheduler queue refresh • Configure High Availability for ResourceManager • Ambari Administration framework for managing user
and group access to Ambari • Ambari Views development framework for
customizing the Ambari Web user experience • Ambari Stacks for extending Ambari to bring custom
Services under Ambari management • Ambari Blueprints for automating cluster
deployments • Performance improvements and enterprise usability
guardrails
88 Astonishing amount of innovation in the OPEN Apache Community
HDP is Apache Hadoop
Page 7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Apache Hive & Stinger Initiative
Page 8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive – Single tool for all SQL use cases
OLTP, ERP, CRM Systems
Unstructured documents, emails
Clickstream
Server logs
Sen>ment, Web Data
Sensor. Machine Data
Geoloca>on
Interactive Analytics
Batch Reports / Deep Analytics
Hive - SQL
ETL / ELT
Page 9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Scales To Any Workload
Page 9
" The original developers of Hive. " More data than existing RDBMS could handle. " 100+ PB of data under management. " 15+ TB of data loaded daily. " 60,000+ Hive queries per day. " More than 1,000 users per day.
Page 10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive Join Strategies
Page 10
Type Approach Pros Cons
Shuffle Join
Join keys are shuffled using map/reduce and joins performed reduce side.
Works regardless of data size or layout.
Most resource-intensive and slowest join type.
Broadcast Join
Small tables are loaded into memory in all nodes, mapper scans through the large table and joins.
Very fast, single scan through largest table.
All but one table must be small enough to fit in RAM.
Sort-Merge-Bucket Join
Mappers take advantage of co-location of keys to do efficient joins.
Very fast for tables of any size.
Data must be bucketed ahead of time.
Page 11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger Initiative
• Stinger Initiative – DELIVERED Next generation SQL based interactive query in Hadoop Speed Improve Hive query performance has increased by 100X to allow for interactive query times (seconds)
Scale The only SQL interface to Hadoop designed for queries that scale from TB to PB
SQL Support broadest range of SQL semantics for analytic applications running against Hadoop
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
HDP 2.1
An Open Community at its finest: Apache Hive Contribution
1,672 Jira Tickets Closed
145 Developers
44 Companies
360,000 Lines Of Code Added… (2.5x)
Apache YARN
Apache MapReduce
1 ° ° °
° ° ° °
° ° ° °
°
°
N
HDFS (Hadoop Distributed File System)
Apache Tez
Apache Hive SQL
Business Analy=cs Custom Apps
13 Months
Hive 10
100’s to 1000’s of seconds
seconds Hive 13
Dramatically faster queries
speeds time to insight
Page 12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger Initiative - Key Innovations
File Format
ORCFile
Execution Engine
Tez
= 100X + + Query Planner
CBO
Page 13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Tez (“Speed”)
• What is it? – A data processing framework as an alternative to MapReduce
• Who else is involved? – Hortonworks, Facebook, Twitter, Yahoo, Microsoft
• Why does it matter? – Widens the platform for Hadoop use cases – Crucial to improving the performance of low-latency applications – Core to the Stinger initiative – Evidence of Hortonworks leading the community in the evolution of Enterprise Hadoop
Page 14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Hive – MR Hive – Tez
Comparing: Hive/MR vs. Hive/Tez
Page 14
SELECT a.state, COUNT(*), AVERAGE(c.price) FROM a
JOIN b ON (a.id = b.id) JOIN c ON (a.itemId = c.itemId)
GROUP BY a.state
SELECT a.state
JOIN (a, c) SELECT c.price
SELECT b.id
JOIN(a, b) GROUP BY a.state
COUNT(*) AVERAGE(c.price)
M M M
R R
M M
R
M M
R
M M
R
HDFS
HDFS
HDFS
M M M
R R
R
M M
R
R
SELECT a.state, c.itemId
JOIN (a, c)
JOIN(a, b) GROUP BY a.state
COUNT(*) AVERAGE(c.price)
SELECT b.id
Tez avoids unneeded writes to HDFS
Page 15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ORCFile – Columnar Storage for Hive
• Columns stored separately
• Knows types – Uses type-specific encoders – Stores statistics (min, max, sum, count)
• Has light-weight index – Skip over blocks of rows that don’t matter
Page 15
Page 16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
ORCFile – Columnar Storage for Hive
Large block size ideal for map/reduce.
Columnar format enables high compression and high performance.
Page 17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Query Planner – Cost Based Optimizer in Hive
The Cost-Based Optimizer (CBO) uses statistics within Hive tables to produce optimal query plans
Why cost-based optimization? • Ease of Use – Join Reordering • Reduces the need for specialists to tune queries. • More efficient query plans lead to better cluster utilization.
Page 17
Page 18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Statistics: Foundations for CBO
Kind of statistics Table Statistics – Collected on load per partition • Uncompressed size
• Number of rows
• Number of files
Column Statistics – Required by CBO • NDV (Number of Distinct Values)
• Nulls, Min, Max
Usability - How does the data get Statistics Analyze Table Command • Analyze entire table
• Run this command per partition
• Run for some partitions and the compiler will extrapolate statistics
Collecting statistics on load • Table stats can be collected if you insert via hive using set
hive.stats.autogather=true
• Not with load data file
Page 19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
A Journey to SQL Compliance
Evolu=on of SQL Compliance in Hive SQL Datatypes SQL Seman=cs
INT/TINYINT/SMALLINT/BIGINT SELECT, INSERT
FLOAT/DOUBLE GROUP BY, ORDER BY, HAVING
BOOLEAN JOIN on explicit join key
ARRAY, MAP, STRUCT, UNION Inner, outer, cross and semi joins
STRING Sub-‐queries in the FROM clause
BINARY ROLLUP and CUBE
TIMESTAMP UNION
DECIMAL Standard aggrega>ons (sum, avg, etc.)
DATE Custom Java UDFs
VARCHAR Windowing func>ons (OVER, RANK, etc.)
CHAR Advanced UDFs (ngram, XPath, URL)
JOINs in WHERE Clause
Sub-‐queries for IN/NOT IN, HAVING
Legend
Hive 10 or earlier
Hive 11
Hive 12
Hive 13
Gov
erna
nce
&
Inte
grat
ion
Secu
rity
Ope
ratio
ns
Data Access
Data Management
HDP 2.1
Page 20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Now this is not the end. It is not even the beginning of the end. But it is, perhaps, the
end of the beginning. -Winston Churchill
Hive 0.13
Page 21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.Next
Page 22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Stinger.Next: Delivery Themes
Hive 0.14
• Transac>ons with ACID allowing insert, update and delete
• Streaming Ingest
• Cost Based Op>mizer op>mizes star and bushy join queries
Sub-‐Second 1st Half 2015
• Sub-‐Second queries with LLAP
• Hive-‐Spark Machine Learning integra>on
• Opera>onal repor>ng with Hive Streaming Ingest and Transac>ons
Richer Analy=cs 2nd Half 2015
• Toward SQL:2011 Analy>cs
• Materialized Views
• Cross-‐Geo Queries
• Workload Management via YARN and LLAP integra>on
Page 23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transaction Use Cases Reporting with Analytics (YES) Reporting on data with occasional updates Corrections to the fact tables, evolving dimension tables
Low concurrency updates, low TPS4
Operational Reporting (YES) High throughput ingest from operational (OLTP) database
Periodic inserts every 5-30 minutes
Requires tool support and changes in our Transactions
Operational (OLTP) Database (NO) Small Transactions, each doing single line inserts
High Concurrency - Hundreds to thousands of connections
Hive
OLTP Hive Replication
Analytics Modifications
Hive
High Concurrency OLTP
Page 24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Deep Dive: Transactions Transaction Support in Hive with ACID semantics • Hive native support for INSERT, UPDATE, DELETE. • Split Into Phases:
• Phase 1: Hive Streaming Ingest (append) • Phase 2: INSERT / UPDATE / DELETE Support • Phase 3: BEGIN / COMMIT / ROLLBACK Txn
[Done]
[Done]
[Next]
Read-Optimized ORCFile
Delta File Merged Read-
Optimized ORCFile
1. Original File Task reads the latest
ORCFile
Task
Read-Optimized ORCFile
Task Task
2. Edits Made Task reads the ORCFile and merges
the delta file with the edits
3. Edits Merged Task reads the
updated ORCFile
Hive ACID Compactor periodically merges the delta
files in the background.
Page 25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Transactions - Requirements
Needs to declare table as having Transaction Property
Table must be in ORC format
Tables must to be bucketed
Page 25
Page 26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Putting It Together
Page 27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 1 - Turn On Transactions Hive Configuration
§ hive.support.concurrency=true
§ hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager
§ hive.compactor.initiator.on=true
§ hive.compactor.worker.threads=2
§ hive.enforce.bucketing=true
§ hive.exec.dynamic.partition.mode=nonstrict
Page 27
Page 28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 2 – Enable Concurrency By Defining Queues
YARN Configuration
§ yarn.scheduler.capacity.root.default.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.hive1.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.hive1.user-limit-factor=4
§ yarn.scheduler.capacity.root.hiveserver.hive2.capacity=50
§ yarn.scheduler.capacity.root.hiveserver.hive2.user-limit-factor=4
§ yarn.scheduler.capacity.root.hiveserver.queues=hive1,hive2
§ yarn.scheduler.capacity.root.queues=default,hiveserver
Default
Hive1
Hive2
Cluster Capacity
Page 29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 3 – Deliver Capacity Guarantees BY Enabling YARN Preemption
YARN Configuration
§ yarn.resourcemanager.scheduler.monitor.enable=true
§ yarn.resourcemanager.scheduler.monitor.policies=org.apache.hadoop.yarn.server.resourcemanager.monitor.capacity.ProportionalCapacityPreemptionPolicy
§ yarn.resourcemanager.monitor.capacity.preemption.monitoring_interval=1000
§ yarn.resourcemanager.monitor.capacity.preemption.max_wait_before_kill=5000
§ yarn.resourcemanager.monitor.capacity.preemption.total_preemption_per_round=0.4
Page 30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Enable Sessions For Hive Queues
Step 4 – Enable Tez Execution Engine & Tez Sessions
Hive Configuration
§ hive.execution.engine=tez
§ hive.server2.tez.initialize.default.sessions=true
§ hive.server2.tez.default.queues=hive1,hive2
§ hive.server2.tez.sessions.per.default.queue=1
§ hive.server2.enable.doAs=false
§ hive.vectorized.groupby.maxentries=10240
§ hive.vectorized.groupby.flush.percent=0.1
Page 31 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 5 - Create Partitioned & Bucketed ORC Tables
Create table if not exists test (id int, val string)
partitioned by (year string,month string,day string)
clustered by (id) into 7 buckets
stored as orc TBLPROPERTIES ("transactional"="true”);
Note: § Transaction Requires Bucketed tables in ORC
Format. Tables cannot be sorted.
§ Transactional=true must be set as table properties
§ For performance, table Partition is recommended but not mandatory § Partition on filter columns with low
cardinality § For optimal performance stay below 1000
partitions § Cluster on join columns
§ Number of buckets contingent on dataset size
Page 32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 6 - Loading Data into ORC table
§ SQOOP, FLUME & STORM support direct ingestion to ORC Tables
§ Have a Text File ? § Load to a Table stored as textfile § Transfer to ORC Table using Hive insert statement
Page 33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Step 7 - Compute Statistics § Compute Table Stats
analyze table test partition(year,month,day) compute statistics;
§ Compute Column Stats
analyze table test partition(year,month,day) compute statistics for columns;
§ Keep Stats Updated § Speed computation by limiting it to partitions that have
changed
Note: § In hive 0.14, column stats can be
calculated for all partitions in a single statement
§ To limit computation to a specific partition, specify partition keys
Page 34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Code – Sqoop Import To ORC Table
sqoop import --verbose --connect 'jdbc:mysql://localhost/people' --table persons --username root --hcatalog-table persons --hcatalog-storage-stanza "stored as orc" -m 1
Use Hcatalog to import to ORC Table
Page 35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Sample Code – Flume Configuration For Hive Streaming Ingest ## Agent
agent.sources = csvfile
agent.sources.csvfile.type = exec
agent.sources.csvfile.command = tail -F /root/test.txt
agent.sources.csvfile.batchSize = 1
agent.sources.csvfile.channels = memoryChannel
agent.sources.csvfile.interceptors = intercepttime
agent.sources.csvfile.interceptors.intercepttime.type = timestamp
## Channels
agent.channels = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000
## Hive Streaming Sink
agent.sinks = hiveout
agent.sinks.hiveout.type = hive
agent.sinks.hiveout.hive.metastore=thrift://localhost:9083
agent.sinks.hiveout.hive.database=default
agent.sinks.hiveout.hive.table=test
agent.sinks.hiveout.hive.partition=%Y,%m,%d
agent.sinks.hiveout.serializer = DELIMITED
agent.sinks.hiveout.serializer.fieldnames =id,val
agent.sinks.hiveout.channel = memoryChannel
Page 36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved
Q&A