Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1.
-
Upload
baldwin-lawrence-simon -
Category
Documents
-
view
223 -
download
4
Transcript of Hive – A Warehousing Solution Over a MapReduce Framework Bingbing Liu 2009-12-12 1.
Hive – A Warehousing Solution Over a MapReduce Framework
Bingbing Liu
2009-12-12
1
Outline
• Introduction
• Data Model
• Architecture
• HiveQL
2
What is Hive?
• A system for managing and querying structured data built on top of Hadoop– Map-Reduce for execution– HDFS for storage– Metadata on raw files
• Key Building Principles:– SQL as a familiar data warehousing tool– Extensibility – Types, Functions, Formats, Scripts– Scalability and Performance
3
Hive/Hadoop Usage @ Facebook
• Types of Applications:– Reporting
• Eg: Daily/Weekly aggregations of impression/click counts• Complex measures of user engagement
– Ad hoc Analysis• Eg: how many group admins broken down by state/country
– Data Mining (Assembling training data)• Eg: User Engagement as a function of user attributes
– Spam Detection• Anomalous patterns for Site Integrity• Application API usage patterns
– Ad Optimization– Too many to count ..
700 Terabytes data
5000queries/day
More than 100 users
4
Data Warehousing at Facebook Today
Web Servers Scribe Servers
Filers
Hive on Hadoop ClusterOracle RAC Federated MySQL 5
6
Data Model
• Hive中数据组织形式 :
– Tables: 概念上类似于 rdbms中的 table,在存储上对应于一个 HDFS的目录。
– Partitions:每个表有一个或多个分区,决定数据在子目录中分发。
– Buckets: 每个分区中数据基于对列的 hash分配到每个 bucket,每个 bucket是一个文件。
例如:指定数据按例 ds划分Create table sc ( sno
int ) partitioned by ( ds string)则数据中,若 ds=2009-12-08,存储中此分区子目录则为
/sc/ds=2009-12-08
7
Data Model
Logical Partitioning
Hash Partitioning
sc
HDFS MetaStore
/hive/sc/hive/sc/ds=2009-12-08
/hive/sc/ds=2009-12-08/sc.txt
…
Tables
Data LocationBucketing Info
Partitioning Cols
Metastore DB
student
course
8
Metastore
• 存储于本地或者传统的 Rdbms中(非 Hdfs)。• Database
– 所有 table的命名空间,默认为“ default”• Table
– 包括 Column列表和其类型, storage和序列反序列化信息。
– Storage包括数据在底层位置,数据格式(类型), buckets信息。
• Partition – 每个分区可以包含自己的列,序列反序列化信息,以
及 storage信息。9
Architecture
HDFS
Hive CLIDDL QueriesBrowsing
Map Reduce
MetaStore
Thrift API
SerDeThrift Jute JSON..
ExecutionParser
Planner
DB
Web U
I
Optimizer
10
HiveQL – Hive Query Language
• Support:– Select ,project, aggregate ,union all– Load data to table from local or hdfs directory– Equi-joins– Subqueries in from clause– Multi-table Insert– Multi-group-by
11
Example
• Student ( sno int ,sname string ,class int)
• Course (cno int ,cname string);
• Sc (sno int , cno int ,grade int) partitioned by (ds string);
12
13
传统的Insert into table test( 1 , 1 , 1);不支持
14
HiveQL- Join
• SQL:
INSERT OVERWRITE TABLE test
SELECT t1.sname,t2.cno
FROM student t1 JOIN sc t2 ON (t1.sno = t2.sno);
Sno Sname
Class
1 Wang 1
2 Zhang
1
3 Zhou 2
4 Chen 2
Sno Cno
Grade
1 1 90
1 2 80
2 1 79
2 2 80
sno cno
Wang
1
Wang
2
Zhang
1
Zhang
2
X =
student sc test
15
HiveQL- Join in Map Reducekey value
1 <0,Wang>
2 <0,Zhang>
3 <0,Zhou>
4 <0,Chen>
student
sckey value
1 <1,1>
1 <1,2>
2 <1,1>
2 <1,2>
Map
key value
1 <0,Wang>
1 <1,1>
1 <1,2>
key value
2 <0,Zhang>
2 <1,1>
2 <1,2>
ShuffleSort
Reduce
Sno Sname
Class
1 Wang 1
2 Zhang 1
3 Zhou 2
4 Chen 2
Sno Cno Grade
1 1 90
1 2 80
2 1 79
2 2 80
3 <0,Zhou>
4 <0,Chen>
16
Query planTableScanOperator
Table:student[sno int ,sname string ,class
int]
TableScanOperatorTable:sc
[sno int ,cno int ,grade int]
ReduceSinkOperatorPartition cols:col[0][0 int ,1 string ,2 int]
ReduceSinkOperatorPartition cols:col[0][0 int ,1 int ,2 int]
JoinOperatorPredicate : cols[0,0]=col[1,0]
[0 int ,1 string ,2 int ,3 int ,4 int ,5 int]
SelectOperatorExpressions:[col[1],col[4]]
[0 string ,1 int]
FileOutputOperatorTable:test
[0 string ,1 int]
Map
Reduce
17
18
Hive QL – Group By
SELECT student.class, count(1)
FROM student
GROUP BY student.class;
student
Class count
1 2
2 2
Sno Sname
Class
1 Wang 1
2 Zhang
1
3 Zhou 2
4 Chen 2
19
Hive QL – Group By in Map Reduce
Sno Sname
Class
1 Wang
1
2 Zhang
1
pv_users
class count
1 2
Sno Sname
Class
3 Zhou 2
4 Chen 2
Map
key value
1 1
1 1
key value
2 1
2 1
key value
1 1
1 1
key value
2 1
2 1
ShuffleSort
class count
2 2
Reduce
20
Query planTableScanOperator
Table:student[sno int ,sname string ,class
int]
ReduceSinkOperatorPartition cols:col[2][0 int ,1 string ,2 int]
GroupByOperatorAggregations:[count[2]]
Keys:[col[2]][0 int ,1 bigint]
FileOutputOperatorTable:tmp1
[0 int , 1 bigint]
TableScanOperatorTable:tmp1
[0 int , 1 bigint]
ReduceSinkOperatorPartition cols:col[0]
[0 int , 1 bigint]
SelectOperatorExpressions:[col[0],col[1]]
[0 int , 1 bigint]
聚集的key
如果 groupby sno , class?
0<int ,int>?21
22
Multi group by
23
24